GeoDa is a free software package that conducts spatial data analysis, geovisualization, spatial autocorrelation and spatial modeling.
It runs on different versions of Windows, Mac OS, and Linux. The package was initially developed by the Spatial Analysis Laboratory of the University of Illinois at Urbana-Champaign under the direction of Luc Anselin. From 2016 development continues at the Center for Spatial Data Science (CSDS) at the University of Chicago. [1]
GeoDa has powerful capabilities to perform spatial analysis, multivariate exploratory data analysis, and global and local spatial autocorrelation. It also performs basic linear regression. As for spatial models, both the spatial lag model and the spatial error model, both estimated by maximum likelihood, are included.
GeoDa replaced what was previously called DynESDA, a module that worked under the old ArcView 3.x to perform exploratory spatial data analysis (or ESDA). Current releases of GeoDa no longer depend on the presence of ArcView or other GIS packages on a system.
Projects in GeoDa basically consist of a shapefile that defines the lattice data, and an attribute table in a .dbf format. The attribute table can be edited inside GeoDa.
The package is specialized in exploratory data analysis and geo-visualization, where it exploits techniques for dynamic linking and brushing. This means that when the user has multiple views or windows in a project, selecting an object in one of them will highlight the same object in all other windows.
GeoDa also is capable of producing histograms, box plots, Scatter plots to conduct simple exploratory analyses of the data. The most important thing, however, is the capability of mapping and linking those statistical devices with the spatial distribution of the phenomenon that the users are studying.
Dynamic linking and brushing are powerful devices as they allow users to interactively discover or confirm suspected patterns of spatial arrangement of the data or otherwise discard the existence of those. It allows users to extract information from data in spatial arrangements that may otherwise require very heavy computer routines to process the numbers and yield useful statistical results. The latter may also cost the users quite a bit in terms of expert knowledge and software capabilities.
A very interesting device available in GeoDa to explore global patterns of autocorrelation in space is Anselin's Moran scatterplot. This graph depicts a standardized variable in the x-axis versus the spatial lag of that standardized variable. The spatial lag is nothing but a summary of the effects of the neighboring spatial units. That summary is obtained by means of a spatial weights matrix, which can take various forms, but a very commonly used is the contiguity matrix. The contiguity matrix is an array that has a value of one in the position (i, j) whenever the spatial unit j is contiguous to the unit i. For convenience that matrix is standardized in such a way that the rows sum to one by dividing each value by the row sum of the original matrix.
In essence, Anselin's Moran scatterplot presents the relation of the variable in the location i with respect to the values of that variable in the neighboring locations. By construction, the slope of the line in the scatter plot is equivalent to the Moran's I coefficient. The latter is a well-known statistic that accounts for the Global spatial autocorrelation. If that slope is positive it means that there is positive spatial autocorrelation: high values of the variable in location i tend to be clustered with high values of the same variable in locations that are neighbors of i, and vice versa. If the slope in the scatter plot is negative that means that we have a sort of checkerboard pattern or a sort of spatial competition in which high values in a variable in location i tend to be co-located with lower values in the neighboring locations.
In Anselin's Moran scatter plot, the slope of the curve is calculated and displayed on top of the graph. In this case, this value is positive, which means that areas with a high rate of criminality tend to have neighbors with high rates as well, and vice versa.
At the global level we can talk about clustering, i.e. the general trend of the map to be clustered; at the local level, we can talk about clusters i.e. we are able to pinpoint the locations of the clusters. The latter can be assessed by means of Local Indicators of Spatial Association - LISA. LISA analysis allows us to identify where are the areas high values of a variable that are surrounded by high values on the neighboring areas i.e. what is called the high-high clusters. Concomitantly, the low-low clusters are also identified from this analysis.
Another type of phenomenon that is important to analyze in this context is the existence of outliers that represent high values of the variable in a given location surrounded by low values in the neighboring locations. This functionality is available in GeoDa by means of Anselin's Moran scatter plot. Note, however, that the fact that a value is high in comparison with the values in neighboring locations does not necessarily mean that it is an outlier as we need to assess the statistical significance of that relationship. In other words, we may find areas where there seems to be clustering or where there may seem to be clusters but when the statistical procedures are conducted they turn to be non statistically significant clusters or outliers. The procedures employed to assess statistical significance consists of a Monte Carlo simulation of different arrangements of the data and the construction of an empirical distribution of simulated statistics. Afterward, the value obtained originally is compared to the distribution of simulated values and if the value exceeds the 95h percentile it is said that the relation found is significant at 5%.
Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.
Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where the variation in the data can be described with fewer dimensions than the initial data. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science.
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.
In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.
Spatial analysis is any of the formal techniques which studies entities using their topological, geometric, or geographic properties. Spatial analysis includes a variety of techniques using different analytic approaches, especially spatial statistics. It may be applied in fields as diverse as astronomy, with its studies of the placement of galaxies in the cosmos, or to chip fabrication engineering, with its use of "place and route" algorithms to build complex wiring structures. In a more restricted sense, spatial analysis is geospatial analysis, the technique applied to structures at the human scale, most notably in the analysis of geographic data. It may also be applied to genomics, as in transcriptomics data.
Data and information visualization is the practice of designing and creating easy-to-communicate and easy-to-understand graphic or visual representations of a large amount of complex quantitative and qualitative data and information from a certain domain of expertise with the help of static, dynamic or interactive visual items for a broader audience to help them visually explore and discover, quickly understand, interpret and gain important insights into otherwise difficult-to-identify structures, relationships, correlations, local and global patterns, trends, variations, constancy, clusters, outliers and unusual groupings within data. When intended for the general public to convey a concise version of known, specific information in a clear and engaging manner, it is typically called information graphics.
A heat map is a 2-dimensional data visualization technique that represents the magnitude of individual values within a dataset as a color. The variation in color may be by hue or intensity.
Galton's problem, named after Sir Francis Galton, is the problem of drawing inferences from cross-cultural data, due to the statistical phenomenon now called autocorrelation. The problem is now recognized as a general one that applies to all nonexperimental studies and to experimental design as well. It is most simply described as the problem of external dependencies in making statistical estimates when the elements sampled are not statistically independent. Asking two people in the same household whether they watch TV, for example, does not give you statistically independent answers. The sample size, n, for independent observations in this case is one, not two. Once proper adjustments are made that deal with external dependencies, then the axioms of probability theory concerning statistical independence will apply. These axioms are important for deriving measures of variance, for example, or tests of statistical significance.
Indicators of spatial association are statistics that evaluate the existence of clusters in the spatial arrangement of a given variable. For instance, if we are studying cancer rates among census tracts in a given city local clusters in the rates mean that there are areas that have higher or lower rates than is to be expected by chance alone; that is, the values occurring are above or below those of a random distribution in space.
Luc E. Anselin is one of the developers of the field of spatial econometrics.
GGobi is a free statistical software tool for interactive data visualization. GGobi allows extensive exploration of the data with Interactive dynamic graphics. It is also a tool for looking at multivariate data. R can be used in sync with GGobi. The GGobi software can be embedded as a library in other programs and program packages using an application programming interface (API) or as an add-on to existing languages and scripting environments, e.g., with the R command line or from a Perl or Python scripts. GGobi prides itself on its ability to link multiple graphs together.
Biplots are a type of exploratory graph used in statistics, a generalization of the simple two-variable scatterplot. A biplot overlays a score plot with a loading plot. A biplot allows information on both samples and variables of a data matrix to be displayed graphically. Samples are displayed as points while variables are displayed either as vectors, linear axes or nonlinear trajectories. In the case of categorical variables, category level points may be used to represent the levels of a categorical variable. A generalised biplot displays information on both continuous and categorical variables.
Geary's C is a measure of spatial autocorrelation that attempts to determine if observations of the same variable are spatially autocorrelated globally. Spatial autocorrelation is more complex than autocorrelation because the correlation is multi-dimensional and bi-directional.
In statistics, Moran's I is a measure of spatial autocorrelation developed by Patrick Alfred Pierce Moran. Spatial autocorrelation is characterized by a correlation in a signal among nearby locations in space. Spatial autocorrelation is more complex than one-dimensional autocorrelation because spatial correlation is multi-dimensional and multi-directional.
A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.
The following outline is provided as an overview of and topical guide to regression analysis:
Mondrian is a general-purpose statistical data-visualization system, for interactive data visualization.
CrimeStat is a crime mapping software program. CrimeStat is Windows-based program that conducts spatial and statistical analysis and is designed to interface with a geographic information system (GIS). The program is developed by Ned Levine & Associates under the direction of Ned Levine, with funding by the National Institute of Justice (NIJ), an agency of the United States Department of Justice. The program and manual are distributed for free by NIJ.