Geary's C

Last updated

Geary's C is a measure of spatial autocorrelation that attempts to determine if observations of the same variable are spatially autocorrelated globally (rather than at the neighborhood level). Spatial autocorrelation is more complex than autocorrelation because the correlation is multi-dimensional and bi-directional.

Contents

Global Geary's C


Geary's C is defined as

where is the number of spatial units indexed by and ; is the variable of interest; is the mean of ; is the row of the spatial weights matrix with zeroes on the diagonal (i.e., ); and is the sum of all weights in .

Geary's C statistic computed for different spatial patterns. Using 'rook' neighbors for each grid cell, setting
w
i
j
=
1
{\displaystyle w_{ij}=1}
for neighbours
j
{\displaystyle j}
of
i
{\displaystyle i}
and then row normalizing the weight matrix. Top left shows gives
C
>
1
{\displaystyle C>1}
indicating anti-correlation. Top right shows a spatial gradient giving
C
<
1
{\displaystyle C<1}
indicating correlation. Bottom left shows random data giving a value of
C
~
1
{\displaystyle C\sim 1}
indicating no correlation. Bottom right shows a spreading pattern with positive autocorrelation. Geary example.png
Geary's C statistic computed for different spatial patterns. Using 'rook' neighbors for each grid cell, setting for neighbours of and then row normalizing the weight matrix. Top left shows gives indicating anti-correlation. Top right shows a spatial gradient giving indicating correlation. Bottom left shows random data giving a value of indicating no correlation. Bottom right shows a spreading pattern with positive autocorrelation.

The value of Geary's C lies between 0 and some unspecified value greater than 1. Values significantly lower than 1 demonstrate increasing positive spatial autocorrelation, whilst values significantly higher than 1 illustrate increasing negative spatial autocorrelation.

Geary's C is inversely related to Moran's I, but it is not identical. While Moran's I and Geary's C are both measures of global spatial autocorrelation, they are slightly different. Geary's C uses the sum of squared distances whereas Moran's I uses standardized spatial covariance. By using squared distances Geary's C is less sensitive to linear associations and may pickup autocorrelation where Moran's I may not. [1]

Geary's C is also known as Geary's contiguity ratio or simply Geary's ratio. [2]

This statistic was developed by Roy C. Geary. [3]

Local Geary's C

Like Moran's I, Geary's C can be decomposed into a sum of Local Indicators of Spatial Association (LISA) statistics. LISA statistics can be used to find local clusters through significance testing, though because a large number of tests must be performed (one per sampling area) this approach suffers from the multiple comparisons problem. As noted by Anselin, [4] this means the analysis of the local Geary statistic is aimed at identifying interesting points which should then be subject to further investigation. This is therefore a type of exploratory data analysis.

A local version of is given by [5]

where

then,

Local Geary's C can be calculated in GeoDa and PySAL. [6]


Sources

  1. Anselin, Luc (April 2019). "A Local Indicator of Multivariate Spatial Association: Extending Geary's c". Geographical Analysis. 51 (2): 133–150. doi: 10.1111/gean.12164 .
  2. J. N. R. Jeffers (1973). "A Basic Subroutine for Geary's Contiguity Ratio". Journal of the Royal Statistical Society, Series D. 22 (4). Wiley: 299–302. doi:10.2307/2986827. JSTOR   2986827.
  3. Geary, R. C. (1954). "The Contiguity Ratio and Statistical Mapping". The Incorporated Statistician . 5 (3): 115–145. doi:10.2307/2986645. JSTOR   2986645.
  4. https://geodacenter.github.io/workbook/6b_local_adv/lab6b.html#local-geary
  5. Anselin, L. (2019). "A local indicator of multivariate spatial association: extending Geary's C". Geographical Analysis. 51 (2): 133–150. doi:10.1111/gean.12164.
  6. https://pysal.org/esda/


Related Research Articles

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

<span class="mw-page-title-main">Correlation</span> Statistical concept

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

In statistics, the Wishart distribution is a generalization of the gamma distribution to multiple dimensions. It is named in honor of John Wishart, who first formulated the distribution in 1928. Other names include Wishart ensemble, or Wishart–Laguerre ensemble, or LOE, LUE, LSE.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.

<span class="mw-page-title-main">Inverse distance weighting</span> Type of deterministic method for multivariate interpolation

Inverse distance weighting (IDW) is a type of deterministic method for multivariate interpolation with a known scattered set of points. The assigned values to unknown points are calculated with a weighted average of the values available at the known points. This method can also be used to create spatial weights matrices in spatial autocorrelation analyses.

In statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation at lag 1 in the residuals from a regression analysis. It is named after James Durbin and Geoffrey Watson. The small sample distribution of this ratio was derived by John von Neumann. Durbin and Watson applied this statistic to the residuals from least squares regressions, and developed bounds tests for the null hypothesis that the errors are serially uncorrelated against the alternative that they follow a first order autoregressive process. Note that the distribution of this test statistic does not depend on the estimated regression coefficients and the variance of the errors.

<span class="mw-page-title-main">Correlogram</span> Image of correlation statistics

In the analysis of data, a correlogram is a chart of correlation statistics. For example, in time series analysis, a plot of the sample autocorrelations versus is an autocorrelogram. If cross-correlation is plotted, the result is called a cross-correlogram.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

<span class="mw-page-title-main">GeoDa</span> Free geovisualization and analysis software

GeoDa is a free software package that conducts spatial data analysis, geovisualization, spatial autocorrelation and spatial modeling.

Indicators of spatial association are statistics that evaluate the existence of clusters in the spatial arrangement of a given variable. For instance, if we are studying cancer rates among census tracts in a given city local clusters in the rates mean that there are areas that have higher or lower rates than is to be expected by chance alone; that is, the values occurring are above or below those of a random distribution in space.

In statistics, one-way analysis of variance is a technique to compare whether two or more samples' means are significantly different. This analysis of variance technique requires a numeric response variable "Y" and a single explanatory variable "X", hence "one-way".

Morans <i>I</i> Measure of spatial autocorrelation

In statistics, Moran's I is a measure of spatial autocorrelation developed by Patrick Alfred Pierce Moran. Spatial autocorrelation is characterized by a correlation in a signal among nearby locations in space. Spatial autocorrelation is more complex than one-dimensional autocorrelation because spatial correlation is multi-dimensional and multi-directional.

<span class="mw-page-title-main">Maximum cut</span> Problem of finding a maximum cut in a graph

In a graph, a maximum cut is a cut whose size is at least the size of any other cut. That is, it is a partition of the graph's vertices into two complementary sets S and T, such that the number of edges between S and T is as large as possible. Finding such a cut is known as the max-cut problem.

<span class="mw-page-title-main">Generalized chi-squared distribution</span>

In probability theory and statistics, the generalized chi-squared distribution is the distribution of a quadratic form of a multinormal variable, or a linear combination of different normal variables and squares of normal variables. Equivalently, it is also a linear sum of independent noncentral chi-square variables and a normal variable. There are several other such generalizations for which the same term is sometimes used; some of them are special cases of the family discussed here, for example the gamma distribution.

In statistics, the multivariate Behrens–Fisher problem is the problem of testing for the equality of means from two multivariate normal distributions when the covariance matrices are unknown and possibly not equal. Since this is a generalization of the univariate Behrens-Fisher problem, it inherits all of the difficulties that arise in the univariate problem.

Permutational multivariate analysis of variance (PERMANOVA), is a non-parametric multivariate statistical permutation test. PERMANOVA is used to compare groups of objects and test the null hypothesis that the centroids and dispersion of the groups as defined by measure space are equivalent for all groups. A rejection of the null hypothesis means that either the centroid and/or the spread of the objects is different between the groups. Hence the test is based on the prior calculation of the distance between any two objects included in the experiment. PERMANOVA shares some resemblance to ANOVA where they both measure the sum-of-squares within and between group and make use of F test to compare within-group to between-group variance. However, while ANOVA bases the significance of the result on assumption of normality, PERMANOVA draws tests for significance by comparing the actual F test result to that gained from random permutations of the objects between the groups. Moreover, whilst PERMANOVA tests for similarity based on a chosen distance measure, ANOVA tests for similarity of the group averages.