Spatial descriptive statistics

Last updated

Spatial descriptive statistics is the intersection of spatial statistics and descriptive statistics; these methods are used for a variety of purposes in geography, particularly in quantitative data analyses involving Geographic Information Systems (GIS).

Contents

Types of spatial data

The simplest forms of spatial data are gridded data, in which a scalar quantity is measured for each point in a regular grid of points, and point sets, in which a set of coordinates (e.g. of points in the plane) is observed. An example of gridded data would be a satellite image of forest density that has been digitized on a grid. An example of a point set would be the latitude/longitude coordinates of all elm trees in a particular plot of land. More complicated forms of data include marked point sets and spatial time series.

Measures of spatial central tendency

The coordinate-wise mean of a point set is the centroid, which solves the same variational problem in the plane (or higher-dimensional Euclidean space) that the familiar average solves on the real line that is, the centroid has the smallest possible average squared distance to all points in the set.

Measures of spatial dispersion

Dispersion captures the degree to which points in a point set are separated from each other. For most applications, spatial dispersion should be quantified in a way that is invariant to rotations and reflections. Several simple measures of spatial dispersion for a point set can be defined using the covariance matrix of the coordinates of the points. The trace, the determinant, and the largest eigenvalue of the covariance matrix can be used as measures of spatial dispersion.

A measure of spatial dispersion that is not based on the covariance matrix is the average distance between nearest neighbors. [1]

Measures of spatial homogeneity

A homogeneous set of points in the plane is a set that is distributed such that approximately the same number of points occurs in any circular region of a given area. A set of points that lacks homogeneity may be spatially clustered at a certain spatial scale. A simple probability model for spatially homogeneous points is the Poisson process in the plane with constant intensity function.

Ripley's K and L functions

Ripley's K and L functions introduced by Brian D. Ripley [2] are closely related descriptive statistics for detecting deviations from spatial homogeneity. The K function (technically its sample-based estimate) is defined as

where dij is the Euclidean distance between the ith and jth points in a data set of n points, t is the search radius, λ is the average density of points (generally estimated as n/A, where A is the area of the region containing all points) and I is the indicator function (i.e. 1 if its operand is true, 0 otherwise). [3] In 2 dimensions, if the points are approximately homogeneous, should be approximately equal to πt2.

For data analysis, the variance stabilized Ripley K function called the L function is generally used. The sample version of the L function is defined as

For approximately homogeneous data, the L function has expected value t and its variance is approximately constant in t. A common plot is a graph of against t, which will approximately follow the horizontal zero-axis with constant dispersion if the data follow a homogeneous Poisson process.

Using Ripley's K function it can be determined whether points have a random, dispersed or clustered distribution pattern at a certain scale. [4]

See also

Related Research Articles

In statistics, a central tendency is a central or typical value for a probability distribution.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

<span class="mw-page-title-main">Centroid</span> Mean position of all the points in a shape

In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any object in -dimensional Euclidean space.

<span class="mw-page-title-main">Homogeneous coordinates</span> Coordinate system used in projective geometry

In mathematics, homogeneous coordinates or projective coordinates, introduced by August Ferdinand Möbius in his 1827 work Der barycentrische Calcul, are a system of coordinates used in projective geometry, just as Cartesian coordinates are used in Euclidean geometry. They have the advantage that the coordinates of points, including points at infinity, can be represented using finite coordinates. Formulas involving homogeneous coordinates are often simpler and more symmetric than their Cartesian counterparts. Homogeneous coordinates have a range of applications, including computer graphics and 3D computer vision, where they allow affine transformations and, in general, projective transformations to be easily represented by a matrix. They are also used in fundamental elliptic curve cryptography algorithms.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

<span class="mw-page-title-main">Linear discriminant analysis</span> Method used in statistics, pattern recognition, and other fields

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

Fuzzy clustering is a form of clustering in which each data point can belong to more than one cluster.

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

The Kabsch algorithm, also known as the Kabsch-Umeyama algorithm, named after Wolfgang Kabsch and Shinji Umeyama, is a method for calculating the optimal rotation matrix that minimizes the RMSD between two paired sets of points. It is useful for point-set registration in computer graphics, and in cheminformatics and bioinformatics to compare molecular and protein structures.

The eight-point algorithm is an algorithm used in computer vision to estimate the essential matrix or the fundamental matrix related to a stereo camera pair from a set of corresponding image points. It was introduced by Christopher Longuet-Higgins in 1981 for the case of the essential matrix. In theory, this algorithm can be used also for the fundamental matrix, but in practice the normalized eight-point algorithm, described by Richard Hartley in 1997, is better suited for this case.

BIRCH is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources. In most cases, BIRCH only requires a single scan of the database.

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval.

In data mining and machine learning, kq-flats algorithm is an iterative method which aims to partition m observations into k clusters where each cluster is close to a q-flat, where q is a given integer.

In applied statistics and geostatistics, regression-kriging (RK) is a spatial prediction technique that combines a regression of the dependent variable on auxiliary variables with interpolation (kriging) of the regression residuals. It is mathematically equivalent to the interpolation method variously called universal kriging and kriging with external drift, where auxiliary predictors are used directly to solve the kriging weights.

<span class="mw-page-title-main">Poisson point process</span> Type of random mathematical object

In probability theory, statistics and related fields, a Poisson point process is a type of random mathematical object that consists of points randomly located on a mathematical space with the essential feature that the points occur independently of one another. The Poisson point process is also called a Poisson random measure, Poisson random point field and Poisson point field. When the process is defined on the real number line, it is often called simply the Poisson process.

References

  1. Clark, Philip; Evans, Francis (1954). "Distance to nearest neighbor as a measure of spatial relationships in populations". Ecology. 35 (4): 445–453. doi:10.2307/1931034. JSTOR   1931034.
  2. Ripley, B.D. (1976). "The second-order analysis of stationary point processes". Journal of Applied Probability. 13 (2): 255–266. doi: 10.2307/3212829 . JSTOR   3212829.
  3. Dixon, Philip M. (2002). "Ripley's K function" (PDF). In El-Shaarawi, Abdel H.; Piegorsch, Walter W. (eds.). Encyclopedia of Environmetrics. John Wiley & Sons. pp. 1796–1803. ISBN   978-0-471-89997-6 . Retrieved April 25, 2014.
  4. Wilschut, L.I.; Laudisoit, A.; Hughes, N.K.; Addink, E.A.; de Jong, S.M.; Heesterbeek, J.A.P.; Reijniers, J.; Eagle, S.; Dubyanskiy, V.M.; Begon, M. (2015). "Spatial distribution patterns of plague hosts: point pattern analysis of the burrows of great gerbils in Kazakhstan". Journal of Biogeography. 42 (7): 1281–1292. doi:10.1111/jbi.12534. PMC   4737218 . PMID   26877580.