Spatial descriptive statistics

Last updated

Spatial descriptive statistics is the intersection of spatial statistics and descriptive statistics; these methods are used for a variety of purposes in geography, particularly in quantitative data analyses involving Geographic Information Systems (GIS).

Contents

Types of spatial data

The simplest forms of spatial data are gridded data, in which a scalar quantity is measured for each point in a regular grid of points, and point sets, in which a set of coordinates (e.g. of points in the plane) is observed. An example of gridded data would be a satellite image of forest density that has been digitized on a grid. An example of a point set would be the latitude/longitude coordinates of all elm trees in a particular plot of land. More complicated forms of data include marked point sets and spatial time series.

Measures of spatial central tendency

The coordinate-wise mean of a point set is the centroid, which solves the same variational problem in the plane (or higher-dimensional Euclidean space) that the familiar average solves on the real line that is, the centroid has the smallest possible average squared distance to all points in the set.

Measures of spatial dispersion

Dispersion captures the degree to which points in a point set are separated from each other. For most applications, spatial dispersion should be quantified in a way that is invariant to rotations and reflections. Several simple measures of spatial dispersion for a point set can be defined using the covariance matrix of the coordinates of the points. The trace, the determinant, and the largest eigenvalue of the covariance matrix can be used as measures of spatial dispersion.

A measure of spatial dispersion that is not based on the covariance matrix is the average distance between nearest neighbors. [1]

Measures of spatial homogeneity

A homogeneous set of points in the plane is a set that is distributed such that approximately the same number of points occurs in any circular region of a given area. A set of points that lacks homogeneity may be spatially clustered at a certain spatial scale. A simple probability model for spatially homogeneous points is the Poisson process in the plane with constant intensity function.

Ripley's K and L functions

Ripley's K and L functions introduced by Brian D. Ripley [2] are closely related descriptive statistics for detecting deviations from spatial homogeneity. The K function (technically its sample-based estimate) is defined as

where dij is the Euclidean distance between the ith and jth points in a data set of n points, t is the search radius, λ is the average density of points (generally estimated as n/A, where A is the area of the region containing all points) and I is the indicator function (1 if its operand is true, 0 otherwise). [3] In 2 dimensions, if the points are approximately homogeneous, should be approximately equal to πt2.

For data analysis, the variance stabilized Ripley K function called the L function is generally used. The sample version of the L function is defined as

For approximately homogeneous data, the L function has expected value t and its variance is approximately constant in t. A common plot is a graph of against t, which will approximately follow the horizontal zero-axis with constant dispersion if the data follow a homogeneous Poisson process.

Using Ripley's K function you can determine whether points have a random, dispersed or clustered distribution pattern at a certain scale. [4]

See also

Related Research Articles

In statistics, a central tendency is a central or typical value for a probability distribution.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where the variation in the data can be described with fewer dimensions than the initial data. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science.

<span class="mw-page-title-main">Centroid</span> Mean ("average") position of all the points in a shape

In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any object in n-dimensional Euclidean space.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

<span class="mw-page-title-main">Fuzzy clustering</span>

Fuzzy clustering is a form of clustering in which each data point can belong to more than one cluster.

<span class="mw-page-title-main">Kernel method</span> Class of algorithms for pattern analysis

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using Inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient, or even give misleading inferences. GLS was first described by Alexander Aitken in 1936.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.

Complete spatial randomness (CSR) describes a point process whereby point events occur within a given study area in a completely random fashion. It is synonymous with a homogeneous spatial Poisson process. Such a process is modeled using only one parameter , i.e. the density of points within the defined area. The term complete spatial randomness is commonly used in Applied Statistics in the context of examining certain point patterns, whereas in most other statistical contexts it is referred to the concept of a spatial Poisson process.

<span class="mw-page-title-main">BIRCH</span> Clustering using tree-based data aggregation

BIRCH is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources. In most cases, BIRCH only requires a single scan of the database.

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval.

In data mining and machine learning, -flats algorithm is an iterative method which aims to partition observations into clusters where each cluster is close to a -flat, where is a given integer.

<span class="mw-page-title-main">Poisson point process</span> Type of random mathematical object

In probability, statistics and related fields, a Poisson point process is a type of random mathematical object that consists of points randomly located on a mathematical space with the essential feature that the points occur independently of one another. The Poisson point process is often called simply the Poisson process, but it is also called a Poisson random measure, Poisson random point field or Poisson point field. This point process has convenient mathematical properties, which has led to its being frequently defined in Euclidean space and used as a mathematical model for seemingly random processes in numerous disciplines such as astronomy, biology, ecology, geology, seismology, physics, economics, image processing, and telecommunications.

References

  1. Clark, Philip; Evans, Francis (1954). "Distance to nearest neighbor as a measure of spatial relationships in populations". Ecology. 35 (4): 445–453. doi:10.2307/1931034. JSTOR   1931034.
  2. Ripley, B.D. (1976). "The second-order analysis of stationary point processes". Journal of Applied Probability. 13 (2): 255–266. doi: 10.2307/3212829 . JSTOR   3212829.
  3. Dixon, Philip M. (2002). "Ripley's K function" (PDF). In El-Shaarawi, Abdel H.; Piegorsch, Walter W. (eds.). Encyclopedia of Environmetrics. John Wiley & Sons. pp. 1796–1803. ISBN   978-0-471-89997-6 . Retrieved April 25, 2014.
  4. Wilschut, L.I.; Laudisoit, A.; Hughes, N.K.; Addink, E.A.; de Jong, S.M.; Heesterbeek, J.A.P.; Reijniers, J.; Eagle, S.; Dubyanskiy, V.M.; Begon, M. (2015). "Spatial distribution patterns of plague hosts: point pattern analysis of the burrows of great gerbils in Kazakhstan". Journal of Biogeography. 42 (7): 1281–1292. doi:10.1111/jbi.12534. PMC   4737218 . PMID   26877580.