Geostatistics is a branch of statistics focusing on spatial or spatiotemporal datasets. Developed originally to predict probability distributions of ore grades for mining operations, [1] it is currently applied in diverse disciplines including petroleum geology, hydrogeology, hydrology, meteorology, oceanography, geochemistry, geometallurgy, geography, forestry, environmental control, landscape ecology, soil science, and agriculture (esp. in precision farming). Geostatistics is applied in varied branches of geography, particularly those involving the spread of diseases (epidemiology), the practice of commerce and military planning (logistics), and the development of efficient spatial networks. Geostatistical algorithms are incorporated in many places, including geographic information systems (GIS).
Geostatistics is intimately related to interpolation methods, but extends far beyond simple interpolation problems. Geostatistical techniques rely on statistical models that are based on random function (or random variable) theory to model the uncertainty associated with spatial estimation and simulation.
A number of simpler interpolation methods/algorithms, such as inverse distance weighting, bilinear interpolation and nearest-neighbor interpolation, were already well known before geostatistics. [2] Geostatistics goes beyond the interpolation problem by considering the studied phenomenon at unknown locations as a set of correlated random variables.
Let Z(x) be the value of the variable of interest at a certain location x. This value is unknown (e.g. temperature, rainfall, piezometric level, geological facies, etc.). Although there exists a value at location x that could be measured, geostatistics considers this value as random since it was not measured, or has not been measured yet. However, the randomness of Z(x) is not complete, but defined by a cumulative distribution function (CDF) that depends on certain information that is known about the value Z(x):
Typically, if the value of Z is known at locations close to x (or in the neighborhood of x) one can constrain the CDF of Z(x) by this neighborhood: if a high spatial continuity is assumed, Z(x) can only have values similar to the ones found in the neighborhood. Conversely, in the absence of spatial continuity Z(x) can take any value. The spatial continuity of the random variables is described by a model of spatial continuity that can be either a parametric function in the case of variogram-based geostatistics, or have a non-parametric form when using other methods such as multiple-point simulation [3] or pseudo-genetic techniques.
By applying a single spatial model on an entire domain, one makes the assumption that Z is a stationary process. It means that the same statistical properties are applicable on the entire domain. Several geostatistical methods provide ways of relaxing this stationarity assumption.
In this framework, one can distinguish two modeling goals:
A number of methods exist for both geostatistical estimation and multiple realizations approaches. Several reference books provide a comprehensive overview of the discipline. [6] [2] [7] [8] [9] [10] [11] [12] [13] [14] [15]
Kriging is a group of geostatistical techniques to interpolate the value of a random field (e.g., the elevation, z, of the landscape as a function of the geographic location) at an unobserved location from observations of its value at nearby locations.
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update a probability model as more evidence or information becomes available. Bayesian inference is playing an increasingly important role in Geostatistics. [16] Bayesian estimation implements kriging through a spatial process, most commonly a Gaussian process, and updates the process using Bayes' Theorem to calculate its posterior. High-dimensional Bayesian Geostatistics [17]
Considering the principle of conservation of probability, recurrent difference equations (finite difference equations) were used in conjunction with lattices to compute probabilities quantifying uncertainty about the geological structures. This procedure is a numerical alternative method to Markov chains and Bayesian models. [18]
A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.
In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.
In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging gives the best linear unbiased prediction (BLUP) at unsampled locations. Interpolating methods based on other criteria such as smoothness may not yield the BLUP. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Wiener–Kolmogorov prediction, after Norbert Wiener and Andrey Kolmogorov.
In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.
In spatial statistics the theoretical variogram, denoted , is a function describing the degree of spatial dependence of a spatial random field or stochastic process . The semivariogram is half the variogram.
Markov chain geostatistics uses Markov chain spatial models, simulation algorithms and associated spatial correlation measures based on the Markov chain random field theory, which extends a single Markov chain into a multi-dimensional random field for geostatistical modeling. A Markov chain random field is still a single spatial Markov chain. The spatial Markov chain moves or jumps in a space and decides its state at any unobserved location through interactions with its nearest known neighbors in different directions. The data interaction process can be well explained as a local sequential Bayesian updating process within a neighborhood. Because single-step transition probability matrices are difficult to estimate from sparse sample data and are impractical in representing the complex spatial heterogeneity of states, the transiogram, which is defined as a transition probability function over the distance lag, is proposed as the accompanying spatial measure of Markov chain random fields.
Data assimilation is a mathematical discipline that seeks to optimally combine theory with observations. There may be a number of different goals sought – for example, to determine the optimal state estimate of a system, to determine initial conditions for a numerical forecast model, to interpolate sparse observation data using knowledge of the system being observed, to set numerical parameters based on training a model from observed data. Depending on the goal, different solution methods may be used. Data assimilation is distinguished from other forms of machine learning, image analysis, and statistical methods in that it utilizes a dynamical model of the system being analyzed.
Spatial analysis is any of the formal techniques which studies entities using their topological, geometric, or geographic properties. Spatial analysis includes a variety of techniques using different analytic approaches, especially spatial statistics. It may be applied in fields as diverse as astronomy, with its studies of the placement of galaxies in the cosmos, or to chip fabrication engineering, with its use of "place and route" algorithms to build complex wiring structures. In a more restricted sense, spatial analysis is geospatial analysis, the technique applied to structures at the human scale, most notably in the analysis of geographic data. It may also be applied to genomics, as in transcriptomics data.
Georges François Paul Marie Matheron was a French mathematician and civil engineer of mines, known as the founder of geostatistics and a co-founder of mathematical morphology. In 1968, he created the Centre de Géostatistique et de Morphologie Mathématique at the Paris School of Mines in Fontainebleau. He is known for his contributions on Kriging and mathematical morphology. His seminal work is posted for study and review to the Online Library of the Centre de Géostatistique, Fontainebleau, France.
The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective.
Uncertainty quantification (UQ) is the science of quantitative characterization and estimation of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known. An example would be to predict the acceleration of a human body in a head-on crash with another car: even if the speed was exactly known, small differences in the manufacturing of individual cars, how tightly every bolt has been tightened, etc., will lead to different results that can only be predicted in a statistical sense.
In numerical analysis, multivariate interpolation is interpolation on functions of more than one variable ; when the variates are spatial coordinates, it is also known as spatial interpolation.
Polynomial chaos (PC), also called polynomial chaos expansion (PCE) and Wiener chaos expansion, is a method for representing a random variable in terms of a polynomial function of other random variables. The polynomials are chosen to be orthogonal with respect to the joint probability distribution of these random variables. Note that despite its name, PCE has no immediate connections to chaos theory. The word "chaos" here should be understood as "random".
Pedometric mapping, or statistical soil mapping, is data-driven generation of soil property and class maps that is based on use of statistical methods. Its main objectives are to predict values of some soil variable at unobserved locations, and to access the uncertainty of that estimate using statistical inference i.e. statistically optimal approaches. From the application point of view, its main objective is to accurately predict response of a soil-plant ecosystem to various soil management strategies—that is, to generate maps of soil properties and soil classes that can be used for other environmental models and decision-making. It is largely based on applying geostatistics in soil science, and other statistical methods used in pedometrics.
In applied statistics and geostatistics, regression-kriging (RK) is a spatial prediction technique that combines a regression of the dependent variable on auxiliary variables with interpolation (kriging) of the regression residuals. It is mathematically equivalent to the interpolation method variously called universal kriging and kriging with external drift, where auxiliary predictors are used directly to solve the kriging weights.
Mean-field particle methods are a broad class of interacting type Monte Carlo algorithms for simulating from a sequence of probability distributions satisfying a nonlinear evolution equation. These flows of probability measures can always be interpreted as the distributions of the random states of a Markov process whose transition probabilities depends on the distributions of the current random states. A natural way to simulate these sophisticated nonlinear Markov processes is to sample a large number of copies of the process, replacing in the evolution equation the unknown distributions of the random states by the sampled empirical measures. In contrast with traditional Monte Carlo and Markov chain Monte Carlo methods these mean-field particle techniques rely on sequential interacting samples. The terminology mean-field reflects the fact that each of the samples interacts with the empirical measures of the process. When the size of the system tends to infinity, these random empirical measures converge to the deterministic distribution of the random states of the nonlinear Markov chain, so that the statistical interaction between particles vanishes. In other words, starting with a chaotic configuration based on independent copies of initial state of the nonlinear Markov chain model, the chaos propagates at any time horizon as the size the system tends to infinity; that is, finite blocks of particles reduces to independent copies of the nonlinear Markov process. This result is called the propagation of chaos property. The terminology "propagation of chaos" originated with the work of Mark Kac in 1976 on a colliding mean-field kinetic gas model.
André Georges Journel is a French American engineer who excelled in formulating and promoting geostatistics in the earth sciences and engineering, first from the Centre of Mathematical Morphology in Fontainebleau, France and later from Stanford University.
In statistics and machine learning, Gaussian process approximation is a computational method that accelerates inference tasks in the context of a Gaussian process model, most commonly likelihood evaluation and prediction. Like approximations of other models, they can often be expressed as additional assumptions imposed on the model, which do not correspond to any actual feature, but which retain its key properties while simplifying calculations. Many of these approximation methods can be expressed in purely linear algebraic or functional analytic terms as matrix or function approximations. Others are purely algorithmic and cannot easily be rephrased as a modification of a statistical model.