Optimal estimation

Last updated April 29, 2019

In applied statistics, optimal estimation is a regularized matrix inverse method based on Bayes' theorem. It is used very commonly in the geosciences, particularly for atmospheric sounding. A matrix inverse problem looks like this:

Regularization (mathematics) technique in mathematics, statistics, and computer science

In mathematics, statistics, and computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.

Matrix (mathematics) Two-dimensional array of numbers with specific operations

In mathematics, a matrix is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. For example, the dimensions of the matrix below are 2 × 3, because there are two rows and three columns:

An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, source reconstruction in acoustics, or calculating the density of the Earth from measurements of its gravity field.

\mathbf {A} {\vec {x}}={\vec {y}}

The essential concept is to transform the matrix, A, into a conditional probability and the variables, ${\vec {x}}$ and ${\vec {y}}$ into probability distributions by assuming Gaussian statistics and using empirically-determined covariance matrices.

In probability theory, conditional probability is a measure of the probability of an event given that another event has occurred. If the event of interest is A and the event B is known or assumed to have occurred, "the conditional probability of A given B", or "the probability of A under the condition B", is usually written as P(A | B), or sometimes P_B(A) or P(A / B). For example, the probability that any given person has a cough on any given day may be only 5%. But if we know or assume that the person has a cold, then they are much more likely to be coughing. The conditional probability of coughing by the unwell might be 75%, then: P(Cough) = 5%; P(Cough | Sick) = 75%

Derivation

Typically, one expects the statistics of most measurements to be Gaussian. So for example for $P({\vec {y}}|{\vec {x}})$ , we can write:

In probability theory, the normaldistribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.

P({\vec {y}}|{\vec {x}})={\frac {1}{(2\pi )^{mn/2}|{\boldsymbol {S_{y}}}|}}\exp \left[-{\frac {1}{2}}({\boldsymbol {A}}{\vec {x}}-{\vec {y}})^{T}{\boldsymbol {S_{y}}}^{-1}({\boldsymbol {A}}{\vec {x}}-{\vec {y}})\right]

where m and n are the numbers of elements in ${\vec {x}}$ and ${\vec {y}}$ respectively ${\boldsymbol {A}}$ is the matrix to be solved (the linear or linearised forward model) and ${\boldsymbol {S_{y}}}$ is the covariance matrix of the vector ${\vec {y}}$ . This can be similarly done for ${\vec {x}}$ :

P({\vec {x}})={\frac {1}{(2\pi )^{m/2}|{\boldsymbol {S_{x_{a}}}}|}}\exp \left[-{\frac {1}{2}}({\vec {x}}-{\widehat {x_{a}}})^{T}{\boldsymbol {S_{x_{a}}}}^{-1}({\vec {x}}-{\widehat {x_{a}}})\right]

Here $P({\vec {x}})$ is taken to be the so-called "a-priori" distribution: ${\widehat {x_{a}}}$ denotes the a-priori values for ${\vec {x}}$ while ${\boldsymbol {S_{x_{a}}}}$ is its covariance matrix.

The nice thing about the Gaussian distributions is that only two parameters are needed to describe them and so the whole problem can be converted once again to matrices. Assuming that $P({\vec {x}}|{\vec {y}})$ takes the following form:

P({\vec {x}}|{\vec {y}})={\frac {1}{(2\pi )^{mn/2}|{\boldsymbol {S_{x}}}|}}\exp \left[-{\frac {1}{2}}({\vec {x}}-{\widehat {x}})^{T}{\boldsymbol {S_{x}}}^{-1}({\vec {x}}-{\widehat {x}})\right]

$P({\vec {y}})$ may be neglected since, for a given value of ${\vec {x}}$ , it is simply a constant scaling term. Now it is possible to solve for both the expectation value of ${\vec {x}}$ , ${\widehat {x}}$ , and for its covariance matrix by equating $P({\vec {x}}|{\vec {y}})$ and $P({\vec {y}}|{\vec {x}})P({\vec {x}})$ . This produces the following equations:

{\boldsymbol {S_{x}}}=({\boldsymbol {A}}^{T}{\boldsymbol {S_{y}^{-1}}}{\boldsymbol {A}}+{\boldsymbol {S_{x_{a}}^{-1}}})^{-1}

{\widehat {x}}={\widehat {x_{a}}}+{\boldsymbol {S_{x}}}{\boldsymbol {A}}^{T}{\boldsymbol {S_{y}}}^{-1}({\vec {y}}-{\boldsymbol {A}}{\widehat {x_{a}}})

Because we are using Gaussians, the expected value is equivalent to the maximum likely value, and so this is also a form of maximum likelihood estimation.

Typically with optimal estimation, in addition to the vector of retrieved quantities, one extra matrix is returned along with the covariance matrix. This is sometimes called the resolution matrix or the averaging kernel and is calculated as follows:

{\boldsymbol {R}}=({\boldsymbol {A}}^{T}{\boldsymbol {S_{y}}}^{-1}{\boldsymbol {A}}+{\boldsymbol {S_{x_{a}}}}^{-1})^{-1}{\boldsymbol {A}}^{T}{\boldsymbol {S_{y}}}^{-1}{\boldsymbol {A}}

This tells us, for a given element of the retrieved vector, how much of the other elements of the vector are mixed in. In the case of a retrieval of profile information, it typical indicates the altitude resolution for a given altitude. For instance if the resolution vectors for all the altitudes contain non-zero elements (to a numerical tolerance) in their four nearest neighbours, then the altitude resolution is only one fourth that of the actual grid size.

Related Research Articles

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model so the observed data is most probable. Specifically, this is done by finding the value of the parameter $that maximizes the likelihood function, which is the joint probability of the observed data, over a parameter space . The point that maximizes the likelihood function is called the maximum likelihood estimate . The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of inference within much of the quantitative research of the social and medical sciences.$

Covariance matrix measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix, also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix, is a matrix whose element in the i, j position is the covariance between the i-th and j-th elements of a random vector. A random vector is a random variable with multiple dimensions. Each element of the vector is a scalar random variable. Each element has either a finite number of observed empirical values or a finite or infinite number of potential values. The potential values are specified by a theoretical joint probability distribution.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the form:

In statistics, the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables.

In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that the subcomponents are non-Gaussian signals and that they are statistically independent from each other. ICA is a special case of blind source separation. A common example application is the "cocktail party problem" of listening in on one person's speech in a noisy room.

A sensor array is a group of sensors, usually deployed in a certain geometry pattern, used for collecting and processing electromagnetic or acoustic signals. The advantage of using a sensor array over using a single sensor lies in the fact that an array adds new dimensions to the observation, helping to estimate more parameters and improve the estimation performance. For example an array of radio antenna elements used for beamforming can increase antenna gain in the direction of the signal while decreasing the gain in other directions, i.e., increasing signal-to-noise ratio (SNR) by amplifying the signal coherently. Another example of sensor array application is to estimate the direction of arrival of impinging electromagnetic waves. The related processing method is called array signal processing. Application examples of array signal processing include radar/sonar, wireless communications, seismology, machine condition monitoring, astronomical observations fault diagnosis, etc.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in R^p×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has normal distribution, the sample covariance matrix has Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

The James–Stein estimator is a biased estimator of the mean of Gaussian random vectors. It can be shown that the James–Stein estimator dominates the "ordinary" least squares approach, i.e., it has lower mean squared error. It is the best-known example of Stein's phenomenon.

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

In statistics, the matrix t-distribution is the generalization of the multivariate t-distribution from vectors to matrices. The matrix t-distribution shares the same relationship with the multivariate t-distribution that the matrix normal distribution shares with the multivariate normal distribution. For example, the matrix t-distribution is the compound distribution that results from sampling from a matrix normal distribution having sampled the covariance matrix of the matrix normal from an inverse Wishart distribution.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

Information field theory (IFT) is a Bayesian statistical field theory relating to signal reconstruction, cosmography, and other related areas. IFT summarizes the information available on a physical field using Bayesian probabilities. It uses computational techniques developed for quantum field theory and statistical field theory to handle the infinite number of degrees of freedom of a field and to derive algorithms for the calculation of field expectation values. For example, the posterior expectation value of a field generated by a known Gaussian process and measured by a linear device with known Gaussian noise statistics is given by a generalized Wiener filter applied to the measured data. IFT extends such known filter formula to situations with nonlinear physics, nonlinear devices, non-Gaussian field or noise statistics, dependence of the noise statistics on the field values, and partly unknown parameters of measurement. For this it uses Feynman diagrams, renormalisation flow equations, and other methods from mathematical physics.

In the mathematical theory of probability, multivariate Laplace distributions are extensions of the Laplace distribution and the asymmetric Laplace distribution to multiple variables. The marginal distributions of symmetric multivariate Laplace distribution variables are Laplace distributions. The marginal distributions of asymmetric multivariate Laplace distribution variables are asymmetric Laplace distributions.

SAMV is a parameter-free superresolution algorithm for the linear inverse problem in spectral estimation, direction-of-arrival (DOA) estimation and tomographic reconstruction with applications in signal processing, medical imaging and remote sensing. The name was coined in 2013 to emphasize its basis on the asymptotically minimum variance (AMV) criterion. It is a powerful tool for the recovery of both the amplitude and frequency characteristics of multiple highly correlated sources in challenging environment (e.g., limited number of snapshots, low signal-to-noise ratio. Applications include synthetic-aperture radar, computed tomography scan, and magnetic resonance imaging.

References

Clive D. Rodgers (1976). "Retrieval of Atmospheric Temperature and Composition From Remote Measurements of Thermal Radiation". Reviews of Geophysics and Space Physics. 14 (4). p. 609.

Clive D. Rodgers (2000). Inverse Methods for Atmospheric Sounding: Theory and Practice. World Scientific.

Clive D. Rodgers (2002). "Atmospheric Remote Sensing: The Inverse Problem". Proceedings of the Fourth Oxford/RAL Spring School in Quantitative Earth Observation. University of Oxford.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.