V-statistic

Last updated January 31, 2024

V-statistics are a class of statistics named for Richard von Mises who developed their asymptotic distribution theory in a fundamental paper in 1947.^[1] V-statistics are closely related to U-statistics ^[2]^[3] (U for "unbiased") introduced by Wassily Hoeffding in 1948.^[4] A V-statistic is a statistical function (of a sample) defined by a particular statistical functional of a probability distribution.

Statistical functions

Statistics that can be represented as functionals $T(F_{n})$ of the empirical distribution function $(F_{n})$ are called statistical functionals.^[5] Differentiability of the functional T plays a key role in the von Mises approach; thus von Mises considers differentiable statistical functionals.^[1]

Examples of statistical functions

The k-th central moment is the functional $T(F)=\int (x-\mu )^{k}\,dF(x)$ , where $\mu =E[X]$ is the expected value of X. The associated statistical function is the sample k-th central moment,
$T_{n}=m_{k}=T(F_{n})={\frac {1}{n}}\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{k}.$
The chi-squared goodness-of-fit statistic is a statistical function T(F_n), corresponding to the statistical functional
$T(F)=\sum _{i=1}^{k}{\frac {(\int _{A_{i}}\,dF-p_{i})^{2}}{p_{i}}},$
where A_i are the k cells and p_i are the specified probabilities of the cells under the null hypothesis.
The Cramér–von-Mises and Anderson–Darling goodness-of-fit statistics are based on the functional
$T(F)=\int (F(x)-F_{0}(x))^{2}\,w(x;F_{0})\,dF_{0}(x),$
where w(x; F₀) is a specified weight function and F₀ is a specified null distribution. If w is the identity function then T(F_n) is the well known Cramér–von-Mises goodness-of-fit statistic; if $w(x;F_{0})=[F_{0}(x)(1-F_{0}(x))]^{-1}$ then T(F_n) is the Anderson–Darling statistic.

Representation as a V-statistic

Suppose x₁, ..., x_n is a sample. In typical applications the statistical function has a representation as the V-statistic

V_{mn}={\frac {1}{n^{m}}}\sum _{i_{1}=1}^{n}\cdots \sum _{i_{m}=1}^{n}h(x_{i_{1}},x_{i_{2}},\dots ,x_{i_{m}}),

where h is a symmetric kernel function. Serfling^[6] discusses how to find the kernel in practice. V_mn is called a V-statistic of degree m.

A symmetric kernel of degree 2 is a function h(x, y), such that h(x, y) = h(y, x) for all x and y in the domain of h. For samples x₁, ..., x_n, the corresponding V-statistic is defined

V_{2,n}={\frac {1}{n^{2}}}\sum _{i=1}^{n}\sum _{j=1}^{n}h(x_{i},x_{j}).

Example of a V-statistic

An example of a degree-2 V-statistic is the second central moment m₂. If h(x, y) = (x−y)²/2, the corresponding V-statistic is
$V_{2,n}={\frac {1}{n^{2}}}\sum _{i=1}^{n}\sum _{j=1}^{n}{\frac {1}{2}}(x_{i}-x_{j})^{2}={\frac {1}{n}}\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2},$
which is the maximum likelihood estimator of variance. With the same kernel, the corresponding U-statistic is the (unbiased) sample variance:
$s^{2}={n \choose 2}^{-1}\sum _{i<j}{\frac {1}{2}}(x_{i}-x_{j})^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}$ .

Asymptotic distribution

In examples 1–3, the asymptotic distribution of the statistic is different: in (1) it is normal, in (2) it is chi-squared, and in (3) it is a weighted sum of chi-squared variables.

Von Mises' approach is a unifying theory that covers all of the cases above.^[1] Informally, the type of asymptotic distribution of a statistical function depends on the order of "degeneracy," which is determined by which term is the first non-vanishing term in the Taylor expansion of the functional T. In case it is the linear term, the limit distribution is normal; otherwise higher order types of distributions arise (under suitable conditions such that a central limit theorem holds).

There are a hierarchy of cases parallel to asymptotic theory of U-statistics.^[7] Let A(m) be the property defined by:

A(m):

Var(h(X₁, ..., X_k)) = 0 for k < m, and Var(h(X₁, ..., X_k)) > 0 for k = m;
n^m/2R_mn tends to zero (in probability). (R_mn is the remainder term in the Taylor series for T.)

Case m = 1 (Non-degenerate kernel):

If A(1) is true, the statistic is a sample mean and the Central Limit Theorem implies that T(F_n) is asymptotically normal.

In the variance example (4), m₂ is asymptotically normal with mean $\sigma ^{2}$ and variance $(\mu _{4}-\sigma ^{4})/n$ , where $\mu _{4}=E(X-E(X))^{4}$ .

Case m = 2 (Degenerate kernel):

Suppose A(2) is true, and $E[h^{2}(X_{1},X_{2})]<\infty ,\,E|h(X_{1},X_{1})|<\infty ,$ and $E[h(x,X_{1})]\equiv 0$ . Then nV_2,n converges in distribution to a weighted sum of independent chi-squared variables:

nV_{2,n}{\stackrel {d}{\longrightarrow }}\sum _{k=1}^{\infty }\lambda _{k}Z_{k}^{2},

where $Z_{k}$ are independent standard normal variables and $\lambda _{k}$ are constants that depend on the distribution F and the functional T. In this case the asymptotic distribution is called a quadratic form of centered Gaussian random variables. The statistic V_2,n is called a degenerate kernel V-statistic. The V-statistic associated with the Cramer–von Mises functional^[1] (Example 3) is an example of a degenerate kernel V-statistic.^[8]

Notes

1 2 3 4 von Mises (1947)
↑ Lee (1990)
↑ Koroljuk & Borovskich (1994)
↑ Hoeffding (1948)
↑ von Mises (1947), p. 309; Serfling (1980), p. 210.
↑ Serfling (1980, Section 6.5)
↑ Serfling (1980, Ch. 5–6); Lee (1990, Ch. 3)
↑ See Lee (1990, p. 160) for the kernel function.

Related Research Articles

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

In probability theory and statistics, the chi-squared distribution with $degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables. The chi-squared distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing and in construction of confidence intervals. This distribution is sometimes called the central chi-squared distribution, a special case of the more general noncentral chi-squared distribution.$

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference.

In mathematical analysis, asymptotic analysis, also known as asymptotics, is a method of describing limiting behavior.

In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions $and in the RKHS are close in norm, i.e., is small, then and are also pointwise close, i.e., is small for all . The converse does not need to be true. Informally, this can be shown by looking at the supremum norm: the sequence of functions converges pointwise, but does not converge uniformly i.e. does not converge with respect to the supremum norm.$

Directional statistics is the subdiscipline of statistics that deals with directions, axes or rotations in Rⁿ. More generally, directional statistics deals with observations on compact Riemannian manifolds including the Stiefel manifold.

In probability and statistics, a circular distribution or polar distribution is a probability distribution of a random variable whose values are angles, usually taken to be in the range [0, 2π). A circular distribution is often a continuous probability distribution, and hence has a probability density, but such distributions can also be discrete, in which case they are called circular lattice distributions. Circular distributions can be used even when the variables concerned are not explicitly angles: the main consideration is that there is not usually any real distinction between events occurring at the lower or upper end of the range, and the division of the range could notionally be made at any point.

In probability theory and directional statistics, the von Mises distribution is a continuous probability distribution on the circle. It is a close approximation to the wrapped normal distribution, which is the circular analogue of the normal distribution. A freely diffusing angle $on a circle is a wrapped normally distributed random variable with an unwrapped variance that grows linearly in time. On the other hand, the von Mises distribution is the stationary distribution of a drift and diffusion process on the circle in a harmonic potential, i.e. with a preferred orientation. The von Mises distribution is the maximum entropy distribution for circular data when the real and imaginary parts of the first circular moment are specified. The von Mises distribution is a special case of the von Mises-Fisher distribution on the N -dimensional sphere.$

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

<span class="mw-page-title-main">Empirical distribution function</span> Distribution function associated with the empirical measure of a sample

In statistics, an empirical distribution function is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by $1/ n$ at each of the $n$ data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free. However, the test is most often used in contexts where a family of distributions is being tested, in which case the parameters of that family need to be estimated and account must be taken of this in adjusting either the test-statistic or its critical values. When applied to testing whether a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality. K-sample Anderson–Darling tests are available for testing whether several collections of observations can be modelled as coming from a single population, where the distribution function does not have to be specified.

In statistics the Cramér–von Mises criterion is a criterion used for judging the goodness of fit of a cumulative distribution function $compared to a given empirical distribution function, or for comparing two empirical distributions. It is also used as a part of other algorithms, such as minimum distance estimation. It is defined as$

In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. In many applications it is the right tail of the distribution that is of interest, but a distribution may have a heavy left tail, or both tails may be heavy.

In statistics, kernel regression is a non-parametric technique to estimate the conditional expectation of a random variable. The objective is to find a non-linear relation between a pair of random variables X and Y.

In statistical theory, a U-statistic is a class of statistics defined as the average over the application of a given function applied to all tuples of a fixed size. The letter "U" stands for unbiased. In elementary statistics, U-statistics arise naturally in producing minimum-variance unbiased estimators.

<span class="mw-page-title-main">Wrapped normal distribution</span>

In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution that results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for periodic boundary conditions. It is closely approximated by the von Mises distribution, which, due to its mathematical simplicity and tractability, is the most commonly used distribution in directional statistics.

Kernel density estimation is a nonparametric technique for density estimation i.e., estimation of probability density functions, which is one of the fundamental questions in statistics. It can be viewed as a generalisation of histogram density estimation with improved statistical properties. Apart from histograms, other types of density estimators include parametric, spline, wavelet and Fourier series. Kernel density estimators were first introduced in the scientific literature for univariate data in the 1950s and 1960s and subsequently have been widely adopted. It was soon recognised that analogous estimators for multivariate data would be an important addition to multivariate statistics. Based on research carried out in the 1990s and 2000s, multivariate kernel density estimation has reached a level of maturity comparable to its univariate counterparts.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

References

Hoeffding, W. (1948). "A class of statistics with asymptotically normal distribution". Annals of Mathematical Statistics. 19 (3): 293–325. doi: 10.1214/aoms/1177730196 . JSTOR 2235637.
Koroljuk, V.S.; Borovskich, Yu.V. (1994). Theory of U-statistics (English translation by P.V.Malyshev and D.V.Malyshev from the 1989 Ukrainian ed.). Dordrecht: Kluwer Academic Publishers. ISBN 0-7923-2608-3.
Lee, A.J. (1990). U-Statistics: theory and practice. New York: Marcel Dekker, Inc. ISBN 0-8247-8253-4.
Neuhaus, G. (1977). "Functional limit theorems for U-statistics in the degenerate case". Journal of Multivariate Analysis. 7 (3): 424–439. doi: 10.1016/0047-259X(77)90083-5 .
Rosenblatt, M. (1952). "Limit theorems associated with variants of the von Mises statistic". Annals of Mathematical Statistics. 23 (4): 617–623. doi: 10.1214/aoms/1177729341 . JSTOR 2236587.
Serfling, R.J. (1980). Approximation theorems of mathematical statistics. New York: John Wiley & Sons. ISBN 0-471-02403-1.
Taylor, R.L.; Daffer, P.Z.; Patterson, R.F. (1985). Limit theorems for sums of exchangeable random variables. New Jersey: Rowman and Allanheld.
von Mises, R. (1947). "On the asymptotic distribution of differentiable statistical functions". Annals of Mathematical Statistics. 18 (2): 309–348. doi: 10.1214/aoms/1177730385 . JSTOR 2235734.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[VM-1] 1 2 3 4 von Mises (1947)

[2] Lee (1990)

[3] Koroljuk & Borovskich (1994)

[4] Hoeffding (1948)

[5] von Mises (1947), p. 309; Serfling (1980), p. 210.

[Serfling.a-6] Serfling (1980, Section 6.5)

[7] Serfling (1980, Ch. 5–6); Lee (1990, Ch. 3)

[8] See Lee (1990, p. 160) for the kernel function.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]