# U-statistic

Last updated

In statistical theory, a U-statistic is a class of statistics that is especially important in estimation theory; the letter "U" stands for unbiased. In elementary statistics, U-statistics arise naturally in producing minimum-variance unbiased estimators.

## Contents

The theory of U-statistics allows a minimum-variance unbiased estimator to be derived from each unbiased estimator of an estimable parameter (alternatively, statistical functional ) for large classes of probability distributions. [1] [2] An estimable parameter is a measurable function of the population's cumulative probability distribution: For example, for every probability distribution, the population median is an estimable parameter. The theory of U-statistics applies to general classes of probability distributions.

Many statistics originally derived for particular parametric families have been recognized as U-statistics for general distributions. In non-parametric statistics, the theory of U-statistics is used to establish for statistical procedures (such as estimators and tests) and estimators relating to the asymptotic normality and to the variance (in finite samples) of such quantities. [3] The theory has been used to study more general statistics as well as stochastic processes, such as random graphs. [4] [5] [6]

Suppose that a problem involves independent and identically-distributed random variables and that estimation of a certain parameter is required. Suppose that a simple unbiased estimate can be constructed based on only a few observations: this defines the basic estimator based on a given number of observations. For example, a single observation is itself an unbiased estimate of the mean and a pair of observations can be used to derive an unbiased estimate of the variance. The U-statistic based on this estimator is defined as the average (across all combinatorial selections of the given size from the full set of observations) of the basic estimator applied to the sub-samples.

Sen (1992) provides a review of the paper by Wassily Hoeffding (1948), which introduced U-statistics and set out the theory relating to them, and in doing so Sen outlines the importance U-statistics have in statistical theory. Sen says [7] "The impact of Hoeffding (1948) is overwhelming at the present time and is very likely to continue in the years to come". Note that the theory of U-statistics is not limited to [8] the case of independent and identically-distributed random variables or to scalar random-variables. [9]

## Definition

The term U-statistic, due to Hoeffding (1948), is defined as follows.

Let ${\displaystyle f\colon R^{r}\to R}$ be a real-valued or complex-valued function of ${\displaystyle r}$ variables. For each ${\displaystyle n\geq r}$ the associated U-statistic ${\displaystyle f_{n}\colon R^{n}\to R}$ is equal to the average over ordered samples ${\displaystyle \varphi (1),\ldots ,\varphi (r)}$ of size ${\displaystyle r}$ of the sample values ${\displaystyle f(x_{\varphi })}$. In other words, ${\displaystyle f_{n}(x_{1},\ldots ,x_{n})=\operatorname {ave} f(x_{\varphi (1)},\ldots ,x_{\varphi (r)})}$, the average being taken over distinct ordered samples of size ${\displaystyle r}$ taken from ${\displaystyle \{1,\ldots ,n\}}$. Each U-statistic ${\displaystyle f_{n}(x_{1},\ldots ,x_{n})}$ is necessarily a symmetric function.

U-statistics are very natural in statistical work, particularly in Hoeffding's context of independent and identically-distributed random variables, or more generally for exchangeable sequences, such as in simple random sampling from a finite population, where the defining property is termed 'inheritance on the average'.

Fisher's k-statistics and Tukey's polykays are examples of homogeneous polynomial U-statistics (Fisher, 1929; Tukey, 1950). For a simple random sample φ of size n taken from a population of size N, the U-statistic has the property that the average over sample values ƒn(xφ) is exactly equal to the population value ƒN(x).

## Examples

Some examples: If ${\displaystyle f(x)=x}$ the U-statistic ${\displaystyle f_{n}(x)={\bar {x}}_{n}=(x_{1}+\cdots +x_{n})/n}$ is the sample mean.

If ${\displaystyle f(x_{1},x_{2})=|x_{1}-x_{2}|}$, the U-statistic is the mean pairwise deviation ${\displaystyle f_{n}(x_{1},\ldots ,x_{n})=2/(n(n-1))\sum _{i>j}|x_{i}-x_{j}|}$, defined for ${\displaystyle n\geq 2}$.

If ${\displaystyle f(x_{1},x_{2})=(x_{1}-x_{2})^{2}/2}$, the U-statistic is the sample variance ${\displaystyle f_{n}(x)=\sum (x_{i}-{\bar {x}}_{n})^{2}/(n-1)}$ with divisor ${\displaystyle n-1}$, defined for ${\displaystyle n\geq 2}$.

The third ${\displaystyle k}$-statistic ${\displaystyle k_{3,n}(x)=\sum (x_{i}-{\bar {x}}_{n})^{3}n/((n-1)(n-2))}$, the sample skewness defined for ${\displaystyle n\geq 3}$, is a U-statistic.

The following case highlights an important point. If ${\displaystyle f(x_{1},x_{2},x_{3})}$ is the median of three values, ${\displaystyle f_{n}(x_{1},\ldots ,x_{n})}$ is not the median of ${\displaystyle n}$ values. However, it is a minimum variance unbiased estimate of the expected value of the median of three values, not the median of the population. Similar estimates play a central role where the parameters of a family of probability distributions are being estimated by probability weighted moments or L-moments.

## Notes

1. Cox & Hinkley (1974),p. 200, p. 258
2. Hoeffding (1948), between Eq's(4.3),(4.4)
3. Sen (1992)
4. Page 508 in Koroljuk, V. S.; Borovskich, Yu. V. (1994). Theory of U-statistics. Mathematics and its Applications. 273 (Translated by P. V. Malyshev and D. V. Malyshev from the 1989 Russian original ed.). Dordrecht: Kluwer Academic Publishers Group. pp. x+552. ISBN   0-7923-2608-3. MR   1472486.
5. Pages 381–382 in Borovskikh, Yu. V. (1996). U-statistics in Banach spaces. Utrecht: VSP. pp. xii+420. ISBN   90-6764-200-2. MR   1419498.
6. Page xii in Kwapień, Stanisƚaw; Woyczyński, Wojbor A. (1992). Random series and stochastic integrals: Single and multiple. Probability and its Applications. Boston, MA: Birkhäuser Boston, Inc. pp. xvi+360. ISBN   0-8176-3572-6. MR   1167198.
7. Sen (1992) p. 307
8. Sen (1992), p306
9. Borovskikh's last chapter discusses U-statistics for exchangeable random elements taking values in a vector space (separable Banach space).

## Related Research Articles

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished.

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population or a probability distribution. For a data set, it may be thought of as the "middle" value. For example, the basic advantage of the median in describing data compared to the mean is that it is not skewed so much by a small proportion of extremely large or small values, and so it may give a better idea of a "typical" value. For example, in understanding statistics like household income or assets, which vary greatly, the mean may be skewed by a small number of extremely high or low values. Median income, for example, may be a better way to suggest what a "typical" income is. Because of this, the median is of central importance in robust statistics, as it is the most resistant statistic, having a breakdown point of 50%: so long as no more than half the data are contaminated, the median will not give an arbitrarily large or small result.

In probability theory, a normaldistribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value. Variance has a central role in statistics, where some ideas that use it include descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling. Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , or .

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.

In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference.

In statistics, completeness is a property of a statistic in relation to a model for a set of observed data. In essence, it ensures that the distributions corresponding to different values of the parameters are distinct.

In statistics, a confidence interval (CI) is a type of estimate computed from the statistics of the observed data. This proposes a range of plausible values for an unknown parameter. The interval has an associated confidence level that the true parameter is in the proposed range. Given observations and a confidence level , a valid confidence interval has a probability of containing the true underlying parameter. The level of confidence can be chosen by the investigator. In general terms, a confidence interval for an unknown parameter is based on sampling the distribution of a corresponding estimator.

In statistics, the Lehmann–Scheffé theorem is a prominent statement, tying together the ideas of completeness, sufficiency, uniqueness, and best unbiased estimation. The theorem states that any estimator which is unbiased for a given unknown quantity and that depends on the data only through a complete, sufficient statistic is the unique best unbiased estimator of that quantity. The Lehmann–Scheffé theorem is named after Erich Leo Lehmann and Henry Scheffé, given their two early papers.

In statistics, an efficient estimator is an estimator that estimates the quantity of interest in some “best possible” manner. The notion of “best possible” relies upon the choice of a particular loss function — the function which quantifies the relative degree of undesirability of estimation errors of different magnitudes. The most common choice of the loss function is quadratic, resulting in the mean squared error criterion of optimality.

In statistics, importance sampling is a general technique for estimating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. It is related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements.

In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter.

Probability theory and statistics have some commonly used conventions, in addition to standard mathematical notation and mathematical symbols.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias can also be measured with respect to the median, rather than the mean, in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Bias is related to consistency in that consistent estimators are convergent and asymptotically unbiased, though individual estimators in a consistent sequence may be biased ; see bias versus consistency.

In the comparison of various statistical procedures, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance. This article primarily deals with efficiency of estimators.

V-statistics are a class of statistics named for Richard von Mises who developed their asymptotic distribution theory in a fundamental paper in 1947. V-statistics are closely related to U-statistics introduced by Wassily Hoeffding in 1948. A V-statistic is a statistical function defined by a particular statistical functional of a probability distribution.