Empirical distribution function

Last updated
ECDF-100.png
The green curve, which asymptotically approaches heights of 0 and 1 without reaching them, is the true cumulative distribution function of the standard normal distribution. The grey hash marks represent the observations in a particular sample drawn from that distribution, and the horizontal steps of the blue step function (including the leftmost point in each step but not including the rightmost point) form the empirical distribution function of that sample. ( Click here to load a new graph. )

In statistics, an empirical distribution function (commonly also called an empirical cumulative distribution function, eCDF) is the distribution function associated with the empirical measure of a sample. [1] This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

Contents

The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample. It converges with probability 1 to that underlying distribution, according to the Glivenko–Cantelli theorem. A number of results exist to quantify the rate of convergence of the empirical distribution function to the underlying cumulative distribution function.

Definition

Let (X1, …, Xn) be independent, identically distributed real random variables with the common cumulative distribution function F(t). Then the empirical distribution function is defined as [2]

where is the indicator of event A. For a fixed t, the indicator is a Bernoulli random variable with parameter p = F(t); hence is a binomial random variable with mean nF(t) and variance nF(t)(1 − F(t)). This implies that is an unbiased estimator for F(t).

However, in some textbooks, the definition is given as

[3] [4]

Asymptotic properties

Since the ratio (n + 1)/n approaches 1 as n goes to infinity, the asymptotic properties of the two definitions that are given above are the same.

By the strong law of large numbers, the estimator converges to F(t) as n → ∞ almost surely, for every value of t: [2]

thus the estimator is consistent. This expression asserts the pointwise convergence of the empirical distribution function to the true cumulative distribution function. There is a stronger result, called the Glivenko–Cantelli theorem, which states that the convergence in fact happens uniformly over t: [5]

The sup-norm in this expression is called the Kolmogorov–Smirnov statistic for testing the goodness-of-fit between the empirical distribution and the assumed true cumulative distribution function F. Other norm functions may be reasonably used here instead of the sup-norm. For example, the L2-norm gives rise to the Cramér–von Mises statistic.

The asymptotic distribution can be further characterized in several different ways. First, the central limit theorem states that pointwise, has asymptotically normal distribution with the standard rate of convergence: [2]

This result is extended by the Donsker’s theorem, which asserts that the empirical process , viewed as a function indexed by , converges in distribution in the Skorokhod space to the mean-zero Gaussian process , where B is the standard Brownian bridge. [5] The covariance structure of this Gaussian process is

The uniform rate of convergence in Donsker’s theorem can be quantified by the result known as the Hungarian embedding: [6]

Alternatively, the rate of convergence of can also be quantified in terms of the asymptotic behavior of the sup-norm of this expression. Number of results exist in this venue, for example the Dvoretzky–Kiefer–Wolfowitz inequality provides bound on the tail probabilities of : [6]

In fact, Kolmogorov has shown that if the cumulative distribution function F is continuous, then the expression converges in distribution to , which has the Kolmogorov distribution that does not depend on the form of F.

Another result, which follows from the law of the iterated logarithm, is that [6]

and

Confidence intervals

Empirical CDF, CDF and confidence interval plots for various sample sizes of normal distribution Empirical CDF, CDF and Confidence Interval plots for various sample sizes of Normal Distribution.png
Empirical CDF, CDF and confidence interval plots for various sample sizes of normal distribution
Empirical CDF, CDF and confidence interval plots for various sample sizes of Cauchy distribution Cauchy emp .png
Empirical CDF, CDF and confidence interval plots for various sample sizes of Cauchy distribution
Empirical CDF, CDF and confidence interval plots for various sample sizes of triangle distribution Triangle emp.png
Empirical CDF, CDF and confidence interval plots for various sample sizes of triangle distribution

As per Dvoretzky–Kiefer–Wolfowitz inequality the interval that contains the true CDF, , with probability is specified as

As per the above bounds, we can plot the Empirical CDF, CDF and confidence intervals for different distributions by using any one of the statistical implementations.

Statistical implementation

A non-exhaustive list of software implementations of Empirical Distribution function includes:

See also

Related Research Articles

<span class="mw-page-title-main">Binomial distribution</span> Probability distribution

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success or failure. A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

<span class="mw-page-title-main">Cumulative distribution function</span> Probability that random variable X is less than or equal to x

In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable , or just distribution function of , evaluated at , is the probability that will take a value less than or equal to .

In complex analysis, an entire function, also called an integral function, is a complex-valued function that is holomorphic on the whole complex plane. Typical examples of entire functions are polynomials and the exponential function, and any finite sums, products and compositions of these, such as the trigonometric functions sine and cosine and their hyperbolic counterparts sinh and cosh, as well as derivatives and integrals of entire functions such as the error function. If an entire function has a root at , then , taking the limit value at , is an entire function. On the other hand, the natural logarithm, the reciprocal function, and the square root are all not entire functions, nor can they be continued analytically to an entire function.

<span class="mw-page-title-main">Kolmogorov–Smirnov test</span> Non-parametric statistical test between two distributions

Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to test whether a sample came from a given reference probability distribution, or to test whether two samples came from the same distribution. Intuitively, the test provides a method to qualitatively answer the question "How likely is it that we would see a collection of samples like this if they were drawn from that probability distribution?" or, in the second case, "How likely is it that we would see two sets of samples like this if they were drawn from the same probability distribution?". It is named after Andrey Kolmogorov and Nikolai Smirnov.

Kuiper's test is used in statistics to test that whether a data sample come from a given distribution, or whether two data samples came from the same unknown distribution. It is named after Dutch mathematician Nicolaas Kuiper.

<span class="mw-page-title-main">Central limit theorem</span> Fundamental theorem in probability theory and statistics

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

<span class="mw-page-title-main">Law of large numbers</span> Averages of repeated trials converge to the expected value

In probability theory, the law of large numbers (LLN) is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

<span class="mw-page-title-main">Gumbel distribution</span> Particular case of the generalized extreme value distribution

In probability theory and statistics, the Gumbel distribution is used to model the distribution of the maximum of a number of samples of various distributions.

In mathematics, the question of whether the Fourier series of a periodic function converges to a given function is researched by a field known as classical harmonic analysis, a branch of pure mathematics. Convergence is not necessarily given in the general case, and certain criteria must be met for convergence to occur.

In mathematics, the root test is a criterion for the convergence of an infinite series. It depends on the quantity

The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free. However, the test is most often used in contexts where a family of distributions is being tested, in which case the parameters of that family need to be estimated and account must be taken of this in adjusting either the test-statistic or its critical values. When applied to testing whether a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality. K-sample Anderson–Darling tests are available for testing whether several collections of observations can be modelled as coming from a single population, where the distribution function does not have to be specified.

In the theory of probability, the Glivenko–Cantelli theorem, named after Valery Ivanovich Glivenko and Francesco Paolo Cantelli, describes the asymptotic behaviour of the empirical distribution function as the number of independent and identically distributed observations grows. Specifically, the empirical distribution function converges uniformly to the true distribution function almost surely.

<span class="mw-page-title-main">Donsker's theorem</span> Statement in probability theory

In probability theory, Donsker's theorem, named after Monroe D. Donsker, is a functional extension of the central limit theorem for empirical distribution functions. Specifically, the theorem states that an appropriately centered and scaled version of the empirical distribution function converges to a Gaussian process.

Bootstrapping is a procedure for estimating the distribution of an estimator by resampling one's data or a model estimated from the data. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In mathematics, a local martingale is a type of stochastic process, satisfying the localized version of the martingale property. Every martingale is a local martingale; every bounded local martingale is a martingale; in particular, every local martingale that is bounded from below is a supermartingale, and every local martingale that is bounded from above is a submartingale; however, a local martingale is not in general a martingale, because its expectation can be distorted by large values of small probability. In particular, a driftless diffusion process is a local martingale, but not necessarily a martingale.

<span class="mw-page-title-main">Dvoretzky–Kiefer–Wolfowitz inequality</span> Statistical inequality

In the theory of probability and statistics, the Dvoretzky–Kiefer–Wolfowitz–Massart inequality provides a bound on the worst case distance of an empirically determined distribution function from its associated population distribution function. It is named after Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz, who in 1956 proved the inequality

<span class="mw-page-title-main">Quantile function</span> Statistical function that defines the quantiles of a probability distribution

In probability and statistics, the quantile function outputs the value of a random variable such that its probability is less than or equal to an input probability value. Intuitively, the quantile function associates with a range at and below a probability input the likelihood that a random variable is realized in that range for some probability distribution. It is also called the percentile function, percent-point function, inverse cumulative distribution function or inverse distribution function.

In statistics, the Fisher–Tippett–Gnedenko theorem is a general result in extreme value theory regarding asymptotic distribution of extreme order statistics. The maximum of a sample of iid random variables after proper renormalization can only converge in distribution to one of only 3 possible distribution families: the Gumbel distribution, the Fréchet distribution, or the Weibull distribution. Credit for the extreme value theorem and its convergence details are given to Fréchet (1927), Fisher and Tippett (1928), Mises (1936), and Gnedenko (1943).

In probability theory, the Komlós–Major–Tusnády approximation refers to one of the two strong embedding theorems: 1) approximation of random walk by a standard Brownian motion constructed on the same probability space, and 2) an approximation of the empirical process by a Brownian bridge constructed on the same probability space. It is named after Hungarian mathematicians János Komlós, Gábor Tusnády, and Péter Major, who proved it in 1975.

In statistics, cumulative distribution function (CDF)-based nonparametric confidence intervals are a general class of confidence intervals around statistical functionals of a distribution. To calculate these confidence intervals, all that is required is an independently and identically distributed (iid) sample from the distribution and known bounds on the support of the distribution. The latter requirement simply means that all the nonzero probability mass of the distribution must be contained in some known interval .

References

  1. A modern introduction to probability and statistics: Understanding why and how. Michel Dekking. London: Springer. 2005. p. 219. ISBN   978-1-85233-896-1. OCLC   262680588.{{cite book}}: CS1 maint: others (link)
  2. 1 2 3 van der Vaart, A.W. (1998). Asymptotic statistics . Cambridge University Press. p.  265. ISBN   0-521-78450-6.
  3. Coles, S. (2001) An Introduction to Statistical Modeling of Extreme Values. Springer, p. 36, Definition 2.4. ISBN   978-1-4471-3675-0.
  4. Madsen, H.O., Krenk, S., Lind, S.C. (2006) Methods of Structural Safety. Dover Publications. p. 148-149. ISBN   0486445976
  5. 1 2 van der Vaart, A.W. (1998). Asymptotic statistics . Cambridge University Press. p.  266. ISBN   0-521-78450-6.
  6. 1 2 3 van der Vaart, A.W. (1998). Asymptotic statistics . Cambridge University Press. p.  268. ISBN   0-521-78450-6.
  7. "What's new in Matplotlib 3.8.0 (Sept 13, 2023) — Matplotlib 3.8.3 documentation".

Further reading