Range (statistics)

Last updated

In statistics, the range of a set of data is the difference between the largest and smallest values. Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation. In applying statistics to, for example, a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments. See glossary of probability and statistics.

Contents

However, in descriptive statistics, this concept of range has a more complex meaning. The range is the size of the smallest interval (statistics) which contains all the data and provides an indication of statistical dispersion. It is measured in the same units as the data. Since it only depends on two of the observations, it is most useful in representing the dispersion of small data sets. 

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of a collection of information, while descriptive statistics in the mass noun sense is the process of using and analyzing those statistics. Descriptive statistics is distinguished from inferential statistics, in that descriptive statistics aims to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently nonparametric statistics. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example, in papers reporting on human subjects, typically a table is included giving the overall sample size, sample sizes in important subgroups, and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, the proportion of subjects with related comorbidities, etc. In statistics, dispersion is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartile range.

For continuous IID random variables

For n independent and identically distributed continuous random variables X1, X2, ..., Xn with cumulative distribution function G(x) and probability density function g(x). Let T denote the range of a sample of size n from a population with distribution function G(x).

In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usually abbreviated as i.i.d. or iid or IID. Herein, i.i.d. is used, because it is the most prevalent. In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable , or just distribution function of , evaluated at , is the probability that will take a value less than or equal to . In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample. In other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would equal one sample compared to the other sample.

Distribution

The range has cumulative distribution function  

$F(t)=n\int _{-\infty }^{\infty }g(x)[G(x+t)-G(x)]^{n-1}{\text{d}}x.$ Gumbel notes that the "beauty of this formula is completely marred by the facts that, in general, we cannot express G(x + t) by G(x), and that the numerical integration is lengthy and tiresome." 

Emil Julius Gumbel was a German mathematician and political writer.

If the distribution of each Xi is limited to the right (or left) then the asymptotic distribution of the range is equal to the asymptotic distribution of the largest (smallest) value. For more general distributions the asymptotic distribution can be expressed as a Bessel function. Bessel functions, first defined by the mathematician Daniel Bernoulli and then generalized by Friedrich Bessel, are the canonical solutions y(x) of Bessel's differential equation

Moments

The mean range is given by 

$n\int _{0}^{1}x(G)[G^{n-1}-(1-G)^{n-1}]\,{\text{d}}G$ where x(G) is the inverse function. In the case where each of the Xi has a standard normal distribution, the mean range is given by 

$\int _{-\infty }^{\infty }(1-(1-\Phi (x))^{n}-\Phi (x)^{n})\,{\text{d}}x.$ For continuous non-IID random variables

For n nonidentically distributed independent continuous random variables X1, X2, ..., Xn with cumulative distribution functions G1(x), G2(x), ..., Gn(x) and probability density functions g1(x), g2(x), ..., gn(x), the range has cumulative distribution function 

$F(t)=\sum _{i=1}^{n}\int _{-\infty }^{\infty }g_{i}(x)\prod _{j=1,j\neq i}^{n}[G_{j}(x+t)-G_{j}(x)]\,{\text{d}}x.$ For discrete IID random variables

For n independent and identically distributed discrete random variables X1, X2, ..., Xn with cumulative distribution function G(x) and probability mass function g(x) the range of the Xi is the range of a sample of size n from a population with distribution function G(x). We can assume without loss of generality that the support of each Xi is {1,2,3,...,N} where N is a positive integer or infinity.  

Distribution

The range has probability mass function   

f(t)={\begin{cases}\sum _{x=1}^{N}[g(x)]^{n}&t=0\\[6pt]\sum _{x=1}^{N-t}\left({\begin{alignedat}{2}&[G(x+t)-G(x-1)]^{n}\\{}-{}&[G(x+t)-G(x)]^{n}\\{}-{}&[G(x+t-1)-G(x-1)]^{n}\\{}+{}&[G(x+t-1)-G(x)]^{n}\\\end{alignedat}}\right)&t=1,2,3\ldots ,N-1.\end{cases}} Example

If we suppose that g(x) = 1/N, the discrete uniform distribution for all x, then we find  

$f(t)={\begin{cases}{\frac {1}{N^{n-1}}}&t=0\\[4pt]\sum _{x=1}^{N-t}\left([{\frac {t+1}{N}}]^{n}-2[{\frac {t}{N}}]^{n}+[{\frac {t-1}{N}}]^{n}\right)&t=1,2,3\ldots ,N-1.\end{cases}}$ Derivation

The probability of having a specific range value, t, can be determined by adding the probabilities of having two samples differing by t, and every other sample having a value between the two extremes. The probability of one sample having a value of x is $n*g\left(x\right)$ . The probability of another having a value t greater than x is:

$\left(n-1\right)g\left(x+t\right)$ .

The probability of all other values lying between these two extremes is:

$\left(\int _{x}^{x+t}g\left(x\right){\text{d}}x\right)^{n-2}=\left(G\left(x+t\right)-G\left(x\right)\right)^{n-2}$ .

Combining the three together yields:

$f(t)=n\left(n-1\right)\int _{-\infty }^{\infty }g\left(x\right)g\left(x+t\right)\left[G\left(x+t\right)-G\left(x\right)\right]^{n-2}{\text{d}}x$ The range is a simple function of the sample maximum and minimum and these are specific examples of order statistics. In particular, the range is a linear function of order statistics, which brings it into the scope of L-estimation.

Related Research Articles The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution, Cauchy–Lorentz distribution, Lorentz(ian) function, or Breit–Wigner distribution. The Cauchy distribution is the distribution of the x-intercept of a ray issuing from with a uniformly distributed angle. It is also the distribution of the ratio of two independent normally distributed random variables if the denominator distribution has mean zero. In statistics, the Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution, or to compare two samples. It is named after Andrey Kolmogorov and Nikolai Smirnov. In economics, the Lorenz curve is a graphical representation of the distribution of income or of wealth. It was developed by Max O. Lorenz in 1905 for representing inequality of the wealth distribution. The median is the value separating the higher half from the lower half of a data sample. For a data set, it may be thought of as the "middle" value. For example, in the data set {1, 3, 3, 6, 7, 8, 9}, the median is 6, the fourth largest, and also the fourth smallest, number in the sample. For a continuous probability distribution, the median is the value such that a number is equally likely to fall above or below it.

Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set of axioms. Typically these axioms formalise probability in terms of a probability space, which assigns a measure taking values between 0 and 1, termed the probability measure, to a set of outcomes called the sample space. Any specified subset of these outcomes is called an event.

In probability theory and statistics, a probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. In more technical terms, the probability distribution is a description of a random phenomenon in terms of the probabilities of events. For instance, if the random variable X is used to denote the outcome of a coin toss, then the probability distribution of X would take the value 0.5 for X = heads, and 0.5 for X = tails. Examples of random phenomena can include the results of an experiment or survey. In mathematics, the Dirac delta function is a generalized function or distribution introduced by the physicist Paul Dirac. It is used to model the density of an idealized point mass or point charge as a function equal to zero everywhere except for zero and whose integral over the entire real line is equal to one. As there is no function that has these properties, the computations made by the theoretical physicists appeared to mathematicians as nonsense until the introduction of distributions by Laurent Schwartz to formalize and validate the computations. As a distribution, the Dirac delta function is a linear functional that maps every function to its value at zero. The Kronecker delta function, which is usually defined on a discrete domain and takes values 0 and 1, is a discrete analog of the Dirac delta function.

In probability theory and statistics, the moment-generating function of a real-valued random variable is an alternative specification of its probability distribution. Thus, it provides the basis of an alternative route to analytical results compared with working directly with probability density functions or cumulative distribution functions. There are particularly simple results for the moment-generating functions of distributions defined by the weighted sums of random variables. However, not all random variables have moment-generating functions.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the form:

In mathematics, a moment is a specific quantitative measure of the shape of a function. It is used in both mechanics and statistics. If the function represents physical density, then the zeroth moment is the total mass, the first moment divided by the total mass is the center of mass, and the second moment is the rotational inertia. If the function is a probability distribution, then the zeroth moment is the total probability, the first moment is the mean, the second central moment is the variance, the third standardized moment is the skewness, and the fourth standardized moment is the kurtosis. The mathematical concept is closely related to the concept of moment in physics.

The Gram–Charlier A series, and the Edgeworth series are series that approximate a probability distribution in terms of its cumulants. The series are the same; but, the arrangement of terms differ. The key idea of these expansions is to write the characteristic function of the distribution whose probability density function f is to be approximated in terms of the characteristic function of a distribution with known and suitable properties, and to recover f through the inverse Fourier transform. In statistics, an empirical distribution function is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value. In probability theory and statistics, the characteristic function of any real-valued random variable completely defines its probability distribution. If a random variable admits a probability density function, then the characteristic function is the Fourier transform of the probability density function. Thus it provides the basis of an alternative route to analytical results compared with working directly with probability density functions or cumulative distribution functions. There are particularly simple results for the characteristic functions of distributions defined by the weighted sums of random variables.

The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free. However, the test is most often used in contexts where a family of distributions is being tested, in which case the parameters of that family need to be estimated and account must be taken of this in adjusting either the test-statistic or its critical values. When applied to testing whether a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality. K-sample Anderson–Darling tests are available for testing whether several collections of observations can be modelled as coming from a single population, where the distribution function does not have to be specified.

Differential entropy is a concept in information theory that began as an attempt by Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In probability theory, an empirical process is a stochastic process that describes the proportion of objects in a system in a given state. For a process in a discrete state space a population continuous time Markov chain or Markov population model is a process which counts the number of objects in a given state . In mean field theory, limit theorems are considered and generalise the central limit theorem for empirical measures. Applications of the theory of empirical processes arise in non-parametric statistics. In probability and statistics, studentized range distribution is the continuous probability distribution of the studentized range of an i.i.d. sample from a normally distributed population.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product

1. George Woodbury (2001). An Introduction to Statistics. Cengage Learning. p. 74. ISBN   0534377556.
2. Carin Viljoen (2000). Elementary Statistics: Vol 2. Pearson South Africa. pp. 7–27. ISBN   186891075X.
3. E. J. Gumbel (1947). "The Distribution of the Range". The Annals of Mathematical Statistics. 18 (3): 384–412. doi:10.1214/aoms/1177730387. JSTOR   2235736.
4. Tsimashenka, I.; Knottenbelt, W.; Harrison, P. (2012). "Controlling Variability in Split-Merge Systems". Analytical and Stochastic Modeling Techniques and Applications (PDF). Lecture Notes in Computer Science. 7314. p. 165. doi:10.1007/978-3-642-30782-9_12. ISBN   978-3-642-30781-2.
5. H. O. Hartley; H. A. David (1954). "Universal Bounds for Mean Range and Extreme Observation". The Annals of Mathematical Statistics. 25 (1): 85–99. doi:10.1214/aoms/1177728848. JSTOR   2236514.
6. L. H. C. Tippett (1925). "On the Extreme Individuals and the Range of Samples Taken from a Normal Population". Biometrika. 17 (3/4): 364–387. doi:10.1093/biomet/17.3-4.364. JSTOR   2332087.
7. Evans, D. L.; Leemis, L. M.; Drew, J. H. (2006). "The Distribution of Order Statistics for Discrete Random Variables with Applications to Bootstrapping". INFORMS Journal on Computing. 18: 19. doi:10.1287/ijoc.1040.0105.
8. Irving W. Burr (1955). "Calculation of Exact Sampling Distribution of Ranges from a Discrete Population". The Annals of Mathematical Statistics. 26 (3): 530–532. doi:10.1214/aoms/1177728500. JSTOR   2236482.
9. Abdel-Aty, S. H. (1954). "Ordered variables in discontinuous distributions". Statistica Neerlandica. 8 (2): 61–82. doi:10.1111/j.1467-9574.1954.tb00442.x.
10. Siotani, M. (1956). "Order statistics for discrete case with a numerical application to the binomial distribution". Annals of the Institute of Statistical Mathematics. 8: 95–96. doi:10.1007/BF02863574.
11. Paul R. Rider (1951). "The Distribution of the Range in Samples from a Discrete Rectangular Population". Journal of the American Statistical Association . 46 (255): 375–378. doi:10.1080/01621459.1951.10500796. JSTOR   2280515.