Mid-range

Last updated

In statistics, the mid-range or mid-extreme is a measure of central tendency of a sample defined as the arithmetic mean of the maximum and minimum values of the data set: [1]

Contents

The mid-range is closely related to the range, a measure of statistical dispersion defined as the difference between maximum and minimum values. The two measures are complementary in sense that if one knows the mid-range and the range, one can find the sample maximum and minimum values.

The mid-range is rarely used in practical statistical analysis, as it lacks efficiency as an estimator for most distributions of interest, because it ignores all intermediate points, and lacks robustness, as outliers change it significantly. Indeed, for many distributions it is one of the least efficient and least robust statistics. However, it finds some use in special cases: it is the maximally efficient estimator for the center of a uniform distribution, trimmed mid-ranges address robustness, and as an L-estimator, it is simple to understand and compute.

Robustness

The midrange is highly sensitive to outliers and ignores all but two data points. It is therefore a very non-robust statistic, having a breakdown point of 0, meaning that a single observation can change it arbitrarily. Further, it is highly influenced by outliers: increasing the sample maximum or decreasing the sample minimum by x changes the mid-range by while it changes the sample mean, which also has breakdown point of 0, by only It is thus of little use in practical statistics, unless outliers are already handled.

A trimmed midrange is known as a midsummary – the n% trimmed midrange is the average of the n% and (100n)% percentiles, and is more robust, having a breakdown point of n%. In the middle of these is the midhinge, which is the 25% midsummary. The median can be interpreted as the fully trimmed (50%) mid-range; this accords with the convention that the median of an even number of points is the mean of the two middle points.

These trimmed midranges are also of interest as descriptive statistics or as L-estimators of central location or skewness: differences of midsummaries, such as midhinge minus the median, give measures of skewness at different points in the tail. [2]

Efficiency

Despite its drawbacks, in some cases it is useful: the midrange is a highly efficient estimator of μ, given a small sample of a sufficiently platykurtic distribution, but it is inefficient for mesokurtic distributions, such as the normal.

For example, for a continuous uniform distribution with unknown maximum and minimum, the mid-range is the uniformly minimum-variance unbiased estimator (UMVU) estimator for the mean. The sample maximum and sample minimum, together with sample size, are a sufficient statistic for the population maximum and minimum – the distribution of other samples, conditional on a given maximum and minimum, is just the uniform distribution between the maximum and minimum and thus add no information. See German tank problem for further discussion. Thus the mid-range, which is an unbiased and sufficient estimator of the population mean, is in fact the UMVU: using the sample mean just adds noise based on the uninformative distribution of points within this range.

Conversely, for the normal distribution, the sample mean is the UMVU estimator of the mean. Thus for platykurtic distributions, which can often be thought of as between a uniform distribution and a normal distribution, the informativeness of the middle sample points versus the extrema values varies from "equal" for normal to "uninformative" for uniform, and for different distributions, one or the other (or some combination thereof) may be most efficient. A robust analog is the trimean, which averages the midhinge (25% trimmed mid-range) and median.

Small samples

For small sample sizes (n from 4 to 20) drawn from a sufficiently platykurtic distribution (negative excess kurtosis, defined as γ2 = (μ4/(μ2)²)  3), the mid-range is an efficient estimator of the mean μ. The following table summarizes empirical data comparing three estimators of the mean for distributions of varied kurtosis; the modified mean is the truncated mean, where the maximum and minimum are eliminated. [3] [4]

Excess kurtosis (γ2)Most efficient estimator of μ
−1.2 to −0.8Midrange
−0.8 to 2.0Mean
2.0 to 6.0Modified mean

For n = 1 or 2, the midrange and the mean are equal (and coincide with the median), and are most efficient for all distributions. For n = 3, the modified mean is the median, and instead the mean is the most efficient measure of central tendency for values of γ2 from 2.0 to 6.0 as well as from −0.8 to 2.0.

Sampling properties

For a sample of size n from the standard normal distribution, the mid-range M is unbiased, and has a variance given by: [5]

For a sample of size n from the standard Laplace distribution, the mid-range M is unbiased, and has a variance given by: [6]

and, in particular, the variance does not decrease to zero as the sample size grows.

For a sample of size n from a zero-centred uniform distribution, the mid-range M is unbiased, nM has an asymptotic distribution which is a Laplace distribution. [7]

Deviation

While the mean of a set of values minimizes the sum of squares of deviations and the median minimizes the average absolute deviation, the midrange minimizes the maximum deviation (defined as ): it is a solution to a variational problem.

See also

Related Research Articles

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished. For example, the sample mean is a commonly used estimator of the population mean.

In probability theory and statistics, kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes a particular aspect of a probability distribution. There are different ways to quantify kurtosis for a theoretical distribution, and there are corresponding ways of estimating it using a sample from a population. Different measures of kurtosis may have different interpretations.

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe center of the income distribution because increases in the largest incomes alone have no effect on median. For this reason, the median is of central importance in robust statistics.

<span class="mw-page-title-main">Skewness</span> Measure of the asymmetry of random variables

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

In statistics, point estimation involves the use of sample data to calculate a single value which is to serve as a "best guess" or "best estimate" of an unknown population parameter. More formally, it is the application of a point estimator to the data to obtain a point estimate.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

The average absolute deviation (AAD) of a data set is the average of the absolute deviations from a central point. It is a summary statistic of statistical dispersion or variability. In the general form, the central point can be a mean, median, mode, or the result of any other measure of central tendency or any reference value related to the given data set. AAD includes the mean absolute deviation and the median absolute deviation.

A truncated mean or trimmed mean is a statistical measure of central tendency, much like the mean and median. It involves the calculation of the mean after discarding given parts of a probability distribution or sample at the high and low end, and typically discarding an equal amount of both. This number of points to be discarded is usually given as a percentage of the total number of points, but may also be given as a fixed number of points.

In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter.

<span class="mw-page-title-main">Discrete uniform distribution</span> Probability distribution on equally likely outcomes

In probability theory and statistics, the discrete uniform distribution is a symmetric probability distribution wherein a finite number of values are equally likely to be observed; every one of n values has equal probability 1/n. Another way of saying "discrete uniform distribution" would be "a known, finite number of outcomes equally likely to happen".

<span class="mw-page-title-main">Continuous uniform distribution</span> Uniform distribution on an interval

In probability theory and statistics, the continuous uniform distributions or rectangular distributions are a family of symmetric probability distributions. Such a distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds. The bounds are defined by the parameters, and which are the minimum and maximum values. The interval can either be closed or open. Therefore, the distribution is often abbreviated where stands for uniform distribution. The difference between the bounds defines the interval length; all intervals of the same length on the distribution's support are equally probable. It is the maximum entropy probability distribution for a random variable under no constraint other than that it is contained in the distribution's support.

Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

The sample mean and the sample covariance are statistics computed from a sample of data on one or more random variables.

<span class="mw-page-title-main">L-estimator</span>

In statistics, an L-estimator is an estimator which is a linear combination of order statistics of the measurements. This can be as little as a single point, as in the median, or as many as all points, as in the mean.

In statistics, a trimmed estimator is an estimator derived from another estimator by excluding some of the extreme values, a process called truncation. This is generally done to obtain a more robust statistic, and the extreme values are considered outliers. Trimmed estimators also often have higher efficiency for mixture distributions and heavy-tailed distributions than the corresponding untrimmed estimator, at the cost of lower efficiency for other distributions, such as the normal distribution.

<span class="mw-page-title-main">Sample maximum and minimum</span> Greatest and least values in a statistical data sample

In statistics, the sample maximum and sample minimum, also called the largest observation and smallest observation, are the values of the greatest and least elements of a sample. They are basic summary statistics, used in descriptive statistics such as the five-number summary and Bowley's seven-figure summary and the associated box plot.

In statistics, robust measures of scale are methods that quantify the statistical dispersion in a sample of numerical data while resisting outliers. The most common such robust statistics are the interquartile range (IQR) and the median absolute deviation (MAD). These are contrasted with conventional or non-robust measures of scale, such as sample variance or standard deviation, which are greatly influenced by outliers.

In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used.

In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator needs fewer input data or observations than a less efficient one to achieve the Cramér–Rao bound. An efficient estimator is characterized by having the smallest possible variance, indicating that there is a small deviance between the estimated value and the "true" value in the L2 norm sense.

References

  1. Dodge 2003.
  2. Velleman & Hoaglin 1981.
  3. Vinson, William Daniel (1951). An Investigation of Measures of Central Tendency Used in Quality Control (Master's). University of North Carolina at Chapel Hill. Table (4.1), pp. 32–34.
  4. Cowden, Dudley Johnstone (1957). Statistical methods in quality control. Prentice-Hall. pp.  67–68.
  5. Kendall & Stuart 1969, Example 14.4.
  6. Kendall & Stuart 1969, Example 14.5.
  7. Kendall & Stuart 1969, Example 14.12.