Robust measures of scale

Last updated

In statistics, a robust measure of scale is a robust statistic that quantifies the statistical dispersion in a set of numerical data. The most common such statistics are the interquartile range (IQR) and the median absolute deviation (MAD). These are contrasted with conventional measures of scale, such as sample variance or sample standard deviation, which are non-robust, meaning greatly influenced by outliers.

Contents

These robust statistics are particularly used as estimators of a scale parameter, and have the advantages of both robustness and superior efficiency on contaminated data, at the cost of inferior efficiency on clean data from distributions such as the normal distribution. To illustrate robustness, the standard deviation can be made arbitrarily large by increasing exactly one observation (it has a breakdown point of 0, as it can be contaminated by a single point), a defect that is not shared by robust statistics.

IQR and MAD

One of the most common robust measures of scale is the interquartile range (IQR), the difference between the 75th percentile and the 25th percentile of a sample; this is the 25% trimmed range, an example of an L-estimator. Other trimmed ranges, such as the interdecile range (10% trimmed range) can also be used.

Another familiar robust measure of scale is the median absolute deviation (MAD), the median of the absolute values of the differences between the data values and the overall median of the data set; for a Gaussian distribution, MAD is related to as (the derivation can be found here).

Estimation

Robust measures of scale can be used as estimators of properties of the population, either for parameter estimation or as estimators of their own expected value.

For example, robust estimators of scale are used to estimate the population variance or population standard deviation, generally by multiplying by a scale factor to make it an unbiased consistent estimator; see scale parameter: estimation. For example, dividing the IQR by 22 erf−1(1/2) (approximately 1.349), makes it an unbiased, consistent estimator for the population standard deviation if the data follow a normal distribution.

In other situations, it makes more sense to think of a robust measure of scale as an estimator of its own expected value, interpreted as an alternative to the population variance or standard deviation as a measure of scale. For example, the MAD of a sample from a standard Cauchy distribution is an estimator of the population MAD, which in this case is 1, whereas the population variance does not exist.

Efficiency

These robust estimators typically have inferior statistical efficiency compared to conventional estimators for data drawn from a distribution without outliers (such as a normal distribution), but have superior efficiency for data drawn from a mixture distribution or from a heavy-tailed distribution, for which non-robust measures such as the standard deviation should not be used.

For example, for data drawn from the normal distribution, the MAD is 37% as efficient as the sample standard deviation, while the Rousseeuw–Croux estimator Qn is 88% as efficient as the sample standard deviation.

Absolute pairwise differences

Rousseeuw and Croux [1] propose alternatives to the MAD, motivated by two weaknesses of it:

  1. It is inefficient (37% efficiency) at Gaussian distributions.
  2. it computes a symmetric statistic about a location estimate, thus not dealing with skewness.

They propose two alternative statistics based on pairwise differences: Sn and Qn, defined as:

where is a constant depending on .

These can be computed in O(n log n) time and O(n) space.

Neither of these requires location estimation, as they are based only on differences between values. They are both more efficient than the MAD under a Gaussian distribution: Sn is 58% efficient, while Qn is 82% efficient.

For a sample from a normal distribution, Sn is approximately unbiased for the population standard deviation even down to very modest sample sizes (<1% bias for n = 10). For a large sample from a normal distribution, 2.219144465985075864722Qn is approximately unbiased for the population standard deviation. For small or moderate samples, the expected value of Qn under a normal distribution depends markedly on the sample size, so finite-sample correction factors (obtained from a table or from simulations) are used to calibrate the scale of Qn.

The biweight midvariance

Like Sn and Qn, the biweight midvariance aims to be robust without sacrificing too much efficiency. It is defined as

where I is the indicator function, Q is the sample median of the Xi, and

Its square root is a robust estimator of scale, since data points are downweighted as their distance from the median increases, with points more than 9 MAD units from the median having no influence at all.

Extensions

Mizera & Müller (2004) propose a robust depth-based estimator for location and scale simultaneously. [2]

See also

Related Research Articles

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished.

Interquartile range measure of statistical dispersion

In descriptive statistics, the interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used robust measure of scale.

Median quantile

The median is the value separating the higher half from the lower half of a data sample. For a data set, it may be thought of as the "middle" value. For example, in the data set {1, 3, 3, 6, 7, 8, 9}, the median is 6, the fourth largest, and also the fourth smallest, number in the sample. For a continuous probability distribution, the median is the value such that a number is equally likely to fall above or below it.

Standard deviation dispersion of the values of a random variable around its expected value

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Skewness measure of the asymmetry of random variables

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined.

Important note: in some papers, MAE Mean absolute error is often abbreviated as MAD , and.

Standard error statistical property

The standard error (SE) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the parameter or the statistic is the mean, it is called the standard error of the mean (SEM).

In probability theory and statistics, the coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage, and is defined as the ratio of the standard deviation to the mean . The CV or RSD is widely used in analytical chemistry to express the precision and repeatability of an assay. It is also commonly used in fields such as engineering or physics when doing quality assurance studies and ANOVA gauge R&R. In addition, CV is utilized by economists and investors in economic models.

In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter.

In statistics, the mid-range or mid-extreme of a set of statistical data values is the arithmetic mean of the maximum and minimum values in a data set, defined as:

Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations; under this model, non-robust methods like a t-test work poorly.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator.

In mathematics and statistics, deviation is a measure of difference between the observed value of a variable and some other value, often that variable's mean. The sign of the deviation reports the direction of that difference. The magnitude of the value indicates the size of the difference.

In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample.

L-estimator

In statistics, an L-estimator is an estimator which is an L-statistic – a linear combination of order statistics of the measurements. This can be as little as a single point, as in the median, or as many as all points, as in the mean.

In statistics, a trimmed estimator is an estimator derived from another estimator by excluding some of the extreme values, a process called truncation. This is generally done to obtain a more robust statistic, and the extreme values are considered outliers. Trimmed estimators also often have higher efficiency for mixture distributions and heavy-tailed distributions than the corresponding untrimmed estimator, at the cost of lower efficiency for other distributions, such as the normal distribution.

Sample maximum and minimum

In statistics, the sample maximum and sample minimum, also called the largest observation and smallest observation, are the values of the greatest and least elements of a sample. They are basic summary statistics, used in descriptive statistics such as the five-number summary and Bowley's seven-figure summary and the associated box plot.

Statistical dispersion general term for the value describing how spread out the data are

In statistics, dispersion is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartile range.

In the comparison of various statistical procedures, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance. This article primarily deals with efficiency of estimators.

Peter Rousseeuw Belgian statistician

Peter J. Rousseeuw is a statistician known for his work on robust statistics and cluster analysis. He obtained his PhD in 1981 at the Vrije Universiteit Brussel, following research carried out at the ETH in Zurich in the group of Frank Hampel, which led to a book on influence functions. Later he was professor at the Delft University of Technology, The Netherlands, at the University of Fribourg, Switzerland, and at the University of Antwerp, Belgium. Currently he is professor at KU Leuven, Belgium. He is a fellow of the Institute of Mathematical Statistics (1993) and the American Statistical Association (1994). His former PhD students include A. Leroy, H. Lopuhäa, G. Molenberghs, C. Croux, M. Hubert, S. Van Aelst and T. Verdonck.

References

  1. Rousseeuw, Peter J.; Croux, Christophe (December 1993), "Alternatives to the Median Absolute Deviation", Journal of the American Statistical Association, American Statistical Association, 88 (424): 1273–1283, doi:10.2307/2291267, JSTOR   2291267
  2. Mizera, I.; Müller, C. H. (2004), "Location-scale depth", Journal of the American Statistical Association, 99 (468): 949–966, doi:10.1198/016214504000001312 .