Interquartile range

Last updated
Boxplot (with an interquartile range) and a probability density function (pdf) of a Normal N(0,s ) Population Boxplot vs PDF.svg
Boxplot (with an interquartile range) and a probability density function (pdf) of a Normal N(0,σ ) Population

In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. [1] The IQR may also be called the midspread, middle 50%, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data. [2] [3] [4] To calculate the IQR, the data set is divided into quartiles, or four rank-ordered even parts via linear interpolation. [1] These quartiles are denoted by Q1 (also called the lower quartile), Q2 (the median), and Q3 (also called the upper quartile). The lower quartile corresponds with the 25th percentile and the upper quartile corresponds with the 75th percentile, so IQR = Q3 −  Q1. [1]

Contents

The IQR is an example of a trimmed estimator, defined as the 25% trimmed range, which enhances the accuracy of dataset statistics by dropping lower contribution, outlying points. [5] It is also used as a robust measure of scale [5] It can be clearly visualized by the box on a Box plot. [1]

Use

The primary use of the IQR is to represent the difference between the upper and lower quartiles of a data set. This can be used as an indicator for variability of the dataset. [1]

It is also used to build box plots, which are a graphical representation of probability distribution. In the box plot, the IQR is the height of the box itself, and the whiskers have a length of 1.5*IQR. [1] Any data point located outside of the whiskers is referred to as an outlier (see below). [1]

IQR is often used as a preferred measurement or variability to total range or median absolute deviation because it has a lower breakdown point: 25% compared to MAD's 50%. [6]

The IQR has been practically used in a number of recent studies. Some of these uses include:

Algorithm

Discrete Variables

The IQR of a set of values is calculated as the difference between the upper and lower quartiles, Q3 and Q1. Each quartile is a median calculated as follows.

Given an even 2n or odd 2n+1 number of values:

The second quartile Q2 is the same as the ordinary median. [10]

Continuous Variables

The interquartile range of a continuous distribution can be calculated by integrating the probability density function over specific intervals. The lower quartile, Q1, is a number such that integral of the PDF from -∞ to Q1 equals 0.25, while the upper quartile, Q3, is such a number that the integral from -∞ to Q3 equals 0.75. [1]

In terms of the CDF, the quartiles can be defined as follows: where CDF−1 is the quantile function. [1]

Examples

Data set in a table

The following table has 13 rows, and follows the rules for the odd number of entries.

ix[i]MedianQuartile
17Q2 = 87
(median of whole table)
Q1 = 31
(median of upper half, from row 1 to 6)
27
331
431
547
675
787
8115
Q3 = 119
(median of lower half, from row 8 to 13)
9116
10119
11119
12155
13177

For the data in this table the interquartile range is IQR = Q3Q1 = 119 - 31 = 88.

Data set in a plain-text box plot

                                                  +−−−−−+−+                     * |−−−−−−−−−−−|     | |−−−−−−−−−−−|                              +−−−−−+−+                           +−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+   number line  0   1   2   3   4   5   6   7   8   9   10  11  12    

For the data set in this box plot:

This means the 1.5*IQR whiskers can be uneven in lengths. The median, minimum, maximum, and the first and third quartile constitute the Five-number summary. [1] [11]

Distributions

The interquartile range and median of some common distributions are shown below:

DistributionMedianIQR
Normal μ2 Φ1(0.75)σ ≈ 1.349σ ≈ (27/20)σ
Laplace μ2b ln(2) ≈ 1.386b
Cauchy μ

If both the median and mean of a distribution fall inside the interquartile range, the distribution is considered to be reasonably symmetrical. [12]

Outliers

Box-and-whisker plot with four mild outliers and one extreme outlier. In this chart, outliers are defined as mild above Q3 + 1.5 * IQR and extreme above Q3 + 3 * IQR. Box-Plot mit Interquartilsabstand.png
Box-and-whisker plot with four mild outliers and one extreme outlier. In this chart, outliers are defined as mild above Q3 + 1.5 * IQR and extreme above Q3 + 3 * IQR.

The interquartile range is often used to find outliers in data. A fence is used to identify and categorize types of outliers from the data, or on a box plot. [13] There are four relevant fences:

Any data points that fall between the inner and outer fences are called mild outliers. Points that fall beyond the outer fences are called extreme outliers. [13]

See also

Related Research Articles

Histogram Graphical representation of the distribution of numerical data

A histogram is an approximate representation of the distribution of numerical data. It was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often of equal size.

In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or quarters, of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a form of order statistic. The three main quartiles are as follows:

Quantile Statistical method of dividing data into equal-sized intervals for analysis

In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. Common quantiles have special names, such as quartiles, deciles, and percentiles. The groups created are termed halves, thirds, quarters, etc., though sometimes the terms for the quantile are used for the groups created, rather than for the cut points.

The interquartile mean (IQM) is a statistical measure of central tendency based on the truncated mean of the interquartile range. The IQM is very similar to the scoring method used in sports that are evaluated by a panel of judges: discard the lowest and the highest scores; calculate the mean value of the remaining scores.

Outlier observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

Box plot Data visualization

In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles. In addition to the box on a box plot, there can be lines extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also termed as the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset may be plotted as individual points beyond the whiskers on the box-plot. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings in each subsection of the box-plot indicate the degree of dispersion (spread) and skewness of the data, which are usually described using the five-number summary. In addition, the box-plot allows one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.

The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles:

  1. the sample minimum (smallest observation)
  2. the lower quartile or first quartile
  3. the median
  4. the upper quartile or third quartile
  5. the sample maximum

In statistics, a k-thpercentile, denoted , is a score below which a given percentage k of scores in its frequency distribution falls or a score at or below which a given percentage falls. For example, the 50th percentile is the score below which (exclusive) or at or below which (inclusive) 50% of the scores in the distribution may be found. Percentiles are expressed in the same unit of measurement as the input scores; for example, if the scores refer to human weight, the corresponding percentiles will be expressed in kilograms or pounds.

A truncated mean or trimmed mean is a statistical measure of central tendency, much like the mean and median. It involves the calculation of the mean after discarding given parts of a probability distribution or sample at the high and low end, and typically discarding an equal amount of both. This number of points to be discarded is usually given as a percentage of the total number of points, but may also be given as a fixed number of points.

In statistics the trimean (TM), or Tukey's trimean, is a measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles:

In statistics, the quartile coefficient of dispersion is a descriptive statistic which measures dispersion and which is used to make comparisons within and between data sets. Since it is based on quantile information, it is less sensitive to outliers than measures such as the Coefficient of variation. As such, it is one of several Robust measures of scale.

In statistics, the midhinge is the average of the first and third quartiles and is thus a measure of location. Equivalently, it is the 25% trimmed mid-range or 25% midsummary; it is an L-estimator.

L-estimator

In statistics, an L-estimator is an estimator which is a linear combination of order statistics of the measurements. This can be as little as a single point, as in the median, or as many as all points, as in the mean.

In descriptive statistics, the seven-number summary is a collection of seven summary statistics, and is an extension of the five-number summary. There are two similar, common forms.

In statistics, a trimmed estimator is an estimator derived from another estimator by excluding some of the extreme values, a process called truncation. This is generally done to obtain a more robust statistic, and the extreme values are considered outliers. Trimmed estimators also often have higher efficiency for mixture distributions and heavy-tailed distributions than the corresponding untrimmed estimator, at the cost of lower efficiency for other distributions, such as the normal distribution.

Plot (graphics)

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

In statistics, robust measures of scale are methods that quantify the statistical dispersion in a sample of numerical data while resisting outliers. The most common such robust statistics are the interquartile range (IQR) and the median absolute deviation (MAD). These are contrasted with conventional or non-robust measures of scale, such as sample variance or standard deviation, which are greatly influenced by outliers.

Statistical dispersion Statistical property quantifying how much a collection of data is spread out

In statistics, dispersion is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartile range. For instance, when the variance of data in a set is large, the data is widely scattered. On the other hand, when the variance is small, the data in the set is clustered.

In statistical graphics, the functional boxplot is an informative exploratory tool that has been proposed for visualizing functional data. Analogous to the classical boxplot, the descriptive statistics of a functional boxplot are: the envelope of the 50% central region, the median curve and the maximum non-outlying envelope.

In statistical graphics and scientific visualization, the contour boxplot is an exploratory tool that has been proposed for visualizing ensembles of feature-sets determined by a threshold on some scalar function. Analogous to the classical boxplot and considered an expansion of the concepts defining functional boxplot, the descriptive statistics of a contour boxplot are: the envelope of the 50% central region, the median curve and the maximum non-outlying envelope.

References

  1. 1 2 3 4 5 6 7 8 9 10 Dekking, Frederik Michel; Kraaikamp, Cornelis; Lopuhaä, Hendrik Paul; Meester, Ludolf Erwin (2005). A Modern Introduction to Probability and Statistics. Springer Texts in Statistics. London: Springer London. doi:10.1007/1-84628-168-7. ISBN   978-1-85233-896-1.
  2. Upton, Graham; Cook, Ian (1996). Understanding Statistics. Oxford University Press. p. 55. ISBN   0-19-914391-9.
  3. Zwillinger, D., Kokoska, S. (2000) CRC Standard Probability and Statistics Tables and Formulae, CRC Press. ISBN   1-58488-059-7 page 18.
  4. Ross, Sheldon (2010). Introductory Statistics. Burlington, MA: Elsevier. pp. 103–104. ISBN   978-0-12-374388-6.
  5. 1 2 Kaltenbach, Hans-Michael (2012). A concise guide to statistics. Heidelberg: Springer. ISBN   978-3-642-23502-3. OCLC   763157853.
  6. Rousseeuw, Peter J.; Croux, Christophe (1992). Y. Dodge (ed.). "Explicit Scale Estimators with High Breakdown Point" (PDF). L1-Statistical Analysis and Related Methods. Amsterdam: North-Holland. pp. 77–92.
  7. Zhang, Yiming; Kim, Nam H.; Haftka, Raphael T. (2019-11-20). "General-Surrogate Adaptive Sampling Using Interquartile Range for Design Space Exploration". Journal of Mechanical Design. 142 (5). doi:10.1115/1.4044432. ISSN   1050-0472.
  8. Dai, Zhifeng and Xiaomin Chang. “Predicting Stock Return with Economic Constraint: Can Interquartile Range Truncate the Outliers?” (2021).
  9. Ajil, Jassim, Firas (2013-02-05). Image Denoising Using Interquartile Range Filter with Local Averaging. OCLC   1106182050.
  10. 1 2 Bertil., Westergren (1988). Beta [beta] mathematics handbook : concepts, theorems, methods, algorithms, formulas, graphs, tables. Studentlitteratur. p. 348. ISBN   9144250517. OCLC   18454776.
  11. Tukey, J.W. "Exploratory data analysis". Addison-Wesley, Reading, 1977.
  12. Whitley, Elise; Ball, Jonathan (2002). "Statistics review 1: Presenting and summarising data". Critical Care. 6 (1): 66–71. ISSN   1364-8535. PMID   11940268.
  13. 1 2 "NIST/SEMATECH e-Handbook of Statistical Methods". www.itl.nist.gov. doi:10.18434/m32189 . Retrieved 2021-12-14.