Box plot

Last updated
Figure 1. Box plot of data from the Michelson-Morley experiment Michelsonmorley-boxplot.svg
Figure 1. Box plot of data from the Michelson–Morley experiment

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length). The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically. Box plots received their name from the box in the middle.

Contents

History of the box plot

The range-bar was introduced by Mary Eleanor Spear in 1952 [1] and again in 1969. [2] The box and whiskers plot was first introduced in 1970 by John Tukey, who later published on the subject in 1977. [3]

Elements of a box plot

Figure 2. Boxplot with whiskers from minimum to maximum Box-Plot mit Min-Max Abstand.png
Figure 2. Boxplot with whiskers from minimum to maximum
Figure 3. Same Boxplot with whiskers with maximum 1.5 IQR Box-Plot mit Interquartilsabstand.png
Figure 3. Same Boxplot with whiskers with maximum 1.5 IQR

A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles.

Minimum : the lowest data point excluding any outliers.

Maximum : the largest data point excluding any outliers.

Median (Q2 / 50th percentile) : the middle value of the dataset.

First quartile (Q1 / 25th percentile) : also known as the lower quartileqn(0.25), is the median of the lower half of the dataset.

Third quartile (Q3 / 75th percentile) : also known as the upper quartileqn(0.75), is the median of the upper half of the dataset. [4]

An important element used to construct the box plot by determining the minimum and maximum data values feasible, but is not part of the aforementioned five-number summary, is the interquartile range or IQR denoted below:

Interquartile range (IQR) : is the distance between the upper and lower quartiles.

A boxplot is constructed of two parts, a box and a set of whiskers shown in Figure 2. The lowest point is the minimum of the data set and the highest point is the maximum of the data set. The box is drawn from Q1 to Q3 with a horizontal line drawn in the middle to denote the median.

The same data set can also be represented as a boxplot shown in Figure 3. From above the upper quartile, a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest observed point from the dataset that falls within this distance. Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile and a whisker is drawn up to the lower observed point from the dataset that falls within this distance. All other observed points are plotted as outliers. [5]

However, the whiskers can represent several possible alternative values, among them:

Any data not included between the whiskers should be plotted as an outlier with a dot, small circle, or star, but occasionally this is not done.

Some box plots include an additional character to represent the mean of the data. [6] [7]

On some box plots a crosshatch is placed on each whisker, before the end of the whisker.

Rarely, box plots can be presented with no whiskers at all.

Because of this variability, it is appropriate to describe the convention being used for the whiskers and outliers in the caption for the plot.

The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to show the seven-number summary. If the data are normally distributed, the locations of the seven marks on the box plot will be equally spaced.

Variations

Figure 4. Four box plots, with and without notches and variable width Fourboxplots.svg
Figure 4. Four box plots, with and without notches and variable width

Since the mathematician John W. Tukey popularized this type of visual data display in 1969, several variations on the traditional box plot have been described. Two of the most common are variable width box plots and notched box plots (see Figure 4).

Variable width box plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group. [8]

Notched box plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide to significance of difference of medians; if the notches of two boxes do not overlap, this offers evidence of a statistically significant difference between the medians. [8] The width of the notches is proportional to the interquartile range (IQR) of the sample and inversely proportional to the square root of the size of the sample. However, there is uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples). [8] One convention is to use . [9]

Adjusted box plots are intended for skew distributions. They rely on the medcouple statistic of skewness. [10] For a medcouple value of MC, the lengths of the upper and lower whiskers are respectively defined to be

For symmetrical distributions, the medcouple will be zero, and this reduces to Tukey's boxplot with equal whisker lengths of for both whiskers.

Other kinds of plots such as violin plots and bean plots can show the difference between single-modal and multimodal distributions, a difference that cannot be seen with the original boxplot. [11]

Example(s)

Example without outliers

Figure 5. The generated boxplot figure of our example on the left with no outliers. No Outlier.png
Figure 5. The generated boxplot figure of our example on the left with no outliers.

A series of hourly temperatures were measured throughout the day in degrees Fahrenheit. The recorded values are listed in order as follows: 57, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81.

A box plot of the data can be generated by calculating five relevant values: minimum, maximum, median, first quartile, and third quartile.

The minimum is the smallest number of the set. In this case, the minimum day temperature is 57 °F.

The maximum is the largest number of the set. In this case, the maximum day temperature is 81 °F.

The median is the "middle" number of the ordered set. This means that there are exactly 50% of the elements less than the median and 50% of the elements greater than the median. The median of this ordered set is 70 °F.

The first quartile value is the number that marks one quarter of the ordered set. In other words, there are exactly 25% of the elements that are less than the first quartile and exactly 75% of the elements that are greater. The first quartile value can easily be determined by finding the "middle" number between the minimum and the median. For the hourly temperatures, the "middle" number between 57 °F and 70 °F is 66 °F.

The third quartile value is the number that marks three quarters of the ordered set. In other words, there are exactly 75% of the elements that are less than the first quartile and 25% of the elements that are greater. The third quartile value can be easily determined by finding the "middle" number between the median and the maximum. For the hourly temperatures, the "middle" number between 70 °F and 81 °F is 75 °F.

The interquartile range, or IQR, can be calculated:

Hence,

1.5 IQR above the third quartile is:

1.5IQR below the first quartile is:

The upper whisker of the box plot is the largest dataset number smaller than 1.5IQR above the third quartile. Here, 1.5IQR above the third quartile is 88.5 °F and the maximum is 81 °F. Therefore, the upper whisker is drawn at the value of the maximum, 81 °F.

Similarly, the lower whisker of the box plot is the smallest dataset number larger than 1.5IQR below the first quartile. Here, 1.5IQR below the first quartile is 52.5 °F and the minimum is 57 °F. Therefore, the lower whisker is drawn at the value of the minimum, 57 °F.

Example with outliers

Figure 6. The generated boxplot of our example on the left with outliers. Boxplot with outlier.png
Figure 6. The generated boxplot of our example on the left with outliers.

Above is an example without outliers. Here is a followup example with outliers:

The ordered set is: 52, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 89.

In this example, only the first and the last number are changed. The median, third quartile, and first quartile remain the same.

In this case, the maximum is 89 °F and 1.5IQR above the third quartile is 88.5 °F. The maximum is greater than 1.5IQR plus the third quartile, so the maximum is an outlier. Therefore, the upper whisker is drawn at the greatest value smaller than 1.5IQR above the third quartile, which is 79 °F.

Similarly, the minimum is 52 °F and 1.5IQR below the first quartile is 52.5 °F. The minimum is smaller than 1.5IQR minus the first quartile, so the minimum is also an outlier. Therefore, the lower whisker is drawn at the smallest value greater than 1.5IQR below the first quartile, which is 57 °F.

In the case of large datasets

General equation to compute empirical quantiles

Using the example from above with 24 data points, meaning n = 24, one can also calculate the median, first and third quartile mathematically vs. visually.

Median :

First quartile :

Third quartile :

Visualization

Figure 7. Boxplot and a probability density function (pdf) of a Normal N(0,1s ) Population Boxplot vs PDF.svg
Figure 7. Boxplot and a probability density function (pdf) of a Normal N(0,1σ ) Population

The box plot allows quick graphical examination of one or more data sets. Box plots may seem more primitive than a histogram or kernel density estimate but they do have some advantages. They take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data (see Figure 1 for an example). Choice of number and width of bins techniques can heavily influence the appearance of a histogram, and choice of bandwidth can heavily influence the appearance of a kernel density estimate.

As looking at a statistical distribution is more commonplace than looking at a box plot, comparing the box plot against the probability density function (theoretical histogram) for a normal N(0,σ2) distribution may be a useful tool for understanding the box plot (Figure 7).

Figure 8. Boxplots displaying skew Boxplots with skewness.png
Figure 8. Boxplots displaying skew

See also

Related Research Articles

Interquartile range

In descriptive statistics, the interquartile range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used robust measure of scale.

A quartile is a type of quantile which divides the number of data points into four more or less equal parts, or quarters. The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. It is also known as the lower quartile or the 25th empirical quartile and it marks where 25% of the data is below or to the left of it. The second quartile (Q2) is the median of a data set and 50% of the data lies below this point. The third quartile (Q3) is the middle value between the median and the highest value of the data set. It is also known as the upper quartile or the 75th empirical quartile and 75% of the data lies below this point. Due to the fact that the data needs to be ordered from smallest to largest to compute quartiles, quartiles are a form of Order statistic.

Quantile

In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. Thus quartiles are the three cut points that will divide a dataset into four equal-sized groups. Common quantiles have special names: for instance quartile, decile. The groups created are termed halves, thirds, quarters, etc., though sometimes the terms for the quantile are used for the groups created, rather than for the cut points.

Outlier observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles:

  1. the sample minimum (smallest observation)
  2. the lower quartile or first quartile
  3. the median
  4. the upper quartile or third quartile
  5. the sample maximum
Order statistic

In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference.

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value below which 20% of the observations may be found. Equivalently, 80% of the observations are found above the 20th percentile.

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

In statistics, the mid-range or mid-extreme of a set of statistical data values is the arithmetic mean of the maximum and minimum values in a data set, defined as:

Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations; under this model, non-robust methods like a t-test work poorly.

In statistics the trimean (TM), or Tukey's trimean, is a measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles:

In statistics, the midhinge is the average of the first and third quartiles and is thus a measure of location. Equivalently, it is the 25% trimmed mid-range or 25% midsummary; it is an L-estimator.

L-estimator

In statistics, an L-estimator is an estimator which is an L-statistic – a linear combination of order statistics of the measurements. This can be as little as a single point, as in the median, or as many as all points, as in the mean.

In descriptive statistics, the seven-number summary is a collection of seven summary statistics, and is an extension of the five-number summary. There are two similar, common forms.

In statistics, a trimmed estimator is an estimator derived from another estimator by excluding some of the extreme values, a process called truncation. This is generally done to obtain a more robust statistic, and the extreme values are considered outliers. Trimmed estimators also often have higher efficiency for mixture distributions and heavy-tailed distributions than the corresponding untrimmed estimator, at the cost of lower efficiency for other distributions, such as the normal distribution.

Plot (graphics)

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

In statistics, a robust measure of scale is a robust statistic that quantifies the statistical dispersion in a set of numerical data. The most common such statistics are the interquartile range (IQR) and the median absolute deviation (MAD). These are contrasted with conventional measures of scale, such as sample variance or sample standard deviation, which are non-robust, meaning greatly influenced by outliers.

In statistical graphics, the functional boxplot is an informative exploratory tool that has been proposed for visualizing functional data. Analogous to the classical boxplot, the descriptive statistics of a functional boxplot are: the envelope of the 50% central region, the median curve and the maximum non-outlying envelope.

In statistical graphics and scientific visualization, the contour boxplot is an exploratory tool that has been proposed for visualizing ensembles of feature-sets determined by a threshold on some scalar function. Analogous to the classical boxplot and considered an expansion of the concepts defining functional boxplot, the descriptive statistics of a contour boxplot are: the envelope of the 50% central region, the median curve and the maximum non-outlying envelope.

Medcouple

In statistics, the medcouple is a robust statistic that measures the skewness of a univariate distribution. It is defined as a scaled median difference of the left and right half of a distribution. Its robustness makes it suitable for identifying outliers in adjusted boxplots. Ordinary box plots do not fare well with skew distributions, since they label the longer unsymmetrical tails as outliers. Using the medcouple, the whiskers of a boxplot can be adjusted for skew distributions and thus have a more accurate identification of outliers for non-symmetrical distributions.

References

  1. Spear, Mary Eleanor (1952). Charting Statistics. McGraw Hill. p. 166.
  2. Spear, Mary Eleanor. (1969). Practical charting techniques. New York: McGraw-Hill. ISBN   0070600104. OCLC   924909765.
  3. Wickham, Stryjewski, Hadley, Lisa (November 29, 2011). "40 years of boxplots" (PDF). Retrieved December 11, 2019.
  4. Holmes, Alexander; Illowsky, Barbara; Dean, Susan. "Introductory Business Statistics". OpenStax.
  5. Dekking, F.M. (2005). A Modern Introduction to Probability and Statistics . Springer. pp.  234–238. ISBN   1-85233-896-2.
  6. Frigge, Michael; Hoaglin, David C.; Iglewicz, Boris (February 1989). "Some Implementations of the Boxplot". The American Statistician . 43 (1): 50–54. doi:10.2307/2685173. JSTOR   2685173.
  7. Marmolejo-Ramos, F.; Tian, S. (2010). "The shifting boxplot. A boxplot based on essential summary statistics around the mean". International Journal of Psychological Research. 3 (1): 37–46. doi: 10.21500/20112084.823 .
  8. 1 2 3 McGill, Robert; Tukey, John W.; Larsen, Wayne A. (February 1978). "Variations of Box Plots". The American Statistician . 32 (1): 12–16. doi:10.2307/2683468. JSTOR   2683468.
  9. "R: Box Plot Statistics". R manual. Retrieved 26 June 2011.
  10. Hubert, M.; Vandervieren, E. (2008). "An adjusted boxplot for skewed distribution". Computational Statistics and Data Analysis. 52 (12): 5186–5201. CiteSeerX   10.1.1.90.9812 . doi:10.1016/j.csda.2007.11.008.
  11. Wickham, Hadley; Stryjewski, Lisa (2011). "40 years of boxplots" (PDF).

Further reading