Seven-number summary

Last updated

In descriptive statistics, the seven-number summary is a collection of seven summary statistics, and is an extension of the five-number summary. There are three similar, common forms.


As with the five-number summary, it can be represented by a modified box plot, adding hatch-marks on the "whiskers" for two of the additional numbers.

Seven-number summary

The following percentiles are (approximately) evenly spaced under a normally distributed variable:

Normal distribution seven summary numbers
More precise
Alternate name(s)
#1 2nd 2.15%lower whisker bottom end
#2 9th 8.87%lower whisker crosshatch mark
#325th25.00% lower quartile or first quartile
#450th50.00% median, middle value, or second quartile
#575th75.00% upper quartile or third quartile
#691st91.13%upper whisker crosshatch mark
#798th97.85%upper whisker top end

The middle three values the lower quartile, median, and upper quartile are the usual statistics from the five-number summary and are the standard values for the box in a box plot.

The two unusual percentiles at either end are used because the locations of all seven values will be approximately equally spaced if the data is normally distributed. [a] Some statistical tests require normally distributed data, so the plotted values provide a convenient visual check for validity of later tests, simply by scanning to see if the marks for those seven percentiles appear to be equal distances apart on the graph.

Notice that whereas the extreme values of the five-number summary depend on the number of samples, this seven-number summary does not, and is somewhat more stable, since its whisker-ends are protected from the usual wild swings in the extreme values of the sample by replacing them with the more steady 2nd and 98th percentiles.

The values can be represented using a modified box plot. The 2nd and 98th percentiles are represented by the ends of the whiskers, and hatch-marks across the whiskers mark the 9th and 91st percentiles.

Bowley’s seven-figure summary

Arthur Bowley used a set of non-parametric statistics, called a "seven-figure summary", including the extremes, deciles, and quartiles, along with the median. [1]

Thus the numbers are:

Bowley’s seven summary figures [1]
Nr. Percentile Alternate name(s)
#10% sample minimum (nominal: highest zero-th percentile)
#210%first decile
#325% lower quartile or first quartile
#450% median, middle value, or second quartile
#575% upper quartile or third quartile
#690%last decile
#7100% sample maximum (nominal: lowest hundredth percentile)

Note that the middle five of the seven numbers are very nearly the same as for the seven number summary, above.

The addition of the deciles allow one to compute the interdecile range, which for a normal distribution can be scaled to give a reasonably efficient estimate of standard deviation, and the 10% midsummary, which when compared to the median gives an idea of the skewness in the tails.

Tukey’s seven-number summary

John Tukey used a seven-number summary consisting of the extremes, octiles, quartiles, and the median. [2]

The seven numbers are:

Tukey’s seven summary figures [2]
Nr. Percentile Alternate name(s)
#10% sample minimum (nominal: highest zero-th percentile)
#212.5%first octile
#325.0% lower quartile or first quartile
#450.0% median, middle value, or second quartile
#575.0% upper quartile or third quartile
#687.5%last octile
#7100% sample maximum (nominal: lowest hundredth percentile)

Note that the middle five of the seven numbers can all be obtained by successive partitioning of the ordered data into subsets of equal size. Extending the seven-number summary by continued partitioning produces the nine-number summary, the eleven-number summary, and so on.

See also


  1. The seven equally spaced percentiles with three digits of precision are 2.15%, 8.87%, 25.0%, 50.0%, 75.0%, 91.13%, and 97.85% . If one desires to identify a symmetric distribution different from the normal, or Gaussian distribution, the listed outer pairs of quantiles (2.15% and 8.87% on the lower whisker, and on the upper whisker 75.0% and 91.13%) may be replaced by quantiles from the other desired distribution whose spacing has been calculated to match the spacing between the median and the quartiles.

Related Research Articles

<span class="mw-page-title-main">Interquartile range</span> Measure of statistical dispersion

In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data. To calculate the IQR, the data set is divided into quartiles, or four rank-ordered even parts via linear interpolation. These quartiles are denoted by Q1 (also called the lower quartile), Q2 (the median), and Q3 (also called the upper quartile). The lower quartile corresponds with the 25th percentile and the upper quartile corresponds with the 75th percentile, so IQR = Q3 − Q1.

In statistics, quartiles are a type of quantiles which divide the number of data points into four parts, or quarters, of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a form of order statistic. The three quartiles, resulting in four data divisions, are as follows:

<span class="mw-page-title-main">Quantile</span> Statistical method of dividing data into equal-sized intervals for analysis

In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. Common quantiles have special names, such as quartiles, deciles, and percentiles. The groups created are termed halves, thirds, quarters, etc., though sometimes the terms for the quantile are used for the groups created, rather than for the cut points.

<span class="mw-page-title-main">Skewness</span> Measure of the asymmetry of random variables

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

<span class="mw-page-title-main">Box plot</span> Data visualization

In descriptive statistics, a box plot or boxplot is a method for demonstrating graphically the locality, spread and skewness groups of numerical data through their quartiles. In addition to the box on a box plot, there can be lines extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also called the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset may be plotted as individual points beyond the whiskers on the box-plot. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings in each subsection of the box-plot indicate the degree of dispersion (spread) and skewness of the data, which are usually described using the five-number summary. In addition, the box-plot allows one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.

The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles:

  1. the sample minimum (smallest observation)
  2. the lower quartile or first quartile
  3. the median
  4. the upper quartile or third quartile
  5. the sample maximum

In statistics, a k-thpercentile, also known as percentile score or centile, is a score below which a given percentage k of scores in its frequency distribution falls or a score at or below which a given percentage falls ; i.e. a score in the k-th percentile would be above approximately k% of all scores in its set. Percentiles are expressed in the same unit of measurement as the input scores, not in percent; for example, if the scores refer to human weight, the corresponding percentiles will be expressed in kilograms or pounds. In the limit of an infinite sample size, the percentile approximates the percentile function, the inverse of the cumulative distribution function.

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell beyond the formal modeling and thereby contrasts with traditional hypothesis testing, in which a model is supposed to be selected before the data is seen. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

<span class="mw-page-title-main">Rankit</span>

In statistics, rankits of a set of data are the expected values of the order statistics of a sample from the standard normal distribution the same size as the data. They are primarily used in the normal probability plot, a graphical technique for normality testing.

<span class="mw-page-title-main">Normal probability plot</span> Graphical technique in statistics

The normal probability plot is a graphical technique to identify substantive departures from normality. This includes identifying outliers, skewness, kurtosis, a need for transformations, and mixtures. Normal probability plots are made of raw data, residuals from model fits, and estimated parameters.

In descriptive statistics, a decile is any of the nine values that divide the sorted data into ten equal parts, so that each part represents 1/10 of the sample or population. A decile is one possible form of a quantile; others include the quartile and percentile. A decile rank arranges the data in order from lowest to highest and is done on a scale of one to ten where each successive number corresponds to an increase of 10 percentage points.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

<span class="mw-page-title-main">Q–Q plot</span> Comparison of two distributions

In statistics, a Q–Q plot (quantile–quantile plot) is a probability plot, a graphical method for comparing two probability distributions by plotting their quantiles against each other. A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate). This defines a parametric curve where the parameter is the index of the quantile interval.

In statistics the trimean (TM), or Tukey's trimean, is a measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles:

In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample.

<span class="mw-page-title-main">Quantile function</span> Statistical function that defines the quantiles of a probability distribution

In probability and statistics, the quantile function outputs the value of a random variable such that its probability is less than or equal to an input probability value. Intuitively, the quantile function associates with a range at and below a probability input the likelihood that a random variable is realized in that range for some probability distribution. It is also called the percentile function, percent-point function, inverse cumulative distribution function or inverse distribution function.

<span class="mw-page-title-main">L-estimator</span>

In statistics, an L-estimator is an estimator which is a linear combination of order statistics of the measurements. This can be as little as a single point, as in the median, or as many as all points, as in the mean.

<span class="mw-page-title-main">Plot (graphics)</span> Graphical technique for data sets

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

<span class="mw-page-title-main">Sample maximum and minimum</span> Greatest and least values in a statistical data sample

In statistics, the sample maximum and sample minimum, also called the largest observation and smallest observation, are the values of the greatest and least elements of a sample. They are basic summary statistics, used in descriptive statistics such as the five-number summary and Bowley's seven-figure summary and the associated box plot.


  1. 1 2 Bowley, A. (1920). Elementary Manual of Statistics (3rd ed.). p.  62. the seven positions are the maximum and minimum, median, quartiles, and two deciles
  2. 1 2 Tukey, J. (1977). Exploratory Data Analysis. Addison-Wesley Publishing Company. p.  53. ISBN   978-0-201-07616-5.