Seven-number summary

Last updated

In descriptive statistics, the seven-number summary is a collection of seven summary statistics, and is an extension of the five-number summary. There are two similar, common forms.

Contents

As with the five-number summary, it can be represented by a modified box plot, adding hatch-marks on the "whiskers" for two of the additional numbers.

Seven-number summary

The following percentiles are (approximately) evenly spaced under a normally distributed variable:

  1. the 2nd percentile
  2. the 9th percentile
  3. the 25th percentile or lower quartile or first quartile
  4. the 50th percentile or median (middle value, or second quartile)
  5. the 75th percentile or upper quartile or third quartile
  6. the 91st percentile
  7. the 98th percentile

The middle three values the lower quartile, median, and upper quartile are the usual statistics from the five-number summary and are the standard values for the box in a box plot.

The two unusual percentiles at either end are used because the locations of all seven values will be approximately equally spaced if the data is normally distributed (four equally spaced percentiles with three digits of precision are 2.15, 8.87, 25.0, and 50.0). Some statistical tests require normally distributed data, so the plotted values provide a convenient visual check for validity of later tests, simply by scanning to see if the marks for those seven percentiles appear to be equal distances apart on the graph.

Notice that whereas the extreme values of the five-number summary depend on the number of samples, the seven-number summary does not.

The values can be represented using a modified box plot. The 2nd and 98th percentiles are represented by the ends of the whiskers, and hatch-marks across the whiskers mark the 9th and 91st percentiles.

Bowley’s seven-figure summary

Arthur Bowley used a set of non-parametric statistics, called a "seven-figure summary", including the extremes, deciles, and quartiles, along with the median. [1]

Thus the numbers are:

  1. the sample minimum
  2. the 10th percentile (first decile)
  3. the 25th percentile or lower quartile or first quartile
  4. the 50th percentile or median (middle value, or second quartile)
  5. the 75th percentile or upper quartile or third quartile
  6. the 90th percentile (last decile)
  7. the sample maximum

Note that the middle five of the seven numbers are very nearly the same as for the seven number summary, above.

The addition of the deciles allow one to compute the interdecile range, which for a normal distribution can be scaled to give a reasonably efficient estimate of standard deviation, and the 10% midsummary, which when compared to the median gives an idea of the skewness in the tails.

See also

Related Research Articles

Interquartile range measure of statistical dispersion

In descriptive statistics, the interquartile range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used robust measure of scale.

Median Middle quantile of a data set or probability distribution

In statistics and probability theory, a median is a value separating the higher half from the lower half of a data sample, a population or a probability distribution. For a data set, it may be thought of as "the middle" value. For example, the basic advantage of the median in describing data compared to the mean is that it is not skewed so much by a small proportion of extremely large or small values, and so it may give a better idea of a "typical" value. For example, in understanding statistics like household income or assets, which vary greatly, the mean may be skewed by a small number of extremely high or low values. Median income, for example, may be a better way to suggest what a "typical" income is. Because of this, the median is of central importance in robust statistics, as it is the most resistant statistic, having a breakdown point of 50%: so long as no more than half the data are contaminated, the median will not give an arbitrarily large or small result.

There are several kinds of mean in mathematics, especially in statistics.

A quartile is a type of quantile which divides the number of data points into four more or less equal parts, or quarters. The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. It is also known as the lower quartile or the 25th empirical quartile and it marks where 25% of the data is below or to the left of it. The second quartile (Q2) is the median of a data set and 50% of the data lies below this point. The third quartile (Q3) is the middle value between the median and the highest value of the data set. It is also known as the upper quartile or the 75th empirical quartile and 75% of the data lies below this point. Due to the fact that the data needs to be ordered from smallest to largest to compute quartiles, quartiles are a form of Order statistic.

Quantile cutpoint dividing a set of observations into equal sized groups

In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. Thus quartiles are the three cut points that will divide a dataset into four equal-sized groups. Common quantiles have special names: for instance quartile, decile. The groups created are termed halves, thirds, quarters, etc., though sometimes the terms for the quantile are used for the groups created, rather than for the cut points.

Summary statistics Measure in Statistics

In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in

Skewness measure of the asymmetry of random variables

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

Box plot method for graphically depicting groups of numerical data through their quartiles

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically. Box plots received their name from the box in the middle.

The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles:

  1. the sample minimum (smallest observation)
  2. the lower quartile or first quartile
  3. the median
  4. the upper quartile or third quartile
  5. the sample maximum
Percentile rank

The percentile rank of a score is the percentage of scores in its frequency distribution that are equal to or lower than it. For example, a test score that is greater than 75% of the scores of people taking the test is said to be at the 75th percentile, where 75 is the percentile rank. In educational measurement, a range of percentile ranks, often appearing on a score report, shows the range within which the test taker's "true" percentile rank probably occurs. The "true" value refers to the rank the test taker would obtain if there were no random errors involved in the testing process.

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value below which 20% of the observations may be found. Equivalently, 80% of the observations are found above the 20th percentile.

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

In descriptive statistics, a decile is any of the nine values that divide the sorted data into ten equal parts, so that each part represents 1/10 of the sample or population. A decile is one possible form of a quantile; others include the quartile and percentile. A decile rank arranges the data in order from lowest to highest and is done on a scale of one to ten where each successive number corresponds to an increase of 10 percentage points.

The following is a glossary of terms used in the mathematical sciences statistics and probability.

In statistics the trimean (TM), or Tukey's trimean, is a measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles:

L-estimator

In statistics, an L-estimator is an estimator which is an L-statistic – a linear combination of order statistics of the measurements. This can be as little as a single point, as in the median, or as many as all points, as in the mean.

Plot (graphics) graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

Sample maximum and minimum

In statistics, the sample maximum and sample minimum, also called the largest observation and smallest observation, are the values of the greatest and least elements of a sample. They are basic summary statistics, used in descriptive statistics such as the five-number summary and Bowley's seven-figure summary and the associated box plot.

In statistics, the interdecile range is the difference between the first and the ninth deciles. The interdecile range is a measure of statistical dispersion of the values in a set of data, similar to the range and the interquartile range, and can be computed from the (non-parametric) seven-number summary.

A fan chart is made of a group of dispersion fan diagrams, which may be positioned according to two categorising dimensions. A dispersion fan diagram is a circular diagram which reports the same information about a dispersion as a box plot: namely median, quartiles, and two extreme values.

References

  1. Bowley, Arthur (1920). Elementary Manual of Statistics (3rd ed.). p.  62. the seven positions are the maximum and minimum, median, quartiles, and two deciles