Stem-and-leaf display

Last updated June 01, 2023

A stem-and-leaf display or stem-and-leaf plot is a device for presenting quantitative data in a graphical format, similar to a histogram, to assist in visualizing the shape of a distribution. They evolved from Arthur Bowley's work in the early 1900s, and are useful tools in exploratory data analysis. Stemplots became more commonly used in the 1980s after the publication of John Tukey's book on exploratory data analysis in 1977.^[1] The popularity during those years is attributable to their use of monospaced (typewriter) typestyles that allowed computer technology of the time to easily produce the graphics. Modern computers' superior graphic capabilities have meant these techniques are less often used.

A stem-and-leaf plot is also called a stemplot, but the latter term often refers to another chart type. A simple stem plot may refer to plotting a matrix of y values onto a common x axis, and identifying the common x value with a vertical line, and the individual y values with symbols on the line.^[4]

Unlike histograms, stem-and-leaf displays retain the original data to at least two significant digits, and put the data in order, thereby easing the move to order-based inference and non-parametric statistics.

Construction

To construct a stem-and-leaf display, the observations must first be sorted in ascending order: this can be done most easily if working by hand by constructing a draft of the stem-and-leaf display with the leaves unsorted, then sorting the leaves to produce the final stem-and-leaf display. Here is the sorted set of data values that will be used in the following example:

44, 46, 47, 49, 63, 64, 66, 68, 68, 72, 72, 75, 76, 81, 84, 88, 106

Next, it must be determined what the stems will represent and what the leaves will represent. Typically, the leaf contains the last digit of the number and the stem contains all of the other digits. In the case of very large numbers, the data values may be rounded to a particular place value (such as the hundreds place) that will be used for the leaves. The remaining digits to the left of the rounded place value are used as the stem.

In this example, the leaf represents the ones place and the stem will represent the rest of the number (tens place and higher).

The stem-and-leaf display is drawn with two columns separated by a vertical line. The stems are listed to the left of the vertical line. It is important that each stem is listed only once and that no numbers are skipped, even if it means that some stems have no leaves. The leaves are listed in increasing order in a row to the right of each stem.

It is important to note that when there is a repeated number in the data (such as two 72s) then the plot must reflect such (so the plot would look like 7 | 2 2 5 6 7 when it has the numbers 72 72 75 76 77).

{\begin{array}{r|l}{\text{Stem}}&{\text{Leaf}}\\\hline 4&4~6~7~9\\5&\\6&3~4~6~8~8\\7&2~2~5~6\\8&1~4~8\\9&\\10&6\end{array}}

Key:

6\mid 3=63

Leaf unit: 1.0

Stem unit: 10.0

Rounding may be needed to create a stem-and-leaf display. Based on the following set of data, the stem plot below would be created:

−23.678758, −12.45, −3.4, 4.43, 5.5, 5.678, 16.87, 24.7, 56.8

For negative numbers, a negative is placed in front of the stem unit, which is still the value X / 10. Non-integers are rounded. This allowed the stem and leaf plot to retain its shape, even for more complicated data sets. As in this example below:

{\begin{array}{r|l}{\text{Stem}}&{\text{Leaf}}\\\hline -2&4\\-1&2\\-0&3\\0&4~6~6\\1&7\\2&5\\3&\\4&\\5&7\end{array}}

Key:

-2\mid 4=-24

Usage

Stem-and-leaf displays are useful for displaying the relative density and shape of the data, giving the reader a quick overview of the distribution. They retain (most of) the raw numerical data, often with perfect integrity. They are also useful for highlighting outliers and finding the mode. However, stem-and-leaf displays are only useful for moderately sized data sets (around 15–150 data points). With very small data sets a stem-and-leaf displays can be of little use, as a reasonable number of data points are required to establish definitive distribution properties. A dot plot may be better suited for such data. With very large data sets, a stem-and-leaf display will become very cluttered, since each data point must be represented numerically. A box plot or histogram may become more appropriate as the data size increases.

Non-numerical use

a│abdeghilmnrstwxy b│aeioy c│h d│aeio e│adefhlmnrstwx f|aey g│iou h│aeimo i│dfnost j│ao k│aioy l│aio m│aeimouy n│aeouy o│bdefhikmnoprsuwxy p│aeio q│i r│e s│hiot t│aeio u│ghmnprst v│ w│eo x│iu y│aeou z│aeo

Stem-and-leaf displays can also be used to convey non-numerical information. In this example of valid two-letter words in Collins Scrabble Words (the word list used in Scrabble tournaments outside the US) with their initials as stems, it can be easily seen that the three most common initials are o, a and e.^[5]

{{{annotations}}}

Some railway timetables use stem-and-leaf displays with hours as stems and minutes as leaves

Notes

↑ Tukey, John W. (1977). Exploratory Data Analysis (1 ed.). Pearson. ISBN 0-201-07616-0.
↑ Function in Octave
↑ Function in R
↑ Examples: MATLAB's and Matplotlib's stem functions. They do not create a stem-and-leaf display.
↑ Gideon Goldin, Two-Letter Scrabble Words Visualized as Stem and Leaf Plot, 2020-10-01

Related Research Articles

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics is the process of using and analysing those statistics. Descriptive statistics is distinguished from inferential statistics by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently nonparametric statistics. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example, in papers reporting on human subjects, typically a table is included giving the overall sample size, sample sizes in important subgroups, and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, the proportion of subjects with related co-morbidities, etc.

A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often of equal size.

A numeral system is a writing system for expressing numbers; that is, a mathematical notation for representing numbers of a given set, using digits or other symbols in a consistent manner.

In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or quarters, of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a form of order statistic. The three main quartiles are as follows:

In mathematics and computing, Fibonacci coding is a universal code which encodes positive integers into binary code words. It is one example of representations of integers based on Fibonacci numbers. Each code word ends with "11" and contains no other instances of "11" before the end.

<span class="mw-page-title-main">Subtraction</span> One of the four basic arithmetic operations

Subtraction is one of the four arithmetic operations along with addition, multiplication and division. Subtraction is an operation that represents removal of objects from a collection. For example, in the adjacent picture, there are 5 − 2 peaches—meaning 5 peaches with 2 taken away, resulting in a total of 3 peaches. Therefore, the difference of 5 and 2 is 3; that is, 5 − 2 = 3. While primarily associated with natural numbers in arithmetic, subtraction can also represent removing or decreasing physical and abstract quantities using different kinds of objects including negative numbers, fractions, irrational numbers, vectors, decimals, functions, and matrices.

<span class="mw-page-title-main">Box plot</span> Data visualization

In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles. In addition to the box on a box plot, there can be lines extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also termed as the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset may be plotted as individual points beyond the whiskers on the box-plot. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings in each subsection of the box-plot indicate the degree of dispersion (spread) and skewness of the data, which are usually described using the five-number summary. In addition, the box-plot allows one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.

In analytic number theory and related branches of mathematics, a complex-valued arithmetic function $:\mathbb {Z} \rightarrow \mathbb {C} }$ is a Dirichlet character of modulus if for all integers $and :$

<span class="mw-page-title-main">Synthetic division</span> Algorithm for Euclidean division of polynomials

In algebra, synthetic division is a method for manually performing Euclidean division of polynomials, with less writing and fewer calculations than long division.

An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently, the ratio of the odds of B in the presence of A and the odds of B in the absence of A. Two events are independent if and only if the OR equals 1, i.e., the odds of one event are the same in either the presence or absence of the other event. If the OR is greater than 1, then A and B are associated (correlated) in the sense that, compared to the absence of B, the presence of B raises the odds of A, and symmetrically the presence of A raises the odds of B. Conversely, if the OR is less than 1, then A and B are negatively correlated, and the presence of one event reduces the odds of the other event.

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

In image processing and photography, a color histogram is a representation of the distribution of colors in an image. For digital images, a color histogram represents the number of pixels that have colors in each of a fixed list of color ranges, that span the image's color space, the set of all possible colors.

In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a block matrix. It is a specialization of the tensor product from vectors to matrices and gives the matrix of the tensor product linear map with respect to a standard choice of basis. The Kronecker product is to be distinguished from the usual matrix multiplication, which is an entirely different operation. The Kronecker product is also sometimes called matrix direct product.

In theoretical physics, a minimal model or Virasoro minimal model is a two-dimensional conformal field theory whose spectrum is built from finitely many irreducible representations of the Virasoro algebra. Minimal models have been classified and solved, and found to obey an ADE classification. The term minimal model can also refer to a rational CFT based on an algebra that is larger than the Virasoro algebra, such as a W-algebra.

In statistics, McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal. It is named after Quinn McNemar, who introduced it in 1947. An application of the test in genetics is the transmission disequilibrium test for detecting linkage disequilibrium.

<span class="mw-page-title-main">Plot (graphics)</span>

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

The table of chords, created by the Greek astronomer, geometer, and geographer Ptolemy in Egypt during the 2nd century AD, is a trigonometric table in Book I, chapter 11 of Ptolemy's Almagest, a treatise on mathematical astronomy. It is essentially equivalent to a table of values of the sine function. It was the earliest trigonometric table extensive enough for many practical purposes, including those of astronomy. Since the 8th and 9th centuries, the sine and other trigonometric functions have been used in Islamic mathematics and astronomy, reforming the production of sine tables. Khwarizmi and Habash al-Hasib later produced a set of trigonometric tables.

In computer science, a segmented scan is a modification of the prefix sum with an equal-sized array of flag bits to denote segment boundaries on which the scan should be performed.

In mathematics, the Khatri–Rao product of matrices is defined as

References

Wild, C. and Seber, G. (2000) Chance Encounters: A First Course in Data Analysis and Inference pp. 49–54 John Wiley and Sons. ISBN 0-471-32936-3
Elliott, Jane; Catherine Marsh (2008). Exploring Data: An Introduction to Data Analysis for Social Scientists (2nd ed.). Polity Press. ISBN 0-7456-2282-8.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Tukey, John W. (1977). Exploratory Data Analysis (1 ed.). Pearson. ISBN 0-201-07616-0.

[2] Function in Octave

[3] Function in R

[4] Examples: MATLAB's and Matplotlib's stem functions. They do not create a stem-and-leaf display.

[5] Gideon Goldin, Two-Letter Scrabble Words Visualized as Stem and Leaf Plot, 2020-10-01

[1]

[2]

[3]

[4]

[5]