Summary statistics

Last updated March 11, 2023

In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in

a measure of location, or central tendency, such as the arithmetic mean
a measure of statistical dispersion like the standard mean absolute deviation
a measure of the shape of the distribution like skewness or kurtosis
if more than one variable is measured, a measure of statistical dependence such as a correlation coefficient

A common collection of order statistics used as summary statistics are the five-number summary, sometimes extended to a seven-number summary, and the associated box plot.

Entries in an analysis of variance table can also be regarded as summary statistics.^[1]^: 378

Examples

Location

Common measures of location, or central tendency, are the arithmetic mean, median, mode, and interquartile mean.^[2]^[3]

Spread

Common measures of statistical dispersion are the standard deviation, variance, range, interquartile range, absolute deviation, mean absolute difference and the distance standard deviation. Measures that assess spread in comparison to the typical size of data values include the coefficient of variation.

The Gini coefficient was originally developed to measure income inequality and is equivalent to one of the L-moments.

A simple summary of a dataset is sometimes given by quoting particular order statistics as approximations to selected percentiles of a distribution.

Shape

Common measures of the shape of a distribution are skewness or kurtosis, while alternatives can be based on L-moments. A different measure is the distance skewness, for which a value of zero implies central symmetry.

Dependence

The common measure of dependence between paired random variables is the Pearson product-moment correlation coefficient, while a common alternative summary statistic is Spearman's rank correlation coefficient. A value of zero for the distance correlation implies independence.

Human perception of summary statistics

Humans efficiently use summary statistics to quickly perceive the gist of auditory and visual information.^[4]^[5]^[6]

Related Research Articles

In statistics, a central tendency is a central or typical value for a probability distribution.

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics is the process of using and analysing those statistics. Descriptive statistic is distinguished from inferential statistics by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently nonparametric statistics. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example, in papers reporting on human subjects, typically a table is included giving the overall sample size, sample sizes in important subgroups, and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, the proportion of subjects with related co-morbidities, etc.

In probability theory and statistics, kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes a particular aspect of a probability distribution. There are different ways to quantify kurtosis for a theoretical distribution, and there are corresponding ways of estimating it using a sample from a population. Different measures of kurtosis may have different interpretations.

<span class="mw-page-title-main">Skewness</span> Measure of the asymmetry of random variables

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

In probability theory and statistics, the coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage, and is defined as the ratio of the standard deviation $to the mean . The CV or RSD is widely used in analytical chemistry to express the precision and repeatability of an assay. It is also commonly used in fields such as engineering or physics when doing quality assurance studies and ANOVA gauge R&R, by economists and investors in economic models, and in neuroscience.$

<span class="mw-page-title-main">Multimodal distribution</span> Probability distribution whose density has two or more distinct local maxima

In statistics, a multimodaldistribution is a probability distribution with more than one mode. These appear as distinct peaks in the probability density function, as shown in Figures 1 and 2. Categorical, continuous, and discrete data can all form multimodal distributions. Among univariate analyses, multimodal distributions are commonly bimodal.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In statistics, the mid-range or mid-extreme is a measure of central tendency of a sample defined as the arithmetic mean of the maximum and minimum values of the data set:

In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed.

In statistics, the quartile coefficient of dispersion is a descriptive statistic which measures dispersion and is used to make comparisons within and between data sets. Since it is based on quantile information, it is less sensitive to outliers than measures such as the coefficient of variation. As such, it is one of several robust measures of scale.

In probability theory and statistics, the index of dispersion, dispersion index,coefficient of dispersion,relative variance, or variance-to-mean ratio (VMR), like the coefficient of variation, is a normalized measure of the dispersion of a probability distribution: it is a measure used to quantify whether a set of observed occurrences are clustered or dispersed compared to a standard statistical model.

<span class="mw-page-title-main">L-estimator</span>

In statistics, an L-estimator is an estimator which is a linear combination of order statistics of the measurements. This can be as little as a single point, as in the median, or as many as all points, as in the mean.

In statistics, a trimmed estimator is an estimator derived from another estimator by excluding some of the extreme values, a process called truncation. This is generally done to obtain a more robust statistic, and the extreme values are considered outliers. Trimmed estimators also often have higher efficiency for mixture distributions and heavy-tailed distributions than the corresponding untrimmed estimator, at the cost of lower efficiency for other distributions, such as the normal distribution.

In statistics, robust measures of scale are methods that quantify the statistical dispersion in a sample of numerical data while resisting outliers. The most common such robust statistics are the interquartile range (IQR) and the median absolute deviation (MAD). These are contrasted with conventional or non-robust measures of scale, such as sample variance or standard deviation, which are greatly influenced by outliers.

In statistics, dispersion is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartile range. For instance, when the variance of data in a set is large, the data is widely scattered. On the other hand, when the variance is small, the data in the set is clustered.

In statistics, L-moments are a sequence of statistics used to summarize the shape of a probability distribution. They are linear combinations of order statistics (L-statistics) analogous to conventional moments, and can be used to calculate quantities analogous to standard deviation, skewness and kurtosis, termed the L-scale, L-skewness and L-kurtosis respectively. Standardised L-moments are called L-moment ratios and are analogous to standardized moments. Just as for conventional moments, a theoretical distribution has a set of population L-moments. Sample L-moments can be defined for a sample from the population, and can be used as estimators of the population L-moments.

In statistics and probability theory, the nonparametric skew is a statistic occasionally used with random variables that take real values. It is a measure of the skewness of a random variable's distribution—that is, the distribution's tendency to "lean" to one side or the other of the mean. Its calculation does not require any knowledge of the form of the underlying distribution—hence the name nonparametric. It has some desirable properties: it is zero for any symmetric distribution; it is unaffected by a scale shift; and it reveals either left- or right-skewness equally well. In some statistical samples it has been shown to be less powerful than the usual measures of skewness in detecting departures of the population from normality.

Univariate is a term commonly used in statistics to describe a type of data which consists of observations on only a single characteristic or attribute. A simple example of univariate data would be the salaries of workers in industry. Like all the other data, univariate data can be visualized using graphs, images or other analysis tools after the data is measured, collected, reported, and analyzed.

References

↑ Upton, Graham; Cook, Ian (2 October 2008). "Dictionary (S)" . A Dictionary of Statistics (Second (revised) ed.). Oxford University Press. ISBN 978-0199541454. LCCN 2008300706. OCLC 935100347. OL 23145891M – via Internet Archive. p. 378: summary statistics [...] *ANOVA table might be referred to as summary statistics
↑ Bullen, P. S. (31 August 2003). Handbook of Means and Their Inequalities. Mathematics and Its Applications. Vol. 560 (2 ed.). Springer Dordrecht. doi:10.1007/978-94-017-0399-4. ISBN 978-1-4020-1522-9. LCCN 2003060794. OCLC 939214285. OL 8370727M.
↑ Grabisch, Michel; Marichal, Jean-Luc; Mesiar, Radko; Pap, Endre (2009). Aggregation Functions. Oxford University Press. ISBN 978-0521519267.
↑ Piazza, Elise A.; Sweeny, Timothy D.; Wessel, David; Silver, Michael A.; Whitney, David (2013). "Humans Use Summary Statistics to Perceive Auditory Sequences". Psychological Science. 24 (8): 1389–1397. doi:10.1177/0956797612473759. PMC 4381997 . PMID 23761928.
↑ Alexander, R. G.; Schmidt, J.; Zelinsky, G. Z. (2014). "Are summary statistics enough? Evidence for the importance of shape in guiding visual search". Visual Cognition. 22 (3–4): 595–609. doi:10.1080/13506285.2014.890989. PMC 4500174 . PMID 26180505.
↑ Utochkin, Igor S. (2015). "Ensemble summary statistics as a basis for rapid visual categorization". Journal of Vision. 15 (4): 8. doi: 10.1167/15.4.8 . PMID 26317396.

External links

Media related to Summary statistics at Wikimedia Commons

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Upton-1] Upton, Graham; Cook, Ian (2 October 2008). "Dictionary (S)" . A Dictionary of Statistics (Second (revised) ed.). Oxford University Press. ISBN 978-0199541454. LCCN 2008300706. OCLC 935100347. OL 23145891M – via Internet Archive. p. 378: summary statistics [...] *ANOVA table might be referred to as summary statistics

[2] Bullen, P. S. (31 August 2003). Handbook of Means and Their Inequalities. Mathematics and Its Applications. Vol. 560 (2 ed.). Springer Dordrecht. doi:10.1007/978-94-017-0399-4. ISBN 978-1-4020-1522-9. LCCN 2003060794. OCLC 939214285. OL 8370727M.

[3] Grabisch, Michel; Marichal, Jean-Luc; Mesiar, Radko; Pap, Endre (2009). Aggregation Functions. Oxford University Press. ISBN 978-0521519267.

[4] Piazza, Elise A.; Sweeny, Timothy D.; Wessel, David; Silver, Michael A.; Whitney, David (2013). "Humans Use Summary Statistics to Perceive Auditory Sequences". Psychological Science. 24 (8): 1389–1397. doi:10.1177/0956797612473759. PMC 4381997 . PMID 23761928.

[5] Alexander, R. G.; Schmidt, J.; Zelinsky, G. Z. (2014). "Are summary statistics enough? Evidence for the importance of shape in guiding visual search". Visual Cognition. 22 (3–4): 595–609. doi:10.1080/13506285.2014.890989. PMC 4500174 . PMID 26180505.

[6] Utochkin, Igor S. (2015). "Ensemble summary statistics as a basis for rapid visual categorization". Journal of Vision. 15 (4): 8. doi: 10.1167/15.4.8 . PMID 26317396.

[1]

[2]

[3]

[4]

[5]

[6]