Bagplot

Last updated April 16, 2024

A bagplot, or starburst plot,^[1]^[2] is a method in robust statistics for visualizing two- or three-dimensional statistical data, analogous to the one-dimensional box plot. Introduced in 1999 by Rousseuw et al., the bagplot allows one to visualize the location, spread, skewness, and outliers of a data set.^[3]

Construction

The bagplot consists of three nested polygons, called the "bag", the "fence", and the "loop".

The inner polygon, called the bag, is constructed on the basis of Tukey depth, the smallest number of observations that can be contained by a half-plane that also contains a given point.^[4] It contains at most 50% of the data points
The outermost of the three polygons, called the fence is not drawn as part of the bagplot, but is used to construct it. It is formed by inflating the bag by a certain factor (usually 3). Observations outside the fence are flagged as outliers.^[5]
The observations that are not marked as outliers are surrounded by a loop, the convex hull of the observations within the fence.^[6]

An asterisk symbol (*) near the center of the graph is used to mark the depth median, the point with the highest possible Tukey depth. The observations between the bag and fence are marked by line segments, on a line to the depth median, connecting them to the bag.
The three-dimensional version consists of an inner and outer bag.^[7] The outer bag must be drawn in transparent colors so that the inner bag remains visible.

Properties

The bagplot is invariant under affine transformations of the plane, and robust against outliers.^[8]

Related Research Articles

In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data. To calculate the IQR, the data set is divided into quartiles, or four rank-ordered even parts via linear interpolation. These quartiles are denoted by Q₁ (also called the lower quartile), Q₂ (the median), and Q₃ (also called the upper quartile). The lower quartile corresponds with the 25th percentile and the upper quartile corresponds with the 75th percentile, so IQR = Q₃ − Q₁_.

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

In statistics, quartiles are a type of quantiles which divide the number of data points into four parts, or quarters, of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a form of order statistic. The three quartiles, resulting in four data divisions, are as follows:

In geometry, the convex hull, convex envelope or convex closure of a shape is the smallest convex set that contains it. The convex hull may be defined either as the intersection of all convex sets containing a given subset of a Euclidean space, or equivalently as the set of all convex combinations of points in the subset. For a bounded subset of the plane, the convex hull may be visualized as the shape enclosed by a rubber band stretched around the subset.

<span class="mw-page-title-main">Outlier</span> Observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.

<span class="mw-page-title-main">Box plot</span> Data visualization

In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles. In addition to the box on a box plot, there can be lines extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also called the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset may be plotted as individual points beyond the whiskers on the box-plot. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings in each subsection of the box-plot indicate the degree of dispersion (spread) and skewness of the data, which are usually described using the five-number summary. In addition, the box-plot allows one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

Robust statistics are statistics which maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. The relative position and angle of the axes is typically uninformative, but various heuristics, such as algorithms that plot data as the maximal total area, can be applied to sort the variables (axes) into relative positions that reveal distinct correlations, trade-offs, and a multitude of other comparative measures.

In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample.

In statistics and computational geometry, the notion of centerpoint is a generalization of the median to data in higher-dimensional Euclidean space. Given a set of points in d-dimensional space, a centerpoint of the set is a point such that any hyperplane that goes through that point divides the set of points in two roughly equal subsets: the smaller part should have at least a 1/(d + 1) fraction of the points. Like the median, a centerpoint need not be one of the data points. Every non-empty set of points (with no duplicates) has at least one centerpoint.

Least absolute deviations (LAD), also known as least absolute errors (LAE), least absolute residuals (LAR), or least absolute values (LAV), is a statistical optimality criterion and a statistical optimization technique based on minimizing the sum of absolute deviations or the L₁ norm of such values. It is analogous to the least squares technique, except that it is based on absolute values instead of squared values. It attempts to find a function which closely approximates a set of data by minimizing residuals between points generated by the function and corresponding data points. The LAD estimate also arises as the maximum likelihood estimate if the errors have a Laplace distribution. It was introduced in 1757 by Roger Joseph Boscovich.

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce many measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

ELKI is a data mining software framework developed for use in research and teaching. It was originally created by the database systems research unit at the Ludwig Maximilian University of Munich, Germany, led by Professor Hans-Peter Kriegel. The project has continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours.

Peter J. Rousseeuw is a statistician known for his work on robust statistics and cluster analysis. He obtained his PhD in 1981 at the Vrije Universiteit Brussel, following research carried out at the ETH in Zurich, which led to a book on influence functions. Later he was professor at the Delft University of Technology, The Netherlands, at the University of Fribourg, Switzerland, and at the University of Antwerp, Belgium. Next he was a senior researcher at Renaissance Technologies. He then returned to Belgium as professor at KU Leuven, until becoming emeritus in 2022. His former PhD students include Annick Leroy, Hendrik Lopuhaä, Geert Molenberghs, Christophe Croux, Mia Hubert, Stefan Van Aelst, Tim Verdonck and Jakob Raymaekers.

In non-parametric statistics, the Theil–Sen estimator is a method for robustly fitting a line to sample points in the plane by choosing the median of the slopes of all lines through pairs of points. It has also been called Sen's slope estimator, slope selection, the single median method, the Kendall robust line-fit method, and the Kendall–Theil robust line. It is named after Henri Theil and Pranab K. Sen, who published papers on this method in 1950 and 1968 respectively, and after Maurice Kendall because of its relation to the Kendall tau rank correlation coefficient.

In statistical graphics, the functional boxplot is an informative exploratory tool that has been proposed for visualizing functional data. Analogous to the classical boxplot, the descriptive statistics of a functional boxplot are: the envelope of the 50% central region, the median curve and the maximum non-outlying envelope.

In statistical graphics and scientific visualization, the contour boxplot is an exploratory tool that has been proposed for visualizing ensembles of feature-sets determined by a threshold on some scalar function. Analogous to the classical boxplot and considered an expansion of the concepts defining functional boxplot, the descriptive statistics of a contour boxplot are: the envelope of the 50% central region, the median curve and the maximum non-outlying envelope.

References

↑ Rousseeuw, Peter J.; Ruts I.; Tukey J. W. (1999). "The Bagplot: A Bivariate Boxplot". The American Statistician. 53 (4): 382–387. doi:10.1080/00031305.1999.10474494.
↑ Ronald K. Pearson (1 April 2005). Mining Imperfect Data: Dealing with Contamination and Incomplete Records. SIAM. pp. 204–. ISBN 978-0-89871-582-8.
↑ Dominique Haughton; Jonathan Haughton (18 September 2011). Living Standards Analytics: Development through the Lens of Household Survey Data. Springer. pp. 14–. ISBN 978-1-4614-0385-2.
↑ Sophie Dabo-Niang; Frédéric Ferraty (21 May 2008). Functional and Operatorial Statistics. Springer. pp. 204–. ISBN 978-3-7908-2062-1.
↑ John C. Gower; Sugnet Gardner Lubbe; Niel J. Le Roux (23 February 2011). Understanding Biplots. John Wiley & Sons. pp. 59–. ISBN 978-1-119-97290-7.
↑ Prabhanjan Narayanachar Tattar (24 July 2013). R Statistical Application Development by Example Beginner's Guide. Packt Publishing Ltd. pp. 203–. ISBN 978-1-84951-945-8.
↑ Kruppa, Jochen J.; Jung K. (2017). "Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots". BMC Bioinformatics. 18: 232. doi: 10.1186/s12859-017-1645-5 . PMC 5414140 . PMID 28464790.
↑ Rajeev Raman; Robert Sedgewick; Matthias F. Stallmann (1 January 2006). Proceedings of the Eighth Workshop on Algorithm Engineering and Experiments and the Third Workshop on Analytic Algorithmics and Combinatorics. SIAM. pp. 62–. ISBN 978-0-89871-610-8.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Rousseeuw, Peter J.; Ruts I.; Tukey J. W. (1999). "The Bagplot: A Bivariate Boxplot". The American Statistician. 53 (4): 382–387. doi:10.1080/00031305.1999.10474494.

[Pearson2005-2] Ronald K. Pearson (1 April 2005). Mining Imperfect Data: Dealing with Contamination and Incomplete Records. SIAM. pp. 204–. ISBN 978-0-89871-582-8.

[HaughtonHaughton2011-3] Dominique Haughton; Jonathan Haughton (18 September 2011). Living Standards Analytics: Development through the Lens of Household Survey Data. Springer. pp. 14–. ISBN 978-1-4614-0385-2.

[Dabo-NiangFerraty2008-4] Sophie Dabo-Niang; Frédéric Ferraty (21 May 2008). Functional and Operatorial Statistics. Springer. pp. 204–. ISBN 978-3-7908-2062-1.

[GowerLubbe2011-5] John C. Gower; Sugnet Gardner Lubbe; Niel J. Le Roux (23 February 2011). Understanding Biplots. John Wiley & Sons. pp. 59–. ISBN 978-1-119-97290-7.

[Tattar2013-6] Prabhanjan Narayanachar Tattar (24 July 2013). R Statistical Application Development by Example Beginner's Guide. Packt Publishing Ltd. pp. 203–. ISBN 978-1-84951-945-8.

[7] Kruppa, Jochen J.; Jung K. (2017). "Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots". BMC Bioinformatics. 18: 232. doi: 10.1186/s12859-017-1645-5 . PMC 5414140 . PMID 28464790.

[RamanSedgewick2006-8] Rajeev Raman; Robert Sedgewick; Matthias F. Stallmann (1 January 2006). Proceedings of the Eighth Workshop on Algorithm Engineering and Experiments and the Third Workshop on Analytic Algorithmics and Combinatorics. SIAM. pp. 62–. ISBN 978-0-89871-610-8.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Bagplot

Contents

Construction

Properties

Related Research Articles

References