Violin plot

Last updated
Example of a violin plot Violin plot.gif
Example of a violin plot
Example of a violin plot in a scientific publication in PLOS Pathogens. Violinplot-hiv-paper-plot-pathogens.svg
Example of a violin plot in a scientific publication in PLOS Pathogens.

A violin plot is a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side. [1]

Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator. Typically a violin plot will include all the data that is in a box plot: a marker for the median of the data; a box or marker indicating the interquartile range; and possibly all sample points, if the number of samples is not too high.

Violin plots are available as extensions to a number of software packages such as DataVisualization on CRAN [2] and the md-plot package on PyPI. [3]

A violin plot is more informative than a plain box plot. While a box plot only shows summary statistics such as mean/median and interquartile ranges, the violin plot shows the full distribution of the data. The difference is particularly useful when the data distribution is multimodal (more than one peak). In this case a violin plot shows the presence of different peaks, their position and relative amplitude.

Like box plots, violin plots are used to represent comparison of a variable distribution (or sample distribution) across different "categories" (for example, temperature distribution compared between day and night, or distribution of car prices compared across different car makers).

A violin plot can have multiple layers. For instance, the outer shape represents all possible results. The next layer inside might represent the values that occur 95% of the time. The next layer (if it exists) inside might represent the values that occur 50% of the time.

Although more informative than box plots, they are less popular. Because of their unpopularity, they may be harder to understand for readers not familiar with them. In this case, a more accessible alternative is to plot a series of stacked histograms or kernel density distributions.

Violin plots are available as extensions to a number of software packages, including the R packages vioplot, wvioplot, caroline, UsingR, lattice and ggplot2, the Stata add-on command vioplot, [4] and the Python libraries matplotlib, [5] Plotly, [6] ROOT [7] and Seaborn, [8] a graph type in Origin, [9] IGOR Pro, [10] Julia statistical plotting package StatsPlots.jl [11] and DistributionChart in Mathematica.

Related Research Articles

In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or quarters, of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a form of order statistic. The three main quartiles are as follows:

Quantile Statistical method of dividing data into equal-sized intervals for analysis

In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. Common quantiles have special names, such as quartiles, deciles, and percentiles. The groups created are termed halves, thirds, quarters, etc., though sometimes the terms for the quantile are used for the groups created, rather than for the cut points.

Box plot Data visualization

In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles. In addition to the box on a box plot, there can be lines extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also termed as the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset may be plotted as individual points beyond the whiskers on the box-plot. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings in each subsection of the box-plot indicate the degree of dispersion (spread) and skewness of the data, which are usually described using the five-number summary. In addition, the box-plot allows one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.

Stata Statistical software package

Stata is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fields, including biomedicine, epidemiology, sociology and science.

Kernel density estimation

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics.

Matplotlib Library for creating static, animated, and interactive visualizations in Python.

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK. There is also a procedural "pylab" interface based on a state machine, designed to closely resemble that of MATLAB, though its use is discouraged. SciPy makes use of Matplotlib.

Correlogram

In the analysis of data, a correlogram is a chart of correlation statistics. For example, in time series analysis, a plot of the sample autocorrelations versus is an autocorrelogram. If cross-correlation is plotted, the result is called a cross-correlogram.

IPython Advanced interactive shell for Python

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers introspection, rich media, shell syntax, tab completion, and history. IPython provides the following features:

Enthought, Inc. is a software company based in Austin, Texas, United States that develops scientific and analytic computing solutions using primarily the Python programming language. It is best known for the early development and maintenance of the SciPy library of mathematics, science, and engineering algorithms and for its Python for scientific computing distribution Enthought Canopy.

Plot (graphics)

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

Python Package Index Software repository

The Python Package Index, abbreviated as PyPI and also known as the Cheese Shop, is the official third-party software repository for Python. It is analogous to the CPAN repository for Perl and to the CRAN repository for R. PyPI is run by the Python Software Foundation, a charity. Some package managers, including pip, use PyPI as the default source for packages and their dependencies.

ggplot2

ggplot2 is an open-source data visualization package for the statistical programming language R. Created by Hadley Wickham in 2005, ggplot2 is an implementation of Leland Wilkinson's Grammar of Graphics—a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers. ggplot2 can serve as a replacement for the base graphics in R and contains a number of defaults for web and print display of common scales. Since 2005, ggplot2 has grown in use to become one of the most popular R packages.

In statistical graphics, the functional boxplot is an informative exploratory tool that has been proposed for visualizing functional data. Analogous to the classical boxplot, the descriptive statistics of a functional boxplot are: the envelope of the 50% central region, the median curve and the maximum non-outlying envelope.

Plotly Canadian computing company

Plotly is a technical computing company headquartered in Montreal, Quebec, that develops online data analytics and visualization tools. Plotly provides online graphing, analytics, and statistics tools for individuals and collaboration, as well as scientific graphing libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.

Statsmodels is a Python package that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. It complements SciPy's stats module.

glue (software) Data visualization software

glue is an interactive linked-view data visualization package for exploring relationships within and between related datasets.

ArviZ is a Python package for exploratory analysis of Bayesian models it offers data structures for manipulating data common in Bayesian analysis, like numerical samples from the posterior, prior predictive and posterior predictive distributions as well as observed data. Additionally, many numerical and visual diagnostics as well as plots are available. The ArviZ name is derived from reading "rvs" as a word instead of spelling it and also using the particle "viz" usually used to abbreviate visualization.

References

  1. "Violin Plot". NIST DataPlot. National Institute of Standards and Technology. 2015-10-13.
  2. "CRAN - Package DataVisualization". 12 January 2021.
  3. "md-plot . PyPI".
  4. Hintze, Jerry L.; Nelson, Ray D. (1998). "Violin Plots: A Box Plot-Density Trace Synergism". The American Statistician. 52 (2): 181–4. doi:10.1080/00031305.1998.10480559.
  5. "violin plots". What's new in matplotlib.
  6. "Violin Plots in Python". Plotly Python API Library Reference.
  7. "The Violin option" . Retrieved 2020-05-05.
  8. Waskom, Michael. "Violinplot from a wide-form dataset". Seaborn: statistical data visualization.
  9. "Violin Plot in Origin 2019". 10 September 2018. Retrieved 2018-10-29.
  10. "Igor Pro 8 Highlights". Wavemetrics. Retrieved 2019-07-28.
  11. "boxplot, dotplot, and violin" . Retrieved 2020-08-15.

PD-icon.svg This article incorporates  public domain material from the National Institute of Standards and Technology document: "Dataplot reference manual: Violin plot".