Volcano plot (statistics)

Last updated
Volcano plot showing metabolomic data. The red arrows indicate points-of-interest that display both large magnitude fold-changes (x axis) and high statistical significance (-log10 of p value, y axis). The dashed red line shows where p = 0.05 with points above the line having p < 0.05 and points below the line having p > 0.05. This plot is colored such that those points having a fold-change less than 2 (log2 = 1) are shown in gray. Volcano eg.jpg
Volcano plot showing metabolomic data. The red arrows indicate points-of-interest that display both large magnitude fold-changes (x axis) and high statistical significance (-log10 of p value, y axis). The dashed red line shows where p = 0.05 with points above the line having p < 0.05 and points below the line having p > 0.05. This plot is colored such that those points having a fold-change less than 2 (log2 = 1) are shown in gray.

In statistics, a volcano plot is a type of scatter-plot that is used to quickly identify changes in large data sets composed of replicate data. [1] [2] It plots significance versus fold-change on the y and x axes, respectively. These plots are increasingly common in omic experiments such as genomics, proteomics, and metabolomics where one often has a list of many thousands of replicate data points between two conditions and one wishes to quickly identify the most meaningful changes. A volcano plot combines a measure of statistical significance from a statistical test (e.g., a p value from an ANOVA model) with the magnitude of the change, enabling quick visual identification of those data-points (genes, etc.) that display large magnitude changes that are also statistically significant.

A volcano plot is constructed by plotting the negative logarithm of the p value on the y axis (usually base 10). This results in data points with low p values (highly significant) appearing toward the top of the plot. The x axis is the logarithm of the fold change between the two conditions. The logarithm of the fold change is used so that changes in both directions appear equidistant from the center. Plotting points in this way results in two regions of interest in the plot: those points that are found toward the top of the plot that are far to either the left- or right-hand sides. These represent values that display large magnitude fold changes (hence being left or right of center) as well as high statistical significance (hence being toward the top).

Additional information can be added by coloring the points according to a third dimension of data (such as signal intensity), but this is not uniformly employed. Volcano plots are also used to graphically display a significance analysis of microarrays (SAM) gene selection criterion, an example of regularization. [3]

The concept of volcano plot can be generalized to other applications, where the x axis is related to a measure of the strength of a statistical signal, and y axis is related to a measure of the statistical significance of the signal. For example, in a genetic association case-control study, such as Genome-wide association study, a point in a volcano plot represents a single-nucleotide polymorphism. Its x value can be the logarithm of the odds ratio and its y value can be -log10 of the p value from a Chi-square test or a Chi-square test statistic. [4]

Volcano plots show a characteristic upwards two arm shape because the x axis, i.e. the underlying log2-fold changes, are generally normal distribution whereas the y axis, the log10-p values, tend toward greater significance for fold-changes that deviate more strongly from zero. The density of the normal distribution takes the form

.

So the of that is

and the negative is

which is a parabola whose arms reach upwards on the left and right sides. The upper bound of the data is one parabola and the lower bound is another parabola.

Related Research Articles

Biostatistics is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

<span class="mw-page-title-main">Logarithm</span> Inverse of the exponential function

In mathematics, the logarithm is the inverse function to exponentiation. That means that the logarithm of a number x to the base b is the exponent to which b must be raised to produce x. For example, since 1000 = 103, the logarithm base 10 of 1000 is 3, or log10 (1000) = 3. The logarithm of x to base b is denoted as logb (x), or without parentheses, logbx, or even without the explicit base, log x, when no confusion is possible, or when the base does not matter such as in big O notation.

<span class="mw-page-title-main">Natural logarithm</span> Logarithm to the base of the mathematical constant e

The natural logarithm of a number is its logarithm to the base of the mathematical constant e, which is an irrational and transcendental number approximately equal to 2.718281828459. The natural logarithm of x is generally written as ln x, logex, or sometimes, if the base e is implicit, simply log x. Parentheses are sometimes added for clarity, giving ln(x), loge(x), or log(x). This is done particularly when the argument to the logarithm is not a single symbol, so as to prevent ambiguity.

<span class="mw-page-title-main">Power law</span> Functional relationship between two quantities

In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a relative change in the other quantity proportional to a power of the change, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four. The rate of change exhibited in these relationships is said to be multiplicative.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

<span class="mw-page-title-main">Chi-squared distribution</span> Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables. The chi-squared distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing and in construction of confidence intervals. This distribution is sometimes called the central chi-squared distribution, a special case of the more general noncentral chi-squared distribution.

<span class="mw-page-title-main">Logarithmic scale</span> Measurement scale based on orders of magnitude

A logarithmic scale is a way of displaying numerical data over a very wide range of values in a compact way. As opposed to a linear number line in which every unit of distance corresponds to adding by the same amount, on a logarithmic scale, every unit of length corresponds to multiplying the previous value by the same amount. Hence, such a scale is nonlinear. In nonlinear scale, the numbers 1, 2, 3, 4, 5, and so on would not be equally spaced. Rather, the numbers 10, 100, 1000, 10000, and 100000 would be equally spaced. Likewise, the numbers 2, 4, 8, 16, 32, and so on, would be equally spaced. Often exponential growth curves are displayed on a log scale, otherwise they would increase too quickly to fit within a small graph.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.

The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free. However, the test is most often used in contexts where a family of distributions is being tested, in which case the parameters of that family need to be estimated and account must be taken of this in adjusting either the test-statistic or its critical values. When applied to testing whether a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality. K-sample Anderson–Darling tests are available for testing whether several collections of observations can be modelled as coming from a single population, where the distribution function does not have to be specified.

<span class="mw-page-title-main">Microarray analysis techniques</span>

Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes - in many cases, an organism's entire genome - in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult - if not impossible - to analyze without the help of computer programs.

<span class="mw-page-title-main">Data transformation (statistics)</span>

In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point zi is replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.

Within computational biology, an MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M and A scales, then plotting these values. Though originally applied in the context of two channel DNA microarray gene expression data, MA plots are also used to visualise high-throughput sequencing analysis.

In chemical kinetics, an Arrhenius plot displays the logarithm of a reaction rate constant, (, ordinate axis) plotted against reciprocal of the temperature (, abscissa). Arrhenius plots are often used to analyze the effect of temperature on the rates of chemical reactions. For a single rate-limited thermally activated process, an Arrhenius plot gives a straight line, from which the activation energy and the pre-exponential factor can both be determined.

In statistics, the multinomial test is the test of the null hypothesis that the parameters of a multinomial distribution equal specified values; it is used for categorical data.

<span class="mw-page-title-main">Plot (graphics)</span> Graphical technique for data sets

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

<span class="mw-page-title-main">Maximum spacing estimation</span> Method of estimating a statistical models parameters

In statistics, maximum spacing estimation (MSE or MSP), or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of spacings in the data, which are the differences between the values of the cumulative distribution function at neighbouring data points.

Fold change is a measure describing how much a quantity changes between an original and a subsequent measurement. It is defined as the ratio between the two quantities; for quantities A and B the fold change of B with respect to A is B/A. In other words, a change from 30 to 60 is defined as a fold-change of 2. This is also referred to as a "one fold increase". Similarly, a change from 30 to 15 is referred to as a "0.5-fold decrease". Fold change is often used when analysing multiple measurements of a biological system taken at different times as the change described by the ratio between the time points is easier to interpret than the difference.

<span class="mw-page-title-main">Cavalieri's quadrature formula</span> Mathematical term in calculus

In calculus, Cavalieri's quadrature formula, named for 17th-century Italian mathematician Bonaventura Cavalieri, is the integral

<span class="mw-page-title-main">Dual-flashlight plot</span>

In statistics, a dual-flashlight plot is a type of scatter-plot in which the standardized mean of a contrast variable (SMCV) is plotted against the mean of a contrast variable representing a comparison of interest . The commonly used dual-flashlight plot is for the difference between two groups in high-throughput experiments such as microarrays and high-throughput screening studies, in which we plot the SSMD versus average log fold-change on the y- and x-axes, respectively, for all genes or compounds investigated in an experiment. As a whole, the points in a dual-flashlight plot look like the beams of a flashlight with two heads, hence the name dual-flashlight plot.

References

  1. Jin, W; Riley, RM; Wolfinger, RD; White, KP; Passador-Gurgel, G; Gibson, G (2001). "Contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster". Nature Genetics. 29 (4): 389–395. doi:10.1038/ng766. PMID   11726925. S2CID   16841881.
  2. Cui, X.; Churchill, G. A. (2003). "Statistical tests for differential expression in cDNA microarray experiments". Genome Biology. 4 (4): 210. doi: 10.1186/gb-2003-4-4-210 . PMC   154570 . PMID   12702200.
  3. Li, W. (2012). "Volcano plots in analyzing differential expressions with mRNA microarrays". Journal of Bioinformatics and Computational Biology . 10 (6): 1231003. arXiv: 1103.3434 . doi:10.1142/S0219720012310038. PMID   23075208. S2CID   204899379.
  4. Li, W.; Freudenberg, J.; Suh, Y. J.; Yang, Y. (2014). "Using volcano plots and regularized-chi statistics in genetic association studies". Computational Biology and Chemistry . 48: 77–83. arXiv: 1308.6245 . doi:10.1016/j.compbiolchem.2013.02.003. PMID   23602812. S2CID   12399345.