RA plot

Last updated February 29, 2020

The ratio average (RA) plot is an integer-based version of an MA plot for visualizing two-condition count data. Its distinctive arrow-like shape derives from the way it includes condition-unique (0,n) or (n,0) points into the plot via an epsilon factor.

Definition

An RA plot, like its cousin, the MA plot, is a re-scaled and (45-degree) rotated version of a simple two-dimensional scatter plot of a versus b where a and b are equal-length vectors of positive measurements. This rescaling and rotation allows for better visibility and emphasis of important outliers points that vary between the two measurement conditions.^[1] Essentially it is a plot of the log ratio [R] vs the average log [A] of each pairing of the elements of a and b. Unlike an MA plot, however, because the RA plot takes non-negative integer counts as input, it must employ work-arounds to include mathematically invisible points (such as points where one or both element(s) of the pair is zero).

If we modify our original a (or b) vector via:

a={\begin{cases}a+\varepsilon ,&{\text{if }}a=0\\a,&{\text{if }}a>0\end{cases}}

where

0<\varepsilon <0.5

then R and A can be defined as:

R=\log _{2}(a/b)

A={\frac {1}{2}}\times (\log _{2}a+\log _{2}b)

R, like M, is plotted on the y-axis and represents a log (fold change) ratio between a and b. A is plotted on the x-axis and represents the average abundance for a coordinate pair. The RA plot provides a quick overview of the distribution and size of a dataset consisting of non-zero counts.

Etymology

The acronym prefix "R.A." is sometimes pronounced as the one syllable word "ray" because of the plot's strong resemblance to a geometric ray. This characteristic arrow-like shape derives from two key features: on the right at the vector origin, a long asymptotic tail, and on the left (forming the arrow head) two (often dense) patches of condition-unique points.

Work-arounds for point visibility and inclusion

Condition unique points

Because a large portion of the pairs of a and b contain zeros in one or both conditions, they are impossible to plot as-is on a log scale. Other MA plotting functions artificially include these condition-unique points in the plot by spreading them vertically as a "smear" on the left or horizontally as a "rug" at the very top and bottom of the plot. In an RA plot, by contrast, the uniques are included via addition a small epsilon factor (between .1 and .5) which places them in a more statistically appropriate location in the plot.

MA plot with condition-unique and zero points as a "smear" (via the edgeR Bioconductor package)

RA plot with condition-unique and zero points as diagonal "arms" giving it a distinct ray-like shape

Two different ways of artificially adding condition-unique points into an MA-style plot.

Overplotting

Another problem with plotting this (or any) type of count data is overplotting which is solved in the RA plot by jittering the points out away from each other but no so far as to merge with other coordinates. The result of this feature is a patchwork-like appearance to the plot that fades away as the A increases.

An RA plot: many points have identical coordinates and are hidden from each other

A jittered RA plot: contiguous patches have identical original coordinates

RA plot in the caroline package

Packages

The caroline CRAN R package contains the only known implementation of an RA plot. However, the meta-transcriptomics "manta" R package provides a wrapper around this RA plot implementation and is used for assessing fold change in transcription of genes (the points) while simultaneously visualizing each gene's taxonomic distributions as individual pie chart points.^[2]

Examples

  library(caroline) a <- rnbinom(n=10000, mu=5, size=2) b <- rnbinom(n=10000, mu=5, size=2)  raPlot(a, b)

Related Research Articles

In astronomy, Kepler's laws of planetary motion are three scientific laws describing the motion of planets around the Sun, published by Johannes Kepler between 1609 and 1619. These improved the heliocentric theory of Nicolaus Copernicus, replacing its circular orbits with epicycles with elliptical trajectories, and explaining how planetary velocities vary. The laws state that:

In linear algebra, the dual numbers extend the real numbers by adjoining one new element $ε$ with the property $ε 2 = 0$ . Thus the multiplication of dual numbers is given by

Vector field Assignment of a vector to each point in a subset of Euclidean space

In vector calculus and physics, a vector field is an assignment of a vector to each point in a subset of space. A vector field in the plane, can be visualised as a collection of arrows with a given magnitude and direction, each attached to a point in the plane. Vector fields are often used to model, for example, the speed and direction of a moving fluid throughout space, or the strength and direction of some force, such as the magnetic or gravitational force, as it changes from one point to another point.

In statistics, the Gauss–Markov theorem states that in a linear regression model in which the errors are uncorrelated, have equal variances and expectation value of zero, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator, provided it exists. Here "best" means giving the lowest variance of the estimate, as compared to other unbiased, linear estimators. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator or ridge regression.

In mathematics, Poisson's equation is a partial differential equation of elliptic type with broad utility in mechanical engineering and theoretical physics. It arises, for instance, to describe the potential field caused by a given charge or mass density distribution; with the potential field known, one can then calculate gravitational or electrostatic field. It is a generalization of Laplace's equation, which is also frequently seen in physics. The equation is named after the French mathematician, geometer, and physicist Siméon Denis Poisson.

In linear algebra, an n-by-n square matrix $A$ is called invertible if there exists an n-by-n square matrix $B$ such that

In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.

In mathematics, more specifically in fractal geometry, a fractal dimension is a ratio providing a statistical index of complexity comparing how detail in a pattern changes with the scale at which it is measured. It has also been characterized as a measure of the space-filling capacity of a pattern that tells how a fractal scales differently from the space it is embedded in; a fractal dimension does not have to be an integer.

In numerical analysis, the Kahan summation algorithm, also known as compensated summation, significantly reduces the numerical error in the total obtained by adding a sequence of finite-precision floating-point numbers, compared to the obvious approach. This is done by keeping a separate running compensation.

In statistics and econometrics, and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting). ARIMA models are applied in some cases where data show evidence of non-stationarity, where an initial differencing step can be applied one or more times to eliminate the non-stationarity.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function.

In chaos theory, the correlation dimension is a measure of the dimensionality of the space occupied by a set of random points, often referred to as a type of fractal dimension.

In the field of mathematics known as differential geometry, a generalized complex structure is a property of a differential manifold that includes as special cases a complex structure and a symplectic structure. Generalized complex structures were introduced by Nigel Hitchin in 2002 and further developed by his students Marco Gualtieri and Gil Cavalcanti.

An MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M and A scales, then plotting these values. Though originally applied in the context of two channel DNA microarray gene expression data, MA plots are also used to visualise high-throughput sequencing analysis.

Riemann hypothesis Conjecture in mathematics linked to the distribution of prime numbers

In mathematics, the Riemann hypothesis is a conjecture that the Riemann zeta function has its zeros only at the negative even integers and complex numbers with real part 1/2. Many consider it to be the most important unsolved problem in pure mathematics. It is of great interest in number theory because it implies results about the distribution of prime numbers. It was proposed by Bernhard Riemann (1859), after whom it is named.

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander. Its basic idea is similar to DBSCAN, but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. To do so, the points of the database are (linearly) ordered such that spatially closest points become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that must be accepted for a cluster so that both points belong to the same cluster. This is represented as a dendrogram.

In continuum mechanics, a compatible deformation tensor field in a body is that unique tensor field that is obtained when the body is subjected to a continuous, single-valued, displacement field. Compatibility is the study of the conditions under which such a displacement field can be guaranteed. Compatibility conditions are particular cases of integrability conditions and were first derived for linear elasticity by Barré de Saint-Venant in 1864 and proved rigorously by Beltrami in 1886.

In numerical analysis, pairwise summation, also called cascade summation, is a technique to sum a sequence of finite-precision floating-point numbers that substantially reduces the accumulated round-off error compared to naively accumulating the sum in sequence. Although there are other techniques such as Kahan summation that typically have even smaller round-off errors, pairwise summation is nearly as good while having much lower computational cost—it can be implemented so as to have nearly the same cost as naive summation.

The electric dipole moment is a measure of the separation of positive and negative electrical charges within a system, that is, a measure of the system's overall polarity. The SI units for electric dipole moment are coulomb-meter (C⋅m); however, a commonly used unit in atomic physics and chemistry is the debye (D).

Radial basis function (RBF) interpolation is an advanced method in approximation theory for constructing high-order accurate interpolants of unstructured data, possibly in high-dimensional spaces. The interpolant takes the form of a weighted sum of radial basis functions. RBF interpolation is a mesh-free method, meaning the nodes need not lie on a structured grid, and does not require the formation of a mesh. It is often spectrally accurate and stable for large numbers of nodes even in high dimensions.

References

↑ Dudoit, S, Yang, YH, Callow, MJ, Speed, TP. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sin. 12:1 111–139
↑ Schruth, D. & Marchetti, A. (2011). Microbial Assemblage Normalized Transcript Analysis. R package version 0.9.5.