# Biplot

Last updated

Biplots are a type of exploratory graph used in statistics, a generalization of the simple two-variable scatterplot. A biplot allows information on both samples and variables of a data matrix to be displayed graphically. Samples are displayed as points while variables are displayed either as vectors, linear axes or nonlinear trajectories. In the case of categorical variables, category level points may be used to represent the levels of a categorical variable. A generalised biplot displays information on both continuous and categorical variables.

## Introduction and history

The biplot was introduced by K. Ruben Gabriel (1971). [1] Gower and Hand (1996) wrote a monograph on biplots. Yan and Kang (2003) described various methods which can be used in order to visualize and interpret a biplot. The book by Greenacre (2010) [2] is a practical user-oriented guide to biplots, along with scripts in the open-source R programming language, to generate biplots associated with principal component analysis (PCA), multidimensional scaling (MDS), log-ratio analysis (LRA)—also known as spectral mapping [3] [4] discriminant analysis (DA) and various forms of correspondence analysis: simple correspondence analysis (CA), multiple correspondence analysis (MCA) and canonical correspondence analysis (CCA) (Greenacre 2016 [5] ). The book by Gower, Lubbe and le Roux (2011) aims to popularize biplots as a useful and reliable method for the visualization of multivariate data when researchers want to consider, for example, principal component analysis (PCA), canonical variates analysis (CVA) or various types of correspondence analysis.

## Construction

A biplot is constructed by using the singular value decomposition (SVD) to obtain a low-rank approximation to a transformed version of the data matrix X, whose n rows are the samples (also called the cases, or objects), and whose p columns are the variables. The transformed data matrix Y is obtained from the original matrix X by centering and optionally standardizing the columns (the variables). Using the SVD, we can write Y = k=1,...pdkukvkT;, where the uk are n-dimensional column vectors, the vk are p-dimensional column vectors, and the dk are a non-increasing sequence of non-negative scalars. The biplot is formed from two scatterplots that share a common set of axes and have a between-set scalar product interpretation. The first scatterplot is formed from the points (d1αu1i,  d2αu2i), for i = 1,...,n. The second plot is formed from the points (d11−αv1j, d21−αv2j), for j = 1,...,p. This is the biplot formed by the dominant two terms of the SVD, which can then be represented in a two-dimensional display. Typical choices of α are 1 (to give a distance interpretation to the row display) and 0 (to give a distance interpretation to the column display), and in some rare cases α=1/2 to obtain a symmetrically scaled biplot (which gives no distance interpretation to the rows or the columns, but only the scalar product interpretation). The set of points depicting the variables can be drawn as arrows from the origin to reinforce the idea that they represent biplot axes onto which the samples can be projected to approximate the original data.

## Related Research Articles

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

The principal components of a collection of points in a real p-space are a sequence of direction vectors, where the vector is the direction of a line that best fits the data while being orthogonal to the first vectors. Here, a best-fitting line is defined as one that minimizes the average squared distance from the points to the line. These directions constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.

In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix that generalizes the eigendecomposition of a square normal matrix to any matrix via an extension of the polar decomposition.

A chart is a graphical representation of data, in which "the data is represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart". A chart can represent tabular numeric data, functions or some kinds of quality structure and provides different info.

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector. Any covariance matrix is symmetric and positive semi-definite and its main diagonal contains variances.

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors, plus "error" terms.

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

Partial least squares regression is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares discriminant analysis (PLS-DA) is a variant used when the Y is categorical.

In linear algebra, an eigenvector or characteristic vector of a linear transformation is a nonzero vector that changes by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted by , is the factor by which the eigenvector is scaled.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

Correspondence analysis (CA) or reciprocal averaging is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a similar manner to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form.

Analysis of variance – simultaneous component analysis is a method that partitions variation and enables interpretation of these partitions by SCA, a method that is similar to principal components analysis (PCA). This method is a multivariate or even megavariate extension of analysis of variance (ANOVA). The variation partitioning is similar to ANOVA. Each partition matches all variation induced by an effect or factor, usually a treatment regime or experimental condition. The calculated effect partitions are called effect estimates. Because even the effect estimates are multivariate, interpretation of these effects estimates is not intuitive. By applying SCA on the effect estimates one gets a simple interpretable result. In case of more than one effect this method estimates the effects in such a way that the different effects are not correlated.

In time series analysis, singular spectrum analysis (SSA) is a nonparametric spectral estimation method. It combines elements of classical time series analysis, multivariate statistics, multivariate geometry, dynamical systems and signal processing. Its roots lie in the classical Karhunen (1946)–Loève spectral decomposition of time series and random fields and in the Mañé (1981)–Takens (1981) embedding theorem. SSA can be an aid in the decomposition of time series into a sum of components, each having a meaningful interpretation. The name "singular spectrum analysis" relates to the spectrum of eigenvalues in a singular value decomposition of a covariance matrix, and not directly to a frequency domain decomposition.

Detrended correspondence analysis (DCA) is a multivariate statistical technique widely used by ecologists to find the main factors or gradients in large, species-rich but usually sparse data matrices that typify ecological community data. DCA is frequently used to suppress artifacts inherent in most other multivariate analyses when applied to gradient data.

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

In statistics, the RV coefficient is a multivariate generalization of the squared Pearson correlation coefficient. It measures the closeness of two set of points that may each be represented in a matrix.

In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables.

Mondrian is a general-purpose statistical data-visualization system, for interactive data visualization. All plots in Mondrian are fully linked, and offer various interactions and queries. Any case selected in a plot in Mondrian is highlighted in all other plots.

A CUR matrix approximation is a set of three matrices that, when multiplied together, closely approximate a given matrix. A CUR approximation can be used in the same way as the low-rank approximation of the Singular value decomposition (SVD). CUR approximations are less accurate than the SVD, but they offer two key advantages, both stemming from the fact that the rows and columns come from the original matrix :

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

## References

1. 'Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika, 58(3), 453–467.
2. Greenacre, M. (2010). Biplots in Practice. BBVA Foundation, Bilbao, Spain. Available for free at http://www.multivariatestatistics.org
3. Lewi, Paul J. (2005). "Spectral mapping, a personal and historical account of an adventure in multivariate data analysis". Chemometrics and Intelligent Laboratory Systems. 77 (1–2): 215–223. doi:10.1016/j.chemolab.2004.07.010.
4. David Livingstone (2009). A Practical Guide to Scientific Data Analysis. Chichester, John Wiley & Sons Ltd, 233–238. ISBN   978-0-470-85153-1
5. Greenacre, M. (2016) Correspondence Analysis in Practice. Third Edition. Chapman and Hall / CRC Press. ISBN   978-84-923846-8-6

## Sources

• Gabriel, K.R. (1971). "The biplot graphic display of matrices with application to principal component analysis". Biometrika. 58 (3): 453–467. doi:10.1093/biomet/58.3.453.
• Gower, J.C., Lubbe, S. and le Roux, N. (2010). Understanding Biplots. Wiley. ISBN   978-0-470-01255-0
• Gower, J.C. and Hand, D.J (1996). Biplots. Chapman & Hall, London, UK. ISBN   0-412-71630-5
• Yan, W. and Kang, M.S. (2003). GGE Biplot Analysis. CRC Press, Boca Raton, Florida. ISBN   0-8493-1338-4
• Demey, J.R., Vicente-Villardón, J.L., Galindo-Villardón, M.P. and Zambrano, A.Y. (2008). Identifying molecular markers associated with classification of genotypes by External Logistic Biplots. Bioinformatics. 24(24):2832–2838