Multivariate statistics

Last updated

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.

Contents

In addition, multivariate statistics is concerned with multivariate probability distributions, in terms of both

  • how these can be used to represent the distributions of observed data;
  • how they can be used as part of statistical inference, particularly where several different quantities are of interest to the same analysis.

Certain types of problems involving multivariate data, for example simple linear regression and multiple regression, are not usually considered to be special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome variable given the other variables.

Multivariate analysis

Multivariate analysis (MVA) is based on the principles of multivariate statistics. Typically, MVA is used to address the situations where multiple measurements are made on each experimental unit and the relations among these measurements and their structures are important. [1] A modern, overlapping categorization of MVA includes: [1]

Multivariate analysis can be complicated by the desire to include physics-based analysis to calculate the effects of variables for a hierarchical "system-of-systems". Often, studies that wish to use multivariate analysis are stalled by the dimensionality of the problem. These concerns are often eased through the use of surrogate models, highly accurate approximations of the physics-based code. Since surrogate models take the form of an equation, they can be evaluated very quickly. This becomes an enabler for large-scale MVA studies: while a Monte Carlo simulation across the design space is difficult with physics-based codes, it becomes trivial when evaluating surrogate models, which often take the form of response-surface equations.

Types of analysis

There are many different models, each with its own type of analysis:

  1. Multivariate analysis of variance (MANOVA) extends the analysis of variance to cover cases where there is more than one dependent variable to be analyzed simultaneously; see also Multivariate analysis of covariance (MANCOVA).
  2. Multivariate regression attempts to determine a formula that can describe how elements in a vector of variables respond simultaneously to changes in others. For linear relations, regression analyses here are based on forms of the general linear model. Some suggest that multivariate regression is distinct from multivariable regression, however, that is debated and not consistently true across scientific fields. [2]
  3. Principal components analysis (PCA) creates a new set of orthogonal variables that contain the same information as the original set. It rotates the axes of variation to give a new set of orthogonal axes, ordered so that they summarize decreasing proportions of the variation.
  4. Factor analysis is similar to PCA but allows the user to extract a specified number of synthetic variables, fewer than the original set, leaving the remaining unexplained variation as error. The extracted variables are known as latent variables or factors; each one may be supposed to account for covariation in a group of observed variables.
  5. Canonical correlation analysis finds linear relationships among two sets of variables; it is the generalised (i.e. canonical) version of bivariate [3] correlation.
  6. Redundancy analysis (RDA) is similar to canonical correlation analysis but allows the user to derive a specified number of synthetic variables from one set of (independent) variables that explain as much variance as possible in another (independent) set. It is a multivariate analogue of regression.
  7. Correspondence analysis (CA), or reciprocal averaging, finds (like PCA) a set of synthetic variables that summarise the original set. The underlying model assumes chi-squared dissimilarities among records (cases).
  8. Canonical (or "constrained") correspondence analysis (CCA) for summarising the joint variation in two sets of variables (like redundancy analysis); combination of correspondence analysis and multivariate regression analysis. The underlying model assumes chi-squared dissimilarities among records (cases).
  9. Multidimensional scaling comprises various algorithms to determine a set of synthetic variables that best represent the pairwise distances between records. The original method is principal coordinates analysis (PCoA; based on PCA).
  10. Discriminant analysis, or canonical variate analysis, attempts to establish whether a set of variables can be used to distinguish between two or more groups of cases.
  11. Linear discriminant analysis (LDA) computes a linear predictor from two sets of normally distributed data to allow for classification of new observations.
  12. Clustering systems assign objects into groups (called clusters) so that objects (cases) from the same cluster are more similar to each other than objects from different clusters.
  13. Recursive partitioning creates a decision tree that attempts to correctly classify members of the population based on a dichotomous dependent variable.
  14. Artificial neural networks extend regression and clustering methods to non-linear multivariate models.
  15. Statistical graphics such as tours, parallel coordinate plots, scatterplot matrices can be used to explore multivariate data.
  16. Simultaneous equations models involve more than one regression equation, with different dependent variables, estimated together.
  17. Vector autoregression involves simultaneous regressions of various time series variables on their own and each other's lagged values.
  18. Principal response curves analysis (PRC) is a method based on RDA that allows the user to focus on treatment effects over time by correcting for changes in control treatments over time. [4]
  19. Iconography of correlations consists in replacing a correlation matrix by a diagram where the “remarkable” correlations are represented by a solid line (positive correlation), or a dotted line (negative correlation).

Important probability distributions

There is a set of probability distributions used in multivariate analyses that play a similar role to the corresponding set of distributions that are used in univariate analysis when the normal distribution is appropriate to a dataset. These multivariate distributions are:

The Inverse-Wishart distribution is important in Bayesian inference, for example in Bayesian multivariate linear regression. Additionally, Hotelling's T-squared distribution is a multivariate distribution, generalising Student's t-distribution, that is used in multivariate hypothesis testing.

History

Anderson's 1958 textbook, An Introduction to Multivariate Statistical Analysis, [5] educated a generation of theorists and applied statisticians; Anderson's book emphasizes hypothesis testing via likelihood ratio tests and the properties of power functions: admissibility, unbiasedness and monotonicity. [6] [7]

MVA once solely stood in the statistical theory realms due to the size, complexity of underlying data set and high computational consumption. With the dramatic growth of computational power, MVA now plays an increasingly important role in data analysis and has wide application in OMICS fields.

Applications

Software and tools

There are an enormous number of software packages and other tools for multivariate analysis, including:

See also

Related Research Articles

Psychological statistics

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical Methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, Factor analysis, Experimental Designs, and Multivariate Behavioral Research. The article also discusses journals in the same field.

Statistics Study of the collection, analysis, interpretation, and presentation of data

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

Statistics is a field of inquiry that studies the collection, analysis, interpretation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities; it is also used and misused for making informed decisions in all areas of business and government.

Pearson correlation coefficient Measure of linear correlation

In statistics, the Pearson correlation coefficient ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

Time series Sequence of data points over time

In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

Regression analysis Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear regression models may be compactly written as

Mathematical statistics

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

In statistics, the generalized canonical correlation analysis (gCCA), is a way of making sense of cross-correlation matrices between the sets of random variables when there are more than two sets. While a conventional CCA generalizes principal component analysis (PCA) to two sets of random variables, a gCCA generalizes PCA to more than two sets of random variables. The canonical variables represent those common factors that can be found by a large PCA of all of the transformed random variables after each set underwent its own PCA.

The following is a glossary of terms used in the mathematical sciences statistics and probability.

In mathematics, a univariate object is an expression, equation, function or polynomial involving only one variable. Objects involving more than one variable are multivariate. In some cases the distinction between the univariate and multivariate cases is fundamental; for example, the fundamental theorem of algebra and Euclid's algorithm for polynomials are fundamental properties of univariate polynomials that cannot be generalized to multivariate polynomials.

Genstat

Genstat is a statistical software package with data analysis capabilities, particularly in the field of agriculture.

Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship between predictors and dependent variable. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates.

In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed.

Data transformation (statistics)

In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point zi is replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.

Bivariate analysis

Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables, for the purpose of determining the empirical relationship between them.

A vine is a graphical tool for labeling constraints in high-dimensional probability distributions. A regular vine is a special case for which all constraints are two-dimensional or conditional two-dimensional. Regular vines generalize trees, and are themselves specializations of Cantor tree.

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

References

  1. 1 2 Olkin, I.; Sampson, A. R. (2001-01-01), "Multivariate Analysis: Overview", in Smelser, Neil J.; Baltes, Paul B. (eds.), International Encyclopedia of the Social & Behavioral Sciences, Pergamon, pp. 10240–10247, ISBN   9780080430768 , retrieved 2019-09-02
  2. Hidalgo, B; Goodman, M (2013). "Multivariate or multivariable regression?". Am J Public Health. 103: 39–40. doi:10.2105/AJPH.2012.300897. PMC   3518362 . PMID   23153131.
  3. Unsophisticated analysts of bivariate Gaussian problems may find useful a crude but accurate method of accurately gauging probability by simply taking the sum S of the N residuals' squares, subtracting the sum Sm at minimum, dividing this difference by Sm, multiplying the result by (N - 2) and taking the inverse anti-ln of half that product.
  4. ter Braak, Cajo J.F. & Šmilauer, Petr (2012). Canoco reference manual and user's guide: software for ordination (version 5.0), p292. Microcomputer Power, Ithaca, NY.
  5. T.W. Anderson (1958) An Introduction to Multivariate Analysis, New York: Wiley ISBN   0471026409; 2e (1984) ISBN   0471889873; 3e (2003) ISBN   0471360910
  6. Sen, Pranab Kumar; Anderson, T. W.; Arnold, S. F.; Eaton, M. L.; Giri, N. C.; Gnanadesikan, R.; Kendall, M. G.; Kshirsagar, A. M.; et al. (June 1986). "Review: Contemporary Textbooks on Multivariate Statistical Analysis: A Panoramic Appraisal and Critique". Journal of the American Statistical Association . 81 (394): 560–564. doi:10.2307/2289251. ISSN   0162-1459. JSTOR   2289251.(Pages 560–561)
  7. Schervish, Mark J. (November 1987). "A Review of Multivariate Analysis". Statistical Science. 2 (4): 396–413. doi: 10.1214/ss/1177013111 . ISSN   0883-4237. JSTOR   2245530.
  8. CRAN has details on the packages available for multivariate data analysis

Further reading