Ordination (statistics)

Last updated

Ordination or gradient analysis, in multivariate analysis, is a method complementary to data clustering, and used mainly in exploratory data analysis (rather than in hypothesis testing). In contrast to cluster analysis, ordination orders quantities in a (usually lower-dimensional) latent space. In the ordination space, quantities that are near each other share attributes (i.e., are similar to some degree), and dissimilar objects are farther from each other. Such relationships between the objects, on each of several axes or latent variables, are then characterized numerically and/or graphically in a biplot.

Contents

The first ordination method, principal components analysis, was suggested by Karl Pearson in 1901.

Methods

Ordination methods can broadly be categorized in eigenvector-, algorithm-, or model-based methods. Many classical ordination techniques, including principal components analysis, correspondence analysis (CA) and its derivatives (detrended correspondence analysis, canonical correspondence analysis, and redundancy analysis, belong to the first group).

The second group includes some distance-based methods such as non-metric multidimensional scaling, and machine learning methods such as T-distributed stochastic neighbor embedding and nonlinear dimensionality reduction.

The third group includes model-based ordination methods, which can be considered as multivariate extensions of Generalized Linear Models. [1] [2] [3] [4] Model-based ordination methods are more flexible in their application than classical ordination methods, so that it is for example possible to include random-effects. [5] Unlike in the aforementioned two groups, there is no (implicit or explicit) distance measure in the ordination. Instead, a distribution needs to be specified for the responses as is typical for statistical models. These and other assumptions, such as the assumed mean-variance relationship, can be validated with the use of residual diagnostics, unlike in other ordination methods.

Applications

Ordination can be used on the analysis of any set of multivariate objects. It is frequently used in several environmental or ecological sciences, particularly plant community ecology. It is also used in genetics and systems biology for microarray data analysis and in psychometrics.

See also

Related Research Articles

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors plus "error" terms, hence factor analysis can be thought of as a special case of errors-in-variables models.

In statistics, latent variables are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such latent variable models are used in many disciplines, including engineering, medicine, ecology, physics, machine learning/artificial intelligence, natural language processing, bioinformatics, chemometrics, demography, economics, management, political science, psychology and the social sciences.

In statistics, a latent class model (LCM) is a model for clustering multivariate discrete data. It assumes that the data arise from a mixture of discrete distributions, within each of which the variables are independent. It is called a latent class model because the class to which each data point belongs is unobserved, or latent.

In archaeology, seriation is a relative dating method in which assemblages or artifacts from numerous sites in the same culture are placed in chronological order. Where absolute dating methods, such as radio carbon, cannot be applied, archaeologists have to use relative dating methods to date archaeological finds and features. Seriation is a standard method of dating in archaeology. It can be used to date stone tools, pottery fragments, and other artifacts. In Europe, it has been used frequently to reconstruct the chronological sequence of graves in a cemetery.

In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social science research. It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct. As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog (1969) and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).

Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a similar manner to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form. Its aim is to display in a biplot any structure hidden in the multivariate setting of the data table. As such it is a technique from the field of multivariate ordination. Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis.

Detrended correspondence analysis (DCA) is a multivariate statistical technique widely used by ecologists to find the main factors or gradients in large, species-rich but usually sparse data matrices that typify ecological community data. DCA is frequently used to suppress artifacts inherent in most other multivariate analyses when applied to gradient data.

Sparse principal component analysis is a technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by introducing sparsity structures to the input variables.

Psychometric software refers to specialized programs used for the psychometric analysis of data that was obtained from tests, questionnaires, polls or inventories that measure latent psychoeducational variables. Although some psychometric analysis can be conducted using general statistical software like SPSS, most require dedicated tools designed specifically for psychometric purposes.

In multivariate analysis, canonical correspondence analysis (CCA) is an ordination technique that determines axes from the response data as a linear combination of measured predictors. CCA is commonly used in ecology in order to extract gradients that drive the composition of ecological communities. CCA extends Correspondence Analysis (CA) with regression, in order to incorporate predictor variables.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

Analysis of similarities (ANOSIM) is a non-parametric statistical test widely used in the field of ecology. The test was first suggested by K. R. Clarke as an ANOVA-like test, where instead of operating on raw data, operates on a ranked dissimilarity matrix.

The following outline is provided as an overview of and topical guide to machine learning:

Marti J. Anderson is an American researcher based in New Zealand. Her ecological statistical works is interdisciplinary, from marine biology and ecology to mathematical and applied statistics. Her core areas of research and expertise are: community ecology, biodiversity, multivariate analysis, resampling methods, experimental designs, and statistical models of species abundances. She is a Distinguished Professor in the New Zealand Institute for Advanced Study at Massey University and also the Director of the New Zealand research and software-development company, PRIMER-e.

In statistics, cluster analysis is the algorithmic grouping of objects into homogeneous groups based on numerical measurements. Model-based clustering bases this on a statistical model for the data, usually a mixture model. This has several advantages, including a principled statistical basis for clustering, and ways to choose the number of clusters, to choose the best clustering model, to assess the uncertainty of the clustering, and to identify outliers that do not belong to any group.

References

  1. Hui, Francis K.C.; Taskinen, Sara; Pledger, Shirley; Foster, Scott D.; Warton, David I. (2015). O'Hara, Robert B. (ed.). "Model‐based approaches to unconstrained ordination". Methods in Ecology and Evolution. 6 (4): 399–411. doi: 10.1111/2041-210X.12236 . ISSN   2041-210X. S2CID   62624917.
  2. Warton, David I.; Blanchet, F. Guillaume; O’Hara, Robert B.; Ovaskainen, Otso; Taskinen, Sara; Walker, Steven C.; Hui, Francis K. C. (2015-12-01). "So Many Variables: Joint Modeling in Community Ecology". Trends in Ecology & Evolution. 30 (12): 766–779. doi:10.1016/j.tree.2015.09.007. ISSN   0169-5347. PMID   26519235.
  3. Yee, Thomas W. (2004). "A New Technique for Maximum-Likelihood Canonical Gaussian Ordination". Ecological Monographs. 74 (4): 685–701. doi:10.1890/03-0078. ISSN   0012-9615.
  4. Hawinkel, Stijn; Kerckhof, Frederiek-Maarten; Bijnens, Luc; Thas, Olivier (2019-02-13). "A unified framework for unconstrained and constrained ordination of microbiome read count data". PLOS ONE. 14 (2): e0205474. doi: 10.1371/journal.pone.0205474 . ISSN   1932-6203. PMC   6373939 . PMID   30759084.
  5. van der Veen, Bert; Hui, Francis K. C.; Hovstad, Knut A.; O'Hara, Robert B. (2023). "Concurrent ordination: Simultaneous unconstrained and constrained latent variable modelling". Methods in Ecology and Evolution. 14 (2): 683–695. doi: 10.1111/2041-210X.14035 . hdl: 11250/3050891 . ISSN   2041-210X.

Further reading

  1. General
  2. Specific Techniques
  3. Software