In multivariate analysis, canonical correspondence analysis (CCA) is an ordination technique that determines axes from the response data as a unimodal combination of measured predictors. CCA is commonly used in ecology in order to extract gradients that drive the composition of ecological communities. CCA extends Correspondence Analysis (CA) with regression, in order to incorporate predictor variables.
CCA was developed in 1986 by Cajo ter Braak [1] and implemented in the program CANOCO, an extension of DECORANA. [2] To date, CCA is one of the most popular multivariate methods in ecology, despite the availability of contemporary alternatives. [3] CCA was originally derived and implemented using an algorithm of weighted averaging, though Legendre & Legendre (1998) derived an alternative algorithm. [4]
The requirements of a CCA are that the samples are random and independent. Also, the data are categorical and that the independent variables are consistent within the sample site and error-free. [5] The original publication states the need for equal species tolerances, equal species maxima, and equispaced or uniformly distributed species optima and site scores. [1]
Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.
The following outline is provided as an overview of and topical guide to statistics:
Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors X = (X1, ..., Xn) and Y = (Y1, ..., Ym) of random variables, and there are correlations among the variables, then canonical-correlation analysis will find linear combinations of X and Y that have a maximum correlation with each other. T. R. Knapp notes that "virtually all of the commonly encountered parametric tests of significance can be treated as special cases of canonical-correlation analysis, which is the general procedure for investigating the relationships between two sets of variables." The method was first introduced by Harold Hotelling in 1936, although in the context of angles between flats the mathematical concept was published by Camille Jordan in 1875.
Partial least squares (PLS) regression is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares discriminant analysis (PLS-DA) is a variant used when the Y is categorical.
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
When classification is performed by a computer, statistical methods are normally used to develop the algorithm.
Ordination or gradient analysis, in multivariate analysis, is a method complementary to data clustering, and used mainly in exploratory data analysis. In contrast to cluster analysis, ordination orders quantities in a latent space. In the ordination space, quantities that are near each other share attributes, and dissimilar objects are farther from each other. Such relationships between the objects, on each of several axes or latent variables, are then characterized numerically and/or graphically in a biplot.
Spatial ecology studies the ultimate distributional or spatial unit occupied by a species. In a particular habitat shared by several species, each of the species is usually confined to its own microhabitat or spatial niche because two species in the same general territory cannot usually occupy the same ecological niche for any significant length of time.
Phytosociology, also known as phytocoenology or simply plant sociology, is the study of groups of species of plant that are usually found together. Phytosociology aims to empirically describe the vegetative environment of a given territory. A specific community of plants is considered a social unit, the product of definite conditions, present and past, and can exist only when such conditions are met. In phyto-sociology, such a unit is known as a phytocoenosis. A phytocoenosis is more commonly known as a plant community, and consists of the sum of all plants in a given area. It is a subset of a biocoenosis, which consists of all organisms in a given area. More strictly speaking, a phytocoenosis is a set of plants in area that are interacting with each other through competition or other ecological processes. Coenoses are not equivalent to ecosystems, which consist of organisms and the physical environment that they interact with. A phytocoensis has a distribution which can be mapped. Phytosociology has a system for describing and classifying these phytocoenoses in a hierarchy, known as syntaxonomy, and this system has a nomenclature. The science is most advanced in Europe, Africa and Asia.
Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship between predictors and dependent variable. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates.
Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a similar manner to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form. Its aim is to display in a biplot any structure hidden in the multivariate setting of the data table. As such it is a technique from the field of multivariate ordination. Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis.
Detrended correspondence analysis (DCA) is a multivariate statistical technique widely used by ecologists to find the main factors or gradients in large, species-rich but usually sparse data matrices that typify ecological community data. DCA is frequently used to suppress artifacts inherent in most other multivariate analyses when applied to gradient data.
The Unscrambler X is a commercial software product for multivariate data analysis, used for calibration of multivariate data which is often in the application of analytical data such as near infrared spectroscopy and Raman spectroscopy, and development of predictive models for use in real-time spectroscopic analysis of materials. The software was originally developed in 1986 by Harald Martens and later by CAMO Software.
Species distribution modelling (SDM), also known as environmental(or ecological) niche modelling (ENM), habitat modelling, predictive habitat distribution modelling, and range mapping uses ecological models to predict the distribution of a species across geographic space and time using environmental data. The environmental data are most often climate data (e.g. temperature, precipitation), but can include other variables such as soil type, water depth, and land cover. SDMs are used in several research areas in conservation biology, ecology and evolution. These models can be used to understand how environmental conditions influence the occurrence or abundance of a species, and for predictive purposes (ecological forecasting). Predictions from an SDM may be of a species’ future distribution under climate change, a species’ past distribution in order to assess evolutionary relationships, or the potential future distribution of an invasive species. Predictions of current and/or future habitat suitability can be useful for management applications (e.g. reintroduction or translocation of vulnerable species, reserve placement in anticipation of climate change).
In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.
In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.
In multivariate statistics, principal response curves (PRC) are used for analysis of treatment effects in experiments with a repeated measures design.
Marti J. Anderson is an American researcher based in New Zealand. Her ecological statistical works is interdisciplinary, from marine biology and ecology to mathematical and applied statistics. Her core areas of research and expertise are: community ecology, biodiversity, multivariate analysis, resampling methods, experimental designs, and statistical models of species abundances. She is a Distinguished Professor in the New Zealand Institute for Advanced Study at Massey University and also the Director of the New Zealand research and software-development company, PRIMER-e.