Detrended correspondence analysis

Last updated

Detrended correspondence analysis (DCA) is a multivariate statistical technique widely used by ecologists to find the main factors or gradients in large, species-rich but usually sparse data matrices that typify ecological community data. DCA is frequently used to suppress artifacts inherent in most other multivariate analyses when applied to gradient data. [1]

Contents

History

DCA was created in 1979 by Mark Hill of the United Kingdom's Institute for Terrestrial Ecology (now merged into Centre for Ecology and Hydrology) and implemented in FORTRAN code package called DECORANA (Detrended Correspondence Analysis), a correspondence analysis method. DCA is sometimes erroneously referred to as DECORANA; however, DCA is the underlying algorithm, while DECORANA is a tool implementing it.

Issues addressed

According to Hill and Gauch, [2] DCA suppresses two artifacts inherent in most other multivariate analyses when applied to gradient data. An example is a time-series of plant species colonising a new habitat; early successional species are replaced by mid-successional species, then by late successional ones (see example below). When such data are analysed by a standard ordination such as a correspondence analysis:

Outside ecology, the same artifacts occur when gradient data are analysed (e.g. soil properties along a transect running between 2 different geologies, or behavioural data over the lifespan of an individual) because the curved projection is an accurate representation of the shape of the data in multivariate space.

Ter Braak and Prentice (1987, p. 121) cite a simulation study analysing two-dimensional species packing models resulting in a better performance of DCA compared to CA.

Method

DCA is an iterative algorithm that has shown itself to be a highly reliable and useful tool for data exploration and summary in community ecology (Shaw 2003). It starts by running a standard ordination (CA or reciprocal averaging) on the data, to produce the initial horse-shoe curve in which the 1st ordination axis distorts into the 2nd axis. It then divides the first axis into segments (default = 26), and rescales each segment to have mean value of zero on the 2nd axis - this effectively squashes the curve flat. It also rescales the axis so that the ends are no longer compressed relative to the middle, so that 1 DCA unit approximates to the same rate of turnover all the way through the data: the rule of thumb is that 4 DCA units mean that there has been a total turnover in the community. Ter Braak and Prentice (1987, p. 122) warn against the non-linear rescaling of the axes due to robustness issues and recommend using detrending-by-polynomials only.

Drawbacks

No significance tests are available with DCA, although there is a constrained (canonical) version called DCCA in which the axes are forced by Multiple linear regression to correlate optimally with a linear combination of other (usually environmental) variables; this allows testing of a null model by Monte-Carlo permutation analysis.

Example

The example shows an ideal data set: The species data is in rows, samples in columns. For each sample along the gradient, a new species is introduced but another species is no longer present. The result is a sparse matrix. Ones indicate the presence of a species in a sample. Except at the edges each sample contains five species.

Comparison of Correspondence Analysis and Detrended Correspondence Analysis on example (ideal) data. See the arch effect in CA and its solution in DCA. CA-vs-DCA-made-by-vegan.png
Comparison of Correspondence Analysis and Detrended Correspondence Analysis on example (ideal) data. See the arch effect in CA and its solution in DCA.
Ideal ordination data
1234567891011121314151617181920
SP111100000000000000000
SP211110000000000000000
SP311111000000000000000
SP401111100000000000000
SP500111110000000000000
SP600011111000000000000
SP700001111100000000000
SP800000111110000000000
SP900000011111000000000
SP1000000001111100000000
SP1100000000111110000000
SP1200000000011111000000
SP1300000000001111100000
SP1400000000000111110000
SP1500000000000011111000
SP1600000000000001111100
SP1700000000000000111110
SP1800000000000000011111
SP1900000000000000001111
SP2000000000000000000111

The plot of the first two axes of the correspondence analysis result on the right hand side clearly shows the disadvantages of this procedure: the edge effect, i.e. the points are clustered at the edges of the first axis, and the arch effect.

See also

Related Research Articles

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.

Principal component analysis Conversion of observations of possibly-correlated variables into values of fewer, uncorrelated variables

The principal components of a collection of points in a real coordinate space are a sequence of unit vectors, where the -th vector is the direction of a line that best fits the data while being orthogonal to the first vectors. Here, a best-fitting line is defined as one that minimizes the average squared distance from the points to the line. These directions constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.

Standard score how many standard deviations apart from the mean an observed datum is

In statistics, the standard score is the number of standard deviations by which the value of a raw score is above or below the mean value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores.

Time series Sequence of data points over time

In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

Linear trend estimation is a statistical technique to aid interpretation of data. When a series of measurements of a process are treated as, for example, a time series, trend estimation can be used to make and justify statements about tendencies in the data, by relating the measurements to the times at which they occurred. This model can then be used to describe the behaviour of the observed data, without explaining it. In this case linear trend estimation expresses data as a linear function of time, and can also be used to determine the significance of differences in a set of data linked by a categorical factor. An example of the latter from biomedical science would be levels of a molecule in the blood or tissues of patients with incrementally worsening disease – such as mild, moderate and severe. This is in contrast to an ANOVA, which is reserved for three or more independent groups.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

Ordination or gradient analysis, in multivariate analysis, is a method complementary to data clustering, and used mainly in exploratory data analysis. Ordination orders objects that are characterized by values on multiple variables so that similar objects are near each other and dissimilar objects are farther from each other. Such relationships between the objects, on each of several axes, are then characterized numerically and/or graphically. Many ordination techniques exist, including principal components analysis (PCA), non-metric multidimensional scaling (NMDS), correspondence analysis (CA) and its derivatives, Bray–Curtis ordination, and redundancy analysis (RDA), among others.

Phytosociology, also known as phytocoenology or simply plant sociology, is the study of groups of species of plant that are usually found together. Phytosociology aims to empirically describe the vegetative environment of a given territory. A specific community of plants is considered a social unit, the product of definite conditions, present and past, and can exist only when such conditions are met. In phytosociology such as unit is known as a phytocoenosis. A phytocoenosis is more commonly known as a plant community, and consists of the sum of all plants in a given area. It is a subset of a biocoenosis, which consists of all organisms in a given area. More strictly speaking, a phytocoenosis is a set of plants in area that are interacting with each other through competition or other ecological processes. Coenoses are not equivalent to ecosystems, which consist of organisms and the physical environment that they interact with. A phytocoensis has a distribution which can be mapped. Phytosociology has a system for describing and classifying these phytocoenoses in a hierarchy, known as syntaxonomy, and this system has a nomenclature. The science is most advanced in Europe, Africa and Asia.

Genstat

Genstat is a statistical software package with data analysis capabilities, particularly in the field of agriculture.

In archaeology, seriation is a relative dating method in which assemblages or artifacts from numerous sites in the same culture are placed in chronological order. Where absolute dating methods, such as radio carbon, cannot be applied, archaeologists have to use relative dating methods to date archaeological finds and features. Seriation is a standard method of dating in archaeology. It can be used to date stone tools, pottery fragments, and other artifacts. In Europe, it has been used frequently to reconstruct the chronological sequence of graves in a cemetery.

Data transformation (statistics)

In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point zi is replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.

In stochastic processes, chaos theory and time series analysis, detrended fluctuation analysis (DFA) is a method for determining the statistical self-affinity of a signal. It is useful for analysing time series that appear to be long-memory processes or 1/f noise.

Correspondence analysis (CA) or reciprocal averaging is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a similar manner to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form. Its aim is to display in a biplot any structure hidden in the multivariate setting of the data table. As such it is a technique from the field of multivariate ordination. Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis.

Repeated measures design is a research design that involves multiple measures of the same variable taken on the same or matched subjects either under different conditions or over two or more time periods. For instance, repeated measurements are collected in a longitudinal study in which change over time is assessed.

Plot (graphics)

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

In applied statistics, canonical correspondence analysis (CCA) is a multivariate constrained ordination technique that extracts major gradients among combinations of explanatory variables in a dataset. The requirements of a CCA are that the samples are random and independent. Also, the data are categorical and that the independent variables are consistent within the sample site and error-free.

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In multivariate statistics, principal response curves (PRC) are used for analysis of treatment effects in experiments with a repeated measures design.

Peter Greig-Smith

Peter Greig-Smith was a British plant ecologist, founder of the discipline of quantitative ecology in the United Kingdom. He had a deep influence across the world on vegetation studies and plant ecology, mostly from his book Quantitative Plant Ecology, first published in 1957 and a must-read for multiple generations of young ecologists.

References

  1. Hill and Gauch (1980)
  2. Hill and Gauch (1980)