Canonical analysis

Last updated March 18, 2019

In statistics, canonical analysis (from Ancient Greek : κανων bar, measuring rod, ruler) belongs to the family of regression methods for data analysis. Regression analysis quantifies a relationship between a predictor variable and a criterion variable by the coefficient of correlation r, coefficient of determination r², and the standard regression coefficient β. Multiple regression analysis expresses a relationship between a set of predictor variables and a single criterion variable by the multiple correlation R, multiple coefficient of determination R², and a set of standard partial regression weights β₁, β₂, etc. Canonical variate analysis captures a relationship between a set of predictor variables and a set of criterion variables by the canonical correlations ρ₁, ρ₂, ..., and by the sets of canonical weights C and D.

Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation. In applying statistics to, for example, a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Statistics deals with all aspects of data, including the planning of data collection in terms of the design of surveys and experiments. See glossary of probability and statistics.

A measuring rod is a tool used to physically measure lengths and survey areas of various sizes. Most measuring rods are round or square sectioned; however, they can also be flat boards. Some have markings at regular intervals. It is likely that the measuring rod was used before the line, chain or steel tapes used in modern measurement.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Canonical analysis

Canonical analysis belongs to a group of methods which involve solving the characteristic equation for its latent roots and vectors. It describes formal structures in hyperspace invariant with respect to the rotation of their coordinates. In this type of solution, rotation leaves many optimizing properties preserved, provided it takes place in certain ways and in a subspace of its corresponding hyperspace. This rotation from the maximum intervariate correlation structure into a different, simpler and more meaningful structure increases the interpretability of the canonical weights C and D. In this the canonical analysis differs from Harold Hotelling’s (1936) canonical variate analysis (also called the canonical correlation analysis), designed to obtain maximum (canonical) correlations between the predictor and criterion canonical variates. The difference between the canonical variate analysis and canonical analysis is analogous to the difference between the principal components analysis and factor analysis, each with its characteristic set of commonalities, eigenvalues and eigenvectors.

Harold Hotelling was an American mathematical statistician and an influential economic theorist, known for Hotelling's law, Hotelling's lemma, and Hotelling's rule in economics, as well as Hotelling's T-squared distribution in statistics. He also developed and named the principal component analysis method widely used in statistics and computer science.

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors, plus "error" terms. Factor analysis aims to find independent latent variables.

Canonical analysis (simple)

Canonical analysis is a multivariate technique which is concerned with determining the relationships between groups of variables in a data set. The data set is split into two groups X and Y, based on some common characteristics. The purpose of canonical analysis is then to find the relationship between X and Y, i.e. can some form of X represent Y. It works by finding the linear combination of X variables, i.e. X₁, X₂ etc., and linear combination of Y variables, i.e. Y₁, Y₂ etc., which are most highly correlated. This combination is known as the "first canonical variates" which are usually denoted U₁ and V₁, with the pair of U₁ and V₁ being called a "canonical function". The next canonical functions, U₂ and V₂ are then restricted so that they are uncorrelated with U₁ and V₁. Everything is scaled so that the variance equals 1.

One can also construct relationships which are made to agree with constraint restrictions arising from theory or to agree with common sense/intuition. These are called maximum correlation models. (Tofallis, 1999)

Mathematically, canonical analysis maximizes U′X′YV subject to U′X′XU = I and V′Y′YV = I, where X and Y are the data matrices (row for instance and column for feature).

Related Research Articles

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. The application of multivariate statistics is multivariate analysis.

In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though in common usage it most often refers to how close two variables are to having a linear relationship with each other. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a limited supply product and its price.

In statistics, the standard score is the signed fractional number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. Observed values above the mean have positive standard scores, while values below the mean have negative standard scores.

In statistics, the logistic model is a widely used statistical model that, in its basic form, uses a logistic function to model a binary dependent variable; many more complex extensions exist. In regression analysis, logistic regression is estimating the parameters of a logistic model; it is a form of binomial regression. Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail, win/lose, alive/dead or healthy/sick; these are represented by an indicator variable, where the two values are labeled "0" and "1". In the logistic model, the log-odds for the value labeled "1" is a linear combination of one or more independent variables ("predictors"); the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. Analogous models with a different sigmoid function instead of the logistic function can also be used, such as the probit model; the defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each dependent variable having its own parameter; for a binary independent variable this generalizes the odds ratio.

In statistics, canonical-correlation analysis is a way of inferring information from cross-covariance matrices. If we have two vectors X = (X₁, ..., X_n) and Y = (Y₁, ..., Y_m) of random variables, and there are correlations among the variables, then canonical-correlation analysis will find linear combinations of X and Y which have maximum correlation with each other. T. R. Knapp notes that "virtually all of the commonly encountered parametric tests of significance can be treated as special cases of canonical-correlation analysis, which is the general procedure for investigating the relationships between two sets of variables." The method was first introduced by Harold Hotelling in 1936, although in the context of angles between flats the mathematical concept was published by Jordan in 1875.

The general linear model or multivariate regression model is a statistical linear model. It may be written as

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

In statistics, the coefficient of multiple correlation is a measure of how well a given variable can be predicted using a linear function of a set of other variables. It is the correlation between the variable's values and the best predictions that can be computed linearly from the predictive variables.

In statistics, multicollinearity is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multivariate regression model with collinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.

In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped. These models can be seen as generalizations of linear models, although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. If we are interested in finding whether or to what extent there is a numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another, confounding, variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

Segmented regression, also known as piecewise regression or broken-stick regression, is a method in regression analysis in which the independent variable is partitioned into intervals and a separate line segment is fit to each interval. Segmented regression analysis can also be performed on multivariate data by partitioning the various independent variables. Segmented regression is useful when the independent variables, clustered into different groups, exhibit different relationships between the variables in these regions. The boundaries between the segments are breakpoints.

In statistics, the RV coefficient is a multivariate generalization of the squared Pearson correlation coefficient. It measures the closeness of two set of points that may each be represented in a matrix.

Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables, for the purpose of determining the empirical relationship between them.

In statistics and in machine learning, a linear predictor function is a linear function of a set of coefficients and explanatory variables, whose value is used to predict the outcome of a dependent variable. This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers, as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In statistics, functional correlation is a dimensionality reduction technique used to quantify the correlation and dependence between two variables when the data is functional. Several approaches have been developed to quantify the relation between two functional variables.

References

Hotelling, H. (1936). "Relations Between Two Sets of Variates". Biometrika . 28 (3–4): 321–377. doi:10.1093/biomet/28.3-4.321. JSTOR 2333955.
Krus, D. J.; et al. (1976). "Rotation in Canonical Analysis". Educational and Psychological Measurement. 36 (3): 725–730. doi:10.1177/001316447603600320.
Liang, K. H.; Krus, D. J.; Webb, J. M. (1995). "K-fold crossvalidation in canonical analysis". Multivariate Behavioral Research. 30 (4): 539–545. doi:10.1207/s15327906mbr3004_4.
Tofallis, C. (1999). "Model Building with Multiple Dependent Variables and Constraints". J. R. Stat. Soc. D . 48 (3): 1–8. arXiv: 1109.0725  . doi:10.1111/1467-9884.00195. SSRN 1353202  .

Biometrika is a peer-reviewed scientific journal published by Oxford University Press for the Biometrika Trust. The editor-in-chief is Paul Fearnhead. The principal focus of this journal is theoretical statistics. It was established in 1901 and originally appeared quarterly. It changed to three issues per year in 1977 but returned to quarterly publication in 1992.

In computing, a Digital Object Identifier or DOI is a persistent identifier or handle used to uniquely identify objects, standardized by the International Organization for Standardization (ISO). An implementation of the Handle System, DOIs are in wide use mainly to identify academic, professional, and government information, such as journal articles, research reports and data sets, and official publications though they also have been used to identify other types of information resources, such as commercial videos.

JSTOR is a digital library founded in 1995. Originally containing digitized back issues of academic journals, it now also includes books and other primary sources, and current issues of journals. It provides full-text searches of almost 2,000 journals. As of 2013, more than 8,000 institutions in more than 160 countries had access to JSTOR; most access is by subscription, but some of the site's public domain and open access content is available at no cost to anyone. JSTOR's revenue was $86 million in 2015.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

Canonical analysis

Contents

Canonical analysis

Canonical analysis (simple)

See also

Related Research Articles

References