Multicollinearity

Last updated

In statistics, multicollinearity or collinearity is a situation where the predictors in a regression model are linearly dependent.

Contents

Perfect multicollinearity refers to a situation where the predictive variables have an exact linear relationship. When there is perfect collinearity, the design matrix has less than full rank, and therefore the moment matrix cannot be inverted. In this situation, the parameter estimates of the regression are not well-defined, as the system of equations has infinitely many solutions.

Imperfect multicollinearity refers to a situation where the predictive variables have a nearly exact linear relationship.

Contrary to popular belief, neither the Gauss–Markov theorem nor the more common maximum likelihood justification for ordinary least squares relies on any kind of correlation structure between dependent predictors [1] [2] [3] (although perfect collinearity can cause problems with some software).

There is no justification for the practice of removing collinear variables as part of regression analysis, [1] [4] [5] [6] [7] and doing so may constitute scientific misconduct. Econometricians and statisticians have facetiously referred to imperfect collinearity as "micronumerosity", noting that it is only a problem when working with an insufficient sample size. [3] [4] Including collinear variables does not reduce the predictive power or reliability of the model as a whole, [6] and does not reduce the accuracy of coefficient estimates. [1]

High collinearity indicates that it is exceptionally important to include all collinear variables, as excluding any will cause worse coefficient estimates, strong confounding, and downward-biased estimates of standard errors. [2]

Perfect multicollinearity

A depiction of multicollinearity. Multicollinearity.jpg
A depiction of multicollinearity.
In a linear regression, the true parameters are
a
1
=
2
,
a
2
=
4
{\displaystyle a_{1}=2,a_{2}=4}
which are reliably estimated in the case of uncorrelated
X
1
{\displaystyle X_{1}}
and
X
2
{\displaystyle X_{2}}
(black case) but are unreliably estimated when
X
1
{\displaystyle X_{1}}
and
X
2
{\displaystyle X_{2}}
are correlated (red case). Effect of multicollinearity on coefficients of linear model.png
In a linear regression, the true parameters are which are reliably estimated in the case of uncorrelated and (black case) but are unreliably estimated when and are correlated (red case).

Perfect multicollinearity refers to a situation where the predictors are linearly dependent (one can be written as an exact linear function of the others). Ordinary least squares requires inverting the matrix , where

is an matrix, where is the number of observations, is the number of explanatory variables, and . If there is an exact linear relationship among the independent variables, then at least one of the columns of is a linear combination of the others, and so the rank of (and therefore of ) is less than , and the matrix will not be invertible.

Resolution

Perfect collinearity is typically caused by including redundant variables in a regression. For example, a dataset may include variables for income, expenses, and savings. However, because income is equal to expenses plus savings by definition, it is incorrect to include all 3 variables in a regression simultaneously. Similarly, including a dummy variable for every category (e.g., summer, autumn, winter, and spring) as well as an intercept term will result in perfect collinearity. This is known as the dummy variable trap. [8]

The other common cause of perfect collinearity is attempting to use ordinary least squares when working with very wide datasets (those with more variables than observations). These require more advanced data analysis techniques like Bayesian hierarchical modeling to produce meaningful results.[ citation needed ]

Numerical issues

Sometimes, the variables are nearly collinear. In this case, the matrix has an inverse, but it is ill-conditioned. A computer algorithm may or may not be able to compute an approximate inverse; even if it can, the resulting inverse may have large rounding errors.

The standard measure of ill-conditioning in a matrix is the condition index. This determines if the inversion of the matrix is numerically unstable with finite-precision numbers, indicating the potential sensitivity of the computed inverse to small changes in the original matrix. The condition number is computed by finding the maximum singular value divided by the minimum singular value of the design matrix. [9] In the context of collinear variables, the variance inflation factor is the condition number for a particular coefficient.

Solutions

Numerical problems in estimating can be solved by applying standard techniques from linear algebra to estimate the equations more precisely:

  1. Standardizing predictor variables. Working with polynomial terms (e.g. , ), including interaction terms (i.e., ) can cause multicollinearity. This is especially true when the variable in question has a limited range. Standardizing predictor variables will eliminate this special kind of multicollinearity for polynomials of up to 3rd order. [10]
  2. Use an orthogonal representation of the data. [11] Poorly-written statistical software will sometimes fail to converge to a correct representation when variables are strongly correlated. However, it is still possible to rewrite the regression to use only uncorrelated variables by performing a change of basis.
    • For polynomial terms in particular, it is possible to rewrite the regression as a function of uncorrelated variables using orthogonal polynomials.

Effects on coefficient estimates

In addition to causing numerical problems, imperfect collinearity makes precise estimation of variables difficult. In other words, highly correlated variables lead to poor estimates and large standard errors.

As an example, say that we notice Alice wears her boots whenever it is raining and that there are only puddles when it rains. Then, we cannot tell whether she wears boots to keep the rain from landing on her feet, or to keep her feet dry if she steps in a puddle.

The problem with trying to identify how much each of the two variables matters is that they are confounded with each other: our observations are explained equally well by either variable, so we do not know which one of them causes the observed correlations.

There are two ways to discover this information:

  1. Using prior information or theory. For example, if we notice Alice never steps in puddles, we can reasonably argue puddles are not why she wears boots, as she does not need the boots to avoid puddles.
  2. Collecting more data. If we observe Alice enough times, we will eventually see her on days where there are puddles but not rain (e.g. because the rain stops before she leaves home).

This confounding becomes substantially worse when researchers attempt to ignore or suppress it by excluding these variables from the regression (see #Misuse). Excluding multicollinear variables from regressions will invalidate causal inference and produce worse estimates by removing important confounders.

Remedies

There are many ways to prevent multicollinearity from affecting results by planning ahead of time. However, these methods require researchers to decide on a procedure and analysis before data has been collected (see post hoc analysis and #Misuse).

Regularized estimators

Many regression methods are naturally "robust" to multicollinearity and generally perform better than ordinary least squares regression, even when variables are independent. Regularized regression techniques such as ridge regression, LASSO, elastic net regression, or spike-and-slab regression are less sensitive to including "useless" predictors, a common cause of collinearity. These techniques can detect and remove these predictors automatically to avoid problems. Bayesian hierarchical models (provided by software like BRMS) can perform such regularization automatically, learning informative priors from the data.

Often, problems caused by the use of frequentist estimation are misunderstood or misdiagnosed as being related to multicollinearity. [3] Researchers are often frustrated not by multicollinearity, but by their inability to incorporate relevant prior information in regressions. For example, complaints that coefficients have "wrong signs" or confidence intervals that "include unrealistic values" indicate there is important prior information that is not being incorporated into the model. When this is information is available, it should be incorporated into the prior using Bayesian regression techniques. [3]

Stepwise regression (the procedure of excluding "collinear" or "insignificant" variables) is especially vulnerable to multicollinearity, and is one of the few procedures wholly invalidated by it (with any collinearity resulting in heavily biased estimates and invalidated p-values). [2]

Improved experimental design

When conducting experiments where researchers have control over the predictive variables, researchers can often avoid collinearity by choosing an optimal experimental design in consultation with a statistician.

Acceptance

While the above strategies work in some situations, they typically do not have a substantial effect. More advanced techniques may still result large standard errors. Thus the most common response to multicollinearity should be to "do nothing". [1] The scientific process often involves null or inconclusive results; not every experiment will be "successful" in the sense of providing decisive confirmation of the researcher's original hypothesis.

Edward Leamer notes that "The solution to the weak evidence problem is more and better data. Within the confines of the given data set there is nothing that can be done about weak evidence"; [3] researchers who believe there is a problem with the regression results should look at the prior probability, not the likelihood function.

Damodar Gujarati writes that "we should rightly accept [our data] are sometimes not very informative about parameters of interest". [1] Olivier Blanchard quips that "multicollinearity is God's will, not a problem with OLS"; [7] in other words, when working with observational data, researchers cannot "fix" multicollinearity, only accept it.

Misuse

Variance inflation factors are often misused as criteria in stepwise regression (i.e. for variable inclusion/exclusion), a use that "lacks any logical basis but also is fundamentally misleading as a rule-of-thumb". [2]

Excluding collinear variables leads to artificially small estimates for standard errors, but does not reduce the true (not estimated) standard errors for regression coefficients. [1] Excluding variables with a high variance inflation factor also invalidates the calculated standard errors and p-values, by turning the results of the regression into a post hoc analysis. [13]

Because collinearity leads to large standard errors and p-values, which can make publishing articles more difficult, some researchers will try to suppress inconvenient data by removing strongly-correlated variables from their regression. This procedure falls into the broader categories of p-hacking, data dredging, and post hoc analysis. Dropping (useful) collinear predictors will generally worsen the accuracy of the model and coefficient estimates.

Similarly, trying many different models or estimation procedures (e.g. ordinary least squares, ridge regression, etc.) until finding one that can "deal with" the collinearity creates a forking paths problem. P-values and confidence intervals derived from post hoc analyses are invalidated by ignoring the uncertainty in the model selection procedure.

It is reasonable to exclude unimportant predictors if they are known ahead of time to have little or no effect on the outcome; for example, local cheese production should not be used to predict the height of skyscrapers. However, this must be done when first specifying the model, prior to observing any data, and potentially-informative variables should always be included.

See also

Related Research Articles

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals. It is a measure of the discrepancy between the data and an estimation model, such as a linear regression. A small RSS indicates a tight fit of the model to the data. It is used as an optimality criterion in parameter selection and model selection.

In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equations, each having its own dependent variable and potentially different sets of exogenous explanatory variables. Each equation is a valid linear regression on its own and can be estimated separately, which is why the system is called seemingly unrelated, although some authors suggest that the term seemingly related would be more appropriate, since the error terms are assumed to be correlated across the equations.

In mathematics, a moment matrix is a special symmetric square matrix whose rows and columns are indexed by monomials. The entries of the matrix depend on the product of the indexing monomials only

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.


In statistics, the variance inflation factor (VIF) is the ratio (quotient) of the variance of a parameter estimate when fitting a full model that includes other parameters to the variance of the parameter estimate if the model is fit with only the parameter on its own. The VIF provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics and in machine learning, a linear predictor function is a linear function of a set of coefficients and explanatory variables, whose value is used to predict the outcome of a dependent variable. This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers, as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

<span class="mw-page-title-main">Influential observation</span> Observation that would cause a large change if deleted

In statistics, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation. In particular, in regression analysis an influential observation is one whose deletion has a large effect on the parameter estimates.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

<span class="mw-page-title-main">Homoscedasticity and heteroscedasticity</span> Statistical property

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.

References

  1. 1 2 3 4 5 6 Gujarati, Damodar (2009). "Multicollinearity: what happens if the regressors are correlated?". Basic Econometrics (4th ed.). McGraw−Hill. pp.  363. ISBN   9780073375779.
  2. 1 2 3 4 Kalnins, Arturs; Praitis Hill, Kendall (13 December 2023). "The VIF Score. What is it Good For? Absolutely Nothing". Organizational Research Methods. doi:10.1177/10944281231216381. ISSN   1094-4281.
  3. 1 2 3 4 5 Leamer, Edward E. (1973). "Multicollinearity: A Bayesian Interpretation". The Review of Economics and Statistics. 55 (3): 371–380. doi:10.2307/1927962. ISSN   0034-6535.
  4. 1 2 Giles, Dave (15 September 2011). "Econometrics Beat: Dave Giles' Blog: Micronumerosity". Econometrics Beat. Retrieved 3 September 2023.
  5. Goldberger,(1964), A.S. (1964). Econometric Theory. New York: Wiley.{{cite book}}: CS1 maint: numeric names: authors list (link)
  6. 1 2 Goldberger, A.S. "Chapter 23.3". A Course in Econometrics. Cambridge MA: Harvard University Press.
  7. 1 2 Blanchard, Olivier Jean (October 1987). "Comment". Journal of Business & Economic Statistics. 5 (4): 449–451. doi:10.1080/07350015.1987.10509611. ISSN   0735-0015.
  8. Karabiber, Fatih. "Dummy Variable Trap - What is the Dummy Variable Trap?". LearnDataSci (www.learndatasci.com). Retrieved 18 January 2024.
  9. Belsley, David (1991). Conditioning Diagnostics: Collinearity and Weak Data in Regression . New York: Wiley. ISBN   978-0-471-52889-0.
  10. "12.6 - Reducing Structural Multicollinearity | STAT 501". newonlinecourses.science.psu.edu. Retrieved 16 March 2019.
  11. 1 2 "Computational Tricks with Turing (Non-Centered Parametrization and QR Decomposition)". storopoli.io. Retrieved 3 September 2023.
  12. Gelman, Andrew; Imbens, Guido (3 July 2019). "Why High-Order Polynomials Should Not Be Used in Regression Discontinuity Designs". Journal of Business & Economic Statistics. 37 (3): 447–456. doi:10.1080/07350015.2017.1366909. ISSN   0735-0015.
  13. Gelman, Andrew; Loken, Eric (14 November 2013). "The garden of forking paths" (PDF). Unpublished via Columbia.

    Further reading