Variance inflation factor

Last updated January 07, 2025

In statistics, the variance inflation factor (VIF) is the ratio (quotient) of the variance of a parameter estimate when fitting a full model that includes other parameters to the variance of the parameter estimate if the model is fit with only the parameter on its own.^[1] The VIF provides an index that measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity.

Definition

Consider the following linear model with k independent variables:

Y = β₀ + β₁X₁ + β₂X₂ + ... + β_kX_k + ε.

The standard error of the estimate of β_j is the square root of the j + 1 diagonal element of s²(X′X)⁻¹, where s is the root mean squared error (RMSE) (note that RMSE² is a consistent estimator of the true variance of the error term, $\sigma ^{2}$ ); X is the regression design matrix — a matrix such that X_{i, j+1} is the value of the j^th independent variable for the i^th case or observation, and such that X_i,1, the predictor vector associated with the intercept term, equals 1 for all i. It turns out that the square of this standard error, the estimated variance of the estimate of β_j, can be equivalently expressed as:^[3]^[4]

{\widehat {\operatorname {var} }}({\hat {\beta }}_{j})={\frac {s^{2}}{(n-1){\widehat {\operatorname {var} }}(X_{j})}}\cdot {\frac {1}{1-R_{j}^{2}}},

where R_j² is the multiple R² for the regression of X_j on the other covariates (a regression that does not involve the response variable Y) and ${\hat {\beta }}_{j}$ are the coefficient estimates, id est, the estimates of ${\beta }_{j}$ . This identity separates the influences of several distinct factors on the variance of the coefficient estimate:

s²: greater scatter in the data around the regression surface leads to proportionately more variance in the coefficient estimates
n: greater sample size results in proportionately less variance in the coefficient estimates
${\widehat {\operatorname {var} }}(X_{j})$ : greater variability in a particular covariate leads to proportionately less variance in the corresponding coefficient estimate

The remaining term, 1 / (1 − R_j²) is the VIF. It reflects all other factors that influence the uncertainty in the coefficient estimates. The VIF equals 1 when the vector X_j is orthogonal to each column of the design matrix for the regression of X_j on the other covariates. By contrast, the VIF is greater than 1 when the vector X_j is not orthogonal to all columns of the design matrix for the regression of X_j on the other covariates. Finally, note that the VIF is invariant to the scaling of the variables (that is, we could scale each variable X_j by a constant c_j without changing the VIF).

{\widehat {\operatorname {var} }}({\hat {\beta }}_{j})=s^{2}[(X^{T}X)^{-1}]_{jj}

Now let $r=X^{T}X$ , and without losing generality, we reorder the columns of X to set the first column to be $X_{j}$

r^{-1}={\begin{bmatrix}r_{j,j}&r_{j,-j}\\r_{-j,j}&r_{-j,-j}\end{bmatrix}}^{-1}

r_{j,j}=X_{j}^{T}X_{j},r_{j,-j}=X_{j}^{T}X_{-j},r_{-j,j}=X_{-j}^{T}X_{j},r_{-j,-j}=X_{-j}^{T}X_{-j}

.

By using Schur complement, the element in the first row and first column in $r^{-1}$ is,

r_{1,1}^{-1}=[r_{j,j}-r_{j,-j}r_{-j,-j}^{-1}r_{-j,j}]^{-1}

Then we have,

{\begin{aligned}&{\widehat {\operatorname {var} }}({\hat {\beta }}_{j})=s^{2}[(X^{T}X)^{-1}]_{jj}=s^{2}r_{1,1}^{-1}\\={}&s^{2}[X_{j}^{T}X_{j}-X_{j}^{T}X_{-j}(X_{-j}^{T}X_{-j})^{-1}X_{-j}^{T}X_{j}]^{-1}\\={}&s^{2}[X_{j}^{T}X_{j}-X_{j}^{T}X_{-j}(X_{-j}^{T}X_{-j})^{-1}(X_{-j}^{T}X_{-j})(X_{-j}^{T}X_{-j})^{-1}X_{-j}^{T}X_{j}]^{-1}\\={}&s^{2}[X_{j}^{T}X_{j}-{\hat {\beta }}_{*j}^{T}(X_{-j}^{T}X_{-j}){\hat {\beta }}_{*j}]^{-1}\\={}&s^{2}{\frac {1}{\mathrm {RSS} _{j}}}\\={}&{\frac {s^{2}}{(n-1){\widehat {\operatorname {var} }}(X_{j})}}\cdot {\frac {1}{1-R_{j}^{2}}}\end{aligned}}

Here ${\hat {\beta }}_{*j}$ is the coefficient of regression of dependent variable $X_{j}$ over covariate $X_{-j}$ . $\mathrm {RSS} _{j}$ is the corresponding residual sum of squares.

Calculation and analysis

We can calculate k different VIFs (one for each X_i) in three steps:

Step one

First we run an ordinary least square regression that has X_i as a function of all the other explanatory variables in the first equation.
If i = 1, for example, equation would be

X_{1}=\alpha _{0}+\alpha _{2}X_{2}+\alpha _{3}X_{3}+\cdots +\alpha _{k}X_{k}+\varepsilon

where $\alpha _{0}$ is a constant and $\varepsilon$ is the error term.

Step two

Then, calculate the VIF factor for ${\hat {\alpha }}_{i}$ with the following formula :

\mathrm {VIF} _{i}={\frac {1}{1-R_{i}^{2}}}

where R²_i is the coefficient of determination of the regression equation in step one, with $X_{i}$ on the left hand side, and all other predictor variables (all the other X variables) on the right hand side.

Step three

Analyze the magnitude of multicollinearity by considering the size of the $\operatorname {VIF} ({\hat {\alpha }}_{i})$ . A rule of thumb is that if $\operatorname {VIF} ({\hat {\alpha }}_{i})>10$ then multicollinearity is high^[5] (a cutoff of 5 is also commonly used^[6]). However, there is no value of VIF greater than 1 in which the variance of the slopes of predictors isn't inflated. As a result, including two or more variables in a multiple regression that are not orthogonal (i.e. have correlation = 0), will alter each other's slope, SE of the slope, and P-value, because there is shared variance between the predictors that can't be uniquely attributed to any one of them.^[7]

Some software instead calculates the tolerance which is just the reciprocal of the VIF. The choice of which to use is a matter of personal preference.

Interpretation

The square root of the variance inflation factor indicates how much larger the standard error increases compared to if that variable had 0 correlation to other predictor variables in the model.

Example
If the variance inflation factor of a predictor variable were 5.27 (√5.27 = 2.3), this means that the standard error for the coefficient of that predictor variable is 2.3 times larger than if that predictor variable had 0 correlation with the other predictor variables.

Implementation

vif function in the car R package
ols_vif_tol function in the olsrr R package
PROC REG in SAS System
variance_inflation_factor function in statsmodels Python package
estat vif in Stata
r.vif addon for GRASS GIS
vif (non categorical) and gvif (categorical data) functions in StatsModels Julia programing language

Related Research Articles

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

In statistics, a studentized residual is the dimensionless ratio resulting from the division of a residual by an estimate of its standard deviation, both expressed in the same units. It is a form of a Student's t-statistic, with the estimate of error varying between points.

In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero, caused by errors in the independent variable.

In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals. It is a measure of the discrepancy between the data and an estimation model, such as a linear regression. A small RSS indicates a tight fit of the model to the data. It is used as an optimality criterion in parameter selection and model selection.

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation at lag 1 in the residuals from a regression analysis. It is named after James Durbin and Geoffrey Watson. The small sample distribution of this ratio was derived by John von Neumann. Durbin and Watson applied this statistic to the residuals from least squares regressions, and developed bounds tests for the null hypothesis that the errors are serially uncorrelated against the alternative that they follow a first order autoregressive process. Note that the distribution of this test statistic does not depend on the estimated regression coefficients and the variance of the errors.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). PCR is a form of reduced rank regression. More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in $space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.$

In statistics, an errors-in-variables model or a measurement error model is a regression model that accounts for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function $with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.$

In statistics, linear regression is a model that estimates the linear relationship between a scalar response and one or more explanatory variables. A model with exactly one explanatory variable is a simple linear regression; a model with two or more explanatory variables is a multiple linear regression. This term is distinct from multivariate linear regression, which predicts multiple correlated dependent variables rather than a single dependent variable.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

Functional regression is a version of regression analysis when responses or covariates include functional data. Functional regression models can be classified into four types depending on whether the responses or covariates are functional or scalar: (i) scalar responses with functional covariates, (ii) functional responses with scalar covariates, (iii) functional responses with functional covariates, and (iv) scalar or functional responses with functional and scalar covariates. In addition, functional regression models can be linear, partially linear, or nonlinear. In particular, functional polynomial models, functional single and multiple index models and functional additive models are three special cases of functional nonlinear models.

References

↑ James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2017). An Introduction to Statistical Learning (8th ed.). Springer Science+Business Media New York. ISBN 978-1-4614-7138-7.
↑ Snee, Ron (1981). Origins of the Variance Inflation Factor as Recalled by Cuthbert Daniel (Technical report). Snee Associates.
↑ Rawlings, John O.; Pantula, Sastry G.; Dickey, David A. (1998). Applied regression analysis : a research tool (Second ed.). New York: Springer. pp. 372, 373. ISBN 0387227539. OCLC 54851769.
↑ Faraway, Julian J. (2002). Practical Regression and Anova using R (PDF). pp. 117, 118.
↑ Kutner, M. H.; Nachtsheim, C. J.; Neter, J. (2004). Applied Linear Regression Models (4th ed.). McGraw-Hill Irwin.
↑ Sheather, Simon (2009). A modern approach to regression with R. New York, NY: Springer. ISBN 978-0-387-09607-0.
↑ James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2021). An introduction to statistical learning: with applications in R (Second ed.). New York, NY: Springer. p. 116. ISBN 978-1-0716-1418-1 . Retrieved 1 November 2024.