Leverage (statistics)

Last updated December 28, 2023

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in $\mathbb {R} ^{p}$ space, where ${p}$ is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation.^[1] Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.

Definition and interpretations

Consider the linear regression model ${y}_{i}={\boldsymbol {x}}_{i}^{\top }{\boldsymbol {\beta }}+{\varepsilon }_{i}$ , $i=1,\,2,\ldots ,\,n$ . That is, ${\boldsymbol {y}}=\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }}$ , where, $\mathbf {X}$ is the $n\times p$ design matrix whose rows correspond to the observations and whose columns correspond to the independent or explanatory variables. The leverage score for the ${i}^{th}$ independent observation ${\boldsymbol {x}}_{i}$ is given as:

h_{ii}=\left[\mathbf {H} \right]_{ii}={\boldsymbol {x}}_{i}^{\top }\left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}{\boldsymbol {x}}_{i}

, the

{i}^{th}

diagonal element of the ortho-projection matrix (a.k.a hat matrix)

\mathbf {H} =\mathbf {X} \left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {X} ^{\top }

.

Thus the ${i}^{th}$ leverage score can be viewed as the 'weighted' distance between ${\boldsymbol {x}}_{i}$ to the mean of ${\boldsymbol {x}}_{i}$ 's (see its relation with Mahalanobis distance). It can also be interpreted as the degree by which the ${i}^{th}$ measured (dependent) value (i.e., $y_{i}$ ) influences the ${i}^{th}$ fitted (predicted) value (i.e., ${\widehat {y\,}}_{i}$ ): mathematically,

h_{ii}={\frac {\partial {\widehat {y\,}}_{i}}{\partial y_{i}}}

.

Hence, the leverage score is also known as the observation self-sensitivity or self-influence.^[2] Using the fact that ${\boldsymbol {\widehat {y}}}={\mathbf {H} }{\boldsymbol {y}}$ (i.e., the prediction ${\boldsymbol {\widehat {y}}}$ is ortho-projection of ${\boldsymbol {y}}$ onto range space of $\mathbf {X}$ ) in the above expression, we get $h_{ii}=\left[\mathbf {H} \right]_{ii}$ . Note that this leverage depends on the values of the explanatory variables $(\mathbf {X} )$ of all observations but not on any of the values of the dependent variables $(y_{i})$ .

Properties

The leverage $h_{ii}$ is a number between 0 and 1, $0\leq h_{ii}\leq 1.$ Proof: Note that $\mathbf {H}$ is idempotent matrix ( $\mathbf {H} ^{2}=\mathbf {H}$ ) and symmetric ( $h_{ij}=h_{ji}$ ). Thus, by using the fact that $\left[\mathbf {H} ^{2}\right]_{ii}=\left[\mathbf {H} \right]_{ii}$ , we have $h_{ii}=h_{ii}^{2}+\sum _{j\neq i}h_{ij}^{2}$ . Since we know that $\sum _{j\neq i}h_{ij}^{2}\geq 0$ , we have $h_{ii}\geq h_{ii}^{2}\implies 0\leq h_{ii}\leq 1$ .
Sum of leverages is equal to the number of parameters $(p)$ in ${\boldsymbol {\beta }}$ (including the intercept). Proof: $\sum _{i=1}^{n}h_{ii}=\operatorname {Tr} (\mathbf {H} )=\operatorname {Tr} \left(\mathbf {X} \left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {X} ^{\top }\right)=\operatorname {Tr} \left(\mathbf {X} ^{\top }\mathbf {X} \left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\right)=\operatorname {Tr} (\mathbf {I} _{p})=p$ .

Determination of outliers in X using leverages

Large leverage ${h_{ii}}$ corresponds to an ${{\boldsymbol {x}}_{i}}$ that is extreme. A common rule is to identify ${{\boldsymbol {x}}_{i}}$ whose leverage value ${h}_{ii}$ is more than 2 times larger than the mean leverage ${\bar {h}}={\dfrac {1}{n}}\sum _{i=1}^{n}h_{ii}={\dfrac {p}{n}}$ (see property 2 above). That is, if $h_{ii}>2{\dfrac {p}{n}}$ , ${{\boldsymbol {x}}_{i}}$ shall be considered an outlier. Some statisticians prefer the threshold of $3p/{n}$ instead of $2p/{n}$ .

Relation to Mahalanobis distance

Leverage is closely related to the Mahalanobis distance (proof^[3]). Specifically, for some $n\times p$ matrix $\mathbf {X}$ , the squared Mahalanobis distance of ${{\boldsymbol {x}}_{i}}$ (where ${\boldsymbol {x}}_{i}^{\top }$ is ${i}^{th}$ row of $\mathbf {X}$ ) from the vector of mean ${\widehat {\boldsymbol {\mu }}}=\sum _{i=1}^{n}{\boldsymbol {x}}_{i}$ of length $p$ , is $D^{2}({\boldsymbol {x}}_{i})=({\boldsymbol {x}}_{i}-{\widehat {\boldsymbol {\mu }}})^{\top }\mathbf {S} ^{-1}({\boldsymbol {x}}_{i}-{\widehat {\boldsymbol {\mu }}})$ , where $\mathbf {S} =\mathbf {X} ^{\top }\mathbf {X}$ is the estimated covariance matrix of ${{\boldsymbol {x}}_{i}}$ 's. This is related to the leverage $h_{ii}$ of the hat matrix of $\mathbf {X}$ after appending a column vector of 1's to it. The relationship between the two is:

D^{2}({\boldsymbol {x}}_{i})=(n-1)(h_{ii}-{\tfrac {1}{n}})

This relationship enables us to decompose leverage into meaningful components so that some sources of high leverage can be investigated analytically.^[4]

Relation to influence functions

In a regression context, we combine leverage and influence functions to compute the degree to which estimated coefficients would change if we removed a single data point. Denoting the regression residuals as ${\widehat {e}}_{i}=y_{i}-{\boldsymbol {x}}_{i}^{\top }{\widehat {\boldsymbol {\beta }}}$ , one can compare the estimated coefficient ${\widehat {\boldsymbol {\beta }}}$ to the leave-one-out estimated coefficient ${\widehat {\boldsymbol {\beta }}}^{(-i)}$ using the formula ^[5]^[6]

{\widehat {\boldsymbol {\beta }}}-{\widehat {\boldsymbol {\beta }}}^{(-i)}={\frac {(\mathbf {X} ^{\top }\mathbf {X} )^{-1}{\boldsymbol {x}}_{i}{\widehat {e}}_{i}}{1-h_{ii}}}

Young (2019) uses a version of this formula after residualizing controls.^[7] To gain intuition for this formula, note that ${\frac {\partial {\hat {\beta }}}{\partial y_{i}}}=(\mathbf {X} ^{\top }\mathbf {X} )^{-1}{\boldsymbol {x}}_{i}$ captures the potential for an observation to affect the regression parameters, and therefore $(\mathbf {X} ^{\top }\mathbf {X} )^{-1}{\boldsymbol {x}}_{i}{\widehat {e}}_{i}$ captures the actual influence of that observations' deviations from its fitted value on the regression parameters. The formula then divides by $(1-h_{ii})$ to account for the fact that we remove the observation rather than adjusting its value, reflecting the fact that removal changes the distribution of covariates more when applied to high-leverage observations (i.e. with outlier covariate values). Similar formulas arise when applying general formulas for statistical influences functions in the regression context.^[8]^[9]

Effect on residual variance

If we are in an ordinary least squares setting with fixed $\mathbf {X}$ and homoscedastic regression errors $\varepsilon _{i},$ ${\boldsymbol {y}}=\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }};\ \ \operatorname {Var} ({\boldsymbol {\varepsilon }})=\sigma ^{2}\mathbf {I}$ , then the ${i}^{th}$ regression residual, $e_{i}=y_{i}-{\widehat {y}}_{i}$ has variance

\operatorname {Var} (e_{i})=(1-h_{ii})\sigma ^{2}

.

In other words, an observation's leverage score determines the degree of noise in the model's misprediction of that observation, with higher leverage leading to less noise. This follows from the fact that $\mathbf {I} -\mathbf {H}$ is idempotent and symmetric and ${\widehat {\boldsymbol {y}}}=\mathbf {H} {\boldsymbol {y}}$ , hence, $\operatorname {Var} ({\boldsymbol {e}})=\operatorname {Var} ((\mathbf {I} -\mathbf {H} ){\boldsymbol {y}})=(\mathbf {I} -\mathbf {H} )\operatorname {Var} ({\boldsymbol {y}})(\mathbf {I} -\mathbf {H} )^{\top }=\sigma ^{2}(\mathbf {I} -\mathbf {H} )^{2}=\sigma ^{2}(\mathbf {I} -\mathbf {H} )$ .

The corresponding studentized residual—the residual adjusted for its observation-specific estimated residual variance—is then

t_{i}={e_{i} \over {\widehat {\sigma }}{\sqrt {1-h_{ii}\ }}}

where ${\widehat {\sigma }}$ is an appropriate estimate of $\sigma$ .

Partial leverage

Partial leverage (PL) is a measure of the contribution of the individual independent variables to the total leverage of each observation. That is, PL is a measure of how $h_{ii}$ changes as a variable is added to the regression model. It is computed as:

\left(\mathrm {PL} _{j}\right)_{i}={\frac {\left(\mathbf {X} _{j\bullet [j]}\right)_{i}^{2}}{\sum _{k=1}^{n}\left(\mathbf {X} _{j\bullet [j]}\right)_{k}^{2}}}

where $j$ is the index of independent variable, $i$ is the index of observation and $\mathbf {X} _{j\bullet [j]}$ are the residuals from regressing $\mathbf {X} _{j}$ against the remaining independent variables. Note that the partial leverage is the leverage of the ${i}^{th}$ point in the partial regression plot for the ${j}^{th}$ variable. Data points with large partial leverage for an independent variable can exert undue influence on the selection of that variable in automatic regression model building procedures.

Software implementations

Many programs and statistics packages, such as R, Python, etc., include implementations of Leverage.

Language/Program	Function	Notes
R	`hat(x, intercept = TRUE)` or `hatvalues(model, ...)`	See
Python	`(x * np.linalg.pinv(x).T).sum(-1)`	See

Related Research Articles

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

The method of least squares is a parameters estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, a studentized residual is the dimensionless ratio resulting from the division of a residual by an estimate of its standard deviation, both expressed in the same units. It is a form of a Student's t-statistic, with the estimate of error varying between points.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the covariance matrix of the errors, are null.

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate influential data points that are particularly worth checking for validity; or to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics, the projection matrix $, sometimes also called the influence matrix or hat matrix, maps the vector of response values to the vector of fitted values. It describes the influence each response value has on each fitted value. The diagonal elements of the projection matrix are the leverages, which describe the influence each response value has on the fitted value for that same observation.$

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

↑ Everitt, B. S. (2002). Cambridge Dictionary of Statistics. Cambridge University Press. ISBN 0-521-81099-X.
↑ Cardinali, C. (June 2013). "Data Assimilation: Observation influence diagnostic of a data assimilation system" (PDF).
↑ Prove the relation between Mahalanobis distance and Leverage?
↑ Kim, M. G. (2004). "Sources of high leverage in linear regression model (Journal of Applied Mathematics and Computing, Vol 16, 509–513)". arXiv: 2006.04024 [math.ST].
↑ Miller, Rupert G. (September 1974). "An Unbalanced Jackknife". Annals of Statistics. 2 (5): 880–891. doi: 10.1214/aos/1176342811 . ISSN 0090-5364.
↑ Hiyashi, Fumio (2000). Econometrics. Princeton University Press. p. 21.
↑ Young, Alwyn (2019). "Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results". The Quarterly Journal of Economics. 134 (2): 567. doi: 10.1093/qje/qjy029 .
↑ Chatterjee, Samprit; Hadi, Ali S. (August 1986). "Influential Observations, High Leverage Points, and Outliers in Linear Regression". Statistical Science. 1 (3): 379–393. doi: 10.1214/ss/1177013622 . ISSN 0883-4237.
↑ "regression - Influence functions and OLS". Cross Validated. Retrieved 2020-12-06.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Everitt, B. S. (2002). Cambridge Dictionary of Statistics. Cambridge University Press. ISBN 0-521-81099-X.

[2] Cardinali, C. (June 2013). "Data Assimilation: Observation influence diagnostic of a data assimilation system" (PDF).

[3] Prove the relation between Mahalanobis distance and Leverage?

[4] Kim, M. G. (2004). "Sources of high leverage in linear regression model (Journal of Applied Mathematics and Computing, Vol 16, 509–513)". arXiv: 2006.04024 [math.ST].

[5] Miller, Rupert G. (September 1974). "An Unbalanced Jackknife". Annals of Statistics. 2 (5): 880–891. doi: 10.1214/aos/1176342811 . ISSN 0090-5364.

[6] Hiyashi, Fumio (2000). Econometrics. Princeton University Press. p. 21.

[7] Young, Alwyn (2019). "Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results". The Quarterly Journal of Economics. 134 (2): 567. doi: 10.1093/qje/qjy029 .

[8] Chatterjee, Samprit; Hadi, Ali S. (August 1986). "Influential Observations, High Leverage Points, and Outliers in Linear Regression". Statistical Science. 1 (3): 379–393. doi: 10.1214/ss/1177013622 . ISSN 0883-4237.

[9] "regression - Influence functions and OLS". Cross Validated. Retrieved 2020-12-06.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]