Cook's distance

Last updated

In statistics, Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. [1] In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate influential data points that are particularly worth checking for validity; or to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977. [2] [3]

Contents

Definition

Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression. Cook's distance measures the effect of deleting a given observation. Points with a large Cook's distance are considered to merit closer examination in the analysis.

For the algebraic expression, first define

where is the error term, is the coefficient matrix, is the number of covariates or predictors for each observation, and is the design matrix including a constant. The least squares estimator then is , and consequently the fitted (predicted) values for the mean of are

where is the projection matrix (or hat matrix). The -th diagonal element of , given by , [4] is known as the leverage of the -th observation. Similarly, the -th element of the residual vector is denoted by .

Cook's distance of observation is defined as the sum of all the changes in the regression model when observation is removed from it [5]

where p is the rank of the model and is the fitted response value obtained when excluding , and is the mean squared error of the regression model. [6]

Equivalently, it can be expressed using the leverage [5] ():

Detecting highly influential observations

There are different opinions regarding what cut-off values to use for spotting highly influential points. Since Cook's distance is in the metric of an F distribution with and (as defined for the design matrix above) degrees of freedom, the median point (i.e., ) can be used as a cut-off. [7] Since this value is close to 1 for large , a simple operational guideline of has been suggested. [8]

The -dimensional random vector , which is the change of due to a deletion of the -th case, has a covariance matrix of rank one and therefore it is distributed entirely over one dimensional subspace (a line) of the -dimensional space. However, in the introduction of Cook’s distance, a scaling matrix of full rank is chosen and as a result is treated as if it is a random vector distributed over the whole space of dimensions. Hence the Cook's distance measure is likely to distort the real influence of observations, misleading the right choice of influential observations. [9] [10]

Relationship to other influence measures (and interpretation)

can be expressed using the leverage [5] () and the square of the internally Studentized residual (), as follows:

The benefit in the last formulation is that it clearly shows the relationship between and to (while p and n are the same for all observations). If is large then it (for non-extreme values of ) will increase . If is close to 0 then will be small, while if is close to 1 then will become very large (as long as , i.e.: that the observation is not exactly on the regression line that was fitted without observation ).

is related to DFFITS through the following relationship (note that is the externally studentized residual, and are defined here):

can be interpreted as the distance one's estimates move within the confidence ellipsoid that represents a region of plausible values for the parameters.[ clarification needed ] This is shown by an alternative but equivalent representation of Cook's distance in terms of changes to the estimates of the regression parameters between the cases, where the particular observation is either included or excluded from the regression analysis.

An alternative to has been proposed. Instead of considering the influence a single observation has on the overall model, the statistics serves as a measure of how sensitive the prediction of the -th observation is to the deletion of each observation in the original data set. It can be formulated as a weighted linear combination of the 's of all data points. Again, the projection matrix is involved in the calculation to obtain the required weights:

In this context, () resembles the correlation between the predictions and [lower-alpha 1] .
In contrast to , the distribution of is asymptotically normal for large sample sizes and models with many predictors. In absence of outliers the expected value of is approximately . An influential observation can be identified if

with as the median and as the median absolute deviation of all -values within the original data set, i.e., a robust measure of location and a robust measure of scale for the distribution of . The factor 4.5 covers approx. 3 standard deviations of around its centre.
When compared to Cook's distance, was found to perform well for high- and intermediate-leverage outliers, even in presence of masking effects for which failed. [12]
Interestingly, and are closely related because they can both be expressed in terms of the matrix which holds the effects of the deletion of the -th data point on the -th prediction:

With at hand, is given by:

where if is symmetric and idempotent, which is not necessarily the case. In contrast, can be calculated as:

where extracts the main diagonal of a square matrix . In this context, is referred to as the influence matrix whereas resembles the so-called sensitivity matrix. An eigenvector analysis of and - which both share the same eigenvalues – serves as a tool in outlier detection, although the eigenvectors of the sensitivity matrix are more powerful. [13]

Software implementations

Many programs and statistics packages, such as R, Python, Julia, etc., include implementations of Cook's distance.

Language/ProgramFunctionNotes
Stata predict, cooksdSee
R cooks.distance(model, ...)See
Python CooksDistance().fit(X, y)See
Julia cooksdistance(model, ...)See

Extensions

High-dimensional Influence Measure (HIM) is an alternative to Cook's distance for when (i.e., when there are more predictors than observations). [14] While the Cook's distance quantifies the individual observation's influence on the least squares regression coefficient estimate, the HIM measures the influence of an observation on the marginal correlations.

See also

Notes

  1. The indices and are often interchanged in the original publication as the projection matrix is symmetric in ordinary linear regression, i.e., . Since this is not always the case, e.g., in weighted linear regression, the indices have been written consistently here to account for potential asymmetry and thus allow for direct usage. [11]

Related Research Articles

<span class="mw-page-title-main">Matrix multiplication</span> Mathematical operation in linear algebra

In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix. The product of matrices A and B is denoted as AB.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

In the mathematical field of differential geometry, a metric tensor is an additional structure on a manifold M that allows defining distances and angles, just as the inner product on a Euclidean space allows defining distances and angles there. More precisely, a metric tensor at a point p of M is a bilinear form defined on the tangent space at p, and a metric tensor on M consists of a metric tensor at each point p of M that varies smoothly with p.

In mathematics, the Hessian matrix, Hessian or Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named after him. Hesse originally used the term "functional determinants". The Hessian is sometimes denoted by H or, ambiguously, by ∇2.

In mathematics, the matrix exponential is a matrix function on square matrices analogous to the ordinary exponential function. It is used to solve systems of linear differential equations. In the theory of Lie groups, the matrix exponential gives the exponential map between a matrix Lie algebra and the corresponding Lie group.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of matrices. It collects the various partial derivatives of a single function with respect to many variables, and/or of a multivariate function with respect to a single variable, into vectors and matrices that can be treated as single entities. This greatly simplifies operations such as finding the maximum or minimum of a multivariate function and solving systems of differential equations. The notation used here is commonly used in statistics and engineering, while the tensor index notation is preferred in physics.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In analytical mechanics, the mass matrix is a symmetric matrix M that expresses the connection between the time derivative of the generalized coordinate vector q of a system and the kinetic energy T of that system, by the equation

<span class="mw-page-title-main">Symmetry in quantum mechanics</span> Properties underlying modern physics

Symmetries in quantum mechanics describe features of spacetime and particles which are unchanged under some transformation, in the context of quantum mechanics, relativistic quantum mechanics and quantum field theory, and with applications in the mathematical formulation of the standard model and condensed matter physics. In general, symmetry in physics, invariance, and conservation laws, are fundamentally important constraints for formulating physical theories and models. In practice, they are powerful methods for solving problems and predicting what can happen. While conservation laws do not always give the answer to the problem directly, they form the correct constraints and the first steps to solving a multitude of problems. In application, understanding symmetries can also provide insights on the eigenstates that can be expected. For example, the existence of degenerate states can be inferred by the presence of non commuting symmetry operators or that the non degenerate states are also eigenvectors of symmetry operators.

In pure and applied mathematics, quantum mechanics and computer graphics, a tensor operator generalizes the notion of operators which are scalars and vectors. A special class of these are spherical tensor operators which apply the notion of the spherical basis and spherical harmonics. The spherical basis closely relates to the description of angular momentum in quantum mechanics and spherical harmonic functions. The coordinate-free generalization of a tensor operator is known as a representation operator.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

References

  1. Mendenhall, William; Sincich, Terry (1996). A Second Course in Statistics: Regression Analysis (5th ed.). Upper Saddle River, NJ: Prentice-Hall. p. 422. ISBN   0-13-396821-9. A measure of overall influence an outlying observation has on the estimated coefficients was proposed by R. D. Cook (1979). Cook's distance, Di, is calculated...
  2. Cook, R. Dennis (February 1977). "Detection of Influential Observations in Linear Regression". Technometrics. American Statistical Association. 19 (1): 15–18. doi:10.2307/1268249. JSTOR   1268249. MR   0436478.
  3. Cook, R. Dennis (March 1979). "Influential Observations in Linear Regression". Journal of the American Statistical Association . American Statistical Association. 74 (365): 169–174. doi:10.2307/2286747. hdl: 11299/199280 . JSTOR   2286747. MR   0529533.
  4. Hayashi, Fumio (2000). Econometrics. Princeton University Press. pp. 21–23. ISBN   1400823838.
  5. 1 2 3 "Cook's Distance".
  6. "Statistics 512: Applied Linear Models" (PDF). Purdue University. Archived from the original (PDF) on 2016-11-30. Retrieved 2016-03-25.
  7. Bollen, Kenneth A.; Jackman, Robert W. (1990). "Regression Diagnostics: An Expository Treatment of Outliers and Influential Cases". In Fox, John; Long, J. Scott (eds.). Modern Methods of Data Analysis. Newbury Park, CA: Sage. pp.  266. ISBN   0-8039-3366-5.
  8. Cook, R. Dennis; Weisberg, Sanford (1982). Residuals and Influence in Regression. New York, NY: Chapman & Hall. hdl:11299/37076. ISBN   0-412-24280-X.
  9. Kim, Myung Geun (31 May 2017). "A cautionary note on the use of Cook's distance". Communications for Statistical Applications and Methods. 24 (3): 317–324. doi: 10.5351/csam.2017.24.3.317 . ISSN   2383-4757.
  10. On deletion diagnostic statistic in regression
  11. Peña 2005, p. 2.
  12. Peña, Daniel (2005). "A New Statistic for Influence in Linear Regression". Technometrics. American Society for Quality and the American Statistical Association. 47 (1): 1–12. doi:10.1198/004017004000000662. S2CID   1802937.
  13. Peña, Daniel (2006). Pham, Hoang (ed.). Springer Handbook of Engineering Statistics. Springer London. pp. 523–536. doi:10.1007/978-1-84628-288-1. ISBN   978-1-84628-288-1. S2CID   60460007.
  14. High-dimensional influence measure

Further reading