High-dimensional statistics

Last updated

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger (relative to the number of datapoints) than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking. [1] [2]

Contents

There are several notions of high-dimensional analysis of statistical methods including:

Examples

Parameter estimation in linear models

Illustration of the linear model in high-dimensions: a data set consists of a response vector
Y
[?]
R
n
{\displaystyle Y\in \mathbb {R} ^{n}}
and a design matrix
X
[?]
R
n
x
p
{\displaystyle X\in \mathbb {R} ^{n\times p}}
with
p
[?]
n
{\displaystyle p\gg n}
. Our goal is to estimate the unknown vector
b
=
(
b
1
,
...
,
b
p
)
[?]
R
p
{\displaystyle \beta =(\beta _{1},\dots ,\beta _{p})\in \mathbb {R} ^{p}}
of regression coefficients where
b
{\displaystyle \beta }
is often assumed to be sparse, in the sense that the cardinality of the set
S
:=
{
j
:
b
j
[?]
0
}
{\displaystyle S:=\{j:\beta _{j}\neq 0\}}
is small by comparison with
p
{\displaystyle p}
. High-dimensional linear model.svg
Illustration of the linear model in high-dimensions: a data set consists of a response vector and a design matrix with . Our goal is to estimate the unknown vector of regression coefficients where is often assumed to be sparse, in the sense that the cardinality of the set is small by comparison with .

The most basic statistical model for the relationship between a covariate vector and a response variable is the linear model

where is an unknown parameter vector, and is random noise with mean zero and variance . Given independent responses , with corresponding covariates , from this model, we can form the response vector , and design matrix . When and the design matrix has full column rank (i.e. its columns are linearly independent), the ordinary least squares estimator of is

When , it is known that . Thus, is an unbiased estimator of , and the Gauss-Markov theorem tells us that it is the Best Linear Unbiased Estimator.

However, overfitting is a concern when is of comparable magnitude to : the matrix in the definition of may become ill-conditioned, with a small minimum eigenvalue. In such circumstances will be large (since the trace of a matrix is the sum of its eigenvalues). Even worse, when , the matrix is singular. (See Section 1.2 and Exercise 1.2 in [1] .)

It is important to note that the deterioration in estimation performance in high dimensions observed in the previous paragraph is not limited to the ordinary least squares estimator. In fact, statistical inference in high dimensions is intrinsically hard, a phenomenon known as the curse of dimensionality, and it can be shown that no estimator can do better in a worst-case sense without additional information (see Example 15.10 [2] ). Nevertheless, the situation in high-dimensional statistics may not be hopeless when the data possess some low-dimensional structure. One common assumption for high-dimensional linear regression is that the vector of regression coefficients is sparse, in the sense that most coordinates of are zero. Many statistical procedures, including the Lasso, have been proposed to fit high-dimensional linear models under such sparsity assumptions.

Covariance matrix estimation

Another example of a high-dimensional statistical phenomenon can be found in the problem of covariance matrix estimation. Suppose that we observe , which are i.i.d. draws from some zero mean distribution with an unknown covariance matrix . A natural unbiased estimator of is the sample covariance matrix

In the low-dimensional setting where increases and is held fixed, is a consistent estimator of in any matrix norm. When grows with , on the other hand, this consistency result may fail to hold. As an illustration, suppose that each and . If were to consistently estimate , then the eigenvalues of should approach one as increases. It turns out that this is not the case in this high-dimensional setting. Indeed, the largest and smallest eigenvalues of concentrate around and , respectively, according to the limiting distribution derived by Tracy and Widom, and these clearly deviate from the unit eigenvalues of . Further information on the asymptotic behaviour of the eigenvalues of can be obtained from the Marchenko–Pastur law. From a non-asymptotic point of view, the maximum eigenvalue of satisfies

for any and all choices of pairs of . [2]

Again, additional low-dimensional structure is needed for successful covariance matrix estimation in high dimensions. Examples of such structures include sparsity, low rankness and bandedness. Similar remarks apply when estimating an inverse covariance matrix (precision matrix).

History

From an applied perspective, research in high-dimensional statistics was motivated by the realisation that advances in computing technology had dramatically increased the ability to collect and store data, and that traditional statistical techniques such as those described in the examples above were often ill-equipped to handle the resulting challenges. Theoretical advances in the area can be traced back to the remarkable result of Charles Stein in 1956, [4] where he proved that the usual estimator of a multivariate normal mean was inadmissible with respect to squared error loss in three or more dimensions. Indeed, the James-Stein estimator [5] provided the insight that in high-dimensional settings, one may obtain improved estimation performance through shrinkage, which reduces variance at the expense of introducing a small amount of bias. This bias-variance tradeoff was further exploited in the context of high-dimensional linear models by Hoerl and Kennard in 1970 with the introduction of ridge regression. [6] Another major impetus for the field was provided by Robert Tibshirani's work on the Lasso in 1996, which used regularisation to achieve simultaneous model selection and parameter estimation in high-dimensional sparse linear regression. [7] Since then, a large number of other shrinkage estimators have been proposed to exploit different low-dimensional structures in a wide range of high-dimensional statistical problems.

Topics in high-dimensional statistics

The following are examples of topics that have received considerable attention in the high-dimensional statistics literature in recent years:

Notes

  1. 1 2 Lederer, Johannes (2022). Fundamentals of High-Dimensional Statistics: With Exercises and R labs. Springer Textbooks in Statistics. doi:10.1017/9781108627771. ISBN   9781108498029. S2CID   128095693.
  2. 1 2 3 Wainwright, Martin J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. doi:10.1017/9781108627771. ISBN   9781108498029. S2CID   128095693.
  3. Wainwright MJ. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge: Cambridge University Press; 2019. doi:10.1017/9781108627771
  4. Stein, C. (1956), "Inadmissibility of the usual estimator for the mean of a multivariate distribution", Proc. Third Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 197–206, MR   0084922, Zbl   0073.35602
  5. James, W.; Stein, C. (1961), "Estimation with quadratic loss", Proc. Fourth Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 361–379, MR   0133191
  6. Hoerl, Arthur E., and Robert W. Kennard. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics, vol. 12, no. 1, 1970, pp. 55–67. [www.jstor.org/stable/1267351 JSTOR]. Accessed 13 March 2021.
  7. Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the lasso". Journal of the Royal Statistical Society. Series B (methodological). Wiley. 58 (1): 267–88. JSTOR   2346178.
  8. Candes, Emmanuel; Tao, Terence (2007). "The Dantzig selector: Statistical estimation when p is much larger than n". Annals of Statistics. 35 (6): 2313–2351. arXiv: math/0506081 . doi:10.1214/009053606000001523. MR   2382644. S2CID   88524200.
  9. Zou, Hui; Hastie, Trevor (2005). "Regularization and Variable Selection via the Elastic Net". Journal of the Royal Statistical Society. Series B (statistical Methodology). Wiley. 67 (2): 301–20. doi: 10.1111/j.1467-9868.2005.00503.x . JSTOR   3647580.
  10. Yuan, Ming; Lin, Yi (2006). "Model Selection and Estimation in Regression with Grouped Variables". Journal of the Royal Statistical Society. Series B (statistical Methodology). Wiley. 68 (1): 49–67. doi: 10.1111/j.1467-9868.2005.00532.x . JSTOR   3647556. S2CID   6162124.
  11. Tibshirani, Robert, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. 2005. “Sparsity and Smoothness via the Fused lasso”. Journal of the Royal Statistical Society. Series B (statistical Methodology) 67 (1). Wiley: 91–108. https://www.jstor.org/stable/3647602.
  12. Meinshausen, Nicolai; Bühlmann, Peter (2010). "Stability selection". Journal of the Royal Statistical Society, Series B (Statistical Methodology). 72 (4): 417–473. doi: 10.1111/j.1467-9868.2010.00740.x . ISSN   1467-9868. S2CID   1231300.
  13. Shah, Rajen D.; Samworth, Richard J. (2013). "Variable selection with error control: another look at stability selection". Journal of the Royal Statistical Society. Series B (Statistical Methodology). 75 (1): 55–80. arXiv: 1105.5578 . doi: 10.1111/j.1467-9868.2011.01034.x . ISSN   1369-7412. JSTOR   23361014. S2CID   18211609.
  14. Cai, T. Tony; Zhang, Cun-Hui; Zhou, Harrison H. (August 2010). "Optimal rates of convergence for covariance matrix estimation". The Annals of Statistics. 38 (4): 2118–2144. arXiv: 1010.3866 . doi:10.1214/09-AOS752. ISSN   0090-5364. S2CID   14038500 . Retrieved 2021-04-06.
  15. Cai, Tony; Liu, Weidong; Luo, Xi (2011-06-01). "A Constrained Minimization Approach to Sparse Precision Matrix Estimation". Journal of the American Statistical Association. 106 (494): 594–607. arXiv: 1102.2233 . doi:10.1198/jasa.2011.tm10155. ISSN   0162-1459. S2CID   15900101 . Retrieved 2021-04-06.
  16. Johnstone, Iain M.; Lu, Arthur Yu (2009-06-01). "On Consistency and Sparsity for Principal Components Analysis in High Dimensions". Journal of the American Statistical Association. 104 (486): 682–693. doi:10.1198/jasa.2009.0121. ISSN   0162-1459. PMC   2898454 . PMID   20617121.
  17. Vu, Vincent Q.; Lei, Jing (December 2013). "Minimax sparse principal subspace estimation in high dimensions". The Annals of Statistics. 41 (6): 2905–2947. arXiv: 1211.0373 . doi: 10.1214/13-AOS1151 . ISSN   0090-5364. S2CID   562591.
  18. Bickel, Peter J.; Levina, Elizaveta (2004). "Some theory for Fisher's linear discriminant function, naive Bayes', and some alternatives when there are many more variables than observations". Bernoulli. 10 (6): 989–1010. doi: 10.3150/bj/1106314847 .
  19. Fan, Jianqing; Fan, Yingying (December 2008). "High-dimensional classification using features annealed independence rules". The Annals of Statistics. 36 (6): 2605–2637. arXiv: math/0701108 . doi: 10.1214/07-AOS504 . PMC   2630123 . PMID   19169416. S2CID   2982392.
  20. Cannings, Timothy I.; Samworth, Richard J. (2017). "Random-projection ensemble classification". Journal of the Royal Statistical Society, Series B (Statistical Methodology). 79 (4): 959–1035. arXiv: 1504.04595 . doi: 10.1111/rssb.12228 . S2CID   88520328.

Related Research Articles

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

The method of least squares is a parameters estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, the theory of minimum norm quadratic unbiased estimation (MINQUE) was developed by C. R. Rao. MINQUE is a theory alongside other estimation methods in estimation theory, such as the method of moments or maximum likelihood estimation. Similar to the theory of best linear unbiased estimation, MINQUE is specifically concerned with linear regression models. The method was originally conceived to estimate heteroscedastic error variance in multiple linear regression. MINQUE estimators also provide an alternative to maximum likelihood estimators or restricted maximum likelihood estimators for variance components in mixed effects models. MINQUE estimators are quadratic forms of the response variable and are used to estimate a linear function of the variances.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

Functional data analysis (FDA) is a branch of statistics that analyses data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework, each sample element of functional data is considered to be a random function. The physical continuum over which these functions are defined is often time, but may also be spatial location, wavelength, probability, etc. Intrinsically, functional data are infinite dimensional. The high intrinsic dimensionality of these data brings challenges for theory as well as computation, where these challenges vary with how the functional data were sampled. However, the high or infinite dimensional structure of the data is a rich source of information and there are many interesting challenges for research and data analysis.

<span class="mw-page-title-main">Simple linear regression</span> Linear regression model with a single explanatory variable

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equations, each having its own dependent variable and potentially different sets of exogenous explanatory variables. Each equation is a valid linear regression on its own and can be estimated separately, which is why the system is called seemingly unrelated, although some authors suggest that the term seemingly related would be more appropriate, since the error terms are assumed to be correlated across the equations.

In statistics, Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate influential data points that are particularly worth checking for validity; or to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935. It requires knowing the covariance matrix for the residuals; if this is unknown, estimating the covariance matrix gives the method of feasible generalized least squares (FGLS) instead, with fewer guarantees of improvement.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.

Sliced inverse regression is a tool for dimensionality reduction in the field of multivariate statistics.

De-sparsified lasso contributes to construct confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in high-dimensional model.

<span class="mw-page-title-main">Homoscedasticity and heteroscedasticity</span> Statistical property

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.

References