statistical theory The theory of statistics provides a basis for the whole range of techniques, in both study design and data analysis, that are used within applications of statistics. The theory covers approaches to statistical-decision problems and to statistica ...

, the field of high-dimensional statistics studies data whose

dimension In physics and mathematics, the dimension of a Space (mathematics), mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any Point (geometry), point within it. Thus, a Line (geometry), lin ...

is larger than typically considered in classical

multivariate analysis Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. Multivariate statistics concerns understanding the different aims and background of each of the dif ...

. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the

sample size Sample size determination is the act of choosing the number of observations or Replication (statistics), replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make stat ...

, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.

Examples

Parameter estimation in linear models

The most basic statistical model for the relationship between a

covariate Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

vector

x \in \mathbb^p

and a

response variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...

y \in \mathbb

is the

linear model In statistics, the term linear model is used in different ways according to the context. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, the term ...

y = x^\top \beta + \epsilon,

where

\beta \in \mathbb^p

is an unknown parameter vector, and

\epsilon

is random noise with mean zero and variance

\sigma^2

. Given independent responses

Y_1,\ldots,Y_n

, with corresponding covariates

x_1,\ldots,x_n

, from this model, we can form the response vector

Y = (Y_1,\ldots,Y_n)^\top

, and

design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ob ...

X = (x_1,\ldots,x_n)^\top \in \mathbb^

. When

n \geq p

and the design matrix has full

column rank In linear algebra, the rank of a matrix is the dimension of the vector space generated (or spanned) by its columns. p. 48, § 1.16 This corresponds to the maximal number of linearly independent columns of . This, in turn, is identical to the dime ...

(i.e. its columns are

linearly independent In the theory of vector spaces, a set of vectors is said to be if there is a nontrivial linear combination of the vectors that equals the zero vector. If no such linear combination exists, then the vectors are said to be . These concepts are ...

), the

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the prin ...

estimator of

\beta

is :

\hat := (X^\top X)^ X^\top Y.

When

\epsilon \sim N(0,\sigma^2)

, it is

known Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is distinc ...

that

\hat \sim N_p\bigl(\beta,\sigma^2(X^\top X)^\bigr)

. Thus,

\hat

is an

unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...

\beta

, and the Gauss-Markov theorem tells us that it is the

Best Linear Unbiased Estimator Best or The Best may refer to: People * Best (surname), people with the surname Best * Best (footballer, born 1968), retired Portuguese footballer Companies and organizations * Best & Co., an 1879–1971 clothing chain * Best Lock Corporation, ...

. However,

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

is a concern when

p

is of comparable magnitude to

n

: the matrix

X^\top X

in the definition of

\hat

may become

ill-conditioned In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input ...

, with a small minimum

eigenvalue In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted b ...

. In such circumstances

\mathbb(\, \hat - \beta\, ^2) = \sigma^2 \mathrm\bigl((X^\top X)^\bigr)

will be large (since the

trace Trace may refer to: Arts and entertainment Music * Trace (Son Volt album), ''Trace'' (Son Volt album), 1995 * Trace (Died Pretty album), ''Trace'' (Died Pretty album), 1993 * Trace (band), a Dutch progressive rock band * The Trace (album), ''The ...

of a matrix is the sum of its eigenvalues). Even worse, when

p > n

, the matrix

X^\top X

singular Singular may refer to: * Singular, the grammatical number that denotes a unit quantity, as opposed to the plural and other forms * Singular homology * SINGULAR, an open source Computer Algebra System (CAS) * Singular or sounder, a group of boar, ...

. (See Section 1.2 and Exercise 1.2 in .) It is important to note that the deterioration in estimation performance in high dimensions observed in the previous paragraph is not limited to the ordinary least squares estimator. In fact, statistical inference in high dimensions is intrinsically hard, a phenomenon known as the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The ...

, and it can be shown that no estimator can do better in a worst-case sense without additional information (see Example 15.10). Nevertheless, the situation in high-dimensional statistics may not be hopeless when the data possess some low-dimensional structure. One common assumption for high-dimensional linear regression is that the vector of regression coefficients is sparse, in the sense that most coordinates of

\beta

are zero. Many statistical procedures, including the

Lasso A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish an ...

, have been proposed to fit high-dimensional linear models under such sparsity assumptions.

Covariance matrix estimation

Another example of a high-dimensional statistical phenomenon can be found in the problem of covariance matrix estimation. Suppose that we observe

X_1,\ldots,X_n \in \mathbb^p

, which are

i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...

draws from some zero mean distribution with an unknown covariance matrix

\Sigma \in \mathbb^

. A natural

\Sigma

is the

sample covariance matrix The sample mean (or "empirical mean") and the sample covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or mean value) of a sample of numbers taken from a larger popula ...

\widehat := \frac \sum_^n X_iX_i^\top.

In the low-dimensional setting where

n

increases and

p

is held fixed,

\widehat

is a

consistent estimator In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter ''θ''0—having the property that as the number of data points used increases indefinitely, the result ...

\Sigma

in any

matrix norm In mathematics, a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices (of given dimensions). Preliminaries Given a field K of either real or complex numbers, let K^ be the -vector space of matrices with m rows ...

. When

p

grows with

n

, on the other hand, this consistency result may fail to hold. As an illustration, suppose that each

X_i \sim N_p(0,I)

and

p/n \rightarrow \alpha \in (0,1)

. If

\widehat

were to consistently estimate

\Sigma = I

, then the eigenvalues of

\widehat

should approach one as

n

increases. It turns out that this is not the case in this high-dimensional setting. Indeed, the largest and smallest eigenvalues of

\widehat

concentrate around

(1 + \sqrt)^2

and

(1 - \sqrt)^2

, respectively, according to the limiting distribution derived by Tracy and Widom, and these clearly deviate from the unit eigenvalues of

\Sigma

. Further information on the asymptotic behaviour of the eigenvalues of

\widehat

can be obtained from the Marchenko–Pastur law. From a non-asymptotic point of view, the maximum eigenvalue

\lambda_(\widehat)

\widehat

satisfies :

\mathbb\left(\lambda_(\widehat) \geq (1 + \sqrt + \delta)^2\right) \leq e^,

for any

\delta \geq 0

and all choices of pairs of

n,p

. Again, additional low-dimensional structure is needed for successful covariance matrix estimation in high dimensions. Examples of such structures include

sparsity In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse b ...

, low rankness and bandedness. Similar remarks apply when estimating an inverse covariance matrix (precision matrix).

History

From an applied perspective, research in high-dimensional statistics was motivated by the realisation that advances in computing technology had dramatically increased the ability to collect and store

data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...

, and that traditional statistical techniques such as those described in the examples above were often ill-equipped to handle the resulting challenges. Theoretical advances in the area can be traced back to the remarkable result of Charles Stein in 1956, where he proved that the usual estimator of a multivariate normal mean was inadmissible with respect to squared error loss in three or more dimensions. Indeed, the James-Stein estimator provided the insight that in high-dimensional settings, one may obtain improved estimation performance through shrinkage, which reduces variance at the expense of introducing a small amount of bias. This bias-variance tradeoff was further exploited in the context of high-dimensional

linear models In statistics, the term linear model is used in different ways according to the context. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, the ter ...

by Hoerl and Kennard in 1970 with the introduction of

ridge regression Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...

. Another major impetus for the field was provided by

Robert Tibshirani Robert Tibshirani (born July 10, 1956) is a professor in the Departments of Statistics and Biomedical Data Science at Stanford University. He was a professor at the University of Toronto from 1985 to 1998. In his work, he develops statistical to ...

's work on the

in 1996, which used

\ell_1

regularisation to achieve simultaneous model selection and parameter estimation in high-dimensional sparse linear regression. Since then, a large number of other shrinkage estimators have been proposed to exploit different low-dimensional structures in a wide range of high-dimensional statistical problems.

Topics in high-dimensional statistics

The following are examples of topics that have received considerable attention in the high-dimensional statistics literature in recent years: * Linear models in high dimensions. Linear models are one of the most widely used tools in statistics and its applications. As such, sparse linear regression is one of the most well-studied topics in high-dimensional statistical research. Building upon earlier works on

and the

, several other shrinkage estimators have been proposed and studied in this and related problems. They include ** The Dantzig selector, which minimises the maximum covariate-residual correlation, instead of the residual sum of squares as in the Lasso, subject to an

\ell_1

constraint on the coefficients. ** Elastic net, which combines

\ell_1

regularisation of the

with

\ell_2

regularisation of

to allow highly correlated covariates to be simultaneously selected with similar regression coefficients. ** The Group Lasso, which allows predefined groups of covariates to be selected jointly. ** The Fused lasso, which regularises the difference between nearby coefficients when the regression coefficients reflect spatial or temporal relationships, so as to enforce a piecewise constant structure.Tibshirani, Robert, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. 2005. “Sparsity and Smoothness via the Fused lasso”. Journal of the Royal Statistical Society. Series B (statistical Methodology) 67 (1). Wiley: 91–108. https://www.jstor.org/stable/3647602. * High-dimensional variable selection. In addition to estimating the underlying parameter in regression models, another important topic is to seek to identify the non-zero coefficients, as these correspond to variables that are needed in a final model. Each of the techniques listed under the previous heading can be used for this purpose, and are sometimes combined with ideas such as subsampling through Stability Selection. * High-dimensional covariance and precision matrix estimation. These problems were introduced above; see also shrinkage estimation. Methods include tapering estimators and the constrained

\ell_1

minimisation estimator. * Sparse principal component analysis.

Principal Component Analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...

is another technique that breaks down in high dimensions; more precisely, under appropriate conditions, the leading eigenvector of the sample covariance matrix is an inconsistent estimator of its population counterpart when the ratio of the number of variables

p

to the number of observations

n

is bounded away from zero. Under the assumption that this leading eigenvector is sparse (which can aid interpretability), consistency can be restored. *

Matrix completion Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the mo ...

. This topic, which concerns the task of filling in the missing entries of a partially observed matrix, became popular owing in large part to the

Netflix prize The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users being identified e ...

for predicting user ratings for films. * High-dimensional classification.

Linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...

cannot be used when

p > n

, because the sample covariance matrix is

. Alternative approaches have been proposed based on

naive Bayes In statistics, naive Bayes classifiers are a family of simple "Probabilistic classification, probabilistic classifiers" based on applying Bayes' theorem with strong (naive) statistical independence, independence assumptions between the features (s ...

, feature selection and

random projections In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are known for their power, simplicity, and low error rates when compared ...

. * Graphical models for high-dimensional data. Graphical models are used to encode the conditional dependence structure between different variables. Under a Gaussianity assumption, the problem reduces to that of estimating a sparse precision matrix, discussed above.

Notes

References

* * * * * {{statistics, state=collapsed Multivariate statistics Probability theory Functional analysis