In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, particularly in
hypothesis testing
A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
, the Hotelling's ''T''-squared distribution (''T''
2), proposed by
Harold Hotelling
Harold Hotelling (; September 29, 1895 – December 26, 1973) was an American mathematical statistician and an influential economic theorist, known for Hotelling's law, Hotelling's lemma, and Hotelling's rule in economics, as well as Hotelling ...
,
is a
multivariate probability distribution that is tightly related to the
''F''-distribution and is most notable for arising as the distribution of a set of
sample statistics
A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...
that are natural generalizations of the statistics underlying the
Student's ''t''-distribution.
The Hotelling's ''t''-squared statistic (''t''
2) is a generalization of
Student's ''t''-statistic that is used in
multivariate hypothesis testing
A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
.
Motivation
The distribution arises in
multivariate statistics
Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., '' multivariate random variables''.
Multivariate statistics concerns understanding the differ ...
in undertaking
tests
Test(s), testing, or TEST may refer to:
* Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities
Arts and entertainment
* ''Test'' (2013 film), an American film
* ''Test'' (2014 film) ...
of the differences between the (multivariate) means of different populations, where tests for univariate problems would make use of a
''t''-test.
The distribution is named for
Harold Hotelling
Harold Hotelling (; September 29, 1895 – December 26, 1973) was an American mathematical statistician and an influential economic theorist, known for Hotelling's law, Hotelling's lemma, and Hotelling's rule in economics, as well as Hotelling ...
, who developed it as a generalization of Student's ''t''-distribution.
Definition
If the vector
is
Gaussian multivariate-distributed with zero mean and unit
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
and
is a
random matrix with a
Wishart distribution
In statistics, the Wishart distribution is a generalization of the gamma distribution to multiple dimensions. It is named in honor of John Wishart (statistician), John Wishart, who first formulated the distribution in 1928. Other names include Wi ...
with unit
scale matrix
In affine geometry, uniform scaling (or isotropic scaling) is a linear transformation that enlarges (increases) or shrinks (diminishes) objects by a '' scale factor'' that is the same in all directions ( isotropically). The result of uniform sc ...
and ''m''
degrees of freedom
In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...
, and ''d'' and ''M'' are independent of each other, then the
quadratic form
In mathematics, a quadratic form is a polynomial with terms all of degree two (" form" is another name for a homogeneous polynomial). For example,
4x^2 + 2xy - 3y^2
is a quadratic form in the variables and . The coefficients usually belong t ...
has a Hotelling distribution (with parameters
and
):
:
It can be shown that if a random variable ''X'' has Hotelling's ''T''-squared distribution,
, then:
[
:
where is the ''F''-distribution with parameters ''p'' and ''m'' − ''p'' + 1.
]
Hotelling ''t''-squared statistic
Let be the sample covariance
The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables.
The sample mean is the average value (or me ...
:
:
where we denote transpose
In linear algebra, the transpose of a Matrix (mathematics), matrix is an operator which flips a matrix over its diagonal;
that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other ...
by an apostrophe
The apostrophe (, ) is a punctuation mark, and sometimes a diacritical mark, in languages that use the Latin alphabet and some other alphabets. In English, the apostrophe is used for two basic purposes:
* The marking of the omission of one o ...
. It can be shown that is a positive (semi) definite matrix and follows a ''p''-variate Wishart distribution
In statistics, the Wishart distribution is a generalization of the gamma distribution to multiple dimensions. It is named in honor of John Wishart (statistician), John Wishart, who first formulated the distribution in 1928. Other names include Wi ...
with ''n'' − 1 degrees of freedom.
The sample covariance matrix of the mean reads .
The Hotelling's ''t''-squared statistic is then defined as:
:
which is proportional to the Mahalanobis distance
The Mahalanobis distance is a distance measure, measure of the distance between a point P and a probability distribution D, introduced by Prasanta Chandra Mahalanobis, P. C. Mahalanobis in 1936. The mathematical details of Mahalanobis distance ...
between the sample mean and . Because of this, one should expect the statistic to assume low values if , and high values if they are different.
From the distribution Distribution may refer to:
Mathematics
*Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations
*Probability distribution, the probability of a particular value or value range of a varia ...
,
:
where is the ''F''-distribution with parameters ''p'' and ''n'' − ''p''.
In order to calculate a ''p''-value (unrelated to ''p'' variable here), note that the distribution of equivalently implies that
:
Then, use the quantity on the left hand side to evaluate the ''p''-value corresponding to the sample, which comes from the ''F''-distribution. A confidence region In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. For a bivariate normal distribution, it is an ellipse, also known as the error ellipse. More generally, it is a set of points in an ''n''-dimension ...
may also be determined using similar logic.
Motivation
Let denote a ''p''-variate normal distribution with location
In geography, location or place is used to denote a region (point, line, or area) on Earth's surface. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ambiguous bou ...
and known covariance
In probability theory and statistics, covariance is a measure of the joint variability of two random variables.
The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables. If greater values of one ...
. Let
:
be ''n'' independent identically distributed (iid) random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
s, which may be represented as column vectors of real numbers. Define
:
to be the sample mean
The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables.
The sample mean is the average value (or me ...
with covariance . It can be shown that
:
where is the chi-squared distribution
In probability theory and statistics, the \chi^2-distribution with k Degrees of freedom (statistics), degrees of freedom is the distribution of a sum of the squares of k Independence (probability theory), independent standard normal random vari ...
with ''p'' degrees of freedom.
Alternatively, one can argue using density functions and characteristic functions, as follows.
Two-sample statistic
If and , with the samples independently drawn from two independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One d ...
s with the same mean and covariance, and we define
:
as the sample means, and
:
:
as the respective sample covariance matrices. Then
:
is the unbiased pooled covariance matrix estimate (an extension of pooled variance
In statistics, pooled variance (also known as combined variance, composite variance, or overall variance, and written \sigma^2) is a method for estimating variance of several different populations when the mean of each population may be differen ...
).
Finally, the Hotelling's two-sample ''t''-squared statistic is
:
Related concepts
It can be related to the F-distribution by
:
The non-null distribution of this statistic is the noncentral F-distribution (the ratio of a non-central Chi-squared random variable and an independent central Chi-squared random variable)
:
with
:
where is the difference vector between the population means.
In the two-variable case, the formula simplifies nicely allowing appreciation of how the correlation, ,
between the variables affects . If we define
:
and
:
then
:
Thus, if the differences in the two rows of the vector are of the same sign, in general, becomes smaller as becomes more positive. If the differences are of opposite sign becomes larger as becomes more positive.
A univariate special case can be found in Welch's t-test.
More robust and powerful tests than Hotelling's two-sample test have been proposed in the literature, see for example the interpoint distance based tests which can be applied also when the number of variables is comparable with, or even larger than, the number of subjects.
See also
* Student's ''t''-test in univariate statistics
* Student's ''t''-distribution in univariate probability theory
* Multivariate Student distribution
* ''F''-distribution (commonly tabulated or available in software libraries, and hence used for testing the ''T''-squared statistic using the relationship given above)
* Wilks's lambda distribution (in multivariate statistics
Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., '' multivariate random variables''.
Multivariate statistics concerns understanding the differ ...
, Wilks's ''Λ'' is to Hotelling's ''T''2 as Snedecor's ''F'' is to Student's ''t'' in univariate statistics)
References
External links
*
{{DEFAULTSORT:Hotelling's T-Squared Distribution
Continuous distributions