In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and in
probability theory
Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...
, distance correlation or distance covariance is a measure of
dependence between two paired
random vectors of arbitrary, not necessarily equal,
dimension
In physics and mathematics, the dimension of a Space (mathematics), mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any Point (geometry), point within it. Thus, a Line (geometry), lin ...
. The population distance correlation coefficient is zero if and only if the random vectors are
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s
* Independ ...
. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to
Pearson's correlation, which can only detect linear association between two
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s.
Distance correlation can be used to perform a
statistical test of dependence with a
permutation test. One first computes the distance correlation (involving the re-centering of Euclidean distance matrices) between two random vectors, and then compares this value to the distance correlations of many shuffles of the data.
Background
The classical measure of dependence, the
Pearson correlation coefficient,
is mainly sensitive to a linear relationship between two variables. Distance correlation was introduced in 2005 by
Gábor J. Székely in several lectures to address this deficiency of Pearson's
correlation
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
, namely that it can easily be zero for dependent variables. Correlation = 0 (uncorrelatedness) does not imply independence while distance correlation = 0 does imply independence. The first results on distance correlation were published in 2007 and 2009. It was proved that distance covariance is the same as the Brownian covariance. These measures are examples of
energy distance
Energy distance is a statistical distance between probability distributions. If X and Y are independent random vectors in ''R''d with cumulative distribution functions (cdf) F and G respectively, then the energy distance between the distribution ...
s.
The distance correlation is derived from a number of other quantities that are used in its specification, specifically: distance variance, distance standard deviation, and distance covariance. These quantities take the same roles as the ordinary
moment
Moment or Moments may refer to:
* Present time
Music
* The Moments, American R&B vocal group Albums
* ''Moment'' (Dark Tranquillity album), 2020
* ''Moment'' (Speed album), 1998
* ''Moments'' (Darude album)
* ''Moments'' (Christine Guldbrand ...
s with corresponding names in the specification of the
Pearson product-moment correlation coefficient
In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...
.
Definitions
Distance covariance
Let us start with the definition of the sample distance covariance. Let (''X''
''k'', ''Y''
''k''), ''k'' = 1, 2, ..., ''n'' be a
statistical sample
In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt ...
from a pair of real valued or vector valued random variables (''X'', ''Y''). First, compute the ''n'' by ''n''
distance matrices (''a''
''j'', ''k'') and (''b''
''j'', ''k'') containing all pairwise
distances
Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...
:
where , , ⋅ , , denotes
Euclidean norm
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean s ...
. Then take all doubly centered distances
:
where
is the -th row mean,
is the -th column mean, and
is the
grand mean of the distance matrix of the sample. The notation is similar for the values. (In the matrices of centered distances (''A''
''j'', ''k'') and (''B''
''j'',''k'') all rows and all columns sum to zero.) The squared sample distance covariance (a scalar) is simply the arithmetic average of the products ''A''
''j'', ''k ''''B''
''j'', ''k'':
:
The statistic ''T''
''n'' = ''n'' dCov
2''n''(''X'', ''Y'') determines a consistent multivariate test of independence of random vectors in arbitrary dimensions. For an implementation see ''dcov.test'' function in the ''energy'' package for
R.
The population value of distance covariance can be defined along the same lines. Let ''X'' be a random variable that takes values in a ''p''-dimensional Euclidean space with probability distribution and let ''Y'' be a random variable that takes values in a ''q''-dimensional Euclidean space with probability distribution , and suppose that ''X'' and ''Y'' have finite expectations. Write
:
Finally, define the population value of squared distance covariance of ''X'' and ''Y'' as
:
One can show that this is equivalent to the following definition:
:
where ''E'' denotes expected value, and
and
are independent and identically distributed. The primed random variables
and
denote
independent and identically distributed (iid) copies of the variables
and
and are similarly iid. Distance covariance can be expressed in terms of the classical Pearson's
covariance
In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the les ...
,
cov, as follows:
:
This identity shows that the distance covariance is not the same as the covariance of distances, ). This can be zero even if ''X'' and ''Y'' are not independent.
Alternatively, the distance covariance can be defined as the weighted
''L''2 norm of the distance between the joint
characteristic function of the random variables and the product of their marginal characteristic functions:
[, Theorem 7, (3.7).]
:
where
,
, and
are the
characteristic functions of ''X'', and ''Y'', respectively, ''p'', ''q'' denote the Euclidean dimension of ''X'' and ''Y'', and thus of ''s'' and ''t'', and ''c''
''p'', ''c''
''q'' are constants. The weight function
is chosen to produce a scale equivariant and rotation invariant measure that doesn't go to zero for dependent variables.
[ One interpretation of the characteristic function definition is that the variables ''eisX'' and ''eitY'' are cyclic representations of ''X'' and ''Y'' with different periods given by ''s'' and ''t'', and the expression in the numerator of the characteristic function definition of distance covariance is simply the classical covariance of ''eisX'' and ''eitY''. The characteristic function definition clearly shows that
dCov2(''X'', ''Y'') = 0 if and only if ''X'' and ''Y'' are independent.
]
Distance variance and distance standard deviation
The ''distance variance'' is a special case of distance covariance when the two variables are identical. The population value of distance variance is the square root of
:
where , , and are independent and identically distributed random variables
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
, denotes the expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
, and for function , e.g., .
The ''sample distance variance'' is the square root of
:
which is a relative of Corrado Gini's mean difference introduced in 1912 (but Gini did not work with centered distances).
The ''distance standard deviation'' is the square root of the ''distance variance''.
Distance correlation
The ''distance correlation'' of two random variables is obtained by dividing their ''distance covariance'' by the product of their ''distance standard deviations''. The distance correlation is the square root of
:
and the ''sample distance correlation'' is defined by substituting the sample distance covariance and distance variances for the population coefficients above.
For easy computation of sample distance correlation see the ''dcor'' function in the ''energy'' package for R.
Properties
Distance correlation
Distance covariance
This last property is the most important effect of working with centered distances.
The statistic is a biased estimator of . Under independence of X and Y
:
An unbiased estimator of is given by Székely and Rizzo.
Distance variance
Equality holds in (iv) if and only if one of the random variables or is a constant.
Generalization
Distance covariance can be generalized to include powers of Euclidean distance. Define
:
Then for every , and are independent if and only if . It is important to note that this characterization does not hold for exponent ; in this case for bivariate , is a deterministic function of the Pearson correlation. If and are powers of the corresponding distances, , then sample distance covariance can be defined as the nonnegative number for which
:
One can extend to metric-space-valued random variables
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
and : If has law in a metric space with metric , then define