statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, probability theory, and

information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...

, a statistical distance quantifies the distance between two statistical objects, which can be two

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

s, or two

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

s or samples, or the distance can be between an individual sample point and a population or a wider sample of points. A distance between populations can be interpreted as measuring the distance between two

s and hence they are essentially measures of distances between

probability measure In mathematics, a probability measure is a real-valued function defined on a set of events in a probability space that satisfies measure properties such as ''countable additivity''. The difference between a probability measure and the more gener ...

s. Where statistical distance measures relate to the differences between

s, these may have

statistical dependence Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of o ...

,Dodge, Y. (2003)—entry for distance and hence these distances are not directly related to measures of distances between probability measures. Again, a measure of distance between random variables may relate to the extent of dependence between them, rather than to their individual values. Statistical distance measures are not typically metrics, and they need not be symmetric. Some types of distance measures, which generalize ''squared'' distance, are referred to as (statistical) '' divergences''.

Terminology

Many terms are used to refer to various notions of distance; these are often confusingly similar, and may be used inconsistently between authors and over time, either loosely or with precise technical meaning. In addition to "distance", similar terms include deviance, deviation,

discrepancy Discrepancy may refer to: Mathematics * Discrepancy of a sequence * Discrepancy theory in structural modelling * Discrepancy of hypergraphs, an area of discrepancy theory * Discrepancy (algebraic geometry) Statistics * Discrepancy function in th ...

, discrimination, and divergence, as well as others such as contrast function and metric. Terms from

include cross entropy, relative entropy,

discrimination information Discrimination is the act of making unjustified distinctions between people based on the groups, classes, or other categories to which they belong or are perceived to belong. People may be discriminated on the basis of race, gender, age, relig ...

, and information gain.

Distances as metrics

Metrics

A metric on a set ''X'' is a function (called the ''distance function'' or simply distance) ''d'' : ''X'' × ''X'' → R⁺ (where R⁺ is the set of non-negative real numbers). For all ''x'', ''y'', ''z'' in ''X'', this function is required to satisfy the following conditions: # ''d''(''x'', ''y'') ≥ 0 ('' non-negativity'') # ''d''(''x'', ''y'') = 0 if and only if ''x'' = ''y'' ('' identity of indiscernibles''. Note that condition 1 and 2 together produce '' positive definiteness'') # ''d''(''x'', ''y'') = ''d''(''y'', ''x'') (''

symmetry Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definit ...

'') # ''d''(''x'', ''z'') ≤ ''d''(''x'', ''y'') + ''d''(''y'', ''z'') (''

subadditivity In mathematics, subadditivity is a property of a function that states, roughly, that evaluating the function for the sum of two elements of the domain always returns something less than or equal to the sum of the function's values at each element. ...

'' / '' triangle inequality'').

Generalized metrics

Many statistical distances are not metrics, because they lack one or more properties of proper metrics. For example, pseudometrics violate property (2), identity of indiscernibles; quasimetrics violate property (3), symmetry; and

semimetric In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setting ...

s violate property (4), the triangle inequality. Statistical distances that satisfy (1) and (2) are referred to as divergences.

Examples

Metrics

* Total variation distance (sometimes just called "the" statistical distance) * Hellinger distance * Lévy–Prokhorov metric * Wasserstein metric: also known as the Kantorovich metric, or earth mover's distance *

Mahalanobis distance The Mahalanobis distance is a measure of the distance between a point ''P'' and a distribution ''D'', introduced by P. C. Mahalanobis in 1936. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based ...

Divergences

* Kullback–Leibler divergence * Rényi's divergence * Jensen–Shannon divergence * Bhattacharyya distance (despite its name it is not a distance, as it violates the triangle inequality) * f-divergence: generalizes several distances and divergences * Discriminability index, specifically the Bayes discriminability index is a positive-definite symmetric measure of the overlap of two distributions.

Notes

External links

References

*Dodge, Y. (2003) ''Oxford Dictionary of Statistical Terms'', OUP. {{ISBN, 0-19-920613-9

Terminology

Distances as metrics

Metrics

Generalized metrics

Examples

Metrics

Divergences

See also

Notes

External links

References