statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, the Bhattacharyya distance measures the similarity of two

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

s. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two

statistical Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...

samples or populations. It is not a

metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathema ...

, despite named a "distance", since it does not obey the triangle inequality.

Definition

For probability distributions

P

and

Q

on the same

domain Domain may refer to: Mathematics *Domain of a function, the set of input values for which the (total) function is defined **Domain of definition of a partial function **Natural domain of a partial function **Domain of holomorphy of a function * Do ...

\mathcal

, the Bhattacharyya distance is defined as :

D_B(P,Q) = -\ln \left( BC(P,Q) \right)

where :

BC(P,Q) = \sum_ \sqrt

is the Bhattacharyya coefficient for discrete probability distributions. For continuous probability distributions, with

P(dx) = p(x)dx

and

Q(dx) = q(x) dx

where

p(x)

and

q(x)

are the

probability density In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...

functions, the Bhattacharyya coefficient is defined as :

BC(P,Q) = \int_ \sqrt\, dx

. More generally, given two probability measures

P, Q

on a measurable space

(\mathcal X, \mathcal B)

, let

\lambda

be a ( sigma finite) measure such that

P

and

Q

are

absolutely continuous In calculus, absolute continuity is a smoothness property of functions that is stronger than continuity and uniform continuity. The notion of absolute continuity allows one to obtain generalizations of the relationship between the two central oper ...

with respect to

\lambda

i.e. such that

P(dx) = p(x)\lambda(dx)

, and

Q(dx) = q(x)\lambda(dx)

for probability density functions

p, q

with respect to

\lambda

defined

\lambda

-almost everywhere. Such a measure, even such a probability measure, always exists, e.g.

\lambda = \tfrac12(P + Q)

. Then define the Bhattacharrya measure on

(\mathcal X, \mathcal B)

by :

bc(dx , P,Q) = \sqrt\, \lambda(dx) = \sqrt\lambda(dx).

It does not depend on the measure

\lambda

, for if we choose a measure

\mu

such that

\lambda

and an other measure choice

\lambda'

are absolutely continuous i.e.

\lambda = l(x)\mu

and

\lambda' = l'(x) \mu

, then :

P(dx) = p(x)\lambda(dx) = p'(x)\lambda'(dx) = p(x)l(x) \mu(dx) = p'(x)l'(x)\mu(dx)

, and similarly for

Q

. We then have :

bc(dx , P,Q) = \sqrt\, \lambda(dx) = \sqrt\, l(x)\mu(x) = \sqrt\mu(dx) = \sqrt\, \mu(dx) = \sqrt\,\lambda'(dx)

. We finally define the Bhattacharyya coefficient :

BC(P,Q) = \int_ bc(dx, P,Q) = \int_ \sqrt\, \lambda(dx)

. By the above, the quantity

BC(P,Q)

does not depend on

\lambda

, and by the Cauchy inequality

0\le BC(P,Q) \le 1

. In particular if

P(dx) = p(x)Q(dx)

is absolutely continuous wrt to

Q

with Radon Nikodym derivative

p(x) = \frac(x)

, then

BC(P, Q) = \int_ \sqrt Q(dx) = \int_ \sqrt Q(dx) = E_Q\left sqrt\right

Properties

0 \le BC \le 1

and

0 \le D_B \le \infty

D_B

does not obey the

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...

, though the

Hellinger distance In probability and statistics, the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to quantify the similarity between two probability distributions. It is a type of ''f''-divergence. The Hellin ...

\sqrt

does. Let

p\sim\mathcal(\mu_p,\sigma_p^2)

q\sim\mathcal(\mu_q,\sigma_q^2)

, where

(\mu ,\sigma ^)

is the

normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...

with mean

\mu

and variance

\sigma^2

, then :

D_B(p,q) = \frac \frac + \frac 1 2 \ln\left(\frac\right)

And in general, given two

multivariate normal In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...

distributions

p_i=\mathcal(\boldsymbol\mu_i,\,\boldsymbol\Sigma_i)

, :

D_B(p_1, p_2)=(\boldsymbol\mu_1-\boldsymbol\mu_2)^T \boldsymbol\Sigma^(\boldsymbol\mu_1-\boldsymbol\mu_2)+\ln \,\left(\right),

where

\boldsymbol\Sigma=.

Note that the first term is a squared

Mahalanobis distance The Mahalanobis distance is a measure of the distance between a point ''P'' and a distribution ''D'', introduced by P. C. Mahalanobis in 1936. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based ...

Applications

The Bhattacharyya coefficient quantifies the "closeness" of two random statistical samples. Given two sequences from distributions

P, Q

, bin them into

n

buckets, and let the frequency of samples from

P

in bucket

i

p_i

, and similarly for

q_i

, then the sample Bhattacharyya coefficient is :

BC(\mathbf,\mathbf) = \sum_^n \sqrt,

which is an estimator of

BC(P, Q)

. The quality of estimation depends on the choice of buckets; too few buckets would overestimate

BC(P, Q)

, while too many would underestimate.. A common task in

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

is estimating the separability of classes. Up to a multiplicative factor, the squared

is a special case of the Bhattacharyya distance when the two classes are normally distributed with the same variances. When two classes have similar means but significantly different variances, the Mahalanobis distance would be close to zero, while the Bhattacharyya distance would not be. The Bhattacharyya coefficient is used in the construction of polar codes. The Bhattacharyya distance is used in feature extraction and selection,Euisun Choi, Chulhee Lee, "Feature extraction based on the Bhattacharyya distance", ''Pattern Recognition'', Volume 36, Issue 8, August 2003, Pages 1703–1709 image processing,François Goudail, Philippe Réfrégier, Guillaume Delyon, "Bhattacharyya distance as a contrast parameter for statistical processing of noisy optical images", ''JOSA A'', Vol. 21, Issue 7, pp. 1231−1240 (2004)

speaker recognition Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to ''speaker recognition'' or speech recognition. Speaker verification ...

,Chang Huai You, "An SVM Kernel With GMM-Supervector Based on the Bhattacharyya Distance for Speaker Recognition", ''Signal Processing Letters'', IEEE, Vol 16, Is 1, pp. 49-52 and phone clustering.Mak, B., "Phone clustering using the Bhattacharyya distance", ''Spoken Language'', 1996. ICSLP 96. Proceedings., Fourth International Conference on, Vol 4, pp. 2005–2008 vol.4, 3−6 Oct 1996 A "Bhattacharyya space" has been proposed as a feature selection technique that can be applied to texture segmentation.Reyes-Aldasoro, C.C., and A. Bhalerao, "The Bhattacharyya space for feature selection and its application to texture segmentation", ''Pattern Recognition'', (2006) Vol. 39, Issue 5, May 2006, pp. 812–826

History

Both the Bhattacharyya distance and the Bhattacharyya coefficient are named after

Anil Kumar Bhattacharyya Anil Kumar Bhattacharyya ( bn, অনিল কুমার ভট্টাচার্য) (1 April 1915 – 17 July 1996) was an Indian statistician who worked at the Indian Statistical Institute in the 1930s and early 40s. He made fundam ...

, a

statistician A statistician is a person who works with theoretical or applied statistics. The profession exists in both the private and public sectors. It is common to combine statistical knowledge with expertise in other subjects, and statisticians may wor ...

who worked in the 1930s at the

Indian Statistical Institute Indian Statistical Institute (ISI) is a higher education and research institute which is recognized as an Institute of National Importance by the 1959 act of the Indian parliament. It grew out of the Statistical Laboratory set up by Prasanta C ...

. He developed the method to measure the distance between two non-normal distributions and illustrated this with the classical multinomial populations as well as probability distributions that are absolutely continuous with respect to the Lebesgue measure. The latter work appeared partly in 1943 in the Bulletin of the

Calcutta Mathematical Society The Calcutta Mathematical Society (CalMathSoc) is an association of professional mathematicians dedicated to the interests of mathematical research and education in India. The Society has its head office located at Kolkata, India. History C ...

, while the former part, despite being submitted for publication in 1941, appeared almost five years later in

Sankhya ''Samkhya'' or ''Sankya'' (; Sanskrit सांख्य), IAST: ') is a dualistic school of Indian philosophy. It views reality as composed of two independent principles, ''puruṣa'' ('consciousness' or spirit); and ''prakṛti'', (nature ...

References

* * * * For a short list of properties, see: http://www.mtm.ufsc.br/~taneja/book/node20.html

External links

*
Bhattacharyya's distance measure as a precursor of genetic distance measures

Journal of Biosciences The ''Journal of Biosciences'' is a peer-reviewed scientific journal published by the Indian Academy of Sciences, Bengaluru, India. The current editor-in-chief is Prof. B J Rao (IISER Tirupati, Tirupati). According to the ''Journal Citation Rep ...