In statistics, the earth mover's distance (EMD) is a measure of the distance between two

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...

s over a region ''D''. In mathematics, this is known as the

Wasserstein metric In mathematics, the Wasserstein distance or Kantorovich–Rubinstein metric is a distance function defined between probability distributions on a given metric space M. It is named after Leonid Vaseršteĭn. Intuitively, if each distribution is ...

. Informally, if the distributions are interpreted as two different ways of piling up a certain amount of

earth Earth is the third planet from the Sun and the only astronomical object known to harbor life. While large volumes of water can be found throughout the Solar System, only Earth sustains liquid surface water. About 71% of Earth's surf ...

(dirt) over the region ''D'', the EMD is the minimum cost of turning one pile into the other; where the cost is assumed to be the amount of dirt moved times the

distance Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...

by which it is moved. The above definition is valid only if the two distributions have the same integral (informally, if the two piles have the same amount of dirt), as in normalized

histogram A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or " bucket") the range of values—that is, divide the ent ...

s or

probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) c ...

s. In that case, the EMD is equivalent to the 1st Mallows distance or 1st Wasserstein distance between the two distributions.

Theory

Assume that we have a set of points in

\mathbb^d

(dimension

d

). Instead of assigning one distribution to the set of points, we can cluster them and represent the point set in terms of the clusters. Thus, each cluster is a single point in

\mathbb^d

and the weight of the cluster is decided by the fraction of the distribution present in that cluster. This representation of a distribution by a set of clusters is called the ''signature''. Two signatures can have different sizes, for example, a bimodal distribution has shorter signature (2 clusters) than complex ones. One cluster representation (mean or mode in

\mathbb^d

) can be thought of as a single feature in a signature. The distance between each of the features is called as ''ground distance''. The Earth Mover's Distance can be formulated and solved as a

transportation problem In mathematics and economics, transportation theory or transport theory is a name given to the study of optimal transportation and allocation of resources. The problem was formalized by the French mathematician Gaspard Monge in 1781.G. Monge. ' ...

. Suppose that several suppliers, each with a given amount of goods, are required to supply several consumers, each with a given limited capacity. For each supplier-consumer pair, the cost of transporting a single unit of goods is given. The transportation problem is then to find a least-expensive flow of goods from the suppliers to the consumers that satisfies the consumers' demand. Similarly, here the problem is transforming one signature(

P

) to another(

Q

) with minimum work done. Assume that signature

P

has

m

clusters with

P=\

, where

p_i

is the cluster representative and

w_ > 0

is the weight of the cluster. Similarly another signature

Q=\

has

n

clusters. Let

D=_

be the ground distance between clusters

p_i

and

q_j

. We want to find a flow

F=_

, with

f_

the flow between

p_i

and

q_j

, that minimizes the overall cost. :

\min\limits_F

subject to the constraints: :

f_\ge0,    1\le i \le m, 1\le j \le n

\sum_^n  \le w_, 1 \le i \le m

\sum_^m  \le w_, 1 \le j \le n

\sum_^m\sum_^n f_ = \sum_^m w_ = \sum_^n w_

The optimal flow

F

is found by solving this linear optimization problem. The earth mover's distance is defined as the work normalized by the total flow: :

\text(P,Q) = \frac

Equivalent Form

Equivalently, we can write: :

\text(P,Q) = \inf\limits_ \mathbb_x - y\,,

where

\Pi(P, Q)

is the set of all joint distributions whose marginals are

P

and

Q

Kantorovich-Rubinstein duality

By Kantorovich-Rubinstein duality, the following is an equivalent expression: :

\text(P,Q) = \sup\limits_ \mathbb_

, f(x) The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

, - \mathbb_

, that is, to supremum over all 1-Lipschitz continuous functions, i.e.

\,  \nabla f(x)\,  \leq 1 \quad \forall x

Extensions

Some applications may require the comparison of distributions with different total masses. One approach is to allow for a '' partial match'', where dirt from the most massive distribution is rearranged to make the least massive, and any leftover "dirt" is discarded at no cost. Under this approach, the EMD is no longer a true distance between distributions. Another approach is to allow for mass to be created or destroyed, on a global and/or local level, as an alternative to transportation, but with a cost penalty. In that case one must specify a real parameter σ, the ratio between the cost of creating or destroying one unit of "dirt", and the cost of transporting it by a unit distance. This is equivalent to minimizing the sum of the earth moving cost plus σ times the ''L''¹ distance between the rearranged pile and the second distribution. Notationally, if

\pi:P\to Q

is a

partial function In mathematics, a partial function from a set to a set is a function from a subset of (possibly itself) to . The subset , that is, the domain of viewed as a function, is called the domain of definition of . If equals , that is, if is ...

which is a bijection on subsets

P'\subset P

and

Q'\subset Q

, then one is interested in the distance function :

, P-Q, _\pi = , P\setminus P',  + , Q\setminus Q',

where

P\setminus P'

denotes

set minus In set theory, the complement of a set , often denoted by (or ), is the set of elements not in . When all sets in the universe, i.e. all sets under consideration, are considered to be members of a given set , the absolute complement of is the ...

. Here,

P'

would be the portion of the earth that was moved; thus

P\setminus P'

would be the portion not moved, and

, P\setminus P',

the size of the pile not moved. By symmetry, one contemplates

Q'\subset Q

as the pile at the destination that 'got there' from ''P'', as compared to the total ''Q'' that we ''want to have there''. Formally, this distance indicates how much an

injective In mathematics, an injective function (also known as injection, or one-to-one function) is a function that maps distinct elements of its domain to distinct elements; that is, implies . (Equivalently, implies in the equivalent contraposi ...

correspondence differs from an

isomorphism In mathematics, an isomorphism is a structure-preserving mapping between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between them. The word i ...

. The EMD can be extended naturally to the case where more than two distributions are compared. In this case, the "distance" between the many distributions is defined as the optimal value of a linear program. This generalized EMD may be computed exactly using a greedy algorithm, and the resulting functional has been shown to be Minkowski additive and convex monotone.

Computing the EMD

The EMD can be computed by solving an instance of

, using any algorithm for

minimum-cost flow problem The minimum-cost flow problem (MCFP) is an optimization and decision problem to find the cheapest possible way of sending a certain amount of flow through a flow network. A typical application of this problem involves finding the best delivery ...

, e.g. the network simplex algorithm. The

Hungarian algorithm The Hungarian method is a combinatorial optimization algorithm that solves the assignment problem in polynomial time and which anticipated later primal–dual methods. It was developed and published in 1955 by Harold Kuhn, who gave the name "Hu ...

can be used to get the solution if the domain ''D'' is the set . If the domain is integral, it can be translated for the same algorithm by representing integral bins as multiple binary bins. As a special case, if ''D'' is a one-dimensional array of "bins" of size ''n'', the EMD can be efficiently computed by scanning the array and keeping track of how much dirt needs to be transported between consecutive bins. Here the bins are zero-indexed: :

\begin
\text_0 &= 0 \\
\text_ &= P_i + \text_i - Q_i \\
\text &= \sum_^, \text_i, 
\end

PyTorch Implementation

import torch def EMD(p, q): x = p - q y = torch.cumsum(x, dim=0) return y.abs().sum()

EMD-based similarity analysis

EMD-based similarity analysis (EMDSA) is an important and effective tool in many

multimedia information retrieval Multimedia information retrieval (MMIR or MIR) is a research discipline of computer science that aims at extracting semantic information from multimedia data sources.H Eidenberger. ''Fundamental Media Understanding'', atpress, 2011, p. 1. Data sour ...

and

pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphic ...

applications. However, the computational cost of EMD is super-cubic to the number of the "bins" given an arbitrary "D". Efficient and scalable EMD computation techniques for large scale data have been investigated using

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filteri ...

, as well as

bulk synchronous parallel The bulk synchronous parallel (BSP) abstract computer is a bridging model for designing parallel algorithms. It is similar to the parallel random access machine (PRAM) model, but unlike PRAM, BSP does not take communication and synchronization fo ...

and resilient distributed dataset.

Applications

An early application of the EMD in computer science was to compare two

grayscale In digital photography, computer-generated imagery, and colorimetry, a grayscale image is one in which the value of each pixel is a single sample representing only an ''amount'' of light; that is, it carries only intensity information. Graysc ...

images that may differ due to

dithering Dither is an intentionally applied form of noise used to randomize quantization error, preventing large-scale patterns such as color banding in images. Dither is routinely used in processing of both digital audio and video data, and is ofte ...

, blurring, or local deformations. In this case, the region is the image's domain, and the total amount of light (or ink) is the "dirt" to be rearranged. The EMD is widely used in content-based image retrieval to compute distances between the

color histogram In image processing and photography, a color histogram is a representation of the distribution of colors in an image. For digital images, a color histogram represents the number of pixels that have colors in each of a fixed list of color ranges, ...

s of two

digital image A digital image is an image composed of picture elements, also known as ''pixels'', each with '' finite'', '' discrete quantities'' of numeric representation for its intensity or gray level that is an output from its two-dimensional functions f ...

s. In this case, the region is the RGB color cube, and each image pixel is a parcel of "dirt". The same technique can be used for any other quantitative

pixel In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device. In most digital display devices, pixels are the s ...

attribute, such as

luminance Luminance is a photometric measure of the luminous intensity per unit area of light travelling in a given direction. It describes the amount of light that passes through, is emitted from, or is reflected from a particular area, and falls withi ...

gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...

, apparent motion in a

video frame In filmmaking, video production, animation, and related fields, a frame is one of the many '' still images'' which compose the complete '' moving picture''. The term is derived from the historical development of film stock, in which the sequent ...

, etc.. More generally, the EMD is used in

to compare generic summaries or surrogates of data records called

signatures A signature (; from la, signare, "to sign") is a handwritten (and often stylized) depiction of someone's name, nickname, or even a simple "X" or other mark that a person writes on documents as a proof of identity and intent. The writer of a ...

. A typical signature consists of list of pairs ((''x''₁,''m''₁), ... (''x''_''n'',''m''_''n'')), where each ''x''_''i'' is a certain "feature" (e.g., color in an image, letter in a text, etc.), and ''m''_''i'' is "mass" (how many times that feature occurs in the record). Alternatively, ''x''_''i'' may be the

centroid In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any ...

of a data cluster, and ''m''_''i'' the number of entities in that cluster. To compare two such signatures with the EMD, one must define a distance between features, which is interpreted as the cost of turning a unit mass of one feature into a unit mass of the other. The EMD between two signatures is then the minimum cost of turning one of them into the other. EMD analysis has been used for quantitating multivariate changes in

biomarkers In biomedical contexts, a biomarker, or biological marker, is a measurable indicator of some biological state or condition. Biomarkers are often measured and evaluated using blood, urine, or soft tissues to examine normal biological processes, ...

measured by

flow cytometry Flow cytometry (FC) is a technique used to detect and measure physical and chemical characteristics of a population of cells or particles. In this process, a sample containing cells or particles is suspended in a fluid and injected into the fl ...

, with potential applications to other technologies that report distributions of measurements.

History

The concept was first introduced by

Gaspard Monge Gaspard Monge, Comte de Péluse (9 May 1746 – 28 July 1818) was a French mathematician, commonly presented as the inventor of descriptive geometry, (the mathematical basis of) technical drawing, and the father of differential geometry. Duri ...

in 1781, in the context of transportation theory. The use of the EMD as a distance measure for monochromatic images was described in 1989 by S. Peleg, M. Werman and H. Rom. The name "earth movers' distance" was proposed by J. Stolfi in 1994, and was used in print in 1998 by Y. Rubner, C. Tomasi and L. G. Guibas.

References

{{Reflist

External links

C code for the Earth Mover's Distance
(archive
here

Python implementation with referencesPython2 wrapper for the C implementation of the Earth Mover's DistanceC++ and Matlab and Java wrappers code for the Earth Mover's Distance, especially efficient for thresholded ground distancesJava implementation of a generic generator for evaluating large-scale Earth Mover's Distance based similarity analysis
Statistical distance