statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

and

information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...

, pointwise mutual information (PMI), or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be if the events were

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...

Dan Jurafsky Daniel Jurafsky is a professor of linguistics and computer science at Stanford University, and also an author. With Daniel Gildea, he is known for developing the first automatic system for semantic role labeling (SRL). He is the author of ''The Lan ...

and James H. Martin: Speech and Language Processing (3rd ed. draft), December 29, 2021
chapter 6
PMI (especially in its positive pointwise mutual information variant) has been described as "one of the most important concepts in NLP", where it "draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in corpus than we would have expected them to appear by chance." The concept was introduced in 1961 by

Robert Fano Roberto Mario "Robert" Fano (11 November 1917 – 13 July 2016) was an Italian-American computer scientist and professor of electrical engineering and computer science at the Massachusetts Institute of Technology. He became a student and working ...

under the name of "mutual information", but today that term is instead used for a related measure of dependence between random variables: The

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...

(MI) of two discrete random variables refers to the average PMI of all possible events.

Definition

The PMI of a pair of outcomes ''x'' and ''y'' belonging to

discrete random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers ...

s ''X'' and ''Y'' quantifies the discrepancy between the probability of their coincidence given their

joint distribution A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...

and their individual distributions, assuming

independence Independence is a condition of a nation, country, or state, in which residents and population, or some portion thereof, exercise self-government, and usually sovereignty, over its territory. The opposite of independence is the status of ...

. Mathematically: :

\operatorname(x;y) \equiv \log_2\frac = \log_2\frac = \log_2\frac

(with the latter two expressions being equal to the first by

Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...

). The

(MI) of the random variables ''X'' and ''Y'' is the expected value of the PMI (over all possible outcomes). The measure is symmetric (

\operatorname(x;y)=\operatorname(y;x)

). It can take positive or negative values, but is zero if ''X'' and ''Y'' are

. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is non-negative. PMI maximizes when ''X'' and ''Y'' are perfectly associated (i.e.

p(x, y)

p(y, x)=1

), yielding the following bounds: :

-\infty \leq \operatorname(x;y) \leq \min\left -\log p(x), -\log p(y) \right .

Finally,

\operatorname(x;y)

will increase if

p(x, y)

is fixed but

p(x)

decreases. Here is an example to illustrate: Using this table we can marginalize to get the following additional table for the individual distributions: With this example, we can compute four values for

\operatorname(x;y)

. Using base-2 logarithms: (For reference, the

\operatorname(X;Y)

would then be 0.2141709.)

Similarities to mutual information

Pointwise Mutual Information has many of the same relationships as the mutual information. In particular,

\begin
\operatorname(x;y) &=& h(x) + h(y) - h(x,y) \\ 
 &=& h(x) - h(x \mid y) \\ 
 &=& h(y) - h(y \mid x)
\end

Where

h(x)

is the

self-information In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative w ...

, or

-\log_2 p(x)

Variants

Several variations of PMI have been proposed, in particular to address what has been described as its "two main limitations":Francois Role, Moahmed Nadif
Handling the Impact of Low frequency Events on Co-occurrence-based Measures of Word Similarity:A Case Study of Pointwise Mutual Information.
Proceedings of KDIR 2011 : KDIR- International Conference on Knowledge Discovery and Information Retrieval, Paris, October 26–29, 2011 # PMI can take both positive and negative values and has no fixed bounds, which makes it harder to interpret. # PMI has "a well-known tendency to give higher scores to low-frequency events", but in applications such as measuring word similarity, it is preferable to have "a higher score for pairs of words whose relatedness is supported by more evidence."

Positive PMI

The positive pointwise mutual information (PPMI) measure is defined by setting negative values of PMI to zero:

\operatorname(x;y) \equiv \max\left(\log_2\frac,0\right)

This definition is motivated by the observation that "negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous" and also by a concern that "it's not clear whether it's even possible to evaluate such scores of 'unrelatedness' with human judgment". It also avoids having to deal with

-\infty

values for events that never occur together (

p(x,y)=0

), by setting PPMI for these to 0.

Normalized pointwise mutual information (npmi)

Pointwise mutual information can be normalized between

1,+1 Onekama ( ) is a village in Manistee County in the U.S. state of Michigan. The population was 399 at the 2020 census. The village is located on the northeast shore of Portage Lake and is surrounded by Onekama Township. The town's name is deri ...

resulting in -1 (in the limit) for never occurring together, 0 for independence, and +1 for complete

co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of ordered occurrence of two adjacent terms in a text corpus. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idio ...

\operatorname(x;y) = \frac

Where

h(x,y)

is the joint

-\log_2 p(x,y)

PMI^k family

The PMI^k measure (for k=2, 3 etc.), which was introduced by Béatrice Daille around 1994, and as of 2011 was described as being "among the most widely used variants", is defined as

\operatorname^k(x;y) \equiv \log_2\frac = \operatorname(x;y)-(-(k-1) \log_2 p(x,y))

In particular,

pmi^1(x;y) = pmi(x;y)

. The additional factors of

p(x,y)

inside the logarithm are intended to correct the bias of PMI towards low-frequency events, by boosting the scores of frequent pairs. A 2011 case study demonstrated the success of PMI³ in correcting this bias on a corpus drawn from English Wikipedia. Taking x to be the word "football", its most strongly associated words y according to the PMI measure (i.e. those maximizing

pmi(x;y)

) were domain-specific ("midfielder", "cornerbacks", "goalkeepers") whereas the terms ranked most highly by PMI³ were much more general ("league", "clubs", "england").

Specific Correlation

Total correlation is an extension of

to multi-variables. Analogously to the definition of total correlation, the extension of PMI to multi-variables is "specific correlation." The SI of the results of random variables

\boldsymbol = ( x_, x_, \ldots, x_)

is expressed as the following: :

\mathrm( x_, x_, \ldots, x_ )
\equiv \log \frac
= 
\log p(\boldsymbol)-\log \prod_^n p\left(x_i\right)

Chain-rule

, point mutual information follows the

chain rule In calculus, the chain rule is a formula that expresses the derivative of the Function composition, composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h ...

, that is, :

\operatorname(x;yz) = \operatorname(x;y) + \operatorname(x;z, y)

This is proven through application of

: :

\\ & = \log \frac \\ & = \log \frac \\ & = \operatorname(x;yz) \end

Applications

PMI could be used in various disciplines e.g. in information theory, linguistics or chemistry (in profiling and analysis of chemical compounds). In

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

, PMI has been used for finding

collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words t ...

s and associations between words. For instance, countings of occurrences and

s of words in a

text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...

can be used to approximate the probabilities

p(x)

and

p(x,y)

respectively. The following table shows counts of pairs of words getting the most and the least PMI scores in the first 50 millions of words in Wikipedia (dump of October 2015) filtering by 1,000 or more co-occurrences. The frequency of each count can be obtained by dividing its value by 50,000,952. (Note: natural log is used to calculate the PMI values in this example, instead of log base 2) Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word. Conversely, a pair of words whose probabilities of occurrence are considerably higher than their probability of co-occurrence gets a small PMI score.

References

* {{cite book, last1=Fano, first1=R M, authorlink=Robert Fano, year=1961, title=Transmission of Information: A Statistical Theory of Communications, publisher=MIT Press, Cambridge, MA, url=https://archive.org/details/TransmissionOfInformationAStatisticalTheoryOfCommunicationRobertFano, chapter=chapter 2, isbn=978-0262561693

External links

Demo at Rensselaer MSR Server
(PMI values normalized to be between 0 and 1) Information theory Summary statistics for contingency tables Entropy and information