computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

, second-order co-occurrence pointwise mutual information is a

semantic similarity Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools ...

measure. To assess the degree of

association Association may refer to: *Club (organization), an association of two or more people united by a common interest or goal *Trade association, an organization founded and funded by businesses that operate in a specific industry *Voluntary associatio ...

between two given words, it uses

pointwise mutual information In statistics, probability theory and information theory, pointwise mutual information (PMI), or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be ...

(PMI) to sort lists of important neighbor words of the two target words from a large

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

History

The PMI-IR method used

AltaVista AltaVista was a Web search engine established in 1995. It became one of the most-used early search engines, but lost ground to Google and was purchased by Yahoo! in 2003, which retained the brand, but based all AltaVista searches on its own sear ...

's Advanced Search query syntax to calculate

probabilities Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...

. Note that the "NEAR" search operator of AltaVista is an essential operator in the PMI-IR method. However, it is no longer in use in AltaVista; this means that, from the implementation point of view, it is not possible to use the PMI-IR method in the same form in new systems. In any case, from the algorithmic point of view, the advantage of using SOC-PMI is that it can calculate the similarity between two words that do not co-occur frequently, because they co-occur with the same neighboring words. For example, the

British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...

(BNC) has been used as a source of frequencies and contexts.

Methodology

The method considers the words that are common in both lists and aggregate their PMI values (from the opposite list) to calculate the relative semantic similarity. We define the ''pointwise mutual information'' function for only those words having

f^b (t_i, w)>0

, :

f^\text(t_i,w)=\log_2 \frac,

where

f^t (t_i)

tells us how many times the type

t_i

appeared in the entire corpus,

f^b(t_i, w)

tells us how many times word

t_i

appeared with word

w

in a context window and

m

is total number of tokens in the corpus. Now, for word

w

, we define a set of words,

X^w

, sorted in descending order by their PMI values with

w

and taken the top-most

\beta

words having

f^\text(t_i, w)>0

. The set

X^w

, contains words

X_i^w

, :

X^w=\

, where

i=1, 2, \ldots ,\beta

and :

f^\text(X_1^w, w)\geq f^\text(X_2^w, w)\geq \cdots f^\text(X_^w, w)\geq f^\text(X_\beta^w, w)

A rule of thumb is used to choose the value of

\beta

. The ''

\beta

-PMI summation'' function of a word is defined with respect to another word. For word

w_1

with respect to word

w_2

it is: :

f(w_1,w_2,\beta)=\sum_^\beta (f^\text(X_i^,w_2))^\gamma

where

f^\text(X_i^,w_2)>0

which sums all the positive PMI values of words in the set

X^

also common to the words in the set

X^

. In other words, this function actually aggregates the positive PMI values of all the semantically close words of

w_2

which are also common in

w_1

's list.

\gamma

should have a value greater than 1. So, the ''

\beta

-PMI summation'' function for word

w_1

with respect to word

w_2

having

\beta=\beta_1

and the ''

\beta

-PMI summation'' function for word

w_2

with respect to word

w_1

having

\beta=\beta_2

are :

f(w_1,w_2,\beta_1)=\sum_^(f^\text(X_i^,w_2))^\gamma

and :

f(w_2,w_1,\beta_2)=\sum_^(f^\text(X_i^,w_1))^\gamma

respectively. Finally, the ''semantic PMI similarity'' function between the two words,

w_1

and

w_2

, is defined as :

\mathrm(w_1,w_2)=\frac+\frac.

The semantic word similarity is normalized, so that it provides a similarity score between

0

and

1

inclusively. The normalization of semantic similarity algorithm returns a normalized score of similarity between two words. It takes as arguments the two words,

r_i

and

s_j

, and a maximum value,

\lambda

, that is returned by the semantic similarity function, Sim(). For example, the algorithm returns 0.986 for words ''cemetery'' and ''graveyard'' with

\lambda=20

(for SOC-PMI method).

References

* Islam, A. and Inkpen, D. (2008)
Semantic text similarity using corpus-based word similarity and string similarity
ACM Trans. Knowl. Discov. Data 2, 2 (Jul. 2008), 1–25. * Islam, A. and Inkpen, D. (2006)
Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words
in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038. {{DEFAULTSORT:Second-Order Co-Occurrence Pointwise Mutual Information Computational linguistics Statistical distance