HOME

TheInfoList



OR:

In
information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
, perplexity is a measurement of how well a
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
or
probability model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...
predicts a
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample.


Perplexity of a probability distribution

The perplexity ''PP'' of a discrete
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
''p'' is defined as :\mathit(p) := 2^=2^=\prod_x p(x)^ where ''H''(''p'') is the
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
(in bits) of the distribution and ''x'' ranges over events. (The base need not be 2: The perplexity is independent of the base, provided that the entropy and the exponentiation use the ''same'' base.) This measure is also known in some domains as the '' (order-1 true) diversity''. Perplexity of a
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
''X'' may be defined as the perplexity of the distribution over its possible values ''x''. In the special case where ''p'' models a fair ''k''-sided die (a uniform distribution over ''k'' discrete events), its perplexity is ''k''. A random variable with perplexity ''k'' has the same uncertainty as a fair ''k''-sided die, and one is said to be "''k''-ways perplexed" about the value of the random variable. (Unless it is a fair ''k''-sided die, more than ''k'' values will be possible, but the overall uncertainty is no greater because some of these values will have probability greater than 1/''k'', decreasing the overall value while summing.) Perplexity is sometimes used as a measure of how hard a prediction problem is. This is not always accurate. If you have two choices, one with probability 0.9, then your chances of a correct guess are 90 percent using the optimal strategy. The perplexity is 2−0.9 log2 0.9 - 0.1 log2 0.1= 1.38. The inverse of the perplexity (which, in the case of the fair k-sided die, represents the probability of guessing correctly), is 1/1.38 = 0.72, not 0.9. The perplexity is the exponentiation of the entropy, which is a more clearcut quantity. The entropy is a measure of the expected, or "average", number of bits required to encode the outcome of the random variable, using a theoretical optimal variable-length code, e.g. It can equivalently be regarded as the expected
information Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random ...
gain from learning the outcome of the random variable.


Perplexity of a probability model

A model of an unknown probability distribution ''p'', may be proposed based on a training sample that was drawn from ''p''. Given a proposed probability model ''q'', one may evaluate ''q'' by asking how well it predicts a separate test sample ''x''1, ''x''2, ..., ''xN'' also drawn from ''p''. The perplexity of the model ''q'' is defined as :b^ where b is customarily 2. Better models ''q'' of the unknown distribution ''p'' will tend to assign higher probabilities ''q''(''xi'') to the test events. Thus, they have lower perplexity: they are less surprised by the test sample. The exponent above may be regarded as the average number of bits needed to represent a test event ''xi'' if one uses an optimal code based on ''q''. Low-perplexity models do a better job of compressing the test sample, requiring few bits per test element on average because ''q''(''xi'') tends to be high. The exponent may also be regarded as a
cross-entropy In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set i ...
, :H(\tilde,q) = -\sum_x \tilde(x) \log_2 q(x) where \tilde denotes the empirical distribution of the test sample (i.e., \tilde(x) = n/N if ''x'' appeared ''n'' times in the test sample of size ''N'').


Perplexity per word

In
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
, perplexity is a way of evaluating
language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
s. A language model is a probability distribution over entire sentences or texts. Using the definition of perplexity for a probability model, one might find, for example, that the average sentence ''xi'' in the test sample could be coded in 190 bits (i.e., the test sentences had an average log-probability of -190). This would give an enormous model perplexity of 2190 per sentence. However, it is more common to normalize for sentence length and consider only the number of bits per word. Thus, if the test sample's sentences comprised a total of 1,000 words, and could be coded using a total of 7.95 bits per word, one could report a model perplexity of 27.95 = 247 ''per word.'' In other words, the model is as confused on test data as if it had to choose uniformly and independently among 247 possibilities for each word. The lowest perplexity that has been published on the
Brown Corpus The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the ...
(1 million words of American
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
of varying topics and genres) as of 1992 is indeed about 247 per word, corresponding to a cross-entropy of log2247 = 7.95 bits per word or 1.75 bits per letter using a
trigram Trigrams are a special case of the ''n''-gram, where ''n'' is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. Frequency Context is ...
model. It is often possible to achieve lower perplexity on more specialized
corpora Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
, as they are more predictable. Again, simply guessing that the next word in the Brown corpus is the word "the" will have an accuracy of 7 percent, not 1/247 = 0.4 percent, as a naive use of perplexity as a measure of predictiveness might lead one to believe. This guess is based on the unigram statistics of the Brown corpus, not on the trigram statistics, which yielded the word perplexity 247. Using trigram statistics would further improve the chances of a correct guess.


See also

* Statistical model validation


References

{{Machine learning evaluation metrics Entropy and information Language modeling