HOME

TheInfoList



OR:

Zipf's law (; ) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the -th entry is often approximately
inversely proportional In mathematics, two sequences of numbers, often experimental data, are proportional or directly proportional if their corresponding elements have a constant ratio. The ratio is called ''coefficient of proportionality'' (or ''proportionality ...
to . The best known instance of Zipf's law applies to the frequency table of words in a text or
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
of
natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...
: \ \mathsf\ \propto\ \frac ~. It is usually found that the most common word occurs approximately twice as often as the next common one, three times as often as the third most common, and so on. For example, in the
Brown Corpus The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured Text_corpus, corpus of varied genres. This ...
of American English text, the word "''
the ''The'' is a grammatical article in English, denoting nouns that are already or about to be mentioned, under discussion, implied or otherwise presumed familiar to listeners, readers, or speakers. It is the definite article in English. ''The ...
''" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's law, the second-place word "''of''" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "''and''" (28,852). It is often used in the following form, called Zipf-Mandelbrot law: \ \mathsf\ \propto\ \frac\ where \ a\ and \ b\ are fitted parameters, with \ a \approx 1, and \ b \approx 2.7 ~. This law is named after the American linguist George Kingsley Zipf, and is still an important concept in
quantitative linguistics Quantitative linguistics (QL) is a sub-discipline of general linguistics and, more specifically, of mathematical linguistics. Quantitative linguistics deals with language learning, language change, and application as well as structure of natural ...
. It has been found to apply to many other types of data studied in the physical and
social Social organisms, including human(s), live collectively in interacting populations. This interaction is considered social whether they are aware of it or not, and whether the exchange is voluntary or not. Etymology The word "social" derives fro ...
sciences. In
mathematical statistics Mathematical statistics is the application of probability theory and other mathematical concepts to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques that are commonly used in statistics inc ...
, the concept has been formalized as the Zipfian distribution: A family of related discrete
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
s whose rank-frequency distribution is an inverse
power law In statistics, a power law is a Function (mathematics), functional relationship between two quantities, where a Relative change and difference, relative change in one quantity results in a relative change in the other quantity proportional to the ...
relation. They are related to Benford's law and the
Pareto distribution The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial scien ...
. Some sets of time-dependent empirical data deviate somewhat from Zipf's law. Such empirical distributions are said to be quasi-Zipfian.


History

In 1913, the German physicist Felix Auerbach observed an inverse proportionality between the population sizes of cities, and their ranks when sorted by decreasing order of that variable. Zipf's law had been discovered before Zipf, first by the French stenographer Jean-Baptiste Estoup in 1916, and also by G. Dewey in 1923, and by E. Condon in 1928. The same relation for frequencies of words in natural language texts was observed by George Zipf in 1932, but he never claimed to have originated it. In fact, Zipf did not like mathematics. In his 1932 publication, the author speaks with disdain about mathematical involvement in linguistics, ''a.o. ibidem'', p. 21: : ''... let me say here for the sake of any mathematician who may plan to formulate the ensuing data more exactly, the ability of the highly intense positive to become the highly intense negative, in my opinion, introduces the devil into the formula in the form of'' \ \sqrt ~. The only mathematical expression Zipf used looks like which he "borrowed" from Alfred J. Lotka's 1926 publication. The same relationship was found to occur in many other contexts, and for other variables besides frequency. For example, when corporations are ranked by decreasing size, their sizes are found to be inversely proportional to the rank. The same relation is found for personal incomes (where it is called Pareto principle), number of people watching the same TV channel,
notes Note, notes, or NOTE may refer to: Music and entertainment * Musical note, a pitched sound (or a symbol for a sound) in music * ''Notes'' (album), a 1987 album by Paul Bley and Paul Motian * ''Notes'', a common (yet unofficial) shortened versi ...
in music, cells transcriptomes, and more. In 1992 bioinformatician Wentian Li published a short paper showing that Zipf's law emerges even in randomly generated texts. It included proof that the power law form of Zipf's law was a byproduct of ordering words by rank.


Formal definition

Formally, the Zipf distribution on elements assigns to the element of rank (counting from 1) the probability: \ f(k;N) ~=~ \begin \frac\ \frac\ , &\ \mbox\ 1 \le k \le N ~, \\ \\ ~~ 0 ~~ , &\ \mbox\ k < 1\ \mbox\ N < k ~. \end where is a normalization constant: The th
harmonic number In mathematics, the -th harmonic number is the sum of the reciprocals of the first natural numbers: H_n= 1+\frac+\frac+\cdots+\frac =\sum_^n \frac. Starting from , the sequence of harmonic numbers begins: 1, \frac, \frac, \frac, \frac, \dot ...
: H_N \equiv \sum_^N \frac ~. The distribution is sometimes generalized to an inverse power law with exponent instead of Namely, f(k;N,s) = \frac\,\frac where , is a generalized harmonic number H_ = \sum_^N \frac ~. The generalized Zipf distribution can be extended to infinitely many items ( = ∞) only if the exponent exceeds In that case, the normalization constant , becomes Riemann's zeta function, \zeta (s) = \sum_^\infty \frac < \infty ~. The infinite item case is characterized by the Zeta distribution and is called Lotka's law. If the exponent is or less, the normalization constant , diverges as tends to infinity.


Empirical testing

Empirically, a data set can be tested to see whether Zipf's law applies by checking the
goodness of fit The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measur ...
of an empirical distribution to the hypothesized power law distribution with a
Kolmogorov–Smirnov test In statistics, the Kolmogorov–Smirnov test (also K–S test or KS test) is a nonparametric statistics, nonparametric test of the equality of continuous (or discontinuous, see #Discrete and mixed null distribution, Section 2.2), one-dimensional ...
, and then comparing the (log) likelihood ratio of the power law distribution to alternative distributions like an exponential distribution or lognormal distribution. Zipf's law can be visualized by plotting the item frequency data on a log-log graph, with the axes being the
logarithm In mathematics, the logarithm of a number is the exponent by which another fixed value, the base, must be raised to produce that number. For example, the logarithm of to base is , because is to the rd power: . More generally, if , the ...
of rank order, and logarithm of frequency. The data conform to Zipf's law with exponent to the extent that the plot approximates a
linear In mathematics, the term ''linear'' is used in two distinct senses for two different properties: * linearity of a '' function'' (or '' mapping''); * linearity of a '' polynomial''. An example of a linear function is the function defined by f(x) ...
(more precisely,
affine Affine may describe any of various topics concerned with connections or affinities. It may refer to: * Affine, a Affinity_(law)#Terminology, relative by marriage in law and anthropology * Affine cipher, a special case of the more general substi ...
) function with slope . For exponent one can also plot the reciprocal of the frequency (mean interword interval) against rank, or the reciprocal of rank against frequency, and compare the result with the line through the origin with slope


Statistical explanations

Although Zipf's Law holds for most natural languages, and even certain artificial ones such as
Esperanto Esperanto (, ) is the world's most widely spoken Constructed language, constructed international auxiliary language. Created by L. L. Zamenhof in 1887 to be 'the International Language' (), it is intended to be a universal second language for ...
and
Toki Pona Toki Pona (; , , translated as 'the language of good') is a Philosophical language, philosophical, Artistic language, artistic, constructed language designed for its small vocabulary, simplicity, and ease of acquisition. It was created by Canadia ...
, the reason is still not well understood. Recent reviews of generative processes for Zipf's law include Mitzenmacher, "A Brief History of Generative Models for Power Law and Lognormal Distributions", and Simkin, "Re-inventing Willis". However, it may be partly explained by statistical analysis of randomly generated texts. Wentian Li has shown that in a document in which each character has been chosen randomly from a uniform distribution of all letters (plus a space character), the "words" with different lengths follow the macro-trend of Zipf's law (the more probable words are the shortest and have equal probability). In 1959,
Vitold Belevitch Vitold Belevitch (2 March 1921 – 26 December 1999) was a Belgian mathematician and electrical engineer of Russian origin who produced some important work in the field of electrical network theory. Born to parents fleeing the Bolsheviks, he ...
observed that if any of a large class of well-behaved
statistical distribution In statistics, an empirical distribution function ( an empirical cumulative distribution function, eCDF) is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step functio ...
s (not only the
normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...
) is expressed in terms of rank and expanded into a
Taylor series In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...
, the first-order truncation of the series results in Zipf's law. Further, a second-order truncation of the Taylor series resulted in Mandelbrot's law. The principle of least effort is another possible explanation: Zipf himself proposed that neither speakers nor hearers using a given language wants to work any harder than necessary to reach understanding, and the process that results in approximately equal distribution of effort leads to the observed Zipf distribution. A minimal explanation assumes that words are generated by monkeys typing randomly. If language is generated by a single monkey typing randomly, with fixed and nonzero probability of hitting each letter key or white space, then the words (letter strings separated by white spaces) produced by the monkey follows Zipf's law. Another possible cause for the Zipf distribution is a
preferential attachment A preferential attachment process is any of a class of processes in which some quantity, typically some form of wealth or credit, is distributed among a number of individuals or objects according to how much they already have, so that those who ...
process, in which the value of an item tends to grow at a rate proportional to (intuitively, "
the rich get richer "The rich get richer and the poor get poorer" is an aphorism attributed to Percy Bysshe Shelley. In '' A Defence of Poetry'' (1821, not published until 1840) Shelley remarked that the promoters of utility had exemplified the saying, "To him th ...
" or "success breeds success"). Such a growth process results in the Yule–Simon distribution, which has been shown to fit word frequency versus rank in language and population versus city rank better than Zipf's law. It was originally derived to explain population versus rank in species by Yule, and applied to cities by Simon. A similar explanation is based on atlas models, systems of exchangeable positive-valued
diffusion process In probability theory and statistics, diffusion processes are a class of continuous-time Markov process with almost surely continuous sample paths. Diffusion process is stochastic in nature and hence is used to model many real-life stochastic sy ...
es with drift and variance parameters that depend only on the rank of the process. It has been shown mathematically that Zipf's law holds for Atlas models that satisfy certain natural regularity conditions.


Related laws

A generalization of Zipf's law is the Zipf–Mandelbrot law, proposed by
Benoit Mandelbrot Benoit B. Mandelbrot (20 November 1924 – 14 October 2010) was a Polish-born French-American mathematician and polymath with broad interests in the practical sciences, especially regarding what he labeled as "the art of roughness" of phy ...
, whose frequencies are: f(k;N,q,s) = \frac\ \frac ~. The constant is the Hurwitz zeta function evaluated at . Zipfian distributions can be obtained from
Pareto distribution The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial scien ...
s by an exchange of variables. The Zipf distribution is sometimes called the discrete Pareto distribution because it is analogous to the continuous
Pareto distribution The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial scien ...
in the same way that the
discrete uniform distribution In probability theory and statistics, the discrete uniform distribution is a symmetric probability distribution wherein each of some finite whole number ''n'' of outcome values are equally likely to be observed. Thus every one of the ''n'' out ...
is analogous to the
continuous uniform distribution In probability theory and statistics, the continuous uniform distributions or rectangular distributions are a family of symmetric probability distributions. Such a distribution describes an experiment where there is an arbitrary outcome that li ...
. The tail frequencies of the Yule–Simon distribution are approximately f(k;\rho) \approx \frac for any choice of In the
parabolic fractal distribution Parabolic usually refers to something in a shape of a parabola, but may also refer to a parable. Parabolic may refer to: *In mathematics: **In elementary mathematics, especially elementary geometry: **Parabolic coordinates **Parabolic cylindrical ...
, the logarithm of the frequency is a quadratic polynomial of the logarithm of the rank. This can markedly improve the fit over a simple power-law relationship. Like fractal dimension, it is possible to calculate Zipf dimension, which is a useful parameter in the analysis of texts. It has been argued that Benford's law is a special bounded case of Zipf's law, with the connection between these two laws being explained by their both originating from scale invariant functional relations from statistical physics and critical phenomena. The ratios of probabilities in Benford's law are not constant. The leading digits of data satisfying Zipf's law with satisfy Benford's law.


Occurrences


City sizes

Following Auerbach's 1913 observation, there has been substantial examination of Zipf's law for city sizes. However, more recent empirical and theoretical studies have challenged the relevance of Zipf's law for cities.


Word frequencies in natural languages

In many texts in human languages, word frequencies approximately follow a Zipf distribution with exponent close to 1; that is, the most common word occurs about times the -th most common one. The actual rank-frequency plot of a natural language text deviates in some extent from the ideal Zipf distribution, especially at the two ends of the range. The deviations may depend on the language, on the topic of the text, on the author, on whether the text was translated from another language, and on the spelling rules used. Some deviation is inevitable because of
sampling error In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ...
. At the low-frequency end, where the rank approaches , the plot takes a staircase shape, because each word can occur only an integer number of times. Zipf-euro-4 German, Russian, French, Italian, Medieval English.svg, German (1669), Russian (1972), French (1865),
Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, a Romance ethnic group related to or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance languag ...
(1840), and Medieval English (1460) Zipf-semi-1 Arabic, Geez, Hebraic.svg, Ge'ez (14th century),
Arabic Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
(7th century),
Hebrew Hebrew (; ''ʿÎbrit'') is a Northwest Semitic languages, Northwest Semitic language within the Afroasiatic languages, Afroasiatic language family. A regional dialect of the Canaanite languages, it was natively spoken by the Israelites and ...
(500–800), all with vowels Zipf-asia-1 Chinese, Tibetan, Vietnamese.svg,
Lhasa Tibetan Lhasa Tibetan or Standard Tibetan is a standardized dialect of Tibetan spoken by the people of Lhasa, the capital of the Tibetan Autonomous Region. It is an official language of the Tibet Autonomous Region. In the traditional "three-branched" ...
, Chinese, Vietnamese, all with separated syllables Zipf-heot-0 Hebrew - Books of the Torah.svg, First five books of the
Old Testament The Old Testament (OT) is the first division of the Christian biblical canon, which is based primarily upon the 24 books of the Hebrew Bible, or Tanakh, a collection of ancient religious Hebrew and occasionally Aramaic writings by the Isr ...
(the
Torah The Torah ( , "Instruction", "Teaching" or "Law") is the compilation of the first five books of the Hebrew Bible, namely the books of Genesis, Exodus, Leviticus, Numbers and Deuteronomy. The Torah is also known as the Pentateuch () ...
) in Hebrew, with vowels Zipf-laot-0 Vulgate Pentateuch books.svg, First five books of the
Old Testament The Old Testament (OT) is the first division of the Christian biblical canon, which is based primarily upon the 24 books of the Hebrew Bible, or Tanakh, a collection of ancient religious Hebrew and occasionally Aramaic writings by the Isr ...
(the
Pentateuch The Torah ( , "Instruction", "Teaching" or "Law") is the compilation of the first five books of the Hebrew Bible, namely the books of Genesis, Exodus, Leviticus, Numbers and Deuteronomy. The Torah is also known as the Pentateuch () o ...
) in the Latin
Vulgate The Vulgate () is a late-4th-century Bible translations into Latin, Latin translation of the Bible. It is largely the work of Saint Jerome who, in 382, had been commissioned by Pope Damasus I to revise the Gospels used by the Diocese of ...
version Zipf-lant-0 Vulgate Gospels.svg, First four books of the
New Testament The New Testament (NT) is the second division of the Christian biblical canon. It discusses the teachings and person of Jesus in Christianity, Jesus, as well as events relating to Christianity in the 1st century, first-century Christianit ...
(the
Gospels Gospel originally meant the Christian message (" the gospel"), but in the second century AD the term (, from which the English word originated as a calque) came to be used also for the books in which the message was reported. In this sen ...
) in the Latin
Vulgate The Vulgate () is a late-4th-century Bible translations into Latin, Latin translation of the Bible. It is largely the work of Saint Jerome who, in 382, had been commissioned by Pope Damasus I to revise the Gospels used by the Diocese of ...
version
In some
Romance languages The Romance languages, also known as the Latin or Neo-Latin languages, are the languages that are Language family, directly descended from Vulgar Latin. They are the only extant subgroup of the Italic languages, Italic branch of the Indo-E ...
, the frequencies of the dozen or so most frequent words deviate significantly from the ideal Zipf distribution, because those words include articles inflected for
grammatical gender In linguistics, a grammatical gender system is a specific form of a noun class system, where nouns are assigned to gender categories that are often not related to the real-world qualities of the entities denoted by those nouns. In languages wit ...
and
number A number is a mathematical object used to count, measure, and label. The most basic examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers can ...
. In many East Asian languages, such as Chinese, Tibetan, and Vietnamese, each
morpheme A morpheme is any of the smallest meaningful constituents within a linguistic expression and particularly within a word. Many words are themselves standalone morphemes, while other words contain multiple morphemes; in linguistic terminology, this ...
(word or word piece) consists of a single
syllable A syllable is a basic unit of organization within a sequence of speech sounds, such as within a word, typically defined by linguists as a ''nucleus'' (most often a vowel) with optional sounds before or after that nucleus (''margins'', which are ...
; a word of English being often translated to a compound of two such syllables. The rank-frequency table for those morphemes deviates significantly from the ideal Zipf law, at both ends of the range. Even in English, the deviations from the ideal Zipf's law become more apparent as one examines large collections of texts. Analysis of a corpus of 30,000 English texts showed that only about 15% of the texts in it have a good fit to Zipf's law. Slight changes in the definition of Zipf's law can increase this percentage up to close to 50%. In these cases, the observed frequency-rank relation can be modeled more accurately as by separate Zipf–Mandelbrot laws distributions for different subsets or subtypes of words. This is the case for the frequency-rank plot of the first 10 million words of the English Wikipedia. In particular, the frequencies of the closed class of
function word In linguistics, function words (also called functors) are words that have little lexical meaning or have ambiguous meaning and express grammatical relationships among other words within a sentence, or specify the attitude or mood of the speak ...
s in English is better described with lower than 1, while open-ended vocabulary growth with document size and corpus size require greater than 1 for convergence of the Generalized Harmonic Series. When a text is encrypted in such a way that every occurrence of each distinct plaintext word is always mapped to the same encrypted word (as in the case of simple
substitution cipher In cryptography, a substitution cipher is a method of encrypting in which units of plaintext are replaced with the ciphertext, in a defined manner, with the help of a key; the "units" may be single letters (the most common), pairs of letters, t ...
s, like the Caesar ciphers, or simple
codebook A codebook is a type of document used for gathering and storing cryptography codes. Originally, codebooks were often literally , but today "codebook" is a byword for the complete record of a series of codes, regardless of physical format. Cr ...
ciphers), the frequency-rank distribution is not affected. On the other hand, if separate occurrences of the same word may be mapped to two or more different words (as happens with the
Vigenère cipher The Vigenère cipher () is a method of encryption, encrypting alphabetic text where each letter of the plaintext is encoded with a different Caesar cipher, whose increment is determined by the corresponding letter of another text, the key (crypt ...
), the Zipf distribution will typically have a flat part at the high-frequency end.


Applications

Zipf's law has been used for extraction of parallel fragments of texts out of comparable corpora. Laurance Doyle and others have suggested the application of Zipf's law for detection of alien language in the
search for extraterrestrial intelligence The search for extraterrestrial intelligence (usually shortened as SETI) is an expression that refers to the diverse efforts and scientific projects intended to detect extraterrestrial signals, or any evidence of intelligent life beyond Earth. ...
. The frequency-rank word distribution is often characteristic of the author and changes little over time. This feature has been used in the analysis of texts for authorship attribution. The word-like sign groups of the 15th-century codex Voynich Manuscript have been found to satisfy Zipf's law, suggesting that text is most likely not a hoax but rather written in an obscure language or cipher.


Whale Communication

Recent analysis of
whale vocalization Whales use a variety of sounds for animal communication, communication and sensation. The mechanisms used to produce sound vary from one family of cetaceans to another. Marine mammals, including whales, dolphins, and porpoises, are much more d ...
samples shows they contain recurring phonemes whose distribution appears to closely obey Zipf's Law. While this isn't proof that whale communication is a natural language, it is an intriguing discovery.


See also

* * * * * * * * * * * * * * * * * * * * * *
Letter frequency Letter frequency is the number of times letters of the alphabet appear on average in written language. Letter frequency analysis dates back to the Arab mathematician Al-Kindi (c. AD 801–873), who formally developed the method to break ciph ...
* Most common words in English


Notes


References


Further reading

* * * *


External links

*—An article on Zipf's law applied to city populations
Seeing Around Corners (Artificial societies turn up Zipf's law)
* ttps://www.newscientist.com/article.ns?id=mg18524904.300 An analysis of income distributionbr>Zipf List of French words

Zipf list for English, French, Spanish, Italian, Swedish, Icelandic, Latin, Portuguese and Finnish from Gutenberg Project and online calculator to rank words in texts

Citations and the Zipf–Mandelbrot's lawZipf's Law examples and modelling (1985)Benford's law, Zipf's law, and the Pareto distribution
by Terence Tao. * {{Authority control Discrete distributions Computational linguistics Power laws Statistical laws Empirical laws Eponymous rules Tails of probability distributions Quantitative linguistics Bibliometrics Corpus linguistics 1949 introductions Eponyms in mathematics