In
probability theory
Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...
and
statistics
Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
, the Zipf–Mandelbrot law is a
discrete probability distribution. Also known as the
Pareto–Zipf law, it is a
power-law distribution on
ranked data, named after the
linguist
Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Lingu ...
George Kingsley Zipf who suggested a simpler distribution called
Zipf's law, and the mathematician
Benoit Mandelbrot
Benoit B. Mandelbrot (20 November 1924 – 14 October 2010) was a Polish-born French-American mathematician and polymath with broad interests in the practical sciences, especially regarding what he labeled as "the art of roughness" of p ...
, who subsequently generalized it.
The
probability mass function is given by:
:
where
is given by:
:
which may be thought of as a generalization of a
harmonic number
In mathematics, the -th harmonic number is the sum of the reciprocals of the first natural numbers:
H_n= 1+\frac+\frac+\cdots+\frac =\sum_^n \frac.
Starting from , the sequence of harmonic numbers begins:
1, \frac, \frac, \frac, \frac, \do ...
. In the formula,
is the rank of the data, and
and
are parameters of the distribution. In the limit as
approaches infinity, this becomes the
Hurwitz zeta function . For finite
and
the Zipf–Mandelbrot law becomes
Zipf's law. For infinite
and
it becomes a
Zeta distribution.
Applications
The distribution of words ranked by their
frequency
Frequency is the number of occurrences of a repeating event per unit of time. It is also occasionally referred to as ''temporal frequency'' for clarity, and is distinct from ''angular frequency''. Frequency is measured in hertz (Hz) which is eq ...
in a random
text corpus is approximated by a
power-law distribution, known as
Zipf's law.
If one plots the frequency rank of words contained in a moderately sized corpus of text data versus the number of occurrences or actual frequencies, one obtains a
power-law distribution, with
exponent close to one (but see Powers, 1998 and Gelbukh & Sidorov, 2001). Zipf's law implicitly assumes a fixed vocabulary size, but the
Harmonic series with ''s''=1 does not converge, while the Zipf–Mandelbrot generalization with ''s''>1 does. Furthermore, there is evidence that the closed class of functional words that define a language obeys a Zipf–Mandelbrot distribution with different parameters from the open classes of contentive words that vary by topic, field and register.
In ecological field studies, the
relative abundance distribution (i.e. the graph of the number of species observed as a function of their abundance) is often found to conform to a Zipf–Mandelbrot law.
Within music, many metrics of measuring "pleasing" music conform to Zipf–Mandelbrot distributions.
Notes
References
* Reprinted as
**
*
*
* Van Droogenbroeck F.J., 'An essential rephrasing of the Zipf–Mandelbrot law to solve authorship attribution applications by Gaussian statistics' (2019
External links
Z. K. Silagadze: Citations and the Zipf–Mandelbrot's law*
ttps://github.com/gkohri/discreteRNG C++ Library for generating random Zipf–Mandelbrot deviates.
{{DEFAULTSORT:Zipf-Mandelbrot Law
Discrete distributions
Power laws
Computational linguistics
Quantitative linguistics
Corpus linguistics