Heaps' Law
   HOME

TheInfoList



OR:

In
linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
, Heaps' law (also called Herdan's law) is an
empirical law Scientific laws or laws of science are statements, based on reproducibility, repeated experiments or observations, that describe or prediction, predict a range of natural phenomena. The term ''law'' has diverse usage in many cases (approximate, a ...
which describes the number of distinct words in a document (or set of documents) as a function of the document length (so called type-token relation). It can be formulated as : V_R(n) = Kn^\beta where ''VR'' is the number of distinct words in an instance text of size ''n''. ''K'' and β are free parameters determined empirically. With English
text corpora In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
, typically ''K'' is between 10 and 100, and β is between 0.4 and 0.6. The law is frequently attributed to Harold Stanley Heaps, but was originally discovered by . Under mild assumptions, the Herdan–Heaps law is asymptotically equivalent to Zipf's law concerning the frequencies of individual words within a text. This is a consequence of the fact that the type-token relation (in general) of a homogenous text can be derived from the distribution of its types. Heaps' law means that as more instance text is gathered, there will be diminishing returns in terms of discovery of the full vocabulary from which the distinct terms are drawn. Heaps' law also applies to situations in which the "vocabulary" is just some set of distinct types which are attributes of some collection of objects. For example, the objects could be people, and the types could be country of origin of the person. If persons are selected randomly (that is, we are not selecting based on country of origin), then Heaps' law says we will quickly have representatives from most countries (in proportion to their population) but it will become increasingly difficult to cover the entire set of countries by continuing this method of sampling. Heaps' law has been observed also in single-cell
transcriptomes The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The t ...
considering
genes In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
as the distinct objects in the "vocabulary".


See also

* Zipf's law * Brevity law *
Menzerath's law Menzerath's law, or Menzerath–Altmann law (named after Paul Menzerath and Gabriel Altmann), is a linguistic law according to which the increase of the size of a linguistic construct results in a decrease of the size of its constituents, and vice ...
*
Bradford's law Bradford's law is a pattern first described by Samuel C. Bradford in 1934 that estimates the exponentially diminishing returns of searching for references in science journals. One formulation is that if journals in a field are sorted by number of ...
*
Benford's law Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small.Arno Berger and Theodore ...
*
Pareto distribution The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto ( ), is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actua ...
*
Principle of least effort The principle of least effort is a broad theory that covers diverse fields from evolutionary biology to webpage design. It postulates that animals, people, and even well-designed machines will naturally choose the path of least resistance or "effo ...
* Rank-size distribution


References


Citations


Sources

* . * . * * . *. * . * . *


External links

* Computational linguistics Statistical laws Empirical laws Eponyms {{comp-ling-stub