In
linguistics
Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
, Heaps' law (also called Herdan's law) is an
empirical law
Scientific laws or laws of science are statements, based on repeated experiments or observations, that describe or predict a range of natural phenomena. The term ''law'' has diverse usage in many cases (approximate, accurate, broad, or narrow) ...
which describes the number of distinct words in a document (or set of documents) as a function of the document length (so called type-token relation). It can be formulated as
:
where ''V
R'' is the number of distinct words in an instance text of size ''n''. ''K'' and β are free parameters determined empirically. With English
text corpora
In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Annotated, they have been used in cor ...
, typically ''K'' is between 10 and 100, and β is between 0.4 and 0.6.
The law is frequently attributed to
Harold Stanley Heaps
Harold may refer to:
People
* Harold (given name), including a list of persons and fictional characters with the name
* Harold (surname), surname in the English language
* András Arató, known in meme culture as "Hide the Pain Harold"
Arts ...
, but was originally discovered by . Under mild assumptions, the Herdan–Heaps law is asymptotically equivalent to
Zipf's law
Zipf's law (; ) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the -th entry is often approximately inversely proportional to .
The best known instance of Zipf's law applies to the ...
concerning the frequencies of individual words within a text. This is a consequence of the fact that the type-token relation (in general) of a homogenous text can be derived from the distribution of its types.
Empirically, Heaps' law is preserved even when the document is randomly shuffled,
meaning that it does not depend on the ordering of words, but only the frequency of words. This is used as evidence for deriving Heaps' law from Zipf's law.
Heaps' law means that as more instance text is gathered, there will be diminishing returns in terms of discovery of the full vocabulary from which the distinct terms are drawn.
Deviations from Heaps' law, as typically observed in English text corpora, have been identified in corpora generated with large language models.
Heaps' law also applies to situations in which the "vocabulary" is just some set of distinct types which are attributes of some collection of objects. For example, the objects could be people, and the types could be country of origin of the person. If persons are selected randomly (that is, we are not selecting based on country of origin), then Heaps' law says we will quickly have representatives from most countries (in proportion to their population) but it will become increasingly difficult to cover the entire set of countries by continuing this method of sampling.
Heaps' law has been observed also in single-cell
transcriptomes
The transcriptome is the set of all RNA transcripts, including coding and non-coding RNA, non-coding, in an individual or a population of cell (biology), cells. The term can also sometimes be used to refer to RNA#Types of RNA, all RNAs, or just Mes ...
considering
genes
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
as the distinct objects in the "vocabulary".
See also
*
*
*
*
*
*
*
*
References
Citations
Sources
* .
* .
* . Heaps' law is proposed in Section 7.5 (pp. 206–208).
* .
*.
* .
* .
*
External links
*
Computational linguistics
Statistical laws
Empirical laws
Eponymous rules
{{comp-ling-stub