Heaps' law

Last updated
Verification of Heaps' law on War and Peace, as well as a randomly shuffled version of it. Both cases fit well to the Heaps' law with very similar exponents b, but different K. Heaps' Law on "War and Peace".svg
Verification of Heaps' law on War and Peace , as well as a randomly shuffled version of it. Both cases fit well to the Heaps' law with very similar exponents β, but different K.
A schematic Heaps-law plot. The x-axis represents the text size, and the y-axis represents the number of distinct vocabulary elements present in the text. Compare the values of the two axes. Heaps law plot.png
A schematic Heaps-law plot. The x-axis represents the text size, and the y-axis represents the number of distinct vocabulary elements present in the text. Compare the values of the two axes.

In linguistics, Heaps' law (also called Herdan's law) is an empirical law which describes the number of distinct words in a document (or set of documents) as a function of the document length (so called type-token relation). It can be formulated as

Contents

where VR is the number of distinct words in an instance text of size n. K and β are free parameters determined empirically. With English text corpora, typically K is between 10 and 100, and β is between 0.4 and 0.6.

The law is frequently attributed to Harold Stanley Heaps, but was originally discovered by GustavHerdan ( 1960 ). [1] Under mild assumptions, the Herdan–Heaps law is asymptotically equivalent to Zipf's law concerning the frequencies of individual words within a text. [2] This is a consequence of the fact that the type-token relation (in general) of a homogenous text can be derived from the distribution of its types. [3]

Empirically, Heaps' law is preserved even when the document is randomly shuffled, [4] meaning that it does not depend on the ordering of words, but only the frequency of words. [5] This is used as evidence for deriving Heaps' law from Zipf's law. [4]

Heaps' law means that as more instance text is gathered, there will be diminishing returns in terms of discovery of the full vocabulary from which the distinct terms are drawn.

Deviations from Heaps' law, as typically observed in English text corpora, have been identified in corpora generated with large language models. [6]

Heaps' law also applies to situations in which the "vocabulary" is just some set of distinct types which are attributes of some collection of objects. For example, the objects could be people, and the types could be country of origin of the person. If persons are selected randomly (that is, we are not selecting based on country of origin), then Heaps' law says we will quickly have representatives from most countries (in proportion to their population) but it will become increasingly difficult to cover the entire set of countries by continuing this method of sampling. Heaps' law has been observed also in single-cell transcriptomes [7] considering genes as the distinct objects in the "vocabulary".

See also

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

<span class="mw-page-title-main">Natural language processing</span> Field of linguistics and computer science

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">Zipf's law</span> Probability distribution

Zipf's law is an empirical law that often holds, approximately, when a list of measured values is sorted in decreasing order. It states that the value of the nth entry is inversely proportional to n.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

In probability theory and statistics, the Zipf–Mandelbrot law is a discrete probability distribution. Also known as the Pareto–Zipf law, it is a power-law distribution on ranked data, named after the linguist George Kingsley Zipf who suggested a simpler distribution called Zipf's law, and the mathematician Benoit Mandelbrot, who subsequently generalized it.

<span class="mw-page-title-main">Brown Corpus</span> Data set of American English in 1961

The Brown University Standard Corpus of Present-Day American English is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by Henry Kučera and W. Nelson Francis at Brown University, in Rhode Island, it is a general language corpus containing 500 samples of English, totaling roughly one million words, compiled from works published in the United States in 1961.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

<span class="mw-page-title-main">Informetrics</span> Study of the quantitative aspects of information

Informetrics is the study of quantitative aspects of information, it is an extension and evolution of traditional bibliometrics and scientometrics. Informetrics uses bibliometrics and scientometrics methods to study mainly the problems of literature information management and evaluation of science and technology. Informetrics is an independent discipline that uses quantitative methods from mathematics and statistics to study the process, phenomena, and law of informetrics. Informetrics has gained more attention as it is a common scientific method for academic evaluation, research hotspots in discipline, and trend analysis.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

An empirical statistical law or a law of statistics represents a type of behaviour that has been found across a number of datasets and, indeed, across a range of types of data sets. Many of these observances have been formulated and proved as statistical or probabilistic theorems and the term "law" has been carried over to these theorems. There are other statistical and probabilistic theorems that also have "law" as a part of their names that have not obviously derived from empirical observations. However, both types of "law" may be considered instances of a scientific law in the field of statistics. What distinguishes an empirical statistical law from a formal statistical theorem is the way these patterns simply appear in natural distributions, without a prior theoretical reasoning about the data.

Menzerath's law, or Menzerath–Altmann law, is a linguistic law according to which the increase of the size of a linguistic construct results in a decrease of the size of its constituents, and vice versa.

Quantitative linguistics (QL) is a sub-discipline of general linguistics and, more specifically, of mathematical linguistics. Quantitative linguistics deals with language learning, language change, and application as well as structure of natural languages. QL investigates languages using statistical methods; its most demanding objective is the formulation of language laws and, ultimately, of a general theory of language in the sense of a set of interrelated languages laws. Synergetic linguistics was from its very beginning specifically designed for this purpose. QL is empirically based on the results of language statistics, a field which can be interpreted as statistics of languages or as statistics of any linguistic object. This field is not necessarily connected to substantial theoretical ambitions. Corpus linguistics and computational linguistics are other fields which contribute important empirical evidence.

Macmillan English Dictionary for Advanced Learners, also known as MEDAL, is an advanced learner's dictionary first published in 2002 by Macmillan Education. It shares most of the features of this type of dictionary: it provides definitions in simple language, using a controlled defining vocabulary; most words have example sentences to illustrate how they are typically used; and information is given about how words combine grammatically or in collocations. MEDAL also introduced a number of innovations. These include:

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Word embedding</span> Method in natural language processing

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

<span class="mw-page-title-main">Brevity law</span> Linguistics law

In linguistics, the brevity law is a linguistic law that qualitatively states that the more frequently a word is used, the shorter that word tends to be, and vice versa; the less frequently a word is used, the longer it tends to be. This is a statistical regularity that can be found in natural languages and other natural systems and that claims to be a general rule.

Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."

A word n-gram language model is a purely statistical model of language. It has been superseded by recurrent neural network-based models, which has been superseded by large language models. It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if n-1 words, an n-gram model. Special tokens were introduced to denote the start and end of a sentence and .

References

Citations

  1. Egghe (2007): "Herdan's law in linguistics and Heaps' law in information retrieval are different formulations of the same phenomenon".
  2. Kornai (1999); Baeza-Yates & Navarro (2000); van Leijenhorst & van der Weide (2005).
  3. Milička (2009)
  4. 1 2 Sano, Yukie; Takayasu, Hideki; Takayasu, Misako (2012). "Zipf's Law and Heaps' Law Can Predict the Size of Potential Words". Progress of Theoretical Physics Supplement. 194: 202–209. Bibcode:2012PThPS.194..202S. doi: 10.1143/PTPS.194.202 . ISSN   0375-9687.
  5. Najafi, Elham; Darooneh, Amir H. (2015-06-19). Esteban, Francisco J. (ed.). "The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction". PLOS ONE. 10 (6): e0130617. Bibcode:2015PLoSO..1030617N. doi: 10.1371/journal.pone.0130617 . ISSN   1932-6203. PMC   4474631 . PMID   26091207.
  6. Lai, Uyen; Randhawa, Gurjit; Sheridan, Paul (2023-12-12). "Heaps' Law in GPT-Neo Large Language Model Emulated Corpora". Proceedings of the Tenth International Workshop on Evaluating Information Access (EVIA 2023), a Satellite Workshop of the NTCIR-17 Conference. Tokyo, Japan. pp. 20–23. doi:10.20736/0002001352.
  7. Lazzardi, Silvia; Valle, Filippo; Mazzolini, Andrea; Scialdone, Antonio; Caselle, Michele; Osella, Matteo (2021-06-17). "Emergent Statistical Laws in Single-Cell Transcriptomic Data". bioRxiv: 2021–06.16.448706. doi:10.1101/2021.06.16.448706. S2CID   235482777 . Retrieved 2021-06-18.

Sources

  • Baeza-Yates, Ricardo; Navarro, Gonzalo (2000), "Block addressing indices for approximate text retrieval", Journal of the American Society for Information Science, 51 (1): 69–82, CiteSeerX   10.1.1.31.4832 , doi:10.1002/(sici)1097-4571(2000)51:1<69::aid-asi10>3.0.co;2-c .
  • Egghe, L. (2007), "Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments", Journal of the American Society for Information Science and Technology, 58 (5): 702–709, doi:10.1002/asi.20524 .
  • Heaps, Harold Stanley (1978), Information Retrieval: Computational and Theoretical Aspects, Academic Press. Heaps' law is proposed in Section 7.5 (pp. 206–208).
  • Herdan, Gustav (1960), Type-token mathematics, The Hague: Mouton.
  • Kornai, Andras (1999), "Zipf's law outside the middle range", in Rogers, James (ed.), Proceedings of the Sixth Meeting on Mathematics of Language, University of Central Florida, pp. 347–356.
  • Milička, Jiří (2009), "Type-token & Hapax-token Relation: A Combinatorial Model", Glottotheory. International Journal of Theoretical Linguistics, 1 (2): 99–110, doi:10.1515/glot-2009-0009, S2CID   124490442 .
  • van Leijenhorst, D. C; van der Weide, Th. P. (2005), "A formal derivation of Heaps' Law", Information Sciences, 170 (2–4): 263–272, doi:10.1016/j.ins.2004.03.006 .
  • This article incorporates material from Heaps' law on PlanetMath, which is licensed under the Creative Commons Attribution/Share-Alike License.