This article has an unclear citation style .(March 2021) |
This article includes a list of general references, but it lacks sufficient corresponding inline citations .(December 2023) |
A word list (or lexicon) is a list of a language's lexicon (generally sorted by frequency of occurrence either by levels or as a ranked list) within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort" (Nation 1997), but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field.
In computational linguistics, a frequency list is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.
Type | Occurrences | Rank |
---|---|---|
the | 3,789,654 | 1st |
he | 2,098,762 | 2nd |
[...] | ||
king | 57,897 | 1,356th |
boy | 56,975 | 1,357th |
[...] | ||
stringyfy | 5 | 34,589th |
[...] | ||
transducionalify | 1 | 123,567th |
Nation (Nation 1997) noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists:
Most of currently available studies are based on written text corpus, more easily available and easy to process.
However, New et al. 2007 proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. Brysbaert & New 2009 made a long critical evaluation of this traditional textual analysis approach, and support a move toward speech analysis and analysis of film subtitles available online. This has recently been followed by a handful of follow-up studies, [1] providing valuable frequency count analysis for various languages. Indeed, the SUBTLEX movement completed in five years full studies for French (New et al. 2007), American English (Brysbaert & New 2009; Brysbaert, New & Keuleers 2012), Dutch (Keuleers & New 2010), Chinese (Cai & Brysbaert 2010), Spanish (Cuetos et al. 2011), Greek (Dimitropoulou et al. 2010), Vietnamese (Pham, Bolger & Baayen 2011), Brazil Portuguese (Tang 2012) and Portugal Portuguese (Soares et al. 2015), Albanian (Avdyli & Cuetos 2013), Polish (Mandera et al. 2014) and Catalan (2019 [2] ). SUBTLEX-IT (2015) provides raw data only. [3]
In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise, such as English "can't", French "aujourd'hui", or idioms. It may also be preferable to group words of a word family under the representation of its base word. Thus, possible, impossible, possibility are words of the same word family, represented by the base word *possib*. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character word.
It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics.
German linguists define the Häufigkeitsklasse (frequency class) of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word outragious has a ratio of 76/3789654 and belongs in class 16.
where is the floor function.
Frequency lists, together with semantic networks, are used to identify the least common, specialized terms to be replaced by their hypernyms in a process of semantic compression.
Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and textbook authors (Nation 1997). Paul Nation's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes [thematic] vocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion" (Nation 2006).
Word frequency is known to have various effects (Brysbaert et al. 2011; Rudell 1993). Memorization is positively affected by higher word frequency, likely because the learner is subject to more exposures (Laufer 1997). Lexical access is positively influenced by high word frequency, a phenomenon called word frequency effect (Segui et al.). The effect of word frequency is related to the effect of age-of-acquisition, the age at which the word was learned.
Below is a review of available resources.
Word counting is an ancient field, [4] with known discussion back to Hellenistic time. In 1944, Edward Thorndike, Irvin Lorge and colleagues [5] hand-counted 18,000,000 running words to provide the first large-scale English language frequency list, before modern computers made such projects far easier (Nation 1997). 20th century's works all suffer from their age. In particular, words relating to technology, such as "blog," which, in 2014, was #7665 in frequency [6] in the Corpus of Contemporary American English, [7] was first attested to in 1999, [8] [9] [10] and does not appear in any of these three lists.
The Teacher Word Book contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read, 1990). A corpus of 18 million written words was hand analysed. The size of its source corpus increased its usefulness, but its age, and language changes, have reduced its applicability (Nation 1997).
The General Service List contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5 million written words was analyzed in the 1940s. The rate of occurrence (%) for different meanings, and parts of speech, of the headword are provided. Various criteria, other than frequence and range, were carefully applied to the corpus. Thus, despite its age, some errors, and its corpus being entirely written text, it is still an excellent database of word frequency, frequency of meanings, and reduction of noise (Nation 1997). This list was updated in 2013 by Dr. Charles Browne, Dr. Brent Culligan and Joseph Phillips as the New General Service List.
A corpus of 5 million running words, from written texts used in United States schools (various grades, various subject areas). Its value is in its focus on school teaching materials, and its tagging of words by the frequency of each word, in each of the school grade, and in each of the subject areas (Nation 1997).
These now contain 1 million words from a written corpus representing different dialects of English. These sources are used to produce frequency lists (Nation 1997).
A review has been made by New & Pallier. An attempt was made in the 1950s–60s with the Français fondamental. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules. [11] It is claimed that 70 grammatical words constitute 50% of the communicatives sentence, [12] [13] while 3,680 words make about 95~98% of coverage. [14] A list of 3,000 frequent words is available. [15]
The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue Étienne Brunet. [16] Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain". [17]
More recently, the project Lexique3 provides 142,000 French words, with orthography, phonetic, syllabation, part of speech, gender, number of occurrence in the source corpus, frequency rank, associated lexemes, etc., available under an open license CC-by-sa-4.0. [18]
This Lexique3 is a continuous study from which originate the Subtlex movement cited above. New et al. 2007 made a completely new counting based on online film subtitles.
There have been several studies of Spanish word frequency (Cuetos et al. 2011). [19]
Chinese corpora have long been studied from the perspective of frequency lists. The historical way to learn Chinese vocabulary is based on characters frequency (Allanic 2003). American sinologist John DeFrancis mentioned its importance for Chinese as a foreign language learning and teaching in Why Johnny Can't Read Chinese (DeFrancis 1966). As a frequency toolkit, Da (Da 1998) and the Taiwanese Ministry of Education (TME 1997) provided large databases with frequency ranks for characters and words. The HSK list of 8,848 high and medium frequency words in the People's Republic of China, and the Republic of China (Taiwan)'s TOP list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters. Following the SUBTLEX movement, Cai & Brysbaert 2010 recently made a rich study of Chinese word and character frequencies.
Wiktionary:Frequency lists contains frequency lists in more languages. [20]
Most frequently used words in different languages based on Wikipedia or combined corpora. [21]
Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.
A vocabulary is a set of words, typically the set in a language or the set known to an individual. The word vocabulary originated from the Latin vocabulum, meaning "a word, name". It forms an essential component of language and communication, helping convey thoughts, ideas, emotions, and information. Vocabulary can be oral, written, or signed and can be categorized into two main types: active vocabulary and passive vocabulary. An individual's vocabulary continually evolves through various methods, including direct instruction, independent reading, and natural language exposure, but it can also shrink due to forgetting, trauma, or disease. Furthermore, vocabulary is a significant focus of study across various disciplines, like linguistics, education, psychology, and artificial intelligence. Vocabulary is not limited to single words; it also encompasses multi-word units known as collocations, idioms, and other types of phraseology. Acquiring an adequate vocabulary is one of the largest challenges in learning a second language.
In corpus linguistics a key word is a word which occurs in a text more often than we would expect to occur by chance alone. Key words are calculated by carrying out a statistical test which compares the word frequencies in a text against their expected frequencies derived in a much larger corpus, which acts as a reference for general language use. Keyness is then the quality a word or phrase has of being "key" in its context. Combinations of nouns with parts of speech that human readers would not likely notice, such as prepositions, time adverbs, and pronouns can be a relevant part of keyness. Even separate pronouns can constitute keywords.
A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Historically, concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era.
In linguistics, the term lexis designates the complete set of all possible words in a language, or a particular subset of words that are grouped by some specific linguistic criteria. For example, the general term English lexis refers to all words of the English language, while more specific term English religious lexis refers to a particular subset within English lexis, encompassing only words that are semantically related to the religious sphere of life.
Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.
In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.
The Academic Word List (AWL) is a word list of 570 English word families which appear with great frequency in a broad range of academic texts. The target readership is English as a second or foreign language students intending to enter English-medium higher education, and teachers of such students. The AWL was developed by Averil Coxhead at the School of Linguistics and Applied Language Studies at Victoria University of Wellington, New Zealand. This list replaced the previously widely used University Word List, developed by Xue and Nation in 1986. The words included in the AWL were selected based on their range, frequency, and dispersion, and were divided into ten sublists, each containing 1000 words in decreasing order of frequency. The AWL excludes words from the General Service List. Many words in the AWL are general vocabulary not restricted to an academic domain, such as the words area, approach, create, similar, and occur, found in Sublist One, and the AWL only accounts for a small percentage of the actual word occurrences in academic texts.
The General Service List (GSL) is a list of roughly 2,000 words published by Michael West in 1953. The words were selected to represent the most frequent words of English and were taken from a corpus of written English. The target audience was English language learners and ESL teachers. To maximize the utility of the list, some frequent words that overlapped broadly in meaning with words already on the list were omitted. In the original publication the relative frequencies of various senses of the words were also included.
The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).
Paul Nation is an internationally recognized scholar in the field of linguistics and teaching methodology. As a professor in the field of applied linguistics with a specialization in pedagogical methodology, he has been able to create a language teaching framework to identify key areas of language teaching focus. Paul Nation is best known for this framework, which has been labelled The Four Strands. He has also made notable contributions through his research in the field of language acquisition that focuses on the benefits of extensive reading and repetition as well as intensive reading. Nation's numerous contributions to the linguistics research community through his published work has allowed him to share his knowledge and experience so that others may adopt and adapt it. He is credited with bringing « legitimization to second language vocabulary researches » in 1990.
Macmillan English Dictionary for Advanced Learners, also known as MEDAL, is an advanced learner's dictionary first published in 2002 by Macmillan Education. It shares most of the features of this type of dictionary: it provides definitions in simple language, using a controlled defining vocabulary; most words have example sentences to illustrate how they are typically used; and information is given about how words combine grammatically or in collocations. MEDAL also introduced a number of innovations. These include:
Mark E. Davies is an American linguist. He specializes in corpus linguistics and language variation and change. He is the creator of most of the text corpora from English-Corpora.org as well as the Corpus del español and the Corpus do português. He has also created large datasets of word frequency, collocates, and n-grams data, which have been used by many large companies in the fields of technology and also language learning.
The Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in printed sources published between 1500 and 2022 in Google's text corpora in English, Chinese (simplified), French, German, Hebrew, Italian, Russian, or Spanish. There are also some specialized English corpora, such as American English, British English, and English Fiction.
Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.
Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.
Age of acquisition is a psycholinguistic variable referring to the age at which a word is typically learned. For example, the word 'penguin' is typically learned at a younger age than the word 'albatross'. Studies in psycholinguistics suggest that age of acquisition has an effect on the speed of reading words. The findings have demonstrated that early-acquired words are processed more quickly than later-acquired words. It is a particularly strong variable in predicting the speed of picture naming. It has been generally found that words that are more frequent, shorter, more familiar and refer to concrete concepts are learned earlier than the counterparts in simple words and compound words, together with their respective lexemes. In addition, the AoA effect has been demonstrated in several languages and in bilingual speakers.
{{citation}}
: CS1 maint: multiple names: authors list (link){{citation}}
: CS1 maint: DOI inactive as of November 2024 (link)