Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.
In total, the texts in the Oxford English Corpus contain more than 2 billion words. [1] The OEC includes a wide variety of writing samples, such as literary works, novels, academic journals, newspapers, magazines, Hansard's Parliamentary Debates, blogs, chat logs, and emails. [2]
Another English corpus that has been used to study word frequency is the Brown Corpus, which was compiled by researchers at Brown University in the 1960s. The researchers published their analysis of the Brown Corpus in 1967. Their findings were similar, but not identical, to the findings of the OEC analysis.
According to The Reading Teacher's Book of Lists, the first 25 words in the OEC make up about one-third of all printed material in English, and the first 100 words make up about half of all written English. [3] According to a study cited by Robert McCrum in The Story of English, all of the first hundred of the most common words in English are of Old English origin, [4] except for "people", ultimately from Latin "populus", and "because", in part from Latin "causa".
Some lists of common words distinguish between word forms, while others rank all forms of a word as a single lexeme (the form of the word as it would appear in a dictionary). For example, the lexeme be (as in to be ) comprises all its conjugations (is, was, am, are, were, etc.), and contractions of those conjugations. [5] These top 100 lemmas listed below account for 50% of all the words in the Oxford English Corpus. [1]
A list of 100 words that occur most frequently in written English is given below, based on an analysis of the Oxford English Corpus (a collection of texts in the English language, comprising over 2 billion words). [1] A part of speech is provided for most of the words, but part-of-speech categories vary between analyses, and not all possibilities are listed. For example, "I" may be a pronoun or a Roman numeral; "to" may be a preposition or an infinitive marker; "time" may be a noun or a verb. Also, a single spelling can represent more than one root word. For example, "singer" may be a form of either "sing" or "singe". Different corpora may treat such difference differently.
The number of distinct senses that are listed in Wiktionary is shown in the polysemy column. For example, "out" can refer to an escape, a removal from play in baseball, or any of 36 other concepts. On average, each word in the list has 15.38 senses. The sense count does not include the use of terms in phrasal verbs such as "put out" (as in "inconvenienced") and other multiword expressions such as the interjection "get out!", where the word "out" does not have an individual meaning. [6] As an example, "out" occurs in at least 560 phrasal verbs [7] and appears in nearly 1700 multiword expressions. [8]
The table also includes frequencies from other corpora. As well as usage differences, lemmatisation may differ from corpus to corpus – for example splitting the prepositional use of "to" from the use as a particle. Also, the Corpus of Contemporary American English (COCA) list includes dispersion as well as frequency to calculate rank.
Word | Parts of speech | OEC rank | COCA rank [9] | Dolch level | Polysemy |
---|---|---|---|---|---|
the | Article | 1 | 1 | Pre-primer | 12 |
be | Verb | 2 | 2 | Primer | 21 |
to | Preposition | 3 | 7, 9 | Pre-primer | 17 |
of | Preposition | 4 | 4 | Grade 1 | 12 |
and | Coordinator | 5 | 3 | Pre-primer | 16 |
a | Article | 6 | 5 | Pre-primer | 20 |
in | Preposition | 7 | 6, 128, 3038 | Pre-primer | 23 |
that | Subordinator, determiner | 8 | 12, 27, 903 | Primer | 17 |
have | Verb | 9 | 8 | Primer | 25 |
I | Pronoun | 10 | 11 | Pre-primer | 7 |
it | Pronoun | 11 | 10 | Pre-primer | 18 |
for | Preposition | 12 | 13, 2339 | Pre-primer | 19 |
not | Adverb et al. | 13 | 28, 2929 | Pre-primer | 5 |
on | Preposition | 14 | 17, 155 | Primer | 43 |
with | Preposition | 15 | 16 | Primer | 11 |
he | Pronoun | 16 | 15 | Primer | 7 |
as | Adverb, preposition | 17 | 33, 49, 129 | Grade 1 | 17 |
you | Pronoun | 18 | 14 | Pre-primer | 9 |
do | Verb, noun | 19 | 18 | Primer | 38 |
at | Preposition | 20 | 22 | Primer | 14 |
this | Determiner, adverb, noun | 21 | 20, 4665 | Primer | 9 |
but | Preposition, adverb, coordinator | 22 | 23, 1715 | Primer | 17 |
his | Possessive pronoun | 23 | 25, 1887 | Grade 1 | 6 |
by | Preposition | 24 | 30, 1190 | Grade 1 | 19 |
from | Preposition | 25 | 26 | Grade 1 | 4 |
they | Pronoun | 26 | 21 | Primer | 6 |
we | Pronoun | 27 | 24 | Pre-primer | 6 |
say | Verb et al. | 28 | 19 | Primer | 17 |
her | Possessive pronoun | 29, 106 | 42 | Grade 1 | 3 |
she | Pronoun | 30 | 31 | Primer | 7 |
or | Coordinator | 31 | 32 | Grade 2 | 11 |
an | Article | 32 | (a) | Grade 1 | 6 |
will | Verb, noun | 33 | 48, 1506 | Primer | 16 |
my | Possessive pronoun | 34 | 44 | Pre-primer | 5 |
one | Noun, adjective, et al. | 35 | 51, 104, 839 | Pre-primer | 24 |
all | Adjective | 36 | 43, 222 | Primer | 15 |
would | Verb | 37 | 41 | Grade 2 | 13 |
there | Adverb, pronoun, et al. | 38 | 53, 116 | Primer | 14 |
their | Possessive pronoun | 39 | 36 | Grade 2 | 2 |
what | Pronoun, adverb, et al. | 40 | 34 | Primer | 19 |
so | Coordinator, adverb, et al. | 41 | 55, 196 | Primer | 18 |
up | Adverb, preposition, et al. | 42 | 50, 456 | Pre-primer | 50 |
out | Preposition | 43 | 64, 149 | Primer | 38 |
if | Preposition | 44 | 40 | Grade 3 | 9 |
about | Preposition, adverb, et al. | 45 | 46, 179 | Grade 3 | 18 |
who | Pronoun, noun | 46 | 38 | Primer | 5 |
get | Verb | 47 | 39 | Primer | 37 |
which | Pronoun | 48 | 58 | Grade 2 | 7 |
go | Verb, noun | 49 | 35 | Pre-primer | 54 |
me | Pronoun | 50 | 61 | Pre-primer | 10 |
when | Adverb | 51 | 57, 136 | Grade 1 | 11 |
make | Verb, noun | 52 | 45 | Grade 2 [as "made"] | 48 |
can | Verb, noun | 53 | 37, 2973 | Pre-primer | 18 |
like | Preposition, verb | 54 | 74, 208, 1123, 1684, 2702 | Primer | 26 |
time | Noun | 55 | 52 | Dolch list of 95 nouns | 14 |
no | Determiner, adverb | 56 | 93, 699, 916, 1111, 4555 | Primer | 10 |
just | Adjective | 57 | 66, 1823 | Grade 1 | 14 |
him | Pronoun | 58 | 68 | Grade 1 | 5 |
know | Verb, noun | 59 | 47 | Grade 1 | 13 |
take | Verb, noun | 60 | 63 | Grade 1 | 66 |
people | Noun | 61 | 62 | 9 | |
into | Preposition | 62 | 65 | Primer | 10 |
year | Noun | 63 | 54 | 7 | |
your | Possessive pronoun | 64 | 69 | Grade 2 | 4 |
good | Adjective | 65 | 110, 2280 | Primer | 32 |
some | Determiner | 66 | 60 | Grade 1 | 10 |
could | Verb | 67 | 71 | Grade 1 | 6 |
them | Pronoun | 68 | 59 | Grade 1 | 3 |
see | Verb | 69 | 67 | 25 | |
other | Adjective, pronoun | 70 | 75, 715, 2355 | 12 | |
than | Preposition | 71 | 73, 712 | 4 | |
then | Adverb | 72 | 77 | Grade 1 | 10 |
now | Preposition | 73 | 72, 1906 | Primer | 13 |
look | Verb | 74 | 85, 604 | Pre-primer | 17 |
only | Adverb | 75 | 101, 329 | Grade 3 | 11 |
come | Verb | 76 | 70 | Pre-primer | 20 |
its | Possessive pronoun | 77 | 78 | Grade 2 | 2 |
over | Preposition | 78 | 124, 182 | Grade 1 | 19 |
think | Verb | 79 | 56 | Grade 1 | 10 |
also | Adverb | 80 | 87 | 2 | |
back | Noun, adverb | 81 | 108, 323, 1877 | Dolch list of 95 nouns | 36 |
after | Preposition | 82 | 120, 260 | Grade 1 | 14 |
use | Verb, noun | 83 | 92, 429 | Grade 2 | 17 |
two | Noun | 84 | 80 | Pre-primer | 6 |
how | Adverb | 85 | 76 | Grade 1 | 11 |
our | Possessive pronoun | 86 | 79 | Primer | 3 |
work | Verb, noun | 87 | 117, 199 | Grade 2 | 28 |
first | Adjective | 88 | 86, 2064 | Grade 2 | 10 |
well | Adverb | 89 | 100, 644 | Primer | 30 |
way | Noun, adverb | 90 | 84, 4090 | Dolch list of 95 nouns | 16 |
even | Adjective | 91 | 107, 484 | 23 | |
new | Adjective et al. | 92 | 88 | Primer | 18 |
want | Verb | 93 | 83 | Primer | 10 |
because | Preposition | 94 | 89, 509 | Grade 2 | 7 |
any | Pronoun | 95 | 109, 4720 | Grade 1 | 4 |
these | Pronoun | 96 | 82 | Grade 2 | 2 |
give | Verb | 97 | 98 | Grade 1 | 19 |
day | Noun | 98 | 90 | Dolch list of 95 nouns | 9 |
most | Adverb | 99 | 144, 187 | 12 | |
us | Pronoun | 100 | 113 | Grade 2 | 6 |
The following is a very similar list, also from the OEC, subdivided by part of speech. [1] The list labeled "Others" includes pronouns, possessives, articles, modal verbs, adverbs, and conjunctions.
Rank | Nouns | Verbs | Adjectives | Prepositions | Others |
---|---|---|---|---|---|
1 | time | be | good | to | the |
2 | person | have | new | of | and |
3 | year | do | first | in | a |
4 | way | say | last | for | that |
5 | day | get | long | on | I |
6 | thing | make | great | with | it |
7 | man | go | little | at | not |
8 | world | know | own | by | he |
9 | life | take | other | from | as |
10 | hand | see | old | up | you |
11 | part | come | right | about | this |
12 | child | think | big | into | but |
13 | eye | look | high | over | his |
14 | woman | want | different | after | they |
15 | place | give | small | her | |
16 | work | use | large | she | |
17 | week | find | next | or | |
18 | case | tell | early | an | |
19 | point | ask | young | will | |
20 | government | work | important | my | |
21 | company | seem | few | one | |
22 | number | feel | public | all | |
23 | group | try | bad | would | |
24 | problem | leave | same | there | |
25 | fact | call | able | their |
A lexeme is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms taken by a single root word. For example, in the English language, run, runs, ran and running are forms of the same lexeme, which can be represented as RUN.
Lexicology is the branch of linguistics that analyzes the lexicon of a specific language. A word is the smallest meaningful unit of a language that can stand on its own, and is made up of small components called morphemes and even smaller elements known as phonemes, or distinguishing sounds. Lexicology examines every feature of a word – including formation, spelling, origin, usage, and definition.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Polysemy is the capacity for a sign to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from monosemy, where a word has a single meaning.
Lexical may refer to:
In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.
Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.
In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
In morphology and lexicography, a lemma is the canonical form, dictionary form, or citation form of a set of word forms. In English, for example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed. Lexeme, in this context, refers to the set of all the inflected or alternating forms in the paradigm of a single word, and lemma refers to the particular form that is chosen by convention to represent the lexeme. Lemmas have special significance in highly inflected languages such as Arabic, Turkish, and Russian. The process of determining the lemma for a given lexeme is called lemmatisation. The lemma can be viewed as the chief of the principal parts, although lemmatisation is at least partly arbitrary.
In linguistics, the term lexis designates the complete set of all possible words in a language, or a particular subset of words that are grouped by some specific linguistic criteria. For example, the general term English lexis refers to all words of the English language, while more specific term English religious lexis refers to a particular subset within English lexis, encompassing only words that are semantically related to the religious sphere of life.
In lexicography, a lexical item is a single word, a part of a word, or a chain of words (catena) that forms the basic elements of a language's lexicon (≈ vocabulary). Examples are cat, traffic light, take care of, by the way, and it's raining cats and dogs. Lexical items can be generally understood to convey a single meaning, much as a lexeme, but are not limited to single words. Lexical items are like semes in that they are "natural units" translating between languages, or in learning a new language. In this last sense, it is sometimes said that language consists of grammaticalized lexis, and not lexicalized grammar. The entire store of lexical items in a language is called its lexis.
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.
A multiword expression (MWE), also called phraseme, is a lexeme-like unit made up of a sequence of two or more lexemes that has properties that are not predictable from the properties of the individual lexemes or their normal mode of combination. MWEs differ from lexemes in that the latter are required by many sources to have meaning that cannot be derived from the meaning of separate components. While MWEs must have some properties that cannot be derived from the same property of the components, the property in question does not need to be meaning.
A word list is a list of a language's lexicon within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort", but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles has accelerated the research field.
Mark E. Davies is an American linguist. He specializes in corpus linguistics and language variation and change. He is the creator of most of the text corpora from English-Corpora.org as well as the Corpus del español and the Corpus do português. He has also created large datasets of word frequency, collocates, and n-grams data, which have been used by many large companies in the fields of technology and also language learning.
Example-based machine translation (EBMT) is a method of machine translation often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base at run-time. It is essentially a translation by analogy and can be viewed as an implementation of a case-based reasoning approach to machine learning.
Classic monolingual Word Sense Disambiguation evaluation tasks uses WordNet as its sense inventory and is largely based on supervised / semi-supervised classification with the manually sense annotated corpora:
In the traditional grammar of Modern English, a phrasal verb typically constitutes a single semantic unit consisting of a verb followed by a particle, sometimes collocated with a preposition.
Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.
Below are two estimates of the most common words in Modern Spanish. Each estimate comes from an analysis of a different text corpus. A text corpus is a large collection of samples of written and/or spoken language, that has been carefully prepared for linguistic analysis. To determine which words are the most common, researchers create a database of all the words found in the corpus, and categorise them based on the context in which they are used.