N-gram

Last updated

Six n-grams frequently found in titles of publications about Coronavirus disease 2019 (COVID-19), as of 7 May 2020 LARGER FONT VERSION Six n-grams frequently found in titles of publications about Coronavirus disease 2019, as of 7 May 2020.svg
Six n-grams frequently found in titles of publications about Coronavirus disease 2019 (COVID-19), as of 7 May 2020

An n-gram is a sequence of n adjacent symbols in particular order. [1] The symbols may be n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus.

Contents

N-gram is actually the parent of a family of names term, where family members can be (depending on n numeral) 1-gram, 2-gram etc., or the same using spoken numeral prefixes.

If Latin numerical prefixes are used, then n-gram of size 1 is called a "unigram", size 2 a "bigram" (or, less commonly, a "digram") etc. If, instead of the Latin ones, the English cardinal numbers are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, using Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology, for polymers or oligomers of a known size, called k-mers. When the items are words, n-grams may also be called shingles. [2]

In the context of natural language processing (NLP), the use of n-grams allows bag-of-words models to capture information such as word order, which would not be possible in the traditional bag of words setting.

Examples

(Shannon 1951) [3] discussed n-gram models of English. For example:

Figure 1 n-gram examples from various disciplines
FieldUnitSample sequence1-gram sequence2-gram sequence3-gram sequence
Vernacular nameunigrambigramtrigram
Order of resulting Markov model 012
Protein sequencing amino acid ... Cys-Gly-Leu-Ser-Trp ......, Cys, Gly, Leu, Ser, Trp, ......, Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp, ......, Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp, ...
DNA sequencing base pair ...AGCTTCGA......, A, G, C, T, T, C, G, A, ......, AG, GC, CT, TT, TC, CG, GA, ......, AGC, GCT, CTT, TTC, TCG, CGA, ...
Language model character ...to_be_or_not_to_be......, t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ......, to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ......, to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ...
Word n-gram language model word ... to be or not to be ......, to, be, or, not, to, be, ......, to be, be or, or not, not to, to be, ......, to be or, be or not, or not to, not to be, ...

Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.

Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google n-gram corpus. [4]

3-grams

4-grams

References

  1. "n-gram language model - an overview | ScienceDirect Topics". www.sciencedirect.com. Retrieved 12 December 2024.
  2. Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey (1997). "Syntactic clustering of the web". Computer Networks and ISDN Systems. 29 (8): 1157–1166. doi:10.1016/s0169-7552(97)00031-7. S2CID   9022773.
  3. Shannon, Claude E. "The redundancy of English." Cybernetics; Transactions of the 7th Conference, New York: Josiah Macy, Jr. Foundation. 1951.
  4. Franz, Alex; Brants, Thorsten (2006). "All Our N-gram are Belong to You". Google Research Blog. Archived from the original on 17 October 2006. Retrieved 16 December 2011.

Further reading

See also