Google Ngram Viewer

Last updated
Example of an Ngram query Example of a google Ngram.jpg
Example of an Ngram query

The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in printed sources published between 1500 and 2019 [1] [2] [3] [4] in Google's text corpora in English, Chinese (simplified), French, German, Hebrew, Italian, Russian, or Spanish. [1] [2] [5] There are also some specialized English corpora, such as American English, British English, and English Fiction. [6]

Contents

The program can search for a word or a phrase, including misspellings or gibberish. [5] The n-grams are matched with the text within the selected corpus, and if found in 40 or more books, are then displayed as a graph. [6] The Google Ngram Viewer supports searches for parts of speech and wildcards. [6] It is routinely used in research. [7] [8]

History

In the development processes, Google teamed up with two Harvard researchers, Jean-Baptiste Michel and Erez Lieberman Aiden, and quietly released the program on December 16, 2010. [2] [9] Before the release, it was difficult to quantify the rate of linguistic change because of the absence of a database that was designed for this purpose ― said Steven Pinker, [10] one of the co-authors of the Science paper published on the same day. [1] The Google Ngram Viewer was hence developed in the hope of opening a new window to quantitative research in the humanities field, and the database contained 500 billion words from 5.2 million books publicly available from the very beginning. [2] [3] [9]

The intended audience was scholarly, but the Google Ngram Viewer in fact made it possible for anyone with a computer to see a graph that represents the diachronic change of the use of words and phrases with ease. Lieberman said in response to the New York Times that the developers aimed to provide even children with the ability to browse cultural trends throughout history. [9] In the Science paper, Lieberman and his collaborators called the method of high-volume data analysis in digitalized texts "culturomics". [1] [9]

Usage

Commas delimit user-entered search terms, where each comma-separated term is searched in the database as an n-gram (for example, "nursery school" is a 2-gram or bigram). [6] The Ngram Viewer then returns a plotted line chart. Note that due to limitations on the size of the Ngram database, only matches found in at least 40 books are indexed. [6]

Limitations

The data sets of the Ngram Viewer have been criticized for their reliance upon inaccurate OCR and for including large numbers of incorrectly dated and categorized texts. [11] [12] Because of these errors, and because they are uncontrolled for bias [13] (such as the increasing amount of scientific literature, which causes other terms to appear to decline in popularity), it is risky to use the corpora to study language or test theories. [14] Furthemore, the data sets may not reflect general linguistic or cultural change and can only hint at such an effect because they do not involve any metadata like date published, author, length, or genre, to avoid any potential copyright infringements. [15]

Guidelines for doing research with data from Google Ngram have been proposed that address many of the issues discussed above. [16]

OCR issues

Optical character recognition, or OCR, is not always reliable, and some characters may not be scanned correctly. In particular, systemic errors like the confusion of s and f in pre-19th century texts (due to the use of ſ, the long s, which is similar in appearance to f) can cause systemic bias. [14] Although Google Ngram Viewer claims that the results are reliable from 1800 onwards, poor OCR and insufficient data mean that frequencies given for languages such as Chinese may only be accurate from 1970 onward, with earlier parts of the corpus showing no results at all for common terms, and data for some years containing more than 50% noise. [17] [18]

See also

Related Research Articles

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

<span class="mw-page-title-main">Parallel text</span> Text placed alongside its translation or translations

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.

<i>n</i>-gram Item sequences in computational linguistics

An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters, syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus. If Latin numerical prefixes are used, then n-gram of size 1 is called a "unigram", size 2 a "bigram" etc. If, instead of the Latin ones, the English cardinal numbers are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, using Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology, for polymers or oligomers of a known size, called k-mers. When the items are words, n-grams may also be called shingles.

Dale Hollis Hoiberg is a sinologist and has been the editor-in-chief of the Encyclopædia Britannica since 1997. He holds a PhD degree in Chinese literature and began to work for Encyclopædia Britannica as an index editor in 1978. In 2010, Hoiberg co-authored a paper with Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden entitled "Quantitative Analysis of Culture Using Millions of Digitized Books". The paper was the first to describe the term culturomics.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

<span class="mw-page-title-main">Google Books</span> Service from Google

Google Books is a service from Google that searches the full text of books and magazines that Google has scanned, converted to text using optical character recognition (OCR), and stored in its digital database. Books are provided either by publishers and authors through the Google Books Partner Program, or by Google's library partners through the Library Project. Additionally, Google has partnered with a number of magazine publishers to digitize their archives.

Linguistic categories include

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

<span class="mw-page-title-main">Erez Lieberman Aiden</span> American scientist (born 1980)

Erez Lieberman Aiden is an American research scientist active in multiple fields related to applied mathematics. He is an associate professor at the Baylor College of Medicine, and formerly a fellow at the Harvard Society of Fellows and visiting faculty member at Google. He is an adjunct assistant professor of computer science at Rice University. Using mathematical and computational approaches, he has studied evolution in a range of contexts, including that of networks through evolutionary graph theory and languages in the field of culturomics. He has published scientific articles in a variety of disciplines.

<span class="mw-page-title-main">Mark Davies (linguist)</span> American linguist (born 1963)

Mark E. Davies is an American linguist. He specializes in corpus linguistics and language variation and change. He is the creator of most of the text corpora from English-Corpora.org as well as the Corpus del español and the Corpus do português. He has also created large datasets of word frequency, collocates, and n-grams data, which have been used by many large companies in the fields of technology and also language learning.

Culturomics is a form of computational lexicology that studies human behavior and cultural trends through the quantitative analysis of digitized texts. Researchers data mine large digital archives to investigate cultural phenomena reflected in language and word usage. The term is an American neologism first described in a 2010 Science article called Quantitative Analysis of Culture Using Millions of Digitized Books, co-authored by Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden.

The following outline is provided as an overview of and topical guide to natural-language processing:

Computational social science is an interdisciplinary academic sub-field concerned with computational approaches to the social sciences. This means that computers are used to model, simulate, and analyze social phenomena. It has been applied in areas such as computational economics, computational sociology, computational media analysis, cliodynamics, culturomics, nonprofit studies. It focuses on investigating social and behavioral relationships and interactions using data science approaches, network analysis, social simulation and studies using interactive systems.

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.

This is a timeline of optical character recognition.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

The Czech National Corpus (CNC) is a large electronic corpus of written and spoken Czech language, developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at Charles University in Prague. The collection is used for teaching and research in corpus linguistics. The ICNC collaborates with over 200 researchers and students, 270 publishers, and other similar research projects.

References

  1. 1 2 3 4 Michael, Jean-Baptiste; Shen, Yuan K.; Aiden, Aviva P.; Veres, Adrian; Gray, Matthew K.; The Google Books Team; Pickett, Joseph P.; Hoiberg, Dale; Clancy, Dan; Norvig, Peter; Orwant, Jon; Pinker, Steven; Nowak, Martin A.; Aiden, Erez L. (2010). "Quantitative Analysis of Culture Using Millions of Digitized Books". Science. 331 (6014): 176–182.
  2. 1 2 3 4 Bosker, Bianca (2010-12-17). "Google Ngram Database Tracks Popularity Of 500 Billion Words". The Huffington Post. Retrieved 2012-05-31.
  3. 1 2 Lance Whitney (2010-12-17). "Google's Ngram Viewer: A time machine for wordplay". Cnet.com. Archived from the original on 2014-01-23. Retrieved 2012-05-31.
  4. @searchliaison (July 13, 2020). "The Google Books Ngram Viewer has now been updated with fresh data through 2019" (Tweet). Retrieved 2020-08-11 via Twitter.
  5. 1 2 "Google Books Ngram Viewer - University at Buffalo Libraries". Lib.Buffalo.edu. 2011-08-22. Archived from the original on 2013-07-02. Retrieved 2012-05-31.
  6. 1 2 3 4 5 "Google Books Ngram Viewer - Information" . Retrieved 2024-06-01.
  7. Greenfield, Patricia M. (2013). "The Changing Psychology of Culture From 1800 Through 2000". Psychological Science. 24 (9): 1722–1731. doi:10.1177/0956797613479387. ISSN   0956-7976. PMID   23925305. S2CID   6123553.
  8. Younes, Nadja; Reips, Ulf-Dietrich (2018). "The changing psychology of culture in German-speaking countries: A Google Ngram study: THE CHANGING PSYCHOLOGY OF CULTURE". International Journal of Psychology. 53: 53–62. doi:10.1002/ijop.12428. PMID   28474338. S2CID   7440938.
  9. 1 2 3 4 "In 500 Billion Words, New Window on Culture". The New York Times. 2010-12-16. Retrieved 2024-06-01.
  10. The RSA (2010-02-04). "Steven Pinker – The Stuff of Thought: Language as a window into human nature" . Retrieved 2024-06-02 via YouTube.
  11. "Google Ngrams: OCR and Metadata". ResourceShelf. 2010-12-19. Archived from the original on 2016-04-27. Retrieved 2015-04-19.
  12. Nunberg, Geoff (2010-12-16). "Humanities research with the Google Books corpus". Archived from the original on 2016-03-10. Retrieved 2015-04-19.
  13. Pechenick, Eitan Adam; Danforth, Christopher M.; Dodds, Peter Sheridan; Barrat, Alain (2015-10-07). "Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution". PLOS ONE. 10 (10): e0137041. arXiv: 1501.00960 . Bibcode:2015PLoSO..1037041P. doi: 10.1371/journal.pone.0137041 . PMC   4596490 . PMID   26445406.
  14. 1 2 Zhang, Sarah. "The Pitfalls of Using Google Ngram to Study Language". WIRED. Retrieved 2017-05-24.
  15. Koplenig, Alexander (2015-09-02). "The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII". Digital Scholarship in the Humanities. 32 (1) (published 2017-04-01): 169–188. doi:10.1093/llc/fqv037. ISSN   2055-7671.
  16. Younes, Nadja; Reips, Ulf-Dietrich (2019-03-22). "Guideline for improving the reliability of Google Ngram studies: Evidence from religious terms". PLOS ONE. 14 (3): e0213554. Bibcode:2019PLoSO..1413554Y. doi: 10.1371/journal.pone.0213554 . ISSN   1932-6203. PMC   6430395 . PMID   30901329.
  17. "Google n-grams and pre-modern Chinese". digitalsinology.org. Retrieved 2015-04-19.
  18. "When n-grams go bad". digitalsinology.org. Retrieved 2015-04-19.

Bibliography