Culturomics

Last updated

Culturomics is a form of computational lexicology that studies human behavior and cultural trends through the quantitative analysis of digitized texts. [1] [2] Researchers data mine large digital archives to investigate cultural phenomena reflected in language and word usage. [3] The term is an American neologism first described in a 2010 Science article called Quantitative Analysis of Culture Using Millions of Digitized Books, co-authored by Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden. [4]

Contents

Michel and Aiden helped create the Google Labs project Google Ngram Viewer which uses n-grams to analyze the Google Books digital library for cultural patterns in language use over time.

Because the Google Ngram data set is not an unbiased sample, [5] and does not include metadata, [6] there are several pitfalls when using it to study language or the popularity of terms. [7] Medical literature accounts for a large, but shifting, share of the corpus, [8] which does not take into account how often the literature is printed, or read.

Studies

Narrative network of US Elections 2012 Tripletsnew2012.png
Narrative network of US Elections 2012

In a study called Culturomics 2.0, Kalev H. Leetaru examined news archives including print and broadcast media (television and radio transcripts) for words that imparted tone or "mood" as well as geographic data. [10] [11] The research retroactively predicted the 2011 Arab Spring and successfully estimated the final location of Osama bin Laden to within 124 miles (200 km). [10] [11]

In a 2012 paper by Alexander M. Petersen and co-authors, [12] they found a "dramatic shift in the birth rate and death rates of words": [13] Deaths have increased and births have slowed. The authors also identified a universal "tipping point" in the life cycle of new words at about 30 to 50 years after their origin, they either enter the long-term lexicon or fall into disuse. [13]

Culturomic approaches have been taken in the analysis of newspaper content in a number of studies by I. Flaounas and co-authors. These studies showed macroscopic trends across different news outlets and countries. In 2012, a study of 2.5 million articles suggested that gender bias in news coverage depends on topic and how the readability of newspaper articles is related to topic. [14] A separate study by the same researchers, covering 1.3 million articles from 27 countries, [15] showed macroscopic patterns in the choice of stories to cover. In particular, countries made similar choices when they were related by economic, geographical and cultural links. The cultural links were revealed by the similarity in voting for the Eurovision song contest. This study was performed on a vast scale, by using statistical machine translation, text categorisation and information extraction techniques.

The possibility to detect mood shifts in a vast population by analysing Twitter content was demonstrated in a study by T. Lansdall-Welfare and co-authors. [16] The study considered 84 million tweets generated by more than 9.8 million users from the United Kingdom over a period of 31 months, showing how public sentiment in the UK has changed with the announcement of spending cuts.

In a 2013 study by S Sudhahar and co-authors, the automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analysed by using tools from Network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes. [17]

In a 2014 study by T Lansdall-Welfare and co-authors, 5 million news articles were collected over 5 years [18] and then analyzed to suggest a significant shift in sentiment relative to coverage of nuclear power, corresponding with the disaster of Fukushima. The study also extracted concepts that were associated with nuclear power before and after the disaster, explaining the change in sentiment with a change in narrative framing.

In 2015, a study revealed the bias of the Google books data set, which "suffers from a number of limitations which make it an obscure mask of cultural popularity," [5] and calls into question the significance of many of the earlier results.

Culturomic approaches can also contribute towards conservation science through a better understanding of human-nature relationships, with the first research published by McCallum and Bury in 2013. [19] This study revealed a precipitous decline in public interest in environmental issues. In 2016, a publication by Richard Ladle and colleagues [20] highlighted five key areas where culturomics can be used to advance the practice and science of conservation, including recognizing conservation-oriented constituencies and demonstrating public interest in nature, identifying conservation emblems, providing new metrics and tools for near-real-time environmental monitoring and to support conservation decision making, assessing the cultural impact of conservation interventions, and framing conservation issues and promoting public understanding.

In 2017, a study correlated joint pain with Google search activity and temperature. [21] While the study observed higher search activity for hip and knee pain (but not arthritis) during higher temperatures, it does not (and cannot) control for relevant other factors such as activity. Mass media misinterpreted this as "myth busted: rain does not increase joint pain", [22] [23] while the authors speculate the observed correlation is due to "changes in physical activity levels". [24]

Criticism

Linguists and lexicographers have expressed skepticism regarding the methods and results of some of these studies, including one by Petersen et al. [25] Others have demonstrated bias in the Ngram data set. Their results "call into question the vast majority of existing claims drawn from the Google Books corpus": [5] "Instead of speaking about general linguistic or cultural change, it seems to be preferable to explicitly restrict the results to linguistic or cultural change ‘as it is represented in the Google Ngram data’" [6] because it is unclear what caused the observed change in the sample. Ficetola critiqued the use of Google Trends, suggesting interest was actually increasing. [26] But, in their rebuttal McCallum and Bury [27] provided that as far as public policy was concerned, proportional data was important and absolute numbers irrelevant, explaining that policy is driven by the opinion of the largest portion of the population not the absolute number with decisions made according to majority influence, not simply number of votes.

See also

Related Research Articles

In a blind or blinded experiment, information which may influence the participants of the experiment is withheld until after the experiment is complete. Good blinding can reduce or eliminate experimental biases that arise from a participants' expectations, observer's effect on the participants, observer bias, confirmation bias, and other sources. A blind can be imposed on any participant of an experiment, including subjects, researchers, technicians, data analysts, and evaluators. In some cases, while blinding would be useful, it is impossible or unethical. For example, it is not possible to blind a patient to their treatment in a physical therapy intervention. A good clinical protocol ensures that blinding is as effective as possible within ethical and practical constraints.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

<span class="mw-page-title-main">Social network analysis</span> Analysis of social structures using network and graph theory

Social network analysis (SNA) is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes and the ties, edges, or links that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, meme spread, information circulation, friendship and acquaintance networks, peer learner networks, business networks, knowledge networks, difficult working relationships, collaboration graphs, kinship, disease transmission, and sexual relationships. These networks are often visualized through sociograms in which nodes are represented as points and ties are represented as lines. These visualizations provide a means of qualitatively assessing networks by varying the visual representation of their nodes and edges to reflect attributes of interest.

<span class="mw-page-title-main">Omics</span> Suffix in biology

The branches of science known informally as omics are various disciplines in biology whose names end in the suffix -omics, such as genomics, proteomics, metabolomics, metagenomics, phenomics and transcriptomics. Omics aims at the collective characterization and quantification of pools of biological molecules that translate into the structure, function, and dynamics of an organism or organisms.

<span class="mw-page-title-main">Peter Norvig</span> American computer scientist (born 1956)

Peter Norvig is an American computer scientist and Distinguished Education Fellow at the Stanford Institute for Human-Centered AI. He previously served as a director of research and search quality at Google. Norvig is the co-author with Stuart J. Russell of the most popular textbook in the field of AI: Artificial Intelligence: A Modern Approach used in more than 1,500 universities in 135 countries.

<span class="mw-page-title-main">Sauropodomorpha</span> Extinct clade of dinosaurs

Sauropodomorpha is an extinct clade of long-necked, herbivorous, saurischian dinosaurs that includes the sauropods and their ancestral relatives. Sauropods generally grew to very large sizes, had long necks and tails, were quadrupedal, and became the largest animals to ever walk the Earth. The prosauropods, which preceded the sauropods, were smaller and were often able to walk on two legs. The sauropodomorphs were the dominant terrestrial herbivores throughout much of the Mesozoic Era, from their origins in the Late Triassic until their decline and extinction at the end of the Cretaceous.

<span class="mw-page-title-main">Network theory</span> Study of graphs as a representation of relations between discrete objects

In mathematics, computer science and network science, network theory is a part of graph theory. It defines networks as graphs where the nodes or edges possess attributes. Network theory analyses these networks over the symmetric relations or asymmetric relations between their (discrete) components.

Alan R. Templeton is an American geneticist and statistician at Washington University in St. Louis, where he is the Charles Rebstock emeritus professor of biology. From 2010 to 2019, he held positions in the Institute of Evolution and the Department of Evolutionary and Environmental Biology at the University of Haifa. He is known for his work demonstrating the degree of genetic diversity among humans and, in his opinion, the biological unreality of human races.

<span class="mw-page-title-main">Computational sociology</span> Branch of the discipline of sociology

Computational sociology is a branch of sociology that uses computationally intensive methods to analyze and model social phenomena. Using computer simulations, artificial intelligence, complex statistical methods, and analytic approaches like social network analysis, computational sociology develops and tests theories of complex social processes through bottom-up modeling of social interactions.

Dale Hollis Hoiberg is a sinologist and has been the editor-in-chief of the Encyclopædia Britannica since 1997. He holds a PhD degree in Chinese literature and began to work for Encyclopædia Britannica as an index editor in 1978. In 2010, Hoiberg co-authored a paper with Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden entitled "Quantitative Analysis of Culture Using Millions of Digitized Books". The paper was the first to describe the term culturomics.

Digital broadcasting is the practice of using digital signals rather than analogue signals for broadcasting over radio frequency bands. Digital television broadcasting is widespread. Digital audio broadcasting is being adopted more slowly for radio broadcasting where it is mainly used in Satellite radio.

<span class="mw-page-title-main">Laurasiatheria</span> Clade of mammals

Laurasiatheria is a superorder of placental mammals that groups together true insectivores (eulipotyphlans), bats (chiropterans), carnivorans, pangolins (pholidotes), even-toed ungulates (artiodactyls), odd-toed ungulates (perissodactyls), and all their extinct relatives. From systematics and phylogenetic perspectives, it is subdivided into order Eulipotyphla and clade Scrotifera. It is a sister group to Euarchontoglires with which it forms the magnorder Boreoeutheria. Laurasiatheria was discovered on the basis of the similar gene sequences shared by the mammals belonging to it; no anatomical features have yet been found that unite the group, although a few have been suggested such as a small coracoid process, a simplified hindgut and allantoic vessels that are large to moderate in size. The Laurasiatheria clade is based on DNA sequence analyses and retrotransposon presence/absence data. The superorder originated on the northern supercontinent of Laurasia, after it split from Gondwana when Pangaea broke up. Its last common ancestor is supposed to have lived between ca. 76 to 90 million years ago.

<span class="mw-page-title-main">Atlantogenata</span> Clade of mammals

Atlantogenata is a proposed clade of placental mammals containing the cohorts or superorders Xenarthra and Afrotheria. These groups originated and radiated in the South American and African continents, respectively, presumably in the Cretaceous. Together with Boreoeutheria, they make up Eutheria. The monophyly of this grouping was supported by some genetic evidence.

<span class="mw-page-title-main">Digital humanities</span> Area of scholarly activity

Digital humanities (DH) is an area of scholarly activity at the intersection of computing or digital technologies and the disciplines of the humanities. It includes the systematic use of digital resources in the humanities, as well as the analysis of their application. DH can be defined as new ways of doing scholarship that involve collaborative, transdisciplinary, and computationally engaged research, teaching, and publishing. It brings digital tools and methods to the study of the humanities with the recognition that the printed word is no longer the main medium for knowledge production and distribution.

Google Trends is a website by Google that analyzes the popularity of top search queries in Google Search across various regions and languages. The website uses graphs to compare the search volume of different queries over time.

Cliodynamics is a transdisciplinary area of research that integrates cultural evolution, economic history/cliometrics, macrosociology, the mathematical modeling of historical processes during the longue durée, and the construction and analysis of historical databases.

<span class="mw-page-title-main">Erez Lieberman Aiden</span> American scientist (born 1980)

Erez Lieberman Aiden is an American research scientist active in multiple fields related to applied mathematics. He is an associate professor at the Baylor College of Medicine, and formerly a fellow at the Harvard Society of Fellows and visiting faculty member at Google. He is an adjunct assistant professor of computer science at Rice University. Using mathematical and computational approaches, he has studied evolution in a range of contexts, including that of networks through evolutionary graph theory and languages in the field of culturomics. He has published scientific articles in a variety of disciplines.

Infoveillance is a type of syndromic surveillance that specifically utilizes information found online. The term, along with the term infodemiology, was coined by Gunther Eysenbach to describe research that uses online information to gather information about human behavior.

<span class="mw-page-title-main">Google Ngram Viewer</span> Online search engine

The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in printed sources published between 1500 and 2019 in Google's text corpora in English, Chinese (simplified), French, German, Hebrew, Italian, Russian, or Spanish. There are also some specialized English corpora, such as American English, British English, and English Fiction.

Computational social science is the academic sub-discipline concerned with computational approaches to the social sciences. This means that computers are used to model, simulate, and analyze social phenomena. Fields include computational economics, computational sociology, cliodynamics, culturomics, nonprofit studies, and the automated analysis of contents, in social and traditional media. It focuses on investigating social and behavioral relationships and interactions through social simulation, modeling, network analysis, and media analysis.

References

  1. Cohen, Patricia (16 December 2010). "In 500 Billion Words, New Window on Culture". New York Times.
  2. Hayes, Brian (May–June 2011). "Bit Lit". American Scientist . 99 (3): 190. doi:10.1511/2011.90.190. Archived from the original on 2016-10-18. Retrieved 2011-09-09.
  3. Letcher, David W. (April 6, 2011). "Cultoromics: A New Way to See Temporal Changes in the Prevalence of Words and Phrases" (PDF). American Institute of Higher Education 6th International Conference Proceedings. 4 (1): 228. Archived from the original (PDF) on March 3, 2016. Retrieved September 9, 2011.
  4. Michel, Jean-Baptiste; Liberman Aiden, Erez (16 December 2010). "Quantitative Analysis of Culture Using Millions of Digitized Books". Science . 331 (6014): 176–82. doi:10.1126/science.1199644. PMC   3279742 . PMID   21163965.
  5. 1 2 3 Pechenick, Eitan Adam; Danforth, Christopher M.; Dodds, Peter Sheridan (2015-10-07). "Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution". PLOS ONE. 10 (10): e0137041. arXiv: 1501.00960 . Bibcode:2015PLoSO..1037041P. doi: 10.1371/journal.pone.0137041 . ISSN   1932-6203. PMC   4596490 . PMID   26445406.
  6. 1 2 Koplenig, Alexander (April 2017). "The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII". Digital Scholarship in the Humanities. 32 (1): 169–188. doi:10.1093/llc/fqv037. ISSN   2055-7671.
  7. Zhang, Sarah. "The Pitfalls of Using Google Ngram to Study Language". WIRED. Retrieved 2017-05-24.
  8. Comparison of example terms
  9. Sudhahar, Saatviga; Veltri, Giuseppe A.; Cristianini, Nello (2015). "Automated analysis of the US presidential elections using Big Data and network analysis". Big Data & Society. 2. doi: 10.1177/2053951715572916 . hdl: 2381/31767 . S2CID   62188746.
  10. 1 2 Leetaru, Kalev H. (5 September 2011). "Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global News Media Tone In Time And Space". First Monday . 16 (9). doi:10.5210/fm.v16i9.3663.
  11. 1 2 Quick, Darren (7 September 2011). "Culturomics research uses quarter-century of media coverage to forecast human behavior". Gizmag.com. Retrieved 9 September 2011.
  12. Petersen, Alexander M. (15 March 2012). "Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death". Scientific Reports . 2: 313. arXiv: 1107.3707 . Bibcode:2012NatSR...2E.313P. doi:10.1038/srep00313. PMC   3304511 . PMID   22423321.
  13. 1 2 "The New Science of the Birth and Death of Words ", CHRISTOPHER SHEA, Wall Street Journal, March 16, 2012
  14. Flaounas, Ilias; Ali, Omar; Lansdall-Welfare, Thomas; De Bie, Tijl; Mosdell, Nick; Lewis, Justin; Cristianini, Nello (2013). "Research Methods in the Age of Digital Journalism". Digital Journalism. 1: 102–116. doi: 10.1080/21670811.2012.714928 . S2CID   61080552.
  15. Flaounas, Ilias; Turchi, Marco; Ali, Omar; Fyson, Nick; De Bie, Tijl; Mosdell, Nick; Lewis, Justin; Cristianini, Nello (2010). "The Structure of the EU Mediasphere". PLOS ONE. 5 (12): e14243. Bibcode:2010PLoSO...514243F. doi: 10.1371/journal.pone.0014243 . PMC   2999531 . PMID   21170383.
  16. Lansdall-Welfare, Thomas; Lampos, Vasileios; Cristianini, Nello (2012). "Effects of the recession on public mood in the UK". Proceedings of the 21st international conference companion on World Wide Web - WWW '12 Companion. p. 1221. doi:10.1145/2187980.2188264. ISBN   9781450312301. S2CID   1825992.
  17. Sudhahar, Saatviga; De Fazio, Gianluca; Franzosi, Roberto; Cristianini, Nello (2015). "Network analysis of narrative content in large corpora". Natural Language Engineering. 21: 81–112. doi:10.1017/S1351324913000247. hdl: 1983/dfb87140-42e2-486a-91d5-55f9007042df . S2CID   3385681.
  18. Lansdall-Welfare, Thomas; Sudhahar, Saatviga; Veltri, Giuseppe A.; Cristianini, Nello (2014). "On the coverage of science in the media: A big data study on the impact of the Fukushima disaster". 2014 IEEE International Conference on Big Data (Big Data). pp. 60–66. doi:10.1109/BigData.2014.7004454. hdl:2381/31439. ISBN   978-1-4799-5666-1. S2CID   7686818.
  19. McCallum, Malcolm L; Bury, Gwendolynn W (2016). "Conservation culturomics". Biodiversity and Conservation. 22 (6–7): 1355–1367. doi:10.1002/fee.1260. S2CID   199392763.
  20. Ladle, Richard J.; Correia, Ricardo A.; Do, Yuno; Joo, Gea-Jae; Malhado, Ana CM; Proulx, Raphaël; Roberge, Jean-Michel; Jepson, Paul (2016). "Conservation culturomics". Frontiers in Ecology and the Environment. 14 (5): 269–275. doi:10.1002/fee.1260. S2CID   199392763.
  21. Telfer, Scott; Obradovich, Nick (2017-08-09). "Local weather is associated with rates of online searches for musculoskeletal pain symptoms". PLOS ONE. 12 (8): e0181266. Bibcode:2017PLoSO..1281266T. doi: 10.1371/journal.pone.0181266 . ISSN   1932-6203. PMC   5549896 . PMID   28792953.
  22. "Are achy joints associated with rain? Google suggests otherwise". NBC News. Retrieved 2017-08-10.
  23. "This Myth About Joint Pain Is Total Crap". Men's Health. 2017-08-10. Retrieved 2017-08-10.
  24. "Rain increases joint pain? Google suggests otherwise: People's activity levels -- increasing as temperatures rise, to a point -- are likelier than the weather itself to cause pain that motivates online searches, researchers say". ScienceDaily. Retrieved 2017-08-10.
  25. "When physicists do linguistics", BEN ZIMMER, Boston Globe, February 10, 2013
  26. Ficetola, G. F. (2014). "Is interest toward the environment really declining? The complexity of analysing trends using internet search data". Biodiversity and Conservation. 23 (12): 2983–2988. doi:10.1007/s10531-013-0552-y. S2CID   17003129.
  27. McCallum, Malcolm L. (2014). "Public interest in the environment is falling: a response to Ficetola (2013)". Biodiversity and Conservation. 23 (2): 1057–1062. doi:10.1007/s10531-014-0640-7. S2CID   7056654.

Further reading