GDELT Project

Last updated

The GDELT Project, or Global Database of Events, Language, and Tone, created by Kalev Leetaru of Yahoo! and Georgetown University, along with Philip Schrodt and others, describes itself as "an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day." [1] [2] [3] Early explorations leading up to the creation of GDELT were described by co-creator Philip Schrodt in a conference paper in January 2011. [4] The dataset is available on Google Cloud Platform. [5]

Contents

Data

GDELT includes data from 1979 to the present. The data is available as zip files in tab-separated value format using a CSV extension for easy import into Microsoft Excel or similar spreadsheet software. [6] Data from 1979 to 2005 is available in the form of one zip file per year, with the file size gradually increased from 14.3 MB in 1979 to 125.9 MB in 2005, reflecting the increase in the number of news media and the frequency and comprehensiveness of event recording. [7] Data files from January 2006 to March 2013 are available at monthly granularity, with the zipped file size rising from 11 MB in January 2006 to 103.2 MB in March 2013. Data files from April 1, 2013 onward are available at a daily granularity. The data file for each date is made available by 6 AM Eastern Standard Time the next day. As of June 2014, the size of the daily zipped file is about 5-12 MB. [6] [7] The data files use Conflict and Mediation Event Observations (CAMEO) coding for recording events. [8]

In a blog post for Foreign Policy , co-creator Kalev Leetaru attempted to use GDELT data to answer the question of whether the Arab Spring sparked protests worldwide, using the quotient of the number of protest-related events to the total number of events recorded as a measure of protest intensity for which the time trend was then studied. [9] Political scientist and data science/forecasting expert Jay Ulfelder critiqued the post on his personal blog, saying that Leetaru's normalization method may not have adequately accounted for the change in the nature and composition of media coverage. [10]

The dataset is also available on Google Cloud Platform and can be accessed using Google BigQuery. [5]

Reception

Academic reception

GDELT has been cited and used in a number of academic studies, such as a study of visual and predictive analytics of Singapore news (along with Wikipedia and the Straits Times Index) [11] and a study of political conflict. [12]

The challenge problem at the 2014 International Social Computing, Behavioral Modeling and Prediction Conference (SBP) asked participants to explore GDELT and apply it to the analysis of social networks, behavior, and prediction. [13]

Reception in blogs and media

GDELT has been covered on the website of the Center for Data Innovation [14] as well as the GIS Lounge. [15] It has also been discussed and critiqued on blogs about political violence and crisis prediction. [10] [16] [17] The dataset has been cited and critiqued repeatedly in Foreign Policy , [2] [18] including in discussions of political events in Syria, [19] the Arab Spring, [9] [20] and Nigeria. [21] It has also been cited in New Scientist , [22] on the FiveThirtyEight website [23] and Andrew Sullivan's blog. [24]

The Predictive Heuristics blog and other blogs have compared GDELT with the Integrated Conflict Early Warning System (ICEWS). [25] [26] Alex Hanna blogged about her experiment assessing GDELT with handcoded data by comparing it with the Dynamics of Collective Action dataset. [27]

In May 2014, the Google Cloud Platform blog announced that the entire GDELT dataset would be available as a public dataset in Google BigQuery. [5]

See also

Related Research Articles

<span class="mw-page-title-main">Gmail</span> Email service provided by Google

Gmail is a free email service provided by Google. As of 2019, it had 1.5 billion active users worldwide. A user typically accesses Gmail in a web browser or the official mobile app. Google also supports the use of email clients via the POP and IMAP protocols.

GeoTIFF is a public domain metadata standard which allows georeferencing information to be embedded within a TIFF file. The potential additional information includes map projection, coordinate systems, ellipsoids, datums, and everything else necessary to establish the exact spatial reference for the file. The GeoTIFF format is fully compliant with TIFF 6.0, so software incapable of reading and interpreting the specialized metadata will still be able to open a GeoTIFF format file.

Google Developers is Google's site for software development tools and platforms, application programming interfaces (APIs), and technical resources. The site contains documentation on using Google developer tools and APIs—including discussion groups and blogs for developers using Google's developer products.

The Political Instability Task Force (PITF), formerly known as State Failure Task Force, is a U.S. government-sponsored research project to build a database on major domestic political conflicts leading to state failures. The study analyzed factors to denote the effectiveness of state institutions, population well-being, and found that partial democracies with low involvement in international trade and with high infant mortality are most prone to revolutions. One of the members of the task force resigned on January 20, 2017, in protest of the Trump administration, before Donald Trump was sworn in as U.S. president.

<span class="mw-page-title-main">Redis</span> Open-source in-memory key–value database

Redis is an in-memory data structure store, used as a distributed, in-memory key–value database, cache and message broker, with optional durability. Redis supports different kinds of abstract data structures, such as strings, lists, maps, sets, sorted sets, HyperLogLogs, bitmaps, streams, and spatial indices. The project was developed and maintained by Salvatore Sanfilippo, starting in 2009. From 2015 until 2020, he led a project core team sponsored by Redis Labs. Salvatore Sanfilippo left Redis as the maintainer in 2020. It is open-source software released under a BSD 3-clause license. In 2021, not long after the original author and main maintainer left, Redis Labs dropped the Labs from its name and now is known simply as "Redis".

Darwin Core Archive (DwC-A) is a biodiversity informatics data standard that makes use of the Darwin Core terms to produce a single, self-contained dataset for species occurrence, checklist, sampling event or material sample data. Essentially it is a set of text (CSV) files with a simple descriptor (meta.xml) to inform others how your files are organized. The format is defined in the Darwin Core Text Guidelines. It is the preferred format for publishing data to the GBIF network.

<span class="mw-page-title-main">Google Public Data Explorer</span> Service by Google

Google Public Data Explorer provides public data and forecasts from a range of international organizations and academic institutions including the World Bank, OECD, Eurostat and the University of Denver. These can be displayed as line graphs, bar graphs, cross sectional plots or on maps. The product was launched on March 8, 2010 as an experimental visualization tool in Google Labs.

Culturomics is a form of computational lexicology that studies human behavior and cultural trends through the quantitative analysis of digitized texts. Researchers data mine large digital archives to investigate cultural phenomena reflected in language and word usage. The term is an American neologism first described in a 2010 Science article called Quantitative Analysis of Culture Using Millions of Digitized Books, co-authored by Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden.

Google Drive is a file storage and synchronization service developed by Google. Launched on April 24, 2012, Google Drive allows users to store files in the cloud, synchronize files across devices, and share files. In addition to a web interface, Google Drive offers apps with offline capabilities for Windows and macOS computers, and Android and iOS smartphones and tablets. Google Drive encompasses Google Docs, Google Sheets, and Google Slides, which are a part of the Google Docs Editors office suite that permits collaborative editing of documents, spreadsheets, presentations, drawings, forms, and more. Files created and edited through the Google Docs suite are saved in Google Drive.

<span class="mw-page-title-main">OpenRefine</span> Application for data cleanup and data transformation

OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to spreadsheet applications, and can handle spreadsheet file formats such as CSV, but it behaves more like a database.

The Integrated Crisis Early Warning System (ICEWS) combines a database of political events and a system using these to provide conflict early warnings. It is supported by the Defense Advanced Research Projects Agency in the United States. The database as well as the model used by Lockheed Martin Advanced Technology Laboratories are currently undergoing operational test and evaluation by the United States Southern Command and United States Pacific Command.

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month.

Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, Google Drive, and YouTube. Alongside a set of management tools, it provides a series of modular cloud services including computing, data storage, data analytics and machine learning. Registration requires a credit card or bank account details.

<span class="mw-page-title-main">Kalev Leetaru</span>

Kalev Hannes Leetaru is an American internet entrepreneur, academic, and senior fellow at the George Washington University School of Engineering and Applied Science Center for Cyber & Homeland Security in Washington, D.C. He was a former Yahoo! Fellow in Residence of International Values, Communications Technology & the Global Internet at the Institute for the Study of Diplomacy in the Edmund A. Walsh School of Foreign Service at Georgetown University, before moving to George Washington University.

<span class="mw-page-title-main">Global Terrorism Database</span> Terrorist incident database by the University of Maryland, College Park

The Global Terrorism Database (GTD) is a database of terrorist incidents from 1970 onward. As of May 2021, the list extended through 2019 recording over 200,000 incidents, although data from 1993 is excluded. The database is maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland, College Park in the United States. It is also the basis for other terrorism-related measures, such as the Global Terrorism Index (GTI) published by the Institute for Economics and Peace.

The Armed Conflict Location & Event Data Project (ACLED) is a non-governmental organization specializing in disaggregated conflict data collection, analysis, and crisis mapping. ACLED codes the dates, actors, locations, fatalities, and types of all reported political violence and demonstration events around the world in real time. As of 2022, ACLED has recorded more than 1.3 million individual events globally. In addition to data collection, the ACLED team conducts analysis to describe, explore, and test conflict scenarios, with analysis made freely available to the public for non-commercial use.

The United Nations Global Pulse is an initiative of the United Nations that attempts to "bring real-time monitoring and prediction to development and aid programs."

Conflict and Mediation Event Observations (CAMEO) is a framework for coding event data. It is a more recent alternative to the WEIS coding system developed by Charles A. McClelland and the Conflict and Peace Data Bank (COPDAB) coding system developed by Edward Azar.

The Worldwide Atrocities Dataset is a dataset collected by the Computational Event Data System at Pennsylvania State University and sponsored by the Political Instability Task Force (PITF) that is, in turn, funded by the Central Intelligence Agency in the United States.

Philip Andrew "Phil" Schrodt is a political scientist known for his work in automated data and event coding for political news. On August 1, 2013, he announced that he was leaving his job as professor at Pennsylvania State University to become a full-time consultant. Schrodt is currently a senior research scientist at the statistical consulting firm Parus Analytical Systems.

References

  1. "About GDELT: The Global Database of Events, Language, and Tone" . Retrieved June 2, 2014.
  2. 1 2 "Mapped: Every Protest on the Planet Since 1979". Foreign Policy . Retrieved June 2, 2014.
  3. "Global Database of Events, Language, and Tone". datahub.io. Retrieved June 2, 2014.
  4. Schrodt, Philip (January 20, 2011). "Automated Production of High-Volume, Near-Real-Time Political Event Data" (PDF). Archived from the original (PDF) on 2017-07-02. Retrieved June 12, 2014.
  5. 1 2 3 "World's largest event dataset now publicly available in BigQuery". Google Cloud Platform. May 29, 2014. Retrieved June 2, 2014.
  6. 1 2 "Raw data files". Global Database of Events, Language, and Tone.
  7. 1 2 "All GDELT Event Files" . Retrieved June 12, 2014.
  8. "Documentation". Global Database of Events, Language, and Tone.
  9. 1 2 Leetaru, Kalev (May 29, 2014). "Did the Arab Spring Really Spark a Wave of Global Protests? The world may look like it's roiling now, but the 1980s were far worse". Foreign Policy . Retrieved June 2, 2014.
  10. 1 2 Ulfelder, Jay (June 6, 2014). "Another Note on the Limitations of Event Data" . Retrieved June 12, 2014.
  11. Phua, Clifton; Feng, Yuzhang; Ji, Junyao; Soh, Timothy (2014). "Visual and Predictive Analytics on Singapore News: Experiments on GDELT, Wikipedia, and ^STI". arXiv: 1404.1996 [cs.OH].
  12. Yonamine, James E. "A nuanced study of political conflict using the Global Datasets of Events Location and Tone (GDELT) dataset" . Retrieved June 2, 2014.
  13. "SBP 2014 Grand Challenge: explore GDELT, Global Database of Events, Language and Tone" . Retrieved June 2, 2014.
  14. "Creating a Real-Time Global Database of Events, People, and Places in the News". Center for Data Innovation. December 15, 2013. Retrieved June 2, 2014.
  15. Caitlin Dempsey Morais (September 5, 2013). "Mapping Global Events Since 1979". GIS Lounge. Retrieved June 2, 2014.
  16. "Raining on the Parade: Some Cautions Regarding the Global Database of Events, Language and Tone Dataset". Political Violence at a Glance. February 20, 2014. Retrieved June 2, 2014.
  17. Jongman, Berto (January 5, 2014). "Global Database of Events, Language, and Tone (GDELT) — (Old) Big Data to See (New) Crises?". Public Intelligence Blog. Retrieved June 2, 2014.
  18. Keating, Joshua (April 10, 2013). "What can we learn from the last 200 million things that happened in the world?". Foreign Policy . Archived from the original on June 6, 2014. Retrieved June 2, 2014.
  19. Keating, Joshua (July 9, 2013). "How Well Does GDELT Follow Events in Syria?". Foreign Policy . Archived from the original on June 6, 2014. Retrieved June 2, 2014.
  20. Steinert-Threlkeld, Zachary (September 27, 2013). "The Arab Spring and GDELT" . Retrieved June 18, 2014.
  21. Leetaru, Kalev (March 13, 2014). "Mapping Violence and Protests in Nigeria: How Big Data can find the big story". Foreign Policy . Retrieved June 2, 2014.
  22. Heaven, Douglas (May 13, 2013). "World's largest events database could predict conflict". New Scientist . Retrieved June 2, 2014.
  23. Chalabi, Mona (May 6, 2014). "Kidnapping of Girls in Nigeria Is Part of a Worsening Problem (Updated)". FiveThirtyEight . Retrieved June 2, 2014.
  24. Sullivan, Andrew (May 30, 2014). "Not Your Father's Global Uprising" . Retrieved June 2, 2014.
  25. mdwardlab (October 17, 2013). "GDELT and ICEWS, a short comparison". Predictive Heuristics. Archived from the original on July 17, 2014. Retrieved June 18, 2014.
  26. Beieler, John (October 28, 2013). "Noise in GDELT" . Retrieved June 21, 2014.
  27. Hanna, Alex (February 24, 2014). "Assessing GDELT with handcoded protest data". Bad Hessian. Retrieved June 21, 2014.