Data-driven journalism

Last updated

Data-driven journalism, often shortened to "ddj", a term in use since 2009, is a journalistic process based on analyzing and filtering large data sets for the purpose of creating or elevating a news story. Many data-driven stories begin with newly available resources such as open source software, open access publishing and open data, while others are products of public records requests or leaked materials. This approach to journalism builds on older practices, most notably on computer-assisted reporting (CAR) a label used mainly in the US for decades. Other labels for partially similar approaches are "precision journalism", based on a book by Philipp Meyer, [1] published in 1972, where he advocated the use of techniques from social sciences in researching stories.

Contents

Data-driven journalism has a wider approach. At the core the process builds on the growing availability of open data that is freely available online and analyzed with open source tools. [2] Data-driven journalism strives to reach new levels of service for the public, helping the general public or specific groups or individuals to understand patterns and make decisions based on the findings. As such, data driven journalism might help to put journalists into a role relevant for society in a new way.

Since the introduction of the concept a number of media companies have created "data teams" which develop visualizations for newsrooms. Most notable are teams e.g. at Reuters, [3] Pro Publica, [4] and La Nacion (Argentina). [5] In Europe, The Guardian [6] and Berliner Morgenpost [7] have very productive teams, as well as public broadcasters.

As projects like the MP expense scandal (2009) and the 2013 release of the "offshore leaks" demonstrate, data-driven journalism can assume an investigative role, dealing with "not-so open" aka secret data on occasion.

The annual Data Journalism Awards [8] recognize outstanding reporting in the field of data journalism, and numerous Pulitzer Prizes in recent years have been awarded to data-driven storytelling, including the 2018 Pulitzer Prize in International Reporting [9] and the 2017 Pulitzer Prize in Public Service [10]

Definitions

The data-driven journalism process. Data driven journalism process.jpg
The data-driven journalism process.

According to architect and multimedia journalist Mirko Lorenz, data-driven journalism is primarily a workflow that consists of the following elements: digging deep into data by scraping, cleansing and structuring it, filtering by mining for specific, visualizing and making a story. [11] This process can be extended to provide results that cater to individual interests and the broader public.

Data journalism trainer and writer Paul Bradshaw describes the process of data-driven journalism in a similar manner: data must be found, which may require specialized skills like MySQL or Python, then interrogated, for which understanding of jargon and statistics is necessary, and finally visualized and mashed with the aid of open-source tools. [12]

A more results-driven definition comes from data reporter and web strategist Henk van Ess (2012). [13] "Data-driven journalism enables reporters to tell untold stories, find new angles or complete stories via a workflow of finding, processing and presenting significant amounts of data (in any given form) with or without open tools." Van Ess claims that some of the data-driven workflow leads to products that "are not in orbit with the laws of good story telling" because the result emphases on showing the problem, not explaining the problem. "A good data driven production has different layers. It allows you to find personalized that are only important for you, by drilling down to relevant but also enables you to zoom out to get the big picture."

In 2013, Van Ess came with a shorter definition in [14] that doesn't involve visualisation per se:

"Data journalism is journalism based on data that has to be processed first with tools before a relevant story is possible."

Reporting based on data

Telling stories based on the data is the primary goal. The findings from data can be transformed into any form of journalistic writing. Visualizations can be used to create a clear understanding of a complex situation. Furthermore, elements of storytelling can be used to illustrate what the findings actually mean, from the perspective of someone who is affected by a development. This connection between data and story can be viewed as a "new arc" trying to span the gap between developments that are relevant, but poorly understood, to a story that is verifiable, trustworthy, relevant and easy to remember.

Data quality

In many investigations the data that can be found might have omissions or is misleading. As one layer of data-driven journalism a critical examination of the data quality is important. In other cases the data might not be public or is not in the right format for further analysis, e.g. is only available in a PDF. Here the process of data-driven journalism can turn into stories about data quality or refusals to provide the data by institutions. As the practice as a whole is in early development steps, examinations of data sources, data sets, data quality and data format are therefore an equally important part of this work.

Data-driven journalism and the value of trust

Based on the perspective of looking deeper into facts and drivers of events, there is a suggested change in media strategies: In this view the idea is to move "from attention to trust". The creation of attention, which has been a pillar of media business models has lost its relevance because reports of new events are often faster distributed via new platforms such as Twitter than through traditional media channels. On the other hand, trust can be understood as a scarce resource. While distributing information is much easier and faster via the web, the abundance of offerings creates costs to verify and check the content of any story create an opportunity. The view to transform media companies into trusted data hubs has been described in an article cross-published in February 2011 on Owni.eu [15] and Nieman Lab. [16]

Process of data-driven journalism

The process to transform raw data into stories is akin to a refinement and transformation. The main goal is to extract information recipients can act upon. The task of a data journalist is to extract what is hidden. This approach can be applied to almost any context, such as finances, health, environment or other areas of public interest.

Inverted pyramid of data journalism

In 2011, Paul Bradshaw introduced a model, he called "The Inverted Pyramid of Data Journalism".

Steps of the process

In order to achieve this, the process should be split up into several steps. While the steps leading to results can differ, a basic distinction can be made by looking at six phases:

  1. Find: Searching for data on the web
  2. Clean: Process to filter and transform data, preparation for visualization
  3. Visualize: Displaying the pattern, either as a static or animated visual
  4. Publish: Integrating the visuals, attaching data to stories
  5. Distribute: Enabling access on a variety of devices, such as the web, tablets and mobile
  6. Measure: Tracking usage of data stories over time and across the spectrum of uses.

Description of the steps

Finding data

Data can be obtained directly from governmental databases such as data.gov, data.gov.uk and World Bank Data API [17] but also by placing Freedom of Information requests to government agencies; some requests are made and aggregated on websites like the UK's What Do They Know. While there is a worldwide trend towards opening data, there are national differences as to what extent that information is freely available in usable formats. If the data is in a webpage, scrapers are used to generate a spreadsheet. Examples of scrapers are: Import.io, ScraperWiki, OutWit Hub and Needlebase (retired in 2012 [18] ). In other cases OCR software can be used to get data from PDFs.

Data can also be created by the public through crowd sourcing, as shown in March 2012 at the Datajournalism Conference in Hamburg by Henk van Ess. [19]

Cleaning data

Usually data is not in a format that is easy to visualize. Examples are that there are too many data points or that the rows and columns need to be sorted differently. Another issue is that once investigated many datasets need to be cleaned, structured and transformed. Various tools like Google Refine (open source), Data Wrangler and Google Spreadsheets [20] allow uploading, extracting or formatting data.

Visualizing data

To visualize data in the form of graphs and charts, applications such as Many Eyes or Tableau Public are available. Yahoo! Pipes and Open Heat Map [21] are examples of tools that enable the creation of maps based on data spreadsheets. The number of options and platforms is expanding. Some new offerings provide options to search, display and embed data, an example being Timetric. [22]

To create meaningful and relevant visualizations, journalists use a growing number of tools. There are by now, several descriptions what to look for and how to do it. Most notable published articles are:

  • Joel Gunter: "#ijf11: Lessons in data journalism from the New York Times" [23]
  • Steve Myers: "Using Data Visualization as a Reporting Tool Can Reveal Story’s Shape", including a link to a tutorial by Sarah Cohen [24]

As of 2011, the use of HTML 5 libraries using the canvas tag is gaining in popularity. There are numerous libraries enabling to graph data in a growing variety of forms. One example is RGraph. [25] As of 2011 there is a growing list of JavaScript libraries allowing to visualize data. [26]

Publishing data story

There are different options to publish data and visualizations. A basic approach is to attach the data to single stories, similar to embedding web videos. More advanced concepts allow to create single dossiers, e.g. to display a number of visualizations, articles and links to the data on one page. Often such specials have to be coded individually, as many Content Management Systems are designed to display single posts based on the date of publication.

Distributing data

Providing access to existing data is another phase, which is gaining importance. Think of the sites as "marketplaces" (commercial or not), where datasets can be found easily by others. Especially of the insights for an article where gained from Open Data, journalists should provide a link to the data they used for others to investigate (potentially starting another cycle of interrogation, leading to new insights).

Providing access to data and enabling groups to discuss what information could be extracted is the main idea behind Buzzdata, [27] a site using the concepts of social media such as sharing and following to create a community for data investigations.

Other platforms (which can be used both to gather or to distribute data):

  • Help Me Investigate (created by Paul Bradshaw) [28]
  • Timetric [29]
  • ScraperWiki [30]

Measuring the impact of data stories

A final step of the process is to measure how often a dataset or visualization is viewed.

In the context of data-driven journalism, the extent of such tracking, such as collecting user data or any other information that could be used for marketing reasons or other uses beyond the control of the user, should be viewed as problematic.[ according to whom? ] One newer, non-intrusive option to measure usage is a lightweight tracker called PixelPing. The tracker is the result of a project by ProPublica and DocumentCloud. [31] There is a corresponding service to collect the data. The software is open source and can be downloaded via GitHub. [32]

Examples

There is a growing list of examples how data-driven journalism can be applied:

Other prominent uses of data-driven journalism are related to the release by whistle-blower organization WikiLeaks of the Afghan War Diary, a compendium of 91,000 secret military reports covering the war in Afghanistan from 2004 to 2010. [35] Three global broadsheets, namely The Guardian , The New York Times and Der Spiegel , dedicated extensive sections [36] [37] [38] to the documents; The Guardian's reporting included an interactive map pointing out the type, location and casualties caused by 16,000 IED attacks, [39] The New York Times published a selection of reports that permits rolling over underlined text to reveal explanations of military terms, [40] while Der Spiegel provided hybrid visualizations (containing both graphs and maps) on topics like the number deaths related to insurgent bomb attacks. [41] For the Iraq War logs release, The Guardian used Google Fusion Tables to create an interactive map of every incident where someone died, [42] a technique it used again in the England riots of 2011. [43]

See also

Related Research Articles

Pulitzer Prize Award for achievements in journalism, literature, and musical composition within the United States

The Pulitzer Prize is an award for achievements in newspaper, magazine and online journalism, literature and musical composition within the United States. It was established in 1917 by provisions in the will of Joseph Pulitzer, who had made his fortune as a newspaper publisher and is administered by Columbia University. Prizes are awarded yearly in twenty-one categories. In twenty of the categories, each winner receives a certificate and a US$15,000 cash award. The winner in the public service category is awarded a gold medal.

Columbia University Graduate School of Journalism Journalism school at Columbia University

The Columbia University Graduate School of Journalism is located in Pulitzer Hall on Columbia's Morningside Heights campus in New York City.

Visualization (graphics)

Visualization or visualisation is any technique for creating images, diagrams, or animations to communicate a message. Visualization through visual imagery has been an effective way to communicate both abstract and concrete ideas since the dawn of humanity. Examples from history include cave paintings, Egyptian hieroglyphs, Greek geometry, and Leonardo da Vinci's revolutionary methods of technical drawing for engineering and scientific purposes.

Center for Public Integrity American nonprofit investigative journalism organization

The Center for Public Integrity (CPI) is an American nonprofit investigative journalism organization whose stated mission is "to reveal abuses of power, corruption and dereliction of duty by powerful public and private institutions in order to cause them to operate with honesty, integrity, accountability and to put the public interest first." With over 50 staff members, the CPI is one of the largest nonprofit investigative centers in America. It won the 2014 Pulitzer Prize for Investigative Reporting.

Data visualization Creation and study of the visual representation of data

Data visualization is an interdisciplinary field that deals with the graphic representation of data. It is a particularly efficient way of communicating when the data is numerous as for example a Time Series. From an academic point of view, this representation can be considered as a mapping between the original data and graphic elements. The mapping determines how the attributes of these elements vary according to the data. In this light, a bar chart is a mapping of the length of a bar to a magnitude of a variable. Since the graphic design of the mapping can adversely affect the readability of a chart, mapping is a core competency of Data visualization. Data visualization has its roots in the field of Statistics and is therefore generally considered a branch of Descriptive Statistics. However, because both design skills and statistical and computing skills are required to visualize effectively, it is argued by some authors that it is both an Art and a Science.

The Daily Beast is an American news publication focused on politics, media and pop culture, founded in 2008.

ParaView

ParaView is an open-source multiple-platform application for interactive, scientific visualization. It has a client–server architecture to facilitate remote visualization of datasets, and generates level of detail (LOD) models to maintain interactive frame rates for large datasets. It is an application built on top of the Visualization Toolkit (VTK) libraries. ParaView is an application designed for data parallelism on shared-memory or distributed-memory multicomputers and clusters. It can also be run as a single-computer application.

Database journalism or structured journalism is a principle in information management whereby news content is organized around structured pieces of data, as opposed to news stories. See also Data journalism

Tableau Software is an American interactive data visualization software company focused on business intelligence. It was founded in 2003 in Mountain View, California, and is currently headquartered in Seattle, Washington. In 2019 the company was acquired by Salesforce for $15.7 billion. CNBC reported that this acquisition is the largest acquisition by Salesforce, which is considered the strongest in the CRM field, since its foundation.

California Watch, part of the nonprofit Center for Investigative Reporting, began producing stories in 2009. The official launch of the California Watch website took place in January 2010. The team was best known for producing well researched and widely distributed investigative stories on topics of interest to Californians. In small ways, the newsroom pioneered in the digital space, including listing the names of editors and copy editors at the bottom of each story, custom-editing stories for multiple partners, developing unique methods to engage with audiences and distributing the same essential investigative stories to newsrooms across the state. It worked with many news outlets, including newspapers throughout the state, all of the ABC television affiliates in California, KQED radio and television and dozens of websites. The Center for Investigative Reporting created California Watch with $3.5 million in seed funding. The team won several industry awards for its public interest reporting, including the George Polk Award in 2012. In addition to numerous awards won for its investigative reports, the California Watch website also won an Online Journalism Award in the general excellence category from the Online News Association in its first year of existence.

QuickCode was a web-based platform for collaboratively building programs to extract and analyze public (online) data, in a wiki-like fashion. "Scraper" refers to screen scrapers, programs that extract data from websites. "Wiki" means that any user with programming experience can create or edit such programs for extracting new data, or for analyzing existing datasets. The main use of the website is providing a place for programmers and journalists to collaborate on analyzing public data.

Google Fusion Tables Data management web service

Google Fusion Tables was a web service provided by Google for data management. Fusion tables can be used for gathering, visualising and sharing data tables. Data are stored in multiple tables that Internet users can view and download.

Analytic journalism is a field of journalism that seeks to make sense of complex reality in order to create public understanding. It combines aspects of investigative journalism and explanatory reporting. Analytic journalism can be seen as a response to professionalized communication from powerful agents, information overload, and growing complexity in a globalised world. It aims to create evidence-based interpretations of reality, often confronting dominant ways of understanding a specific phenomenon.

Data journalism is "a way of enhancing reporting and news writing with the use and examination of statistics in order to provide a deeper insight into a news story and to highlight relevant data. One trend in the digital era of journalism has been to disseminate information to the public via interactive online content through data visualization tools such as tables, graphs, maps, infographics, microsites, and visual worlds. The in-depth examination of such data sets can lead to more concrete results and observations regarding timely topics of interest. In addition, data journalism may reveal hidden issues that seemingly were not a priority in the news coverage". Data journalism is a type of journalism reflecting the increased role that numerical data is used in the production and distribution of information in the digital era. It reflects the increased interaction between content producers (journalist) and several other fields such as design, computer science and statistics. From the point of view of journalists, it represents "an overlapping set of competencies drawn from disparate fields".

Sarah Cohen (journalist) American journalist and professor

Sarah Cohen is an American journalist, author, and professor. Cohen is a proponent of, and teaches classes on, computational journalism and authored the book "Numbers in the Newsroom: Using math and statistics in the news."

Google Sheets Cloud-based spreadsheet software

Google Sheets is a spreadsheet program included as part of the free, web-based Google Docs Editors suite offered by Google. The service also includes Google Docs, Google Slides, Google Drawings, Google Forms, Google Sites, and Google Keep. Google Sheets is available as a web application, mobile app for Android, iOS, Windows, BlackBerry, and as a desktop application on Google's Chrome OS. The app is compatible with Microsoft Excel file formats. The app allows users to create and edit files online while collaborating with other users in real-time. Edits are tracked by user with a revision history presenting changes. An editor's position is highlighted with an editor-specific color and cursor and a permissions system regulates what users can do. Updates have introduced features using machine learning, including "Explore", offering answers based on natural language questions in a spreadsheet.

Alberto Cairo

Alberto Cairo is a Spanish information designer and professor. Cairo is the Knight Chair in Visual Journalism at the School of Communication of the University of Miami.

OutWit Hub is a Web data extraction software application designed to automatically extract information from online or local resources. It recognizes and grabs links, images, documents, contacts, recurring vocabulary and phrases, rss feeds and converts structured and unstructured data into formatted tables which can be exported to spreadsheets or databases. The first version was released in 2010. Version 8.0 was released in June 2019.

In automated journalism, also known as algorithmic journalism or robot journalism, news articles are generated by computer programs. Through artificial intelligence (AI) software, stories are produced automatically by computers rather than human reporters. These programs interpret, organize, and present data in human-readable ways. Typically, the process involves an algorithm that scans large amounts of provided data, selects from an assortment of pre-programmed article structures, orders key points, and inserts details such as names, places, amounts, rankings, statistics, and other figures. The output can also be customized to fit a certain voice, tone, or style.

Mark Henry Hansen is an American statistician, professor at the Columbia University Graduate School of Journalism and Director of the David and Helen Gurley Brown Institute for Media Innovation. His special interest is the intersection of data, art and technology. He adopts an interdisciplinary approach to data science, drawing on various branches of applied mathematics, information theory and new media arts. Within the field of journalism, Hansen has promoted coding literacy for journalists.

References

  1. "Philipp Meyer". festivaldelgiornalismo.com. Archived from the original on 4 March 2016. Retrieved 31 January 2019.
  2. Lorenz, Mirko (2010) Data driven journalism: What is there to learn? Edited conference documentation, based on presentations of participants, 24 August 2010, Amsterdam, The Netherlands
  3. "Special Reports from Reuters journalists around the world". Reuters. Retrieved 31 January 2019.
  4. "News Apps". ProPublica. Retrieved 31 January 2019.
  5. "How the Argentinian daily La Nación became a data journalism powerhouse in Latin America". niemanlab.org. Retrieved 31 January 2019.
  6. "Data". The Guardian. Retrieved 31 January 2019.
  7. Berlin, Berliner Morgenpost-. "Portfolio Interaktiv-Team". morgenpost. Retrieved 31 January 2019.
  8. "Data Journalism Awards". datajournalismawards.org. Archived from the original on 21 July 2018. Retrieved 31 January 2019.
  9. "The Pulitzer Prizes". www.Pulitzer.org. Retrieved 31 January 2019.
  10. "The Pulitzer Prizes". www.Pulitzer.org. Retrieved 31 January 2019.
  11. Lorenz, Mirko. (2010). Data driven journalism: What is there to learn? Presented at IJ-7 Innovation Journalism Conference, 7–9 June 2010, Stanford, CA
  12. Bradshaw, Paul (1 October 2010). "How to be a data journalist". The Guardian
  13. van Ess, Henk. (2012). Gory of data driven journalism
  14. van Ess, Henk. (2013). Handboek Datajournalistiek Archived 2013-10-21 at the Wayback Machine
  15. Media Companies Must Become Trusted Data Hubs » OWNI.eu, News, Augmented Archived 2011-08-24 at the Wayback Machine . Owni.eu (2011-02-28). Retrieved on 2013-08-16.
  16. Voices: News organizations must become hubs of trusted data in a market seeking (and valuing) trust » Nieman Journalism Lab. Niemanlab.org (2013-08-09). Retrieved on 2013-08-16.
  17. "Developer Information – World Bank Data Help Desk". datahelpdesk.worldbank.org. Retrieved 31 January 2019.
  18. "Renewing old resolutions for the new year". googleblog.blogspot.com. Retrieved 31 January 2019.
  19. Crowdsourcing: how to find a crowd (Presented at ARD/ZDF Academy in. Slideshare.net (2010-09-17). Retrieved on 2013-08-16.
  20. Hirst, Author Tony (14 October 2008). "Data Scraping Wikipedia with Google Spreadsheets". ouseful.info. Retrieved 31 January 2019.
  21. "OpenHeatMap". www.openheatmap.com. Retrieved 31 January 2019.
  22. "Home - Timetric". www.timetric.com. Retrieved 31 January 2019.
  23. Gunter, Joel (16 April 2011). "#ijf11: Lessons in data journalism from the New York Times". journalism.co.uk. Retrieved 31 January 2019.
  24. "Using Data Visualization as a Reporting Tool Can Reveal Story's Shape". Poynter.org. Retrieved 31 January 2019.
  25. "RGraph is a Free and Open Source JavaScript charts library for the web". www.rgraph.net. Retrieved 31 January 2019.
  26. JavaScript libraries
  27. "BuzzData. BuzzData. Retrieved on 2013-08-16". Archived from the original on 2011-08-12. Retrieved 2011-08-17.
  28. "Help Me Investigate - A network helping people investigate questions in the public interest". helpmeinvestigate.com. Retrieved 31 January 2019.
  29. "Home - Timetric". www.timetric.com. Retrieved 31 January 2019.
  30. "ScraperWiki" . Retrieved 31 January 2019.
  31. Larson, Jeff. (2010-09-08) Pixel Ping: A node.js Stats Tracker. ProPublica. Retrieved on 2013-08-16.
  32. documentcloud/pixel-ping ¡ GitHub. Retrieved on 2013-08-16.
  33. Rogers, Simon (28 July 2011). "Data journalism at the Guardian: what is it and how do we do it?" . Retrieved 31 January 2019 via www.theguardian.com.
  34. Evans, Lisa (27 January 2011). "All of our data journalism in one spreadsheet". The Guardian. Retrieved 31 January 2019.
  35. Kabul War Diary, 26 July 2010, WikiLeaks
  36. Afghanistan The War Logs, 26 July 2010, The Guardian
  37. The War Logs, 26 July 2010 The New York Times
  38. The Afghanistan Protocol: Explosive Leaks Provide Image of War from Those Fighting It, 26 July 2010, Der Spiegel
  39. Afghanistan war logs: IED attacks on civilians, coalition and Afghan troops, 26 July 2010, The Guardian
  40. Text From a Selection of the Secret Dispatches, 26 July 2010, The New York Times
  41. Deathly Toll: Death as a result of insurgent bomb attacks, 26 July 2010, Der Spiegel
  42. Wikileaks Iraq war logs: every death mapped, 22 October 2010, Guardian Datablog
  43. UK riots: every verified incident - interactive map, 11 August 2011, Guardian Datablog