Data curation

Last updated

Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data so that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes "all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data". [1] In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database. [2]

Contents

In the modern era of big data, the curation of data has become more prominent, particularly for software processing high volume and complex data systems. [3] The term is also used in historical occasions and the humanities, [4] where increasing cultural and scholarly data from digital humanities projects requires the expertise and analytical practices of data curation. [5] In broad terms, curation means a range of activities and processes done to create, manage, maintain, and validate a component. [6] Specifically, data curation is the attempt to determine what information is worth saving and for how long. [7]

History and practice

The user, rather than the database itself, typically initiates data curation and maintains metadata. [8] According to the University of Illinois' Graduate School of Library and Information Science, "Data curation is the active and on-going management of data through its lifecycle of interest and usefulness to scholarship, science, and education; curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time." [9] The data curation workflow is distinct from data quality management, data protection, lifecycle management, and data movement. [8]

Census data has been available in tabulated punch card form since the early 20th century and has been electronic since the 1960s. [10] The Inter-university Consortium for Political and Social Research (ICPSR) website marks 1962 as the date of their first Survey Data Archive. [11]

Deep background on data libraries appeared in a 1982 issue of the Illinois journal, Library Trends. [12] For historical background on the data archive movement, see "Social Scientific Information Needs for Numeric Data: The Evolution of the International Data Archive Infrastructure." [13] The exact curation process undertaken within any organisation depends on the volume of data, how much noise the data contains, and what the expected future use of the data means to its dissemination. [3]

The crises in space data led to the 1999 creation of the Open Archival Information System (OAIS) model, [14] stewarded by the Consultative Committee for Space Data Systems (CCSDS), which was formed in 1982. [15]

The term data curation is sometimes used in the context of biological databases, where specific biological information is firstly obtained from a range of research articles and then stored within a specific category of database. For instance, information about anti-depressant drugs can be obtained from various sources and, after checking whether they are available as a database or not, they are saved under a drug's database's anti-depressive category. Enterprises are also utilizing data curation within their operational and strategic processes to ensure data quality and accuracy. [16] [17]

Projects and studies

The Dissemination Information Packages (DIPS) for Information Reuse (DIPIR) project is studying research data produced and used by quantitative social scientists, archaeologists, and zoologists. The intended audience is researchers who use secondary data and the digital curators, digital repository managers, data center staff, and others who collect, manage, and store digital information. [18]

The Protein Data Bank was established in 1971 at Brookhaven National Laboratory, and has grown into a global project. [19] A database for three-dimensional structural data of proteins and other large biological molecules, the PDB contains over 120,000 structures, all standardized, validated against experimental data, and annotated.

FlyBase, the primary repository of genetic and molecular data for the insect family Drosophilidae , dates back to 1992. FlyBase annotates the entire Drosophila melanogaster genome. [20]

The Linguistic Data Consortium is a data repository for linguistic data, dating back to 1992. [21]

The Sloan Digital Sky Survey began surveying the night sky in 2000. [22] Computer scientist Jim Gray, while working on the data architecture of the SDSS, championed the idea of data curation in the sciences. [23]

DataNet was a research program of the U.S. National Science Foundation Office of Cyberinfrastructure, funding data management projects in the sciences. [24] DataONE (Data Observation Network for Earth) is one of the projects funded through DataNet, helping the environmental science community preserve and share data. [25]

See also

Related Research Articles

<span class="mw-page-title-main">Open Archives Initiative</span> Informal organisation

The Open Archives Initiative (OAI) was an informal organization, in the circle around the colleagues Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson and Simeon Warner, to develop and apply technical interoperability standards for archives to share catalogue information (metadata). The group got together in the late late 1990s and was active for around twenty years. OAI coordinated in particular three specification activities: OAI-PMH, OAI-ORE and ResourceSync. All along the group worked towards building a "low-barrier interoperability framework" for archives containing digital content to allow people harvest metadata. Such sets of metadata are since then harvested to provide "value-added services", often by combining different data sets.

The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy, and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations. The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB.

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives. An implementation of OAI-PMH must support representing metadata in Dublin Core, but may also support additional representations.

In library and archival science, digital preservation is a formal process to ensure that digital information of continuing value remains accessible and usable in the long term. It involves planning, resource allocation, and application of preservation methods and technologies, and combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.

An institutional repository (IR) is an archive for collecting, preserving, and disseminating digital copies of the intellectual output of an institution, particularly a research institution. Academics also utilize their IRs for archiving published works to increase their visibility and collaboration with other academics However, most of these outputs produced by universities are not effectively accessed and shared by researchers and other stakeholders As a result academics should be involved in the implementation and development of an IR project so that they can learn the benefits and purpose of building an IR.

<span class="mw-page-title-main">Fedora Commons</span>

Fedora is a digital asset management (DAM) content repository architecture upon which institutional repositories, digital archives, and digital library systems might be built. Fedora is the underlying architecture for a digital repository, and is not a complete management, indexing, discovery, and delivery application. It is a modular architecture built on the principle that interoperability and extensibility are best achieved by the integration of data, interfaces, and mechanisms as clearly defined modules.

The California Digital Library (CDL) was founded by the University of California in 1997. Under the leadership of then UC President Richard C. Atkinson, the CDL's original mission was to forge a better system for scholarly information management and improved support for teaching and research. In collaboration with the ten University of California Libraries and other partners, CDL assembled one of the world's largest digital research libraries. CDL facilitates the licensing of online materials and develops shared services used throughout the UC system. Building on the foundations of the Melvyl Catalog, CDL has developed one of the largest online library catalogs in the country and works in partnership with the UC campuses to bring the treasures of California's libraries, museums, and cultural heritage organizations to the world. CDL continues to explore how services such as digital curation, scholarly publishing, archiving and preservation support research throughout the information lifecycle.

The term Open Archival Information System refers to the ISO OAIS Reference Model for an OAIS. This reference model is defined by recommendation CCSDS 650.0-B-2 of the Consultative Committee for Space Data Systems; this text is identical to = 57284 ISO 14721:2012. The CCSDS's purview is space agencies, but the OAIS model it developed has proved useful to other organizations and institutions with digital archiving needs. OAIS, known as ISO 14721:2003, is widely accepted and utilized by various organizations and disciplines, both national and international, and was designed to ensure preservation. The OAIS standard, published in 2005, is considered the optimum standard to create and maintain a digital repository over a long period of time.

The Digital Curation Centre (DCC) was established to help solve the extensive challenges of digital preservation and digital curation and to lead research, development, advice, and support services for higher education institutions in the United Kingdom.

<span class="mw-page-title-main">UK Data Archive</span>

The UK Data Archive is a national centre of expertise in data archiving in the United Kingdom. It houses the largest collection of social sciences and population digital data in the UK. It is certified under CoreTrustSeal as a trusted digital repository. It is also certified under the international ISO 27001 standard for information security. Located in Colchester, the UK Data Archive is a specialist department of the University of Essex, co-located with the Institute for Social and Economic Research (ISER). It is primarily funded by the Economic and Social Research Council (ESRC) and the University of Essex.

Research data archiving is the long-term storage of scholarly research data, including the natural sciences, social sciences, and life sciences. The various academic journals have differing policies regarding how much of their data and methods researchers are required to store in a public archive, and what is actually archived varies widely between different disciplines. Similarly, the major grant-giving institutions have varying attitudes towards public archival of data. In general, the tradition of science has been for publications to contain sufficient information to allow fellow researchers to replicate and therefore test the research. In recent years this approach has become increasingly strained as research in some areas depends on large datasets which cannot easily be replicated independently.

Preservation metadata is item level information that describes the context and structure of a digital object. It provides background details pertaining to a digital object's provenance, authenticity, and environment. Preservation metadata, is a specific type of metadata that works to maintain a digital object's viability while ensuring continued access by providing contextual information, usage details, and rights.

Trustworthy Repositories Audit & Certification (TRAC) is a document describing the metrics of an OAIS-compliant digital repository that developed from work done by the OCLC/RLG Programs and National Archives and Records Administration (NARA) task force initiative.

Digital curation is the selection, preservation, maintenance, collection, and archiving of digital assets. Digital curation establishes, maintains, and adds value to repositories of digital data for present and future use. This is often accomplished by archivists, librarians, scientists, historians, and scholars. Enterprises are starting to use digital curation to improve the quality of information and data within their operational and strategic processes. Successful digital curation will mitigate digital obsolescence, keeping the information accessible to users indefinitely. Digital curation includes digital asset management, data curation, digital preservation, and electronic records management.

A data management plan or DMP is a formal document that outlines how data are to be handled both during a research project, and after the project is completed. The goal of a data management plan is to consider the many aspects of data management, metadata generation, data preservation, and analysis before the project begins; this may lead to data being well-managed in the present, and prepared for preservation in the future.

Islandora is a free and open-source software digital repository system based on Drupal and integrating with additional applications, including Fedora Commons. It is open source software. Islandora was originally developed at the University of Prince Edward Island by the Robertson Library and is now maintained by the Islandora Foundation, which has a mission to, "promote collaboration through transparency and consensus building among Islandora community members, and to steward their shared vision for digital curation features through a body of software and knowledge."

<span class="mw-page-title-main">Centre pour la communication scientifique directe</span> Organization in France promoting open access to research

The Centre pour la Communication Scientifique Directe (CCSD) is a French organization of the Centre National de la Recherche Scientifique (CNRS) devoted to the development of the open access repositories HAL, TEL and MediHal, and the web platform SciencesConf.org. It is involved in the international open access movement.

The UK Data Service is the largest digital repository for quantitative and qualitative social science and humanities research data in the United Kingdom. The organisation is funded by the UK government through the Economic and Social Research Council and is led by the UK Data Archive at the University of Essex, in partnership with other universities.

<span class="mw-page-title-main">Inter-university Consortium for Political and Social Research</span> Organization of research institutions

ICPSR, the Inter-university Consortium for Political and Social Research, was established in 1962. An integral part of the infrastructure of social science research, ICPSR maintains and provides access to a vast archive of social science data for research and instruction. Since 1963, ICPSR has offered training in quantitative methods to facilitate effective data use. The ICPSR Summer Program in Quantitative Methods of Social Research offers a comprehensive curriculum in research design, statistics, data analysis, and methodology. To ensure that data resources are available to future generations of scholars, ICPSR curates and preserves data, migrating them to new storage media and file formats as changes in technology warrant. In addition, ICPSR provides user support to assist researchers in identifying relevant data for analysis and in conducting their research projects.

<span class="mw-page-title-main">Digital Repository of Ireland</span>

The Digital Repository of Ireland (DRI) is a digital repository for Ireland's humanities, social science and cultural heritage data. It was designed as an open access infrastructure that allows for interactive use and sustained growth. Three institutions, Royal Irish Academy (RIA), Trinity College Dublin (TCD), and Maynooth, currently manage the repository and implement its policies, guidelines and training. The Department of Education and Skills has primarily funded DRI since 2016 through the Higher Education Authority and the Irish Research Council. As of 2018, DRI is home to over 28,000 items.

References

  1. Renée J. Miller, “Big Data Curation” in 20th International Conference on Management of Data (COMAD) 2014, Hyderabad, India, December 17–19, 2014
  2. Bio creative Glossary. Retrieved on 3 October 2016.
  3. 1 2 Furht, Borko; Armando Escalante (2011). Handbook of Data Intensive Computing. Springer Science & Business Media. p. 32. ISBN   9781461414155 . Retrieved 2 October 2016.
  4. Sabharwal, Arjun (2015). Digital Curation in the Digital Humanities: Preserving and Promoting Archival and Special Collections. Chandos Publishing. p. 60. ISBN   9780081001783 . Retrieved 2 October 2016.
  5. "An Introduction to Humanities Data Curation" by Julia Flanders and Trevor Muñoz http://guide.dhcuration.org/intro/. Not available any more: archive.org
  6. Pilin Glossary. Not available any more: archive.org
  7. 1 2 Borgman, C (2015). Big data, little data, no data: Scholarship in the networked world. Cambridge, Massachusetts: MIT Press. pp.  13. ISBN   978-0-262-02856-1.
  8. 1 2 Chessell, Mandy; Nigel L Jones; Jay Limburn; David Radley; Kevin Shank (2015). Designing and Operating a Data Reservoir. IBM Redbooks. pp. 111–113. ISBN   9780837440668 . Retrieved 2 October 2016.
  9. Cragin, Melissa; Heidorn, P. Bryan; Palmer, Carole L.; Smith, Linda C. (2007). "An Educational Program on Data Curation". ALA Science & Technology Section Conference. Retrieved 7 October 2013.
  10. "Preserving Digital Information (PDI) report" (PDF). 1996. Retrieved 2018-03-13.
  11. "ICPSR: History". www.icpsr.umich.edu. Retrieved 2018-03-15.
  12. Heim, Kathleen M. (November 29, 1982). "Library Trends 30 (3) Winter 1982: Data Libraries for the Social Sciences". Library Trends via www.ideals.illinois.edu.
  13. Kathleen M. Heim, "Social Scientific Information Needs for Numeric Data: The Evolution of the International Data Archive Infrastructure." in Collection Management 9 (Spring 1987): 1-53.
  14. "The OAIS reference model". 2015-12-09. Retrieved 2018-03-15.
  15. "CCSDS.org - The Consultative Committee for Space Data Systems (CCSDS)". public.ccsds.org. Retrieved 2018-03-14.
  16. E. Curry, A. Freitas, and S. O’Riáin, “The Role of Community-Driven Data Curation for Enterprises,” Archived 2012-01-23 at the Wayback Machine in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25-47. ISBN   978-1-4419-7664-2
  17. A. Freitas, E. Curry, “Big Data Curation,” Archived 2016-09-13 at the Wayback Machine in New Horizons for a Data-Driven Economy, Springer (Open Access), 2015.
  18. Dissemination Information Packages for Information Reuse (DIPIR) project http://www.oclc.org/research/themes/user-studies/dipir.html
  19. "RCSB PDB: About the PDB Archive and the RCSB PDB". About the PDB Archive and the RCSB PDB. Retrieved 15 March 2018.
  20. Gramates, LS; Marygold, SJ; dos Santos, G; Urbano, J-M; Antonazzo, G; Matthews, BB; Rey, AJ; Tabone, CJ; Crosby, MA; Emmert, DB; Falls, K; Goodman, JL; Hu, Y; Ponting, L; Schroeder, AJ; Strelets, VB; Thurmond, J; Zhou, P; FlyBase Consortium (2017). "lyBase at 25: looking to the future". Nucleic Acids Res. 45 (D1): D663–D671. doi:10.1093/nar/gkw1016. PMC   5210523 . PMID   27799470.
  21. "About LDC". Linguistic Data Consortium. Retrieved 15 March 2018.
  22. "Sloan Digital Sky Survey". SDSS. Retrieved 15 March 2018.
  23. Palmer, Carole L.; Weber, Nicholas M.; Muñoz, Trevor; Renear, Allen H. (June 2013). "Foundations of Data Curation: The Pedagogy and Practice of "Purposeful Work" with Research Data". Archive Journal. 3. hdl:2142/78099.
  24. "Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Summary". National Science Foundation. September 28, 2007. Retrieved March 15, 2018.
  25. "What is DataONE?". What is DataONE?. Archived from the original on 26 April 2019. Retrieved 15 March 2018.