Research data archiving

Last updated

Research data archiving is the long-term storage of scholarly research data, including the natural sciences, social sciences, and life sciences. The various academic journals have differing policies regarding how much of their data and methods researchers are required to store in a public archive, and what is actually archived varies widely between different disciplines. Similarly, the major grant-giving institutions have varying attitudes towards public archival of data. In general, the tradition of science has been for publications to contain sufficient information to allow fellow researchers to replicate and therefore test the research. In recent years this approach has become increasingly strained as research in some areas depends on large datasets which cannot easily be replicated independently.

Contents

Data archiving is more important in some fields than others. In a few fields, all of the data necessary to replicate the work is already available in the journal article. In drug development, a great deal of data is generated and must be archived so researchers can verify that the reports the drug companies publish accurately reflect the data.

The requirement of data archiving is a recent development in the history of science. It was made possible by advances in information technology allowing large amounts of data to be stored and accessed from central locations. For example, the American Geophysical Union (AGU) adopted their first policy on data archiving in 1993, about three years after the beginning of the WWW. [1] This policy mandates that datasets cited in AGU papers must be archived by a recognised data center; it permits the creation of "data papers"; and it establishes AGU's role in maintaining data archives. But it makes no requirements on paper authors to archive their data.

Prior to organized data archiving, researchers wanting to evaluate or replicate a paper would have to request data and methods information from the author. The academic community expects authors to share supplemental data. This process was recognized as wasteful of time and energy and obtained mixed results. Information could become lost or corrupted over the years. In some cases, authors simply refuse to provide the information.

The need for data archiving and due diligence is greatly increased when the research deals with health issues or public policy formation. [2] [3]

Selected policies by journals

Biotropica

Biotropica requires, as a condition for publication, that the data supporting the results in the paper and metadata describing them must be archived in an appropriate public archive such as Dryad, Figshare, GenBank, TreeBASE, or NCBI. Authors may elect to make the data publicly available as soon as the article is published or, if the technology of the archive allows, embargo access to the data up to three years after article publication. A statement describing Data Availability will be included in the manuscript as described in the instructions to authors. Exceptions to the required archiving of data may be granted at the discretion of the Editor-in-Chief for studies that include sensitive information (e.g., the location of endangered species). Our Editorial explaining the motivation for this policy can be found here. A more comprehensive list of data repositories is available here. Promoting a culture of collaboration with researchers who collect and archive data: The data collected by tropical biologists are often long-term, complex, and expensive to collect. The Board of Editors of Biotropica strongly encourages authors who re-use data archives archived data sets to include as fully engaged collaborators the scientists who originally collected them. We feel this will greatly enhance the quality and impact of the resulting research by drawing on the data collector’s profound insights into the natural history of the study system, reducing the risk of errors in novel analyses, and stimulating the cross-disciplinary and cross-cultural collaboration and training for which the ATBC and Biotropica are widely recognized.

NB:Biotropica is one of only two journals that pays the fees for authors depositing data at Dryad.

The American Naturalist

The American Naturalist requires authors to deposit the data associated with accepted papers in a public archive. For gene sequence data and phylogenetic trees, deposition in GenBank or TreeBASE, respectively, is required. There are many possible archives that may suit a particular data set, including the Dryad repository for ecological and evolutionary biology data. All accession numbers for GenBank, TreeBASE, and Dryad must be included in accepted manuscripts before they go to Production. If the data is deposited somewhere else, please provide a link. If the data is culled from published literature, please deposit the collated data in Dryad for the convenience of your readers. Any impediments to data sharing should be brought to the attention of the editors at the time of submission so that appropriate arrangements can be worked out. [4]

Journal of Heredity

The primary data underlying the conclusions of an article are critical to the verifiability and transparency of the scientific enterprise, and should be preserved in usable form for decades in the future. For this reason, Journal of Heredity requires that newly reported nucleotide or amino acid sequences, and structural coordinates, be submitted to appropriate public databases (e.g., GenBank; the EMBL Nucleotide Sequence Database; DNA Database of Japan; the Protein Data Bank; and Swiss-Prot). Accession numbers must be included in the final version of the manuscript. For other forms of data (e.g., microsatellite genotypes, linkage maps, images), the Journal endorses the principles of the Joint Data Archiving Policy (JDAP) in encouraging all authors to archive primary datasets in an appropriate public archive, such as Dryad, TreeBASE, or the Knowledge Network for Biocomplexity. Authors are encouraged to make data publicly available at time of publication or, if the technology of the archive allows, opt to embargo access to the data for a period up to a year after publication. The American Genetic Association also recognizes the vast investment of individual researchers in generating and curating large datasets. Consequently, we recommend that this investment be respected in secondary analyses or meta-analyses in a gracious collaborative spirit.

oxfordjournals.org [5]

Molecular Ecology

Molecular Ecology expects that data supporting the results in the paper should be archived in an appropriate public archive, such as GenBank, Gene Expression Omnibus, TreeBASE, Dryad, the Knowledge Network for Biocomplexity, your own institutional or funder repository, or as Supporting Information on the Molecular Ecology web site. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species.

Wiley [6]

Nature

Such material must be hosted on an accredited independent site (URL and accession numbers to be provided by the author), or sent to the Nature journal at submission, either uploaded via the journal's online submission service, or if the files are too large or in an unsuitable format for this purpose, on CD/DVD (five copies). Such material cannot solely be hosted on an author's personal or institutional web site. [7] Nature requires the reviewer to determine if all of the supplementary data and methods have been archived. The policy advises reviewers to consider several questions, including: "Should the authors be asked to provide supplementary methods or data to accompany the paper online? (Such data might include source code for modelling studies, detailed experimental protocols or mathematical derivations.)

Nature [8]

Science

Science supports the efforts of databases that aggregate published data for the use of the scientific community. Therefore, before publication, large data sets (including microarray data, protein or DNA sequences, and atomic coordinates or electron microscopy maps for macromolecular structures) must be deposited in an approved database and an accession number provided for inclusion in the published paper. [9] "Materials and methods" – Science now requests that, in general, authors place the bulk of their description of materials and methods online as supporting material, providing only as much methods description in the print manuscript as is necessary to follow the logic of the text. (Obviously, this restriction will not apply if the paper is fundamentally a study of a new method or technique.)

Science [10]

Royal Society

To allow others to verify and build on the work published in Royal Society journals, it is a condition of publication that authors make available the data, code and research materials supporting the results in the article.

Datasets and code should be deposited in an appropriate, recognised, publicly available repository. Where no data-specific repository exists, authors should deposit their datasets in a general repository such as Dryad (repository) or Figshare.

Journal of Archaeological Science

The Journal of Archaeological Science has had a data disclosure policy since at least 2013. Their policy states that 'all data relating to the article must be made available in Supplementary files or deposited in external repositories and linked to within the article. The policy recommends that data are deposited in a repository such as the Archaeology Data Service, the Digital Archaeological Record, or PANGAEA. A 2018 study found a data availability rate of 53%, reflecting either weak enforcement of this policy or an incomplete understanding among editors, reviewers, and authors of how to interpret and implement this policy. [12]

Policies by funding agencies

In the United States, the National Science Foundation (NSF) has tightened requirements on data archiving. Researchers seeking funding from NSF are now required to file a data management plan as a two-page supplement to the grant application. [13]

The NSF Datanet initiative has resulted in funding of the Data Observation Network for Earth (DataONE) project, which will provide scientific data archiving for ecological and environmental data produced by scientists worldwide. DataONE's stated goal is to preserve and provide access to multi-scale, multi-discipline, and multi-national data. The community of users for DataONE includes scientists, ecosystem managers, policy makers, students, educators, and the public.

The German DFG requires that research data should be archived in the researcher's own institution or an appropriate nationwide infrastructure for at least 10 years. [14]

The British Digital Curation Centre maintains an overview of funder's data policies. [15]

Data library

Data Repository and an Archive Repository The difference between a Data Repository and an Archive Repository.svg
Data Repository and an Archive Repository

Research data is archived in data libraries or data archives. A data library, data archive, or data repository is a collection of numeric and/or geospatial data sets for secondary use in research. A data library is normally part of a larger institution (academic, corporate, scientific, medical, governmental, etc.). established for research data archiving and to serve the data users of that organisation. The data library tends to house local data collections and provides access to them through various means (CD-/DVD-ROMs or central server for download). A data library may also maintain subscriptions to licensed data resources for its users to access the information. Whether a data library is also considered a data archive may depend on the extent of unique holdings in the collection, whether long-term preservation services are offered, and whether it serves a broader community (as national data archives do). Most public data libraries are listed in the Registry of Research Data Repositories.

Importance and services

In August 2001, the Association of Research Libraries (ARL) published a report [16] presenting results from a survey of ARL member institutions involved in collecting and providing services for numeric data resources.

Library service providing support at the institutional level for the use of numerical and other types of datasets in research. Amongst the support activities typically available:

Examples of data libraries

Natural sciences

The following list refers to scientific data archives.

Social sciences

In the social sciences, data libraries are referred to as data archives. [17] Data archives are professional institutions for the acquisition, preparation, preservation, and dissemination of social and behavioral data. Data archives in the social sciences evolved in the 1950s and have been perceived as an international movement:

By 1964 the International Social Science Council (ISSC) had sponsored a second conference on Social Science Data Archives and had a standing Committee on Social Science Data, both of which stimulated the data archives movement. By the beginning of the twenty-first century, most developed countries and some developing countries had organized formal and well-functioning national data archives. In addition, college and university campuses often have `data libraries' that make data available to their faculty, staff, and students; most of these bear minimal archival responsibility, relying for that function on a national institution (Rockwell, 2001, p. 3227). [18]

See also

Related Research Articles

PubMed Central (PMC) is a free digital repository that archives open access full-text scholarly articles that have been published in biomedical and life sciences journals. As one of the major research databases developed by the National Center for Biotechnology Information (NCBI), PubMed Central is more than a document repository. Submissions to PMC are indexed and formatted for enhanced metadata, medical ontology, and unique identifiers which enrich the XML structured data for each article. Content within PMC can be linked to other NCBI databases and accessed via Entrez search and retrieval systems, further enhancing the public's ability to discover, read and build upon its biomedical knowledge.

<span class="mw-page-title-main">Self-archiving</span> Authorial deposit of documents to provide open access

Self-archiving is the act of depositing a free copy of an electronic document online in order to provide open access to it. The term usually refers to the self-archiving of peer-reviewed research journal and conference articles, as well as theses and book chapters, deposited in the author's own institutional repository or open archive for the purpose of maximizing its accessibility, usage and citation impact. The term green open access has become common in recent years, distinguishing this approach from gold open access, where the journal itself makes the articles publicly available without charge to the reader.

The California Digital Library (CDL) was founded by the University of California in 1997. Under the leadership of then UC President Richard C. Atkinson, the CDL's original mission was to forge a better system for scholarly information management and improved support for teaching and research. In collaboration with the ten University of California Libraries and other partners, CDL assembled one of the world's largest digital research libraries. CDL facilitates the licensing of online materials and develops shared services used throughout the UC system. Building on the foundations of the Melvyl Catalog, CDL has developed one of the largest online library catalogs in the country and works in partnership with the UC campuses to bring the treasures of California's libraries, museums, and cultural heritage organizations to the world. CDL continues to explore how services such as digital curation, scholarly publishing, archiving and preservation support research throughout the information lifecycle.

<span class="mw-page-title-main">Open data</span> Openly accessible data

Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license.

<span class="mw-page-title-main">UK Data Archive</span>

The UK Data Archive is a national centre of expertise in data archiving in the United Kingdom. It houses the largest collection of social sciences and population digital data in the UK. It is certified under CoreTrustSeal as a trusted digital repository. It is also certified under the international ISO 27001 standard for information security. Located in Colchester, the UK Data Archive is a specialist department of the University of Essex, co-located with the Institute for Social and Economic Research (ISER). It is primarily funded by the Economic and Social Research Council (ESRC) and the University of Essex.

Scholarly communication involves the creation, publication, dissemination and discovery of academic research, primarily in peer-reviewed journals and books. It is “the system through which research and other scholarly writings are created, evaluated for quality, disseminated to the scholarly community, and preserved for future use." This primarily involves the publication of peer-reviewed academic journals, books and conference papers.

<span class="mw-page-title-main">Data sharing</span>

Data sharing is the practice of making data used for scholarly research available to other investigators. Many funding agencies, institutions, and publication venues have policies regarding data sharing because transparency and openness are considered by many to be part of the scientific method.

The Dataverse is an open source web application to share, preserve, cite, explore and analyze research data. Researchers, data authors, publishers, data distributors, and affiliated institutions all receive appropriate credit via a data citation with a persistent identifier.

<span class="mw-page-title-main">German National Library of Economics</span> Research library of economics

The National Library of Economics is the world's largest research infrastructure for economic literature, online as well as offline. The ZBW is a member of the Leibniz Association and has been a foundation under public law since 2007. Several times the ZBW received the international LIBER Award for its innovative work in librarianship. The ZBW allows for access of millions of documents and research on economics, partnering with over 40 research institutions to create a connective Open Access portal and social web of research. Through its EconStor and EconBiz, researchers and students have accessed millions of datasets and thousands of articles. The ZBW also edits two journals: Wirtschaftsdienst and Intereconomics.

AGRIS is a global public domain database with more than 12 million structured bibliographical records on agricultural science and technology. It became operational in 1975 and the database was maintained by Coherence in Information for Agricultural Research for Development, and its content is provided by more than 150 participating institutions from 65 countries. The AGRIS Search system, allows scientists, researchers and students to perform sophisticated searches using keywords from the AGROVOC thesaurus, specific journal titles or names of countries, institutions, and authors.

An open-access mandate is a policy adopted by a research institution, research funder, or government which requires or recommends researchers—usually university faculty or research staff and/or research grant recipients—to make their published, peer-reviewed journal articles and conference papers open access (1) by self-archiving their final, peer-reviewed drafts in a freely accessible institutional repository or disciplinary repository or (2) by publishing them in an open-access journal or both.

<span class="mw-page-title-main">DataONE</span> International federation of data repositories

DataONE is a network of interoperable data repositories facilitating data sharing, data discovery, and open science. Originally supported by $21.2 million in funding from the US National Science Foundation as one of the initial DataNet programs in 2009, funding was renewed in 2014 through 2020 with an additional $15 million. DataONE helps preserve, access, use, and reuse of multi-discipline scientific data through the construction of primary cyberinfrastructure and an education and outreach program. DataONE provides scientific data archiving for ecological and environmental data produced by scientists. DataONE's goal is to preserve and provide access to multi-scale, multi-discipline, and multi-national data. Users include scientists, ecosystem managers, policy makers, students, educators, librarians, and the public.

<span class="mw-page-title-main">Dryad (repository)</span>

Dryad is an international open-access repository of research data, especially data underlying scientific and medical publications. Dryad is a curated general-purpose repository that makes data discoverable, freely reusable, and citable. The scientific, educational, and charitable mission of Dryad is to provide the infrastructure for and promote the re-use of scholarly research data.

Open scientific data or open research data is a type of open data focused on publishing observations and results of scientific activities available for anyone to analyze and reuse. A major purpose of the drive for open data is to allow the verification of scientific claims, by allowing others to look at the reproducibility of results, and to allow data from many sources to be integrated to give new knowledge.

The UK Data Service is the largest digital repository for quantitative and qualitative social science and humanities research data in the United Kingdom. The organisation is funded by the UK government through the Economic and Social Research Council and is led by the UK Data Archive at the University of Essex, in partnership with other universities.

The NIH Public Access Policy is an open access mandate, drafted in 2004 and mandated in 2008, requiring that research papers describing research funded by the National Institutes of Health must be available to the public free through PubMed Central within 12 months of publication. PubMed Central is the self-archiving repository in which authors or their publishers deposit their publications. Copyright is retained by the usual holders, but authors may submit papers with one of the Creative Commons licenses.

Figshare is an online open access repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos. It is free to upload content and free to access, in adherence to the principle of open data. Figshare is one of a number of portfolio businesses supported by Digital Science, a subsidiary of Springer Nature.

Data publishing is the act of releasing research data in published form for use by others. It is a practice consisting in preparing certain data or data set(s) for public use thus to make them available to everyone to use as they wish. This practice is an integral part of the open science movement. There is a large and multidisciplinary consensus on the benefits resulting from this practice.

<span class="mw-page-title-main">Zenodo</span> Research data repository

Zenodo is a general-purpose open repository developed under the European OpenAIRE program and operated by CERN. It allows researchers to deposit research papers, data sets, research software, reports, and any other research related digital artefacts. For each submission, a persistent digital object identifier (DOI) is minted, which makes the stored items easily citeable.

<span class="mw-page-title-main">Inter-university Consortium for Political and Social Research</span> Organization of research institutions

ICPSR, the Inter-university Consortium for Political and Social Research, was established in 1962. An integral part of the infrastructure of social science research, ICPSR maintains and provides access to a vast archive of social science data for research and instruction. Since 1963, ICPSR has offered training in quantitative methods to facilitate effective data use. The ICPSR Summer Program in Quantitative Methods of Social Research offers a comprehensive curriculum in research design, statistics, data analysis, and methodology. To ensure that data resources are available to future generations of scholars, ICPSR curates and preserves data, migrating them to new storage media and file formats as changes in technology warrant. In addition, ICPSR provides user support to assist researchers in identifying relevant data for analysis and in conducting their research projects.

References

  1. ”Policy on Referencing Data in and Archiving Data for AGU Publications”
  2. "The Case for Due Diligence When Empirical Research is Used in Policy Formation" by Bruce McCullough and Ross McKitrick.
  3. "Data Sharing and Replication" a website by Gary King Archived 2007-03-28 at the Wayback Machine
  4. Supporting Data and Material
  5. Data archiving policy
  6. Policy on data archiving
  7. "Availability of Data and Materials: The Policy of Nature Magazine
  8. "Guide to Publication Policies of the Nature Journals" (PDF). March 14, 2007.
  9. "General Policies of Science Magazine"
  10. ”Preparing Your Supporting Online Material”
  11. "Data sharing and mining"
  12. Marwick, Ben; Birch, Suzanne E. Pilaar (5 April 2018). "A Standard for the Scholarly Citation of Archaeological Data as an Incentive to Data Sharing". Advances in Archaeological Practice. 6 (2): 125–143. doi: 10.1017/aap.2018.3 .
  13. ”NSF to Ask Every Grant Applicant for Data Management Plan”
  14. ”DFG Guidelines on the Handling of Research Data”
  15. "Overview of funders' data policies | Digital Curation Centre"
  16. SPEC Kit 263: Numeric Data Products and Services
  17. White, Howard D. (1977). Machine-Readable Social Science Data. Drexel Library Quarterly 13 (January, 1977):1-110.
  18. Rockwell, R. C. (2001). Data Archives: International. IN: Smelser, N. J. & Baltes, P. B. (eds.) International Encyclopedia of the Social and Behavioral Sciences (vol. 5, pp. 3225- 3230). Amsterdam: Elsevier

Notes

Further reading

Associations