Data publishing

Last updated

Data publishing (also data publication) is the act of releasing research data in published form for use by others. It is a practice consisting in preparing certain data or data set(s) for public use thus to make them available to everyone to use as they wish. This practice is an integral part of the open science movement. There is a large and multidisciplinary consensus on the benefits resulting from this practice. [1] [2] [3]

Contents

The main goal is to elevate data to be first class research outputs. [4] There are a number of initiatives underway as well as points of consensus and issues still in contention. [5]

There are several distinct ways to make research data available, including:

Publishing data allows researchers to both make their data available to others to use, and enables datasets to be cited similarly to other research publication types (such as articles or books), thereby enabling producers of datasets to gain academic credit for their work.

The motivations for publishing data may range for a desire to make research more accessible, to enable citability of datasets, or research funder or publisher mandates that require open data publishing. The UK Data Service is one key organisation working with others to raise the importance of citing data correctly [7] and helping researchers to do so.

Solutions to preserve privacy within data publishing has been proposed, including privacy protection algorithms, data ”masking” methods, and regional privacy level calculation algorithm. [8]

Methods for publishing data

Data files as supplementary material

A large number of journals and publishers support supplementary material being attached to research articles, including datasets. Though historically such material might have been distributed only by request or on microform to libraries, journals today typically host such material online. Supplementary material is available to subscribers to the journal or, if the article or journal is open access, to everyone.

Data repositories

There are a large number of data repositories, on both general and specialized topics. Many repositories are disciplinary repositories, focused on a particular research discipline such as the UK Data Service which is a trusted digital repository of social, economic and humanities data. Repositories may be free for researchers to upload their data or may charge a one-time or ongoing fee for hosting the data. These repositories offer a publicly accessible web interface for searching and browsing hosted datasets, and may include additional features such as a digital object identifier, for permanent citation of the data, and linking to associated published papers and code.

Data papers

Data papers or data articles are “scholarly publication of a searchable metadata document describing a particular on-line accessible dataset, or a group of datasets, published in accordance to the standard academic practices”. [9] Their final aim is to provide “information on the what, where, why, how and who of the data”. [4] The intent of a data paper is to offer descriptive information on the related dataset(s) focusing on data collection, distinguishing features, access and potential reuse rather than on data processing and analysis. [10] Because data papers are considered academic publications no different than other types of papers, they allow scientists sharing data to receive credit in currency recognizable within the academic system, thus "making data sharing count". [11] This provides not only an additional incentive to share data, but also through the peer review process, increases the quality of metadata and thus reusability of the shared data.

Thus data papers represent the scholarly communication approach to data sharing. Despite their potentiality, data papers are not the ultimate and complete solution for all the data sharing and reuse issues and, in some cases, they are considered to induce false expectations in the research community. [12]

Data journals

Data papers are supported by a rich array of data journals, some of which are "pure", i.e. they are dedicated to publish data papers only, while others – the majority – are "mixed", i.e. they publish a number of articles types including data papers.

A comprehensive survey on data journals is available. [13] A non-exhaustive list of data journals has been compiled by staff at the University of Edinburgh. [14]

Examples of "pure" data journals are: Earth System Science Data , Journal of Open Archaeology Data , Open Health Data , Polar Data Journal , and Scientific Data .

Examples of "mixed" journals publishing data papers are: Biodiversity Data Journal , F1000Research , GigaScience , GigaByte , PLOS ONE , and SpringerPlus .

Data citation

A data citation example Data Dryad citation on Wikipedia.png
A data citation example

Data citation is the provision of accurate, consistent and standardised referencing for datasets just as bibliographic citations are provided for other published sources like research articles or monographs. Typically the well established Digital Object Identifier (DOI) approach is used with DOIs taking users to a website that contains the metadata on the dataset and the dataset itself. [15] [16]

History of development

A 2011 paper reported an inability to determine how often data citation happened in social sciences. [17]

2012-13 papers reported that data citation was becoming more common but the practice for it was not standard. [18] [19] [20]

In 2014 FORCE 11 published the Joint Declaration of Data Citation Principles covering the purpose, function and attributes of data citation. [21]

In October 2018 CrossRef expressed its support for cataloging datasets and recommending their citation. [22]

A popular data-oriented journal reported in April 2019 that it would now use data citations. [23]

A June 2019 paper suggested that increased data citation will make the practice more valuable for everyone by encouraging data sharing and also by increasing the prestige of people who share. [24]

Data citation is an emerging topic in computer science and it has been defined as a computational problem. [25] Indeed, citing data poses significant challenges to computer scientists and the main problems to address are related to: [26]

  • the use of heterogeneous data models and formats – e.g., relational databases, Comma-Separated Values (CSV), Extensible Markup Language (XML), [27] [28] Resource Description Framework (RDF); [29]
  • the transience of data;
  • the necessity to cite data at different levels of coarseness – i.e., deep citations; [30]
  • the necessity to automatically generate citations to data with variable granularity.

See also

Related Research Articles

<span class="mw-page-title-main">Open access</span> Research publications distributed freely online

Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers. With open access strictly defined, or libre open access, barriers to copying or reuse are also reduced or removed by applying an open license for copyright.

The impact factor (IF) or journal impact factor (JIF) of an academic journal is a scientometric index calculated by Clarivate that reflects the yearly mean number of citations of articles published in the last two years in a given journal, as indexed by Clarivate's Web of Science.

Encoded Archival Description (EAD) is a standard for encoding descriptive information regarding archival records.

PubMed Central (PMC) is a free digital repository that archives open access full-text scholarly articles that have been published in biomedical and life sciences journals. As one of the major research databases developed by the National Center for Biotechnology Information (NCBI), PubMed Central is more than a document repository. Submissions to PMC are indexed and formatted for enhanced metadata, medical ontology, and unique identifiers which enrich the XML structured data for each article. Content within PMC can be linked to other NCBI databases and accessed via Entrez search and retrieval systems, further enhancing the public's ability to discover, read and build upon its biomedical knowledge.

Citation impact or citation rate is a measure of how many times an academic journal article or book or author is cited by other articles, books or authors. Citation counts are interpreted as measures of the impact or influence of academic work and have given rise to the field of bibliometrics or scientometrics, specializing in the study of patterns of academic impact through citation analysis. The importance of journals can be measured by the average citation rate, the ratio of number of citations to number articles published within a given time period and in a given index, such as the journal impact factor or the citescore. It is used by academic institutions in decisions about academic tenure, promotion and hiring, and hence also used by authors in deciding which journal to publish in. Citation-like measures are also used in other fields that do ranking, such as Google's PageRank algorithm, software metrics, college and university rankings, and business performance indicators.

<span class="mw-page-title-main">Grey literature</span> Documents and research not produced for commercial or academic journal purposes

Grey literature is materials and research produced by organizations outside of the traditional commercial or academic publishing and distribution channels. Common grey literature publication types include reports, working papers, government documents, white papers and evaluations. Organizations that produce grey literature include government departments and agencies, civil society or non-governmental organizations, academic centres and departments, and private companies and consultants.

<span class="mw-page-title-main">Open science</span> Generally available scientific research

Open science is the movement to make scientific research and its dissemination accessible to all levels of society, amateur or professional. Open science is transparent and accessible knowledge that is shared and developed through collaborative networks. It encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open-notebook science, broader dissemination and engagement in science and generally making it easier to publish, access and communicate scientific knowledge.

<span class="mw-page-title-main">Open data</span> Openly accessible data

Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license.

Journal Citation Reports (JCR) is an annual publication by Clarivate. It has been integrated with the Web of Science and is accessed from the Web of Science Core Collection. It provides information about academic journals in the natural and social sciences, including impact factors. The JCR was originally published as a part of the Science Citation Index. Currently, the JCR, as a distinct service, is based on citations compiled from the Science Citation Index Expanded and the Social Sciences Citation Index. As of the 2023 edition, journals from the Arts and Humanities Citation Index and the Emerging Sources Citation Index will also be included.

<span class="mw-page-title-main">Data sharing</span>

Data sharing is the practice of making data used for scholarly research available to other investigators. Many funding agencies, institutions, and publication venues have policies regarding data sharing because transparency and openness are considered by many to be part of the scientific method.

Open peer review is the various possible modifications of the traditional scholarly peer review process. The three most common modifications to which the term is applied are:

  1. Open identities: Authors and reviewers are aware of each other's identity.
  2. Open reports: Review reports are published alongside the relevant article.
  3. Open participation: The wider community are able to contribute to the review process.

Plazi is a Swiss-based international non-profit association supporting and promoting the development of persistent and openly accessible digital bio-taxonomic literature. Plazi is cofounder of the Biodiversity Literature Repository and is maintaining this digital taxonomic literature repository at Zenodo to provide access to FAIR data converted from taxonomic publications using the TreatmentBank service, enhances submitted taxonomic treatments by creating a version in the XML format Taxpub, and educates about the importance of maintaining open access to scientific discourse and data. It is a contributor to the evolving e-taxonomy in the field of Biodiversity Informatics.

<span class="mw-page-title-main">Dryad (repository)</span>

Dryad is an international open-access repository of research data, especially data underlying scientific and medical publications. Dryad is a curated general-purpose repository that makes data discoverable, freely reusable, and citable. The scientific, educational, and charitable mission of Dryad is to provide the infrastructure for and promote the re-use of scholarly research data.

Open scientific data or open research data is a type of open data focused on publishing observations and results of scientific activities available for anyone to analyze and reuse. A major purpose of the drive for open data is to allow the verification of scientific claims, by allowing others to look at the reproducibility of results, and to allow data from many sources to be integrated to give new knowledge.

Figshare is an online open access repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos. It is free to upload content and free to access, in adherence to the principle of open data. Figshare is one of a number of portfolio businesses supported by Digital Science, a subsidiary of Springer Nature.

Enhanced publications or enhanced ebooks are a form of electronic publishing for the dissemination and sharing of research outcomes, whose first formal definition can be tracked back to 2009. As many forms of digital publications, they typically feature a unique identifier and descriptive metadata information. Unlike traditional digital publications, enhanced publications are often tailored to serve specific scientific domains and are generally constituted by a set of interconnected parts corresponding to research assets of several kinds and to textual descriptions of the research. The nature and format of such parts and of the relationships between them, depends on the application domain and may largely vary from case to case.

The Cancer Imaging Archive (TCIA) is an open-access database of medical images for cancer research. The site is funded by the National Cancer Institute's (NCI) Cancer Imaging Program, and the contract is operated by the University of Arkansas for Medical Sciences. Data within the archive is organized into collections which typically share a common cancer type and/or anatomical site. The majority of the data consists of CT, MRI, and nuclear medicine images stored in DICOM format, but many other types of supporting data are also provided or linked to, in order to enhance research utility. All data are de-identified in order to comply with the Health Insurance Portability and Accountability Act and National Institutes of Health data sharing policies.

SciCrunch is a collaboratively edited knowledge base about scientific resources. It is a community portal for researchers and a content management system for data and databases. It is intended to provide a common source of data to the research community and the data about Research Resource Identifiers (RRIDs), which can be used in scientific publications. After starting as a pilot of two journals in 2014, by 2022 over 1,000 journals have been using them and over half a million RRIDs have been quoted in the scientific literature. In some respect, it is for science and scholarly publishing, similar to what Wikidata is for Wikimedia Foundation projects. Hosted by the University of California, San Diego, SciCrunch was also designed to help communities of researchers create their own portals to provide access to resources, databases and tools of relevance to their research areas

The Plant Genomics and Phenomics Research Data Repository (PGP) is a data publication infrastructure to comprehensively publish multi-domain plant research data. It is hosted at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) in Gatersleben, Germany. The repository hosts DOI citeable datasets that are not being published in public repositories because of their volume or data scope. PGP enables the publication of gigabyte-scale datasets and is registered as a research data repository at FAIRSharing.org, re3data.org and OpenAIRE as a valid EU Horizon 2020 open data archive. The above features, the programmatic interface and the support of standard metadata formats, enable PGP to fulfil the FAIR data principles—findable, accessible, interoperable, reusable. The PGP repository was created using the e!DAL software infrastructure and applies an on-premises approach to "bring the infrastructure to the data" (I2D).

GigaDB is a disciplinary repository launched in 2011 with the aim of ensuring long-term access to massive multidimensional datasets from life science and biomedical science studies. The datasets are diverse and include genomic, transcriptomic, and imaging data. The datasets are curated by GigaDB biocurators who are employed by BGI and China National GeneBank.

References

  1. Costello MJ (2009). "Motivating online publication of data". BioScience. 59 (5): 418–427. doi:10.1525/bio.2009.59.5.9. S2CID   55591360.
  2. Smith VS (2009). "Data publication: towards a database of everything". BMC Research Notes. 2 (113): 113. doi: 10.1186/1756-0500-2-113 . PMC   2702265 . PMID   19552813.
  3. Lawrence, B; Jones, C.; Matthews, B.; Pepler, S.; Callaghan, S. (2011). "Citation and Peer Review of Data: Moving Towards Formal Data Publication". International Journal of Digital Curation. 6 (2): 4–37. doi: 10.2218/ijdc.v6i2.205 .
  4. 1 2 Callaghan S, Donegan S, Pepler S, Thorley M, Cunningham N, Kirsch P, Ault L, Bell P, Bowie R, Leadbetter A, Lowry R, Moncoiffé G, Harrison K, Smith-Haddon B, Weatherby A, Wright D (2012). "Making data a first class scientific output: Data citation and publication by NERCs environmental data centres". International Journal of Digital Curation. 7 (1): 107–113. doi: 10.2218/ijdc.v7i1.218 .
  5. Kratz J, Strasser C (2014). "Data publication consensus and controversies". F1000Research. 3 (94): 94. doi: 10.12688/f1000research.4518 . PMC   4097345 . PMID   25075301.
  6. Assante, M.; Candela, L.; Castelli, D.; Tani, A. (2016). "Are Scientific Data Repositories Coping with Research Data Publishing?". Data Science Journal. 15. doi: 10.5334/dsj-2016-006 .
  7. Service, UK Data. "New to using data". UK Data Service.
  8. Zhang, Longbin; Wang, Yuxiang; Xu, Xiaoliang (August 2017). "Logic-Partition Based Gaussian Sampling for Online Aggregation". 2017 Fifth International Conference on Advanced Cloud and Big Data (CBD). IEEE. pp. 182–187. doi:10.1109/cbd.2017.39. ISBN   978-1-5386-1072-5. S2CID   40025084.
  9. Chavan, V. & Penev, L. (2011). "The data paper: a mechanism to incentivize data publishing in biodiversity science". BMC Bioinformatics. 12 (15): S2. doi: 10.1186/1471-2105-12-S15-S2 . PMC   3287445 . PMID   22373175.
  10. Newman Paul; Corke Peter (2009). "Data papers — peer reviewed publication of high quality data sets". International Journal of Robotics Research. 28 (5): 587. doi: 10.1177/0278364909104283 . S2CID   209308576.
  11. Gorgolewski KJ, Margulies DS, Milham MP (2013). "Making data sharing count: a publication-based solution". Frontiers in Neuroscience. 7: 9. doi: 10.3389/fnins.2013.00009 . PMC   3565154 . PMID   23390412.
  12. Parsons, M.A.; Fox, P.A. (2013). "Is data publication the right metaphor?". Data Science Journal. 12: WDS31–WDS46. doi: 10.2481/dsj.WDS-042 .
  13. Candela L, Castelli D, Manghi P, Tani A (2015). "Data Journals: A Survey". Journal of the Association for Information Science and Technology. 66 (1): 1747–1762. doi:10.1002/asi.23358. S2CID   31358007.
  14. "Sources of dataset peer review - datashare - Wiki Service".
  15. Australian National Data Service: Data Citation Awareness Archived 2012-03-07 at the Wayback Machine (Accessed 20 March 2012)
  16. Ball, A., Duke, M. (2011). 'Data Citation and Linking'. DCC Briefing Papers. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/briefing-papers/
  17. MOONEY, Hailey (April 2011). "Citing data sources in the social sciences: do authors do it?". Learned Publishing. 24 (2): 99–108. doi: 10.1087/20110204 . S2CID   34513423.
  18. Edmunds, Scott C.; Pollard, Tom J.; Hole, Brian; Basford, Alexandra T. (2012-07-02). "Adventures in data citation: sorghum genome data exemplifies the new gold standard". BMC Research Notes. 5 (1): 223. doi: 10.1186/1756-0500-5-223 . ISSN   1756-0500. PMC   3392744 . PMID   22571506.
  19. "Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data". Data Science Journal. 12: CIDCR1–CIDCR75. 2013. doi: 10.2481/dsj.OSOM13-043 .
  20. Mooney, Hailey; Newton, Mark P. (2012). "The Anatomy of a Data Citation: Discovery, Reuse, and Credit". Academic Commons. Columbia University. 1 (1): eP1035. doi:10.7916/D8MW2STM.
  21. Data Citation Synthesis Group (2014). Martone, M. (ed.). "Joint Declaration of Data Citation Principles". San Diego: Force11 Scholarly Communication Institute. doi:10.25490/a97f-egyk.{{cite journal}}: Cite journal requires |journal= (help)
  22. Lin, Jennifer (4 October 2018). "Data citation: let's do this". Crossref.
  23. "Data citation needed". Scientific Data. 6 (1): 27. 10 April 2019. Bibcode:2019NatSD...6...27.. doi:10.1038/s41597-019-0026-5. PMC   6472333 . PMID   30971699.
  24. Pierce, Heather H.; Dev, Anurupa; Statham, Emily; Bierer, Barbara E. (4 June 2019). "Credit data generators for data reuse". Nature. 570 (7759): 30–32. Bibcode:2019Natur.570...30P. doi: 10.1038/d41586-019-01715-4 . PMID   31164773. S2CID   174809246.
  25. Buneman, Peter; Davidson, Susan; Frew, James (September 2016). "Why data citation is a computational problem". Communications of the ACM. 59 (9): 50–57. doi:10.1145/2893181. ISSN   0001-0782. PMC   5687090 . PMID   29151602.
  26. Silvello, G. (2018). 'Theory and Practice of Data Citation'. Journal of the Association for Information Science and Technology (JASIST) (AIS Review), vol. 69 issue 1, pp. 6-20, 2018. Available online (open access): https://onlinelibrary.wiley.com/doi/full/10.1002/asi.23917
  27. Buneman, P. and Silvello, G. (2010). 'A Rule-Based Citation System for Structured and Evolving Datasets'. IEEE Bulletin of the Technical Committee on Data Engineering, Vol. 3, No. 3. IEEE Computer Society, pp. 33-41, September 2010. Available online: http://sites.computer.org/debull/A10sept/buneman.pdf
  28. Silvello, G. (2017). 'Learning to Cite Framework: How to Automatically Construct Citations for Hierarchical Data'. Journal of the Association for Information Science and Technology (JASIST), Volume 68 issue 6, pp. 1505-1524, June 2017. Available online: http://www.dei.unipd.it/~silvello/papers/2016-DataCitation-JASIST-Silvello.pdf
  29. Silvello, G. (2015). 'A Methodology for Citing Linked Open Data Subsets'. D-Lib Magazine 21 (1/2), 2015. Available online: http://www.dlib.org/dlib/january15/silvello/01silvello.html
  30. Buneman, P. (2006). 'How to Cite Curated Databases and how to Make Them Citable'. In Proc. of the 18th International Conference on Scientific and Statistical Database Management, SSDBM 2006, pages 195–203, 2006.