The Cancer Imaging Archive

Last updated

The Cancer Imaging Archive (TCIA) is an open-access database of medical images for cancer research. The site is funded by the National Cancer Institute's (NCI) Cancer Imaging Program, and the contract is operated by the University of Arkansas for Medical Sciences. Data within the archive is organized into collections which typically share a common cancer type and/or anatomical site. The majority of the data consists of CT, MRI, and nuclear medicine (e.g. PET) images stored in DICOM format, but many other types of supporting data are also provided or linked to, in order to enhance research utility. [1] All data are de-identified in order to comply with the Health Insurance Portability and Accountability Act and National Institutes of Health data sharing policies.

Contents

TCIA resources are intended to support:

TCIA is recognized as a recommended repository for the Scientific Data, PLOS One, [3] and F1000Research journals. [4] It is also listed in the Registry of Research Data Repositories. [5]

History

Prior to the creation of TCIA, the NCI funded development of the National Biomedical Imaging Archive. NBIA is an open-source Web application which was designed to allow the storage and query of DICOM images. TCIA was subsequently initiated in December 2010 to expand data sharing activities by funding a service component which would help address the technical and policy challenges associated with medical imaging research. TCIA leverages open-source tools such as NBIA and Clinical Trials Processor in order to provide its services. [6] [7]

Organization of the archive

The site content is organized into five categories: [8]

Methods for accessing data

Most collections on the Cancer Imaging Archive can be accessed without an account, but a few are restricted to specific users and therefore require an account to access them. [2] TCIA has several ways to browse, filter, and download data. They include:

Browsing, bulk downloading and access to supporting data

The home page includes a list of all available collections. Basic information about the data such as the cancer type, cancer location, modalities, and number of subjects are also provided. Clicking on a collection name presents a page which describes the data including its original research purpose, how the data were generated, and how it might be useful to other TCIA users. For example, doi : 10.7937/K9/TCIA.2015.L4FRET6Z describes the NSCLC-Radiomics-Genomics Collection. In the lower section of the page there are links to search or download the images and any available supporting data in the Data Access tab. Additional tabs provide information about data versions and how to cite the data if used in publications.

Many collections contain additional data types such as genomics, patient demographics, treatment details, and expert analyses of the images. This data is usually only found by browsing the collection pages as opposed to searching in NBIA or using the API.

Filtering or searching with NBIA

On each Collection page and also in the main menu of the site there are links to "Search TCIA". This will load the NBIA application which allows simple, advanced and free text searches. Search results follow the conventional DICOM hierarchy of patient -> study -> series. TCIA provides comprehensive documentation on the various features of the NBIA software. [9]

RESTful API

A number of search and download commands are also available through the API. New iterations on the API are released as new versions, so that existing applications developed against older versions of the API continue to function. [10]

Research activities

A list of known publications based on TCIA data is maintained as a convenience to researchers who might want to investigate how it has been used previously. [11] In addition to peer-reviewed publications there are also several major research initiatives described in the Research Activities section of the site.

The CIP TCGA Radiology Initiative for Radiogenomics Research

A large number of collections contain subjects which were analyzed as part of the NIH/NHGRI database known as The Cancer Genome Atlas (TCGA). This offers researchers the ability to correlate clinical images using shared unique identifiers each study that has in TCGA extensive genomic analysis, digital pathology slides and bulk download of individual demographic data and clinical data. A multi-institutional network of investigators volunteering their time is using the data to develop methods to determine prognosis or predict the response to therapy. [12] TCGA collections are designated by nomenclature shared by the TCGA Data Portal [13] (e.g.: TCGA-BRCA, TCGA-GBM, etc). They are subject to a special publication policy which is unique from the other public data on TCIA. [14]

Challenge competitions

TCIA also provides specific data sets used for "Challenge" competitions such as international digital image-focused professional societies like MICCAI, SPIE, or ISBI. A directory of previous and upcoming challenges is maintained on the site. [15]

Digital object identifiers

To facilitate data sharing, many publications encourage authors to include data citations to the data that the authors used in creating the results described in their scholarly papers. In addition, new journals are now available for describing data collections outright (e.g., Nature Scientific Data). TCIA assigns digital object identifiers (DOIs) to all collections when they are submitted, and also has the ability to create persistent identifiers linked to subsets of data held within TCIA that authors may use for data citations in their scholarly papers. [16]

Related Research Articles

Picture archiving and communication system Medical imaging technology

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities. Electronic images and reports are transmitted digitally via PACS; this eliminates the need to manually file, retrieve, or transport film jackets, the folders used to store and protect X-ray film. The universal format for PACS image storage and transfer is DICOM. Non-image data, such as scanned documents, may be incorporated using consumer industry standard formats like PDF, once encapsulated in DICOM. A PACS consists of four major components: The imaging modalities such as X-ray plain film (PF), computed tomography (CT) and magnetic resonance imaging (MRI), a secured network for the transmission of patient information, workstations for interpreting and reviewing images, and archives for the storage and retrieval of images and reports. Combined with available and emerging web technology, PACS has the ability to deliver timely and efficient access to images, interpretations, and related data. PACS reduces the physical and time barriers associated with traditional film-based image retrieval, distribution, and display.

Digital Imaging and Communications in Medicine (DICOM) is the standard for the communication and management of medical imaging information and related data. DICOM is most commonly used for storing and transmitting medical images enabling the integration of medical imaging devices such as scanners, servers, workstations, printers, network hardware, and picture archiving and communication systems (PACS) from multiple manufacturers. It has been widely adopted by hospitals and is making inroads into smaller applications such as dentists' and doctors' offices.

Health informatics Applications of information processing concepts and machinery in medicine

Health informatics is the field of science and engineering that aims at developing methods and technologies for the acquisition, processing, and study of patient data, which can come from different sources and modalities, such as electronic health records, diagnostic test results, medical scans. The health domain provides an extremely wide variety of problems that can be tackled using computational techniques.

BioMed Central (BMC) is a United Kingdom-based, for-profit scientific open access publisher that produces over 250 scientific journals. All its journals are published online only. BioMed Central describes itself as the first and largest open access science publisher. It was founded in 2000 and has been owned by Springer, now Springer Nature, since 2008.

The California Digital Library (CDL) was founded by the University of California in 1997. Under the leadership of then UC President Richard C. Atkinson, the CDL's original mission was to forge a better system for scholarly information management and improved support for teaching and research. In collaboration with the ten University of California Libraries and other partners, CDL assembled one of the world's largest digital research libraries. CDL facilitates the licensing of online materials and develops shared services used throughout the UC system. Building on the foundations of the Melvyl Catalog, CDL has developed one of the largest online library catalogs in the country and works in partnership with the UC campuses to bring the treasures of California's libraries, museums, and cultural heritage organizations to the world. CDL continues to explore how services such as digital curation, scholarly publishing, archiving and preservation support research throughout the information lifecycle.

The Cancer Genome Atlas (TCGA) is a project to catalogue the genetic mutations responsible for cancer using genome sequencing and bioinformatics. The overarching goal was to apply high-throughput genome analysis techniques to improve the ability to diagnose, treat, and prevent cancer through a better understanding of the genetic basis of the disease.

A digital library, also called an online library, an internet library, a digital repository, or a digital collection is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital media formats or a library accessible through the internet. Objects can consist of digitized content like print or photographs, as well as originally produced digital content like word processor files or social media posts. In addition to storing content, digital libraries provide means for organizing, searching, and retrieving the content contained in the collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals or organizations. The digital content may be stored locally, or accessed remotely via computer networks. These information retrieval systems are able to exchange information with each other through interoperability and sustainability.

Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes "all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data". In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database.

LabKey Server is a software suite available for scientists to integrate, analyze, and share biomedical research data. The platform provides a secure data repository that allows web-based querying, reporting, and collaborating across a range of data sources. Specific scientific applications and workflows can be added on top of the basic platform and leverage a data processing pipeline.

The National Database for Autism Research (NDAR) is a secure research data repository promoting scientific data sharing and collaboration among autism spectrum disorder (ASD) investigators. The project was launched in 2006 as a joint effort between five institutes and centers at the National Institutes of Health (NIH): the National Institute of Mental Health (NIMH), the National Institute of Child Health and Human Development (NICHD), the National Institute of Neurological Disorders and Stroke (NINDS), the National Institute of Environmental Health Sciences (NIEHS), and the Center for Information Technology (CIT). The goal of NDAR is to provide a shared common platform for data collection, retrieval, and archiving to accelerate the advancement of research on autism spectrum disorders. The largest repository of its kind, NDAR makes available data at all levels of biological and behavioral organization for all data types. As of November 2013, data from over 90,000 research participants are available to qualified investigators through the NDAR portal. Summary information about the available data is accessible through the NDAR public website.

National Biomedical Imaging Archive (NBIA) is an Open-source software Web application managed by the United States National Cancer Institute (NCI) intended to create searchable repositories of in vivo images. The software is described in detail and can be downloaded from the NBIA wiki. A re-factoring analysis which examined the current status of development and future strategies was completed in 2015 and published on the NCI wiki. A demo instance of NBIA is deployed at http://imaging.nci.nih.gov/. Initially this was leveraged by NCI's Cancer Imaging Program to support the data sharing needs of the cancer imaging research community, but most of that data has been migrated to The Cancer Imaging Archive (TCIA). TCIA continues to leverage the NBIA software as part of its infrastructure.

Meta (academic company) Artificial intelligence company

Meta ULC is a Canadian unlimited liability corporation performing big data analysis of scientific literature, which was acquired by CZI and shut down in 2021 effective in 2022.

Neuroimaging Informatics Tools and Resources Clearinghouse

The Neuroimaging Tools and Resources Collaboratory is a neuroimaging informatics knowledge environment for MR, PET/SPECT, CT, EEG/MEG, optical imaging, clinical neuroinformatics, imaging genomics, and computational neuroscience tools and resources.

Pan-cancer analysis aims to examine the similarities and differences among the genomic and cellular alterations found across diverse tumor types. International efforts have performed pan-cancer analysis on exomes and the whole genomes of cancers, the latter including their non-coding regions. In 2018, The Cancer Genome Atlas (TCGA) Research Network used exome, transcriptome, and DNA methylome data to develop an integrated picture of commonalities, differences, and emergent themes across tumor types.

Data publishing is the act of releasing research data in published form for use by others. It is a practice consisting in preparing certain data or data set(s) for public use thus to make them available to everyone to use as they wish. This practice is an integral part of the open science movement. There is a large and multidisciplinary consensus on the benefits resulting from this practice.

The High-performance Integrated Virtual Environment (HIVE) is a distributed computing environment used for healthcare-IT and biological research, including analysis of Next Generation Sequencing (NGS) data, preclinical, clinical and post market data, adverse events, metagenomic data, etc. Currently it is supported and continuously developed by US Food and Drug Administration, George Washington University, and by DNA-HIVE, WHISE-Global and Embleema. HIVE currently operates fully functionally within the US FDA supporting wide variety (+60) of regulatory research and regulatory review projects as well as for supporting MDEpiNet medical device postmarket registries. Academic deployments of HIVE are used for research activities and publications in NGS analytics, cancer research, microbiome research and in educational programs for students at GWU. Commercial enterprises use HIVE for oncology, microbiology, vaccine manufacturing, gene editing, healthcare-IT, harmonization of real-world data, in preclinical research and clinical studies.

CORE (research service)

CORE is a service provided by the Knowledge Media Institute based at The Open University, United Kingdom. The goal of the project is to aggregate all open access content distributed across different systems, such as repositories and open access journals, enrich this content using text mining and data mining, and provide free access to it through a set of services. The CORE project also aims to promote open access to scholarly outputs. CORE works closely with digital libraries and institutional repositories.

Data preservation is the act of conserving and maintaining both the safety and integrity of data. Preservation is done through formal activities that are governed by policies, regulations and strategies directed towards protecting and prolonging the existence and authenticity of data and its metadata. Data can be described as the elements or units in which knowledge and information is created, and metadata are the summarizing subsets of the elements of data; or the data about the data. The main goal of data preservation is to protect data from being lost or destroyed and to contribute to the reuse and progression of the data.

References

  1. Vendt, Clark (June 2013). "The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository". Journal of Digital Imaging. 26 (6): 1045–57. doi:10.1007/s10278-013-9622-7. PMC   3824915 . PMID   23884657.
  2. 1 2 "About The Cancer Imaging Archive (TCIA) - The Cancer Imaging Archive (TCIA)". The Cancer Imaging Archive (TCIA). Retrieved 2016-03-24.
  3. "PLOS ONE: accelerating the publication of peer-reviewed science". journals.plos.org. Retrieved 2016-06-11.
  4. "Data guidelines - F1000Research". f1000research.com. Retrieved 2016-06-11.
  5. "The Cancer Imaging Archive | re3data.org". service.re3data.org. Retrieved 2016-06-11.
  6. Clark, Kenneth; Vendt, Bruce; Smith, Kirk; Freymann, John; Kirby, Justin; Koppel, Paul; Moore, Stephen; Phillips, Stanley; Maffitt, David (2013-07-25). "The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository". Journal of Digital Imaging. 26 (6): 1045–1057. doi:10.1007/s10278-013-9622-7. ISSN   0897-1889. PMC   3824915 . PMID   23884657.
  7. "The Cancer Imaging Archive". Cancer Imaging Program - National Cancer Institute. Retrieved 2016-03-24.
  8. "The Cancer Imaging Archive (TCIA) - A growing archive of medical images of cancer". The Cancer Imaging Archive (TCIA). Retrieved 2016-03-24.
  9. "Cancer Imaging Archive User's Guide - TCIA Online Help - Cancer Imaging Archive Wiki". wiki.cancerimagingarchive.net. Retrieved 2016-03-24.
  10. "TCIA Programmatic Interface (REST API) Usage Guide - The Cancer Imaging Archive (TCIA) Public Access - Cancer Imaging Archive Wiki". wiki.cancerimagingarchive.net. Retrieved 2016-03-24.
  11. "Publications - The Cancer Imaging Archive (TCIA) Public Access - Cancer Imaging Archive Wiki". wiki.cancerimagingarchive.net. Retrieved 2016-03-24.
  12. "CIP TCGA Radiology Initiative - The Cancer Imaging Archive (TCIA) Public Access - Cancer Imaging Archive Wiki". wiki.cancerimagingarchive.net. Retrieved 2016-03-24.
  13. TCGA Data Portal
  14. "Data Usage Policies and Restrictions - The Cancer Imaging Archive (TCIA) Public Access - Cancer Imaging Archive Wiki". wiki.cancerimagingarchive.net. Retrieved 2016-03-24.
  15. "Challenge competitions - The Cancer Imaging Archive (TCIA) Public Access - Cancer Imaging Archive Wiki". wiki.cancerimagingarchive.net. Retrieved 2016-03-24.
  16. "TCIA Digital Object Identifiers - TCIA DOIs - Cancer Imaging Archive Wiki". wiki.cancerimagingarchive.net. Retrieved 2016-03-24.