Biomedical data science

Last updated

Biomedical data science is a multidisciplinary field which leverages large volumes of data to promote biomedical innovation and discovery. Biomedical data science draws from various fields including Biostatistics, Biomedical informatics, and machine learning, with the goal of understanding biological and medical data. It can be viewed as the study and application of data science to solve biomedical problems. [1] Modern biomedical datasets often have specific features which make their analyses difficult, including:

Contents

Many biomedical data science projects apply machine learning to such datasets. [2] [3] These characteristics, while also present in many data science applications more generally, make biomedical data science a specific field. Examples of biomedical data science research include:

Training in Biomedical Data Science

The National Library of Medicine of the US National Institutes of Health (NIH) identified key biomedical data scientist attributes in an NIH-wide review: general biomedical subject matter knowledge; programming language expertise; predictive analytics, modeling, and machine learning; team science and communication; and responsible data stewardship. [6]

University Departments and Programs

Biomedical Data Science Research in Academia

Scholarly Journals

The first journal dedicated to biomedical data science appeared in 2018 – Annual Review of Biomedical Data Science .

“The Annual Review of Biomedical Data Science provides comprehensive expert reviews in biomedical data science, focusing on advanced methods to store, retrieve, analyze, and organize biomedical data and knowledge. The scope of the journal encompasses informatics, computational, and statistical approaches to biomedical data, including the sub-fields of bioinformatics, computational biology, biomedical informatics, clinical and clinical research informatics, biostatistics, and imaging informatics. The mission of the journal is to identify both emerging and established areas of biomedical data science, and the leaders in these fields.” [7]

Other journals have a more general scope than biomedical data science, but regularly publish biomedical data science research such as Health Data Science [8] and Nature Machine Intelligence. [9] Data science would not exist without curated datasets and the field has seen the rise of journals that are dedicated to describing and validating such datasets, some of which are useful for biomedical applications, including Scientific Data, [10] Biomedical Data, [11] and Data. [12]

Example

The Human Genome Project (HGP), which uncovered the DNA sequences that compose human genes, would not have been possible without biomedical data science. Significant computational resources were required to process the data in the HGP, as the human genome contains over 6 billion DNA base pairs. [13] Scientists constructed the genome by piecing together small fragments of DNA, and computing overlaps between these sequences alone required over 10,000 CPU hours. At this massive data scale, scientists relied on advanced algorithms to perform data processing steps such as sequence assembly and sequence alignment for quality control. [14] Some of these algorithms, such as BLAST, are still used in modern bioinformatics. Scientists in the HGP also had to address complexities often associated with biomedical data including noisy data, such as DNA read errors, and privacy rights of the research subjects. [15] The HGP, completed in 2004, has had immense impact both biologically, shedding light on human evolution, and medically, launching the field of bioinformatics and leading to technologies such as genetic screening and gene therapy.

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

<span class="mw-page-title-main">Computational biology</span> Branch of biology

Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer engineering which uses bioengineering to build computers.

<span class="mw-page-title-main">Health informatics</span> Computational approaches to health care

Health informatics is the field of science and engineering that aims at developing methods and technologies for the acquisition, processing, and study of patient data, which can come from different sources and modalities, such as electronic health records, diagnostic test results, medical scans. The health domain provides an extremely wide variety of problems that can be tackled using computational techniques.

<span class="mw-page-title-main">Biological database</span>

Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.

Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.

<span class="mw-page-title-main">Human Genome Project</span> Human genome sequencing programme

The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a physical and a functional standpoint. It started in 1990 and was completed in 2003. It remains the world's largest collaborative biological project. Planning for the project started after it was adopted in 1984 by the US government, and it officially launched in 1990. It was declared complete on April 14, 2003, and included about 92% of the genome. Level "complete genome" was achieved in May 2021, with a remaining only 0.3% bases covered by potential issues. The final gapless assembly was finished in January 2022.

<span class="mw-page-title-main">Steven Salzberg</span> American biologist and computer scientist

Steven Lloyd Salzberg is an American computational biologist and computer scientist who is a Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University, where he is also Director of the Center for Computational Biology.

Mark Bender Gerstein is an American scientist working in bioinformatics and Data Science. As of 2009, he is co-director of the Yale Computational Biology and Bioinformatics program.

<span class="mw-page-title-main">David Haussler</span> American bioinformatician

David Haussler is an American bioinformatician known for his work leading the team that assembled the first human genome sequence in the race to complete the Human Genome Project and subsequently for comparative genome analysis that deepens understanding the molecular function and evolution of the genome.

<span class="mw-page-title-main">Eugene Myers</span> American scientist

Eugene Wimberly "Gene" Myers, Jr. is an American computer scientist and bioinformatician, who is best known for contributing to the early development of the NCBI's BLAST tool for sequence analysis.

<span class="mw-page-title-main">Computational epigenetics</span>

Computational epigenetics uses statistical methods and mathematical modelling in epigenetic research. Due to the recent explosion of epigenome datasets, computational methods play an increasing role in all areas of epigenetic research.

Biology data visualization is a branch of bioinformatics concerned with the application of computer graphics, scientific visualization, and information visualization to different areas of the life sciences. This includes visualization of sequences, genomes, alignments, phylogenies, macromolecular structures, systems biology, microscopy, and magnetic resonance imaging data. Software tools used for visualizing biological data range from simple, standalone programs to complex, integrated systems.

Jason H. Moore is a translational bioinformatics scientist, biomedical informatician, and human geneticist, the Edward Rose Professor of Informatics and Director of the Institute for Biomedical Informatics at the Perelman School of Medicine at the University of Pennsylvania, where he is also Senior Associate Dean for Informatics and Director of the Division of Informatics in the Department of Biostatistics, Epidemiology, and Informatics.

Translational bioinformatics (TBI) is a field that emerged in the 2010s to study health informatics, focused on the convergence of molecular bioinformatics, biostatistics, statistical genetics and clinical informatics. Its focus is on applying informatics methodology to the increasing amount of biomedical and genomic data to formulate knowledge and medical tools, which can be utilized by scientists, clinicians, and patients. Furthermore, it involves applying biomedical research to improve human health through the use of computer-based information system. TBI employs data mining and analyzing biomedical informatics in order to generate clinical knowledge for application. Clinical knowledge includes finding similarities in patient populations, interpreting biological information to suggest therapy treatments and predict health outcomes.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

<span class="mw-page-title-main">Population informatics</span>

The field of population informatics is the systematic study of populations via secondary analysis of massive data collections about people. Scientists in the field refer to this massive data collection as the social genome, denoting the collective digital footprint of our society. Population informatics applies data science to social genome data to answer fundamental questions about human society and population health much like bioinformatics applies data science to human genome data to answer questions about individual health. It is an emerging research area at the intersection of SBEH sciences, computer science, and statistics in which quantitative methods and computational tools are used to answer fundamental questions about our society. [[File:Data science.png|alt=Data Science|thumb|Data Science]

DNA encryption is the process of hiding or perplexing genetic information by a computational method in order to improve genetic privacy in DNA sequencing processes. The human genome is complex and long, but it is very possible to interpret important, and identifying, information from smaller variabilities, rather than reading the entire genome. A whole human genome is a string of 3.2 billion base paired nucleotides, the building blocks of life, but between individuals the genetic variation differs only by 0.5%, an important 0.5% that accounts for all of human diversity, the pathology of different diseases, and ancestral story. Emerging strategies incorporate different methods, such as randomization algorithms and cryptographic approaches, to de-identify the genetic sequence from the individual, and fundamentally, isolate only the necessary information while protecting the rest of the genome from unnecessary inquiry. The priority now is to ascertain which methods are robust, and how policy should ensure the ongoing protection of genetic privacy.

<span class="mw-page-title-main">Genome informatics</span>

Genome Informatics is a scientific study of information processing in genomes.

<span class="mw-page-title-main">Biological data</span>

Biological data refers to a compound or information derived from living organisms and their products. A medicinal compound made from living organisms, such as a serum or a vaccine, could be characterized as biological data. Biological data is highly complex when compared with other forms of data. There are many forms of biological data, including text, sequence data, protein structure, genomic data and amino acids, and links among others.

References

  1. Altman, Russ; Levitt, Michael (2018). "What is Biomedical Data Science and Do We Need an Annual Review of It?". Annual Review of Biomedical Data Science. 1: i–iii. doi: 10.1146/annurev-bd-01-041718-100001 . S2CID   134950609.
  2. Baldi, Pierre (2018). "Deep learning in biomedical data science". Annual Review of Biomedical Data Science. 1: 181–205. doi:10.1146/annurev-biodatasci-080917-013343. S2CID   67381478.
  3. 1 2 Ronneberger, Olaf; Fischer, Philipp; Brox, Thomas (2015). "U-net: Convolutional networks for biomedical image segmentation". International Conference on Medical Image Computing and Computer-Assisted Intervention. arXiv: 1505.04597 .
  4. Duncan, James S; Insana, Michael F; Ayache, Nicholas (2020). "Biomedical imaging and analysis in the age of big data and deep learning [scanning the issue]". Proceedings of the IEEE. 108: 3–10. doi: 10.1109/JPROC.2019.2956422 . S2CID   210077608.
  5. Su, Chang; Tong, Jie; Zhu, Yongjun; Cu, Peng; Wang, Fei (2020). "Network embedding in biomedical data science". Briefings in Bioinformatics. 21 (1): 182–197. doi:10.1093/bib/bby117. PMID   30535359.
  6. Zaringhalam, Maryam; Federer, Lisa; Huerta, Michael. "Core Skills for Biomedical Data Scientists" (PDF). US National Library of Medicine. US National Institutes of Health. Retrieved 21 February 2022.{{cite web}}: CS1 maint: url-status (link)
  7. "Annual Review of Biomedical Data Science". annualreviews.org. Retrieved 2022-02-21.
  8. "Health Data Science". spj.sciencemag.org. Retrieved 2022-07-05.
  9. "Nature Machine Intelligence". nature.com. Retrieved 2022-07-05.
  10. "Scientific Data". nature.com. Retrieved 2022-07-05.
  11. "Biomedical Data Journal". biomed-data.eu. Retrieved 2022-07-05.
  12. "Data". mdpi.com. Retrieved 2022-07-05.
  13. Piovesan, Allison; Pelleri, Maria C; Antonaros, Francesca; Strippoli, Pierluigi; Vitale, Lorenza (2019). "On the length, weight and GC content of the human genome". BMC Research Notes. 12 (1): 106. doi:10.1186/s13104-019-4137-z. PMC   6391780 . PMID   30813969.
  14. Altschul, Stephen F; Gish, Warren; Miller, Webb; Myers, Eugene W; Lipman, David J (1990). "Basic local alignment search tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID   2231712. S2CID   14441902.
  15. Venter, J. Craig; et al. (2001). "The sequence of the human genome". Science. 291 (5507): 1304–1351. Bibcode:2001Sci...291.1304V. doi:10.1126/science.1058040. PMID   11181995.