Computational genomics

Last updated

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, [1] including both DNA and RNA sequence as well as other "post-genomic" data (i.e., experimental data obtained with technologies that require the genome sequence, such as genomic DNA microarrays). These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes (rather than individual genes) to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery. [2]

Contents

History

The roots of computational genomics are shared with those of bioinformatics. During the 1960s, Margaret Dayhoff and others at the National Biomedical Research Foundation assembled databases of homologous protein sequences for evolutionary study. [3] Their research developed a phylogenetic tree that determined the evolutionary changes that were required for a particular protein to change into another protein based on the underlying amino acid sequences. This led them to create a scoring matrix that assessed the likelihood of one protein being related to another.

Beginning in the 1980s, databases of genome sequences began to be recorded, but this presented new challenges in the form of searching and comparing the databases of gene information. Unlike text-searching algorithms that are used on websites such as Google or Wikipedia, searching for sections of genetic similarity requires one to find strings that are not simply identical, but similar. This led to the development of the Needleman-Wunsch algorithm, which is a dynamic programming algorithm for comparing sets of amino acid sequences with each other by using scoring matrices derived from the earlier research by Dayhoff. Later, the BLAST algorithm was developed for performing fast, optimized searches of gene sequence databases. BLAST and its derivatives are probably the most widely used algorithms for this purpose. [4]

The emergence of the phrase "computational genomics" coincides with the availability of complete sequenced genomes in the mid-to-late 1990s. The first meeting of the Annual Conference on Computational Genomics was organized by scientists from The Institute for Genomic Research (TIGR) in 1998, providing a forum for this speciality and effectively distinguishing this area of science from the more general fields of Genomics or Computational Biology.[ citation needed ] The first use of this term in scientific literature, according to MEDLINE abstracts, was just one year earlier in Nucleic Acids Research. [5] The final Computational Genomics conference was held in 2006, featuring a keynote talk by Nobel Laureate Barry Marshall, co-discoverer of the link between Helicobacter pylori and stomach ulcers. As of 2014, the leading conferences in the field include Intelligent Systems for Molecular Biology (ISMB) and Research in Computational Molecular Biology (RECOMB).

The development of computer-assisted mathematics (using products such as Mathematica or Matlab) has helped engineers, mathematicians and computer scientists to start operating in this domain, and a public collection of case studies and demonstrations is growing, ranging from whole genome comparisons to gene expression analysis. [6] This has increased the introduction of different ideas, including concepts from systems and control, information theory, strings analysis and data mining. It is anticipated that computational approaches will become and remain a standard topic for research and teaching, while students fluent in both topics start being formed in the multiple courses created in the past few years.

Contributions of computational genomics research to biology

Contributions of computational genomics research to biology include: [2]

Genome comparison

Computational tools have been developed to assess the similarity of genomic sequences. Some of them are alignment-based distances such as Average Nucleotide Identity. [7] These methods are highly specific, while being computationally slow. Other, alignment-free methods, include statistical and probabilistic approaches. One example is Mash, [8] a probabilistic approach using minhash. In this method, given a number k, a genomic sequence is transformed into a shorter sketch through a random hash function on the possible k-mers. For example, if , sketches of size 4 are being constructed and given the following hash function

(AA,0) (AC,8) (AT,2) (AG,14)
(CA,6) (CC,13) (CT,5) (CG,4)
(GA,15) (GC,12) (GT,10) (GG,1)
(TA,3) (TC,11) (TT,9) (TG,7)

the sketch of the sequence

CTGACCTTAACGGGAGACTATGATGACGACCGCAT

is {0,1,1,2} which are the smallest hash values of its k-mers of size 2. These sketches are then compared to estimate the fraction of shared k-mers (Jaccard index) of the corresponding sequences. It is worth noticing that a hash value is a binary number. In a real genomic setting a useful size of k-mers ranges from 14 to 21, and the size of the sketches would be around 1000. [8]

By reducing the size of the sequences, even hundreds of times, and comparing them in an alignment-free way, this method reduces significantly the time of estimation of the similarity of sequences.

Clusterization of genomic data

Clustering data is a tool used to simplify statistical analysis of a genomic sample. For example, in [9] the authors developed a tool (BiG-SCAPE) to analize sequence similarity networks of biosynthetic gene clusters (BGC). In [10] successive layers of clusterization of biosynthetic gene clusters are used in the automated tool BiG-MAP, both to filter redundant data and identify gene clusters families. This tool profiles the abundance and expressions levels of BGC's in microbiome samples.

Biosynthetic gene clusters

Bioinformatic tools have been developed to predict, and determine the abundance and expression of, this kind of gene cluster in microbiome samples, from metagenomic data. [11] Since the size of metagenomic data is considerable, filtering and clusterization thereof are important parts of these tools. These processes can consist of dimensionality -reduction techniques, such as Minhash, [8] and clusterization algorithms such as k-medoids and affinity propagation. Also several metrics and similarities have been developed to compare them.

Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous BGCs, which allows identification of cross-species patterns that can be matched to the presence of metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the expensive network-based approach used to group these BGCs into gene cluster families (GCFs). BiG-SLiCE (Biosynthetic Genes Super-Linear Clustering Engine), a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion.

Satria et. al, 2021 [12] across BiG-SLiCE demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential, opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry. [12]

Compression algorithms

Genetics compression algorithms are the latest generation of lossless algorithms that compress data (typically sequences of nucleotides) using both conventional compression algorithms and genetic algorithms adapted to the specific datatype. In 2012, a team of scientists from Johns Hopkins University published a genetic compression algorithm that does not use a reference genome for compression. HAPZIPPER was tailored for HapMap data and achieves over 20-fold compression (95% reduction in file size), providing 2- to 4-fold better compression and is less computationally intensive than the leading general-purpose compression utilities. For this, Chanda, Elhaik, and Bader introduced MAF-based encoding (MAFE), which reduces the heterogeneity of the dataset by sorting SNPs by their minor allele frequency, thus homogenizing the dataset. [13] Other algorithms developed in 2009 and 2013 (DNAZip and GenomeZip) have compression ratios of up to 1200-fold—allowing 6 billion basepair diploid human genomes to be stored in 2.5 megabytes (relative to a reference genome or averaged over many genomes). [14] [15] For a benchmark in genetics/genomics data compressors, see [16]

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Computational biology</span> Branch of biology

Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer science and engineering which uses bioengineering to build computers.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Comparative genomics</span> Field of biological research

Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach compares two or more genomes to discover the similarities and differences between the genomes and to study the biology of the individual genomes. Comparison of whole genome sequences provides a highly detailed view of how organisms are related to each other at the gene level. By comparing whole genome sequences, researchers gain insights into genetic relationships between organisms and study evolutionary changes. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, Comparative genomics provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved or common among species, as well as genes that give unique characteristics of each organism. Moreover, these studies can be performed at different levels of the genomes to obtain multiple perspectives about the organisms.

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

<i>k</i>-mer Substrings of length k contained in a biological sequence

In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.

<span class="mw-page-title-main">Human Microbiome Project</span> Former research initiative

The Human Microbiome Project (HMP) was a United States National Institutes of Health (NIH) research initiative to improve understanding of the microbiota involved in human health and disease. Launched in 2007, the first phase (HMP1) focused on identifying and characterizing human microbiota. The second phase, known as the Integrative Human Microbiome Project (iHMP) launched in 2014 with the aim of generating resources to characterize the microbiome and elucidating the roles of microbes in health and disease states. The program received $170 million in funding by the NIH Common Fund from 2007 to 2016.

<span class="mw-page-title-main">Pan-genome</span> All genes of all strains in a clade

In the fields of molecular biology and genetics, a pan-genome is the entire set of genes from all strains within a clade. More generally, it is the union of all the genomes of a clade. The pan-genome can be broken down into a "core pangenome" that contains genes present in all individuals, a "shell pangenome" that contains genes present in two or more strains, and a "cloud pangenome" that contains genes only found in a single strain. Some authors also refer to the cloud genome as "accessory genome" containing 'dispensable' genes present in a subset of the strains and strain-specific genes. Note that the use of the term 'dispensable' has been questioned, at least in plant genomes, as accessory genes play "an important role in genome evolution and in the complex interplay between the genome and the environment". The field of study of pangenomes is called pangenomics.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

TimeLogic is the bioinformatics division of Active Motif, Inc. The company is headquartered in Carlsbad, California. TimeLogic develops FPGA-accelerated tools for biological sequence comparison in the field of high performance bioinformatics and biocomputing.

High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

Michael Andrew Fischbach is an American chemist, microbiologist, and geneticist. He is an associate professor of Bioengineering and ChEM-H Faculty Fellow at Stanford University and a Chan Zuckerberg Biohub Investigator.

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

Metabolic gene clusters or biosynthetic gene clusters are tightly linked sets of mostly non-homologous genes participating in a common, discrete metabolic pathway. The genes are in physical vicinity to each other on the genome, and their expression is often coregulated. Metabolic gene clusters are common features of bacterial and most fungal genomes. They are less often found in other organisms. They are most widely known for producing secondary metabolites, the source or basis of most pharmaceutical compounds, natural toxins, chemical communication, and chemical warfare between organisms. Metabolic gene clusters are also involved in nutrient acquisition, toxin degradation, antimicrobial resistance, and vitamin biosynthesis. Given all these properties of metabolic gene clusters, they play a key role in shaping microbial ecosystems, including microbiome-host interactions. Thus several computational genomics tools have been developed to predict metabolic gene clusters.

<span class="mw-page-title-main">Genome mining</span>

Genome mining describes the exploitation of genomic information for the discovery of biosynthetic pathways of natural products and their possible interactions. It depends on computational technology and bioinformatics tools. The mining process relies on a huge amount of data accessible in genomic databases. By applying data mining algorithms, the data can be used to generate new knowledge in several areas of medicinal chemistry, such as discovering novel natural products.

References

  1. Koonin EV (March 2001). "Computational genomics". Current Biology. 11 (5): R155–8. doi: 10.1016/S0960-9822(01)00081-1 . PMID   11267880. S2CID   17202180.
  2. 1 2 "Computational Genomics and Proteomics at MIT". Archived from the original on 2018-03-22. Retrieved 2006-12-29.
  3. Mount D (2000). Bioinformatics, Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press. pp. 2–3. ISBN   978-0-87969-597-2.
  4. Brown TA (1999). Genomes . Wiley. ISBN   978-0-471-31618-3.
  5. Wagner A (September 1997). "A computational genomics approach to the identification of gene networks". Nucleic Acids Research. 25 (18): 3594–604. doi:10.1093/nar/25.18.3594. PMC   146952 . PMID   9278479.
  6. Cristianini N, Hahn M (2006). Introduction to Computational Genomics. Cambridge University Press. ISBN   978-0-521-67191-0.
  7. Konstantinidis KT, Tiedje JM (2005). "Genomic insights that advance the species definition for prokaryotes". Proc Natl Acad Sci U S A. 102 (7): 2567–72. Bibcode:2005PNAS..102.2567K. doi: 10.1073/pnas.0409727102 . PMC   549018 . PMID   15701695.
  8. 1 2 3 Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A (2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology. 17 (32): 14. doi: 10.1186/s13059-016-0997-x . PMC   4915045 . PMID   27323842.
  9. Navarro-Muñoz J, Selem-Mojica N, Mullowney M, Kautsar S, Tryon J, Parkinson E, De Los Santos E, Yeong M, Cruz-Morales P, Abubucker S, Roeters A, Lokhorst W, Fernandez-Guerra A, Dias-Cappelini L, Goering A, Thomson R, Metcalf W, Kelleher N, Barona-Gomez F, Medema M (2020). "A computational framework to explore large-scale biosynthetic diversity". Nat Chem Biol. 16 (1): 60–68. doi:10.1038/s41589-019-0400-9. PMC   6917865 . PMID   31768033.
  10. Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M (2020). "BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes". mSystems. 6 (5): e00937-21. bioRxiv   10.1101/2020.12.14.422671 . doi:10.1128/msystems.00937-21. PMC   8547482 . PMID   34581602.
  11. Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M (2020). "BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes". bioRxiv. 6 (5): e00937-21. doi:10.1101/2020.12.14.422671. PMC   8547482 . PMID   34581602.
  12. 1 2 Kautsar, Satria A; van der Hooft, Justin J J; de Ridder, Dick; Medema, Marnix H (13 January 2021). "BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters". GigaScience. 10 (1): giaa154. doi: 10.1093/gigascience/giaa154 . PMC   7804863 . PMID   33438731.
  13. Chanda P, Bader JS, Elhaik E (27 Jul 2012). "HapZipper: sharing HapMap populations just got easier". Nucleic Acids Research. 40 (20): e159. doi:10.1093/nar/gks709. PMC   3488212 . PMID   22844100.
  14. Christley S, Lu Y, Li C, Xie X (Jan 15, 2009). "Human genomes as email attachments". Bioinformatics. 25 (2): 274–5. doi: 10.1093/bioinformatics/btn582 . PMID   18996942.
  15. Pavlichin DS, Weissman T, Yona G (September 2013). "The human genome contracts again". Bioinformatics. 29 (17): 2199–202. doi: 10.1093/bioinformatics/btt362 . PMID   23793748.
  16. Hosseini, Morteza; Pratas, Diogo; Pinho, Armando (2016). "A Survey on Data Compression Methods for Biological Sequences". Information. 7 (4): 56. doi: 10.3390/info7040056 .