Phylogenetic profiling

Last updated

Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint absence of two traits across large numbers of species is used to infer a meaningful biological connection, such as involvement of two different proteins in the same biological pathway. Along with examination of conserved synteny, conserved operon structure, or "Rosetta Stone" domain fusions, comparing phylogenetic profiles is a designated "post-homology" technique, in that the computation essential to this method begins after it is determined which proteins are homologous to which. A number of these techniques were developed by David Eisenberg and colleagues; phylogenetic profile comparison was introduced in 1999 by Pellegrini, et al. [1]

Contents

Method

Over 2000 species of bacteria, archaea, and eukaryotes are now represented by complete DNA genome sequences. Typically, each gene in a genome encodes a protein that can be assigned to a particular protein family on the basis of homology. For a given protein family, its presence or absence in each genome (in the original, binary, formulation) is represented by either 1 (present) or 0 (absent). Consequently, the phylogenetic distribution of the protein family can be represented by a long binary number with a digit for each genome; such binary representations are easily compared with each other to search for correlated phylogenetic distributions. The large number of complete genomes makes these profiles rich in information. The advantage of using only complete genomes is that the 0 values, representing the absence of a trait, tend to be reliable.

Theory

Closely related species should be expected to have very similar sets of genes. However, changes accumulate between more distantly related species by processes that include horizontal gene transfer and gene loss. Individual proteins have specific molecular functions, such as carrying out a single enzymatic reaction or serving as one subunit of a larger protein complex. A biological process such as photosynthesis, methanogenesis, or histidine biosynthesis may require the concerted action of many proteins. If some protein critical to a process is lost, other proteins dedicated to that process would become useless; natural selection makes it unlikely these useless proteins will be retained over evolutionary time. Therefore, should two different protein families consistently tend to be either present or absent together, a likely hypothesis is that the two proteins cooperate in some biological process.

Advances and challenges

Phylogenetic profiling has led to numerous discoveries in biology, including previously unknown enzymes in metabolic pathways, transcription factors that bind to conserved regulatory sites, and explanations for roles of certain mutations in human disease. [2] Improving the method itself is an active area of scientific research because the method itself faces several limitations. First, co-occurrence of two protein families often represents recent common ancestry of two species rather than a conserved functional relationship; disambiguating these two sources of correlation may require improved statistical methods. Second, proteins grouped as homologs may differ in function, or proteins conserved in function may fail to register as homologs; improved methods for tailoring the size of each protein family to reflect functional conservation will lead to improved results.

Tools

Tools include PLEX (Protein Link Explorer). [3] (Now defunct) and JGI IMG (Integrated Microbial Genomes) Phylogenetic Profiler (for both single genes and gene cassettes). [4]

Notes

  1. Pellegrini, Matteo; Marcotte, Edward M; Thompson, Michael J; Eisenberg, David; Yeates, Todd O (1999). "Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles". Proceedings of the National Academy of Sciences USA . 96 (8): 4285–4288. doi:10.1073/pnas.96.8.4285. PMC   16324 . PMID   10200254.
  2. Kensche, Philip R; van Noort, Vera; Dutilh, Bas E; Huynen, Martijn A (2008). "Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution". Journal of the Royal Society Interface . 5 (19): 151–170. doi:10.1098/rsif.2007.1047. PMC   2405902 . PMID   17535793.
  3. Date, Shailesh V.; Marcotte, Edward M. (2005-05-15). "Protein function prediction using the Protein Link EXplorer (PLEX)". Bioinformatics. 21 (10): 2558–2559. doi: 10.1093/bioinformatics/bti313 . ISSN   1367-4803. PMID   15701682.
  4. Chen, I.-Min A.; Chu, Ken; Palaniappan, Krishna; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Huntemann, Marcel; Varghese, Neha; White, James R. (2018-10-05). "IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes". Nucleic Acids Research. 47 (D1): D666–D677. doi:10.1093/nar/gky901. ISSN   1362-4962. PMC   6323987 . PMID   30289528.

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others.

<span class="mw-page-title-main">Protein family</span> Group of evolutionarily-related proteins

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Comparative genomics</span>

Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.

<span class="mw-page-title-main">Interactome</span> Complete set of molecular interactions in a biological cell

In molecular biology, an interactome is the whole set of molecular interactions in a particular cell. The term specifically refers to physical interactions among molecules but can also describe sets of indirect interactions among genes.

<span class="mw-page-title-main">Sequence homology</span> Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

<span class="mw-page-title-main">Conserved sequence</span> Similar DNA, RNA or protein sequences within genomes or among species

In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids or proteins across species, or within a genome, or between donor and receptor taxa. Conservation indicates that a sequence has been maintained by natural selection.

Protein–protein interaction prediction is a field combining bioinformatics and structural biology in an attempt to identify and catalog physical interactions between pairs or groups of proteins. Understanding protein–protein interactions is important for the investigation of intracellular signaling pathways, modelling of protein complex structures and for gaining insights into various biochemical processes.

Genomic phylostratigraphy is a novel genetic statistical method developed in order to date the origin of specific genes by looking at its homologs across species. It was first developed by Ruđer Bošković Institute in Zagreb, Croatia. The system links genes to their founder gene, allowing us to then determine their age. This could help us better understand many evolutionary processes such as patterns of gene birth throughout evolution, or the relationship between the age of a transcriptome throughout embryonic development. Bioinformatic tools like GenEra have been developed to calculate relative gene ages based on genomic phylostratigraphy.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

Edward Marcotte is a professor of biochemistry at The University of Texas at Austin, working in genetics, proteomics, and bioinformatics. Marcotte is an example of a computational biologist who also relies on experiments to validate bioinformatics-based predictions.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

Horizontal or lateral gene transfer is the transmission of portions of genomic DNA between organisms through a process decoupled from vertical inheritance. In the presence of HGT events, different fragments of the genome are the result of different evolutionary histories. This can therefore complicate investigations of the evolutionary relatedness of lineages and species. Also, as HGT can bring into genomes radically different genotypes from distant lineages, or even new genes bearing new functions, it is a major source of phenotypic innovation and a mechanism of niche adaptation. For example, of particular relevance to human health is the lateral transfer of antibiotic resistance and pathogenicity determinants, leading to the emergence of pathogenic lineages.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

<span class="mw-page-title-main">Protein tandem repeats</span>

An array of protein tandem repeats is defined as several adjacent copies having the same or similar sequence motifs. These periodic sequences are generated by internal duplications in both coding and non-coding genomic sequences. Repetitive units of protein tandem repeats are considerably diverse, ranging from the repetition of a single amino acid to domains of 100 or more residues.