Critical Assessment of Function Annotation

Last updated

The Critical Assessment of Functional Annotation (CAFA) is an experiment designed to provide a large-scale assessment of computational methods dedicated to predicting protein function. [1] Different algorithms are evaluated by their ability to predict the Gene Ontology (GO) terms in the categories of Molecular Function, Biological Process, and Cellular Component.

Contents

The experiment consists of two tracks: (i) the eukaryotic track, (ii) the prokaryotic track. In each track, a set of targets is provided by the organizers. Participants are expected to submit their predictions by the submission deadline, after which they are assessed according to a set of specific metrics.

Motivation

The genome of an organism may consist of hundreds to tens of thousands of genes, which encode for hundreds of thousands of different protein sequences. Due to the relatively low cost of genome sequencing, determining gene and protein sequences is fast and inexpensive. Thousands of species have been sequenced so far, yet many of the proteins are not well characterized. [2] The process of experimentally determining the role of a protein in the cell, is an expensive and time consuming task. Further, even when functional assays are performed they are unlikely to provide complete insight into protein function. Therefore it has become important to use computational tools in order to functionally annotate proteins. There are several computational methods of protein function prediction that can infer protein function using a variety of biological and evolutionary data, but there is significant room for improvement. Accurate prediction of protein function can have longstanding implications on biomedical and pharmaceutical research.

The CAFA experiment is designed to provide unbiased assessment of computational methods, to stimulate research in computational function prediction, and provide insights into the overall state-of-the-art in function prediction.

Organization

The experiment consists of three phases:

  1. Prediction phase: ~4 months

Organizers provide protein sequences with unknown or incomplete function to community and set the deadline for the submission of predictions

  1. Target accumulation: 6–12 months

After all predictions are stored and the experiment enters a waiting period in which protein functions are expected to accumulate in public databases

  1. Analysis Phase: 1 month

Predictors are ranked according to their performance. The results are publicly shared in scientific meetings and published after peer review.

History

The CAFA experiment is conducted by the Automated Function Prediction (AFP) Special Interest Group (AFP/SIG). CAFA was conceived by Dr. Inbal (Halperin) Landsberg, and was organized by her along with Prof. Russ Altman, and Dr. Iddo Friedberg. An AFP/SIG meeting has been held alongside the Intelligent Systems for Molecular Biology conference in 2005, 2006, 2008, 2011, and 2012. [3] [4] [5]

CAFA 2010-2012

The first CAFA experiment was organized between fall 2010 and spring 2012. The organizers provided 48,000 sequences for the community with the task to prediction Gene Ontology annotations for each of these sequences. Of those 48,000 proteins, 866 were experimentally annotated during target accumulation phase. The results showed that current function prediction algorithms perform significantly better than a simple domain assignment or a straightforward use of BLAST package. However, they also revealed that accurate prediction of a protein's biological function is still an open and challenging problem.

CAFA 2013-2014

The second CAFA experiment kicked off in fall 2013. Starting in August, interested parties could download more than 100,000 target sequences in 27 species. Registered teams are challenged to annotate the sequences with Gene Ontology terms, with an additional challenge to annotate human sequences with Human Phenotype Ontology terms. The submission deadline was January 15, 2014. The assessment of predictions will take place in June 2014.

See also

CASP: Critical Assessment of protein Structure Prediction
CAPRI: Critical Assessment of Prediction of Interactions

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others.

<span class="mw-page-title-main">CASP</span> Protein structure prediction challenge

Critical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users. Even though the primary goal of CASP is to help advance the methods of identifying protein three-dimensional structure from its amino acid sequence many view the experiment more as a “world championship” in this field of science. More than 100 research groups from all over the world participate in CASP on a regular basis and it is not uncommon for entire groups to suspend their other research for months while they focus on getting their servers ready for the experiment and on performing the detailed predictions.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO is part of a larger classification effort, the Open Biomedical Ontologies, being one of the Initial Candidate Members of the OBO Foundry.

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

In biochemistry, a hypothetical protein is a protein whose existence has been predicted, but for which there is a lack of experimental evidence that it is expressed in vivo. Sequencing of several genomes has resulted in numerous predicted open reading frames to which functions cannot be readily assigned. These proteins, either orphan or conserved hypothetical proteins, make up an estimated 20% to 40% of proteins encoded in each newly sequenced genome. The real evidences for the hypothetical protein functioning in the metabolism of the organism can be predicted by comparing its sequence or structure homology by considering the conserved domain analysis. Even when there is enough evidence that the product of the gene is expressed, by techniques such as microarray and mass spectrometry, it is difficult to assign a function to it given its lack of identity to protein sequences with annotated biochemical function. Nowadays, most protein sequences are inferred from computational analysis of genomic DNA sequence. Hypothetical proteins are created by gene prediction software during genome analysis. When the bioinformatic tool used for the gene identification finds a large open reading frame without a characterised homologue in the protein database, it returns "hypothetical protein" as an annotation remark.

The Fiocruz Genome Comparison Project is a collaborative effort involving Brazil's Oswaldo Cruz Institute and IBM's World Community Grid, designed to produce a database comparing the genes from many genomes with each other using SSEARCH. The program SSEARCH performs a rigorous Smith–Waterman alignment between a protein sequence and another protein sequence, a protein database, a DNA or a DNA library.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

GENCODE is a scientific project in genome research and part of the ENCODE scale-up project.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

COMBREX is a multifaceted project that includes a database of gene annotations, functional predictions and recommendations based on Active Learning principles associated with millions of genes in prokaryotic genomes.

<span class="mw-page-title-main">Blast2GO</span>

Blast2GO, first published in 2005, is a bioinformatics software tool for the automatic, high-throughput functional annotation of novel sequence data. It makes use of the BLAST algorithm to identify similar sequences to then transfers existing functional annotation from yet characterised sequences to the novel one. The functional information is represented via the Gene Ontology (GO), a controlled vocabulary of functional attributes. The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.

dcGO is a comprehensive ontology database for protein domains. As an ontology resource, dcGO integrates Open Biomedical Ontologies from a variety of contexts, ranging from functional information like Gene Ontology to others on enzymes and pathways, from phenotype information across major model organisms to information about human diseases and drugs. As a protein domain resource, dcGO includes annotations to both the individual domains and supra-domains.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

<span class="mw-page-title-main">I-TASSER</span>

I-TASSER is a bioinformatics method for predicting three-dimensional structure model of protein molecules from amino acid sequences. It detects structure templates from the Protein Data Bank by a technique called fold recognition. The full-length structure models are constructed by reassembling structural fragments from threading templates using replica exchange Monte Carlo simulations. I-TASSER is one of the most successful protein structure prediction methods in the community-wide CASP experiments.

References

  1. Predrag, Radivojac; et al. (2013). "A large-scale evaluation of computational protein function prediction". Nature Methods. 10 (3): 221–227. doi:10.1038/nmeth.2340. PMC   3584181 . PMID   23353650.
  2. Bernal, Axel; Uy Ear; Nikos Kyrpides (2001). "Genomes OnLine Database (GOLD): a monitor of genome projects world-wide". Nucleic Acids Research. 29 (1): 126–127. doi:10.1093/nar/29.1.126. PMC   29859 . PMID   11125068.
  3. Friedberg, Iddo; Martin Jambon; Adam Godzik (June 2006). "New avenues in protein function prediction". Protein Science. 15 (6): 1527–1529. doi:10.1110/ps.062158406. PMC   2242544 . PMID   16731984.
  4. Rodrigues, Ana; Barry Grant; Adam Godzik; Iddo Friedberg (2007). "The 2006 Automated Function Prediction Meeting". Bioinformatics. 8 (Suppl 4): S1–4. doi: 10.1186/1471-2105-8-s4-s1 . PMC   1892079 . PMID   17570143.
  5. Gillis, Jesse; Paul Pavlidis (April 2013). "Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA)" (PDF). BMC Bioinformatics. 14 (Suppl 3): S15. doi: 10.1186/1471-2105-14-s3-s15 . PMC   3633048 . PMID   23630983.