Hypothetical protein

Last updated

In biochemistry, a hypothetical protein is a protein whose existence has been predicted, but for which there is a lack of experimental evidence that it is expressed in vivo. Sequencing of several genomes has resulted in numerous predicted open reading frames to which functions cannot be readily assigned. These proteins, either orphan or conserved hypothetical proteins, make up an estimated 20% to 40% of proteins encoded in each newly sequenced genome. The real evidences for the hypothetical protein functioning in the metabolism of the organism can be predicted by comparing its sequence or structure homology by considering the conserved domain analysis. [1] Even when there is enough evidence that the product of the gene is expressed, by techniques such as microarray and mass spectrometry, it is difficult to assign a function to it given its lack of identity to protein sequences with annotated biochemical function. Nowadays, most protein sequences are inferred from computational analysis of genomic DNA sequence. Hypothetical proteins are created by gene prediction software during genome analysis. When the bioinformatic tool used for the gene identification finds a large open reading frame without a characterised homologue in the protein database, it returns "hypothetical protein" as an annotation remark.

Contents

The function of a hypothetical protein can be predicted by domain homology searches with various confidence levels. [2] Conserved domains are available in the hypothetical proteins which need to be compared with the known family domains by which hypothetical protein could be classified into particular protein families even though they have not been in vivo investigated. The function of hypothetical protein could also be predicted by homology modelling, in which hypothetical protein has to align with known protein sequence whose three dimensional structure is known and by modelling method if structure predicted then the capability of hypothetical protein to function could be ascertained computationally. [2] [3] [4] Further, approaches to annotate function to hypothetical proteins include determination of 3-dimensional structure of these proteins by structural genomics initiatives, understanding the nature and mode of prosthetic group/metal ion binding, fold similarity with other proteins of known functions and annotating possible catalytic site and regulatory site. [5] Structure prediction with biochemical function assessment by screening for various substrate is another promising approach to annotate function [2]

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Structural genomics</span>

Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches. The principal difference between structural genomics and traditional structural prediction is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large number of sequenced genomes and previously solved protein structures allows scientists to model protein structure on the structures of previously solved homologs.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; and it is important in medicine and biotechnology.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

In molecular biology, protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs from the homology modeling method of structure prediction as it is used for proteins which do not have their homologous protein structures deposited in the Protein Data Bank (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which one wishes to model.

<span class="mw-page-title-main">Metabolic network modelling</span> Form of biological modelling

Metabolic network modelling, also known as metabolic network reconstruction or metabolic pathway analysis, allows for an in-depth insight into the molecular mechanisms of a particular organism. In particular, these models correlate the genome with molecular physiology. A reconstruction breaks down metabolic pathways into their respective reactions and enzymes, and analyzes them within the perspective of the entire network. In simplified terms, a reconstruction collects all of the relevant metabolic information of an organism and compiles it in a mathematical model. Validation and analysis of reconstructions can allow identification of key features of metabolism such as growth yield, resource distribution, network robustness, and gene essentiality. This knowledge can then be applied to create novel biotechnology.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

Protein–protein interaction prediction is a field combining bioinformatics and structural biology in an attempt to identify and catalog physical interactions between pairs or groups of proteins. Understanding protein–protein interactions is important for the investigation of intracellular signaling pathways, modelling of protein complex structures and for gaining insights into various biochemical processes.

<span class="mw-page-title-main">Homology modeling</span> Method of protein structure prediction using other known proteins

Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein. Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. It has been seen that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

Computational Resources for Drug Discovery (CRDD) is one of the important silico modules of Open Source for Drug Discovery (OSDD). The CRDD web portal provides computer resources related to drug discovery on a single platform. It provides computational resources for researchers in computer-aided drug design, a discussion forum, and resources to maintain a Wikipedia related to drug discovery, predict inhibitors, and predict the ADME-Tox property of molecules. One of the major objectives of CRDD is to promote open source software in the field of chemoinformatics and pharmacoinformatics.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

Phyre and Phyre2 are free web-based services for protein structure prediction. Phyre is among the most popular methods for protein structure prediction having been cited over 1500 times. Like other remote homology recognition techniques, it is able to regularly generate reliable protein models when other widely used methods such as PSI-BLAST cannot. Phyre2 has been designed to ensure a user-friendly interface for users inexpert in protein structure prediction methods. Its development is funded by the Biotechnology and Biological Sciences Research Council.

<span class="mw-page-title-main">Enzyme Function Initiative</span> Collaborative project to determine enzyme function

The Enzyme Function Initiative (EFI) is a large-scale collaborative project aiming to develop and disseminate a robust strategy to determine enzyme function through an integrated sequence–structure-based approach. The project was funded in May 2010 by the National Institute of General Medical Sciences as a Glue Grant which supports the research of complex biological problems that cannot be solved by a single research group. The EFI was largely spurred by the need to develop methods to identify the functions of the enormous number proteins discovered through genomic sequencing projects.

αr15 is a family of bacterial small non-coding RNAs with representatives in a broad group of α-proteobacteria from the order Rhizobiales. The first members of this family were found tandemly arranged in the same intergenic region (IGR) of the Sinorhizobium meliloti 1021 chromosome (C). Further homology and structure conservation analysis have identified full-length Smr15C1 and Smr15C2 homologs in several nitrogen-fixing symbiotic rhizobia, in the plant pathogens belonging to Agrobacterium species as well as in a broad spectrum of Brucella species. The Smr15C1 and Smr15C2 homologs are also encoded in tandem within the same IGR region of Rhizobium and Agrobacterium species, whereas in Brucella species the αr15C loci are spread in the IGRs of Chromosome I. Moreover, this analysis also identified a third αr15 loci in extrachromosomal replicons of the mentioned nitrogen-fixing α-proteobacteria and in the Chromosome II of Brucella species. αr15 RNA species are 99-121 nt long and share a well defined common secondary structure consisting of three stem loops. The transcripts of the αr15 family can be catalogued as trans-acting sRNAs encoded by independent transcription units with recognizable promoter and transcription termination signatures within intergenic regions (IGRs) of the α-proteobacterial genomes.

The HH-suite is an open-source software package for sensitive protein sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences. HHsearch and HHblits are two main programs in the package and the entry point to its search function, the latter being a faster iteration. HHpred is an online server for protein structure prediction that uses homology information from HH-suite.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

Non-coding RNAs have been discovered using both experimental and bioinformatic approaches. Bioinformatic approaches can be divided into three main categories. The first involves homology search, although these techniques are by definition unable to find new classes of ncRNAs. The second category includes algorithms designed to discover specific types of ncRNAs that have similar properties. Finally, some discovery methods are based on very general properties of RNA, and are thus able to discover entirely new kinds of ncRNAs.

References

  1. Galperin MY (2001). "Conserved 'hypothetical' proteins: new hints and new puzzles". Comparative and Functional Genomics. 2 (1): 14–18. doi:10.1002/cfg.66. PMC   2447192 . PMID   18628897.
  2. 1 2 3 Srinivasan B; et al. (2015). "Prediction of substrate specificity and preliminary kinetic characterization of the hypothetical protein PVX_123945 from Plasmodium vivax". Exp. Parasitol. 151–152: 56–63. doi:10.1016/j.exppara.2015.01.013. PMID   25655405.
  3. P S Kewate; R C Urade; D G Gore; M A Soni; A P Kopulwar (2015). "In silico enzyme function prediction in hypothetical proteins of Mycobacterium bovis AF2122/97". Journal of Pharmacy Research . 9 (3): 182–189.
  4. Dilip Gore (2009). "In silico Prediction of Structure and Enzymatic Activity for Hypothetical Proteins of Shigellaflexneri. Biofrontiers". Biofrontiers . 1 (2): 1–10.
  5. Eisenstein E; et al. (2000). "Biological function made crystal clear - annotation of hypothetical proteins via structural genomics". Curr Opin Biotechnol. 11 (1): 25–30. doi:10.1016/j.exppara.2015.01.013. PMID   10679350.