Predictprotein

Last updated
PredictProtein
Original author(s) Burkhard Rost
Developer(s) Guy Yachdav Laszlo Kajan
Initial release1992
Stable release
1.0.88
Operating system UNIX-based
Type Bioinformatics
License GPLv2
Website www.predictprotein.org   OOjs UI icon edit-ltr-progressive.svg

PredictProtein (PP) is an automatic service that searches up-to-date public sequence databases, creates alignments, and predicts aspects of protein structure and function. Users send a protein sequence and receive a single file with results from database comparisons and prediction methods. PP went online in 1992 at the European Molecular Biology Laboratory; since 1999 it has operated from Columbia University and in 2009 it moved to the Technische Universität München. Although many servers have implemented particular aspects, PP remains the most widely used public server for structure prediction: over 1.5 million requests from users in 104 countries have been handled; over 13000 users submitted 10 or more different queries. PP web pages are mirrored in 17 countries on 4 continents. The system is optimized to meet the demands of experimentalists not experienced in bioinformatics. This implied that we focused on incorporating only high-quality methods, and tried to collate results omitting less reliable or less important ones.

Contents

Attempt to simplify output by incorporating a hierarchy of thresholds

The attempt to ‘pre-digest’ as much information as possible to simplify the ease of interpreting the results is a unique pillar of PP. For example, by default PP returns only those proteins found in the database that are very likely to have a similar structure to the query protein. [1] Particular predictions, such as those for membrane helices, coiled-coil regions, signal peptides and nuclear localization signals, are not returned if found to be below given probability thresholds.

Each request triggers the application of over 20 different methods

Users receive a single output file with the following results. Database searches: similar sequences are reported and aligned by a standard, pairwise BLAST, [2] an iterated PSI-BLAST search. [3] Although the pairwise BLAST searches are identical to those obtainable from the NCBI site, the iterated PSI-BLAST is performed on a carefully filtered database to avoid accumulating false positives during the iteration,. [4] [5] A standard search for functional motifs in the PROSITE database. [6] PP now also identifies putative boundaries for structural domains through the CHOP procedure. Structure prediction methods: secondary structure, solvent accessibility and membrane helices predicted by the PHD and PROF programs, [7] [8] membrane strands predicted by PROFtmb, [9] coiled-coil regions by COILS, [10] and inter-residue contacts through PROFcon, [11] low-complexity regions are marked by SEG [12] and long regions with no regular secondary structure are identified by NORSp,. [13] [14] The PHD/PROF programs are only available through PP. The particular way in which PP automatically iterates PSI-BLAST searches and the way in which we decide what to include in sequence families is also unique to PP. The particular aspects of function that are currently embedded explicitly in PP are all somehow related to sub-cellular localization: we detect nuclear localization signals through PredictNLS, [15] [16] we predict localization independent of targeting signals through LOCnet; [17] and annotations homology to proteins involved in cell-cycle control. [18]

Availability

Web Service

The PredictProtein web service is available at www.predictprotein.org. Users can submit an amino acid sequence, and get in return a set of automatic annotations for the submitted sequence. The service is supported by a database of pre-calculated results that speed up the interaction time.

Cloud Solution[ buzzword ]

The PredictProtein cloud solution[ buzzword ] builds upon the open source operating system Debian, [19] and provides its functionality as a set of free [20] Debian software packages. Bio-Linux is an operating system for bioinformatics and computational biology. Its latest release 7 provides more than 500 bioinformatics programs on an Ubuntu Linux base. [21] Ubuntu is a Debian derivative, an operating system that is based on Debian with its own additions. Cloud BioLinux is a comprehensive cloud solution[ buzzword ] that is derived from Bio-Linux and Ubuntu. Debian derivatives can easily share packages between each other. For example, Debian packages are automatically incorporated in Ubuntu, [22] and are also usable in Cloud BioLinux (the procedure is described in [23] ).

See also

Related Research Articles

Bioinformatics Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, chemistry, physics, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

Protein secondary structure General three-dimensional form of local segments of proteins

Protein secondary structure is the three dimensional form of local segments of proteins. The two most common secondary structural elements are alpha helices and beta sheets, though beta turns and omega loops occur as well. Secondary structure elements typically spontaneously form as an intermediate before the protein folds into its three dimensional tertiary structure.

Protein structure prediction Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; and it is important in medicine and biotechnology.

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a huge range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

Protein subcellular localization prediction involves the prediction of where a protein resides in a cell, its subcellular localization.

Protein–protein interaction prediction is a field combining bioinformatics and structural biology in an attempt to identify and catalog physical interactions between pairs or groups of proteins. Understanding protein–protein interactions is important for the investigation of intracellular signaling pathways, modelling of protein complex structures and for gaining insights into various biochemical processes.

Structural and physical properties of DNA provide important constraints on the binding sites formed on surfaces of DNA-binding proteins. Characteristics of such binding sites may be used for predicting DNA-binding sites from the structural and even sequence properties of unbound proteins. This approach has been successfully implemented for predicting the protein–protein interface. Here, this approach is adopted for predicting DNA-binding sites in DNA-binding proteins. First attempt to use sequence and evolutionary features to predict DNA-binding sites in proteins was made by Ahmad et al. (2004) and Ahmad and Sarai (2005). Some methods use structural information to predict DNA-binding sites and therefore require a three-dimensional structure of the protein, while others use only sequence information and do not require protein structure in order to make a prediction.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences [4]. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile [2]. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

Phyre and Phyre2 are free web-based services for protein structure prediction. Phyre is among the most popular methods for protein structure prediction having been cited over 1500 times. Like other remote homology recognition techniques, it is able to regularly generate reliable protein models when other widely used methods such as PSI-BLAST cannot. Phyre2 has been designed to ensure a user-friendly interface for users inexpert in protein structure prediction methods. Its development is funded by the Biotechnology and Biological Sciences Research Council.

The HH-suite is an open-source software package for sensitive protein sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences. HHsearch and HHblits are two main programs in the package and the entry point to its search function, the latter being a faster iteration. HHpred is an online server for protein structure prediction that uses homology information from HH-suite.

Burkhard Rost German computational biology researcher

Burkhard Rost is a scientist leading the Department for Computational Biology & Bioinformatics at the Faculty of Informatics of the Technical University of Munich (TUM). Rost chairs the Study Section Bioinformatics Munich involving the TUM and the Ludwig Maximilian University of Munich (LMU) in Munich. From 2007-2014 Rost was President of the International Society for Computational Biology (ISCB).

PSI-blast based secondary structure PREDiction (PSIPRED) is a method used to investigate protein structure. It uses artificial neural network machine learning methods in its algorithm. It is a server-side program, featuring a website serving as a front-end interface, which can predict a protein's secondary structure from the primary sequence.

Coiled-coil domain-containing 37, also known as FLJ40083, is a protein that in humans is encoded by the CCDC37 gene (3q21.3). There is no confirmed function of CCDC37.

DEPDC1B

DEP Domain Containing Protein 1B also known as XTP1, XTP8, HBV XAg-Transactivated Protein 8, [formerly referred to as BRCC3] is a human protein encoded by a gene of similar name located on chromosome 5.

Testis expressed 36, TEX36, is a protein that in humans is encoded by the tex36 gene. TEX36 interacts with proteins involved in the MAP kinase family, supporting that TEX36 may be regulated with on or off configurations. The encoded protein is highly expressed in fetal, testes, and placental tissues and has background expression levels in adults. There are also many motifs specific to male sex determination and spermatogenic factors, suggesting that it is involved in development.

LOC100287387 is a protein that in humans is encoded by the gene LOC100287387. The function of the protein is not yet understood in the scientific community. The gene is located on the q arm of chromosome 2.

Coiled-coil domain containing 166 Protein-coding gene in the species Homo sapiens

Coiled-coil domain containing 166 is a protein that in humans is encoded by the CCDC166 gene. Its function is currently unknown. It contains a coiled-coil domain, hence the current origin of its name. It is primarily expressed in the testes.

CCDC188 Gene

CCDC188 or coiled-coil domain containing protein is a protein that in humans is encoded by the CCDC188 gene.

References

  1. Rost, B. (1999). "Twilight zone of protein sequence alignments". Protein Engineering. 12 (2): 85–94. doi: 10.1093/protein/12.2.85 . PMID   10195279.
  2. Altschul S.F. and Gish,W. (1996) Local alignment statistics. Methods Enzymol., 266, 460–480.
  3. Altschul S., Madden,T., Shaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997 Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.
  4. Przybylski D. and Rost,B. (2002) Alignments grow, secondary structure prediction improves. Proteins, 46, 195–205.
  5. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195–202.
  6. Hofmann K., Bucher,P., Falquet,L. and Bairoch,A. (1999) The PROSITE database, its status in 1999. Nucleic Acids Res., 27, 215–219.
  7. Rost B. (1996) PHD: predicting one-dimensional protein structure by profile based neural networks. Methods Enzymol., 266, 525–539
  8. Rost B. (2001) Protein secondary structure prediction continues to rise. J. Struct. Biol., 134, 204–218.
  9. Bigelow, H.; Rost, B. (2006). "PROFtmb: A web server for predicting bacterial transmembrane beta barrel proteins". Nucleic Acids Research. 34 (Web Server issue): W186–W188. doi:10.1093/nar/gkl262. PMC   1538807 . PMID   16844988.
  10. Lupas A., Van Dyke,M. and Stock,J. (1991) Predicting coiled coils from protein sequences. Science, 252, 1162–1164.
  11. Punta, M.; Rost, B. (2005). "PROFcon: Novel prediction of long-range contacts". Bioinformatics. 21 (13): 2960–2968. doi: 10.1093/bioinformatics/bti454 . PMID   15890748.
  12. Wootton J.C. and Federhen,S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266, 554–571.
  13. Liu J., Tan,H. and Rost,B. (2002) Loopy proteins appear conserved in evolution. J. Mol. Biol., 322, 53–64
  14. Liu J. and Rost,B. (2003) NORSp: predictions of long regions without regular secondary structure. Nucleic Acids Res., 31, 3833–3835
  15. Cokol M., Nair,R. and Rost,B. (2000) Finding nuclear localisation signals. EMBO Rep., 1, 411–415.
  16. Nair R., Carter,P. and Rost,B. (2003) NLSdb: database of nuclear localization signals. Nucleic Acids Res., 31, 397–399
  17. Nair R. and Rost,B. (2003) Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins, 53, 917–930
  18. Wrzeszczynski K.O. and Rost,B. (2004) Cataloguing proteins in cell cycle control. Methods Mol. Biol., 241, 219–233
  19. Amor, J.J., et al. From pigs to stripes: A travel through debian. in Proceedings of the DebConf5 (Debian Annual Developers Meeting). 2005. Citeseer.
  20. The Debian Free Software Guidelines (DFSG). Available from: http://www.debian.org/social_contract#guidelines
  21. Dawn Field, B.T., Tim Booth, Stewart Houten, Dan Swan, Nicolas Bertrand, Milo Thurston. Bio-Linux 7. 2012; Available from: http://nebc.nerc.ac.uk/tools/bio-linux/bio-linux-7-info
  22. NEW packages through Debian. Available from: https://wiki.ubuntu.com/UbuntuDevelopment/NewPackages#NEW_packages_through_Debian
  23. Krampis, K., et al., Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics, 2012. 13: p. 42