Similarity Matrix of Proteins

Last updated

Similarity Matrix of Proteins (SIMAP) is a database of protein similarities created using volunteer computing. [1] [2] It is freely accessible for scientific purposes. SIMAP uses the FASTA algorithm to precalculate protein similarity, while another application uses hidden Markov models to search for protein domains. SIMAP is a joint project of the Technical University of Munich, the Helmholtz Zentrum München, and the University of Vienna.

Contents

Project

The project usually got new work units at the beginning of each month. More recently, (2010), inclusion of environmental sequences into the database has required longer periods of activity, several months of continuous work for example. Typically, these updates occurred twice each year.[ citation needed ]

In the fourth quarter of 2010, the project relocated to the University of Vienna due to the failing electrical infrastructure at the Technical University of Munich. Part of this exercise involved the creation of a project specific URL requiring existing volunteers and users to detach/reattach to the project.

On May 30, 2014, it was announced by project administrators that after a 10-year history, SIMAP would be leaving BOINC by the end of 2014. SIMAP research, however, will go forward with the use of local hardware consisting of "ordinary multi-core CPUs (some hundreds), crunching a SSE-optimized version of the Smith-Waterman algorithm."

Computing platform

SIMAP used the Berkeley Open Infrastructure for Network Computing (BOINC) distributed computing platform.

Application performance notes

Work unit CPU times varied widely, ranging between 15 minutes and 3 hours. Work units varied in size from 1.5 to 2.2 MB each, averaging around 2 MB. SIMAP provided client software optimized for SSE enabled processors and x86-64 processors. For older processors non SSE applications are provided but require manual installation steps to be taken. Operating Systems supported by SIMAP are Linux, Windows, Mac OS, Android, and other UNIX platforms. Since the database had sometimes been completed with all publicly known protein sequences and metagenomes having been precalculated by the project, the work available consisted of newly published protein sequences and metagenomes that needed to be precomputed for SIMAP.

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, chemistry, physics, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using computational and statistical techniques.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

Computer science is the study of the theoretical foundations of information and computation and their implementation and application in computer systems. One well known subject classification system for computer science is the ACM Computing Classification System devised by the Association for Computing Machinery.

<span class="mw-page-title-main">SETI@home</span> BOINC based volunteer computing project searching for signs of extraterrestrial intelligence

SETI@home is a project of the Berkeley SETI Research Center to analyze radio signals, searching for signs of extraterrestrial intelligence. Until March 2020, it was run as an Internet-based public volunteer computing project that employed the BOINC software platform. It is hosted by the Space Sciences Laboratory at the University of California, Berkeley, and is one of many activities undertaken as part of the worldwide SETI effort.

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble alphabet above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the pig genome that resemble the mouse gene based on similarity of sequence.

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications the elements are more often referred to as points, nodes or vertices.

<span class="mw-page-title-main">Predictor@home</span> BOINC based volunteer computing project to predict protein structure

Predictor@home was a volunteer computing project that used BOINC software to predict protein structure from protein sequence in the context of the 6th biannual CASP, or Critical Assessment of Techniques for Protein Structure Prediction. A major goal of the project was the testing and evaluating of new algorithms to predict both known and unknown protein structures.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

<span class="mw-page-title-main">BOINC Credit System</span> Tracking of CPU time donated to BOINC projects

Within the BOINC platform for volunteer computing, the BOINC Credit System helps volunteers keep track of how much CPU time they have donated to various projects. This ensures users are returning accurate results for both scientific and statistical reasons.

<span class="mw-page-title-main">Rosetta@home</span> BOINC based volunteer computing project researching protein folding

Rosetta@home is a volunteer computing project researching protein structure prediction on the Berkeley Open Infrastructure for Network Computing (BOINC) platform, run by the Baker laboratory at the University of Washington. Rosetta@home aims to predict protein–protein docking and design new proteins with the help of about fifty-five thousand active volunteered computers processing at over 487,946 GigaFLOPS on average as of September 19, 2020. Foldit, a Rosetta@home videogame, aims to reach these goals with a crowdsourcing approach. Though much of the project is oriented toward basic research to improve the accuracy and robustness of proteomics methods, Rosetta@home also does applied research on malaria, Alzheimer's disease, and other pathologies.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

<span class="mw-page-title-main">BOINC client–server technology</span> BOINC volunteer computing client–server structure

BOINC client–server technology refers to the model under which BOINC works. The BOINC framework consists of two layers which operate under the client–server architecture. Once the BOINC software is installed in a machine, the server starts sending tasks to the client. The operations are performed client-side and the results are uploaded to the server-side.

<span class="mw-page-title-main">Blast2GO</span>

Blast2GO, first published in 2005, is a bioinformatics software tool for the automatic, high-throughput functional annotation of novel sequence data. It makes use of the BLAST algorithm to identify similar sequences to then transfers existing functional annotation from yet characterised sequences to the novel one. The functional information is represented via the Gene Ontology (GO), a controlled vocabulary of functional attributes. The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.

<span class="mw-page-title-main">OProject@Home</span> BOINC based volunteer computing project

OProject@Home was a volunteer computing project running on the Berkeley Open Infrastructure for Network Computing (BOINC) and was based on a dedicated library OLib. The project was directed by Lukasz Swierczewski, an IT student at the College of Computer Science and Business Administration in Łomża, Computer Science and Automation Institute. As of 2016 it seems to have been abandoned.

Charity Engine is a free PC app based on Berkeley University's BOINC software, run by The Worldwide Computer Company Limited. The project works by selling spare home computing power to universities and corporations, then sharing the profits between eight partner charities and periodic cash prize draws for the users; those running the Charity Engine BOINC software on their home computers. When there are no corporations purchasing the computing power, Charity Engine donates it to existing volunteer computing projects such as Rosetta@home, Einstein@Home, and Malaria Control, and prize draws are funded by donations.

MG-RAST is an open-source web application server that suggests automatic phylogenetic and functional analysis of metagenomes. It is also one of the biggest repositories for metagenomic data. The name is an abbreviation of Metagenomic Rapid Annotations using Subsystems Technology. The pipeline automatically produces functional assignments to the sequences that belong to the metagenome by performing sequence comparisons to databases in both nucleotide and amino-acid levels. The applications supply phylogenetic and functional assignments of the metagenome being analysed, as well as tools for comparing different metagenomes. It also provides a RESTful API for programmatic access.

ProBiS is a computer software which allows prediction of binding sites and their corresponding ligands for a given protein structure. Initially ProBiS was developed as a ProBiS algorithm by Janez Konc and Dušanka Janežič in 2010 and is now available as ProBiS server, ProBiS CHARMMing server, ProBiS algorithm and ProBiS plugin. The name ProBiS originates from the purpose of the software itself, that is to predict for a given Protein structure Binding Sites and their corresponding ligands.

References

  1. Arnold, R.; Rattei, T.; Tischler, P.; Truong, M.-D.; Stümpflen, V.; Mewes, H. W. (2005). "SIMAP--The similarity matrix of proteins". Bioinformatics. 21 (Suppl 2): ii42–ii46. doi: 10.1093/bioinformatics/bti1107 . ISSN   1367-4803. PMID   16204123.
  2. Rattei, T.; Arnold, R.; Tischler, P.; Lindner, D.; Stümpflen, V.; Mewes, H. W. (2006). "SIMAP: the similarity matrix of proteins". Nucleic Acids Research. 34 (90001): D252–D256. doi:10.1093/nar/gkj106. ISSN   0305-1048. PMC   1347468 . PMID   16381858.