CS-BLAST

Last updated
CS-BLAST
Developer(s) Angermueller C, Biegert A, and Soeding J
Stable release
2.2.3 / December 7, 2013 (2013-12-07)
Preview release
1.1 / April 14, 2009;14 years ago (2009-04-14)
Repository
Written in C++
Available inEnglish
Type Bioinformatics tool
License GNU GPL v3
Website http://wwwuser.gwdg.de/~compbiol/data/csblast/releases/, https://github.com/soedinglab/csblast

CS-BLAST [1] [2] [3] (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST (Basic Local Alignment Search Tool), [4] using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST (Context-Specific Iterated BLAST) is the context-specific analog of PSI-BLAST [5] (Position-Specific Iterated BLAST), which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST (Context-Specific Iterated BLAST) is the context specific analog of PSI-BLAST (Position-Specific Iterated BLAST). Both of these programs are available as web-server and are available for free download.

Contents

Background

Homology is the relationship between biological structures or sequences derived from a common ancestor. Homologous proteins (proteins who have common ancestry) are inferred from their sequence similarity. Inferring homologous relationships involves calculating scores of aligned pairs minus penalties for gaps. Aligning pairs of proteins identify regions of similarity indicating a relationship between the two, or more, proteins. In order to have a homologous relationship, the sum of scores over all the aligned pairs of amino acids or nucleotides must be sufficiently high [2]. Standard methods of sequence comparisons use a substitution matrix to accomplish this [4]. Similarities between amino acids or nucleotides are quantified in these substitution matrices. The substitution score () of amino acids and can we written as follows:

where denotes the probability of amino acid mutating into amino acid [2]. In a large set of sequence alignments, counting the number of amino acids as well as the number of aligned pairs will allow you to derive the probabilities and .

Since protein sequences need to maintain a stable structure, a residue’s substitution probabilities are largely determined by the structural context of where it is found. As a result, substitution matrices are trained for structural contexts. Since context information is encoded in transition probabilities between states, mixing mutation probabilities from substitution matrices weighted for corresponding states achieves improved alignment qualities when compared to standard substitution matrices. CS-BLAST improves further upon this concept. The figure illustrates the sequence to sequence and profile to sequence equivalence with the alignment matrix. The query profile results from the artificial mutations in which the bar heights are proportional to the corresponding amino acid probabilities.

(A FIGURE NEEDS TO GO HERE THIS IS THE CAPTION) “Sequence search/alignment algorithms find the path that maximizes the sum of similarity scores (color-coded blue to red). Substitution matrix scores are equivalent to profile scores if the sequence profile (colored histogram) is generated from the query sequence by adding artificial mutations with the substitution matrix pseudocount scheme. Histogram bar heights represent the fraction of amino acids in profile columns”.

Performance

CS-BLAST greatly improves alignment quality over the entire range of sequence identities and especially for difficult alignments in comparison to regular BLAST and PSI-BLAST. PSI-BLAST (Position-Specific Iterated BLAST) runs at about the same speed per iteration as regular BLAST, but is able to detect weaker sequence similarities that are still biologically relevant. Alignment quality is based on alignment sensitivity and alignment precision.

Alignment Quality

Alignment sensitivity is measured by correctly comparing predicted alignments of residue pairs to the total number of possible alignable pairs. This is calculated with the fraction: (pairs correctly aligned)/(pairs structurally alignable)

Alignment precision is measured by the correctness of aligned residue pairs. This is calculated with the fraction: (pairs correctly aligned)/(pairs aligned)

Search Performance

The graph is the benchmark Biegert and Söding used to evaluate homology detection. The benchmark compares CS-BLAST to BLAST using true positives from the same superfamily versus false positive of pairs from different folds. (A GRAPH NEEDS TO GO HERE)

The other graph uses detects true positives (with a different scale than the previous graph) and false positives of PSI-BLAST and CSI-BLAST and compares the two for one to five iterations. (A DIFFERENT GRAPH NEEDS TO GO HERE)

CS-BLAST offers improved sensitivity and alignment quality in sequence comparison. Sequence searches with CS-BLAST are more than twice as sensitive as BLAST. It produces higher quality alignments and generates reliable E-values without a loss of speed. CS-BLAST detects 139% more homologous proteins at a cumulative error rate of 20%. At a 10% error rate, 138% more homologs are detected, and for the easiest cases at a 1% error rate, CS-BLAST was still 96% more effective than BLAST. Additionally, CS-BLAST in 2 iterations is more sensitive than 5 iterations of PSI-BLAST. About 15% more homologs were detected in comparison.

Method

The CS-BLAST method derives similarities between sequence context-specific amino acids for 13 residue windows centered on each residue. CS-BLAST works by generating a sequence profile for a query sequence by using context-specific mutations and then jumpstarting a profile-to-sequence search method.

CS-BLAST starts by predicting the expected mutation probabilities for each position. For a certain residue, a sequence window of ten total surrounding residues is selected as seen in the image. Then, Biegert and Söding compared the sequence window to a library with thousands of context profiles. The library is generated by clustering a representative set of sequence profile windows. The actual predicting of mutation probabilities is achieved by weighted mixing of the central columns of the most similar context profiles. This aligns short profiles that are nonhomologous and ungapped which gives higher weight to better matching profiles, making them easier to detect. A sequence profile represents a multiple alignment of homologous sequences and describes what amino acids are likely to occur at each position in related sequences. With this method substitution matrices are unnecessary. In addition, there is no need for transition probabilities as a result of the fact that context information is encoded within the context profiles. This makes computation simpler and allows for runtime to be scaled linearly instead of quadratically.

The context specific mutation probability, the probability of observing a specific amino acid in a homologous sequence given a context, is calculated by a weighted mixing of the amino acids in the central columns of the most similar context profiles. The image illustrates the calculation of expected mutation probabilities for a specific residue at a certain position. As seen in the image, the library of context profiles all contribute based on similarity to the context specific sequence profile for the query sequence.

Models

In predicting substitution probabilities using only the amino acid’s local sequence context, you gain the advantage of not needing to know the structure of the query protein while still allowing for the detection of more homologous proteins than standard substitution matrices [4]. Bigert and Söding’s approach to predicting substitution probabilities was based on a generative model. In another paper in collaboration with Angermüller, they develop a discriminative machine learning method that improves prediction accuracy [2].

Generative Model

Given an observed variable and a target variable , a generative model defines the probabilities and separately. In order to predict the unobserved target variable, , Bayes’ theorem,

is used. A generative model, as the name suggests, allows one to generate new data points . The joint distribution is described as . To train a generative model, the following equation is used to maximize the joint probability .

Discriminative Model

The discriminative model is a logistic regression maximum entropy classifier. With the discriminative model, the goal is to predict a context specific substitution probability given a query sequence. The discriminative approach for modeling substitution probabilities, where describes a sequence of amino acids around position of a sequence, is based on context states. Context states are characterized by parameters emission weight (), bias weight (), and context weight () [2]. Emission probabilities from a context state are given by the emission weights as follows for to :

where is the emission probability and is the context state. In the discriminative approach, probability for a context state given context is modeled directly by the exponential of an affine function of the context account profile where is the context count profile with a normalization constant normalizes the probability to 1. This equation is as follows where the first summation takes to and the second summation takes to : .

As with the generative model, target distribution is obtained by mixing the emission probabilities of each context state weighted by the similarity.

Using CS-BLAST

The MPI Bioinformatics toolkit in an interactive website and service that allows anyone to do comprehensive and collaborative protein analysis with a variety of different tools including CS-BLAST as well as PSI-BLAST [1]. This tool allows for input of a protein and select options for you to customize your analysis. It also can forward the output to other tools as well.

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; and it is important in medicine and biotechnology.

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

<span class="mw-page-title-main">Substitution model</span> Description of the process by which states in sequences change into each other and back

In biology, a substitution model, also called models of DNA sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules represented as sequence of symbols. Substitution models are used to calculate the likelihood of phylogenetic trees using multiple sequence alignment data. Thus, substitution models are central to maximum likelihood estimation of phylogeny as well as Bayesian inference in phylogeny. Estimates of evolutionary distances are typically calculated using substitution models. Substitution models are also central to phylogenetic invariants because they are necessary to predict site pattern frequencies given a tree topology. Substitution models are also necessary to simulate sequence data for a group of organisms related by a specific tree.

<span class="mw-page-title-main">Conserved sequence</span> Similar DNA, RNA or protein sequences within genomes or among species

In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids or proteins across species, or within a genome, or between donor and receptor taxa. Conservation indicates that a sequence has been maintained by natural selection.

<span class="mw-page-title-main">Point accepted mutation</span>

A point accepted mutation — also known as a PAM — is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, which is accepted by the processes of natural selection. This definition does not include all point mutations in the DNA of an organism. In particular, silent mutations are not point accepted mutations, nor are mutations that are lethal or that are rejected by natural selection in other ways.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

<span class="mw-page-title-main">BLOSUM</span> Bioinformatics tool

In bioinformatics, the BLOSUM matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They scanned the BLOCKS database for very conserved regions of protein families and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.

ProbCons is an open source probabilistic consistency-based multiple alignment of amino acid sequences. It is one of the most efficient protein multiple sequence alignment programs, since it has repeatedly demonstrated a statistically significant advantage in accuracy over similar tools, including Clustal and MAFFT.

<span class="mw-page-title-main">Homology modeling</span> Method of protein structure prediction using other known proteins

Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein. Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. It has been seen that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure.

<span class="mw-page-title-main">HMMER</span> Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

Phyre and Phyre2 are free web-based services for protein structure prediction. Phyre is among the most popular methods for protein structure prediction having been cited over 1500 times. Like other remote homology recognition techniques, it is able to regularly generate reliable protein models when other widely used methods such as PSI-BLAST cannot. Phyre2 has been designed to ensure a user-friendly interface for users inexpert in protein structure prediction methods. Its development is funded by the Biotechnology and Biological Sciences Research Council.

The HH-suite is an open-source software package for sensitive protein sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences. HHsearch and HHblits are two main programs in the package and the entry point to its search function, the latter being a faster iteration. HHpred is an online server for protein structure prediction that uses homology information from HH-suite.

PredictProtein (PP) is an automatic service that searches up-to-date public sequence databases, creates alignments, and predicts aspects of protein structure and function. Users send a protein sequence and receive a single file with results from database comparisons and prediction methods. PP went online in 1992 at the European Molecular Biology Laboratory; since 1999 it has operated from Columbia University and in 2009 it moved to the Technische Universität München. Although many servers have implemented particular aspects, PP remains the most widely used public server for structure prediction: over 1.5 million requests from users in 104 countries have been handled; over 13000 users submitted 10 or more different queries. PP web pages are mirrored in 17 countries on 4 continents. The system is optimized to meet the demands of experimentalists not experienced in bioinformatics. This implied that we focused on incorporating only high-quality methods, and tried to collate results omitting less reliable or less important ones.

Direct coupling analysis or DCA is an umbrella term comprising several methods for analyzing sequence data in computational biology. The common idea of these methods is to use statistical modeling to quantify the strength of the direct relationship between two positions of a biological sequence, excluding effects from other positions. This contrasts usual measures of correlation, which can be large even if there is no direct relationship between the positions. Such a direct relationship can for example be the evolutionary pressure for two positions to maintain mutual compatibility in the biomolecular structure of the sequence, leading to molecular coevolution between the two positions.

References

  1. Angermüller, C.; Biegert, A.; Söding, J. (Dec 2012). "Discriminative modelling of context-specific amino acid substitution probabilities". Bioinformatics. 28 (24): 3240–7. doi: 10.1093/bioinformatics/bts622 . hdl: 11858/00-001M-0000-0015-8D22-F . PMID   23080114.
  2. Biegert, A.; Söding, J. (Mar 2009). "Sequence context-specific profiles for homology searching" (PDF). Proc Natl Acad Sci U S A. 106 (10): 3770–5. Bibcode:2009PNAS..106.3770B. doi: 10.1073/pnas.0810767106 . PMC   2645910 . PMID   19234132.
  3. "Better Sequence Searches Of Genes And Proteins Devised". ScienceDaily. Mar 7, 2009. Retrieved 2009-08-14.
  4. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). "Basic local alignment search tool". J Mol Biol. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID   2231712.
  5. Altschul SF; Madden TL; Schäffer AA; Zhang J; Zhang Z; Miller W; Lipman DJ. (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs". Nucleic Acids Res. 25 (17): 3389–3402. doi:10.1093/nar/25.17.3389. PMC   146917 . PMID   9254694.

[1] Alva, Vikram, Seung-Zin Nam, Johannes Söding, and Andrei N. Lupas. “The MPI Bioinformatics Toolkit as an Integrative Platform for Advanced Protein Sequence and Structure Analysis.” Nucleic Acids Research 44.Web server Issue (2016): W410-415. NCBI. Web. 2 Nov. 2016.

[2] Angermüller, Christof, Andreas Biegert, and Johannes Söding. “Discriminative Modelling of Context-specific Amino Acid Substitution Properties” BIOINFORMATICS 28.24 (2012): 3240-247. Oxford Journals. Web. 2 Nov. 2016.

[3] Astschul, Stephen F., et al. “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs.” Nucleic Acids Research 25.17 (1997): 3389-402. Oxford University Press. Print

[4] Bigert, A., and J. Söding. “Sequence Context-specific Profiles for Homology Searching.” Proceedings of the National Academy of Sciences 106.10 (2009): 3770-3775. PNAS. Web. 23 Oct. 2016.