Substitution matrix

Last updated

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences. [1]

Contents

Background

In the process of evolution, from one generation to the next the amino acid sequences of an organism's proteins are gradually altered through the action of DNA mutations. For example, the sequence

ALEIRYLRD

could mutate into the sequence

ALEINYLRD

in one step, and possibly

AQEINYQRD

over a longer period of evolutionary time. Each amino acid is more or less likely to mutate into various other amino acids. For instance, a hydrophilic residue such as arginine is more likely to be replaced by another hydrophilic residue such as glutamine, than it is to be mutated into a hydrophobic residue such as leucine. (Here, a residue refers to an amino acid stripped of a hydrogen and/or a hydroxyl group and inserted in the polymeric chain of a protein.) This is primarily due to redundancy in the genetic code, which translates similar codons into similar amino acids. Furthermore, mutating an amino acid to a residue with significantly different properties could affect the folding and/or activity of the protein. This type of disruptive substitution is likely to be removed from populations by the action of purifying selection because the substitution has a higher likelihood of rendering a protein nonfunctional. [2]

If we have two amino acid sequences in front of us, we should be able to say something about how likely they are to be derived from a common ancestor, or homologous. If we can line up the two sequences using a sequence alignment algorithm such that the mutations required to transform a hypothetical ancestor sequence into both of the current sequences would be evolutionarily plausible, then we'd like to assign a high score to the comparison of the sequences.

To this end, we will construct a 20x20 matrix where the th entry is equal to the probability of the th amino acid being transformed into the th amino acid in a certain amount of evolutionary time. There are many different ways to construct such a matrix, called a substitution matrix. Here are the most commonly used ones:

Identity matrix

The simplest possible substitution matrix would be one in which each amino acid is considered maximally similar to itself, but not able to transform into any other amino acid. This matrix would look like

This identity matrix will succeed in the alignment of very similar amino acid sequences but will be miserable at aligning two distantly related sequences. We need to figure out all the probabilities in a more rigorous fashion. It turns out that an empirical examination of previously aligned sequences works best.

Log-odds matrices

We express the probabilities of transformation in what are called log-odds scores. The scores matrix S is defined as

where is the probability that amino acid transforms into amino acid , and , are the frequencies of amino acids i and j. The base of the logarithm is not important, and the same substitution matrix is often expressed in different bases.

Example matrices

PAM

One of the first amino acid substitution matrices, the PAM (Point Accepted Mutation) matrix was developed by Margaret Dayhoff in the 1970s. This matrix is calculated by observing the differences in closely related proteins. Because the use of very closely related homologs, the observed mutations are not expected to significantly change the common functions of the proteins. Thus the observed substitutions (by point mutations) are considered to be accepted by natural selection.

One PAM unit is defined as 1% of the amino acid positions that have been changed. To create a PAM1 substitution matrix, a group of very closely related sequences with mutation frequencies corresponding to one PAM unit is chosen. Based on collected mutational data from this group of sequences, a substitution matrix can be derived. This PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed.

The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. With this assumption, the PAM2 matrix can estimated by squaring the probabilities. Using this logic, Dayhoff derived matrices as high as PAM250. Usually the PAM 30 and the PAM70 are used.

BLOSUM

Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. The BLOSUM (BLOck SUbstitution Matrix) series of matrices rectifies this problem. Henikoff & Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins. The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. These conserved sequences are assumed to be of functional importance within related proteins and will therefore have lower substitution rates than less conserved regions. To reduce bias from closely related sequences on substitution rates, segments in a block with a sequence identity above a certain threshold were clustered, reducing the weight of each such cluster (Henikoff and Henikoff). For the BLOSUM62 matrix, this threshold was set at 62%. Pairs frequencies were then counted between clusters, hence pairs were only counted between segments less than 62% identical. One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences.

It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as BLAST.

Differences between PAM and BLOSUM

  1. PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree: maximum parismony), whereas the BLOSUM matrices are based on an implicit model of evolution.
  2. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.
  3. The method used to count the replacements is different: unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same.
  4. Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than BLOSUM50.

Newer matrices

A number of newer substitution matrices have been proposed to deal with inadequacies in earlier designs.

Specialized substitution matrices and their extensions

The real substitution rates in a protein depends not only on the identity of the amino acid, but also on the specific structural or sequence context it is in. Many specialized matrices have been developed for these contexts, such as in transmembrane alpha helices, [4] for combinations of secondary structure states and solvent accessibility states, [5] [6] [7] or for local sequence-structure contexts. [8] These context-specific substitution matrices lead to generally improved alignment quality at some cost of speed but are not yet widely used.

Recently, sequence context-specific amino acid similarities have been derived that do not need substitution matrices but that rely on a library of sequence contexts instead. Using this idea, a context-specific extension of the popular BLAST program has been demonstrated to achieve a twofold sensitivity improvement for remotely related sequences over BLAST at similar speeds (CS-BLAST).

Terminology

Although "transition matrix" is often used interchangeably with "substitution matrix" in fields other than bioinformatics, the former term is problematic in bioinformatics. With regards to nucleotide substitutions, "transition" is also used to indicate those substitutions that are between the two-ring purines (A  G and G  A) or are between the one-ring pyrimidines (C  T and T  C). Because these substitutions do not require a change in the number of rings, they occur more frequently than the other substitutions. "Transversion" is the term used to indicate the slower-rate substitutions that change a purine to a pyrimidine or vice versa (A  C, A  T, G  C, and G  T).

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; and it is important in medicine and biotechnology.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

<span class="mw-page-title-main">Sequence logo</span>

In bioinformatics, a sequence logo is a graphical representation of the sequence conservation of nucleotides or amino acids . A sequence logo is created from a collection of aligned sequences and depicts the consensus sequence and diversity of the sequences. Sequence logos are frequently used to depict sequence characteristics such as protein-binding sites in DNA or functional units in proteins.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

<span class="mw-page-title-main">Margaret Oakley Dayhoff</span> American biochemist

Margaret Belle (Oakley) Dayhoff was an American physical chemist and a pioneer in the field of bioinformatics. Dayhoff was a professor at Georgetown University Medical Center and a noted research biochemist at the National Biomedical Research Foundation, where she pioneered the application of mathematics and computational methods to the field of biochemistry. She dedicated her career to applying the evolving computational technologies to support advances in biology and medicine, most notably the creation of protein and nucleic acid databases and tools to interrogate the databases. She originated one of the first substitution matrices, point accepted mutations (PAM). The one-letter code used for amino acids was developed by her, reflecting an attempt to reduce the size of the data files used to describe amino acid sequences in an era of punch-card computing.

<span class="mw-page-title-main">Substitution model</span> Description of the process by which states in sequences change into each other and back

In biology, a substitution model, also called models of DNA sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules represented as sequence of symbols. Substitution models are used to calculate the likelihood of phylogenetic trees using multiple sequence alignment data. Thus, substitution models are central to maximum likelihood estimation of phylogeny as well as Bayesian inference in phylogeny. Estimates of evolutionary distances are typically calculated using substitution models. Substitution models are also central to phylogenetic invariants because they are necessary to predict site pattern frequencies given a tree topology. Substitution models are also necessary to simulate sequence data for a group of organisms related by a specific tree.

<span class="mw-page-title-main">Point accepted mutation</span>

A point accepted mutation — also known as a PAM — is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, which is accepted by the processes of natural selection. This definition does not include all point mutations in the DNA of an organism. In particular, silent mutations are not point accepted mutations, nor are mutations that are lethal or that are rejected by natural selection in other ways.

A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

<span class="mw-page-title-main">BLOSUM</span> Bioinformatics tool

In bioinformatics, the BLOSUM matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They scanned the BLOCKS database for very conserved regions of protein families and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.

<span class="mw-page-title-main">Dot plot (bioinformatics)</span>

In bioinformatics a dot plot is a graphical method for comparing two biological sequences and identifying regions of close similarity after sequence alignment. It is a type of recurrence plot.

The GOR method is an information theory-based method for the prediction of secondary structures in proteins. It was developed in the late 1970s shortly after the simpler Chou–Fasman method. Like Chou–Fasman, the GOR method is based on probability parameters derived from empirical studies of known protein tertiary structures solved by X-ray crystallography. However, unlike Chou–Fasman, the GOR method takes into account not only the propensities of individual amino acids to form particular secondary structures, but also the conditional probability of the amino acid to form a secondary structure given that its immediate neighbors have already formed that structure. The method is therefore essentially Bayesian in its analysis.

<span class="mw-page-title-main">HMMER</span> Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

Direct coupling analysis or DCA is an umbrella term comprising several methods for analyzing sequence data in computational biology. The common idea of these methods is to use statistical modeling to quantify the strength of the direct relationship between two positions of a biological sequence, excluding effects from other positions. This contrasts usual measures of correlation, which can be large even if there is no direct relationship between the positions. Such a direct relationship can for example be the evolutionary pressure for two positions to maintain mutual compatibility in the biomolecular structure of the sequence, leading to molecular coevolution between the two positions.

Steven Henikoff is a scientist at the Fred Hutchinson Cancer Research Center, and an HHMI Investigator. His field of study is chromatin-related transcriptional regulation. He earned his BS in chemistry at the University of Chicago. He earned his PhD in biochemistry and molecular biology from Harvard University in the lab of Matt Meselson in 1977. He did a postdoctoral fellowship at the University of Washington. His research has been funded by the National Science Foundation, National Institutes of Health, and HHMI. In 1992, Steven Henikoff, together with his wife Jorja Henikoff, introduced the BLOSUM substitution matrices. The BLOSUM matrices are widely used for sequence alignment of proteins. In 2005, Henikoff was elected to the National Academy of Sciences.

References

  1. Zvelebil, Marketa J. (2008). Understanding bioinformatics. New York: Garland Science. pp. 117–127, 747. ISBN   978-0-8153-4024-9.
  2. Xiong, Jin (2006). Essential Bioinformatics. Cambridge: Cambridge University Press. doi:10.1017/cbo9780511806087.004. ISBN   978-0-511-80608-7.
  3. 1 2 Whelan, Simon; Goldman, Nick (1 May 2001). "A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach". Molecular Biology and Evolution. 18 (5): 691–699. doi: 10.1093/oxfordjournals.molbev.a003851 . ISSN   0737-4038. PMID   11319253.
  4. Müller, T; Rahmann, S; Rehmsmeier, M (2001). "Non-symmetric score matrices and the detection of homologous transmembrane proteins". Bioinformatics. 17 (Suppl 1): S182–9. doi: 10.1093/bioinformatics/17.suppl_1.s182 . PMID   11473008.
  5. Rice, DW; Eisenberg, D (1997). "A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence". Journal of Molecular Biology. 267 (4): 1026–38. CiteSeerX   10.1.1.44.1143 . doi:10.1006/jmbi.1997.0924. PMID   9135128.
  6. Gong, Sungsam; Blundell, Tom L. (2008). Levitt, Michael (ed.). "Discarding functional residues from the substitution table improves predictions of active sites within three-dimensional structures". PLOS Computational Biology. 4 (10): e1000179. Bibcode:2008PLSCB...4E0179G. doi: 10.1371/journal.pcbi.1000179 . PMC   2527532 . PMID   18833291.
  7. Goonesekere, NC; Lee, B (2008). "Context-specific amino acid substitution matrices and their use in the detection of protein homologs". Proteins. 71 (2): 910–9. doi:10.1002/prot.21775. PMID   18004781. S2CID   27443393.
  8. Huang, YM; Bystroff, C (2006). "Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions". Bioinformatics. 22 (4): 413–22. doi: 10.1093/bioinformatics/bti828 . PMID   16352653.

Further reading