BLOSUM

Last updated

The BLOSUM62 matrix, the amino acids have been grouped and coloured based on Margaret Dayhoff's classification scheme. Positive and zero values have been highlighted. Blosum62-dayhoff-ordering.svg
The BLOSUM62 matrix, the amino acids have been grouped and coloured based on Margaret Dayhoff's classification scheme. Positive and zero values have been highlighted.

In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. [1] They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.

Contents

Biological background

The genetic instructions of every replicating cell in a living organism are contained within its DNA. [2] Throughout the cell's lifetime, this information is transcribed and replicated by cellular mechanisms to produce proteins or to provide instructions for daughter cells during cell division, and the possibility exists that the DNA may be altered during these processes. [2] [3] This is known as a mutation. At the molecular level, there are regulatory systems that correct most — but not all — of these changes to the DNA before it is replicated. [3] [4]

The functionality of a protein is highly dependent on its structure. [5] Changing a single amino acid in a protein may reduce its ability to carry out this function, or the mutation may even change the function that the protein carries out. [3] Changes like these may severely impact a crucial function in a cell, potentially causing the cell — and in extreme cases, the organism — to die. [6] Conversely, the change may allow the cell to continue functioning albeit differently, and the mutation can be passed on to the organism's offspring. If this change does not result in any significant physical disadvantage to the offspring, the possibility exists that this mutation will persist within the population. The possibility also exists that the change in function becomes advantageous.

The 20 amino acids translated by the genetic code vary greatly by the physical and chemical properties of their side chains. [5] However, these amino acids can be categorised into groups with similar physicochemical properties. [5] Substituting an amino acid with another from the same category is more likely to have a smaller impact on the structure and function of a protein than replacement with an amino acid from a different category.

Sequence alignment is a fundamental research method for modern biology. The most common sequence alignment for protein is to look for similarity between different sequences in order to infer function or establish evolutionary relationships. This helps researchers better understand the origin and function of genes through the nature of homology and conservation. Substitution matrices are utilized in algorithms to calculate the similarity of different sequences of proteins; however, the utility of Dayhoff PAM Matrix has decreased over time due to the requirement of sequences with a similarity more than 85%. In order to fill in this gap, Henikoff and Henikoff introduced BLOSUM (BLOcks SUbstitution Matrix) matrix which led to marked improvements in alignments and in searches using queries from each of the groups of related proteins. [1]

Terminology

BLOSUM
Blocks Substitution Matrix, a substitution matrix used for sequence alignment of proteins.
Scoring metrics (statistical versus biological)
When evaluating a sequence alignment, one would like to know how meaningful it is. This requires a scoring matrix, or a table of values that describes the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences. [7]
BLOSUM r
The matrix built from blocks with at least r% of similarity
  • E.g., BLOSUM62 is the matrix built using sequences with less than 62% similarity (sequences with ≥ 62% identity were clustered)
  • Note: BLOSUM 62 is the default matrix for protein BLAST. Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities. [1]

Several sets of BLOSUM matrices exist using different alignment databases, named with numbers. BLOSUM matrices with high numbers are designed for comparing closely related sequences, while those with low numbers are designed for comparing distant related sequences. For example, BLOSUM80 is used for closely related alignments, and BLOSUM45 is used for more distantly related alignments. The matrices were created by merging (clustering) all sequences that were more similar than a given percentage into one single sequence and then comparing those sequences (that were all more divergent than the given percentage value) only; thus reducing the contribution of closely related sequences. The percentage used was appended to the name, giving BLOSUM80 for example where sequences that were more than 80% identical were clustered.

Construction of BLOSUM matrices

BLOSUM matrices are obtained by using blocks of similar amino acid sequences as data, then applying statistical methods to the data to obtain the similarity scores. Statistical Methods Steps : [8]

Eliminating Sequences

Eliminate the sequences that are more than r% identical. There are two ways to eliminate the sequences. It can be done either by removing sequences from the block or just by finding similar sequences and replace them by new sequences which could represent the cluster. Elimination is done to remove protein sequences that are more similar than the specified threshold.

Calculating Frequency & Probability

A database storing the sequence alignments of the most conserved regions of protein families. These alignments are used to derive the BLOSUM matrices. Only the sequences with a percentage of identity lower than the threshold are used. By using the block, counting the pairs of amino acids in each column of the multiple alignment.

Log odds ratio

It gives the ratio of the occurrence each amino acid combination in the observed data to the expected value of occurrence of the pair. It is rounded off and used in the substitution matrix.

where is the probability of observing the pair and is the expected probability of such a pair occurring, given the background probabilities of each amino acid.

BLOSUM Matrices

The odds for relatedness are calculated from log odd ratio, which are then rounded off to get the substitution matrices BLOSUM matrices.

Score of the BLOSUM matrices

A scoring matrix or a table of values is required for evaluating the significance of a sequence alignment, such as describing the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Typically, when two nucleotide sequences are being compared, all that is being scored is whether or not two bases are the same at one position. All matches and mismatches are respectively given the same score (typically +1 or +5 for matches, and -1 or -4 for mismatches). [9] But it is different for proteins. Substitution matrices for amino acids are more complicated and implicitly take into account everything that might affect the frequency with which any amino acid is substituted for another. The objective is to provide a relatively heavy penalty for aligning two residues together if they have a low probability of being homologous (correctly aligned by evolutionary descent). Two major forces drive the amino-acid substitution rates away from uniformity: substitutions occur with the different frequencies, and lessen functionally tolerated than others. Thus, substitutions are selected against. [7]

Commonly used substitution matrices include the blocks substitution (BLOSUM) [1] and point accepted mutation (PAM) [10] [11] matrices. Both are based on taking sets of high-confidence alignments of many homologous proteins and assessing the frequencies of all substitutions, but they are computed using different methods. [7]

Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. The matrices are based on the minimum percentage identity of the aligned protein sequence used in calculating them. [12] Every possible identity or substitution is assigned a score based on its observed frequencies in the alignment of related proteins. [13] A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions.

To calculate a BLOSUM matrix, the following equation is used:

Here, is the probability of two amino acids and replacing each other in a homologous sequence, and and are the background probabilities of finding the amino acids and in any protein sequence. The factor is a scaling factor, set such that the matrix contains easily computable integer values.

An example - BLOSUM62

BLOSUM80: more related proteins

BLOSUM62: midrange

BLOSUM45: distantly related proteins

An article in Nature Biotechnology [14] revealed that the BLOSUM62 used for so many years as a standard is not exactly accurate according to the algorithm described by Henikoff and Henikoff. [1] Surprisingly, the miscalculated BLOSUM62 improves search performance. [14]

The BLOSUM62 matrix with the amino acids in the table grouped according to the chemistry of the side chain, as in (a). Each value in the matrix is calculated by dividing the frequency of occurrence of the amino acid pair in the BLOCKS database, clustered at the 62% level, divided by the probability that the same two amino acids might align by chance. The ratio is then converted to a logarithm and expressed as a log odds score, as for PAM. BLOSUM matrices are usually scaled in half-bit units. A score of zero indicates that the frequency with which a given two amino acids were found aligned in the database was as expected by chance, while a positive score indicates that the alignment was found more often than by chance, and negative score indicates that the alignment was found less often than by chance.

Some uses in bioinformatics

Research applications

BLOSUM scores was used to predict and understand the surface gene variants among hepatitis B virus carriers [15] and T-cell epitopes. [16]

Surface gene variants among hepatitis B virus carriers

DNA sequences of HBsAg were obtained from 180 patients, in which 51 were chronic HBV carrier and 129 newly diagnosed patients, and compared with consensus sequences built with 168 HBV sequences imported from GenBank. Literature review and BLOSUM scores were used to define potentially altered antigenicity. [15]

Reliable prediction of T-cell epitopes

A novel input representation has been developed consisting of a combination of sparse encoding, Blosum encoding, and input derived from hidden Markov models. this method predicts T-cell epitopes for the genome of hepatitis C virus and discuss possible applications of the prediction method to guide the process of rational vaccine design. [16]

Use in BLAST

BLOSUM matrices are also used as a scoring matrix when comparing DNA sequences or protein sequences to judge the quality of the alignment. This form of scoring system is utilized by a wide range of alignment software including BLAST. [17]

Comparing PAM and BLOSUM

In addition to BLOSUM matrices, a previously developed scoring matrix can be used. This is known as a PAM. The two result in the same scoring outcome, but use differing methodologies. BLOSUM looks directly at mutations in motifs of related sequences while PAM's extrapolate evolutionary information based on closely related sequences. [1]

Since both PAM and BLOSUM are different methods for showing the same scoring information, the two can be compared but due to the very different method of obtaining this score, a PAM100 does not equal a BLOSUM100. [18]

PAMBLOSUM
PAM100BLOSUM90
PAM120BLOSUM80
PAM160BLOSUM62
PAM200BLOSUM50
PAM250BLOSUM45
The relationship between PAM and BLOSUM
PAMBLOSUM
To compare closely related sequences, PAM matrices with lower numbers are created.To compare closely related sequences, BLOSUM matrices with higher numbers are created.
To compare distantly related proteins, PAM matrices with high numbers are created.To compare distantly related proteins, BLOSUM matrices with low numbers are created.
The differences between PAM and BLOSUM
PAMBLOSUM
Based on global alignments of closely related proteins.Based on local alignments.
PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence but corresponds to 99% sequence identity.BLOSUM 62 is a matrix calculated from comparisons of sequences with a pairwise identity of no more than 62%.
Other PAM matrices are extrapolated from PAM1.Based on observed alignments; they are not extrapolated from comparisons of closely related proteins.
Higher numbers in matrices naming scheme denote larger evolutionary distance.Larger numbers in matrices naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. [19]
Software Packages

There are several software packages in different programming languages that allow easy use of Blosum matrices.

Examples are the blosum module for Python, or the BioJava library for Java.

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; it is important in medicine and biotechnology.

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

<span class="mw-page-title-main">Needleman–Wunsch algorithm</span> Method for aligning biological sequences

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

<span class="mw-page-title-main">Substitution model</span> Description of the process by which states in sequences change into each other and back

In biology, a substitution model, also called models of sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules, such as DNA sequences or protein sequences, that can be represented as sequence of symbols. Substitution models are used to calculate the likelihood of phylogenetic trees using multiple sequence alignment data. Thus, substitution models are central to maximum likelihood estimation of phylogeny as well as Bayesian inference in phylogeny. Estimates of evolutionary distances are typically calculated using substitution models. Substitution models are also central to phylogenetic invariants because they are necessary to predict site pattern frequencies given a tree topology. Substitution models are also necessary to simulate sequence data for a group of organisms related by a specific tree.

<span class="mw-page-title-main">Point accepted mutation</span>

A point accepted mutation — also known as a PAM — is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, which is accepted by the processes of natural selection. This definition does not include all point mutations in the DNA of an organism. In particular, silent mutations are not point accepted mutations, nor are mutations that are lethal or that are rejected by natural selection in other ways.

A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations, insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides.

Protein–protein interaction prediction is a field combining bioinformatics and structural biology in an attempt to identify and catalog physical interactions between pairs or groups of proteins. Understanding protein–protein interactions is important for the investigation of intracellular signaling pathways, modelling of protein complex structures and for gaining insights into various biochemical processes.

The GOR method is an information theory-based method for the prediction of secondary structures in proteins. It was developed in the late 1970s shortly after the simpler Chou–Fasman method. Like Chou–Fasman, the GOR method is based on probability parameters derived from empirical studies of known protein tertiary structures solved by X-ray crystallography. However, unlike Chou–Fasman, the GOR method takes into account not only the propensities of individual amino acids to form particular secondary structures, but also the conditional probability of the amino acid to form a secondary structure given that its immediate neighbors have already formed that structure. The method is therefore essentially Bayesian in its analysis.

<span class="mw-page-title-main">HMMER</span> Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

The HH-suite is an open-source software package for sensitive protein sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences. HHsearch and HHblits are two main programs in the package and the entry point to its search function, the latter being a faster iteration. HHpred is an online server for protein structure prediction that uses homology information from HH-suite.

Steven Henikoff is a scientist at the Fred Hutchinson Cancer Research Center, and an HHMI Investigator. His field of study is chromatin-related transcriptional regulation. He earned his BS in chemistry at the University of Chicago. He earned his PhD in biochemistry and molecular biology from Harvard University in the lab of Matt Meselson in 1977. He did a postdoctoral fellowship at the University of Washington. His research has been funded by the National Science Foundation, National Institutes of Health, and HHMI. In 1992, Steven Henikoff, together with his wife Jorja Henikoff, introduced the BLOSUM substitution matrices. The BLOSUM matrices are widely used for sequence alignment of proteins. In 2005, Henikoff was elected to the National Academy of Sciences.

References

  1. 1 2 3 4 5 6 Henikoff, S.; Henikoff, J.G. (1992). "Amino Acid Substitution Matrices from Protein Blocks". PNAS. 89 (22): 10915–10919. Bibcode:1992PNAS...8910915H. doi: 10.1073/pnas.89.22.10915 . PMC   50453 . PMID   1438297.
  2. 1 2 Campbell NA; Reece JB; Meyers N; Urry LA; Cain ML; Wasserman SA; Minorsky PV; Jackson RB (2009). "The Molecular Basis of Inheritance". Biology: Australian Version (8th ed.). Pearson Education Australia. pp. 307–325. ISBN   9781442502215.
  3. 1 2 3 Campbell NA; Reece JB; Meyers N; Urry LA; Cain ML; Wasserman SA; Minorsky PV; Jackson RB (2009). "From Gene to Protein". Biology: Australian Version (8th ed.). Pearson Education Australia. pp. 327–350. ISBN   9781442502215.
  4. Pal JK, Ghaskadbi SS (2009). "DNA Damage, Repair and Recombination". Fundamentals of Molecular Biology (1st ed.). Oxford University Press. pp.  187–203. ISBN   9780195697810.
  5. 1 2 3 Campbell NA; Reece JB; Meyers N; Urry LA; Cain ML; Wasserman SA; Minorsky PV; Jackson RB (2009). "The Structure and Function of Large Biological Molecules". Biology: Australian Version (8th ed.). Pearson Education Australia. pp. 68–89. ISBN   9781442502215.
  6. Lobo, Ingrid (2008). "Mendelian Ratios and Lethal Genes". Nature . Retrieved 19 October 2013.
  7. 1 2 3 pertsemlidis A.; Fondon JW.3rd (September 2001). "Having a BLAST with bioinformatics (and avoiding BLASTphemy)". Genome Biology. 2 (10): reviews2002.1–2002.10. doi: 10.1186/gb-2001-2-10-reviews2002 . PMC   138974 . PMID   11597340.{{cite journal}}: CS1 maint: numeric names: authors list (link)
  8. "BLOSSUM MATRICES: Introduction to BIOINFORMATICS" (PDF). UNIVERSITI TEKNOLOGI MALAYSIA. 2009. Retrieved 9 September 2014.[ permanent dead link ]
  9. Murali Sivaramakrishnan; Ognjen Perisic; Shashi Ranjan. "CS#594 - Group 13 (Tools and softwares)" (PDF). University of Illinois at Chicago - UIC. Retrieved 9 September 2014.
  10. Margaret O., Dayhoff (1978). "22". Atlas of Protein Sequence and Structure. Vol. 5. Washington DC: National Biomedical Research Foundation. pp. 345–352.
  11. States DJ.; Gish W.; Altschul SF. (1991). "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices". Methods: A Companion to Methods in Enzymology. 3: 66–70. CiteSeerX   10.1.1.114.8183 . doi:10.1016/s1046-2023(05)80165-3. ISSN   1046-2023.
  12. Albert Y. Zomaya (2006). Handbook of Nature-Inspired And Innovative Computing. New York, NY: Springer. ISBN   978-0-387-40532-2.page 673
  13. NIH "Scoring Systems"
  14. 1 2 Mark P Styczynski; Kyle L Jensen; Isidore Rigoutsos; Gregory Stephanopoulos (2008). "BLOSUM62 miscalculations improve search performance". Nat. Biotechnol. 26 (3): 274–275. doi:10.1038/nbt0308-274. PMID   18327232. S2CID   205266180.
  15. 1 2 Roque-Afonso AM, Ferey MP, Ly TD (2007). "Viral and clinical factors associated with surface gene variants among hepatitis B virus carriers". Antivir Ther. 12 (8): 1255–1263. doi: 10.1177/135965350701200801 . PMID   18240865. S2CID   9822759.
  16. 1 2 Nielsen M, Lundegaard C, Worning P, et al. (2003). "Reliable prediction of T‐cell epitopes using neural networks with novel sequence representations" (PDF). Protein Science. 12 (5): 1007–1017. doi:10.1110/ps.0239403. PMC   2323871 . PMID   12717023.
  17. "The Statistics of Sequence Similarity Scores". National Centre for Biotechnology Information. Retrieved 20 October 2013.
  18. Saud, Omama (2009). "PAM and BLOSUM Substitution Matrices". Birec. Archived from the original on 9 March 2013. Retrieved 20 October 2013.
  19. "The art of aligning protein sequences Part 1 Matrices". Dai hoc Can Tho - Can Tho University. Archived from the original on 11 September 2014. Retrieved 7 September 2014.