Gap penalty

Last updated July 03, 2024

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.^[1]

Applications

Genetic sequence alignment - In bioinformatics, gaps are used to account for genetic mutations occurring from insertions or deletions in the sequence, sometimes referred to as indels. Insertions or deletions can occur due to single mutations, unbalanced crossover in meiosis, slipped strand mispairing, and chromosomal translocation.^[2] The notion of a gap in an alignment is important in many biological applications, since the insertions or deletions comprise an entire sub-sequence and often occur from a single mutational event.^[3] Furthermore, single mutational events can create gaps of different sizes. Therefore, when scoring, the gaps need to be scored as a whole when aligning two sequences of DNA. Considering multiple gaps in a sequence as a larger single gap will reduce the assignment of a high cost to the mutations. For instance, two protein sequences may be relatively similar but differ at certain intervals as one protein may have a different subunit compared to the other. Representing these differing sub-sequences as gaps will allow us to treat these cases as “good matches” even though there are long consecutive runs with indel operations in the sequence. Therefore, using a good gap penalty model will avoid low scores in alignments and improve the chances of finding a true alignment.^[3] In genetic sequence alignments, gaps are represented as dashes(-) on a protein/DNA sequence alignment.^[1]
Unix diff function - computes the minimal difference between two files similarly to plagiarism detection.
Spell checking - Gap penalties can help find correctly spelled words with the shortest edit distance to a misspelled word. Gaps can indicate a missing letter in the incorrectly spelled word.
Plagiarism detection - Gap penalties allow algorithms to detect where sections of a document are plagiarized by placing gaps in original sections and matching what is identical. The gap penalty for a certain document quantifies how much of a given document is probably original or plagiarized.

Bioinformatics applications

Global alignment

A global alignment performs an end-to-end alignment of the query sequence with the reference sequence. Ideally, this alignment technique is most suitable for closely related sequences of similar lengths. The Needleman-Wunsch algorithm is a dynamic programming technique used to conduct global alignment. Essentially, the algorithm divides the problem into a set of sub-problems, then uses the results of the sub-problems to reconstruct a solution to the original query.^[4]

Semi-global alignment

The use of semi-global alignment exists to find a particular match within a large sequence. An example includes seeking promoters within a DNA sequence. Unlike global alignment, it compromises of no end gaps in one or both sequences. If the end gaps are penalized in one sequence 1 but not in sequence 2, it produces an alignment that contains sequence 2 within sequence 1.

Local alignment

A local sequence alignment matches a contiguous sub-section of one sequence with a contiguous sub-section of another.^[5] The Smith-Waterman algorithm is motivated by giving scores for matches and mismatches. Matches increase the overall score of an alignment whereas mismatches decrease the score. A good alignment then has a positive score and a poor alignment has a negative score. The local algorithm finds an alignment with the highest score by considering only alignments that score positives and picking the best one from those. The algorithm is a dynamic programming algorithm. When comparing proteins, one uses a similarity matrix which assigns a score to each possible residue pair. The score should be positive for similar residues and negative for dissimilar residue pairs. Gaps are usually penalized using a linear gap function that assigns an initial penalty for a gap opening, and an additional penalty for gap extensions, increasing the gap length.

Scoring matrix

Blosum-62 Matrix BLOSUM62.png — Blosum-62 Matrix

Substitution matrices such as BLOSUM are used for sequence alignment of proteins.^[6] A Substitution matrix assigns a score for aligning any possible pair of residues.^[6] In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees. A single matrix may be reasonably efficient over a relatively broad range of evolutionary change.^[6] The BLOSUM-62 matrix is one of the best substitution matrices for detecting weak protein similarities.^[6] BLOSUM matrices with high numbers are designed for comparing closely related sequences, while those with low numbers are designed for comparing distant related sequences. For example, BLOSUM-80 is used for alignments that are more similar in sequence, and BLOSUM-45 is used for alignments that have diverged from each other.^[6] For particularly long and weak alignments, the BLOSUM-45 matrix may provide the best results. Short alignments are more easily detected using a matrix with a higher "relative entropy" than that of BLOSUM-62. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries.^[6]

Indels

During DNA replication, the cellular replication machinery is prone to making two types of errors while duplicating the DNA. These two replication errors are insertions and deletions of single DNA bases from the DNA strand (indels).^[7] Indels can have severe biological consequences by causing mutations in the DNA strand that could result in the inactivation or over activation of the target protein. For example, if a one or two nucleotide indel occurs in a coding sequence the result will be a shift in the reading frame, or a frameshift mutation that may render the protein inactive.^[7] The biological consequences of indels are often deleterious and are frequently associated with pathologies such as cancer. However, not all indels are frameshift mutations. If indels occur in trinucleotides, the result is an extension of the protein sequence that may also have implications on protein function.^[7]

Types

Constant

This is the simplest type of gap penalty: a fixed negative score is given to every gap, regardless of its length.^[3]^[8] This encourages the algorithm to make fewer, larger, gaps leaving larger contiguous sections.

ATTGACCTGA ||   ||||| AT---CCTGA

Aligning two short DNA sequences, with '-' depicting a gap of one base pair. If each match was worth 1 point and the whole gap -1, the total score: 7 − 1 = 6.

Linear

Compared to the constant gap penalty, the linear gap penalty takes into account the length (L) of each insertion/deletion in the gap. Therefore, if the penalty for each inserted/deleted element is B and the length of the gap L; the total gap penalty would be the product of the two BL.^[9] This method favors shorter gaps, with total score decreasing with each additional gap.

ATTGACCTGA ||   ||||| AT---CCTGA

Unlike constant gap penalty, the size of the gap is considered. With a match with score 1 and each gap -1, the score here is (7 − 3 = 4).

Affine

The most widely used gap penalty function is the affine gap penalty. The affine gap penalty combines the components in both the constant and linear gap penalty, taking the form $A+B\cdot (L-1)$ . This introduces new terms, A is known as the gap opening penalty, B the gap extension penalty and L the length of the gap. Gap opening refers to the cost required to open a gap of any length, and gap extension the cost to extend the length of an existing gap by 1.^[10] Often it is unclear as to what the values A and B should be as it differs according to purpose. In general, if the interest is to find closely related matches (e.g. removal of vector sequence during genome sequencing), a higher gap penalty should be used to reduce gap openings. On the other hand, gap penalty should be lowered when interested in finding a more distant match.^[9] The relationship between A and B also have an effect on gap size. If the size of the gap is important, a small A and large B (more costly to extend a gap) is used and vice versa. Only the ratio A/B is important, as multiplying both by the same positive constant $k$ will increase all penalties by $k$ : $kA+kB(L-1)=k(A+B(L-1))$ which does not change the relative penalty between different alignments.

Convex

Using the affine gap penalty requires the assigning of fixed penalty values for both opening and extending a gap. This can be too rigid for use in a biological context.^[11]

The logarithmic gap takes the form $G(L)=A+C\ln L$ and was proposed as studies had shown the distribution of indel sizes obey a power law.^[12] Another proposed issue with the use of affine gaps is the favoritism of aligning sequences with shorter gaps. Logarithmic gap penalty was invented to modify the affine gap so that long gaps are desirable.^[11] However, in contrast to this, it has been found that using logarithmatic models had produced poor alignments when compared to affine models.^[12]

Profile-based

Profile–profile alignment algorithms are powerful tools for detecting protein homology relationships with improved alignment accuracy.^[13] Profile-profile alignments are based on the statistical indel frequency profiles from multiple sequence alignments generated by PSI-BLAST searches.^[13] Rather than using substitution matrices to measure the similarity of amino acid pairs, profile–profile alignment methods require a profile-based scoring function to measure the similarity of profile vector pairs.^[13] Profile-profile alignments employ gap penalty functions. The gap information is usually used in the form of indel frequency profiles, which is more specific for the sequences to be aligned. ClustalW and MAFFT adopted this kind of gap penalty determination for their multiple sequence alignments.^[13] Alignment accuracies can be improved using this model, especially for proteins with low sequence identity. Some profile–profile alignment algorithms also run the secondary structure information as one term in their scoring functions, which improves alignment accuracy.^[13]

Comparing time complexities

The use of alignment in computational biology often involves sequences of varying lengths. It is important to pick a model that would efficiently run at a known input size. The time taken to run the algorithm is known as the time complexity.

Time complexities for various gap penalty models
Type	Time
Constant gap penalty	O(mn)
Affine gap penalty	O(mn)
Convex gap penalty	O(mn lg(m+n))

Challenges

There are a few challenges when it comes to working with gaps. When working with popular algorithms there seems to be little theoretical basis for the form of the gap penalty functions.^[14] Consequently, for any alignment situation gap placement must be empirically determined.^[14] Also, pairwise alignment gap penalties, such as the affine gap penalty, are often implemented independent of the amino acid types in the inserted or deleted fragment or at the broken ends, despite evidence that specific residue types are preferred in gap regions.^[14] Finally, alignment of sequences implies alignment of the corresponding structures, but the relationships between structural features of gaps in proteins and their corresponding sequences are only imperfectly known. Because of this incorporating structural information into gap penalties is difficult to do.^[14] Some algorithms use predicted or actual structural information to bias the placement of gaps. However, only a minority of sequences have known structures, and most alignment problems involve sequences of unknown secondary and tertiary structure.^[14]

Related Research Articles

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; it is important in medicine and biotechnology.

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are $N$ elements, this matrix will have size $N \times N$ . In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

Clustal is a computer program used for multiple sequence alignment in bioinformatics. The software and its algorithms have gone through several iterations, with ClustalΩ (Omega) being the latest version as of 2011. It is available as standalone software, via a web interface, and through a server hosted by the European Bioinformatics Institute.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

A point accepted mutation — also known as a PAM — is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, which is accepted by the processes of natural selection. This definition does not include all point mutations in the DNA of an organism. In particular, silent mutations are not point accepted mutations, nor are mutations that are lethal or that are rejected by natural selection in other ways.

Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations, insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides.

<span class="mw-page-title-main">BLOSUM</span> Bioinformatics tool

In bioinformatics, the BLOSUM matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They scanned the BLOCKS database for very conserved regions of protein families and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.

In bioinformatics, MAFFT is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform. Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, higher accuracy alignments, alignment of non-coding RNA sequences, and the addition of new sequences to existing alignments.

<span class="mw-page-title-main">Dot plot (bioinformatics)</span>

In bioinformatics a dot plot is a graphical method for comparing two biological sequences and identifying regions of close similarity after sequence alignment. It is a type of recurrence plot.

<span class="mw-page-title-main">HMMER</span> Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

A Pairwise Algorithm is an algorithmic technique with its origins in Dynamic programming. Pairwise algorithms have several uses including comparing a protein profile against the three translation frames of a DNA strand, allowing frameshifting. The most remarkable feature of PairWise as compared to other Protein-DNA alignment tools is that PairWise allows frameshifting during alignment.

References

1 2 "Glossary". Rosalind. Rosalind Team. Retrieved 2021-05-20.
↑ Carroll, Ridge, Clement, Snell, Hyrum , Perry, Mark, Quinn (January 1, 2007). "Effects of Gap Open and Gap Extension Penalties". International Journal of Bioinformatics Research and Applications. Retrieved 2014-09-09.{{cite journal}}: CS1 maint: multiple names: authors list (link)
1 2 3 "Gap Penalty" (PDF). Algorithms for Molecular Biology. 2006-01-01. Archived from the original (PDF) on 2013-06-26. Retrieved 2014-09-13.
↑ Lesk, Arthur M (2013-07-26). "bioinformatics". Encyclopædia Britannica. Retrieved 2014-09-12.
↑ Vingron, M.; Waterman, M. S. (1994). "Sequence alignment and penalty choice. Review of concepts, case studies and implications". Journal of Molecular Biology. 235 (1): 1–12. doi:10.1016/S0022-2836(05)80006-3. PMID 8289235.
1 2 3 4 5 6 "BLAST substitution matrices". NCBI. Retrieved 2012-11-27.
1 2 3 Garcia-Diaz, Miguel (2006). "Mechanism of a genetic glissando: structural biology of indel mutations". Trends in Biochemical Sciences. 31 (4): 206–214. doi:10.1016/j.tibs.2006.02.004. PMID 16545956.
↑ "Glossary - Constant Gap Penalty". Rosalind. Rosalind Team. 12 Aug 2014. Retrieved 12 Aug 2014.
1 2 Hodgman C, French A, Westhead D (2009). BIOS Instant Notes in Bioinformatics. Garland Science. pp. 143–144. ISBN 978-0203967249.
↑ "Global Alignment with Scoring Matrix and Affine Gap Penalty". Rosalind. Rosalind Team. 2012-07-02. Retrieved 2014-09-12.
1 2 Sung, Wing-Kin (2011). Algorithms in Bioinformatics : A Practical Introduction. CRC Press. pp. 42–47. ISBN 978-1420070347.
1 2 Cartwright, Reed (2006-12-05). "Logarithmic gap costs decrease alignment accuracy". BMC Bioinformatics. 7: 527. doi: 10.1186/1471-2105-7-527 . PMC 1770940 . PMID 17147805.
1 2 3 4 5 Wang C, Yan RX, Wang XF, Si JN, Zhang Z (12 October 2011). "Comparison of linear gap penalties and profile-based variable gap penalties in profile-profile alignments". Comput Biol Chem. 35 (5): 308–318. doi:10.1016/j.compbiolchem.2011.07.006. PMID 22000802.
1 2 3 4 5 Wrabl JO, Grishin NV (1 January 2004). "Gaps in structurally similar proteins: towards improvement of multiple sequence alignment". Proteins. 54 (1): 71–87. doi:10.1002/prot.10508. PMID 14705025. S2CID 20474119.