Spaced seed

Last updated
An animated example to show the utility of a spaced seed. First, an attempt to identify a local candidate match without a space seed is made (unsuccessfully), before attempting the same task with a simple spaced seed, where a hit is found successfully. Green indicates a matching base position. See here for more details. Sequence Alignment Example.gif
An animated example to show the utility of a spaced seed. First, an attempt to identify a local candidate match without a space seed is made (unsuccessfully), before attempting the same task with a simple spaced seed, where a hit is found successfully. Green indicates a matching base position. See here for more details.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., [1] alignment, [2] assembly, [3] and metagenomics. [4] They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

Contents

Principle

Due to a number of functional and evolutionary constraints, nucleic acid sequences between individuals tend to be highly conserved, with the typical difference between two human genomes estimated on the order of 0.6% (or around 20 million base pairs). [5] Identification of highly similar regions in the genome may indicate functional importance, as mutations in these areas that would result in cessation of function or loss of regulatory ability would be evolutionary unfavorable. More observed differences between two sequences may arise as a result of stochastic sequencing errors. Similarly, when performing assembly of a previously characterized genome, an attempt is made to align the newly sequenced DNA fragments to the existing genome sequence.

In both cases, it is useful to be able to directly compare nucleic acid sequences. Since the sequences are not expected to be exactly identical, however, it is beneficial to focus on smaller subsequences that are more likely to be locally identical. Spaced seeds allow for even more permissive local matching by allowing certain base pairs (defined by the pattern of the specific spaced seed) to mismatch without penalty, thus allowing algorithms that use the general "hit-extend" strategy of alignment to explore additional potential matches that would be otherwise ignored.

Example

As a simple example, consider the following DNA sequences:

CTAAGTCACG
CTAACACACG
1111001111

Upon visual inspection, it's easy to see that there is a mismatch between the two sequences at the fifth and six base positions (in bold, above). However, the sequences still share 80% sequence similarity. The mismatches may be due to a real (biological) change or a sequencing error. In a non-spaced model, this putative match would be ignored if a seed size greater than 4 is specified. But a spaced seed of could be used to effectively zero-weighting the mismatch sites, treating the sequences as same for the purposes of hit identification. In reality, of course, we don't know the relative positioning of the "true" mismatches, so there can be different spaced seed patterns depending on where the mismatches are anticipated.

History

The concept of spaced seeds has been explored in literature under different names. One of the early uses was in sequence homology where the FLASH algorithm from 1993 referred to it as "non-contiguous sub-sequences of tokens" that were generated from all combinations of positions within a sequence window. [6] In 1995, a similar concept was used in approximate string matching where "gapped tuples" of positions in a sequence were explored to identify common substrings between a large text and a query. [7] The term "shape" was used in a 2001 paper to describe gapped q-grams where it refers to a set of relevant positions in a substring [8] and soon after in 2002, PatternHunter introduced "spaced model" which was proposed as an improvement upon the consecutive seeds used in BLAST, [1] which was ultimately adopted by newer versions of it. Finally, in 2003, PatternHunter II settled on the term "spaced seed" to refer to the approach used in PatternHunter [9]

Spaced versus Non-Spaced Models

Popular alignment algorithms such as BLAST and MegaBLAST use a non-spaced model, where the entire length of the seed is made of exact matches. [10] Thus, any mismatching base pair along the length of the seed will result in the program ignoring the potential hit. In a spaced model (such as PatternHunter), the matches are not necessarily consecutive.

More formally, the difference in a spaced seed model as compared to a non-spaced model is the relative positioning (otherwise known as weight, ) of the matched bases. In a non-spaced model, the length of the seed model, and the weight, of the seeds are the same, as they must be consecutive while in a spaced model, the weight is not necessarily equal to the length of the seed model, since match positions may be non-consecutive. [1] Therefore, a spaced seed model may be longer than a non-spaced seed model but have the same weight. For example, a non-spaced seed has the same weight as a spaced model , but their lengths differ.

The predicted number of hits can be calculated from PatternHunter [1] using the following lemma:

Where is the length of the sequence the model is compared to, is the length of the seed model, is the probability of a match and is the weight of the seed used.

Applications

The type of seed model used for sequence alignment can affect the processing time and memory usage when doing large-scale homology searches – two considerations that have been central in the development of modern homology search algorithms. It may also affect the sensitivity. Using spaced seed models has been demonstrated to allow for faster homology searches as seen with PatternHunter wherein homology searches were twenty times faster and used less memory than BLASTn (using a non-spaced model). [1]

Sequence Alignment

Most aligners first find candidate locations (or seeds) in the target sequence and then inspect those more closely to verify the alignment. [11] [2] [12] [13] Ideally, this first step would find all relevant locations in the target so sensitivity is prioritized but due to computational intensity, many popular algorithms (such as the earlier implementations of BLAST and FASTA) use heuristics to "short-cut" exploring all locations, ultimately missing many but running relatively quickly. One possible way to increase sensitivity, as done in the SHRiMP2 [2] algorithm, is to use spaced seeds to allow for small differences between the query and the candidate locations so that somewhat more locations are identified as candidates. SHRiMP2 specifically uses multiple spaced seeds for this and requires multiple matches, increasing sensitivity as it allows different possible combinations of differences while maintaining speed comparable to original methods.

Sequence Assembly

A variation of spaced seeds with a single contiguous gap has been used in de novo sequence assembly. [3] In this instance, the design has an equal number of ones at either end of the sequence with a run of zeroes in between. The reasoning behind this design is that in assemblers that utilize De Bruijn graphs, increasing k-mer size inflates memory usage, as k-mers are more likely to be unique. However, the most important parts of a k-mer are its ends, as they are what are used to extend sequences in a graph. Thus, to circumvent the problem with memory usage, the less-important middle part (covered by the gap) is ignored. This approach has the additional advantage, as in other uses of spaced seeds, of taking into account any sequencing errors that may have occurred in the gap area. It has been noted [3] that increasing the length of the gap also increases the uniqueness of k-mers in both E. coli and H. sapiens genomes.

Metagenomics

A metagenomics study will commonly start with the high-throughput sequencing of a mixture of distinct species (e.g. from the human gut), yielding a set of sequences but with unknown origins. As such, one common goal is to identify which genome each sequence is phylogenetically most similar to. One approach could be to take k-mers from each sequence and see which, in a set of genomes, it has most sequence similarity with. Spaced seeds have been successfully utilized for this purpose [4] by finding how many k-mers are found in each genome (hit number) and the total number of positions these k-mers cover (coverage).

Multiple Spaced Seeds

An improvement to (single) spaced seeds was first demonstrated by PatternHunter in 2002 [1] where a set of spaced seeds were used, in which a "hit" was called whenever one of the set matched. PatternHunter II, in 2003, demonstrated that this approach could offer higher sensitivity than BLAST while maintaining similar speed. [9] Identifying an optimal set of spaced seeds is an NP-hard problem, however, even finding a "good" set of spaced seeds remains difficult, although several attempts have been made to computationally identify them. [14] [15] [16] Since the speed of the algorithm must decrease with an increasing number of spaced seeds, it makes sense to only consider multiple seeds when all offer some useful contribution. There is ongoing research on how to quickly calculate good multiple spaced seeds, as previous homology search software calculated and hard-coded their seeds [16] – it would be advantageous to be able to calculate purpose-driven multiple spaced seeds instead.

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

<span class="mw-page-title-main">Needleman–Wunsch algorithm</span> Method for aligning biological sequences

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

Warren Richard Gish is the owner of Advanced Biocomputing LLC. He joined Washington University in St. Louis as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007.

YASS (Yet Another Similarity Searcher) is a free software, pairwise sequence alignment software for nucleotide sequences, that is, it can search for similarities between DNA or RNA sequences. YASS accepts nucleotide sequences in either plain text or the FASTA format and the output format includes the BLAST tabular output. YASS uses several transition-constrained spaced seed k-mers, which allow considerably improved sensitivity. YASS can be used locally on a user's machine, or as SaaS on the YASS web server, which produces a browser based dot-plot.

<span class="mw-page-title-main">HMMER</span> Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

Phyre and Phyre2 are free web-based services for protein structure prediction. Phyre is among the most popular methods for protein structure prediction having been cited over 1500 times. Like other remote homology recognition techniques, it is able to regularly generate reliable protein models when other widely used methods such as PSI-BLAST cannot. Phyre2 has been designed to ensure a user-friendly interface for users inexpert in protein structure prediction methods. Its development is funded by the Biotechnology and Biological Sciences Research Council.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

PatternHunter is a commercially available homology search instrument software that uses sequence alignment techniques. It was initially developed in the year 2002 by three scientists: Bin Ma, John Tramp and Ming Li. These scientists were driven by the desire to solve the problem that many investigators face during studies that involve genomics and proteomics. These scientists realized that such studies greatly relied on homology studies that established short seed matches that were subsequently lengthened. Describing homologous genes was an essential part of most evolutionary studies and was crucial to the understanding of the evolution of gene families, the relationship between domains and families. Homologous genes could only be studied effectively using search tools that established like portions or local placement between two proteins or nucleic acid sequences. Homology was quantified by scores obtained from matching sequences, “mismatch and gap scores”.

Non-coding RNAs have been discovered using both experimental and bioinformatic approaches. Bioinformatic approaches can be divided into three main categories. The first involves homology search, although these techniques are by definition unable to find new classes of ncRNAs. The second category includes algorithms designed to discover specific types of ncRNAs that have similar properties. Finally, some discovery methods are based on very general properties of RNA, and are thus able to discover entirely new kinds of ncRNAs.

References

  1. 1 2 3 4 5 6 Ma, Bin; Tromp, John; Li, Ming (March 2002). "PatternHunter: faster and more sensitive homology search". Bioinformatics. 18 (3): 440–445. doi: 10.1093/bioinformatics/18.3.440 . PMID   11934743.
  2. 1 2 3 David, Matei; Dzamba, Misko; Lister, Dan; Ilie, Lucian; Brudno, Michael (April 2011). "SHRiMP2: Sensitive yet Practical Short Read Mapping". Bioinformatics. 27 (7): 1011–1012. doi: 10.1093/bioinformatics/btr046 . PMID   21278192.
  3. 1 2 3 Birol, I; Chu, J; Mohamadi, H; Jackman, S. D.; Raghavan, K; Vandervalk, B. P.; Raymond, A; Warren, René L. (2015). "Spaced Seed Data Structures for De Novo Assembly". International Journal of Genomics. 2015: 196591. doi: 10.1155/2015/196591 . PMC   4619942 . PMID   26539459.
  4. 1 2 Břinda, Karel; Sykulski, Maciej; Kucherov, Gregory (November 2015). "Spaced seeds improve k-mer-based metagenomic classification". Bioinformatics. 31 (22): 3584–3592. arXiv: 1502.06256 . Bibcode:2015arXiv150206256B. doi:10.1093/bioinformatics/btv419. PMID   26209798.
  5. Auton, A.; Brooks, L. D.; Durbin, R. M.; Garrison, E. P.; Kang, H. M.; Korbel, J. O.; Marchini, J. L.; McCarthy, S.; McVean, G. A.; Abecasis, G. R.; Flicek, Paul; Gabriel, Stacey B.; Gibbs, Richard A.; Green, Eric D.; Hurles, Matthew E.; Knoppers, Bartha M.; Korbel, Jan O.; Lander, Eric S.; Lee, Charles; Lehrach, Hans; Mardis, Elaine R.; Marth, Gabor T.; McVean, Gil A.; Nickerson, Deborah A.; Schmidt, Jeanette P.; Sherry, Stephen T.; Wang, Jun; Gibbs, Richard A.; Boerwinkle, Eric; et al. (2015-10-01). "A global reference for human genetic variation". Nature. 526 (7571): 68–74. Bibcode:2015Natur.526...68T. doi:10.1038/nature15393. ISSN   0028-0836. PMC   4750478 . PMID   26432245.
  6. Califano, A.; Rigoutsos, I. (1993). "FLASH: A fast look-up algorithm for string homology". Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Vol. 1. pp. 353–359. doi:10.1109/CVPR.1993.341106. ISBN   0-8186-3880-X. PMID   7584371. S2CID   2862905.
  7. Pevzner, P.A.; Waterman, M.S. (1995). "Multiple filtration and approximate pattern matching". Algorithmica. 13 (1–2): 135–154. doi:10.1007/BF01188584. S2CID   10243441.
  8. Burkhardt, Stefan; Kärkkäinen, Juha (June 2001). "Better Filtering with Gapped q-Grams". Combinatorial Pattern Matching. Lecture Notes in Computer Science. Vol. 2089. pp. 73–85. doi:10.1007/3-540-48194-X_6. ISBN   978-3-540-42271-6.
  9. 1 2 Li, Ming; Ma, Bin; Kisman, Derek; Tromp, John (2003). "PatternHunter II: Highly Sensitive and Fast Homology Search". Genome Informatics. 14: 164–175. doi:10.11234/gi1990.14.164. PMID   15706531.
  10. Gotea, Valer; Veeramachaneni, Vamsi; Makałowski, Wojciech (2003-12-01). "Mastering seeds for genomic size nucleotide BLAST searches". Nucleic Acids Research. 31 (23): 6935–6941. doi:10.1093/nar/gkg886. ISSN   0305-1048. PMC   290255 . PMID   14627826.
  11. Altschul, Stephen F.; Gish, Warren; Miller, Webb; Myers, Eugene W.; Lipman, David J. (15 May 1990). "Basic local alignment search tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID   2231712. S2CID   14441902.
  12. Li, Heng (15 September 2018). "Minimap2: pairwise alignment for nucleotide sequences". Bioinformatics. 34 (18): 3094–3100. doi: 10.1093/bioinformatics/bty191 . PMC   6137996 . PMID   29750242.
  13. Langmead, B.; Salzberg, S. (2012). "Fast gapped-read alignment with Bowtie 2". Nature Methods. 9 (4): 357–359. doi:10.1038/nmeth.1923. PMC   3322381 . PMID   22388286.
  14. Choi, Kwok Pui; Zhang, Louxin (2004-02-01). "Sensitivity analysis and efficient method for identifying optimal spaced seeds". Journal of Computer and System Sciences. 68 (1): 22–40. doi: 10.1016/j.jcss.2003.04.002 . ISSN   0022-0000.
  15. Kong, Yong (2007-03-01). "Generalized Correlation Functions and Their Applications in Selection of Optimal Multiple Spaced Seeds for Homology Search". Journal of Computational Biology. 14 (2): 238–254. doi:10.1089/cmb.2006.0008. PMID   17456017.
  16. 1 2 Ilie, Lucian; Ilie, Silvana (2007-11-15). "Multiple spaced seeds for homology search". Bioinformatics. 23 (22): 2969–2977. doi: 10.1093/bioinformatics/btm422 . ISSN   1367-4803. PMID   17804438.