Bowtie (sequence analysis)

Last updated
Bowtie
Original author(s) Ben Langmead
Cole Trapnell
Mihai Pop
Steven Salzberg
Developer(s) Ben Langmead et al.,
Stable release
Bowtie1.3.0 / July 22, 2020;2 years ago (2020-07-22) [1]
Bowtie 22.4.2 / October 5, 2020;2 years ago (2020-10-05) [2]
Repository Bowtie: github.com/BenLangmead/bowtie
Bowtie 2: github.com/BenLangmead/bowtie2
Written in C++
Operating system Linux
macOS
Windows
Type Bioinformatics
Website www.bowtie-bio.sourceforge.net

Bowtie is a software package commonly used for sequence alignment and sequence analysis in bioinformatics. [3] The source code for the package is distributed freely and compiled binaries are available for Linux, macOS and Windows platforms. As of 2017, the Genome Biology paper describing the original Bowtie method has been cited more than 11,000 times. [3] Bowtie is open-source software and is currently maintained by Johns Hopkins University.

Contents

History

The Bowtie sequence aligner was originally developed by Ben Langmead et al. at the University of Maryland in 2009. [3] The aligner is typically used with short reads and a large reference genome, or for whole genome analysis. Bowtie is promoted as "an ultrafast, memory-efficient short aligner for short DNA sequences." The speed increase of Bowtie is partly due to implementing the Burrows–Wheeler transform for aligning, [4] which reduces the memory footprint (typically to around 2.2GB for the human genome); [5] a similar method is used by the BWA [6] and SOAP2 [7] alignment methods. [5]

Bowtie conducts a quality-aware, greedy, randomized, depth-first search through the space of possible alignments. Because the search is greedy, the first valid alignment encountered by Bowtie will not necessarily be the 'best' in terms of the number of mismatches or in terms of quality.

Bowtie is used as a sequence aligner by a number of other related bioinformatics algorithms, including TopHat, [8] Cufflinks [9] and the CummeRbund Bioconductor package. [10]

Bowtie 2

On 16 October 2011, the developers released a beta fork of the project called Bowtie 2. [11] In addition to the Burrows-Wheeler transform, Bowtie 2 also uses an FM-index (similar to a suffix array) to keep its memory footprint small. Due to its implementation, Bowtie 2 is more suited to finding longer, gapped alignments in comparison with the original Bowtie method. There is no upper limit on read length in Bowtie 2 and it allows alignments to overlap ambiguous characters in the reference.

Related Research Articles

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

T-Coffee is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can also combine multiple sequences alignments obtained previously and in the latest versions can use structural information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of the alignments and some capacity for identifying occurrence of motifs (Mocca). It produces alignment in the aln format (Clustal) by default, but can also produce PIR, MSF, and FASTA format. The most common input formats are supported.

<span class="mw-page-title-main">Steven Salzberg</span> American biologist and computer scientist

Steven Lloyd Salzberg is an American computational biologist and computer scientist who is a Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University, where he is also Director of the Center for Computational Biology.

MUMmer is a bioinformatics software system for sequence alignment. It is based on the suffix tree data structure and is one of the fastest and most efficient systems available for this task, enabling it to be applied to very long sequences. It has been widely used for comparing different genomes to one another. In recent years, it has become a popular algorithm for comparing genome assemblies to one another, which allows scientists to determine how a genome has changed after adding more DNA sequence or after running a different genome assembly program. The acronym "MUMmer" comes from "Maximal Unique Matches", or MUMs. The original algorithms in the MUMMER software package were designed by Art Delcher, Simon Kasif and Steven Salzberg. Mummer was the first whole genome comparison system developed in Bioinformatics. It was originally applied to the comparison of two related strains of bacteria.

<span class="mw-page-title-main">UGENE</span>

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a sequencing technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample, representing an aggregated snapshot of the cells' dynamic pool of RNAs, also known as transcriptome.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

Single-cell DNA template strand sequencing, or Strand-seq, is a technique for the selective sequencing of a daughter cell's parental template strands. This technique offers a wide variety of applications, including the identification of sister chromatid exchanges in the parental cell prior to segregation, the assessment of non-random segregation of sister chromatids, the identification of misoriented contigs in genome assemblies, de novo genome assembly of both haplotypes in diploid organisms including humans, whole-chromosome haplotyping, and the identification of germline and somatic genomic structural variation, the latter of which can be detected robustly even in single cells.

<span class="mw-page-title-main">Lior Pachter</span>

Lior Samuel Pachter is a computational biologist. He works at the California Institute of Technology, where he is the Bren Professor of Computational Biology. He has widely varied research interests including genomics, combinatorics, computational geometry, machine learning, scientific computing, and statistics.

TopHat is an open-source bioinformatics tool for the throughput alignment of shotgun cDNA sequencing reads generated by transcriptomics technologies using Bowtie first and then mapping to a reference genome to discover RNA splice sites de novo. TopHat aligns RNA-Seq reads to mammalian-sized genomes.

Third-generation sequencing is a class of DNA sequencing methods currently under active development.

Ben Langmead is a computational biologist and associate professor in the Computational Biology & Medicine Group at Johns Hopkins University.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

<span class="mw-page-title-main">Cole Trapnell</span>

Bruce Colston Trapnell Jr. is an assistant professor in the Department of Genome Sciences at the University of Washington. He was awarded the Overton Prize by the International Society for Computational Biology (ISCB) for “outstanding accomplishment in the early to mid stage of his career” in 2018.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

References

  1. "Bowtie: An ultrafast, memory-efficient short read aligner". bowtie-bio.sourceforge.net. Retrieved 2021-03-28.
  2. "Bowtie 2: fast and sensitive read alignment". bowtie-bio.sourceforge.net. Retrieved 2021-03-28.
  3. 1 2 3 Langmead, Ben; Cole Trapnell; Mihai Pop; Steven L Salzberg (4 March 2009). "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome" (PDF). Genome Biology. 10 (3): 10:R25. doi:10.1186/gb-2009-10-3-r25. PMC   2690996 . PMID   19261174. Archived from the original (PDF) on 2012-10-20. Retrieved 29 November 2013.
  4. Ferragina, Paolo; Manzini, Giovanni. "Indexing compressed text". Journal of the ACM. 52 (4): 552–581. doi:10.1145/1082036.1082039.
  5. 1 2 "Bowtie: An ultrafast, memory-efficient short read aligner - SourceForge" . Retrieved 29 November 2013.
  6. Li, H.; Durbin, R. (18 May 2009). "Fast and accurate short read alignment with Burrows-Wheeler transform". Bioinformatics. 25 (14): 1754–1760. doi:10.1093/bioinformatics/btp324. PMC   2705234 . PMID   19451168.
  7. Li, R.; Yu, C.; Li, Y.; Lam, T.-W.; Yiu, S.-M.; Kristiansen, K.; Wang, J. (3 June 2009). "SOAP2: an improved ultrafast tool for short read alignment". Bioinformatics. 25 (15): 1966–1967. doi: 10.1093/bioinformatics/btp336 . PMID   19497933.
  8. Trapnell, C.; Pachter, L.; Salzberg, S. L. (16 March 2009). "TopHat: discovering splice junctions with RNA-Seq". Bioinformatics. 25 (9): 1105–1111. doi:10.1093/bioinformatics/btp120. PMC   2672628 . PMID   19289445.
  9. "CummeRbund - An R package for persistent storage, analysis, and visualization of RNA-Seq from cufflinks output" . Retrieved 11 August 2015.
  10. Langmead, Ben; Salzberg, Steven L (4 March 2012). "Fast gapped-read alignment with Bowtie 2". Nature Methods. 9 (4): 357–359. doi:10.1038/nmeth.1923. PMC   3322381 . PMID   22388286.