Bowtie (sequence analysis)

Bowtie
Original author(s)	Ben Langmead ; Cole Trapnell ; Mihai Pop; Steven Salzberg
Developer(s)	Ben Langmead et al.,
Stable release
Bowtie	1.3.0 / July 22, 2020;2 years ago
Bowtie 2	2.4.2 / October 5, 2020;2 years ago
Repository	Bowtie: github.com/BenLangmead/bowtie ; Bowtie 2: github.com/BenLangmead/bowtie2
Written in	C++
Operating system	Linux ; macOS ; Windows
Type	Bioinformatics
Website	www.bowtie-bio.sourceforge.net

Last updated June 09, 2023

Bowtie is a software package commonly used for sequence alignment and sequence analysis in bioinformatics.^[3] The source code for the package is distributed freely and compiled binaries are available for Linux, macOS and Windows platforms. As of 2017, the Genome Biology paper describing the original Bowtie method has been cited more than 11,000 times.^[3] Bowtie is open-source software and is currently maintained by Johns Hopkins University.

History

The Bowtie sequence aligner was originally developed by Ben Langmead et al. at the University of Maryland in 2009.^[3] The aligner is typically used with short reads and a large reference genome, or for whole genome analysis. Bowtie is promoted as "an ultrafast, memory-efficient short aligner for short DNA sequences." The speed increase of Bowtie is partly due to implementing the Burrows–Wheeler transform for aligning,^[4] which reduces the memory footprint (typically to around 2.2GB for the human genome);^[5] a similar method is used by the BWA^[6] and SOAP2 ^[7] alignment methods.^[5]

Bowtie conducts a quality-aware, greedy, randomized, depth-first search through the space of possible alignments. Because the search is greedy, the first valid alignment encountered by Bowtie will not necessarily be the 'best' in terms of the number of mismatches or in terms of quality.

Bowtie is used as a sequence aligner by a number of other related bioinformatics algorithms, including TopHat,^[8] Cufflinks^[9] and the CummeRbund Bioconductor package.^[10]

Bowtie 2

On 16 October 2011, the developers released a beta fork of the project called Bowtie 2.^[11] In addition to the Burrows-Wheeler transform, Bowtie 2 also uses an FM-index (similar to a suffix array) to keep its memory footprint small. Due to its implementation, Bowtie 2 is more suited to finding longer, gapped alignments in comparison with the original Bowtie method. There is no upper limit on read length in Bowtie 2 and it allows alignments to overlap ambiguous characters in the reference.

Related Research Articles

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

T-Coffee is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can also combine multiple sequences alignments obtained previously and in the latest versions can use structural information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of the alignments and some capacity for identifying occurrence of motifs (Mocca). It produces alignment in the aln format (Clustal) by default, but can also produce PIR, MSF, and FASTA format. The most common input formats are supported.

Steven Lloyd Salzberg is an American computational biologist and computer scientist who is a Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University, where he is also Director of the Center for Computational Biology.

MUMmer is a bioinformatics software system for sequence alignment. It is based on the suffix tree data structure and is one of the fastest and most efficient systems available for this task, enabling it to be applied to very long sequences. It has been widely used for comparing different genomes to one another. In recent years, it has become a popular algorithm for comparing genome assemblies to one another, which allows scientists to determine how a genome has changed after adding more DNA sequence or after running a different genome assembly program. The acronym "MUMmer" comes from "Maximal Unique Matches", or MUMs. The original algorithms in the MUMMER software package were designed by Art Delcher, Simon Kasif and Steven Salzberg. Mummer was the first whole genome comparison system developed in Bioinformatics. It was originally applied to the comparison of two related strains of bacteria.

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

RNA-Seq is a sequencing technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample, representing an aggregated snapshot of the cells' dynamic pool of RNAs, also known as transcriptome.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

Single-cell DNA template strand sequencing, or Strand-seq, is a technique for the selective sequencing of a daughter cell's parental template strands. This technique offers a wide variety of applications, including the identification of sister chromatid exchanges in the parental cell prior to segregation, the assessment of non-random segregation of sister chromatids, the identification of misoriented contigs in genome assemblies, de novo genome assembly of both haplotypes in diploid organisms including humans, whole-chromosome haplotyping, and the identification of germline and somatic genomic structural variation, the latter of which can be detected robustly even in single cells.

Lior Samuel Pachter is a computational biologist. He works at the California Institute of Technology, where he is the Bren Professor of Computational Biology. He has widely varied research interests including genomics, combinatorics, computational geometry, machine learning, scientific computing, and statistics.

TopHat is an open-source bioinformatics tool for the throughput alignment of shotgun cDNA sequencing reads generated by transcriptomics technologies using Bowtie first and then mapping to a reference genome to discover RNA splice sites de novo. TopHat aligns RNA-Seq reads to mammalian-sized genomes.

Third-generation sequencing is a class of DNA sequencing methods currently under active development.

Ben Langmead is a computational biologist and associate professor in the Computational Biology & Medicine Group at Johns Hopkins University.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

Bruce Colston Trapnell Jr. is an assistant professor in the Department of Genome Sciences at the University of Washington. He was awarded the Overton Prize by the International Society for Computational Biology (ISCB) for “outstanding accomplishment in the early to mid stage of his career” in 2018.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

References

↑ "Bowtie: An ultrafast, memory-efficient short read aligner". bowtie-bio.sourceforge.net. Retrieved 2021-03-28.
↑ "Bowtie 2: fast and sensitive read alignment". bowtie-bio.sourceforge.net. Retrieved 2021-03-28.
1 2 3 Langmead, Ben; Cole Trapnell; Mihai Pop; Steven L Salzberg (4 March 2009). "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome" (PDF). Genome Biology. 10 (3): 10:R25. doi:10.1186/gb-2009-10-3-r25. PMC 2690996 . PMID 19261174. Archived from the original (PDF) on 2012-10-20. Retrieved 29 November 2013.
↑ Ferragina, Paolo; Manzini, Giovanni. "Indexing compressed text". Journal of the ACM. 52 (4): 552–581. doi:10.1145/1082036.1082039.
1 2 "Bowtie: An ultrafast, memory-efficient short read aligner - SourceForge" . Retrieved 29 November 2013.
↑ Li, H.; Durbin, R. (18 May 2009). "Fast and accurate short read alignment with Burrows-Wheeler transform". Bioinformatics. 25 (14): 1754–1760. doi:10.1093/bioinformatics/btp324. PMC 2705234 . PMID 19451168.
↑ Li, R.; Yu, C.; Li, Y.; Lam, T.-W.; Yiu, S.-M.; Kristiansen, K.; Wang, J. (3 June 2009). "SOAP2: an improved ultrafast tool for short read alignment". Bioinformatics. 25 (15): 1966–1967. doi: 10.1093/bioinformatics/btp336 . PMID 19497933.
↑ Trapnell, C.; Pachter, L.; Salzberg, S. L. (16 March 2009). "TopHat: discovering splice junctions with RNA-Seq". Bioinformatics. 25 (9): 1105–1111. doi:10.1093/bioinformatics/btp120. PMC 2672628 . PMID 19289445.
↑ Trapnell, Cole; Roberts, Adam; Goff, Loyal; Pertea, Geo; Kim, Daehwan; Kelley, David R; Pimentel, Harold; Salzberg, Steven L; Rinn, John L; Pachter, Lior (1 March 2012). "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks". Nature Protocols. 7 (3): 562–578. doi:10.1038/nprot.2012.016. PMC 3334321 . PMID 22383036.
↑ "CummeRbund - An R package for persistent storage, analysis, and visualization of RNA-Seq from cufflinks output" . Retrieved 11 August 2015.
↑ Langmead, Ben; Salzberg, Steven L (4 March 2012). "Fast gapped-read alignment with Bowtie 2". Nature Methods. 9 (4): 357–359. doi:10.1038/nmeth.1923. PMC 3322381 . PMID 22388286.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Bowtie: An ultrafast, memory-efficient short read aligner". bowtie-bio.sourceforge.net. Retrieved 2021-03-28.

[2] "Bowtie 2: fast and sensitive read alignment". bowtie-bio.sourceforge.net. Retrieved 2021-03-28.

[bowtie-paper-3] 1 2 3 Langmead, Ben; Cole Trapnell; Mihai Pop; Steven L Salzberg (4 March 2009). "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome" (PDF). Genome Biology. 10 (3): 10:R25. doi:10.1186/gb-2009-10-3-r25. PMC 2690996 . PMID 19261174. Archived from the original (PDF) on 2012-10-20. Retrieved 29 November 2013.

[fmi-4] Ferragina, Paolo; Manzini, Giovanni. "Indexing compressed text". Journal of the ACM. 52 (4): 552–581. doi:10.1145/1082036.1082039.

[bowtie-home-5] 1 2 "Bowtie: An ultrafast, memory-efficient short read aligner - SourceForge" . Retrieved 29 November 2013.

[bwa-6] Li, H.; Durbin, R. (18 May 2009). "Fast and accurate short read alignment with Burrows-Wheeler transform". Bioinformatics. 25 (14): 1754–1760. doi:10.1093/bioinformatics/btp324. PMC 2705234 . PMID 19451168.

[soap2-7] Li, R.; Yu, C.; Li, Y.; Lam, T.-W.; Yiu, S.-M.; Kristiansen, K.; Wang, J. (3 June 2009). "SOAP2: an improved ultrafast tool for short read alignment". Bioinformatics. 25 (15): 1966–1967. doi: 10.1093/bioinformatics/btp336 . PMID 19497933.

[tophat-8] Trapnell, C.; Pachter, L.; Salzberg, S. L. (16 March 2009). "TopHat: discovering splice junctions with RNA-Seq". Bioinformatics. 25 (9): 1105–1111. doi:10.1093/bioinformatics/btp120. PMC 2672628 . PMID 19289445.

[cufflinks-9] Trapnell, Cole; Roberts, Adam; Goff, Loyal; Pertea, Geo; Kim, Daehwan; Kelley, David R; Pimentel, Harold; Salzberg, Steven L; Rinn, John L; Pachter, Lior (1 March 2012). "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks". Nature Protocols. 7 (3): 562–578. doi:10.1038/nprot.2012.016. PMC 3334321 . PMID 22383036.

[cummerbund-10] "CummeRbund - An R package for persistent storage, analysis, and visualization of RNA-Seq from cufflinks output" . Retrieved 11 August 2015.

[bowtie2-natmeth-11] Langmead, Ben; Salzberg, Steven L (4 March 2012). "Fast gapped-read alignment with Bowtie 2". Nature Methods. 9 (4): 357–359. doi:10.1038/nmeth.1923. PMC 3322381 . PMID 22388286.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

v t e Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive, DNA Data Bank of Japan and China National GeneBank Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: Protein Data Bank, Ensembl and InterPro Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource, GISAID and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE PANGOLIN SAMtools SOAP suite TopHat
Other	Server: ExPASy Ontology: Gene Ontology Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons