TopHat (bioinformatics)

Last updated November 07, 2023

TopHat is an open-source bioinformatics tool for the throughput alignment of shotgun cDNA sequencing reads generated by transcriptomics technologies (e.g. RNA-Seq) using Bowtie first and then mapping to a reference genome to discover RNA splice sites de novo.^[1] TopHat aligns RNA-Seq reads to mammalian-sized genomes.^[2]

History

TopHat was originally developed in 2009 by Cole Trapnell, Lior Pachter and Steven Salzberg at the Center for Bioinformatics and Computational Biology at the University of Maryland, College Park and at the Mathematics Department, UC Berkeley.^[1] TopHat2 was a collaborative effort of Daehwan Kim and Steven Salzberg, initially at the University of Maryland, College Park and later at the Center for Computational Biology at Johns Hopkins University. Kim re-wrote some of Trapnell's original TopHat code in C++ to make it much faster, and added many heuristics to improve its accuracy, in a collaboration with Cole Trapnell and others. Kim and Salzberg also developed TopHat-fusion which used transcriptome data to discover gene fusions in cancer tissues.^[3]

Uses

TopHat is used to align reads from an RNA-Seq experiment. It is a read-mapping algorithm and it aligns the reads to a reference genome. It is useful because it does not need to rely on known splice sites.^[1] TopHat can be used with the Tuxedo pipeline, and is frequently used with Bowtie.

Advantages/Disadvantages

Advantages

When TopHat first came out, it was faster than previous systems. It mapped more than 2.2 million reads per CPU hour. That speed allowed the user to process and entire RNA-Seq experiment in less than a day, even on a standard desktop computer.^[1] Tophat uses Bowtie in the beginning to analyze the reads, but then does more to analyze the reads that span exon-exon junctions. If you are using TopHat for RNA-Seq data, you will get more read aligned against the reference genome.^[4]

Another advantage for TopHat is that it does not need to rely on known splice sites when aligning reads to a reference genome.^[1]

Disadvantages

TopHat is in a low maintenance, low support stage, and contains software bugs that have spawned 3rd party post-processing software to correct.^[5] It has been superseded by HISAT2, which is more efficient and accurate and provides the same core functionality (spliced alignment of RNA-Seq reads).^[2]

Related Research Articles

Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. In this process, particular exons of a gene may be included within or excluded from the final, processed messenger RNA (mRNA) produced from that gene. This means the exons are joined in different combinations, leading to different (alternative) mRNA strands. Consequently, the proteins translated from alternatively spliced mRNAs usually contain differences in their amino acid sequence and, often, in their biological functions.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

Steven Lloyd Salzberg is an American computational biologist and computer scientist who is a Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University, where he is also Director of the Center for Computational Biology.

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

RNA-Seq is a sequencing technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample, representing an aggregated snapshot of the cells' dynamic pool of RNAs, also known as transcriptome.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

Chimeric RNA, sometimes referred to as a fusion transcript, is composed of exons from two or more different genes that have the potential to encode novel proteins. These mRNAs are different from those produced by conventional splicing as they are produced by two or more gene loci.

Bowtie is a software package commonly used for sequence alignment and sequence analysis in bioinformatics. The source code for the package is distributed freely and compiled binaries are available for Linux, macOS and Windows platforms. As of 2017, the Genome Biology paper describing the original Bowtie method has been cited more than 11,000 times. Bowtie is open-source software and is currently maintained by Johns Hopkins University.

Metatranscriptomics is the set of techniques used to study gene expression of microbes within natural environments, i.e., the metatranscriptome.

Lior Samuel Pachter is a computational biologist. He works at the California Institute of Technology, where he is the Bren Professor of Computational Biology. He has widely varied research interests including genomics, combinatorics, computational geometry, machine learning, scientific computing, and statistics.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Perturb-seq refers to a high-throughput method of performing single cell RNA sequencing (scRNA-seq) on pooled genetic perturbation screens. Perturb-seq combines multiplexed CRISPR mediated gene inactivations with single cell RNA sequencing to assess comprehensive gene expression phenotypes for each perturbation. Inferring a gene’s function by applying genetic perturbations to knock down or knock out a gene and studying the resulting phenotype is known as reverse genetics. Perturb-seq is a reverse genetics approach that allows for the investigation of phenotypes at the level of the transcriptome, to elucidate gene functions in many cells, in a massively parallel fashion.

Third-generation sequencing is a class of DNA sequencing methods currently under active development.

Ben Langmead is a computational biologist and associate professor in the Computational Biology & Medicine Group at Johns Hopkins University.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

Bruce Colston Trapnell Jr. is an assistant professor in the Department of Genome Sciences at the University of Washington. He was awarded the Overton Prize by the International Society for Computational Biology (ISCB) for “outstanding accomplishment in the early to mid stage of his career” in 2018.

References

1 2 3 4 5 Trapnell C, Pachter L, Salzberg SL (May 2009). "TopHat: discovering splice junctions with RNA-Seq". Bioinformatics. 25 (9): 1105–11. doi:10.1093/bioinformatics/btp120. PMC 2672628 . PMID 19289445.
1 2 "TopHat". ccb.jhu.edu. Retrieved 2018-04-17.
↑ Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (April 2013). "TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions". Genome Biology. 14 (4): R36. doi:10.1186/gb-2013-14-4-r36. PMC 4053844 . PMID 23618408.
↑ "Bowtie & Tophat". www.biostars.org. Retrieved 2018-04-24.
↑ Brueffer C, Saal LH (May 2016). "TopHat-Recondition: a post-processor for TopHat unmapped reads". BMC Bioinformatics. 17 (1): 199. doi:10.1186/s12859-016-1058-x. PMC 4855331 . PMID 27142976.

External links

TopHat page on Center for Computational Biology at JHU

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Trapnell_2009-1] 1 2 3 4 5 Trapnell C, Pachter L, Salzberg SL (May 2009). "TopHat: discovering splice junctions with RNA-Seq". Bioinformatics. 25 (9): 1105–11. doi:10.1093/bioinformatics/btp120. PMC 2672628 . PMID 19289445.

[:0-2] 1 2 "TopHat". ccb.jhu.edu. Retrieved 2018-04-17.

[tophat2-3] Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (April 2013). "TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions". Genome Biology. 14 (4): R36. doi:10.1186/gb-2013-14-4-r36. PMC 4053844 . PMID 23618408.

[4] "Bowtie & Tophat". www.biostars.org. Retrieved 2018-04-24.

[5] Brueffer C, Saal LH (May 2016). "TopHat-Recondition: a post-processor for TopHat unmapped reads". BMC Bioinformatics. 17 (1): 199. doi:10.1186/s12859-016-1058-x. PMC 4855331 . PMID 27142976.

[1]

[2]

[3]

[4]

[5]

v t e Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive, DNA Data Bank of Japan and China National GeneBank Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: BioNumbers, Protein Data Bank, Ensembl and InterPro Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource, GISAID and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE PANGOLIN SAMtools SOAP suite TopHat
Other	Server: ExPASy Ontology: Gene Ontology Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format GFF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons