List of RNA-Seq bioinformatics tools

RNA-Seq [1] [2] [3] is a technique [4] that allows transcriptome studies (see also Transcriptomics technologies) based on next-generation sequencing technologies. This technique is largely dependent on bioinformatics tools developed to support the different steps of the process. Here are listed some of the principal tools commonly employed and links to some important web resources.



Design is a fundamental step of a particular RNA-Seq experiment. Some important questions like sequencing depth/coverage or how many biological or technical replicates must be carefully considered. Design review. [5]

Quality control, trimming, error correction and pre-processing of data

Quality assessment of raw data [6] is the first step of the bioinformatics pipeline of RNA-Seq. Often, is necessary to filter data, removing low quality sequences or bases (trimming), adapters, contaminations, overrepresented sequences or correcting errors to assure a coherent final result.

Quality control

Improving the quality

Improvement of the RNA-Seq quality, correcting the bias is a complex subject. [16] [17] Each RNA-Seq protocol introduces specific type of bias, each step of the process (such as the sequencing technology used) is susceptible to generate some sort of noise or type of error. Furthermore, even the species under investigation and the biological context of the samples are able to influence the results and introduce some kind of bias. Many sources of bias were already reported – GC content and PCR enrichment, [18] [19] rRNA depletion, [20] errors produced during sequencing, [21] priming of reverse transcription caused by random hexamers. [22]

Different tools were developed to attempt to solve each of the detected errors.

Trimming and adapters removal

  • AlienTrimmer [23] implements a very fast approach (based on k-mers) to trim low-quality base pairs and clip technical (alien) oligonucleotides from single- or paired-end sequencing reads in plain or gzip-compressed FASTQ files (for more details, see AlienTrimmer).
  • BBDuk multithreaded tool to trim adapters and filter or mask contaminants based on kmer-matching, allowing a hamming- or edit-distance, as well as degenerate bases. Also performs optimal quality-trimming and filtering, format conversion, contaminant concentration reporting, gc-filtering, length-filtering, entropy-filtering, chastity-filtering, and generates text histograms for most operations. Interconverts between fastq, fasta, sam, scarf, interleaved and 2-file paired, gzipped, bzipped, ASCII-33 and ASCII-64. Keeps pairs together. Open-source, written in pure Java; supports all platforms with no recompilation and no other dependencies.
  • clean_reads cleans NGS (Sanger, 454, Illumina and solid) reads. It can trim bad quality regions, adaptors, vectors, and regular expressions. It also filters out the reads that do not meet a minimum quality criteria based on the sequence length and the mean quality.
  • condetri [24] is a method for content dependent read trimming for Illumina data using quality scores of each base individually. It is independent from sequencing coverage and user interaction. The main focus of the implementation is on usability and to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequencing data of arbitrary length.
  • cutadapt [25] removes adapter sequences from next-generation sequencing data (Illumina, SOLiD and 454). It is used especially when the read length of the sequencing machine is longer than the sequenced molecule, like the microRNA case.
  • Deconseq Detect and remove contaminations from sequence data.
  • Erne-Filter [26] is a short string alignment package whose goal is to provide an all-inclusive set of tools to handle short (NGS-like) reads. ERNE comprises ERNE-FILTER (read trimming and continamination filtering), ERNE-MAP (core alignment tool/algorithm), ERNE-BS5 (bisulfite treated reads aligner), and ERNE-PMAP/ERNE-PBS5 (distributed versions of the aligners).
  • FastqMcf Fastq-mcf attempts to: Detect & remove sequencing adapters and primers; Detect limited skewing at the ends of reads and clip; Detect poor quality at the ends of reads and clip; Detect Ns, and remove from ends; Remove reads with CASAVA 'Y' flag (purity filtering); Discard sequences that are too short after all of the above; Keep multiple mate-reads in sync while doing all of the above.
  • fastp is a tool designed to provide all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported.
  • fastq-trim is very fast and memory-efficient tool written in C, following object-oriented design and strict code-quality practices. It runs in a fraction of the time of most popular trimming tools, while using only a few megabytes of RAM. It can be easily extended to support additional alignment algorithms. The default algorithm is similar to that used by cutadapt, and the results produced are nearly identical.
  • FASTX-Toolkit is a set of command line tools to manipulate reads in files FASTA or FASTQ format. These commands make possible preprocess the files before mapping with tools like Bowtie. Some of the tasks allowed are: conversion from FASTQ to FASTA format, information about statistics of quality, removing sequencing adapters, filtering and cutting sequences based on quality or conversion DNA/RNA.
  • Flexbar performs removal of adapter sequences, trimming and filtering features.
  • FreClu improves overall alignment accuracy performing sequencing-error correction by trimming short reads, based on a clustering methodology.
  • htSeqTools is a Bioconductor package able to perform quality control, processing of data and visualization. htSeqTools makes possible visualize sample correlations, to remove over-amplification artifacts, to assess enrichment efficiency, to correct strand bias and visualize hits.
  • NxTrim Adapter trimming and virtual library creation routine for Illumina Nextera Mate Pair libraries.
  • PRINSEQ [27] generates statistics of your sequence data for sequence length, GC content, quality scores, n-plicates, complexity, tag sequences, poly-A/T tails, odds ratios. Filter the data, reformat and trim sequences.
  • Sabre A barcode demultiplexing and trimming tool for FastQ files.
  • Scythe A 3'-end adapter contaminant trimmer.
  • SEECER is a sequencing error correction algorithm for RNA-seq data sets. It takes the raw read sequences produced by a next generation sequencing platform like machines from Illumina or Roche. SEECER removes mismatch and indel errors from the raw reads and significantly improves downstream analysis of the data. Especially if the RNA-Seq data is used to produce a de novo transcriptome assembly, running SEECER can have tremendous impact on the quality of the assembly.
  • Sickle A windowed adaptive trimming tool for FASTQ files using quality.
  • SnoWhite [28] is a pipeline designed to flexibly and aggressively clean sequence reads (gDNA or cDNA) prior to assembly. It takes in and returns fastq or fasta formatted sequence files.
  • ShortRead is a package provided in the R (programming language) / BioConductor environments and allows input, manipulation, quality assessment and output of next-generation sequencing data. This tool makes possible manipulation of data, such as filter solutions to remove reads based on predefined criteria. ShortRead could be complemented with several Bioconductor packages to further analysis and visualization solutions (BioStrings, BSgenome, IRanges, and so on).
  • SortMeRNA is a program tool for filtering, mapping and OTU-picking NGS reads in metatranscriptomic and metagenomic data. The core algorithm is based on approximate seeds and allows for analyses of nucleotide sequences. The main application of SortMeRNA is filtering ribosomal RNA from metatranscriptomic data.
  • TagCleaner The TagCleaner tool can be used to automatically detect and efficiently remove tag sequences (e.g. WTA tags) from genomic and metagenomic datasets. It is easily configurable and provides a user-friendly interface.
  • Trimmomatic [29] performs trimming for Illumina platforms and works with FASTQ reads (single or pair-ended). Some of the tasks executed are: cut adapters, cut bases in optional positions based on quality thresholds, cut reads to a specific length, converts quality scores to Phred-33/64.

Detection of chimeric reads

Recent sequencing technologies normally require DNA samples to be amplified via polymerase chain reaction (PCR). Amplification often generates chimeric elements (specially from ribosomal origin) - sequences formed from two or more original sequences joined.

  • UCHIME is an algorithm for detecting chimeric sequences.
  • ChimeraSlayeris a chimeric sequence detection utility, compatible with near-full length Sanger sequences and shorter 454-FLX sequences (~500 bp).

Error correction

High-throughput sequencing errors characterization and their eventual correction. [30]

  • Acacia Error-corrector for pyrosequenced amplicon reads.
  • AllPathsLG error correction.
  • AmpliconNoise [31] AmpliconNoise is a collection of programs for the removal of noise from 454 sequenced PCR amplicons. It involves two steps the removal of noise from the sequencing itself and the removal of PCR point errors. This project also includes the Perseus algorithm for chimera removal.
  • BayesHammer. Bayesian clustering for error correction. This algorithm is based on Hamming graphs and Bayesian subclustering. While BAYES HAMMER was designed for single-cell sequencing, it also improves on existing error correction tools for bulk sequencing data.
  • Bless [32] A bloom filter-based error correction solution for high-throughput sequencing reads.
  • Blue [33] Blue is a short-read error-correction tool based on k-mer consensus and context.
  • BFC A sequencing error corrector designed for Illumina short reads. It uses a non-greedy algorithm with a speed comparable to implementations based on greedy methods.
  • Denoiser Denoiser is designed to address issues of noise in pyrosequencing data. Denoiser is a heuristic variant of PyroNoise. Developers of denoiser report a good agreement with PyroNoise on several test datasets.
  • Echo A reference-free short-read error correction algorithm.
  • Lighter. A sequencing error correction without counting.
  • LSC LSC uses short Illumina reads to corrected errors in long reads.
  • Karect Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data.
  • NoDe NoDe: an error-correction algorithm for pyrosequencing amplicon reads.
  • PyroTagger PyroTagger: A fast, accurate pipeline for analysis of rRNA amplicon pyrosequence data.
  • Quake is a tool to correct substitution sequencing errors in experiments with deep coverage for Illumina sequencing reads.
  • QuorUM: An Error Corrector for Illumina Reads.
  • Rcorrector. Error correction for Illumina RNA-seq reads.
  • Reptile is a software developed in C++ for correcting sequencing errors in short reads from next-gen sequencing platforms.
  • Seecer SEquencing Error CorrEction for Rna reads.
  • SGA
  • SOAPdenovo

Bias correction

  • Alpine [34] Modeling and correcting fragment sequence bias for RNA-seq.
  • cqn [35] is a normalization tool for RNA-Seq data, implementing the conditional quantile normalization method.
  • EDASeq [36] is a Bioconductor package to perform GC-Content Normalization for RNA-Seq Data.
  • GeneScissors A comprehensive approach to detecting and correcting spurious transcriptome inference due to RNAseq reads misalignment.
  • Peer [37] is a collection of Bayesian approaches to infer hidden determinants and their effects from gene expression profiles using factor analysis methods. Applications of PEER have: a) detected batch effects and experimental confounders, b) increased the number of expression QTL findings by threefold, c) allowed inference of intermediate cellular traits, such as transcription factor or pathway activations.
  • RUV [38] is a R package that implements the remove unwanted variation (RUV) methods of Risso et al. (2014) for the normalization of RNA-Seq read counts between samples.
  • svaSurrogate Variable Analysis.
  • svaseq removing batch effects and other unwanted noise from sequencing data.
  • SysCall [39] is a classifier tool to identification and correction of systematic error in high-throughput sequence data.

Other tasks/pre-processing data

Further tasks performed before alignment, namely paired-read mergers.

  • AuPairWise A Method to Estimate RNA-Seq Replicability through Co-expression.
  • BamHash is a checksum based method to ensure that the read pairs in FASTQ files match exactly the read pairs stored in BAM files, regardless of the ordering of reads. BamHash can be used to verify the integrity of the files stored and discover any discrepancies. Thus, BamHash can be used to determine if it is safe to delete the FASTQ files storing raw sequencing reads after alignment, without the loss of data.
  • BBMerge Merges paired reads based on overlap to create longer reads, and an insert-size histogram. Fast, multithreaded, and yields extremely few false positives. Open-source, written in pure Java; supports all platforms with no recompilation and no other dependencies. Distributed with BBMap.
  • Biopieces are a collection of bioinformatics tools that can be pieced together in a very easy and flexible manner to perform both simple and complex tasks. The Biopieces work on a data stream in such a way that the data stream can be passed through several different Biopieces, each performing one specific task: modifying or adding records to the data stream, creating plots, or uploading data to databases and web services.
  • COPE [40] COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly.
  • DeconRNASeq is an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data.
  • FastQ Screen screens FASTQ format sequences to a set of databases to confirm that the sequences contain what is expected (such as species content, adapters, vectors, etc.).
  • FLASH is a read pre-processing tool. FLASH combines paired-end reads which overlap and converts them to single long reads.
  • IDCheck
  • ORNA and ORNA Q/K A tool for reducing redundancy in RNA-seq data which reduces the computational resource requirements of an assembler
  • a program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.
  • PEAR [41] PEAR: Illumina Paired-End reAd mergeR.
  • qRNASeq script The qRNAseq tool can be used to accurately eliminate PCR duplicates from RNA-Seq data if Molecular Indexes™ or other stochastic labels have been used during library prep.
  • SHERA [42] a SHortread Error-Reducing Aligner.
  • XORRO Rapid Paired-End Read Overlapper.
  • DecontaMiner [43] detects contamination in RNA-Seq data.

Alignment tools

After quality control, the first step of RNA-Seq analysis involves alignment of the sequenced reads to a reference genome (if available) or to a transcriptome database. See also List of sequence alignment software .

Short (unspliced) aligners

Short aligners are able to align continuous reads (not containing gaps result of splicing) to a genome of reference. Basically, there are two types: 1) based on the Burrows–Wheeler transform method such as Bowtie and BWA, and 2) based on Seed-extend methods, Needleman–Wunsch or Smith–Waterman algorithms. The first group (Bowtie and BWA) is many times faster, however some tools of the second group tend to be more sensitive, generating more correctly aligned reads.

Spliced aligners

Many reads span exon-exon junctions and can not be aligned directly by Short aligners, thus specific aligners were necessary - Spliced aligners. Some Spliced aligners employ Short aligners to align firstly unspliced/continuous reads (exon-first approach), and after follow a different strategy to align the rest containing spliced regions - normally the reads are split into smaller segments and mapped independently. See also. [45] [46]

Aligners based on known splice junctions (annotation-guided aligners)

In this case the detection of splice junctions is based on data available in databases about known junctions. This type of tools cannot identify new splice junctions. Some of this data comes from other expression methods like expressed sequence tags (EST).

  • Erange is a tool to alignment and data quantification to mammalian transcriptomes.
  • IsoformEx
  • MapAL
  • OSA
  • RNA-MATE is a computational pipeline for alignment of data from Applied Biosystems SOLID system. Provides the possibility of quality control and trimming of reads. The genome alignments are performed using mapreads and the splice junctions are identified based on a library of known exon-junction sequences. This tool allows visualization of alignments and tag counting.
  • RUM performs alignment based on a pipeline, being able to manipulate reads with splice junctions, using Bowtie and Blat. The flowchart starts doing alignment against a genome and a transcriptome database executed by Bowtie. The next step is to perform alignment of unmapped sequences to the genome of reference using BLAT. In the final step all alignments are merged to get the final alignment. The input files can be in FASTA or FASTQ format. The output is presented in RUM and SAM format.
  • SAMMate
  • SpliceSeq
  • X-Mate

De novo splice aligners

De novo Splice aligners allow the detection of new Splice junctions without need to previous annotated information (some of these tools present annotation as a suplementar option).

  • ABMapper
  • BBMap Uses short kmers to align reads directly to the genome (spanning introns to find novel isoforms) or transcriptome. Highly tolerant of substitution errors and indels, and very fast. Supports output of all SAM tags needed by Cufflinks. No limit to genome size or number of splices per read. Supports Illumina, 454, Sanger, Ion Torrent, PacBio, and Oxford Nanopore reads, paired or single-ended. Does not use any splice-site-finding heuristics optimized for a single taxonomic branch, but rather finds optimally-scoring multi-affine-transform global alignments, and thus is ideal for studying new organisms with no annotation and unknown splice motifs. Open-source, written in pure Java; supports all platforms with no recompilation and no other dependencies.
  • ContextMap was developed to overcome some limitations of other mapping approaches, such as resolution of ambiguities. The central idea of this tool is to consider reads in gene expression context, improving this way alignment accuracy. ContextMap can be used as a stand-alone program and supported by mappers producing a SAM file in the output (e.g.: TopHat or MapSplice). In stand-alone mode aligns reads to a genome, to a transcriptome database or both.
  • CRAC propose a novel way of analyzing reads that integrates genomic locations and local coverage, and detect candidate mutations, indels, splice or fusion junctions in each single read. Importantly, CRAC improves its predictive performance when supplied with e.g. 200 nt reads and should fit future needs of read analyses.
  • GMAP A Genomic Mapping and Alignment Program for mRNA and EST Sequences.
  • HISAT is a spliced alignment program for mapping RNA-seq reads. In addition to one global FM-index that represents a whole genome, HISAT uses a large set of small FM-indexes that collectively cover the whole genome (each index represents a genomic region of ~64,000 bp and ~48,000 indexes are needed to cover the human genome). These small indexes (called local indexes) combined with several alignment strategies enable effective alignment of RNA-seq reads, in particular, reads spanning multiple exons. The memory footprint of HISAT is relatively low (~4.3GB for the human genome). We have developed HISAT based on the Bowtie2 implementation to handle most of the operations on the FM-index.
  • HISAT2 is an alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs [Sirén et al. 2014], we designed and implemented a graph FM-index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).
  • HMMSplicer can identify canonical and non-canonical splice junctions in short-reads. Firstly, unspliced reads are removed with Bowtie. After that, the remaining reads are one at a time divided in half, then each part is seeded against a genome and the exon borders are determined based on the Hidden Markov Model. A quality score is assigned to each junction, useful to detect false positive rates.
  • MapSplice
  • PALMapper
  • Pass [47] aligns gapped, ungapped reads and also bisulfite sequencing data. It includes the possibility to filter data before alignment (remotion of adapters). Pass uses Needleman–Wunsch and Smith–Waterman algorithms, and performs alignment in 3 stages: scanning positions of seed sequences in the genome, testing the contiguous regions and finally refining the alignment.
  • PASSion
  • QPALMA predicts splice junctions supported on machine learning algorithms. In this case the training set is a set of spliced reads with quality information and already known alignments.
  • RASER: [48] reads aligner for SNPs and editing sites of RNA.
  • SeqSaw
  • SoapSplice A tool for genome-wide ab initio detection of splice junction sites from RNA-Seq, a method using new generation sequencing technologies to sequence the messenger RNA.
  • SpliceMap
  • SplitSeek
  • SuperSplat was developed to find all type of splice junctions. The algorithm splits each read in all possible two-chunk combinations in an iterative way, and alignment is tried to each chunck. Output in "Supersplat" format.
De novo splice aligners that also use annotation optionally
  • MapNext
  • OLego
  • STAR is a tool that employs "sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure", detects canonical, non-canonical splices junctions and chimeric-fusion sequences. It is already adapted to align long reads (third-generation sequencing technologies) and can reach speeds of 45 million paired reads per hour per processor. [49]
  • Subjunc [44] is a specialized version of Subread. It uses all mappable regions in an RNA-seq read to discover exons and exon-exon junctions. It uses the donor/receptor signals to find the exact splicing locations. Subjunc yields full alignments for every RNA-seq read including exon-spanning reads, in addition to the discovered exon-exon junctions. Subjunc should be used for the purpose of junction detection and genomic variation detection in RNA-seq data.
  • TopHat [50] is prepared to find de novo junctions. TopHat aligns reads in two steps. Firstly, unspliced reads are aligned with Bowtie. After, the aligned reads are assembled with Maq resulting islands of sequences. Secondly, the splice junctions are determined based on the initially unmapped reads and the possible canonical donor and acceptor sites within the island sequences.
Other spliced aligners
  • G.Mo.R-Se is a method that uses RNA-Seq reads to build de novo gene models.

Evaluation of alignment tools

Normalization, quantitative analysis and differential expression

General tools

These tools perform normalization and calculate the abundance of each gene expressed in a sample. [51] RPKM, FPKM and TPMs [52] are some of the units employed to quantification of expression. Some software are also designed to study the variability of genetic expression between samples (differential expression). Quantitative and differential studies are largely determined by the quality of reads alignment and accuracy of isoforms reconstruction. Several studies are available comparing differential expression methods. [53] [54] [55]

Evaluation of quantification and differential expression

Multi-tool solutions

Transposable Element expression

Workbench (analysis pipeline / integrated solutions)

Commercial solutions

Open (free) source solutions

Alternative splicing analysis

General tools

Intron retention analysis

Differential isoform/transcript usage

Fusion genes/chimeras/translocation finders/structural variations

Genome arrangements result of diseases like cancer can produce aberrant genetic modifications like fusions or translocations. Identification of these modifications play important role in carcinogenesis studies. [85]

Copy number variation identification

Single cell RNA-Seq

Single cell sequencing. The traditional RNA-Seq methodology is commonly known as "bulk RNA-Seq", in this case RNA is extracted from a group of cells or tissues, not from the individual cell like it happens in single cell methods. Some tools available to bulk RNA-Seq are also applied to single cell analysis, however to face the specificity of this technique new algorithms were developed.

Integrated Packages

Quality Control and Gene Filtering

Data cleaning and denoising


Dimension Reduction

Differential Expression


RNA-Seq simulators

These Simulators generate in silico reads and are useful tools to compare and test the efficiency of algorithms developed to handle RNA-Seq data. Moreover, some of them make possible to analyse and model RNA-Seq protocols.

Transcriptome assemblers

The transcriptome is the total population of RNAs expressed in one cell or group of cells, including non-coding and protein-coding RNAs. There are two types of approaches to assemble transcriptomes. Genome-guided methods use a reference genome (if possible a finished and high quality genome) as a template to align and assembling reads into transcripts. Genome-independent methods does not require a reference genome and are normally used when a genome is not available. In this case reads are assembled directly in transcripts.

Genome-guided assemblers

Genome-independent (de novo) assemblers

Assembly evaluation tools

Co-expression networks

miRNA prediction and analysis

Visualization tools

Functional, network and pathway analysis tools

Further annotation tools for RNA-Seq data

Compression tools

RNA-Seq databases

Single species' RNA-Seq databases

