De novo transcriptome assembly

Last updated

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

Contents

Introduction

As a result of the development of novel sequencing technologies, the years between 2008 and 2012 saw a large drop in the cost of sequencing. Per megabase and genome, the cost dropped to 1/100,000th and 1/10,000th of the price, respectively. [1] Prior to this, only transcriptomes of organisms that were of broad interest and utility to scientific research were sequenced; however, these developed in 2010s high-throughput sequencing (also called next-generation sequencing) technologies are both cost- and labor- effective, and the range of organisms studied via these methods is expanding. [2] Transcriptomes have subsequently been created for chickpea, [3] planarians, [4] Parhyale hawaiensis , [5] as well as the brains of the Nile crocodile, the corn snake, the bearded dragon, and the red-eared slider, to name just a few. [6]

Examining non-model organisms can provide novel insights into the mechanisms underlying the "diversity of fascinating morphological innovations" that have enabled the abundance of life on planet Earth. [7] In animals and plants, the "innovations" that cannot be examined in common model organisms include mimicry, mutualism, parasitism, and asexual reproduction. De novo transcriptome assembly is often the preferred method to studying non-model organisms, since it is cheaper and easier than building a genome, and reference-based methods are not possible without an existing genome. The transcriptomes of these organisms can thus reveal novel proteins and their isoforms that are implicated in such unique biological phenomena.

De novo vs. reference-based assembly

A set of assembled transcripts allows for initial gene expression studies. Prior to the development of transcriptome assembly computer programs, transcriptome data were analyzed primarily by mapping on to a reference genome. Though genome alignment is a robust way of characterizing transcript sequences, this method is disadvantaged by its inability to account for incidents of structural alterations of mRNA transcripts, such as alternative splicing. [8] Since a genome contains the sum of all introns and exons that may be present in a transcript, spliced variants that do not align continuously along the genome may be discounted as actual protein isoforms. Even if a reference genome is available, de novo assembly should be performed, as it can recover transcripts that are transcribed from segments of the genome that are missing from the reference genome assembly. [9]

Transcriptome vs. genome assembly

Unlike genome sequence coverage levels – which can vary randomly as a result of repeat content in non-coding intron regions of DNA – transcriptome sequence coverage levels can be directly indicative of gene expression levels. These repeated sequences also create ambiguities in the formation of contigs in genome assembly, while ambiguities in transcriptome assembly contigs usually correspond to spliced isoforms, or minor variation among members of a gene family. [8] Genome assembler can't be directly used in transcriptome assembly for several reasons. First, genome sequencing depth is usually the same across a genome, but the depth of transcripts can vary. Second, both strands are always sequenced in genome sequencing, but RNA-seq can be strand-specific. Third, transcriptome assembly is more challenging because transcript variants from the same gene can share exons and are difficult to resolve unambiguously. [9]

Method

RNA-seq

Once RNA is extracted and purified from cells, it is sent to a high-throughput sequencing facility, where it is first reverse transcribed to create a cDNA library. This cDNA can then be fragmented into various lengths depending on the platform used for sequencing. Each of the following platforms utilizes a different type of technology to sequence millions of short reads: 454 Sequencing, Illumina, and SOLiD.

Assembly algorithms

The cDNA sequence reads are assembled into transcripts via a short read transcript assembly program. Most likely, some amino acid variations among transcripts that are otherwise similar reflect different protein isoforms. It is also possible that they represent different genes within the same gene family, or even genes that share only a conserved domain, depending on the degree of variation.

A number of assembly programs are available (see Assemblers). Although these programs have been generally successful in assembling genomes, transcriptome assembly presents some unique challenges. Whereas high sequence coverage for a genome may indicate the presence of repetitive sequences (and thus be masked), for a transcriptome, they may indicate abundance. In addition, unlike genome sequencing, transcriptome sequencing can be strand-specific, due to the possibility of both sense and antisense transcripts. Finally, it can be difficult to reconstruct and tease apart all splicing isoforms. [9]

Short read assemblers generally use one of two basic algorithms: overlap graphs and de Bruijn graphs. [10] Overlap graphs are utilized for most assemblers designed for Sanger sequenced reads. The overlaps between each pair of reads is computed and compiled into a graph, in which each node represents a single sequence read. This algorithm is more computationally intensive than de Bruijn graphs, and most effective in assembling fewer reads with a high degree of overlap. [10] De Bruijn graphs align k-mers (usually 25-50 bp) based on k-1 sequence conservation to create contigs. The k-mers are shorter than the read lengths allowing fast hashing so the operations in de Bruijn graphs are generally less computationally intensive. [10]

Functional annotation

Functional annotation of the assembled transcripts allows for insight into the particular molecular functions, cellular components, and biological processes in which the putative proteins are involved. Blast2GO (B2G) enables Gene Ontology based data mining to annotate sequence data for which no GO annotation is available yet. It is a research tool often employed in functional genomics research on non-model species. [11] It works by blasting assembled contigs against a non-redundant protein database (at NCBI), then annotating them based on sequence similarity. GOanna is another GO annotation program specific for animal and agricultural plant gene products that works in a similar fashion. It is part of the AgBase database of curated, publicly accessible suite of computational tools for GO annotation and analysis. [12] Following annotation, KEGG (Kyoto Encyclopedia of Genes and Genomes) enables visualization of metabolic pathways and molecular interaction networks captured in the transcriptome. [13]

In addition to being annotated for GO terms, contigs can also be screened for open reading frames (ORFs) in order to predict the amino acid sequence of proteins derived from these transcripts. Another approach is to annotate protein domains and determine the presence of gene families, rather than specific genes.

Verification and quality control

Since a well-resolved reference genome is rarely available, the quality of computer-assembled contigs may be verified either by comparing the assembled sequences to the reads used to generate them (reference-free), or by aligning the sequences of conserved gene domains found in mRNA transcripts to transcriptomes or genomes of closely related species (reference-based). Tools such as Transrate [14] and DETONATE [15] allow statistical analysis of assembly quality by these methods. Another method is to design PCR primers for predicted transcripts, then attempt to amplify them from the cDNA library. Often, exceptionally short reads are filtered out. Short sequences (< 40 amino acids) are unlikely to represent functional proteins, as they are unable to fold independently and form hydrophobic cores. [16]

Complementary to these metrics, a quantitative assessment of the gene content may provide additional insights into the quality of the assembly. To perform this step, tools that model the expected gene space based of conserved genes, such as BUSCO, [17] can be used. For eukaryotes, CEGMA [18] may also be used, although it is officially no longer supported since 2015. [19]

Assemblers

The following is a partial compendium of assembly software that has been used to generate transcriptomes, and has also been cited in scientific literature.

SeqMan NGen

SOAPdenovo-Trans

SOAPdenovo-Trans is a de novo transcriptome assembler inherited from the SOAPdenovo2 framework, designed for assembling transcriptome with alternative splicing and different expression level. The assembler provides a more comprehensive way to construct the full-length transcript sets compare to SOAPdenovo2.

Velvet/Oases

The Velvet algorithm uses de Bruijn graphs to assemble transcripts. In simulations, Velvet can produce contigs up to 50-kb N50 length using prokaryotic data and 3-kb N50 in mammalian bacterial artificial chromosomes (BACs). [20] These preliminary transcripts are transferred to Oases, which uses paired end read and long read information to build transcript isoforms. [21]

Trans-ABySS

ABySS is a parallel, paired-end sequence assembler. Trans-ABySS (Assembly By Short Sequences) is a software pipeline written in Python and Perl for analyzing ABySS-assembled transcriptome contigs. This pipeline can be applied to assemblies generated across a wide range of k values. It first reduces the dataset into smaller sets of non-redundant contigs, and identifies splicing events including exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. The Trans-ABySS algorithms are also able to estimate gene expression levels, identify potential polyadenylation sites, as well as candidate gene-fusion events. [22]

Trinity

Trinity [23] first divides the sequence data into a number of de Bruijn graphs, each representing transcriptional variations at a single gene or locus. It then extracts full-length splicing isoforms and distinguishes transcripts derived from paralogous genes from each graph separately. Trinity consists of three independent software modules, which are used sequentially to produce transcripts:

See also

Related Research Articles

In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases. EST approaches have largely been superseded by whole genome and transcriptome sequencing and metagenome sequencing.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.

<i>k</i>-mer

In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular transcriptome.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

<span class="mw-page-title-main">Hybrid genome assembly</span>

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

<span class="mw-page-title-main">Scaffolding (bioinformatics)</span>

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

Chimeric RNA, sometimes referred to as a fusion transcript, is composed of exons from two or more different genes that have the potential to encode novel proteins. These mRNAs are different from those produced by conventional splicing as they are produced by two or more gene loci.

SPAdes is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects.

Metatranscriptomics is the set of techniques used to study gene expression of microbes within natural environments, i.e., the metatranscriptome.

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Third-generation sequencing is a class of DNA sequencing methods currently under active development.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

References

  1. Wettersrand, KA. "The Cost of Sequencing a Human Genome". Genome.gov. Retrieved 6 May 2021.
  2. Surget-Groba Y, Montoya-Burgos JI (2010). "Optimization of de novo transcriptome assembly from next-generation sequencing data". Genome Res. 20 (10): 1432–1440. doi:10.1101/gr.103846.109. PMC   2945192 . PMID   20693479.
  3. Garg R, Patel RK, Tyagi AK, Jain M (2011). "De novo assembly of chickpea transcriptome using short reads for gene discovery and marker identification". DNA Res. 18 (1): 53–63. doi:10.1093/dnares/dsq028. PMC   3041503 . PMID   21217129.
  4. Adamidi C; et al. (2011). "De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics". Genome Res. 21 (7): 1193–1200. doi:10.1101/gr.113779.110. PMC   3129261 . PMID   21536722.
  5. Zeng V; et al. (2011). "De novo assembly and characterization of a maternal and developmental transcriptome for the emerging model crustacean Parhyale hawaiensis". BMC Genomics. 12: 581. doi:10.1186/1471-2164-12-581. PMC   3282834 . PMID   22118449.
  6. Tzika AC; et al. (2011). "Reptilian transcriptome v1.0, a glimpse in the brain transcriptome of five divergent Sauropsida lineages and the phylogenetic position of turtles" (PDF). EvoDevo. 2 (1): 19. doi:10.1186/2041-9139-2-19. PMC   3192992 . PMID   21943375.
  7. Rowan BA, Weigel D, Koenig D (2011). "Developmental genetics and new sequencing technologies: the rise of nonmodel organisms". Developmental Cell. 21 (1): 65–76. doi: 10.1016/j.devcel.2011.05.021 . PMID   21763609.
  8. 1 2 Birol I; et al. (2009). "De novo transcriptome assembly with ABySS". Bioinformatics. 25 (21): 2872–7. doi: 10.1093/bioinformatics/btp367 . PMID   19528083.
  9. 1 2 3 Martin, Jeffrey A.; Wang, Zhong (2011). "Next-generation transcriptome assembly". Nature Reviews Genetics. 12 (10): 671–682. doi:10.1038/nrg3068. PMID   21897427. S2CID   3447321.
  10. 1 2 3 Illumina, Inc. (2010). "De Novo Assembly Using Illumina Reads" (PDF).
  11. Conesa A; et al. (2005). "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research". Bioinformatics. 21 (18): 3674–3676. doi: 10.1093/bioinformatics/bti610 . PMID   16081474.
  12. McCarthy FM; et al. (2006). "AgBase: a functional genomics resource for agriculture". BMC Genomics. 7: 229. doi:10.1186/1471-2164-7-229. PMC   1618847 . PMID   16961921.
  13. "KEGG PATHWAY Database".
  14. Transrate: understand your transcriptome assembly. http://hibberdlab.com/transrate
  15. Li B; et al. (2014). "Evaluation of de novo transcriptome assemblies from RNA-Seq data". Genome Biology. 15 (12): 553. doi:10.1186/s13059-014-0553-5. PMC   4298084 . PMID   25608678.
  16. Karplus, K. pdb-1: Minimum length of Protein Sequence. https://lists.sdsc.edu/pipermail/pdb-l/2011-January/005317.html.
  17. Seppey, Mathieu; Manni, Mosè; Zdobnov, Evgeny M. (2019), Kollmar, Martin (ed.), "BUSCO: Assessing Genome Assembly and Annotation Completeness", Gene Prediction, New York, NY: Springer New York, vol. 1962, pp. 227–245, doi:10.1007/978-1-4939-9173-0_14, ISBN   978-1-4939-9172-3, PMID   31020564, S2CID   131774987 , retrieved 2021-04-24
  18. Parra, G.; Bradnam, K.; Korf, I. (2007-05-01). "CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes". Bioinformatics. 23 (9): 1061–1067. doi: 10.1093/bioinformatics/btm071 . ISSN   1367-4803. PMID   17332020.
  19. "CEGMA". korflab.ucdavis.edu. Retrieved 2021-04-24.
  20. Zerbino DR, Birney E (2008). "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs". Genome Res. 18 (5): 821–829. doi:10.1101/gr.074492.107. PMC   2336801 . PMID   18349386.
  21. "Oases: de novo transcriptome assembler for very short reads". Archived from the original on 2018-11-29. Retrieved 2011-11-28.
  22. "Trans-ABySS: Analyze ABySS multi-k assembled shotgun transcriptome data".
  23. "Trinity". 2018-11-24.
  24. "Trinity RNA-Seq Assembly – software for the reconstruction of full-length transcripts and alternatively spliced isoforms". Archived from the original on July 12, 2011.