De novo sequence assemblers

Last updated

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

Contents

Types of de novo assemblers

There are two types of algorithms that are commonly utilized by these assemblers: greedy, which aim for local optima, and graph method algorithms, which aim for global optima. Different assemblers are tailored for particular needs, such as the assembly of (small) bacterial genomes, (large) eukaryotic genomes, or transcriptomes.

Greedy algorithm assemblers are assemblers that find local optima in alignments of smaller reads. Greedy algorithm assemblers typically feature several steps: 1) pairwise distance calculation of reads, 2) clustering of reads with greatest overlap, 3) assembly of overlapping reads into larger contigs, and 4) repeat. These algorithms typically do not work well for larger read sets, as they do not easily reach a global optimum in the assembly, and do not perform well on read sets that contain repeat regions. [1] Early de novo sequence assemblers, such as SEQAID [2] (1984) and CAP [3] (1992), used greedy algorithms, such as overlap-layout-consensus (OLC) algorithms. These algorithms find overlap between all reads, use the overlap to determine a layout (or tiling) of the reads, and then produce a consensus sequence. Some programs that used OLC algorithms featured filtration (to remove read pairs that will not overlap) and heuristic methods to increase speed of the analyses.

Graph method assemblers [4] come in two varieties: string and De Bruijn. String graph and De Bruijn graph method assemblers were introduced at a DIMACS [5] workshop in 1994 by Waterman [6] and Gene Myers. [7] These methods represented an important step forward in sequence assembly, as they both use algorithms to reach a global optimum instead of a local optimum. While both of these methods made progress towards better assemblies, the De Bruijn graph method has become the most popular in the age of next-generation sequencing. During the assembly of the De Bruijn graph, reads are broken into smaller fragments of a specified size, k. The k-mers are then used as edges in the graph assembly. Nodes are built as (k-1)-mers connect by an edge. The assembler will then construct sequences based on the De Bruijn graph. De Bruijn graph assemblers typically perform better on larger read sets than greedy algorithm assemblers (especially when they contain repeat regions).

Commonly used programs

List of de-novo assemblers
NameDescription /

Methodology

TechnologiesAuthorPresented /

Last updated

Licence*Homepage
ABySSparallel, paired-end sequence assembler designed for large genome assembly of short reads (genomic and transcriptomic), employ a Bloom filter to De Bruijn graphIllumina [8] [9] 2009 / 2017OS link
CanuSmall and large, haploid/diploid genomesPacBio/Oxford Nanopore reads [10] 2001 / 2018OS link
CodonCode AlignerAssemble fragments using local, end-to-end, or large gap (cDNA to genomic) algorithms. Supports sequence qualities. Run Phred and Phrap directly.Sanger, Illumina, 454, SolexaCodonCode Corporation2003 / 2025C link
DISCOVARpaired-end PCR-free reads (successor of ALLPATHS-LG)Illumina (MiSeq or HiSeq 2500) [11] 2014OS link
DNA Baser Sequence AssemblerDNA sequence assembly with automatic end trimming & ambiguity correction. Includes a base caller.Sanger, IlluminaHeracle BioSoft SRL2018.09C ($69)NA
DNASTAR Lasergene GenomicsLarge genomes, exomes, transcriptomes, metagenomes, ESTs. Also de novo assembly and polishing of long read sequencing data from Oxford Nanopore and PacBio, including PacBio Hifi reads.Illumina, ABI SOLiD, Roche 454, Ion Torrent, Solexa, Sanger DNASTAR 2007 / 2023C link
FalconDiploid genomesPacBio reads [12] 2014 / 2017OS link
Flyegenomes and metagenomes. Makes use of repeat graphsPacBio/Oxford Nanopore reads [13] 2018/2023OS link
HGAPGenomes up to 130 MBPacBio reads [14] 2011 / 2015OS link
hifiasmGraph assemblyPacBio HiFi reads [15] 2021OS link
HingeSmall microbial genomesPacBio/Oxford Nanopore reads [16] 2016 / 2018OS link
MaSuRCAAny size, haploid/diploid genomesIllumina and PacBio/Oxford Nanopore data, legacy 454 and Sanger data [17] 2011 / 2018OS link
Newbler genomes, ESTs454, Sanger 454 Life Sciences 2004/2012C link
Phrap genomesSanger, 454, Solexa Green, P. 1994 / 2008C / NC-A link
PlassProtein-level assembler: assembles six-frame-translated sequencing reads into protein sequencesIllumina [18] 2018 / 2019OS link
Raya suite of assemblers including de novo, metagenomic, ontology and taxonomic profiling; uses a De Bruijn graph [19] 2010OS link
SPAdes (small) genomes, single-cellIllumina, Solexa, Sanger, 454, Ion Torrent, PacBio, Oxford Nanopore [20] 2012 / 2021OS link
Trinitytranscriptome assemblies by de Bruijn graphIllumina RNA-seq [21] 2011 link
Velvet (small) genomesSanger, 454, Solexa, SOLiD [22] 2007 / 2011OS link
SOAPdenovo2An empirically improved memory-efficient short-read de novo assemblerSolexa, Illumina, DNBSEQ [23] 2012GPLv3 link
*Licences: OS = Open Source; C = Commercial; C / NC-A = Commercial but free for non-commercial and academics

Different assemblers are designed for different type of read technologies. Reads from second generation technologies (called short read technologies) like Illumina are typically short (with lengths of the order of 50-200 base pairs) and have error rates of around 0.5-2%, with the errors chiefly being substitution errors. However, reads from third generation technologies like PacBio and fourth generation technologies like Oxford Nanopore (called long read technologies) are longer with read lengths typically in the thousands or tens of thousands and have much higher error rates of around 10-20% with errors being chiefly insertions and deletions. This necessitates different algorithms for assembly from short and long read technologies.

Assemblathon

There are numerous programs for de novo sequence assembly and many have been compared in the Assemblathon. The Assemblathon is a periodic, collaborative effort to test and improve the numerous assemblers available. Thus far, two assemblathons have been completed (2011 and 2013) and a third is in progress (as of April 2017). Teams of researchers from across the world choose a program and assemble simulated genomes (Assemblathon 1) and the genomes of model organisms whose that have been previously assembled and annotated (Assemblathon 2). The assemblies are then compared and evaluated using numerous metrics.

Assemblathon 1

Assemblathon 1 [24] was conducted in 2011 and featured 59 assemblies from 17 different groups and the organizers. The goal of this Assembalthon was to most accurately and completely assemble a genome that consisted of two haplotypes (each with three chromosomes of 76.3, 18.5, and 17.7 Mb, respectively) that was generated using Evolver. Numerous metrics were used to assess the assemblies, including: NG50 (point at which 50% of the total genome size is reached when scaffold lengths are summed from the longest to the shortest), LG50 (number of scaffolds that are greater than, or equal to, the N50 length), genome coverage, and substitution error rate.

Assemblathon 2

Assemblathon 2 [25] improved on Assemblathon 1 by incorporating the genomes of multiples vertebrates (a bird (Melopsittacus undulatus), a fish (Maylandia zebra), and a snake (Boa constrictor constrictor)) with genomes estimated to be 1.2, 1.0, and 1.6Gbp in length) and assessment by over 100 metrics. Each team was given four months to assemble their genome from Next-Generation Sequence (NGS) data, including Illumina and Roche 454 sequence data.

See also

References

  1. J. Bang-Jensen; G. Gutin; A. Yeo (2004). "When the greedy algorithm fails". Discrete Optimization. 1 (2): 121–127. doi: 10.1016/j.disopt.2004.03.007 .
  2. Peltola, Hannu; Söderlund, Hans; Ukkonen, Esko (1984-01-11). "SEQAID: a DNA sequence assembling program based on a mathematical model". Nucleic Acids Research. 12 (1Part1): 307–321. doi:10.1093/nar/12.1Part1.307. ISSN   0305-1048. PMC   321006 . PMID   6320092.
  3. Huang, Xiaoqiu (1992-09-01). "A contig assembly program based on sensitive detection of fragment overlaps". Genomics. 14 (1): 18–25. doi:10.1016/S0888-7543(05)80277-0. PMID   1427824.
  4. Compeau, Phillip EC; Pavel A. Pevzner; Glenn Tesler (2011). "How to apply de Bruijn graphs to genome assembly". Nature Biotechnology. 29 (11): 987–991. doi:10.1038/nbt.2023. PMC   5531759 . PMID   22068540.
  5. "DIMACS Workshop on Combinatorial Methods for DNA Mapping and Sequencing". October 1994.
  6. Idury, R. M.; Waterman, M. S. (1995-01-01). "A new algorithm for DNA sequence assembly". Journal of Computational Biology. 2 (2): 291–306. CiteSeerX   10.1.1.79.6459 . doi:10.1089/cmb.1995.2.291. ISSN   1066-5277. PMID   7497130.
  7. Myers, E. W. (1995-01-01). "Toward simplifying and accurately formulating fragment assembly". Journal of Computational Biology. 2 (2): 275–290. doi:10.1089/cmb.1995.2.275. ISSN   1066-5277. PMID   7497129.
  8. Simpson, Jared T.; et al. (2009). "ABySS: a parallel assembler for short read sequence data". Genome Research. 19 (6): 1117–1123. doi:10.1101/gr.089532.108. PMC   2694472 . PMID   19251739.
  9. Birol, Inanç; et al. (2009). "De novo transcriptome assembly with ABySS". Bioinformatics. 25 (21): 2872–2877. doi: 10.1093/bioinformatics/btp367 . PMID   19528083.
  10. Koren, Sergey, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. "Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation." Genome research 27, no. 5 (2017): 722-736. Available here
  11. Love, R. Rebecca; Weisenfeld, Neil I.; Jaffe, David B.; Besansky, Nora J.; Neafsey, Daniel E. (December 2016). "Evaluation of DISCOVAR de novo using a mosquito sample for cost-effective short-read genome assembly". BMC Genomics. 17 (1): 187. doi: 10.1186/s12864-016-2531-7 . ISSN   1471-2164. PMC   4779211 . PMID   26944054.
  12. Chin, Chen-Shan, Paul Peluso, Fritz J. Sedlazeck, Maria Nattestad, Gregory T. Concepcion, Alicia Clum, Christopher Dunn et al. "Phased diploid genome assembly with single-molecule real-time sequencing." Nature methods 13, no. 12 (2016): 1050-1054. Available here
  13. Kolmogorov, Mikhail; Yuan, Jeffrey; Lin, Yu; Pevzner, Pavel A. (2019-04-01). "Assembly of long, error-prone reads using repeat graphs" (PDF). Nature Biotechnology. 37 (5): 540–546. doi:10.1038/s41587-019-0072-8. ISSN   1087-0156. PMID   30936562. S2CID   89616540.
  14. Chin, Chen-Shan, David H. Alexander, Patrick Marks, Aaron A. Klammer, James Drake, Cheryl Heiner, Alicia Clum et al. "Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data." Nature methods 10, no. 6 (2013): 563-569. Available online
  15. Cheng, Haoyu; Concepcion, Gregory T.; Feng, Xiaowen; Zhang, Haowen; Li, Heng (February 2021). "Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm". Nature Methods. 18 (2): 170–175. arXiv: 2008.01237 . doi:10.1038/s41592-020-01056-5. ISSN   1548-7105. PMC   7961889 . PMID   33526886.
  16. Kamath, Govinda M., Ilan Shomorony, Fei Xia, Thomas A. Courtade, and N. Tse David. "HINGE: long-read assembly achieves optimal repeat resolution." Genome research 27, no. 5 (2017): 747-756. Available here
  17. Zimin, Aleksey V.; Marçais, Guillaume; Puiu, Daniela; Roberts, Michael; Salzberg, Steven L.; Yorke, James A. (November 2013). "The MaSuRCA genome assembler". Bioinformatics. 29 (21): 2669–2677. doi:10.1093/bioinformatics/btt476. ISSN   1367-4803. PMC   3799473 . PMID   23990416.
  18. Steinegger, Martin; Mirdita, Milot; Söding, Johannes (2019-06-24). "Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold" (PDF). Nature Methods. 16 (7): 603–606. doi: 10.1038/s41592-019-0437-4 . hdl:21.11116/0000-0003-E0DD-7. PMID   31235882.
  19. Boisvert, Sébastien; François Laviolette; Jacques Corbeil (2010). "Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies". Journal of Computational Biology. 17 (11): 1519–1533. doi:10.1089/cmb.2009.0238. PMC   3119603 . PMID   20958248.
  20. Bankevich, Anton; Nurk, Sergey; Antipov, Dmitry; Gurevich, Alexey A.; Dvorkin, Mikhail; Kulikov, Alexander S.; Lesin, Valery M.; Nikolenko, Sergey I.; Pham, Son; Prjibelski, Andrey D.; Pyshkin, Alexey V. (May 2012). "SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing". Journal of Computational Biology. 19 (5): 455–477. doi:10.1089/cmb.2012.0021. ISSN   1066-5277. PMC   3342519 . PMID   22506599.
  21. Grabherr, Manfred G.; et al. (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome". Nature Biotechnology. 29 (7): 644–652. doi:10.1038/nbt.1883. PMC   3571712 . PMID   21572440.
  22. Zerbino, D. R.; Birney, E. (2008-02-21). "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs". Genome Research. 18 (5): 821–829. doi:10.1101/gr.074492.107. ISSN   1088-9051. PMC   2336801 . PMID   18349386.
  23. https://doi.org/10.1186/2047-217X-1-18.{{cite web}}: Missing or empty |title= (help)
  24. Earl, Dent; et al. (December 2011). "Assemblathon 1: A competitive assessment of de novo short read assembly methods". Genome Research. 21 (12): 2224–2241. doi:10.1101/gr.126599.111. PMC   3227110 . PMID   21926179.
  25. Bradnam, Keith R.; et al. (2013). "Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species". GigaScience. 2 (1): 10. arXiv: 1301.5406 . doi: 10.1186/2047-217X-2-10 . PMC   3844414 . PMID   23870653.