De novo sequence assemblers

Last updated July 09, 2024

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

Types of de novo assemblers

There are two types of algorithms that are commonly utilized by these assemblers: greedy, which aim for local optima, and graph method algorithms, which aim for global optima. Different assemblers are tailored for particular needs, such as the assembly of (small) bacterial genomes, (large) eukaryotic genomes, or transcriptomes.

Greedy algorithm assemblers are assemblers that find local optima in alignments of smaller reads. Greedy algorithm assemblers typically feature several steps: 1) pairwise distance calculation of reads, 2) clustering of reads with greatest overlap, 3) assembly of overlapping reads into larger contigs, and 4) repeat. These algorithms typically do not work well for larger read sets, as they do not easily reach a global optimum in the assembly, and do not perform well on read sets that contain repeat regions.^[1] Early de novo sequence assemblers, such as SEQAID^[2] (1984) and CAP^[3] (1992), used greedy algorithms, such as overlap-layout-consensus (OLC) algorithms. These algorithms find overlap between all reads, use the overlap to determine a layout (or tiling) of the reads, and then produce a consensus sequence. Some programs that used OLC algorithms featured filtration (to remove read pairs that will not overlap) and heuristic methods to increase speed of the analyses.

Graph method assemblers^[4] come in two varieties: string and De Bruijn. String graph and De Bruijn graph method assemblers were introduced at a DIMACS ^[5] workshop in 1994 by Waterman ^[6] and Gene Myers.^[7] These methods represented an important step forward in sequence assembly, as they both use algorithms to reach a global optimum instead of a local optimum. While both of these methods made progress towards better assemblies, the De Bruijn graph method has become the most popular in the age of next-generation sequencing. During the assembly of the De Bruijn graph, reads are broken into smaller fragments of a specified size, k. The k-mers are then used as edges in the graph assembly. Nodes are built as (k-1)-mers connect by an edge. The assembler will then construct sequences based on the De Bruijn graph. De Bruijn graph assemblers typically perform better on larger read sets than greedy algorithm assemblers (especially when they contain repeat regions).

Commonly used programs

List of de-novo assemblers
Name	Description / Methodology	Technologies	Author	Presented / Last updated	Licence*	Homepage
ABySS	parallel, paired-end sequence assembler designed for large genome assembly of short reads (genomic and transcriptomic), employ a Bloom filter to De Bruijn graph	Illumina	^[8]^[9]	2009 / 2017	OS	link
Canu	Small and large, haploid/diploid genomes	PacBio/Oxford Nanopore reads	^[10]	2001 / 2018	OS	link
DISCOVAR	paired-end PCR-free reads (successor of ALLPATHS-LG)	Illumina (MiSeq or HiSeq 2500)	^[11]	2014	OS	link
DNA Baser Sequence Assembler	DNA sequence assembly with automatic end trimming & ambiguity correction. Includes a base caller.	Sanger, Illumina	Heracle BioSoft SRL	2018.09	C ($69)	NA
DNASTAR Lasergene Genomics	Large genomes, exomes, transcriptomes, metagenomes, ESTs. Also de novo assembly and polishing of long read sequencing data from Oxford Nanopore and PacBio, including PacBio Hifi reads.	Illumina, ABI SOLiD, Roche 454, Ion Torrent, Solexa, Sanger	DNASTAR	2007 / 2023	C	link
Falcon	Diploid genomes	PacBio reads	^[12]	2014 / 2017	OS	link
Flye	genomes and metagenomes. Makes use of repeat graphs	PacBio/Oxford Nanopore reads	^[13]	2018/2023	OS	link
HGAP	Genomes up to 130 MB	PacBio reads	^[14]	2011 / 2015	OS	link
hifiasm	Graph assembly	PacBio HiFi reads	^[15]	2021	OS	link
Hinge	Small microbial genomes	PacBio/Oxford Nanopore reads	^[16]	2016 / 2018	OS	link
MaSuRCA	Any size, haploid/diploid genomes	Illumina and PacBio/Oxford Nanopore data, legacy 454 and Sanger data	^[17]	2011 / 2018	OS	link
Newbler	genomes, ESTs	454, Sanger	454 Life Sciences	2004/2012	C	link
Phrap	genomes	Sanger, 454, Solexa	Green, P.	1994 / 2008	C / NC-A	link
Plass	Protein-level assembler: assembles six-frame-translated sequencing reads into protein sequences	Illumina	^[18]	2018 / 2019	OS	link
Ray	a suite of assemblers including de novo, metagenomic, ontology and taxonomic profiling; uses a De Bruijn graph		^[19]	2010	OS	link
SPAdes	(small) genomes, single-cell	Illumina, Solexa, Sanger, 454, Ion Torrent, PacBio, Oxford Nanopore	^[20]	2012 / 2021	OS	link
Trinity	transcriptome assemblies by de Bruijn graph	Illumina RNA-seq	^[21]	2011		link
Velvet	(small) genomes	Sanger, 454, Solexa, SOLiD	^[22]	2007 / 2011	OS	link
*Licences: OS = Open Source; C = Commercial; C / NC-A = Commercial but free for non-commercial and academics

Different assemblers are designed for different type of read technologies. Reads from second generation technologies (called short read technologies) like Illumina are typically short (with lengths of the order of 50-200 base pairs) and have error rates of around 0.5-2%, with the errors chiefly being substitution errors. However, reads from third generation technologies like PacBio and fourth generation technologies like Oxford Nanopore (called long read technologies) are longer with read lengths typically in the thousands or tens of thousands and have much higher error rates of around 10-20% with errors being chiefly insertions and deletions. This necessitates different algorithms for assembly from short and long read technologies.

Assemblathon

There are numerous programs for de novo sequence assembly and many have been compared in the Assemblathon. The Assemblathon is a periodic, collaborative effort to test and improve the numerous assemblers available. Thus far, two assemblathons have been completed (2011 and 2013) and a third is in progress (as of April 2017). Teams of researchers from across the world choose a program and assemble simulated genomes (Assemblathon 1) and the genomes of model organisms whose that have been previously assembled and annotated (Assemblathon 2). The assemblies are then compared and evaluated using numerous metrics.

Assemblathon 1

Assemblathon 1^[23] was conducted in 2011 and featured 59 assemblies from 17 different groups and the organizers. The goal of this Assembalthon was to most accurately and completely assemble a genome that consisted of two haplotypes (each with three chromosomes of 76.3, 18.5, and 17.7 Mb, respectively) that was generated using Evolver. Numerous metrics were used to assess the assemblies, including: NG50 (point at which 50% of the total genome size is reached when scaffold lengths are summed from the longest to the shortest), LG50 (number of scaffolds that are greater than, or equal to, the N50 length), genome coverage, and substitution error rate.

Software compared: ABySS, Phusion2, phrap, Velvet, SOAPdenovo, PRICE, ALLPATHS-LG
N50 analysis: assemblies by the Plant Genome Assembly Group (using the assembler Meraculous) and ALLPATHS, Broad Institute, USA (using ALLPATHS-LG) performed the best in this category, by an order of magnitude over other groups. These assemblies scored an N50 of >8,000,000 bases.
Coverage of genome by assembly: for this metric, BGI's assembly via SOAPdenovo performed best, with 98.8% of the total genome being covered. All assemblers performed relatively well in this category, with all but three groups having coverage of 90% and higher, and the lowest total coverage being 78.5% (Dept. of Comp. Sci., University of Chicago, USA via Kiki).
Substitution errors: the assembly with the lowest substitution error rate was submitted by the Wellcome Trust Sanger Institute, UK team using the software SGA.
Overall: No one assembler performed significantly better in others in all categories. While some assemblers excelled in one category, they did not in others, suggesting that there is still much room for improvement in assembler software quality.

Assemblathon 2

Assemblathon 2^[24] improved on Assemblathon 1 by incorporating the genomes of multiples vertebrates (a bird (Melopsittacus undulatus), a fish (Maylandia zebra), and a snake (Boa constrictor constrictor)) with genomes estimated to be 1.2, 1.0, and 1.6Gbp in length) and assessment by over 100 metrics. Each team was given four months to assemble their genome from Next-Generation Sequence (NGS) data, including Illumina and Roche 454 sequence data.

Software compared: ABySS, ALLPATHS-LG, PRICE, Ray, and SOAPdenovo
N50 analysis: for the assembly of the bird genome, the Baylor College of Medicine Human Genome Sequencing Center and ALLPATHS teams had the highest NG50s, at over 16,000,000 and over 14,000,000 bp, respectively.
Presence of core genes: Most assemblies performed well in this category (~80% or higher), with only one dropping to just over 50% in their bird genome assembly (Wayne State University via HyDA).
Overall: Overall, the Baylor College of Medicine Human Genome Sequencing Center utilizing a variety of assembly methods (SeqPrep, KmerFreq, Quake, BWA, Newbler, ALLPATHS-LG, Atlas-Link, Atlas-GapFill, Phrap, CrossMatch, Velvet, BLAST, and BLASR) performed the best for the bird and fish assemblies. For the snake genome assembly, the Wellcome Trust Sanger Institute using SGA, performed best. For all assemblies, SGA, BCM, Meraculous, and Ray submitted competitive assemblies and evaluations. The results of the many assemblies and evaluations described here suggest that while one assembler may perform well on one species, it may not perform as well on another. The authors make several suggestions for assembly: 1) use more than one assembler, 2) use more than one metric for evaluation, 3) select an assembler that excels in metrics of more interest (e.g., N50, coverage), 4) low N50s or assembly sizes may not be concerning, depending on user needs, and 5) assess the levels of heterozygosity in the genome of interest.

Related Research Articles

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

In graph theory, an $n$ -dimensional De Bruijn graph of $m$ symbols is a directed graph representing overlaps between sequences of symbols. It has $m n$ vertices, consisting of all possible length- $n$ sequences of the given symbols; the same symbol may appear multiple times in a sequence. For a set of $m$ symbols $S = {s 1, \dots, s m},$ the set of vertices is:

A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for the computer program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. The FASTQ format encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

In bioinformatics, k-mers are substrings of length $contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k -mers are composed of nucleotides, k -mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k -mer refers to all of a sequence's subsequences of length, such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k -mers and total possible k -mers, where is number of possible monomers.$

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

<span class="mw-page-title-main">Hybrid genome assembly</span>

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

<span class="mw-page-title-main">Scaffolding (bioinformatics)</span> Bioinformatics technique

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

SPAdes is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects.

In bioinformatics, a DNA read error occurs when a sequence assembler changes one DNA base for a different base. The reads from the sequence assembler can then be used to create a de Bruijn graph, which can be used in various ways to find errors.

In genetics, coverage is one of several measures of the depth or completeness of DNA sequencing, and is more specifically expressed in any of the following terms:

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Third-generation sequencing is a class of DNA sequencing methods which produce longer sequence reads, under active development since 2008.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

References

↑ J. Bang-Jensen; G. Gutin; A. Yeo (2004). "When the greedy algorithm fails". Discrete Optimization. 1 (2): 121–127. doi: 10.1016/j.disopt.2004.03.007 .
↑ Peltola, Hannu; Söderlund, Hans; Ukkonen, Esko (1984-01-11). "SEQAID: a DNA sequence assembling program based on a mathematical model". Nucleic Acids Research. 12 (1Part1): 307–321. doi:10.1093/nar/12.1Part1.307. ISSN 0305-1048. PMC 321006 . PMID 6320092.
↑ Huang, Xiaoqiu (1992-09-01). "A contig assembly program based on sensitive detection of fragment overlaps". Genomics. 14 (1): 18–25. doi:10.1016/S0888-7543(05)80277-0. PMID 1427824.
↑ Compeau, Phillip EC; Pavel A. Pevzner; Glenn Tesler (2011). "How to apply de Bruijn graphs to genome assembly". Nature Biotechnology. 29 (11): 987–991. doi:10.1038/nbt.2023. PMC 5531759 . PMID 22068540.
↑ "DIMACS Workshop on Combinatorial Methods for DNA Mapping and Sequencing". October 1994.
↑ Idury, R. M.; Waterman, M. S. (1995-01-01). "A new algorithm for DNA sequence assembly". Journal of Computational Biology. 2 (2): 291–306. CiteSeerX 10.1.1.79.6459 . doi:10.1089/cmb.1995.2.291. ISSN 1066-5277. PMID 7497130.
↑ Myers, E. W. (1995-01-01). "Toward simplifying and accurately formulating fragment assembly". Journal of Computational Biology. 2 (2): 275–290. doi:10.1089/cmb.1995.2.275. ISSN 1066-5277. PMID 7497129.
↑ Simpson, Jared T.; et al. (2009). "ABySS: a parallel assembler for short read sequence data". Genome Research. 19 (6): 1117–1123. doi:10.1101/gr.089532.108. PMC 2694472 . PMID 19251739.
↑ Birol, Inanç; et al. (2009). "De novo transcriptome assembly with ABySS". Bioinformatics. 25 (21): 2872–2877. doi: 10.1093/bioinformatics/btp367 . PMID 19528083.
↑ Koren, Sergey, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. "Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation." Genome research 27, no. 5 (2017): 722-736. Available here
↑ Love, R. Rebecca; Weisenfeld, Neil I.; Jaffe, David B.; Besansky, Nora J.; Neafsey, Daniel E. (December 2016). "Evaluation of DISCOVAR de novo using a mosquito sample for cost-effective short-read genome assembly". BMC Genomics. 17 (1): 187. doi: 10.1186/s12864-016-2531-7 . ISSN 1471-2164. PMC 4779211 . PMID 26944054.
↑ Chin, Chen-Shan, Paul Peluso, Fritz J. Sedlazeck, Maria Nattestad, Gregory T. Concepcion, Alicia Clum, Christopher Dunn et al. "Phased diploid genome assembly with single-molecule real-time sequencing." Nature methods 13, no. 12 (2016): 1050-1054. Available here
↑ Kolmogorov, Mikhail; Yuan, Jeffrey; Lin, Yu; Pevzner, Pavel A. (2019-04-01). "Assembly of long, error-prone reads using repeat graphs" (PDF). Nature Biotechnology. 37 (5): 540–546. doi:10.1038/s41587-019-0072-8. ISSN 1087-0156. PMID 30936562. S2CID 89616540.
↑ Chin, Chen-Shan, David H. Alexander, Patrick Marks, Aaron A. Klammer, James Drake, Cheryl Heiner, Alicia Clum et al. "Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data." Nature methods 10, no. 6 (2013): 563-569. Available online
↑ Cheng, Haoyu; Concepcion, Gregory T.; Feng, Xiaowen; Zhang, Haowen; Li, Heng (February 2021). "Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm". Nature Methods. 18 (2): 170–175. arXiv: 2008.01237 . doi:10.1038/s41592-020-01056-5. ISSN 1548-7105. PMC 7961889 . PMID 33526886.
↑ Kamath, Govinda M., Ilan Shomorony, Fei Xia, Thomas A. Courtade, and N. Tse David. "HINGE: long-read assembly achieves optimal repeat resolution." Genome research 27, no. 5 (2017): 747-756. Available here
↑ Zimin, Aleksey V.; Marçais, Guillaume; Puiu, Daniela; Roberts, Michael; Salzberg, Steven L.; Yorke, James A. (November 2013). "The MaSuRCA genome assembler". Bioinformatics. 29 (21): 2669–2677. doi:10.1093/bioinformatics/btt476. ISSN 1367-4803. PMC 3799473 . PMID 23990416.
↑ Steinegger, Martin; Mirdita, Milot; Söding, Johannes (2019-06-24). "Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold" (PDF). Nature Methods. 16 (7): 603–606. doi: 10.1038/s41592-019-0437-4 . hdl:21.11116/0000-0003-E0DD-7. PMID 31235882.
↑ Boisvert, Sébastien; François Laviolette; Jacques Corbeil (2010). "Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies". Journal of Computational Biology. 17 (11): 1519–1533. doi:10.1089/cmb.2009.0238. PMC 3119603 . PMID 20958248.
↑ Bankevich, Anton; Nurk, Sergey; Antipov, Dmitry; Gurevich, Alexey A.; Dvorkin, Mikhail; Kulikov, Alexander S.; Lesin, Valery M.; Nikolenko, Sergey I.; Pham, Son; Prjibelski, Andrey D.; Pyshkin, Alexey V. (May 2012). "SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing". Journal of Computational Biology. 19 (5): 455–477. doi:10.1089/cmb.2012.0021. ISSN 1066-5277. PMC 3342519 . PMID 22506599.
↑ Grabherr, Manfred G.; et al. (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome". Nature Biotechnology. 29 (7): 644–652. doi:10.1038/nbt.1883. PMC 3571712 . PMID 21572440.
↑ Zerbino, D. R.; Birney, E. (2008-02-21). "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs". Genome Research. 18 (5): 821–829. doi:10.1101/gr.074492.107. ISSN 1088-9051. PMC 2336801 . PMID 18349386.
↑ Earl, Dent; et al. (December 2011). "Assemblathon 1: A competitive assessment of de novo short read assembly methods". Genome Research. 21 (12): 2224–2241. doi:10.1101/gr.126599.111. PMC 3227110 . PMID 21926179.
↑ Bradnam, Keith R.; et al. (2013). "Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species". GigaScience. 2 (1): 10. arXiv: 1301.5406 . doi: 10.1186/2047-217X-2-10 . PMC 3844414 . PMID 23870653.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] J. Bang-Jensen; G. Gutin; A. Yeo (2004). "When the greedy algorithm fails". Discrete Optimization. 1 (2): 121–127. doi: 10.1016/j.disopt.2004.03.007 .

[2] Peltola, Hannu; Söderlund, Hans; Ukkonen, Esko (1984-01-11). "SEQAID: a DNA sequence assembling program based on a mathematical model". Nucleic Acids Research. 12 (1Part1): 307–321. doi:10.1093/nar/12.1Part1.307. ISSN 0305-1048. PMC 321006 . PMID 6320092.

[3] Huang, Xiaoqiu (1992-09-01). "A contig assembly program based on sensitive detection of fragment overlaps". Genomics. 14 (1): 18–25. doi:10.1016/S0888-7543(05)80277-0. PMID 1427824.

[4] Compeau, Phillip EC; Pavel A. Pevzner; Glenn Tesler (2011). "How to apply de Bruijn graphs to genome assembly". Nature Biotechnology. 29 (11): 987–991. doi:10.1038/nbt.2023. PMC 5531759 . PMID 22068540.

[5] "DIMACS Workshop on Combinatorial Methods for DNA Mapping and Sequencing". October 1994.

[6] Idury, R. M.; Waterman, M. S. (1995-01-01). "A new algorithm for DNA sequence assembly". Journal of Computational Biology. 2 (2): 291–306. CiteSeerX 10.1.1.79.6459 . doi:10.1089/cmb.1995.2.291. ISSN 1066-5277. PMID 7497130.

[7] Myers, E. W. (1995-01-01). "Toward simplifying and accurately formulating fragment assembly". Journal of Computational Biology. 2 (2): 275–290. doi:10.1089/cmb.1995.2.275. ISSN 1066-5277. PMID 7497129.

[8] Simpson, Jared T.; et al. (2009). "ABySS: a parallel assembler for short read sequence data". Genome Research. 19 (6): 1117–1123. doi:10.1101/gr.089532.108. PMC 2694472 . PMID 19251739.

[9] Birol, Inanç; et al. (2009). "De novo transcriptome assembly with ABySS". Bioinformatics. 25 (21): 2872–2877. doi: 10.1093/bioinformatics/btp367 . PMID 19528083.

[10] Koren, Sergey, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. "Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation." Genome research 27, no. 5 (2017): 722-736. Available here

[11] Love, R. Rebecca; Weisenfeld, Neil I.; Jaffe, David B.; Besansky, Nora J.; Neafsey, Daniel E. (December 2016). "Evaluation of DISCOVAR de novo using a mosquito sample for cost-effective short-read genome assembly". BMC Genomics. 17 (1): 187. doi: 10.1186/s12864-016-2531-7 . ISSN 1471-2164. PMC 4779211 . PMID 26944054.

[12] Chin, Chen-Shan, Paul Peluso, Fritz J. Sedlazeck, Maria Nattestad, Gregory T. Concepcion, Alicia Clum, Christopher Dunn et al. "Phased diploid genome assembly with single-molecule real-time sequencing." Nature methods 13, no. 12 (2016): 1050-1054. Available here

[13] Kolmogorov, Mikhail; Yuan, Jeffrey; Lin, Yu; Pevzner, Pavel A. (2019-04-01). "Assembly of long, error-prone reads using repeat graphs" (PDF). Nature Biotechnology. 37 (5): 540–546. doi:10.1038/s41587-019-0072-8. ISSN 1087-0156. PMID 30936562. S2CID 89616540.

[14] Chin, Chen-Shan, David H. Alexander, Patrick Marks, Aaron A. Klammer, James Drake, Cheryl Heiner, Alicia Clum et al. "Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data." Nature methods 10, no. 6 (2013): 563-569. Available online

[15] Cheng, Haoyu; Concepcion, Gregory T.; Feng, Xiaowen; Zhang, Haowen; Li, Heng (February 2021). "Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm". Nature Methods. 18 (2): 170–175. arXiv: 2008.01237 . doi:10.1038/s41592-020-01056-5. ISSN 1548-7105. PMC 7961889 . PMID 33526886.

[16] Kamath, Govinda M., Ilan Shomorony, Fei Xia, Thomas A. Courtade, and N. Tse David. "HINGE: long-read assembly achieves optimal repeat resolution." Genome research 27, no. 5 (2017): 747-756. Available here

[17] Zimin, Aleksey V.; Marçais, Guillaume; Puiu, Daniela; Roberts, Michael; Salzberg, Steven L.; Yorke, James A. (November 2013). "The MaSuRCA genome assembler". Bioinformatics. 29 (21): 2669–2677. doi:10.1093/bioinformatics/btt476. ISSN 1367-4803. PMC 3799473 . PMID 23990416.

[18] Steinegger, Martin; Mirdita, Milot; Söding, Johannes (2019-06-24). "Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold" (PDF). Nature Methods. 16 (7): 603–606. doi: 10.1038/s41592-019-0437-4 . hdl:21.11116/0000-0003-E0DD-7. PMID 31235882.

[19] Boisvert, Sébastien; François Laviolette; Jacques Corbeil (2010). "Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies". Journal of Computational Biology. 17 (11): 1519–1533. doi:10.1089/cmb.2009.0238. PMC 3119603 . PMID 20958248.

[20] Bankevich, Anton; Nurk, Sergey; Antipov, Dmitry; Gurevich, Alexey A.; Dvorkin, Mikhail; Kulikov, Alexander S.; Lesin, Valery M.; Nikolenko, Sergey I.; Pham, Son; Prjibelski, Andrey D.; Pyshkin, Alexey V. (May 2012). "SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing". Journal of Computational Biology. 19 (5): 455–477. doi:10.1089/cmb.2012.0021. ISSN 1066-5277. PMC 3342519 . PMID 22506599.

[21] Grabherr, Manfred G.; et al. (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome". Nature Biotechnology. 29 (7): 644–652. doi:10.1038/nbt.1883. PMC 3571712 . PMID 21572440.

[22] Zerbino, D. R.; Birney, E. (2008-02-21). "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs". Genome Research. 18 (5): 821–829. doi:10.1101/gr.074492.107. ISSN 1088-9051. PMC 2336801 . PMID 18349386.

[23] Earl, Dent; et al. (December 2011). "Assemblathon 1: A competitive assessment of de novo short read assembly methods". Genome Research. 21 (12): 2224–2241. doi:10.1101/gr.126599.111. PMC 3227110 . PMID 21926179.

[24] Bradnam, Keith R.; et al. (2013). "Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species". GigaScience. 2 (1): 10. arXiv: 1301.5406 . doi: 10.1186/2047-217X-2-10 . PMC 3844414 . PMID 23870653.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]