Sequence assembly

Last updated

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

Contents

The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.

Genome assemblers

The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity (from small viruses over plasmids to bacteria and finally eukaryotes), the assembly programs used in these genome projects needed increasingly sophisticated strategies to handle:

Faced with the challenge of assembling the first larger eukaryotic genomesthe fruit fly Drosophila melanogaster in 2000 and the human genome just a year later,scientists developed assemblers like Celera Assembler [1] and Arachne [2] able to handle genomes of 130 million (e.g., the fruit fly D. melanogaster) to 3 billion (e.g., the human genome) base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS [3] was launched to bring together all the innovations in genome assembly technology under the open source framework.

Strategy how a sequence assembler would take fragments (shown below the black bar) and match overlaps among them to assembly the final sequence (in black). Potentially problematic repeats are shown above the sequence (in pink above). Without overlapping fragments it may be impossible to assign these segments to any specific region. Seqassemble.png
Strategy how a sequence assembler would take fragments (shown below the black bar) and match overlaps among them to assembly the final sequence (in black). Potentially problematic repeats are shown above the sequence (in pink above). Without overlapping fragments it may be impossible to assign these segments to any specific region.

EST assemblers

Expressed sequence tag or EST assembly was an early strategy, dating from the mid-1990s to the mid-2000s, to assemble individual genes rather than whole genomes. The problem differs from genome assembly in several ways. The input sequences for EST assembly are fragments of the transcribed mRNA of a cell and represent only a subset of the whole genome. A number of algorithmical problems differ between genome and EST assembly. For instance, genomes often have large amounts of repetitive sequences, concentrated in the intergenic regions. Transcribed genes contain many fewer repeats, making assembly somewhat easier. On the other hand, some genes are expressed (transcribed) in very high numbers (e.g., housekeeping genes), which means that unlike whole-genome shotgun sequencing, the reads are not uniformly sampled across the genome.

EST assembly is made much more complicated by features like (cis-) alternative splicing, trans-splicing, single-nucleotide polymorphism, and post-transcriptional modification. Beginning in 2008 when RNA-Seq was invented, EST sequencing was replaced by this far more efficient technology, described under de novo transcriptome assembly.

Types of sequence assembly

Types of sequence assembly Types of sequencing assembly.png
Types of sequence assembly

There are three approaches to assembling sequencing data:

  1. De-novo: assembling sequencing reads to create full-length (sometimes novel) sequences, without using a template (see de novo sequence assemblers, de novo transcriptome assembly)
  2. Mapping/Aligning: assembling reads by aligning reads against a template (AKA reference). The assembled consensus may not be identical to the template.
  3. Reference-guided: grouping of reads by similarity to the most similar region within the reference (step wise mapping). Reads within each group are then shortened down to mimic short reads quality. A typical method to do so is the k-mer approach. Reference-guided assembly is most useful using long-reads.

Referenced-guided assembly is a combination of the other types. This type is applied on long reads to mimic short reads advantages (i.e. call quality). The logic behind it is to group the reads by smaller windows within the reference. Reads in each group will then be reduced in size using the k-mere approach to select the highest quality and most probable contiguous (contig). Contigs will then will be joined together to create a scaffold. The final consense is made by closing any gaps in the scaffold.

De-novo vs. mapping assembly

In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm needs to compare every read with every other read (an operation that has a naive time complexity of O(n2)). Current de-novo genome assemblers may use different types of graph-based algorithms, such as the:

Referring to the comparison drawn to shredded books in the introduction: while for mapping assemblies one would have a very similar book as a template (perhaps with the names of the main characters and a few locations changed), de-novo assemblies present a more daunting challenge in that one would not know beforehand whether this would become a science book, a novel, a catalogue, or even several books. Also, every shred would be compared with every other shred.

Handling repeats in de-novo assembly requires the construction of a graph representing neighboring repeats. Such information can be derived from reading a long fragment covering the repeats in full or only its two ends. On the other hand, in a mapping assembly, parts with multiple or no matches are usually left for another assembling technique to look into. [5]

Sequence assembly pipeline (bioinformatics)

In general, there are three steps in assembling sequencing reads into a scaffold:

  1. Pre-assembly: This step is essential to ensure the integrity of downstream analysis such as variant calling or final scaffold sequence. This step consists of two chronological workflows:
    1. Quality check: Depending on the type of sequencing technology, different errors might arise that would lead to a false base call. For example, sequencing "NAAAAAAAAAAAAN" and "NAAAAAAAAAAAN" which include 12 adenine might be wrongfully called with 11 adenine instead. Sequencing a highly repetitive segment of the target DNA/RNA might result in a call that is one base shorter or one base longer. Read quality is typically measured by Phred which is an encoded score of each nucleotide quality within a read's sequence. Some sequencing technologies such as PacBio do not have a scoring method for their sequenced reads. A common tool used in this step is FastQC. [6]
    2. Filtering of reads: Reads that failed to pass the quality check should be removed from the FASTQ file to get the best assembly contigs.
  2. Assembly: During this step, reads alignment will be utilized with different criteria to map each read to the possible location. The predicted position of a read is based on either how much of its sequence aligns with other reads or a reference. Different alignment algorithms are used for reads from different sequencing technologies. Some of the commonly used approaches in the assembly are de Bruijn graph and overlapping. Read length, coverage, quality, and the sequencing technique used plays a major role in choosing the best alignment algorithm in the case of Next Generation Sequencing. [7] On the other hand, algorithms aligning 3rd generation sequencing reads requires advance approaches to account for the high error rate associated with them.
  3. Post-assembly: This step is focusing on extracting valuable information from the assembled sequence. Comparative genomics and population analysis are examples of post-assembly analysis.

Influence of technological changes

The complexity of sequence assembly is driven by two major factors: the number of fragments and their lengths. While more and longer fragments allow better identification of sequence overlaps, they also pose problems as the underlying algorithms show quadratic or even exponential complexity behaviour to both number of fragments and their length. And while shorter sequences are faster to align, they also complicate the layout phase of an assembly as shorter reads are more difficult to use with repeats or near identical repeats.

In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length (some dozen bases) after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand.

In 1975, the dideoxy termination method (AKA Sanger sequencing) was invented and until shortly after 2000, the technology was improved up to a point where fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome shotgun sequencing projects where the reads

With the Sanger technology, bacterial projects with 20,000 to 200,000 reads could easily be assembled on one computer. Larger projects, like the human genome with approximately 35 million reads, needed large computing farms and distributed computing.

By 2004 / 2005, pyrosequencing had been brought to commercial viability by 454 Life Sciences. This new sequencing method generated reads much shorter than those of Sanger sequencing: initially about 100 bases, now 400-500 bases. Its much higher throughput and lower cost (compared to Sanger sequencing) pushed the adoption of this technology by genome centers, which in turn pushed development of sequence assemblers that could efficiently handle the read sets. The sheer amount of data coupled with technology-specific error patterns in the reads delayed development of assemblers; at the beginning in 2004 only the Newbler assembler from 454 was available. Released in mid-2007, [8] the hybrid version of the MIRA assembler by Chevreux et al. was the first freely available assembler that could assemble 454 reads as well as mixtures of 454 reads and Sanger reads. Assembling sequences from different sequencing technologies was subsequently coined hybrid assembly.

From 2006, the Illumina (previously Solexa) technology has been available and can generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. Illumina was initially limited to a length of only 36 bases, making it less suitable for de novo assembly (such as de novo transcriptome assembly), but newer iterations of the technology achieve read lengths above 100 bases from both ends of a 3-400bp clone. Announced at the end of 2007, the SHARCGS assembler [9] by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads. It was quickly followed by a number of others.

Later, new technologies like SOLiD from Applied Biosystems, Ion Torrent and SMRT were released and new technologies (e.g. Nanopore sequencing) continue to emerge. Despite the higher error rates of these technologies they are important for assembly because their longer read length helps to address the repeat problem. It is impossible to assemble through a perfect repeat that is longer than the maximum read length; however, as reads become longer the chance of a perfect repeat that large becomes small. This gives longer sequencing reads an advantage in assembling repeats even if they have low accuracy (~85%).

Assembly algorithms

Different organisms have a distinct region of higher complexity within their genome. Hence, the need of different computational approaches is needed. Some of the commonly used algorithms are:

Given a set of sequence fragments, the object is to find a longer sequence that contains all the fragments (see figure under Types of Sequence Assembly):

  1. Сalculate pairwise alignments of all fragments.
  2. Choose two fragments with the largest overlap.
  3. Merge chosen fragments.
  4. Repeat step 2 and 3 until only one fragment is left.

The result might not be an optimal solution to the problem.

Quality control

Most sequence assemblers have some algorithms built in for quality control, such as Phred. However, such measures do not assess assembly completeness in terms of gene content. Some tools evaluate the quality of an assembly after the fact.

For instance, BUSCO (Benchmarking Universal Single-Copy Orthologs) is a measure of gene completeness in a genome, gene set, or transcriptome, using the fact that many genes are present only as single-copy genes in most genomes. [10] The initial BUSCO sets represented 3023 genes for vertebrates, 2675 for arthropods, 843 for metazoans, 1438 for fungi and 429 for eukaryotes. This table shows an example for human and fruit fly genomes: [10]

BUSCO notation assessment results (C,D,F, M in %)
SpeciesgenesC:D:F:M:n:
Homo sapiens 20,364991.70.00.03,023
Drosophila melanogaster 13,918993.70.20.02,675

Programs

For a lists of de-novo assemblers, see De novo sequence assemblers. For a list of mapping aligners, see List of sequence alignment software § Short-read sequence alignment.

Some of the common tools used in different assembly steps are listed in the following table:

Sequence Assembly Tools
SoftwareRead typeTool web pageNotes
FastQCMultiple https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ This is a common tool used to check reads quality from different sequencing technologies such as Illumina, 454, and PacBio.
BWAShort & Long reads https://sourceforge.net/projects/bio-bwa/files/ This is a command line tool. Mostly known for lightweight run and accurate sequence alignment.
MiniMap2Long reads https://github.com/lh3/minimap2 This command line tool is designed to handle PacBio & Oxford Nanopore and reads with 15% error rate.
LoReTTALong reads https://github.com/salvocamiolo/LoReTTA/releases/tag/v0.1 This tool is designed to assemble (reference-guided) viral genomes at a greater accuracy using PacBio CCS reads.
SPAdes Short & Long reads http://cab.spbu.ru/software/spades/ This is an assembly tool that runs on the command line.
Samtools Alignment analysis https://samtools.github.io This is useful post-assembly. It can generate different statistics and perform multiple filtering steps to the alignment file.

See also

Related Research Articles

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

<span class="mw-page-title-main">Hybrid genome assembly</span>

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

<span class="mw-page-title-main">Scaffolding (bioinformatics)</span> Bioinformatics technique

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

SPAdes is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects.

TopHat is an open-source bioinformatics tool for the throughput alignment of shotgun cDNA sequencing reads generated by transcriptomics technologies using Bowtie first and then mapping to a reference genome to discover RNA splice sites de novo. TopHat aligns RNA-Seq reads to mammalian-sized genomes.

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Third-generation sequencing is a class of DNA sequencing methods currently under active development.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.

<span class="mw-page-title-main">Genome skimming</span> Method of genome sequencing

Genome skimming is a sequencing approach that uses low-pass, shallow sequencing of a genome, to generate fragments of DNA, known as genome skims. These genome skims contain information about the high-copy fraction of the genome. The high-copy fraction of the genome consists of the ribosomal DNA, plastid genome (plastome), mitochondrial genome (mitogenome), and nuclear repeats such as microsatellites and transposable elements. It employs high-throughput, next generation sequencing technology to generate these skims. Although these skims are merely 'the tip of the genomic iceberg', phylogenomic analysis of them can still provide insights on evolutionary history and biodiversity at a lower cost and larger scale than traditional methods. Due to the small amount of DNA required for genome skimming, its methodology can be applied in other fields other than genomics. Tasks like this include determining the traceability of products in the food industry, enforcing international regulations regarding biodiversity and biological resources, and forensics.

<span class="mw-page-title-main">Linked-read sequencing</span>

Linked-read sequencing, a type of DNA sequencing technology, uses specialized technique that tags DNA molecules with unique barcodes before fragmenting them. Unlike traditional sequencing technology, where DNA is broken into small fragments and then sequenced individually, resulting in short read lengths that has difficulties in accurately reconstructing the original DNA sequence, the unique barcodes of linked-read sequencing allows scientists to link together DNA fragments that come from the same DNA molecule. A pivotal benefit of this technology lies in the small quantities of DNA required for large genome information output, effectively combining the advantages of long-read and short-read technologies.

References

  1. Myers, E. W.; Sutton, GG; Delcher, AL; Dew, IM; Fasulo, DP; Flanigan, MJ; Kravitz, SA; Mobarry, CM; et al. (March 2000). "A whole-genome assembly of Drosophila". Science. 287 (5461): 2196–204. Bibcode:2000Sci...287.2196M. CiteSeerX   10.1.1.79.9822 . doi:10.1126/science.287.5461.2196. PMID   10731133. S2CID   6049420.
  2. Batzoglou, S.; Jaffe, DB; Stanley, K; Butler, J; Gnerre, S; Mauceli, E; Berger, B; Mesirov, JP; Lander, ES (January 2002). "ARACHNE: a whole-genome shotgun assembler". Genome Research. 12 (1): 177–89. doi:10.1101/gr.208902. PMC   155255 . PMID   11779843.
  3. "AMOS WIKI". amos.sourceforge.net. Retrieved 2023-01-02.
  4. Miller, Jason R.; Koren, Sergey; Sutton, Granger (2010-03-06). "Assembly algorithms for next-generation sequencing data". Genomics. 95 (6): 315–327. doi:10.1016/j.ygeno.2010.03.001. PMC   2874646 . PMID   20211242.
  5. Wolf, Beat. "De novo genome assembly versus mapping to a reference genome" (PDF). University of Applied Sciences Western Switzerland. Retrieved 6 April 2019.
  6. "Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data". www.bioinformatics.babraham.ac.uk. Retrieved 2022-05-09.
  7. Ruffalo, M.; LaFramboise, T.; Koyuturk, M. (2011-10-15). "Comparative analysis of algorithms for next-generation sequencing read alignment". Bioinformatics. 27 (20): 2790–2796. doi: 10.1093/bioinformatics/btr477 . ISSN   1367-4803. PMID   21856737.
  8. "MIRA 2.9.8 for 454 and 454 / Sanger hybrid assembly". groups.google.com. Retrieved 2023-01-02.
  9. Dohm, J. C.; Lottaz, C.; Borodina, T.; Himmelbauer, H. (November 2007). "SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing". Genome Research. 17 (11): 1697–706. doi:10.1101/gr.6435207. PMC   2045152 . PMID   17908823.
  10. 1 2 Simão, Felipe A.; Waterhouse, Robert M.; Ioannidis, Panagiotis; Kriventseva, Evgenia V.; Zdobnov, Evgeny M. (2015-10-01). "BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs". Bioinformatics. 31 (19): 3210–3212. doi:10.1093/bioinformatics/btv351. ISSN   1367-4811.