Velvet assembler

Last updated

Velvet assembler
Developer(s) Daniel Zerbino, [1] Ewan Birney
Initial release2008
Stable release
1.2.10
Operating system Unix-like
Available in C
Type Bioinformatics
License GPL
Website www.github.com/dzerbino/velvet/

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. [2] Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

Contents

Introduction

The development of next-generation sequencers (NGS) allowed for increased cost effectiveness on very short read sequencing. The manipulation of de Bruijn graphs as a method for alignment became more realistic but further developments were needed to address issues with errors and repeats. [3] This led to the development of Velvet by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute in the United Kingdom. [4]

Velvet works by efficiently manipulating de Bruijn graphs through simplification and compression, without the loss of graph information, by converging non-intersecting paths into single nodes. It eliminates errors and resolves repeats by first using an error correction algorithm that merges sequences together. Repeats are then removed from the sequence via the repeat solver that separates paths which share local overlaps.

The combination of short reads and read pairs allows Velvet to resolve small repeats and produce contigs of reasonable length. This application of Velvet can produce contigs with a N50 length of 50 kb on paired-end prokaryotic data and a 3 kb length for regions of mammalian data.

Algorithm

As already mentioned Velvet uses the de Bruijn graph to assemble short reads. More specifically Velvet represents each different k-mer obtained from the reads by a unique node on the graph. Two nodes are connected if its k-mers have a k-1 overlap. In other words, an arc from node A to node B exists if the last k-1 characters of the k-mer, represented by A, are the first k-1 characters of the k-mer represented by B. The following figure shows an example of a de Bruijn graph generated with Velvet:

Figure 1: Example of read hashing and respective de Bruijn graph Example 1seq.pdf
Figure 1: Example of read hashing and respective de Bruijn graph

The same process is simultaneously done with the reverse complement of all the k-mers to take into account the overlaps between the reads of opposite strands. A number of optimizations can be done over the graph which includes simplification and error removal.

Simplification

An easy way to save memory costs is to merge nodes that do not affect the path generated in the graph, i.e., whenever a node A has only one outgoing arc that points to node B, with only one ingoing arc, the nodes can be merged. It is possible to represent both nodes as one, merging them and all their information together. The next figure illustrates this process in the simplification of the initial example.

Figure 2: Simplified de Bruijn graph Example 2simp.pdf
Figure 2: Simplified de Bruijn graph

Error removal

Errors in the graph can be caused by the sequencing process or it could simply be that the biological sample contains some errors (for example polymorphisms). Velvet recognizes three kinds of errors: tips; bubbles; and erroneous connections.

Tips

A node is considered a tip and should be erased if it is disconnected on one of its ends, the length of the information stored in the node is shorter than 2k, and the arc leading to this node has a low multiplicity (number of times the arc was found during the construction of the graph) and as a result cannot be compared to other alternative paths. Once these errors are removed, the graph once again undergoes simplification.

Figure 3: Example of tips Example 3.pdf
Figure 3: Example of tips

Bubbles

Bubbles are generated when two distinct paths start and end at the same nodes. Normally bubbles are caused by errors or biological variants. These errors are removed using the Tour Bus algorithm, which is similar to a Dijkstra's algorithm, a breadth-first search that detects the best path to follow and determines which ones should be erased. A simple example is shown in figure 4.

Figure 4: Example of bubble erase Example 4.pdf
Figure 4: Example of bubble erase

This process is also shown in figure 5 following on from the examples shown in figures 1 and 2.

Figure 5: Example of bubble detection Example 5.pdf
Figure 5: Example of bubble detection

Erroneous connections

These are connections that do not generate correct paths or do not create any recognizable structures within the graph. Velvet erases these errors after completion of the Tour Bus algorithm, applying a simple coverage cut-off that must be defined by the user.

Velvet commands

Velvet provides the following functions:

velveth
This command helps to construct the data set (hashes the reads) for velvetg and includes information about the meaning of each sequence files.
velvetg
This command builds the de Bruijn graph from the k-mers obtained by velveth and runs simplification and error correction over the graph. It then extracts the contigs.

After running velvetg a number of files are generated. Most importantly, a file of contigs contains the sequences of the contigs longer than 2k, where k is the word-length used in velveth.

For more detail and examples refer to the Velvet Manual [5]

Motivation

Current DNA sequencing technologies, including NGS, are limited on the basis that genomes are much larger than any read length. Typically, NGS operate with small reads, less than 400 bp, and have a much lower cost per read than previous first generation machines. They are also simpler to operate with higher parallel operation and higher yield. [3]

However, short reads contain less information than larger reads thus requiring a higher assembly read coverage to allow for detectable overlaps. This in turn increases the complexity of the sequencing and significantly increases computational requirements. A larger number of reads also increases the size of the overlap graph, making it more difficult and lengthy to compute. The connections between the reads become more indistinct due to the decrease in overlapping sections leading to a greater possibility of errors.

To overcome these issues, dynamic sequencing programs that are efficient, highly cost effective and able to resolve errors and repeats were developed. Velvet algorithms was designed for this and are able to perform short read de novo sequencing alignment in relatively short computational time and with lower memory usage compared to other assemblers. [6]

Graphical interface

One of the main drawbacks in the use of Velvet is the use of the command-line interface and the difficulties users, especially beginners, face in the implementation of their data. A graphical user interface for the Velvet assembler was developed in 2012 and designed to overcome this problem and simplify the running of Velvet. [7]

See also

Related Research Articles

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

<span class="mw-page-title-main">Genome project</span>

Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism and to annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome sequences.

A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence. For a set of m symbols S = {s1, …, sm}, the set of vertices is:

<span class="mw-page-title-main">Ewan Birney</span> English businessman

John Frederick William Birney is joint director of EMBL's European Bioinformatics Institute (EMBL-EBI), in Hinxton, Cambridgeshire and deputy director general of the European Molecular Biology Laboratory (EMBL). He also serves as non-executive director of Genomics England, chair of the Global Alliance for Genomics and Health (GA4GH) and honorary professor of bioinformatics at the University of Cambridge. Birney has made significant contributions to genomics, through his development of innovative bioinformatics and computational biology tools. He previously served as an associate faculty member at the Wellcome Trust Sanger Institute.

<i>k</i>-mer

In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular transcriptome.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

<span class="mw-page-title-main">Hybrid genome assembly</span>

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

<span class="mw-page-title-main">Scaffolding (bioinformatics)</span>

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

SPAdes is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects.

In bioinformatics, a DNA read error occurs when a sequence assembler changes one DNA base for a different base. The reads from the sequence assembler can then be used to create a de Bruijn graph, which can be used in various ways to find errors.

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

References

  1. Zerbino, D. R. (2010). "Using the Velvetde novo Assembler for Short-Read Sequencing Technologies". In Andreas D. Baxevanis (ed.). Using the Velvet de novo assembler for short-read sequencing technologies. Current Protocols in Bioinformatics. Vol. 31. pp. Unit 11.5. doi:10.1002/0471250953.bi1105s31. ISBN   978-0471250951. PMC   2952100 . PMID   20836074.
  2. Zerbino, D. R.; Birney, E. (2008). "Velvet: de novo assembly using very short reads". Retrieved 18 October 2013.
  3. 1 2 Miller, J. R.; Koren, S; Sutton, G (2010). "Assembly algorithms for next-generation sequencing data". Genomics. 95 (6): 315–27. doi:10.1016/j.ygeno.2010.03.001. PMC   2874646 . PMID   20211242.
  4. Zerbino, D. R.; Birney, E. (2008). "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs". Genome Research. 18 (5): 821–829. doi:10.1101/gr.074492.107. PMC   2336801 . PMID   18349386.
  5. "Velvet Manual" Retrieved 18 October 2013
  6. Zhang, W.; Chen, J.; Yang, Y.; Tang, Y.; Shang, J.; Shen, B. (2011). "A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies". PLOS ONE. 6 (3): e17915. Bibcode:2011PLoSO...617915Z. doi: 10.1371/journal.pone.0017915 . PMC   3056720 . PMID   21423806.
  7. Powell, D. R.; Seemann, T (2013). "VAGUE: A graphical user interface for the Velvet assembler". Bioinformatics. 29 (2): 264–5. doi: 10.1093/bioinformatics/bts664 . PMID   23162059.