SPAdes (software)

Last updated

SPAdes
Developer(s) St. Petersburg State University, Russia
St. Petersburg Academic University, Russia
University of California, San Diego, USA
Stable release
4.0.0 / June 3rd, 2024
Repository github.com/ablab/spades
Written in C++, C, Python, Perl.
Operating system Linux, macOS
Type Bioinformatics
License GNU General Public License version 2 (GPLv2)
Website ablab.github.io/spades/

SPAdes (St. Petersburg genome assembler) [1] is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects. [1] [2]

Contents

SPAdes works with Ion Torrent, PacBio, Oxford Nanopore, and Illumina paired-end, mate-pairs and single reads. [1] SPAdes has been integrated into Galaxy pipelines by Guy Lionel and Philip Mabon. [3]

Background

Studying the genome of single cells will help to track changes that occur in DNA over time or associated with exposure to different conditions. Additionally, many projects such as Human Microbiome Project and antibiotics discovery would greatly benefit from Single-cell sequencing (SCS). [4] [5] SCS has an advantage over sequencing DNA extracted from large number of cells. The problem of averaging out the significant variations between cells can be overcome by using SCS. [6] Experimental and computational technologies are being optimized to allow researchers to sequence single cells. For instance, amplification of DNA extracted from a single cell is one of the experimental challenges. To maximize the accuracy and quality of SCS, a uniform DNA amplification is needed. It was demonstrated that using multiple annealing and looping-based amplification cycles (MALBAC) for DNA amplification generates less biasness compared to polymerase chain reaction (PCR) or multiple displacement amplification (MDA). [7] Furthermore, it has been recognized that the challenges facing SCS are computational rather than experimental. [8] Currently available assembler, such as Velvet, [9] String Graph Assembler (SGA) [10] and EULER-SR, [11] were not designed to handle SCS assembly. [2] Assembly of single cell data is difficult due to non-uniform read coverage, variation in insert length, high levels of sequencing errors and chimeric reads. [8] [12] [13] Therefore, the new algorithmic approach, SPAdes, was designed to address these issues.

SPAdes assembly approach

SPAdes uses k-mers for building the initial de Bruijn graph and on following stages it performs graph-theoretical operations which are based on graph structure, coverage and sequence lengths. Moreover, it adjusts errors iteratively. [2] The stages of assembly in SPAdes are: [2]

Details on SPAdes assembly

Logarithmic coverage plot for the single-cell sequencing data for E. coli genome. Coverage2.jpg
Logarithmic coverage plot for the single-cell sequencing data for E. coli genome.

SPAdes was designed to overcome the problems associated with the assembly of single cell data as follows: [2]

1. Non-uniform coverage. SPAdes utilizes multisized de Bruijn graph which allows employing different values of k. It has been suggested to use smaller values of k in low-coverage regions to minimize fragmentation, and larger values of k in high coverage regions to decrease repeat collapsing (Stage 1 above).

2. Variable insert sizes of paired-end reads. SPAdes employs the basic concept of paired de Bruijn graphs. However, paired de Bruijn works well on paired-end reads with fixed insert size. Therefore, SPAdes estimates 'distances' instead of using 'insert sizes'. Distance (d) of a paired-end read is defined as, for a read length L, d = insert size – L. By utilizing k-bimer adjustment approach, distances are exactly estimated. A k-bimer consisting of k-mers ‘α’ and ‘β’ together with the estimated distance between them in a genome (α|β,d). This approach breaks the paired–end reads into pairs of k-mers which are transformed to define pairs of edges (biedges) in the de Bruijn graphs. These sets of biedges are involved in the estimation of distances between edges paths between k-mers α and β. By clustering, the optimal distance estimate is chosen from each cluster (stage 2, above). To construct paired de Bruijn graph, the rectangle graphs are employed in SPAdes (stage 3). Rectangle graphs approach was first introduced in 2012 [15] to construct paired de Bruijn graphs with doubtful distances.

3. Bulge, tips and chimeras. Bulges and tips occur due to errors in the middle and ends of reads, respectively. A chimeric connection joins two unrelated substrings of the genome. SPAdes identifies these based on graph topology, the length and coverage of the non-branching paths included in them. SPAdes keeps a data structure to be able to backtrack all corrections or removals.

SPAdes modifies the previously used bulge removal approach [16] and iterative de Bruijn graph approach from Peng et al (2010) [17] and creates a new approach called ‘‘bulge corremoval’’, which stands for bulge correction and removal. The bulge corremoval algorithm can be summarized as follows: a simple bulge is formed by two small and similar paths (P and Q) connecting the same hubs. If P is a non-branching path (h-path), then SPAdes maps every edge in P to an edge projection in Q and removes P from the graph, as a result the coverage of Q increases. Unlike other assemblers, which use a fixed coverage cut-off bulge removal, SPAdes removes or projects the h-paths with low coverage step by step. This is achieved by employing gradually increasing cut-off thresholds and iterating through all h-paths in increasing order of coverage (for bulge corremoval and chimeric removal) or length (for tip removal). Moreover, in order to guarantee that no new sources/sinks are introduced to the graph, SPAdes deletes an h-path (in chimeric h-path removal) or projects (in bulge corremoval) only if its start and end vertices have at least two outgoing and ingoing edges. This helps to remove low coverage h-paths occurring from sequencing errors and chimeric reads but not from repeats.

SPAdes pipelines and performance

SPAdes is composed of the following tools: [1]

Comparing assemblers

A study [18] compared several genome assemblers on single cell E. coli samples. These assemblers are EULER-SR, [11] Velvet, [9] SOAPdenovo, [19] Velvet-SC, EULER+ Velvet-SC (E+V-SC), [16] IDBA-UD [20] and SPAdes. It was demonstrated that IDBA-UD and SPAdes performed the best. [18] SPAdes had the largest NG50 (99,913, NG50 statistics is the same as the N50 except that the genome size is used rather than the assembly size). [21] Moreover, using E. coli reference genome, [22] SPAdes assembled the highest percentage of genome (97%) and the highest number of complete genes (4,071 out of 4,324). [18] The assemblers’ performances were as follows: [18]

IDBA-UD < Velvet < E+V-SC < SPAdes < EULER-SR < Velvet-SC < SOAPdenovo

SPAdes > IDBA-UD >>> E+V-SC > EULER-SR >Velvet >Velvet-SC > SOAPdenovo

IDBA-UD > SPAdes > > EULER-SR > Velvet= E+V-SC > Velvet-SC > SOAPdenovo

SPAdes > IDBA-UD > E+V-SC > Velvet-SC > EULER-SR > SOAPdenovo > Velvet

E+V-SC = Velvet = Velvet-SC < SOAPdenovo < IDBA-UD < SPADes < EULER-SR

See also

Related Research Articles

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence. For a set of m symbols S = {s1, …, sm}, the set of vertices is:

<i>k</i>-mer Substrings of length k contained in a biological sequence

In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

<span class="mw-page-title-main">Hybrid genome assembly</span>

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

<span class="mw-page-title-main">Scaffolding (bioinformatics)</span> Bioinformatics technique

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

Single-cell DNA template strand sequencing, or Strand-seq, is a technique for the selective sequencing of a daughter cell's parental template strands. This technique offers a wide variety of applications, including the identification of sister chromatid exchanges in the parental cell prior to segregation, the assessment of non-random segregation of sister chromatids, the identification of misoriented contigs in genome assemblies, de novo genome assembly of both haplotypes in diploid organisms including humans, whole-chromosome haplotyping, and the identification of germline and somatic genomic structural variation, the latter of which can be detected robustly even in single cells.

In bioinformatics, a DNA read error occurs when a sequence assembler changes one DNA base for a different base. The reads from the sequence assembler can then be used to create a de Bruijn graph, which can be used in various ways to find errors.

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

<span class="mw-page-title-main">Genome skimming</span> Method of genome sequencing

Genome skimming is a sequencing approach that uses low-pass, shallow sequencing of a genome, to generate fragments of DNA, known as genome skims. These genome skims contain information about the high-copy fraction of the genome. The high-copy fraction of the genome consists of the ribosomal DNA, plastid genome (plastome), mitochondrial genome (mitogenome), and nuclear repeats such as microsatellites and transposable elements. It employs high-throughput, next generation sequencing technology to generate these skims. Although these skims are merely 'the tip of the genomic iceberg', phylogenomic analysis of them can still provide insights on evolutionary history and biodiversity at a lower cost and larger scale than traditional methods. Due to the small amount of DNA required for genome skimming, its methodology can be applied in other fields other than genomics. Tasks like this include determining the traceability of products in the food industry, enforcing international regulations regarding biodiversity and biological resources, and forensics.

References

  1. 1 2 3 4 "SPAdes 3.0.0 Manual". Archived from the original on February 2, 2014. Retrieved January 26, 2014.
  2. 1 2 3 4 5 Bankevich A; Nurk S; Antipov D; Gurevich AA; Dvorkin M; Kulikov AS; Lesin VM; Nikolenko SI; Pham S; Prjibelski AD; Pyshkin AV; Sirotkin AV; Vyahhi N; Tesler G; Alekseyev MA; Pevzner PA. (2012). "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing". Journal of Computational Biology. 19 (5): 455–477. doi:10.1089/cmb.2012.0021. PMC   3342519 . PMID   22506599.
  3. Galaxy tool shed
  4. Gill S; Pop M; Deboy R; Eckburg P; Turnbaugh P; Samuel B; Gordon J; Relman D; Fraser-Liggett C; Nelson K (2006). "Metagenomic analysis of the human distal gut microbiome". Science. 312 (5778): 1355–1359. Bibcode:2006Sci...312.1355G. doi:10.1126/science.1124234. PMC   3027896 . PMID   16741115.
  5. Li J; Vederas J (2009). "Drug discovery and natural products: end of an era or an endless frontier?" (PDF). Science. 325 (5937): 161–165. Bibcode:2009Sci...325..161L. doi:10.1126/science.1168243. PMID   19589993. S2CID   206517350.
  6. Lu S; Zong C; Fan W; Yang M; Li J; Chapman A; Zhu P; Hu X; Xu L; Yan L; F B; Qiao J; Tang F; Li R; Xie X (2012). "Probing meiotic recombination and aneuploidy of single sperm cells by whole-genome sequencing". Science. 338 (6114): 1627–1630. Bibcode:2012Sci...338.1627L. doi:10.1126/science.1229112. PMC   3590491 . PMID   23258895.
  7. "One cell is all you need". January 4, 2013.
  8. 1 2 Rodrigue S; Malmstrom RR; Berlin AM; Birren BW; Henn MR; Chisholm SW (2009). "Whole genome amplification and de novo assembly of single bacterial cells". PLOS ONE. 4 (9): e6864. Bibcode:2009PLoSO...4.6864R. doi: 10.1371/journal.pone.0006864 . PMC   2731171 . PMID   19724646.
  9. 1 2 Zerbino D; Birney E (2008). "Velvet: algorithms for de novo short read assembly using de Bruijn graphs". Genome Research. 18 (5): 821–829. doi:10.1101/gr.074492.107. PMC   2336801 . PMID   18349386.
  10. Simpson JT; Durbin R (2012). "Efficient de novo assembly of large genomes using compressed data structures". Genome Research. 22 (3): 549–556. doi:10.1101/gr.126953.111. PMC   3290790 . PMID   22156294.
  11. 1 2 Pevzner PA; Tang H; Waterman MS (2001). "An Eulerian path approach to DNA fragment assembly". Proceedings of the National Academy of Sciences of the United States of America. 98 (17): 9748–9753. Bibcode:2001PNAS...98.9748P. doi: 10.1073/pnas.171285098 . PMC   55524 . PMID   11504945.
  12. Medvedev P; Scott E; Kakaradov B; Pevzner P (2011). "Error correction of high-throughput sequencing datasets with non-uniform coverage". Bioinformatics. 27 (13): i137–141. doi:10.1093/bioinformatics/btr208. PMC   3117386 . PMID   21685062.
  13. Ishoey T; Woyke T; Stepanauskas R; Novotny M; Lasken RS (2008). "Genomic sequencing of single microbial cells from environmental samples". Current Opinion in Microbiology. 11 (3): 198–204. doi:10.1016/j.mib.2008.05.006. PMC   3635501 . PMID   18550420.
  14. 1 2 3 Nikolenko SI; Korobeynikov AI; Alekseyev MA. (2012). "BayesHammer: Bayesian clustering for error correction in single-cell sequencing". BMC Genomics. 14 (Suppl 1): S7. arXiv: 1211.2756 . doi: 10.1186/1471-2164-14-S1-S7 . PMC   3549815 . PMID   23368723.
  15. Vyahhi N; Pham SK; Pevzner P (2012). "From de Bruijn Graphs to Rectangle Graphs for Genome Assembly". Algorithms in Bioinformatics. Lecture Notes in Bioinformatics. Vol. 7534. pp. 249–261. doi:10.1007/978-3-642-33122-0_20. ISBN   978-3-642-33121-3.
  16. 1 2 Chitsaz H; Yee-Greenbaum JL; Tesler G; Lombardo MJ; Dupont CL; Badger JH; Novotny M; Rusch DB; Fraser LJ; Gormley NA; Schulz-Trieglaff O; Smith GP; Evers DJ; Pevzner PA; Lasken RS (2011). "Efficient de novo assembly of single-cell bacterial genomes from short-read data sets". Nat Biotechnol. 29 (10): 915–921. doi:10.1038/nbt.1966. PMC   3558281 . PMID   21926975.
  17. Peng Y.; Leung H.C.M.; Yiu S.-M; Chin FYL (2010). "IDBA – A Practical Iterative de Bruijn Graph de Novo Assembler" . Research in Computational Molecular Biology. Lecture Notes in Computer Science. Vol. 6044. pp.  426–440. Bibcode:2010LNCS.6044..426P. CiteSeerX   10.1.1.157.195 . doi:10.1007/978-3-642-12683-3_28. hdl:10722/129571. ISBN   978-3-642-12682-6. S2CID   16328443.{{cite book}}: |journal= ignored (help)
  18. 1 2 3 4 Gurevich A; Saveliev V; Vyahhi N; Tesler G (2013). "QUAST: quality assessment tool for genome assemblies". Bioinformatics. 29 (8): 1072–1075. doi:10.1093/bioinformatics/btt086. PMC   3624806 . PMID   23422339.
  19. Li R; Zhu H; Ruan J; Qian W; Fang X; Shi Z; Li Y; Li S; Shan G; Kristiansen K; Li S; Yang H; Wang J; Wang J (2010). "De novo assembly of human genomes with massively parallel short read sequencing" (PDF). Genome Research. 20 (2): 265–272. doi:10.1101/gr.097261.109. PMC   2813482 . PMID   20019144.
  20. Peng Y; Leung HCM; Yiu SM; Chin FYL (2012). "IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth". Bioinformatics. 28 (11): 1–8. doi: 10.1093/bioinformatics/bts174 . hdl: 10722/152505 . PMID   22495754.
  21. "SPAdes Genome Assembler | Algorithmic Biology Lab".
  22. Blattner FR; Plunkett G; Bloch C; Perna N; Burland V; Riley M; Collado-Vides J; Glasner J; Rode C; Mayhew G; Gregor J; Davis N; Kirkpatrick H; Goeden M; Rose D; Mau B; Shao Y (1997). "The complete genome sequence of Escherichia coli K-12". Science. 277 (5331): 1453–1462. doi: 10.1126/science.277.5331.1453 . PMID   9278503.