Developer(s) | St. Petersburg State University, Russia St. Petersburg Academic University, Russia University of California, San Diego, USA |
---|---|
Stable release | 4.0.0 / June 3rd, 2024 |
Repository | github |
Written in | C++, C, Python, Perl. |
Operating system | Linux, macOS |
Type | Bioinformatics |
License | GNU General Public License version 2 (GPLv2) |
Website | ablab |
SPAdes (St. Petersburg genome assembler) [1] is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects. [1] [2]
SPAdes works with Ion Torrent, PacBio, Oxford Nanopore, and Illumina paired-end, mate-pairs and single reads. [1] SPAdes has been integrated into Galaxy pipelines by Guy Lionel and Philip Mabon. [3]
Studying the genome of single cells will help to track changes that occur in DNA over time or associated with exposure to different conditions. Additionally, many projects such as Human Microbiome Project and antibiotics discovery would greatly benefit from Single-cell sequencing (SCS). [4] [5] SCS has an advantage over sequencing DNA extracted from large number of cells. The problem of averaging out the significant variations between cells can be overcome by using SCS. [6] Experimental and computational technologies are being optimized to allow researchers to sequence single cells. For instance, amplification of DNA extracted from a single cell is one of the experimental challenges. To maximize the accuracy and quality of SCS, a uniform DNA amplification is needed. It was demonstrated that using multiple annealing and looping-based amplification cycles (MALBAC) for DNA amplification generates less biasness compared to polymerase chain reaction (PCR) or multiple displacement amplification (MDA). [7] Furthermore, it has been recognized that the challenges facing SCS are computational rather than experimental. [8] Currently available assembler, such as Velvet, [9] String Graph Assembler (SGA) [10] and EULER-SR, [11] were not designed to handle SCS assembly. [2] Assembly of single cell data is difficult due to non-uniform read coverage, variation in insert length, high levels of sequencing errors and chimeric reads. [8] [12] [13] Therefore, the new algorithmic approach, SPAdes, was designed to address these issues.
SPAdes uses k-mers for building the initial de Bruijn graph and on following stages it performs graph-theoretical operations which are based on graph structure, coverage and sequence lengths. Moreover, it adjusts errors iteratively. [2] The stages of assembly in SPAdes are: [2]
SPAdes was designed to overcome the problems associated with the assembly of single cell data as follows: [2]
1. Non-uniform coverage. SPAdes utilizes multisized de Bruijn graph which allows employing different values of k. It has been suggested to use smaller values of k in low-coverage regions to minimize fragmentation, and larger values of k in high coverage regions to decrease repeat collapsing (Stage 1 above).
2. Variable insert sizes of paired-end reads. SPAdes employs the basic concept of paired de Bruijn graphs. However, paired de Bruijn works well on paired-end reads with fixed insert size. Therefore, SPAdes estimates 'distances' instead of using 'insert sizes'. Distance (d) of a paired-end read is defined as, for a read length L, d = insert size – L. By utilizing k-bimer adjustment approach, distances are exactly estimated. A k-bimer consisting of k-mers ‘α’ and ‘β’ together with the estimated distance between them in a genome (α|β,d). This approach breaks the paired–end reads into pairs of k-mers which are transformed to define pairs of edges (biedges) in the de Bruijn graphs. These sets of biedges are involved in the estimation of distances between edges paths between k-mers α and β. By clustering, the optimal distance estimate is chosen from each cluster (stage 2, above). To construct paired de Bruijn graph, the rectangle graphs are employed in SPAdes (stage 3). Rectangle graphs approach was first introduced in 2012 [15] to construct paired de Bruijn graphs with doubtful distances.
3. Bulge, tips and chimeras. Bulges and tips occur due to errors in the middle and ends of reads, respectively. A chimeric connection joins two unrelated substrings of the genome. SPAdes identifies these based on graph topology, the length and coverage of the non-branching paths included in them. SPAdes keeps a data structure to be able to backtrack all corrections or removals.
SPAdes modifies the previously used bulge removal approach [16] and iterative de Bruijn graph approach from Peng et al (2010) [17] and creates a new approach called ‘‘bulge corremoval’’, which stands for bulge correction and removal. The bulge corremoval algorithm can be summarized as follows: a simple bulge is formed by two small and similar paths (P and Q) connecting the same hubs. If P is a non-branching path (h-path), then SPAdes maps every edge in P to an edge projection in Q and removes P from the graph, as a result the coverage of Q increases. Unlike other assemblers, which use a fixed coverage cut-off bulge removal, SPAdes removes or projects the h-paths with low coverage step by step. This is achieved by employing gradually increasing cut-off thresholds and iterating through all h-paths in increasing order of coverage (for bulge corremoval and chimeric removal) or length (for tip removal). Moreover, in order to guarantee that no new sources/sinks are introduced to the graph, SPAdes deletes an h-path (in chimeric h-path removal) or projects (in bulge corremoval) only if its start and end vertices have at least two outgoing and ingoing edges. This helps to remove low coverage h-paths occurring from sequencing errors and chimeric reads but not from repeats.
SPAdes is composed of the following tools: [1]
A study [18] compared several genome assemblers on single cell E. coli samples. These assemblers are EULER-SR, [11] Velvet, [9] SOAPdenovo, [19] Velvet-SC, EULER+ Velvet-SC (E+V-SC), [16] IDBA-UD [20] and SPAdes. It was demonstrated that IDBA-UD and SPAdes performed the best. [18] SPAdes had the largest NG50 (99,913, NG50 statistics is the same as the N50 except that the genome size is used rather than the assembly size). [21] Moreover, using E. coli reference genome, [22] SPAdes assembled the highest percentage of genome (97%) and the highest number of complete genes (4,071 out of 4,324). [18] The assemblers’ performances were as follows: [18]
IDBA-UD < Velvet < E+V-SC < SPAdes < EULER-SR < Velvet-SC < SOAPdenovo
SPAdes > IDBA-UD >>> E+V-SC > EULER-SR >Velvet >Velvet-SC > SOAPdenovo
IDBA-UD > SPAdes > > EULER-SR > Velvet= E+V-SC > Velvet-SC > SOAPdenovo
SPAdes > IDBA-UD > E+V-SC > Velvet-SC > EULER-SR > SOAPdenovo > Velvet
E+V-SC = Velvet = Velvet-SC < SOAPdenovo < IDBA-UD < SPADes < EULER-SR
In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).
Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.
In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence. For a set of m symbols S = {s1, …, sm}, the set of vertices is:
In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.
Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.
RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.
SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.
In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.
De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.
In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.
Scaffolding is a technique used in bioinformatics. It is defined as follows:
Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.
Single-cell DNA template strand sequencing, or Strand-seq, is a technique for the selective sequencing of a daughter cell's parental template strands. This technique offers a wide variety of applications, including the identification of sister chromatid exchanges in the parental cell prior to segregation, the assessment of non-random segregation of sister chromatids, the identification of misoriented contigs in genome assemblies, de novo genome assembly of both haplotypes in diploid organisms including humans, whole-chromosome haplotyping, and the identification of germline and somatic genomic structural variation, the latter of which can be detected robustly even in single cells.
In bioinformatics, a DNA read error occurs when a sequence assembler changes one DNA base for a different base. The reads from the sequence assembler can then be used to create a de Bruijn graph, which can be used in various ways to find errors.
De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.
Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.
A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.
Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.
Genome skimming is a sequencing approach that uses low-pass, shallow sequencing of a genome, to generate fragments of DNA, known as genome skims. These genome skims contain information about the high-copy fraction of the genome. The high-copy fraction of the genome consists of the ribosomal DNA, plastid genome (plastome), mitochondrial genome (mitogenome), and nuclear repeats such as microsatellites and transposable elements. It employs high-throughput, next generation sequencing technology to generate these skims. Although these skims are merely 'the tip of the genomic iceberg', phylogenomic analysis of them can still provide insights on evolutionary history and biodiversity at a lower cost and larger scale than traditional methods. Due to the small amount of DNA required for genome skimming, its methodology can be applied in other fields other than genomics. Tasks like this include determining the traceability of products in the food industry, enforcing international regulations regarding biodiversity and biological resources, and forensics.
{{cite book}}
: |journal=
ignored (help)