Short Oligonucleotide Analysis Package

Last updated

SOAP (Short Oligonucleotide Analysis Package) is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

Contents

All programs in the SOAP package may be used free of charge and are distributed under the GPL open source software license.

Functionality

The SOAP suite of tools can be used to perform the following genome assembly tasks:

Sequence Alignment

SOAPaligner (SOAP2) is specifically designed for fast alignment of short reads and performs favorably with respect to similar alignment tools such as Bowtie and MAQ. [1]

Genome Assembly

SOAPdenovo is a short read de novo assembler utilizing De Bruijn graph construction. It is optimized for short reads such as that generated by Illumina and is capable of assembling large genomes such as the human genome. [2] SOAPdenovo was used to assemble the genome of the giant panda. [3] This was upgraded to SOAPdenovo2, which was optimized for large genomes and included the widely used GapCloser module. [4]

Transcriptome Assembly

SOAPdenovo-Trans is a de novo transcriptome assembler designed specifically for RNA-Seq that was created for the 1000 Plant Genomes project. [5]

Indel Discovery

SOAPindel is a tool to find insertions and deletions from next generation paired-end sequencing data, providing a list of candidate indels with quality scores. [6]

SNP Discovery

SOAPsnp is a consensus sequence builder. This tool uses the output from SOAPaligner to generate a consensus sequence which enables SNPs to be called on a newly sequenced individual.

Structural Variation Discovery

SOAPsv is a tool to find structural variations using whole genome assembly. [7]

Quality control and preprocessing

SOAPnuke is a tool for integrated quality control and preprocessing of datasets from genomic, small RNA, Digital Gene Expression, and metagenomic experiments. [8]

History

SOAP v1

The first release of SOAP consisted only of the sequence alignment tool SOAPaligner. [9]

SOAP v2

SOAP v2 [1] extended and improved on SOAP v1 by significantly improving the performance of the SOAPaligner tool. Alignment time was reduced by a factor of 20-30, while memory usage was reduced by a factor of 3. Support was added for compressed file formats.

The SOAP suite was expanded then to include the new tools: SOAPdenovo 1&2, SOAPindel, SOAPsnp, and SOAPsv.

SOAP v3

SOAP v3 extended the alignment tool by being the first short-read alignment tool to utilize GPU processors. [10] As a result of these improvements, SOAPalign significantly outperformed competing aligners Bowtie and BWA in terms of speed.

See also

Related Research Articles

<span class="mw-page-title-main">Genome project</span>

Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism and to annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome sequences.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

<i>k</i>-mer Substrings of length k contained in a biological sequence

In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

Complete Genomics is a life sciences company that has developed and commercialized a DNA sequencing platform for human genome sequencing and analysis. This solution combines the company's proprietary human genome sequencing technology with its informatics and data management software to provide finished variant reports and assemblies at Complete Genomics’ commercial genome center in Mountain View, California.

The 1000 Plant Transcriptomes Initiative (1KP) was an international research effort to establish the most detailed catalogue of genetic variation in plants. It was announced in 2008 and headed by Gane Ka-Shu Wong and Michael Deyholos of the University of Alberta. The project successfully sequenced the transcriptomes of 1000 different plant species by 2014; its final capstone products were published in 2019.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.

<span class="mw-page-title-main">Scaffolding (bioinformatics)</span> Bioinformatics technique

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

SPAdes is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects.

In genetics, coverage is one of several measures of the depth or completeness of DNA sequencing, and is more specifically expressed in any of the following terms:

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

<span class="mw-page-title-main">Srinivas Aluru</span> American computer scientist

Srinivas Aluru is a professor in the School of Computational Science and Engineering at Georgia Institute of Technology, and co-Executive Director for the Georgia Tech Interdisciplinary Research Institute in Data Engineering and Science. His main areas of research are high performance computing, data science, bioinformatics and systems biology, combinatorial methods in scientific computing, and string algorithms. Aluru is a Fellow of the American Association for the Advancement of Science (AAAS) and the Institute for Electrical and Electronic Engineers (IEEE). He is best known for his research contributions in parallel algorithms and applications, interdisciplinary research in bioinformatics and computational biology, and particularly the intersection of these two fields.

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

<span class="mw-page-title-main">Linked-read sequencing</span>

Linked-read sequencing, a type of DNA sequencing technology, uses specialized technique that tags DNA molecules with unique barcodes before fragmenting them. Unlike traditional sequencing technology, where DNA is broken into small fragments and then sequenced individually, resulting in short read lengths that has difficulties in accurately reconstructing the original DNA sequence, the unique barcodes of linked-read sequencing allows scientists to link together DNA fragments that come from the same DNA molecule. A pivotal benefit of this technology lies in the small quantities of DNA required for large genome information output, effectively combining the advantages of long-read and short-read technologies.

References

  1. 1 2 Li, R.; Yu, C.; Li, Y.; Lam, T.-W.; Yiu, S.-M.; Kristiansen, K.; Wang, J. (2009). "SOAP2: an improved ultrafast tool for short read alignment". Bioinformatics. 25 (15): 1966–1967. doi:10.1093/bioinformatics/btp336. ISSN   1367-4803. PMID   19497933.
  2. Li, R.; Zhu, H.; Ruan, J.; Qian, W.; Fang, X.; Shi, Z.; Li, Y.; Li, S.; Shan, G.; Kristiansen, K.; Li, S.; Yang, H.; Wang, J.; Wang, J. (2009). "De novo assembly of human genomes with massively parallel short read sequencing". Genome Research. 20 (2): 265–272. doi:10.1101/gr.097261.109. ISSN   1088-9051. PMC   2813482 . PMID   20019144.
  3. Li, Ruiqiang; Fan, Wei; Tian, Geng; Zhu, Hongmei; He, Lin; Cai, Jing; Huang, Quanfei; Cai, Qingle; Li, Bo; Bai, Yinqi; Zhang, Zhihe; Zhang, Yaping; Wang, Wen; Li, Jun; Wei, Fuwen; Li, Heng; Jian, Min; Li, Jianwen; Zhang, Zhaolei; Nielsen, Rasmus; Li, Dawei; Gu, Wanjun; Yang, Zhentao; Xuan, Zhaoling; Ryder, Oliver A.; Leung, Frederick Chi-Ching; Zhou, Yan; Cao, Jianjun; Sun, Xiao; et al. (2009). "The sequence and de novo assembly of the giant panda genome". Nature. 463 (7279): 311–317. Bibcode:2010Natur.463..311L. doi:10.1038/nature08696. ISSN   0028-0836. PMC   3951497 . PMID   20010809.
  4. Luo, Ruibang; Liu, Binghang; Xie, Yinlong; Li, Zhenyu; Huang, Weihua; Yuan, Jianying; He, Guangzhu; Chen, Yanxiang; Pan, Qi; Liu, Yunjie; Tang, Jingbo (2012-12-01). "SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler". GigaScience. 1 (1): 18. doi: 10.1186/2047-217X-1-18 . PMC   3626529 . PMID   23587118.
  5. Xie, Yinlong; Wu, Gengxiong; Tang, Jingbo; Luo, Ruibang; Patterson, Jordan; Liu, Shanlin; Huang, Weihua; He, Guangzhu; Gu, Shengchang; Li, Shengkang; Zhou, Xin (2014-06-15). "SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads". Bioinformatics. 30 (12): 1660–1666. arXiv: 1305.6760 . doi: 10.1093/bioinformatics/btu077 . ISSN   1367-4803. PMID   24532719.
  6. Li, Shengting; Li, Ruiqiang; Li, Heng; Lu, Jianliang; Li, Yingrui; Bolund, Lars; Schierup, Mikkel H.; Wang, Jun (2013-01-01). "SOAPindel: Efficient identification of indels from short paired reads". Genome Research. 23 (1): 195–200. doi: 10.1101/gr.132480.111 . ISSN   1088-9051. PMC   3530679 . PMID   22972939.
  7. Li, Yingrui; Zheng, Hancheng; Luo, Ruibang; Wu, Honglong; Zhu, Hongmei; Li, Ruiqiang; Cao, Hongzhi; Wu, Boxin; Huang, Shujia; Shao, Haojing; Ma, Hanzhou (August 2011). "Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly". Nature Biotechnology. 29 (8): 723–730. doi: 10.1038/nbt.1904 . ISSN   1546-1696. PMID   21785424.
  8. Chen, Yuxin; Chen, Yongsheng; Shi, Chunmei; Huang, Zhibo; Zhang, Yong; Li, Shengkang; Li, Yan; Ye, Jia; Yu, Chang; Li, Zhuo; Zhang, Xiuqing (2018-01-01). "SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data". GigaScience. 7 (1): 1–6. doi:10.1093/gigascience/gix120. PMC   5788068 . PMID   29220494.
  9. Li, R.; Li, Y.; Kristiansen, K.; Wang, J. (2008). "SOAP: short oligonucleotide alignment program". Bioinformatics. 24 (5): 713–714. doi: 10.1093/bioinformatics/btn025 . ISSN   1367-4803. PMID   18227114.
  10. Liu, C.-M.; Wong, T.; Wu, E.; Luo, R.; Yiu, S.-M.; Li, Y.; Wang, B.; Yu, C.; Chu, X.; Zhao, K.; Li, R.; Lam, T.-W. (2012). "SOAP3: ultra-fast GPU-based parallel alignment tool for short reads". Bioinformatics. 28 (6): 878–879. doi: 10.1093/bioinformatics/bts061 . ISSN   1367-4803. PMID   22285832.