Shotgun sequencing

Last updated

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

Contents

The chain-termination method of DNA sequencing ("Sanger sequencing") can only be used for short DNA strands of 100 to 1000 base pairs. Due to this size limit, longer sequences are subdivided into smaller fragments that can be sequenced separately, and these sequences are assembled to give the overall sequence.

In shotgun sequencing, [1] [2] DNA is broken up randomly into numerous small segments, which are sequenced using the chain termination method to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence. [1]

Shotgun sequencing was one of the precursor technologies that was responsible for enabling whole genome sequencing.

Example

For example, consider the following two rounds of shotgun reads:

StrandSequence
OriginalAGCATGCTGCAGTCATGCTTAGGCTA
First shotgun sequenceAGCATGCTGCAGTCATGCT-------
-------------------TAGGCTA
Second shotgun sequenceAGCATG--------------------
------CTGCAGTCATGCTTAGGCTA
ReconstructionAGCATGCTGCAGTCATGCTTAGGCTA

In this extremely simplified example, none of the reads cover the full length of the original sequence, but the four reads can be assembled into the original sequence using the overlap of their ends to align and order them. In reality, this process uses enormous amounts of information that are rife with ambiguities and sequencing errors. Assembly of complex genomes is additionally complicated by the great abundance of repetitive sequences, meaning similar short reads could come from completely different parts of the sequence.

Many overlapping reads for each segment of the original DNA are necessary to overcome these difficulties and accurately assemble the sequence. For example, to complete the Human Genome Project, most of the human genome was sequenced at 12X or greater coverage; that is, each base in the final sequence was present on average in 12 different reads. Even so, current methods have failed to isolate or assemble reliable sequence for approximately 1% of the (euchromatic) human genome, as of 2004. [3]

Whole genome shotgun sequencing

History

Whole genome shotgun sequencing for small (4000- to 7000-base-pair) genomes was first suggested in 1979. [1] The first genome sequenced by shotgun sequencing was that of cauliflower mosaic virus, published in 1981. [4] [5]

Paired-end sequencing

Broader application benefited from pairwise end sequencing, known colloquially as double-barrel shotgun sequencing. As sequencing projects began to take on longer and more complicated DNA sequences, multiple groups began to realize that useful information could be obtained by sequencing both ends of a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two sequences were oriented in opposite directions and were about the length of a fragment apart from each other was valuable in reconstructing the sequence of the original target fragment.

History. The first published description of the use of paired ends was in 1990 [6] as part of the sequencing of the human HGPRT locus, although the use of paired ends was limited to closing gaps after the application of a traditional shotgun sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming fragments of constant length, was in 1991. [7] At the time, there was community consensus that the optimal fragment length for pairwise end sequencing would be three times the sequence read length. In 1995 Roach et al. [8] introduced the innovation of using fragments of varying sizes, and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets. The strategy was subsequently adopted by The Institute for Genomic Research (TIGR) to sequence the genome of the bacterium Haemophilus influenzae in 1995, [9] and then by Celera Genomics to sequence the Drosophila melanogaster (fruit fly) genome in 2000, [10] and subsequently the human genome.

Approach

To apply the strategy, a high-molecular-weight DNA strand is sheared into random fragments, size-selected (usually 2, 10, 50, and 150 kb), and cloned into an appropriate vector. The clones are then sequenced from both ends using the chain termination method yielding two short sequences. Each sequence is called an end-read or read 1 and read 2 and two reads from the same clone are referred to as mate pairs . Since the chain termination method usually can only produce reads between 500 and 1000 bases long, in all but the smallest clones, mate pairs will rarely overlap.

Assembly

The original sequence is reconstructed from the reads using sequence assembly software. First, overlapping reads are collected into longer composite sequences known as contigs. Contigs can be linked together into scaffolds by following connections between mate pairs. The distance between contigs can be inferred from the mate pair positions if the average fragment length of the library is known and has a narrow window of deviation. Depending on the size of the gap between contigs, different techniques can be used to find the sequence in the gaps. If the gap is small (5-20kb) then the use of polymerase chain reaction (PCR) to amplify the region is required, followed by sequencing. If the gap is large (>20kb) then the large fragment is cloned in special vectors such as bacterial artificial chromosomes (BAC) followed by sequencing of the vector.

Pros and cons

Proponents of this approach argue that it is possible to sequence the whole genome at once using large arrays of sequencers, which makes the whole process much more efficient than more traditional approaches. Detractors argue that although the technique quickly sequences large regions of DNA, its ability to correctly link these regions is suspect, particularly for eukaryotic genomes with repeating regions. As sequence assembly programs become more sophisticated and computing power becomes cheaper, it may be possible to overcome this limitation.[ citation needed ]

Coverage

Coverage (read depth or depth) is the average number of reads representing a given nucleotide in the reconstructed sequence. It can be calculated from the length of the original genome (G), the number of reads(N), and the average read length(L) as . For example, a hypothetical genome with 2,000 base pairs reconstructed from 8 reads with an average length of 500 nucleotides will have 2x redundancy. This parameter also enables one to estimate other quantities, such as the percentage of the genome covered by reads (sometimes also called coverage). A high coverage in shotgun sequencing is desired because it can overcome errors in base calling and assembly. The subject of DNA sequencing theory addresses the relationships of such quantities.

Sometimes a distinction is made between sequence coverage and physical coverage. Sequence coverage is the average number of times a base is read (as described above). Physical coverage is the average number of times a base is read or spanned by mate paired reads. [11]

Hierarchical shotgun sequencing

In whole genome shotgun sequencing (top), the entire genome is sheared randomly into small fragments (appropriately sized for sequencing) and then reassembled. In hierarchical shotgun sequencing (bottom), the genome is first broken into larger segments. After the order of these segments is deduced, they are further sheared into fragments appropriately sized for sequencing. Whole genome shotgun sequencing versus Hierarchical shotgun sequencing.png
In whole genome shotgun sequencing (top), the entire genome is sheared randomly into small fragments (appropriately sized for sequencing) and then reassembled. In hierarchical shotgun sequencing (bottom), the genome is first broken into larger segments. After the order of these segments is deduced, they are further sheared into fragments appropriately sized for sequencing.

Although shotgun sequencing can in theory be applied to a genome of any size, its direct application to the sequencing of large genomes (for instance, the human genome) was limited until the late 1990s, when technological advances made practical the handling of the vast quantities of complex data involved in the process. [12] Historically, full-genome shotgun sequencing was believed to be limited by both the sheer size of large genomes and by the complexity added by the high percentage of repetitive DNA (greater than 50% for the human genome) present in large genomes. [13] It was not widely accepted that a full-genome shotgun sequence of a large genome would provide reliable data. For these reasons, other strategies that lowered the computational load of sequence assembly had to be utilized before shotgun sequencing was performed. [13] In hierarchical sequencing, also known as top-down sequencing, a low-resolution physical map of the genome is made prior to actual sequencing. From this map, a minimal number of fragments that cover the entire chromosome are selected for sequencing. [14] In this way, the minimum amount of high-throughput sequencing and assembly is required.

The amplified genome is first sheared into larger pieces (50-200kb) and cloned into a bacterial host using BACs or P1-derived artificial chromosomes (PAC). Because multiple genome copies have been sheared at random, the fragments contained in these clones have different ends, and with enough coverage (see section above) finding the smallest possible scaffold of BAC contigs that covers the entire genome is theoretically possible. This scaffold is called the minimum tiling path.

A BAC contig that covers the entire genomic area of interest makes up the tiling path. Tiling path.png
A BAC contig that covers the entire genomic area of interest makes up the tiling path.

Once a tiling path has been found, the BACs that form this path are sheared at random into smaller fragments and can be sequenced using the shotgun method on a smaller scale. [15]

Although the full sequences of the BAC contigs is not known, their orientations relative to one another are known. There are several methods for deducing this order and selecting the BACs that make up a tiling path. The general strategy involves identifying the positions of the clones relative to one another and then selecting the fewest clones required to form a contiguous scaffold that covers the entire area of interest. The order of the clones is deduced by determining the way in which they overlap. [16] Overlapping clones can be identified in several ways. A small radioactively or chemically labeled probe containing a sequence-tagged site (STS) can be hybridized onto a microarray upon which the clones are printed. [16] In this way, all the clones that contain a particular sequence in the genome are identified. The end of one of these clones can then be sequenced to yield a new probe and the process repeated in a method called chromosome walking.

Alternatively, the BAC library can be restriction-digested. Two clones that have several fragment sizes in common are inferred to overlap because they contain multiple similarly spaced restriction sites in common. [16] This method of genomic mapping is called restriction or BAC fingerprinting because it identifies a set of restriction sites contained in each clone. Once the overlap between the clones has been found and their order relative to the genome known, a scaffold of a minimal subset of these contigs that covers the entire genome is shotgun-sequenced. [14]

Because it involves first creating a low-resolution map of the genome, hierarchical shotgun sequencing is slower than whole-genome shotgun sequencing, but relies less heavily on computer algorithms than whole-genome shotgun sequencing. The process of extensive BAC library creation and tiling path selection, however, make hierarchical shotgun sequencing slow and labor-intensive. Now that the technology is available and the reliability of the data demonstrated, [13] the speed and cost efficiency of whole-genome shotgun sequencing has made it the primary method for genome sequencing.

Newer sequencing technologies

The classical shotgun sequencing was based on the Sanger sequencing method: this was the most advanced technique for sequencing genomes from about 1995–2005. The shotgun strategy is still applied today, however using other sequencing technologies, such as short-read sequencing and long-read sequencing.

Short-read or "next-gen" sequencing produces shorter reads (anywhere from 25–500bp) but many hundreds of thousands or millions of reads in a relatively short time (on the order of a day). [17] This results in high coverage, but the assembly process is much more computationally intensive. These technologies are vastly superior to Sanger sequencing due to the high volume of data and the relatively short time it takes to sequence a whole genome. [18]

Metagenomic shotgun sequencing

Having reads of 400-500 base pairs length is sufficient to determine the species or strain of the organism where the DNA comes from, provided its genome is already known, by using for example a k-mer based taxonomic classifier software. With millions of reads from next generation sequencing of an environmental sample, it is possible to get a complete overview of any complex microbiome with thousands of species, like the gut flora. Advantages over 16S rRNA amplicon sequencing are: not being limited to bacteria; strain-level classification where amplicon sequencing only gets the genus; and the possibility to extract whole genes and specify their function as part of the metagenome. [19] The sensitivity of metagenomic sequencing makes it an attractive choice for clinical use. [20] It however emphasizes the problem of contamination of the sample or the sequencing pipeline. [21]

See also

Related Research Articles

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

A bacterial artificial chromosome (BAC) is a DNA construct, based on a functional fertility plasmid, used for transforming and cloning in bacteria, usually E. coli. F-plasmids play a crucial role because they contain partition genes that promote the even distribution of plasmids after bacterial cell division. The bacterial artificial chromosome's usual insert size is 150–350 kbp. A similar cloning vector called a PAC has also been produced from the DNA of P1 bacteriophage.

A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

Primer walking is a technique used to clone a gene from its known closest markers. As a result, it is employed in cloning and sequencing efforts in plants, fungi, and mammals with minor alterations. This technique, also known as "directed sequencing," employs a series of Sanger sequencing reactions to either confirm the reference sequence of a known plasmid or PCR product based on the reference sequence or to discover the unknown sequence of a full plasmid or PCR product by designing primers to sequence overlapping sections.

A genomic library is a collection of overlapping DNA fragments that together make up the total genomic DNA of a single organism. The DNA is stored in a population of identical vectors, each containing a different insert of DNA. In order to construct a genomic library, the organism's DNA is extracted from cells and then digested with a restriction enzyme to cut the DNA into fragments of a specific size. The fragments are then inserted into the vector using DNA ligase. Next, the vector DNA can be taken up by a host organism - commonly a population of Escherichia coli or yeast - with each cell containing only one vector molecule. Using a host cell to carry the vector allows for easy amplification and retrieval of specific clones from the library for analysis.

In the fields of bioinformatics and computational biology, Genome survey sequences (GSS) are nucleotide sequences similar to expressed sequence tags (ESTs) that the only difference is that most of them are genomic in origin, rather than mRNA.

DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects, predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of systems engineering or operations research. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses physical processes related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g. sequence alignment. Publications sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues. Sequencing theory is based on elements of mathematics, biology, and systems engineering, so it is highly interdisciplinary. The subject may be studied within the context of computational biology.

Paired-end tags (PET) are the short sequences at the 5’ and 3' ends of a DNA fragment which are unique enough that they (theoretically) exist together only once in a genome, therefore making the sequence of the DNA in between them available upon search or upon further sequencing. Paired-end tags (PET) exist in PET libraries with the intervening DNA absent, that is, a PET "represents" a larger fragment of genomic or cDNA by consisting of a short 5' linker sequence, a short 5' sequence tag, a short 3' sequence tag, and a short 3' linker sequence. It was shown conceptually that 13 base pairs are sufficient to map tags uniquely. However, longer sequences are more practical for mapping reads uniquely. The endonucleases used to produce PETs give longer tags but sequences of 50–100 base pairs would be optimal for both mapping and cost efficiency. After extracting the PETs from many DNA fragments, they are linked (concatenated) together for efficient sequencing. On average, 20–30 tags could be sequenced with the Sanger method, which has a longer read length. Since the tag sequences are short, individual PETs are well suited for next-generation sequencing that has short read lengths and higher throughput. The main advantages of PET sequencing are its reduced cost by sequencing only short fragments, detection of structural variants in the genome, and increased specificity when aligning back to the genome compared to single tags, which involves only one end of the DNA fragment.

<span class="mw-page-title-main">Hybrid genome assembly</span>

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

<span class="mw-page-title-main">Jumping library</span>

Jumping libraries or junction-fragment libraries are collections of genomic DNA fragments generated by chromosome jumping. These libraries allow the analysis of large areas of the genome and overcome distance limitations in common cloning techniques. A jumping library clone is composed of two stretches of DNA that are usually located many kilobases away from each other. The stretch of DNA located between these two "ends" is deleted by a series of biochemical manipulations carried out at the start of this cloning technique.

<span class="mw-page-title-main">Scaffolding (bioinformatics)</span> Bioinformatics technique

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

<span class="mw-page-title-main">Viral metagenomics</span>

Viral metagenomics uses metagenomic technologies to detect viral genomic material from diverse environmental and clinical samples. Viruses are the most abundant biological entity and are extremely diverse; however, only a small fraction of viruses have been sequenced and only an even smaller fraction have been isolated and cultured. Sequencing viruses can be challenging because viruses lack a universally conserved marker gene so gene-based approaches are limited. Metagenomics can be used to study and analyze unculturable viruses and has been an important tool in understanding viral diversity and abundance and in the discovery of novel viruses. For example, metagenomics methods have been used to describe viruses associated with cancerous tumors and in terrestrial ecosystems.

<span class="mw-page-title-main">End-sequence profiling</span>

End-sequence profiling (ESP) is a method based on sequence-tagged connectors developed to facilitate de novo genome sequencing to identify high-resolution copy number and structural aberrations such as inversions and translocations.

Third-generation sequencing is a class of DNA sequencing methods currently under active development.

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.

References

  1. 1 2 3 Staden, R (1979). "A strategy of DNA sequencing employing computer programs". Nucleic Acids Research. 6 (70): 2601–10. doi:10.1093/nar/6.7.2601. PMC   327874 . PMID   461197.
  2. Anderson, S (1981). "Shotgun DNA sequencing using cloned DNase I-generated fragments". Nucleic Acids Research. 9 (13): 3015–27. doi:10.1093/nar/9.13.3015. PMC   327328 . PMID   6269069.
  3. Human Genome Sequencing Consortium, International (21 October 2004). "Finishing the euchromatic sequence of the human genome". Nature. 431 (7011): 931–945. Bibcode:2004Natur.431..931H. doi: 10.1038/nature03001 . PMID   15496913.
  4. Gardner, Richard C.; Howarth, Alan J.; Hahn, Peter; Brown-Luedi, Marianne; Shepherd, Robert J.; Messing, Joachim (1981-06-25). "The complete nucleotide sequence of an infectious clone of cauliflower mosaic virus by M13mp7 shotgun sequencing". Nucleic Acids Research. 9 (12): 2871–2888. doi:10.1093/nar/9.12.2871. ISSN   0305-1048. PMC   326899 . PMID   6269062.
  5. Doctrow, Brian (2016-07-19). "Profile of Joachim Messing". Proceedings of the National Academy of Sciences. 113 (29): 7935–7937. Bibcode:2016PNAS..113.7935D. doi: 10.1073/pnas.1608857113 . ISSN   0027-8424. PMC   4961156 . PMID   27382176.
  6. Edwards, A; Caskey, T (1991). "Closure strategies for random DNA sequencing". Methods: A Companion to Methods in Enzymology. 3 (1): 41–47. doi:10.1016/S1046-2023(05)80162-8.
  7. Edwards, A; Voss, H.; Rice, P.; Civitello, A.; Stegemann, J.; Schwager, C.; Zimmerman, J.; Erfle, H.; Caskey, T.; Ansorge, W. (1990). "Automated DNA sequencing of the human HPRT locus". Genomics. 6 (4): 593–608. doi:10.1016/0888-7543(90)90493-E. PMID   2341149.
  8. Roach, JC; Boysen, C; Wang, K; Hood, L (1995). "Pairwise end sequencing: a unified approach to genomic mapping and sequencing". Genomics. 26 (2): 345–353. doi:10.1016/0888-7543(95)80219-C. PMID   7601461.
  9. Fleischmann, RD; et al. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae Rd". Science. 269 (5223): 496–512. Bibcode:1995Sci...269..496F. doi:10.1126/science.7542800. PMID   7542800. S2CID   10423613.
  10. Adams, MD; et al. (2000). "The genome sequence of Drosophila melanogaster" (PDF). Science. 287 (5461): 2185–95. Bibcode:2000Sci...287.2185.. CiteSeerX   10.1.1.549.8639 . doi:10.1126/science.287.5461.2185. PMID   10731132. Archived from the original (PDF) on 2018-07-22. Retrieved 2017-10-25.
  11. Meyerson, M.; Gabriel, S.; Getz, G. (2010). "Advances in understanding cancer genomes through second-generation sequencing". Nature Reviews Genetics. 11 (10): 685–696. doi:10.1038/nrg2841. PMID   20847746. S2CID   2544266.
  12. Dunham, I. Genome Sequencing. Encyclopedia of Life Sciences, 2005. doi : 10.1038/npg.els.0005378
  13. 1 2 3 Venter, J. C. "Shotgunning the Human Genome: A Personal View." Encyclopedia of Life Sciences, 2006.
  14. 1 2 Gibson, G. and Muse, S. V. A Primer of Genome Science. 3rd ed. P.84
  15. Bozdag, Serdar; Close, Timothy J.; Lonardi, Stefano (March 2013). "A Graph-Theoretical Approach to the Selection of the Minimum Tiling Path from a Physical Map". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 10 (2): 352–360. doi:10.1109/tcbb.2013.26. ISSN   1545-5963.
  16. 1 2 3 Dear, P. H. Genome Mapping. Encyclopedia of Life Sciences, 2005. doi : 10.1038/npg.els.0005353.
  17. Karl, V; et al. (2009). "Next Generation Sequencing: From Basic Research to Diagnostics". Clinical Chemistry. 55 (4): 41–47. doi: 10.1373/clinchem.2008.112789 . PMID   19246620.
  18. Metzker, Michael L. (2010). "Sequencing technologies - the next generation" (PDF). Nat Rev Genet. 11 (1): 31–46. CiteSeerX   10.1.1.719.3885 . doi:10.1038/nrg2626. PMID   19997069. S2CID   205484500.
  19. Roumpeka, Despoina D.; et al. (2017). "A review of bioinformatics tools for bio-prospecting from metagenomic sequence data". Frontiers in Genetics. 8: 23. doi: 10.3389/fgene.2017.00023 . PMC   5337752 . PMID   28321234.
  20. Gu, Wei; et al. (2018). "Clinical Metagenomic Next-Generation Sequencing for Pathogen Detection". Annual Review of Pathology: Mechanisms of Disease. 14: 319–338. doi:10.1146/annurev-pathmechdis-012418-012751. PMC   6345613 . PMID   30355154.
  21. Thoendel, Matthew; et al. (2017). "Impact of contaminating DNA in whole genome amplification kits used for metagenomic shotgun sequencing for infection diagnosis". Journal of Clinical Microbiology. 55 (6): 1789–1801. doi:10.1128/JCM.02402-16. PMC   5442535 . PMID   28356418.

Further reading

PD-icon.svg This article incorporates public domain material from NCBI Handbook. National Center for Biotechnology Information.