DNA sequencing theory

Last updated

DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects (known as "strategic genomics"), predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of systems engineering or operations research. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses physical processes related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g. sequence alignment. Publications [1] sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues. Sequencing theory is based on elements of mathematics, biology, and systems engineering, so it is highly interdisciplinary. The subject may be studied within the context of computational biology.

Contents

Theory and sequencing strategies

Sequencing as a covering problem

All mainstream methods of DNA sequencing rely on reading small fragments of DNA and subsequently reconstructing these data to infer the original DNA target, either via assembly or alignment to a reference. The abstraction common to these methods is that of a mathematical covering problem. [2] For example, one can imagine a line segment representing the target and a subsequent process where smaller segments are "dropped" onto random locations of the target. The target is considered "sequenced" when adequate coverage accumulates (e.g., when no gaps remain).

The abstract properties of covering have been studied by mathematicians for over a century. [3] However, direct application of these results has not generally been possible. Closed-form mathematical solutions, especially for probability distributions, often cannot be readily evaluated. That is, they involve inordinately large amounts of computer time for parameters characteristic of DNA sequencing. Stevens' configuration is one such example. [4] Results obtained from the perspective of pure mathematics also do not account for factors that are actually important in sequencing, for instance detectable overlap in sequencing fragments, double-stranding, edge-effects, and target multiplicity. Consequently, development of sequencing theory has proceeded more according to the philosophy of applied mathematics. In particular, it has been problem-focused and makes expedient use of approximations, simulations, etc.

Early uses derived from elementary probability theory

The earliest result may be found directly from elementary probability theory. Suppose we model the above process taking and as the fragment length and target length, respectively. The probability of "covering" any given location on the target with one particular fragment is then . (This presumes , which is valid often, but not for all real-world cases.) The probability of a single fragment not covering a given location on the target is therefore , and for fragments. The probability of covering a given location on the target with at least one fragment is therefore

This equation was first used to characterize plasmid libraries, [5] but it may appear in a modified form. For most projects , so that, to a good degree of approximation

where is called the redundancy. Note the significance of redundancy as representing the average number of times a position is covered with fragments. Note also that in considering the covering process over all positions in the target, this probability is identical to the expected value of the random variable , the fraction of the target coverage. The final result,

remains in widespread use as a "back of the envelope" estimator and predicts that coverage for all projects evolves along a universal curve that is a function only of the redundancy.

Lander-Waterman theory

In 1988, Eric Lander and Michael Waterman published an important paper [6] examining the covering problem from the standpoint of gaps. Although they focused on the so-called mapping problem, the abstraction to sequencing is much the same. They furnished a number of useful results that were adopted as the standard theory from the earliest days of "large-scale" genome sequencing. [7] Their model was also used in designing the Human Genome Project and continues to play an important role in DNA sequencing.

Ultimately, the main goal of a sequencing project is to close all gaps, so the "gap perspective" was a logical basis of developing a sequencing model. One of the more frequently used results from this model is the expected number of contigs, given the number of fragments sequenced. If one neglects the amount of sequence that is essentially "wasted" by having to detect overlaps, their theory yields

In 1995, Roach [8] published improvements to this theory, enabling it to be applied to sequencing projects in which the goal was to completely sequence a target genome. Michael Wendl and Bob Waterston [9] confirmed, based on Stevens' method, [4] that both models produced similar results when the number of contigs was substantial, such as in low coverage mapping or sequencing projects. As sequencing projects ramped up in the 1990s, and projects approached completion, low coverage approximations became inadequate, and the exact model of Roach was necessary. However, as the cost of sequencing dropped, parameters of sequencing projects became easier to directly test empirically, and interest and funding for strategic genomics diminished.

The basic ideas of Lander–Waterman theory led to a number of additional results for particular variations in mapping techniques. [10] [11] [12] However, technological advancements have rendered mapping theories largely obsolete except in organisms other than highly studied model organisms (e.g., yeast, flies, mice, and humans).

Parking strategy

The parking strategy for sequencing resembles the process of parking cars along a curb. Each car is a sequenced clone, and the curb is the genomic target. [13] Each clone sequenced is screened to ensure that subsequently sequenced clones do not overlap any previously sequenced clone. No sequencing effort is redundant in this strategy. However, much like the gaps between parked cars, unsequenced gaps less than the length of a clone accumulate between sequenced clones. There can be considerable cost to close such gaps.

Pairwise end-sequencing

In 1995, Roach et al. [14] proposed and demonstrated through simulations a generalization of a set of strategies explored earlier by Edwards and Caskey. [15] This whole-genome sequencing method became immensely popular as it was championed by Celera and used to sequence several model organisms before Celera applied it to the human genome. Today, most sequencing projects employ this strategy, often called paired end sequencing.

Post Human Genome Project advancements

The physical processes and protocols of DNA sequencing have continued to evolve, largely driven by advancements in bio-chemical methods, instrumentation, and automation techniques. There is now a wide range of problems that DNA sequencing has made in-roads into, including metagenomics and medical (cancer) sequencing. There are important factors in these scenarios that classical theory does not account for. Recent work has begun to focus on resolving the effects of some of these issues. The level of mathematics becomes commensurately more sophisticated.

Various artifacts of large-insert sequencing

Biologists have developed methods to filter highly-repetitive, essentially un-sequenceable regions of genomes. These procedures are important for organisms whose genomes consist mostly of such DNA, for example corn. They yield multitudes of small islands of sequenceable DNA products. Wendl and Barbazuk [16] proposed an extension to Lander–Waterman Theory to account for "gaps" in the target due to filtering and the so-called "edge-effect". The latter is a position-specific sampling bias, for example the terminal base position has only a chance of being covered, as opposed to for interior positions. For , classical Lander–Waterman Theory still gives good predictions, but dynamics change for higher redundancies.

Modern sequencing methods usually sequence both ends of a larger fragment, which provides linking information for de novo assembly and improved probabilities for alignment to reference sequence. Researchers generally believe that longer lengths of data (read lengths) enhance performance for very large DNA targets, an idea consistent with predictions from distribution models. [17] However, Wendl [18] showed that smaller fragments provide better coverage on small, linear targets because they reduce the edge effect in linear molecules. These findings have implications for sequencing the products of DNA filtering procedures. Read-pairing and fragment size evidently have negligible influence for large, whole-genome class targets.

Individual and population sequencing

Sequencing is emerging as an important tool in medicine, for example in cancer research. Here, the ability to detect heterozygous mutations is important and this can only be done if the sequence of the diploid genome is obtained. In the pioneering efforts to sequence individuals, Levy et al. [19] and Wheeler et al., [20] who sequenced Craig Venter and Jim Watson, respectively, outlined models for covering both alleles in a genome. Wendl and Wilson [21] followed with a more general theory that allowed for an arbitrary number of coverings of each allele and arbitrary ploidy. These results point to the general conclusion that the amount of data needed for such projects is significantly higher than for traditional haploid projects. Generally, at least 30-fold redundancy, i.e. each nucleotide spanned by an average of 30 sequence reads, is now standard. [22] However, requirements can be even greater, depending upon what kinds of genomic events are to be found. For example, in the so-called "discordant read pairs method", DNA insertions can be inferred if the distance between read pairs is larger than expected. Calculations show that around 50-fold redundancy is needed to avoid false-positive errors at 1% threshold. [23]

The advent of next-generation sequencing has also made large-scale population sequencing feasible, for example the 1000 Genomes Project to characterize variation in human population groups. While common variation is easily captured, rare variation poses a design challenge: too few samples with significant sequence redundancy risks not having a variant in the sample group, but large samples with light redundancy risk not capturing a variant in the read set that is actually in the sample group. Wendl and Wilson [24] report a simple set of optimization rules that maximize the probability of discovery for a given set of parameters. For example, for observing a rare allele at least twice (to eliminate the possibility is unique to an individual) a little less than 4-fold redundancy should be used, regardless of the sample size.

Metagenomic sequencing

Next-generation instruments are now also enabling the sequencing of whole uncultured metagenomic communities. The sequencing scenario is more complicated here and there are various ways of framing design theories for a given project. For example, Stanhope [25] developed a probabilistic model for the amount of sequence needed to obtain at least one contig of a given size from each novel organism of the community, while Wendl et al. reported analysis for the average contig size or the probability of completely recovering a novel organism for a given rareness within the community. [26] Conversely, Hooper et al. propose a semi-empirical model based on the gamma distribution. [27]

Limitations

DNA sequencing theories often invoke the assumption that certain random variables in a model are independent and identically distributed. For example, in Lander–Waterman Theory, a sequenced fragment is presumed to have the same probability of covering each region of a genome and all fragments are assumed to be independent of one another. In actuality, sequencing projects are subject to various types of bias, including differences of how well regions can be cloned, sequencing anomalies, biases in the target sequence (which is not random), and software-dependent errors and biases. In general, theory will agree well with observation up to the point that enough data have been generated to expose latent biases. [21] The kinds of biases related to the underlying target sequence are particularly difficult to model, since the sequence itself may not be known a priori. This presents a type of Catch-22 (logic) problem.

See also

Related Research Articles

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

A bacterial artificial chromosome (BAC) is a DNA construct, based on a functional fertility plasmid, used for transforming and cloning in bacteria, usually E. coli. F-plasmids play a crucial role because they contain partition genes that promote the even distribution of plasmids after bacterial cell division. The bacterial artificial chromosome's usual insert size is 150–350 kbp. A similar cloning vector called a PAC has also been produced from the DNA of P1 bacteriophage.

A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.

In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases. EST approaches have largely been superseded by whole genome and transcriptome sequencing and metagenome sequencing.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

<span class="mw-page-title-main">Sanger sequencing</span> Method of DNA sequencing developed in 1977

Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Frederick Sanger and colleagues in 1977, it became the most widely used sequencing method for approximately 40 years. It was first commercialized by Applied Biosystems in 1986. More recently, higher volume Sanger sequencing has been replaced by next generation sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use for smaller-scale projects and for validation of deep sequencing results. It still has the advantage over short-read sequencing technologies in that it can produce DNA sequence reads of > 500 nucleotides and maintains a very low error rate with accuracies around 99.99%. Sanger sequencing is still actively being used in efforts for public health initiatives such as sequencing the spike protein from SARS-CoV-2 as well as for the surveillance of norovirus outbreaks through the Center for Disease Control and Prevention's (CDC) CaliciNet surveillance network.

<span class="mw-page-title-main">Gene mapping</span> Process of locating specific genes

Gene mapping or genome mapping describes the methods used to identify the location of a gene on a chromosome and the distances between genes. Gene mapping can also describe the distances between different sites within a gene.

A genomic library is a collection of overlapping DNA fragments that together make up the total genomic DNA of a single organism. The DNA is stored in a population of identical vectors, each containing a different insert of DNA. In order to construct a genomic library, the organism's DNA is extracted from cells and then digested with a restriction enzyme to cut the DNA into fragments of a specific size. The fragments are then inserted into the vector using DNA ligase. Next, the vector DNA can be taken up by a host organism - commonly a population of Escherichia coli or yeast - with each cell containing only one vector molecule. Using a host cell to carry the vector allows for easy amplification and retrieval of specific clones from the library for analysis.

<span class="mw-page-title-main">Phred quality score</span> Measurement in DNA sequencing

A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for the computer program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. The FASTQ format encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

The Sulston score is an equation used in DNA mapping to numerically assess the likelihood that a given "fingerprint" similarity between two DNA clones is merely a result of chance. Used as such, it is a test of statistical significance. That is, low values imply that similarity is significant, suggesting that two DNA clones overlap one another and that the given similarity is not just a chance event. The name is an eponym that refers to John Sulston by virtue of his being the lead author of the paper that first proposed the equation's use.

Methylated DNA immunoprecipitation is a large-scale purification technique in molecular biology that is used to enrich for methylated DNA sequences. It consists of isolating methylated DNA fragments via an antibody raised against 5-methylcytosine (5mC). This technique was first described by Weber M. et al. in 2005 and has helped pave the way for viable methylome-level assessment efforts, as the purified fraction of methylated DNA can be input to high-throughput DNA detection methods such as high-resolution DNA microarrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). Nonetheless, understanding of the methylome remains rudimentary; its study is complicated by the fact that, like other epigenetic properties, patterns vary from cell-type to cell-type.

Paired-end tags (PET) are the short sequences at the 5’ and 3' ends of a DNA fragment which are unique enough that they (theoretically) exist together only once in a genome, therefore making the sequence of the DNA in between them available upon search or upon further sequencing. Paired-end tags (PET) exist in PET libraries with the intervening DNA absent, that is, a PET "represents" a larger fragment of genomic or cDNA by consisting of a short 5' linker sequence, a short 5' sequence tag, a short 3' sequence tag, and a short 3' linker sequence. It was shown conceptually that 13 base pairs are sufficient to map tags uniquely. However, longer sequences are more practical for mapping reads uniquely. The endonucleases used to produce PETs give longer tags but sequences of 50–100 base pairs would be optimal for both mapping and cost efficiency. After extracting the PETs from many DNA fragments, they are linked (concatenated) together for efficient sequencing. On average, 20–30 tags could be sequenced with the Sanger method, which has a longer read length. Since the tag sequences are short, individual PETs are well suited for next-generation sequencing that has short read lengths and higher throughput. The main advantages of PET sequencing are its reduced cost by sequencing only short fragments, detection of structural variants in the genome, and increased specificity when aligning back to the genome compared to single tags, which involves only one end of the DNA fragment.

Optical mapping is a technique for constructing ordered, genome-wide, high-resolution restriction maps from single, stained molecules of DNA, called "optical maps". By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that sequence. Originally developed by Dr. David C. Schwartz and his lab at NYU in the 1990s this method has since been integral to the assembly process of many large-scale sequencing projects for both microbial and eukaryotic genomes. Later technologies use DNA melting, DNA competitive binding or enzymatic labelling in order to create the optical mappings.

Michael Christopher Wendl is a mathematician and biomedical engineer who has worked on DNA sequencing theory, covering and matching problems in probability, theoretical fluid mechanics, and co-wrote Phred. He was a scientist on the Human Genome Project and has done bioinformatics and biostatistics work in cancer. Wendl is of ethnic German heritage and is the son of the aerospace engineer Michael J. Wendl.

<span class="mw-page-title-main">Jumping library</span>

Jumping libraries or junction-fragment libraries are collections of genomic DNA fragments generated by chromosome jumping. These libraries allow the analysis of large areas of the genome and overcome distance limitations in common cloning techniques. A jumping library clone is composed of two stretches of DNA that are usually located many kilobases away from each other. The stretch of DNA located between these two "ends" is deleted by a series of biochemical manipulations carried out at the start of this cloning technique.

<span class="mw-page-title-main">Denaturation mapping</span>

Denaturation Mapping is a form of optical mapping, first described in 1966. It is used to characterize DNA molecules without the need for amplification or sequencing. It is based on the differences between the melting temperatures of AT-rich and GC-rich regions. Even though modern sequencing methods reduced the need for denaturation mapping, it is still being used for specific purposes, such as detection of large scale structural variants.

<span class="mw-page-title-main">End-sequence profiling</span>

End-sequence profiling (ESP) is a method based on sequence-tagged connectors developed to facilitate de novo genome sequencing to identify high-resolution copy number and structural aberrations such as inversions and translocations.

In genetics, coverage is one of several measures of the depth or completeness of DNA sequencing, and is more specifically expressed in any of the following terms:

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.

Physical map is a technique used in molecular biology to find the order and physical distance between DNA base pairs by DNA markers. It is one of the gene mapping techniques which can determine the sequence of DNA base pairs with high accuracy. Genetic mapping, another approach of gene mapping, can provide markers needed for the physical mapping. However, as the former deduces the relative gene position by recombination frequencies, it is less accurate than the latter.

References

  1. Waterman, Michael S. (1995). Introduction to Computational Biology. Boca Raton: Chapman and Hall/CRC. ISBN   978-0-412-99391-6.
  2. Hall, P. (1988). Introduction to the Theory of Coverage Processes. New York: Wiley. ISBN   978-0-471-85702-0.
  3. Solomon, H. (1978). Geometric Probability. Philadelphia: Society for Industrial and Applied Mathematics. ISBN   978-0-898-71025-0.
  4. 1 2 Stevens WL (1939). "Solution to a Geometrical Problem in Probability". Annals of Eugenics. 9 (4): 315–320. doi: 10.1111/j.1469-1809.1939.tb02216.x .
  5. Clarke L, Carbon J (1976). "A colony bank containing synthetic Col-El hybrid plasmids representative of the entire E. coli genome". Cell. 9 (1): 91–99. doi:10.1016/0092-8674(76)90055-6. PMID   788919. S2CID   2535372.
  6. Lander ES, Waterman MS (1988). "Genomic mapping by fingerprinting random clones: a mathematical analysis". Genomics. 2 (3): 231–239. doi:10.1016/0888-7543(88)90007-9. PMID   3294162.
  7. Fleischmann RD; et al. (1995). "Whole-genome random sequencing and assembly of haemophilus influenzae Rd". Science. 269 (5223): 496–512. Bibcode:1995Sci...269..496F. doi:10.1126/science.7542800. PMID   7542800.
  8. Roach JC (1995). "Random subcloning". Genome Research. 5 (5): 464–473. doi: 10.1101/gr.5.5.464 . PMID   8808467.
  9. Wendl MC, Waterston RH (2002). "Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing". Genome Research. 12 (12): 1943–1949. doi:10.1101/gr.655102. PMC   187573 . PMID   12466299.
  10. Arratia R; et al. (1991). "Genomic mapping by anchoring random clones: a mathematical analysis". Genomics. 11 (4): 806–827. CiteSeerX   10.1.1.80.8788 . doi:10.1016/0888-7543(91)90004-X. PMID   1783390.
  11. Port E; et al. (1995). "Genomic mapping by end-characterized random clones: a mathematical analysis". Genomics. 26 (1): 84–100. CiteSeerX   10.1.1.74.4380 . doi:10.1016/0888-7543(95)80086-2. PMID   7782090.
  12. Zhang MQ, Marr TG (1993). "Genome mapping by nonrandom anchoring: a discrete theoretical analysis". Proceedings of the National Academy of Sciences . 90 (2): 600–604. Bibcode:1993PNAS...90..600Z. doi: 10.1073/pnas.90.2.600 . PMC   45711 . PMID   8421694.
  13. Roach JC; et al. (2000). "Parking strategies for genome sequencing". Genome Research. 10 (7): 1020–1030. doi:10.1101/gr.10.7.1020. PMC   310895 . PMID   10899151.
  14. Roach JC, Boysen C, Wang K, Hood L (1995). "Pairwise end sequencing: a unified approach to genomic mapping and sequencing". Genomics. 26 (2): 345–353. doi:10.1016/0888-7543(95)80219-C. PMID   7601461.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  15. Edwards, A.; Caskey, T. (1991). Closure strategies for random DNA sequencing. Vol. 3. A Companion to Methods in Enzymology. pp. 41–47.
  16. Wendl MC, Barbazuk WB (2005). "Extension of Lander–Waterman Theory for sequencing filtered DNA libraries". BMC Bioinformatics. 6: article 245. doi: 10.1186/1471-2105-6-245 . PMC   1280921 . PMID   16216129.
  17. Wendl MC (2006). "Occupancy modeling of coverage distribution for whole genome shotgun DNA sequencing". Bulletin of Mathematical Biology. 68 (1): 179–196. doi:10.1007/s11538-005-9021-4. PMID   16794926. S2CID   23889071.
  18. Wendl MC (2006). "A general coverage theory for shotgun DNA sequencing". Journal of Computational Biology. 13 (6): 1177–1196. doi:10.1089/cmb.2006.13.1177. PMID   16901236. S2CID   17112274.
  19. Levy S; et al. (2007). "The diploid genome sequence of an individual human". PLOS Biology. 5 (10): article e254. doi: 10.1371/journal.pbio.0050254 . PMC   1964779 . PMID   17803354.
  20. Wheeler DA; et al. (2008). "The complete genome of an individual by massively parallel DNA sequencing". Nature. 452 (7189): 872–876. Bibcode:2008Natur.452..872W. doi: 10.1038/nature06884 . PMID   18421352.
  21. 1 2 Wendl MC, Wilson RK (2008). "Aspects of coverage in medical DNA sequencing". BMC Bioinformatics. 9: article 239. doi: 10.1186/1471-2105-9-239 . PMC   2430974 . PMID   18485222.
  22. Ley TJ; et al. (2008). "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome". Nature. 456 (7218): 66–72. Bibcode:2008Natur.456...66L. doi:10.1038/nature07485. PMC   2603574 . PMID   18987736.
  23. Wendl MC, Wilson RK (2009). "Statistical aspects of discerning indel-type structural variation via DNA sequence alignment". BMC Genomics. 10: article 359. doi: 10.1186/1471-2164-10-359 . PMC   2748092 . PMID   19656394.
  24. Wendl MC, Wilson RK (2009). "The theory of discovering rare variants via DNA sequencing". BMC Genomics. 10: article 485. doi: 10.1186/1471-2164-10-485 . PMC   2778663 . PMID   19843339.
  25. Stanhope SA (2010). "Occupancy modeling maximum contig size probabilities and designing metagenomics experiments". PLOS ONE. 5 (7): article e11652. Bibcode:2010PLoSO...511652S. doi: 10.1371/journal.pone.0011652 . PMC   2912229 . PMID   20686599.
  26. Wendl MC; et al. (2012). "Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens' theorem". Journal of Mathematical Biology. 67 (5): 1141–1161. doi:10.1007/s00285-012-0586-x. PMC   3795925 . PMID   22965653.
  27. Hooper SD; et al. (2010). "Estimating DNA coverage and abundance in metagenomes using a gamma approximation". Bioinformatics. 26 (3): 295–301. doi:10.1093/bioinformatics/btp687. PMC   2815663 . PMID   20008478.