Oligotyping (sequencing)

Last updated

Oligotyping is the process of correcting DNA sequence measured during the process of DNA sequencing based on frequency data of related sequences across related samples.

Contents

History

DNA sequences were originally read from sequencing gels by eye. With the advent of computerized base callers, humans no longer 'called' the bases and instead 'corrected' the called bases. The bases were called by the software using the relative intensity of each putative basepair signal and the local spacing of the signals.

With the advent of high throughput sequencing, the volume of sequence to be corrected exceeded human capacity for sequence correction.

Use

Multiple applications require single-base pair accuracy across populations of closely related sequences. An example is amplicon sequencing to assess the relative contribution of DNA from diverse organisms to a sample.

The requirement for single basepair accuracy led to the development of methods which drew on frequency data distributed across several samples to identify variant sequences which shared the same frequency profile and were thus likely errors from the same original sequence. [1] [2] The ability to use higher-order statistics to correct sequences is an important element in decreasing the burden of error in DNA sequence datasets.

See also

Related Research Articles

DNA sequencer

A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then reported as a text string, called a read. Some DNA sequencers can be also considered optical instruments as they analyze light signals originating from fluorochromes attached to nucleotides.

Nanopore sequencing DNA / RNA sequencing technique

Nanopore sequencing is a third generation approach used in the sequencing of biopolymers- specifically, polynucleotides in the form of DNA or RNA.

DNA sequencing Process of determining the order of nucleotides in DNA molecules

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

Metagenomics Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics.

Molecular ecology A field of evolutionary biology that applies molecular population genetics, molecular phylogenetics, and genomics to traditional ecological questions

Molecular ecology is a field of evolutionary biology that is concerned with applying molecular population genetics, molecular phylogenetics, and more recently genomics to traditional ecological questions. It is virtually synonymous with the field of "Ecological Genetics" as pioneered by Theodosius Dobzhansky, E. B. Ford, Godfrey M. Hewitt, and others. These fields are united in their attempt to study genetic-based questions "out in the field" as opposed to the laboratory. Molecular ecology is related to the field of conservation genetics.

RNA spike-in

An RNA spike-in is an RNA transcript of known sequence and quantity used to calibrate measurements in RNA hybridization assays, such as DNA microarray experiments, RT-qPCR, and RNA-Seq.

Phred quality score

A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for Phred base calling to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. The FASTQ format encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

2 base encoding

2 Base Encoding, also called SOLiD, is a next-generation sequencing technology developed by Applied Biosystems and has been commercially available since 2008. These technologies generate hundreds of thousands of small sequence reads at one time. Well-known examples of such DNA sequencing methods include 454 pyrosequencing, the Solexa system and the SOLiD system. These methods have reduced the cost from $0.01/base in 2004 to nearly $0.0001/base in 2006 and increased the sequencing capacity from 1,000,000 bases/machine/day in 2004 to more than 100,000,000 bases/machine/day in 2006.

Single-molecule real-time (SMRT) sequencing is a parallelized single molecule DNA sequencing method. Single-molecule real-time sequencing utilizes a zero-mode waveguide (ZMW). A single DNA polymerase enzyme is affixed at the bottom of a ZMW with a single molecule of DNA as a template. The ZMW is a structure that creates an illuminated observation volume that is small enough to observe only a single nucleotide of DNA being incorporated by DNA polymerase. Each of the four DNA bases is attached to one of four different fluorescent dyes. When a nucleotide is incorporated by the DNA polymerase, the fluorescent tag is cleaved off and diffuses out of the observation area of the ZMW where its fluorescence is no longer observable. A detector detects the fluorescent signal of the nucleotide incorporation, and the base call is made according to the corresponding fluorescence of the dye.

Phred base calling is a computer program for identifying a base (nucleobase) sequence from a fluorescence "trace" data generated by an automated DNA sequencer that uses electrophoresis and 4-fluorescent dye method. When originally developed, Phred produced significantly fewer errors in the data sets examined than other methods, averaging 40–50% fewer errors. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods.

An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, where an "Operational Taxonomic Unit" is simply the group of organisms currently being studied. In this sense, an OTU is a pragmatic definition to group individuals by similarity, equivalent to but not necessarily in line with classical Linnaean taxonomy or modern evolutionary taxonomy.

DNA barcoding Method of species identification using a short section of DNA

DNA barcoding is a method of species identification using a short section of DNA from a specific gene or genes. The premise of DNA barcoding is that, by comparison with a reference library of such DNA sections, an individual sequence can be used to uniquely identify an organism to species, in the same way that a supermarket scanner uses the familiar black stripes of the UPC barcode to identify an item in its stock against its reference database. These "barcodes" are sometimes used in an effort to identify unknown species, parts of an organism, or simply to catalog as many taxa as possible, or to compare with traditional taxonomy in an effort to determine species boundaries.

Duplex sequencing

Duplex sequencing is a library preparation and analysis method for next-generation sequencing (NGS) platforms that employs random tagging of double-stranded DNA to detect mutations with higher accuracy and lower error rates.

Third-generation sequencing is a class of DNA sequencing methods currently under active development.

Aquatic macroinvertebrate DNA barcoding

DNA barcoding is an alternative method to the traditional morphological taxonomic classification, and has frequently been used to identify species of aquatic macroinvertebrates. Many are crucial indicator organisms in the bioassessment of freshwater and marine ecosystems.

Microbial DNA barcoding is the use of DNA metabarcoding to characterize a mixture of microorganisms. DNA metabarcoding is a method of DNA barcoding that uses universal genetic markers to identify DNA of a mixture of organisms.

DNA barcoding in diet assessment is the use of DNA barcoding to analyse the diet of organisms. and further detect and describe their trophic interactions. This approach is based on the identification of consumed species by characterization of DNA present in dietary samples, e.g. individual food remains, regurgitates, gut and fecal samples, homogenized body of the host organism, target of the diet study.

Amplicon sequence variant

Amplicon sequence variant (ASV) is a term used to refer to single DNA sequences recovered from a high-throughput marker gene analysis. These amplicon reads are created following the removal of erroneous sequences generated during PCR and sequencing. This allows ASVs to distinguish sequence variation by a single nucleotide change. ASVs are utilized to classify groups of species based on DNA sequences, finding biological and environmental variation and to determine ecological patterns. For many years the standard unit for marker gene analysis was operational taxonomic units (OTUs), which are generated by clustering sequences based on a shared similarity threshold. These traditional units were created by construction of molecular taxonomic units by either clustering based on similarities between sequencing reads or by clustering reference databases to define and label an OTU. Instead of using exact sequence variants, OTUs are distinguished by a less fixed dissimilarity threshold which is most commonly 3%. This means these units have to share 97% of the DNA sequence. ASV methods on the other hand are able to resolve sequence differences by as little as a single nucleotide change which allows this method the ability to avoid similarity-based operational clustering units all together. Therefore, ASVs provide a more precise measurement of sequence variation since this method uses DNA differences instead of user created OTU differences. ASVs are also referred to as exact sequence variants (ESVs), zero-radius OTUs (zOTUs), sub-OTUs (sOTUs), Haplotypes, or Oligotypes.

Metabarcoding

Metabarcoding is the barcoding of DNA/RNA in a manner that allows for the simultaneous identification of many taxa within the same sample. The main difference between barcoding and metabarcoding is that metabarcoding does not focus on one specific organism, but instead aims to determine species composition within a sample.

A. Murat Eren (Meren) is computer scientist known for his work on microbial ecology and developing novel, open-source, computational tools for analysis of large data sets.

References

  1. Eren, A. Murat; Maignien, Loïs; Sul, Woo Jun; Murphy, Leslie G.; Grim, Sharon L.; Morrison, Hilary G.; Sogin, Mitchell L. (1 December 2013). "Oligotyping: differentiating between closely related microbial taxa using 16S rRNA gene data". Methods in Ecology and Evolution. 4 (12): 1111–1119. doi:10.1111/2041-210X.12114. ISSN   2041-210X. PMC   3864673 . PMID   24358444.
  2. Preheim, Sarah P.; Perrotta, Allison R.; Martin-Platero, Antonio M.; Gupta, Anika; Alm, Eric J. (1 November 2013). "Distribution-Based Clustering: Using Ecology To Refine the Operational Taxonomic Unit". Applied and Environmental Microbiology. 79 (21): 6593–6603. doi:10.1128/AEM.00342-13. ISSN   0099-2240. PMC   3811501 . PMID   23974136.