Amplicon sequence variant

Last updated

An amplicon sequence variant (ASV) is any one of the inferred single DNA sequences recovered from a high-throughput analysis of marker genes. Because these analyses, also called "amplicon reads," are created following the removal of erroneous sequences generated during PCR and sequencing, using ASVs makes it possible to distinguish sequence variation by a single nucleotide change. The uses of ASVs include classifying groups of species based on DNA sequences, finding biological and environmental variation, and determining ecological patterns.

Contents

ASVs were first described in 2013, by Eren and colleagues. [1] Before that, for many years the standard unit for marker-gene analysis was the operational taxonomic unit (OTU), which is generated by clustering sequences based on a threshold of similarity. Compared to ASVs, OTUs reflect a coarser notion of similarity. Though there is no single threshold, the most commonly chosen value is 3%, which means these units share 97% of the DNA sequence. ASV methods on the other hand are able to resolve sequence differences by as little as a single nucleotide change, thus avoiding similarity-based operational clustering units altogether. Therefore, ASVs represent a finer distinction between sequences.

ASVs are also referred to as exact sequence variants (ESVs), zero-radius OTUs (ZOTUs), sub-OTUs (sOTUs), haplotypes, or oligotypes. [2] [3]

This compares ASVs and OTUs. This chart provides a check mark in regards to whether or not that that marker-gene analysis method is precise, traceable, reproducible, or comprehensive. ASVs vs OTU.png
This compares ASVs and OTUs. This chart provides a check mark in regards to whether or not that that marker-gene analysis method is precise, traceable, reproducible, or comprehensive.
This graph shows a real sequence that was sequenced over a hundred times. The black dots are called the error cloud, with the Y-axis being how many types that specific error showed up in this set. The red vertical line represents the 3% cut-off, that means everything to the right of this line is new biology and everything to the left is an error. This demonstrates the errors or new biology that can be missed when using OTUs, since OTUs will include these in the 3% dissimilarity threshold. OTU Errors.png
This graph shows a real sequence that was sequenced over a hundred times. The black dots are called the error cloud, with the Y-axis being how many types that specific error showed up in this set. The red vertical line represents the 3% cut-off, that means everything to the right of this line is new biology and everything to the left is an error. This demonstrates the errors or new biology that can be missed when using OTUs, since OTUs will include these in the 3% dissimilarity threshold.
This is the same real sequence that was sequenced over a hundred times as the above graph. The black dots are called the error cloud, with the Y-axis being how many types that specific error showed up in this set. Now this diagram shows how ASVs prevent these errors associated with OTUs from being included in the data set because ASVs limit the errors to being below the black curved line and new biology being those dots above the curved black line. This means that ASVs are more exact in measuring differences among sequences. ASV Errors.png
This is the same real sequence that was sequenced over a hundred times as the above graph. The black dots are called the error cloud, with the Y-axis being how many types that specific error showed up in this set. Now this diagram shows how ASVs prevent these errors associated with OTUs from being included in the data set because ASVs limit the errors to being below the black curved line and new biology being those dots above the curved black line. This means that ASVs are more exact in measuring differences among sequences.
This visually demonstrates how OTUs pick up erroneous amplicon reads created from PCR and sequencing. When these sequences are amplified into clustered units, these errors are pick-up and placed into clustered units. OTUs therefore pick up a wider set of data points and have the potential to accidentally group two distinct DNA sequences into the same unit as seen by only two colors or DNA sequences being picked up into OTUs instead of four colors (DNA sequences). Operational Taxonomic Units.png
This visually demonstrates how OTUs pick up erroneous amplicon reads created from PCR and sequencing. When these sequences are amplified into clustered units, these errors are pick-up and placed into clustered units. OTUs therefore pick up a wider set of data points and have the potential to accidentally group two distinct DNA sequences into the same unit as seen by only two colors or DNA sequences being picked up into OTUs instead of four colors (DNA sequences).
This visually shows how ASVs remove and correct errors from PCR, when compared to the OTU diagram above. ASVs are able to create groups for all four colors or DNA sequences observed. This allows ASVs to be more precise in finding sequence variation Amplicon Sequence Variants.png
This visually shows how ASVs remove and correct errors from PCR, when compared to the OTU diagram above. ASVs are able to create groups for all four colors or DNA sequences observed. This allows ASVs to be more precise in finding sequence variation

Uses of ASVs versus OTUs

The introduction of ASV methods was marked by a debate about their utility. Although OTUs do not provide such precise and accurate measurements of sequence variation, they are still an acceptable and valuable approach. In one research study, Glassman and Martiny confirmed the suitability of OTUs for investigating broad-scale ecological diversity. [4] They concluded that OTUs and ASVs provided similar results, with ASVs enabling a slightly stronger detection of fungal and bacterial diversity. And their work indicated that even though species diversification can be measured more accurately with ASVs, the use of OTUs in well-constructed studies is generally valid to demonstrate diversification at broad scales.

Some have argued that ASVs should replace OTUs in marker-gene analysis. Their arguments focus on the precision, tractability, reproducibility, and comprehensiveness they can bring to marker-gene analysis. For these researchers, the utility of finer sequence resolution (precision) and the advantage of being able to easily compare sequences between different studies (tractability and reproducibility) make ASVs the better option for analyzing sequence differences. By contrast, since OTUs depend on the specifics of the similarity thresholds used to generate them, the units within any OTU can vary across researchers, experiments, and databases. Thus comparison across OTU-based studies and datasets can be very challenging. [3]

ASV methods

Popular methods for resolving ASVs including DADA2, [5] Deblur, [6] MED, [7] and UNOISE. [8] These methods work broadly by generating an error model tailored to an individual sequencing run and employing algorithms that use the model to distinguish between true biological sequences and those generated by error.

Related Research Articles

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

<span class="mw-page-title-main">DNA sequencer</span> A scientific instrument used to automate the DNA sequencing process

A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then reported as a text string, called a read. Some DNA sequencers can be also considered optical instruments as they analyze light signals originating from fluorochromes attached to nucleotides.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others.

<span class="mw-page-title-main">DNA sequencing</span> Process of determining the nucleic acid sequence

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

<span class="mw-page-title-main">Sanger sequencing</span> Method of DNA sequencing developed in 1977

Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Frederick Sanger and colleagues in 1977, it became the most widely used sequencing method for approximately 40 years. It was first commercialized by Applied Biosystems in 1986. More recently, higher volume Sanger sequencing has been replaced by next generation sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use for smaller-scale projects and for validation of deep sequencing results. It still has the advantage over short-read sequencing technologies in that it can produce DNA sequence reads of > 500 nucleotides and maintains a very low error rate with accuracies around 99.99%. Sanger sequencing is still actively being used in efforts for public health initiatives such as sequencing the spike protein from SARS-CoV-2 as well as for the surveillance of norovirus outbreaks through the Center for Disease Control and Prevention's (CDC) CaliciNet surveillance network.

SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. It is a form of genotyping, which is the measurement of more general genetic variation. SNPs are one of the most common types of genetic variation. An SNP is a single base pair mutation at a specific locus, usually consisting of two alleles. SNPs are found to be involved in the etiology of many human diseases and are becoming of particular interest in pharmacogenetics. Because SNPs are conserved during evolution, they have been proposed as markers for use in quantitative trait loci (QTL) analysis and in association studies in place of microsatellites. The use of SNPs is being extended in the HapMap project, which aims to provide the minimal set of SNPs needed to genotype the human genome. SNPs can also provide a genetic fingerprint for use in identity testing. The increase of interest in SNPs has been reflected by the furious development of a diverse range of SNP genotyping methods.

<span class="mw-page-title-main">Bisulfite sequencing</span> Lab procedure detecting 5-methylcytosines in DNA

Bisulfitesequencing (also known as bisulphite sequencing) is the use of bisulfite treatment of DNA before routine sequencing to determine the pattern of methylation. DNA methylation was the first discovered epigenetic mark, and remains the most studied. In animals it predominantly involves the addition of a methyl group to the carbon-5 position of cytosine residues of the dinucleotide CpG, and is implicated in repression of transcriptional activity.

<span class="mw-page-title-main">16S ribosomal RNA</span> RNA component

16S ribosomal RNA is the RNA component of the 30S subunit of a prokaryotic ribosome. It binds to the Shine-Dalgarno sequence and provides most of the SSU structure.

Single-molecule real-time (SMRT) sequencing is a parallelized single molecule DNA sequencing method. Single-molecule real-time sequencing utilizes a zero-mode waveguide (ZMW). A single DNA polymerase enzyme is affixed at the bottom of a ZMW with a single molecule of DNA as a template. The ZMW is a structure that creates an illuminated observation volume that is small enough to observe only a single nucleotide of DNA being incorporated by DNA polymerase. Each of the four DNA bases is attached to one of four different fluorescent dyes. When a nucleotide is incorporated by the DNA polymerase, the fluorescent tag is cleaved off and diffuses out of the observation area of the ZMW where its fluorescence is no longer observable. A detector detects the fluorescent signal of the nucleotide incorporation, and the base call is made according to the corresponding fluorescence of the dye.

<span class="mw-page-title-main">Microbiota</span> Community of microorganisms

Microbiota are the range of microorganisms that may be commensal, mutualistic, or pathogenic found in and on all multicellular organisms, including plants. Microbiota include bacteria, archaea, protists, fungi, and viruses, and have been found to be crucial for immunologic, hormonal, and metabolic homeostasis of their host.

An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, where an "operational taxonomic unit" is simply the group of organisms currently being studied. In this sense, an OTU is a pragmatic definition to group individuals by similarity, equivalent to but not necessarily in line with classical Linnaean taxonomy or modern evolutionary taxonomy.

Rare biosphere refers to a large number of rare species of microbial life, i.e. bacteria, archaea and fungi, that can be found in very low concentrations in an environment.

<span class="mw-page-title-main">DNA barcoding</span> Method of species identification using a short section of DNA

DNA barcoding is a method of species identification using a short section of DNA from a specific gene or genes. The premise of DNA barcoding is that by comparison with a reference library of such DNA sections, an individual sequence can be used to uniquely identify an organism to species, just as a supermarket scanner uses the familiar black stripes of the UPC barcode to identify an item in its stock against its reference database. These "barcodes" are sometimes used in an effort to identify unknown species or parts of an organism, simply to catalog as many taxa as possible, or to compare with traditional taxonomy in an effort to determine species boundaries.

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.

PICRUSt is a bioinformatics software package. The name is an abbreviation for Phylogenetic Investigation of Communities by Reconstruction of Unobserved States.

<span class="mw-page-title-main">Oligotyping (sequencing)</span>

Oligotyping is the process of correcting DNA sequence measured during the process of DNA sequencing based on frequency data of related sequences across related samples.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

Microbial DNA barcoding is the use of DNA metabarcoding to characterize a mixture of microorganisms. DNA metabarcoding is a method of DNA barcoding that uses universal genetic markers to identify DNA of a mixture of organisms.

<span class="mw-page-title-main">A. Murat Eren</span> Computer scientist

A. Murat Eren (Meren) is a computer scientist known for his work on microbial ecology and developing novel, open-source, computational tools for analysis of large data sets.

References

  1. Eren AM, Maignien L, Sul WJ, Murphy LG, Grim SL, Morrison HG, Sogin ML (December 2013). "Oligotyping: Differentiating between closely related microbial taxa using 16S rRNA gene data". Methods in Ecology and Evolution. 4 (12): 1111–1119. doi:10.1111/2041-210X.12114. hdl: 1912/6377 . PMC   3864673 . PMID   24358444.
  2. Porter TM, Hajibabaei M (January 2018). "Scaling up: A guide to high-throughput genomic approaches for biodiversity analysis". Molecular Ecology. 27 (2): 313–338. doi: 10.1111/mec.14478 . PMID   29292539.
  3. 1 2 Callahan BJ, McMurdie PJ, Holmes SP (December 2017). "Exact sequence variants should replace operational taxonomic units in marker-gene data analysis". The ISME Journal. 11 (12): 2639–2643. doi: 10.1038/ismej.2017.119 . PMC   5702726 . PMID   28731476.
  4. Glassman SI, Martiny JB (July 2018). "Broadscale Ecological Patterns Are Robust to Use of Exact Sequence Variants versus Operational Taxonomic Units". mSphere. 3 (4). doi: 10.1128/mSphere.00148-18 . PMC   6052340 . PMID   30021874.
  5. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes (2015-08-06). "DADA2: High resolution sample inference from amplicon data". bioRxiv. doi: 10.1101/024034 .
  6. Amir A, McDonald D, Navas-Molina JA, Kopylova E, Morton JT, Zech Xu Z, et al. (2017-04-25). Gilbert JA (ed.). "Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns". mSystems. 2 (2). doi:10.1128/mSystems.00191-16. PMC   5340863 . PMID   28289731.
  7. Eren AM, Morrison HG, Lescault PJ, Reveillaud J, Vineis JH, Sogin ML (March 2015). "Minimum entropy decomposition: unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences". The ISME Journal. 9 (4): 968–979. doi:10.1038/ismej.2014.195. PMC   4817710 . PMID   25325381.
  8. Edgar RC (2016-10-15). "UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing". bioRxiv. doi: 10.1101/081257 .