2 base encoding

Last updated
Two-base encoding scheme. In two-base encoding, each unique pair of bases on the 3' end of the probe is assigned one out of four possible colors. For example, "AA" is assigned to blue, "AC" is assigned to green, and so on for all 16 unique pairs. During sequencing, each base in the template is sequenced twice, and the resulting data are decoded according to this scheme. Two-base encoding scheme.png
Two-base encoding scheme. In two-base encoding, each unique pair of bases on the 3' end of the probe is assigned one out of four possible colors. For example, "AA" is assigned to blue, "AC" is assigned to green, and so on for all 16 unique pairs. During sequencing, each base in the template is sequenced twice, and the resulting data are decoded according to this scheme.

2 Base Encoding, also called SOLiD (sequencing by oligonucleotide ligation and detection), is a next-generation sequencing technology developed by Applied Biosystems and has been commercially available since 2008. These technologies generate hundreds of thousands of small sequence reads at one time. Well-known examples of such DNA sequencing methods include 454 pyrosequencing (introduced in 2005), the Solexa system (introduced in 2006) and the SOLiD system (introduced in 2007). These methods have reduced the cost from $0.01/base in 2004 to nearly $0.0001/base in 2006 and increased the sequencing capacity from 1,000,000 bases/machine/day in 2004 to more than 100,000,000 bases/machine/day in 2006.

Contents

2-base encoding is based on ligation sequencing rather than sequencing by synthesis. [1] However, instead of using fluorescent labeled 9-mer probes that distinguish only 6 bases, 2-base encoding takes advantage of fluorescent labeled 8-mer probes that distinguish the two 3 prime most bases but can be cycled similar to the Macevicz method, thus greater than 6bp reads can be obtained (25-50bp published, [2] 50bp in NCBI in Feb 2008). The 2 base encoding enables reading each base twice without performing twice the work. [3] [4] [5] [6]

General features

The general steps common to many of these next-generation sequencing techniques include:

  1. Random fragmentation of genomic DNA
  2. Immobilization of single DNA fragments on a solid support like a bead or a planar solid surface
  3. Amplification of DNA fragments on the solid surface using PCR and making polymerase colonies [7]
  4. Sequencing and subsequent in situ interrogation after each cycle using fluorescence scanning or chemiluminescence. [8]

In 1988, Whiteley et al. demonstrated the use of fluorescently labeled oligonucleotide ligation for the detection of DNA variants. [9] In 1995 Macevicz [10] demonstrated repeated ligation of oligonucleotides to detect contiguous DNA variants. In 2003, Dressman et al. [11] demonstrated the use of emulsion PCR to generate millions of clonally amplified beads which one could perform these repeated ligation assays on. In 2005, Shendure et al. performed a sequencing procedure which combined Whiteley and Dressman techniques performing ligation of fluorescent labeled "8 base degenerate" 9-mer probes which distinguished a different base according to the probes label and non degenerate base. This process was repeated (without regenerating an extendable end as in Macevicz) using identical primers but with probes with labels which identified different non-degenerate base to sequence 6bp reads in 5->3 direction and 7bp reads in the 3->5 direction.

How it works

The SOLiD Sequencing System uses probes with dual base encoding.

The underlying chemistry is summarized in the following steps: [12]

- Step 1, Preparing a Library: This step begins with shearing the genomic DNA into small fragments. Then, two different adapters are added (for example A1 and A2). The resulting library contains template DNA fragments, which are tagged with one adapter at each end (A1-template-A2).

- Step 2, Emulsion PCR: In this step, the emulsion (droplets of water suspended in oil) PCR reaction is performed using DNA fragments from library, two primers (P1 and P2) that complement to the previously used adapters (P1 with A1 and P2 with A2), other PCR reaction components and 1μm beads coupled with one of the primers (e.g. P1) make dilution from DNA library to maximize the droplet that contain one DNA fragment and one bead into a single emulsion droplet.

In each droplet, DNA template anneals to the P1-coupled bead from its A1 side. Then DNA polymerase will extend from P1 to make the complementary sequence, which eventually results in a bead enriched with PCR products from a single template. After PCR reaction, templates are denatured and disassociate from the beads. Dressman et al. first describe this technique in 2003.

- Step 3, Bead Enrichment: In practice, only 30% of beads have target DNA. To increase the number of beads that have target DNA, large polystyrene beads coated with A2 are added to the solution. Thus, any bead containing the extended products will bind polystyrene bead through its P2 end. The resulting complex will be separated from untargeted beads, and melt off to dissociate the targeted beads from polystyrene. This step can increase the throughput of this system from 30% before enrichment to 80% after enrichment.

After enrichment, the 3’-end of products (P2 end) will be modified which makes them capable of covalent bonding in the next step. Therefore, the products of this step are DNA-coupled beads with 3’-modification of each DNA strand.

- Step 4, Bead Deposition: In this step, products of the last step are deposited onto a glass slide. Beads attach to the glass surface randomly through covalent bonds of the 3’-modified beads and the glass.

- Step 5, Sequencing Reaction: As mentioned earlier, unlike other next-generation methods which perform sequencing through synthesis, 2-base encoding is based on sequencing by ligation. The ligation is performed using specific 8-mer probes:

These probes are eight bases in length with a free hydroxyl group at the 3’ end, a fluorescent dye at the 5’ end and a cleavage site between the fifth and sixth nucleotide. The first two bases (starting at the 3' end) are complementary to the nucleotides being sequenced. Bases 3 through 5 are degenerate and able to pair with any nucleotides on the template sequence. Bases 6-8 are also degenerate but are cleaved off, along with the fluorescent dye, as the reaction continues. Cleavage of the fluorescent dye and bases 6-8 leaves a free 5' phosphate group ready for further ligation. In this manner positions n+1 and n+2 are correctly base-paired followed by n+6 and n+7 being correctly paired, etc. The composition of bases n+3, n+4 and n+5 remains undetermined until further rounds of the sequencing reaction.

The sequencing step is basically composed of five rounds and each round consists of about 5-7 cycles (Figure 2). Each round begins with the addition of a P1-complementary universal primer. This primer has, for example, n nucleotides and its 5’-end matches exactly with the 3’-end of the P1. In each cycle, 8-mer probes are added and ligated according to their first and second bases. Then, the remaining unbound probes are washed out, the fluorescent signal from the bound probe is measured, and the bound probe is cleaved between its fifth and sixth nucleotide. Finally the primer and probes are all reset for the next round.

In the next round a new universal primer anneals the position n-1 (its 5’-end matches to the base exactly before the 3’-end of the P1) and the subsequent cycles are repeated similar to the first round. The remaining three rounds will be performed with new universal primers annealing positions n-2, n-3 and n-4 relative to the 3'-end of P1.

A complete reaction of five rounds allows the sequencing of about 25 base pairs of the template from P1.

- Step 6, Decoding Data: For decoding the data, which are represented as colors, we must first know two important factors. First, we must know that each color indicates two bases. Second, we need to know one of the bases in the sequence: this base is incorporated in the sequence in the last (fifth) round of step5. This known base is the last nucleotide of the 3’-end of the known P1. Therefore, since each color represents two nucleotides in which the second base of each dinucleotide unit constitutes the first base of the following dinucleotide, knowing just one base in the sequence will lead us to interpret the whole sequence(Figure 2). [13]

2 Base Encoding considerations

In practice direct translation of color reads into base reads is not advised as the moment one encounters an error in the color calls it will result in a frameshift of the base calls. To best leverage the "error correction" properties of two base encoding it is best to convert your base reference sequence into color-space. There is one unambiguous conversion of a base reference sequence into color-space and while the reverse is also true the conversion can be wildly inaccurate if there are any sequencing errors. [14]

Mapping color-space reads to a color-space reference can properly utilize the two-base encoding rules where only adjacent color differences can represent a true base polymorphism. Direct decoding or translation of the color reads into bases cannot do this efficiently without other knowledge.

More specifically, this method is not an error correction tool but an error transformation tool. Color-space transforms your most common error mode (single measurement errors) into a different frequency than your most common form of DNA variation (SNPs or single base changes). These single base changes affect adjacent colors in color space. There are logical rules which help correct adjacent errors into 'valid' and 'invalid' adjacent errors.

The likelihood of getting two adjacent errors in a 50-bp read can be estimated. There are 49 ways of making adjacent changes to a 50 letter string (50-bp read). There are 1225 ways of making non-adjacent changes to a 50 letter string (50 choose 2). Simplistically, if one assumes errors are completely random (they are usually higher frequency at the end of reads) only 49 out of 1225 errors will be candidates for SNPs. In addition, only one third of the adjacent errors can be valid errors according to the known labeling of the probes thus delivering only 16 out of 1225 errors which can be candidates for SNPs. This is particularly useful for low coverage SNP detection as it reduces false positives at low coverage, Smith et al. [15]

Advantages

Each base in this sequencing method is read twice. This changes the color of two adjacent color space calls, therefore in order to miscall a SNP, two adjacent colors must be miscalled. Because of this the SNP miscall rate is on the order of e^2, where e is the device error rate.

Disadvantages

When base calling single color miscalls cause errors on the remaining portion of the read. In SNP calling this can be corrected, which results in a lower SNP calling error rate. However for simplistic de novo assembly you are left with the raw device error rate which will be significantly higher than the 0.06% reported for SNP calling. Quality filtering of the reads can deliver higher raw accuracy reads which when aligned to form color contigs can deliver reference sequences where 2 base encoding can be better leveraged. Hybrid assemblies with other technologies can also better utilize the 2 base encoding.

See also

Related Research Articles

<span class="mw-page-title-main">Polymerase chain reaction</span> Laboratory technique to multiply a DNA sample for study

The polymerase chain reaction (PCR) is a method widely used to make millions to billions of copies of a specific DNA sample rapidly, allowing scientists to amplify a very small sample of DNA sufficiently to enable detailed study. PCR was invented in 1983 by American biochemist Kary Mullis at Cetus Corporation. Mullis and biochemist Michael Smith, who had developed other essential ways of manipulating DNA, were jointly awarded the Nobel Prize in Chemistry in 1993.

<span class="mw-page-title-main">Primer (molecular biology)</span> Short strand of RNA or DNA that serves as a starting point for DNA synthesis

A primer is a short single-stranded nucleic acid used by all living organisms in the initiation of DNA synthesis. A synthetic primer may also be referred to as an oligo, short for oligonucleotide. DNA polymerase enzymes are only capable of adding nucleotides to the 3’-end of an existing nucleic acid, requiring a primer be bound to the template before DNA polymerase can begin a complementary strand. DNA polymerase adds nucleotides after binding to the RNA primer and synthesizes the whole strand. Later, the RNA strands must be removed accurately and replace them with DNA nucleotides forming a gap region known as a nick that is filled in using an enzyme called ligase. The removal process of the RNA primer requires several enzymes, such as Fen1, Lig1, and others that work in coordination with DNA polymerase, to ensure the removal of the RNA nucleotides and the addition of DNA nucleotides. Living organisms use solely RNA primers, while laboratory techniques in biochemistry and molecular biology that require in vitro DNA synthesis usually use DNA primers, since they are more temperature stable. Primers can be designed in laboratory for specific reactions such as polymerase chain reaction (PCR). When designing PCR primers, there are specific measures that must be taken into consideration, like the melting temperature of the primers and the annealing temperature of the reaction itself. Moreover, the DNA binding sequence of the primer in vitro has to be specifically chosen, which is done using a method called basic local alignment search tool (BLAST) that scans the DNA and finds specific and unique regions for the primer to bind.

In genetics and biochemistry, sequencing means to determine the primary structure of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succinctly summarizes much of the atomic-level structure of the sequenced molecule.

<span class="mw-page-title-main">DNA sequencing</span> Process of determining the nucleic acid sequence

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

<span class="mw-page-title-main">Sanger sequencing</span> Method of DNA sequencing developed in 1977

Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Frederick Sanger and colleagues in 1977, it became the most widely used sequencing method for approximately 40 years. It was first commercialized by Applied Biosystems in 1986. More recently, higher volume Sanger sequencing has been replaced by next generation sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use for smaller-scale projects and for validation of deep sequencing results. It still has the advantage over short-read sequencing technologies in that it can produce DNA sequence reads of > 500 nucleotides and maintains a very low error rate with accuracies around 99.99%. Sanger sequencing is still actively being used in efforts for public health initiatives such as sequencing the spike protein from SARS-CoV-2 as well as for the surveillance of norovirus outbreaks through the Center for Disease Control and Prevention's (CDC) CaliciNet surveillance network.

SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. It is a form of genotyping, which is the measurement of more general genetic variation. SNPs are one of the most common types of genetic variation. An SNP is a single base pair mutation at a specific locus, usually consisting of two alleles. SNPs are found to be involved in the etiology of many human diseases and are becoming of particular interest in pharmacogenetics. Because SNPs are conserved during evolution, they have been proposed as markers for use in quantitative trait loci (QTL) analysis and in association studies in place of microsatellites. The use of SNPs is being extended in the HapMap project, which aims to provide the minimal set of SNPs needed to genotype the human genome. SNPs can also provide a genetic fingerprint for use in identity testing. The increase of interest in SNPs has been reflected by the furious development of a diverse range of SNP genotyping methods.

Multiplex ligation-dependent probe amplification (MLPA) is a variation of the multiplex polymerase chain reaction that permits amplification of multiple targets with only a single primer pair. It detects copy number changes at the molecular level, and software programs are used for analysis. Identification of deletions or duplications can indicate pathogenic mutations, thus MLPA is an important diagnostic tool used in clinical pathology laboratories worldwide.

Sequencing by ligation is a DNA sequencing method that uses the enzyme DNA ligase to identify the nucleotide present at a given position in a DNA sequence. Unlike most currently popular DNA sequencing methods, this method does not use a DNA polymerase to create a second strand. Instead, the mismatch sensitivity of a DNA ligase enzyme is used to determine the underlying sequence of the target DNA molecule.

<span class="mw-page-title-main">ABI Solid Sequencing</span>

SOLiD (Sequencing by Oligonucleotide Ligation and Detection) is a next-generation DNA sequencing technology developed by Life Technologies and has been commercially available since 2006. This next generation technology generates 108 - 109 small sequence reads at one time. It uses 2 base encoding to decode the raw data generated by the sequencing platform into sequence data.

The versatility of polymerase chain reaction (PCR) has led to modifications of the basic protocol being used in a large number of variant techniques designed for various purposes. This article summarizes many of the most common variations currently or formerly used in molecular biology laboratories; familiarity with the fundamental premise by which PCR works and corresponding terms and concepts is necessary for understanding these variant techniques.

The ligase chain reaction (LCR) is a method of DNA amplification. The ligase chain reaction (LCR) is an amplification process that differs from PCR in that it involves a thermostable ligase to join two probes or other molecules together which can then be amplified by standard polymerase chain reaction (PCR) cycling. Each cycle results in a doubling of the target nucleic acid molecule. A key advantage of LCR is greater specificity as compared to PCR. Thus, LCR requires two completely different enzymes to operate properly: ligase, to join probe molecules together, and a thermostable polymerase to amplify those molecules involved in successful ligation. The probes involved in the ligation are designed such that the 5′ end of one probe is directly adjacent to the 3′ end of the other probe, thereby providing the requisite 3′-OH and 5′-PO4 group substrates for the ligase.

Massive parallel signature sequencing (MPSS) is a procedure that is used to identify and quantify mRNA transcripts, resulting in data similar to serial analysis of gene expression (SAGE), although it employs a series of biochemical and sequencing steps that are substantially different.

Molecular Inversion Probe (MIP) belongs to the class of Capture by Circularization molecular techniques for performing genomic partitioning, a process through which one captures and enriches specific regions of the genome. Probes used in this technique are single stranded DNA molecules and, similar to other genomic partitioning techniques, contain sequences that are complementary to the target in the genome; these probes hybridize to and capture the genomic target. MIP stands unique from other genomic partitioning strategies in that MIP probes share the common design of two genomic target complementary segments separated by a linker region. With this design, when the probe hybridizes to the target, it undergoes an inversion in configuration and circularizes. Specifically, the two target complementary regions at the 5’ and 3’ ends of the probe become adjacent to one another while the internal linker region forms a free hanging loop. The technology has been used extensively in the HapMap project for large-scale SNP genotyping as well as for studying gene copy alterations and characteristics of specific genomic loci to identify biomarkers for different diseases such as cancer. Key strengths of the MIP technology include its high specificity to the target and its scalability for high-throughput, multiplexed analyses where tens of thousands of genomic loci are assayed simultaneously.

Polony sequencing is an inexpensive but highly accurate multiplex sequencing technique that can be used to “read” millions of immobilized DNA sequences in parallel. This technique was first developed by Dr. George Church's group at Harvard Medical School. Unlike other sequencing techniques, Polony sequencing technology is an open platform with freely downloadable, open source software and protocols. Also, the hardware of this technique can be easily set up with a commonly available epifluorescence microscopy and a computer-controlled flowcell/fluidics system. Polony sequencing is generally performed on paired-end tags library that each molecule of DNA template is of 135 bp in length with two 17–18 bp paired genomic tags separated and flanked by common sequences. The current read length of this technique is 26 bases per amplicon and 13 bases per tag, leaving a gap of 4–5 bases in each tag.

<span class="mw-page-title-main">DNA nanoball sequencing</span>

DNA nanoball sequencing is a high throughput sequencing technology that is used to determine the entire genomic sequence of an organism. The method uses rolling circle replication to amplify small fragments of genomic DNA into DNA nanoballs. Fluorescent nucleotides bind to complementary nucleotides and are then polymerized to anchor sequences bound to known sequences on the DNA template. The base order is determined via the fluorescence of the bound nucleotides This DNA sequencing method allows large numbers of DNA nanoballs to be sequenced per run at lower reagent costs compared to other next generation sequencing platforms. However, a limitation of this method is that it generates only short sequences of DNA, which presents challenges to mapping its reads to a reference genome. After purchasing Complete Genomics, the Beijing Genomics Institute (BGI) refined DNA nanoball sequencing to sequence nucleotide samples on their own platform.

Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation sequencing. Some of these technologies emerged between 1993 and 1998 and have been commercially available since 2005. These technologies use miniaturized and parallelized platforms for sequencing of 1 million to 43 billion short reads per instrument run.

<span class="mw-page-title-main">Illumina dye sequencing</span> DNA sequencing method

Illumina dye sequencing is a technique used to determine the series of base pairs in DNA, also known as DNA sequencing. The reversible terminated chemistry concept was invented by Bruno Canard and Simon Sarfati at the Pasteur Institute in Paris. It was developed by Shankar Balasubramanian and David Klenerman of Cambridge University, who subsequently founded Solexa, a company later acquired by Illumina. This sequencing method is based on reversible dye-terminators that enable the identification of single nucleotides as they are washed over DNA strands. It can also be used for whole-genome and region sequencing, transcriptome analysis, metagenomics, small RNA discovery, methylation profiling, and genome-wide protein-nucleic acid interaction analysis.

Multiple Annealing and Looping Based Amplification Cycles (MALBAC) is a quasilinear whole genome amplification method. Unlike conventional DNA amplification methods that are non-linear or exponential, MALBAC utilizes special primers that allow amplicons to have complementary ends and therefore to loop, preventing DNA from being copied exponentially. This results in amplification of only the original genomic DNA and therefore reduces amplification bias. MALBAC is “used to create overlapped shotgun amplicons covering most of the genome”. For next generation sequencing, MALBAC is followed by regular PCR which is used to further amplify amplicons.

Magnetic sequencing is a single-molecule sequencing method in development. A DNA hairpin, containing the sequence of interest, is bound between a magnetic bead and a glass surface. A magnetic field is applied to stretch the hairpin open into single strands, and the hairpin refolds after decreasing of the magnetic field. The hairpin length can be determined by direct imaging of the diffraction rings of the magnetic beads using a simple microscope. The DNA sequences are determined by measuring the changes in the hairpin length following successful hybridization of complementary nucleotides.

<span class="mw-page-title-main">Duplex sequencing</span>

Duplex sequencing is a library preparation and analysis method for next-generation sequencing (NGS) platforms that employs random tagging of double-stranded DNA to detect mutations with higher accuracy and lower error rates.

References

  1. Jay Shendure et al. (2005) Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome. Science 309(5741), 1728 - 1732
  2. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, Clouser CR, Duncan C, Ichikawa JK, Lee CC, Zhang Z, Ranade SS, Dimalanta ET, Hyland FC, Sokolsky TD, Zhang L, Sheridan A, Fu H, Hendrickson CL, Li B, Kotler L, Stuart JR, Malek JA, Manning JM, Antipova AA, Perez DS, Moore MP, Hayashibara KC, Lyons MR, Beaudoin RE, Coleman BE, Laptewicz MW, Sannicandro AE, Rhodes MD, Gottimukkala RK, Yang S, Bafna V, Bashir A, MacBride A, Alkan C, Kidd JM, Eichler EE, Reese MG, De La Vega FM, Blanchard AP. Genome Res. 2009 Sep;19(9):1527-41. Epub 2009 Jun 22.
  3. Patent: Reagents,Methods and Libraries for Bead-Based Sequencing
  4. Article: A high-resolution, nucleosome position map of C. elegans reveals a lack of universal...
  5. Article: Stem cell transcriptome profiling via massive-scale mRNA sequencing
  6. Rapid whole-genome mutational profiling using next-generation sequencing technologies, Genome Research, 2008 18:1638-1642
  7. Chetverin, NAR, 1993, Vol.21, No. 10 2349-2353
  8. MATTHEW E. HUDSON (2008) Sequencing breakthroughs for genomic ecology and evolutionary biology. Molecular Ecology Resources 8 (1), 3–17
  9. Whiteley US patent number 4,883,750
  10. Macevicz US patent number 5,750,341
  11. Transforming single DNA molecules into fluorescent magnetic particles fr detection and enumeration of genetic variations,PNAS July 22, 2004 Vol. 100 no. 15, 8817-8822
  12. Applied Biosystems
  13. Tech Summary: ABI's SOLiD (Seq. by Oligo Ligation/Detection) - SEQanswers
  14. Colorspace to FastQ example
  15. Smith et al., Genome Research 2008 18:1638-1642