Illumina dye sequencing

Last updated

Illumina dye sequencing is a technique used to determine the series of base pairs in DNA, also known as DNA sequencing. The reversible terminated chemistry concept was invented by Bruno Canard and Simon Sarfati at the Pasteur Institute in Paris. [1] [2] It was developed by Shankar Balasubramanian and David Klenerman of Cambridge University, [3] who subsequently founded Solexa, a company later acquired by Illumina. This sequencing method is based on reversible dye-terminators that enable the identification of single nucleotides as they are washed over DNA strands. It can also be used for whole-genome and region sequencing, transcriptome analysis, metagenomics, small RNA discovery, methylation profiling, and genome-wide protein-nucleic acid interaction analysis. [4] [5]

Contents

The DNA attaches to the flow cell via complementary sequences. The strand bends over and attaches to a second oligo forming a bridge. A polymerase synthesizes the reverse strand. The two strands release and straighten. Each forms a new bridge (bridge amplification). The result is a cluster of DNA forward and reverse strand clones. Cluster Generation.png
The DNA attaches to the flow cell via complementary sequences. The strand bends over and attaches to a second oligo forming a bridge. A polymerase synthesizes the reverse strand. The two strands release and straighten. Each forms a new bridge (bridge amplification). The result is a cluster of DNA forward and reverse strand clones.

Overview

This works in three basic steps: amplify, sequence, and analyze. The process begins with purified DNA. The DNA is fragmented and adapters are added that contain segments that act as reference points during amplification, sequencing, and analysis. The modified DNA is loaded onto a flow cell where amplification and sequencing will take place. The flow cell contains nanowells that space out fragments and help with overcrowding. [6] Each nanowell contains oligonucleotides that provide an anchoring point for the adapters to attach. Once the fragments have attached, a phase called cluster generation begins. This step makes about a thousand copies of each fragment of DNA and is done by bridge amplification PCR. Next, primers and modified nucleotides are washed onto the chip. These nucleotides have a reversible fluorescent blocker so the DNA polymerase can only add one nucleotide at a time onto the DNA fragment. [6] After each round of synthesis, a camera takes a picture of the chip. A computer determines what base was added by the wavelength of the fluorescent tag and records it for every spot on the chip. After each round, non-incorporated molecules are washed away. A chemical deblocking step is then used to remove the 3’ fluorescent terminal blocking group. The process continues until the full DNA molecule is sequenced. [5] With this technology, thousands of places throughout the genome are sequenced at once via massive parallel sequencing.

Procedure

Genomic Library

After the DNA is purified a DNA library, genomic library, needs to be generated. There are two ways a genomic library can be created: sonication and tagmentation. With tagmentation, transposases randomly cuts the DNA into sizes between 50 and 500 bp fragments and adds adaptors simultaneously. [6] A genetic library can also be generated by using sonication to fragment genomic DNA. Sonication fragments DNA into similar sizes using ultrasonic sound waves. Right and left adapters will need to be attached by T7 DNA Polymerase and T4 DNA ligase after sonication. Strands that fail to have adapters ligated are washed away. [7]

Double stranded DNA is cleaved by transposomes. The cut ends are repaired and adapters, indices, primer binding sites, and terminal sites are added to each strand of the DNA. Image based in part on illumina's sequencing video DNA Processing Preparation.png
Double stranded DNA is cleaved by transposomes. The cut ends are repaired and adapters, indices, primer binding sites, and terminal sites are added to each strand of the DNA. Image based in part on illumina's sequencing video

Adapters

Adapters contain three different segments: the sequence complementary to solid support (oligonucleotides on flow cell), the barcode sequence (indices), and the binding site for the sequencing primer. [6] Indices are usually six base pairs long and are used during DNA sequence analysis to identify samples. Indices allow for up to 96 different samples to be run together, this is also known as multiplexing. During analysis, the computer will group all reads with the same index together. [8] [9] Illumina uses a "sequence by synthesis" approach. [9] This process takes place inside of an acrylamide-coated glass flow cell. [10] The flow cell has oligonucleotides (short nucleotide sequences) coating the bottom of the cell, and they serve as the solid support to hold the DNA strands in place during sequencing. As the fragmented DNA is washed over the flow cell, the appropriate adapter attaches to the complementary solid support.

Millions of oligos line the bottom of each flow cell lane. Oligonucleotide chains in Flow Cell.png
Millions of oligos line the bottom of each flow cell lane.

Bridge amplification

Once attached, cluster generation can begin. The goal is to create hundreds of identical strands of DNA. Some will be the forward strand; the rest, the reverse. This is why right and left adapters are used. Clusters are generated through bridge amplification. DNA polymerase moves along a strand of DNA, creating its complementary strand. The original strand is washed away, leaving only the reverse strand. At the top of the reverse strand there is an adapter sequence. The DNA strand bends and attaches to the oligo that is complementary to the top adapter sequence. Polymerases attach to the reverse strand, and its complementary strand (which is identical to the original) is made. The now double stranded DNA is denatured so that each strand can separately attach to an oligonucleotide sequence anchored to the flow cell. One will be the reverse strand; the other, the forward. This process is called bridge amplification, and it happens for thousands of clusters all over the flow cell at once. [11]

Clonal amplification

Over and over again, DNA strands will bend and attach to the solid support. DNA polymerase will synthesize a new strand to create a double stranded segment, and that will be denatured so that all of the DNA strands in one area are from a single source (clonal amplification). Clonal amplification is important for quality control purposes. If a strand is found to have an odd sequence, then scientists can check the reverse strand to make sure that it has the complement of the same oddity. The forward and reverse strands act as checks to guard against artefacts. Because Illumina sequencing uses DNA polymerase, base substitution errors have been observed, [12] especially at the 3' end. [13] Paired end reads combined with cluster generation can confirm an error took place. The reverse and forward strands should be complementary to each other, all reverse reads should match each other, and all forward reads should match each other. If a read is not similar enough to its counterparts (with which it should be a clone), an error may have occurred. A minimum threshold of 97% similarity has been used in some labs' analyses. [13]

Sequence by synthesis

At the end of clonal amplification, all of the reverse strands are washed off the flow cell, leaving only forward strands. A primer attaches to the forward strands adapter primer binding site, and a polymerase adds a fluorescently tagged dNTP to the DNA strand. Only one base is able to be added per round due to the fluorophore acting as a blocking group; however, the blocking group is reversible. [6] Using the four-color chemistry, each of the four bases has a unique emission, and after each round, the machine records which base was added. Once the color is recorded the fluorophore is washed away and another dNTP is washed over the flow cell and the process is repeated.

Starting with the launch of the NextSeq and later the MiniSeq, Illumina introduced a new two-color sequencing chemistry. Nucleotides are distinguished by either one of two colors (red or green), no color ("black") or combining both colors (appearing orange as a mixture between red and green).

Tagged nucleotides are added in order to the DNA strand. Each of the four nucleotides have an identifying label that can be excited to emit a characteristic wavelength. A computer records all of the emissions, and from this data, base calls are made. Sequence By Synthesis.png
Tagged nucleotides are added in order to the DNA strand. Each of the four nucleotides have an identifying label that can be excited to emit a characteristic wavelength. A computer records all of the emissions, and from this data, base calls are made.

Once the DNA strand has been read, the strand that was just added is washed away. Then, the index 1 primer attaches, polymerizes the index 1 sequence, and is washed away. The strand forms a bridge again, and the 3' end of the DNA strand attaches to an oligo on the flow cell. The index 2 primer attaches, polymerizes the sequence, and is washed away.

A polymerase sequences the complementary strand on top of the arched strand. They separate, and the 3' end of each strand is blocked. The forward strand is washed away, and the process of sequence by synthesis repeats for the reverse strand.

Data analysis

The sequencing occurs for millions of clusters at once, and each cluster has ~1,000 identical copies of a DNA insert. [12] The sequence data is analyzed by finding fragments with overlapping areas, called contigs, and lining them up. If a reference sequence is known, the contigs are then compared to it for variant identification.

This piecemeal process allows scientists to see the complete sequence even though an unfragmented sequence was never run; however, because Illumina read lengths are not very long [13] (HiSeq sequencing can produce read lengths around 90 bp long [8] ), it can be a struggle to resolve short tandem repeat areas. [8] [12] Also, if the sequence is de novo and a reference does not exist, repeated areas can cause a lot of difficulty in sequence assembly. [12] Additional difficulties include base substitutions (especially at the 3' end of reads [13] ) by inaccurate polymerases, chimeric sequences, and PCR-bias, all of which can contribute to generating an incorrect sequence. [13]

Comparison with other sequencing methods

This technique offers several advantages over traditional sequencing methods such as Sanger sequencing. Sanger sequencing requires two reactions, one for the forward primer and another for the reverse primer. Unlike Illumina, Sanger sequencing uses fluorescently labeled dideoxynucleoside triphosphates (ddNTPs) to determine the sequence of the DNA fragment. ddNTPs are missing the 3' OH group and terminates DNA synthesis permanently. [6] In each reaction tube, dNTPs and ddNTPs are added, along with DNA polymerase and primers. The ratio of ddNTPs to dNTPs matter since the template DNA needs to be completely synthesized, and an overabundance of ddNTPs will create multiple fragments of the same size and position of the DNA template. When the DNA polymerase adds a ddNTP the fragment is terminated and a new fragment is synthesized. Each fragment synthesized is one nucleotide longer than the last. Once the DNA template has been completely synthesized, the fragments are separated by capillary electrophoresis. At the bottom of the capillary tube a laser excites the fluorescently labeled ddNTPs and a camera captures the color emitted.

Due to the automated nature of Illumina dye sequencing it is possible to sequence multiple strands at once and gain actual sequencing data quickly. With Sanger sequencing, only one strand is able to be sequenced at a time and is relatively slow. Illumina only uses DNA polymerase as opposed to multiple, expensive enzymes required by other sequencing techniques (i.e. pyrosequencing). [14]

Related Research Articles

<span class="mw-page-title-main">Complementary DNA</span> Single-stranded DNA synthesized from RNA

In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA template in a reaction catalyzed by the enzyme reverse transcriptase. cDNA is often used to express a specific protein in a cell that does not normally express that protein, or to sequence or quantify mRNA molecules using DNA based methods. cDNA that codes for a specific protein can be transferred to a recipient cell for expression, often bacterial or yeast expression systems. cDNA is also generated to analyze transcriptomic profiles in bulk tissue, single cells, or single nuclei in assays such as microarrays, qPCR, and RNA-seq.

<span class="mw-page-title-main">Primer (molecular biology)</span> Short strand of RNA or DNA that serves as a starting point for DNA synthesis

A primer is a short single-stranded nucleic acid used by all living organisms in the initiation of DNA synthesis. A synthetic primer may also be referred to as an oligo, short for oligonucleotide. DNA polymerase enzymes are only capable of adding nucleotides to the 3’-end of an existing nucleic acid, requiring a primer be bound to the template before DNA polymerase can begin a complementary strand. DNA polymerase adds nucleotides after binding to the RNA primer and synthesizes the whole strand. Later, the RNA strands must be removed accurately and replace them with DNA nucleotides forming a gap region known as a nick that is filled in using an enzyme called ligase. The removal process of the RNA primer requires several enzymes, such as Fen1, Lig1, and others that work in coordination with DNA polymerase, to ensure the removal of the RNA nucleotides and the addition of DNA nucleotides. Living organisms use solely RNA primers, while laboratory techniques in biochemistry and molecular biology that require in vitro DNA synthesis usually use DNA primers, since they are more temperature stable. Primers can be designed in laboratory for specific reactions such as polymerase chain reaction (PCR). When designing PCR primers, there are specific measures that must be taken into consideration, like the melting temperature of the primers and the annealing temperature of the reaction itself. Moreover, the DNA binding sequence of the primer in vitro has to be specifically chosen, which is done using a method called basic local alignment search tool (BLAST) that scans the DNA and finds specific and unique regions for the primer to bind.

In genetics and biochemistry, sequencing means to determine the primary structure of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succinctly summarizes much of the atomic-level structure of the sequenced molecule.

<span class="mw-page-title-main">DNA sequencer</span> A scientific instrument used to automate the DNA sequencing process

A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then reported as a text string, called a read. Some DNA sequencers can be also considered optical instruments as they analyze light signals originating from fluorochromes attached to nucleotides.

<span class="mw-page-title-main">DNA sequencing</span> Process of determining the nucleic acid sequence

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

<span class="mw-page-title-main">Sanger sequencing</span> Method of DNA sequencing developed in 1977

Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Frederick Sanger and colleagues in 1977, it became the most widely used sequencing method for approximately 40 years. It was first commercialized by Applied Biosystems in 1986. More recently, higher volume Sanger sequencing has been replaced by next generation sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use for smaller-scale projects and for validation of deep sequencing results. It still has the advantage over short-read sequencing technologies in that it can produce DNA sequence reads of > 500 nucleotides and maintains a very low error rate with accuracies around 99.99%. Sanger sequencing is still actively being used in efforts for public health initiatives such as sequencing the spike protein from SARS-CoV-2 as well as for the surveillance of norovirus outbreaks through the Center for Disease Control and Prevention's (CDC) CaliciNet surveillance network.

Illumina, Inc. is an American biotechnology company, headquartered in San Diego, California, and it serves more than 155 countries. Incorporated on April 1, 1998, Illumina develops, manufactures, and markets integrated systems for the analysis of genetic variation and biological function. The company provides a line of products and services that serves the sequencing, genotyping and gene expression, and proteomics markets.

SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. It is a form of genotyping, which is the measurement of more general genetic variation. SNPs are one of the most common types of genetic variation. An SNP is a single base pair mutation at a specific locus, usually consisting of two alleles. SNPs are found to be involved in the etiology of many human diseases and are becoming of particular interest in pharmacogenetics. Because SNPs are conserved during evolution, they have been proposed as markers for use in quantitative trait loci (QTL) analysis and in association studies in place of microsatellites. The use of SNPs is being extended in the HapMap project, which aims to provide the minimal set of SNPs needed to genotype the human genome. SNPs can also provide a genetic fingerprint for use in identity testing. The increase of interest in SNPs has been reflected by the furious development of a diverse range of SNP genotyping methods.

<span class="mw-page-title-main">ABI Solid Sequencing</span>

SOLiD (Sequencing by Oligonucleotide Ligation and Detection) is a next-generation DNA sequencing technology developed by Life Technologies and has been commercially available since 2006. This next generation technology generates 108 - 109 small sequence reads at one time. It uses 2 base encoding to decode the raw data generated by the sequencing platform into sequence data.

<span class="mw-page-title-main">2 base encoding</span>

2 Base Encoding, also called SOLiD, is a next-generation sequencing technology developed by Applied Biosystems and has been commercially available since 2008. These technologies generate hundreds of thousands of small sequence reads at one time. Well-known examples of such DNA sequencing methods include 454 pyrosequencing, the Solexa system and the SOLiD system. These methods have reduced the cost from $0.01/base in 2004 to nearly $0.0001/base in 2006 and increased the sequencing capacity from 1,000,000 bases/machine/day in 2004 to more than 100,000,000 bases/machine/day in 2006.

Optical mapping is a technique for constructing ordered, genome-wide, high-resolution restriction maps from single, stained molecules of DNA, called "optical maps". By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that sequence. Originally developed by Dr. David C. Schwartz and his lab at NYU in the 1990s this method has since been integral to the assembly process of many large-scale sequencing projects for both microbial and eukaryotic genomes. Later technologies use DNA melting, DNA competitive binding or enzymatic labelling in order to create the optical mappings.

<span class="mw-page-title-main">DNA nanoball sequencing</span>

DNA nanoball sequencing is a high throughput sequencing technology that is used to determine the entire genomic sequence of an organism. The method uses rolling circle replication to amplify small fragments of genomic DNA into DNA nanoballs. Fluorescent nucleotides bind to complementary nucleotides and are then polymerized to anchor sequences bound to known sequences on the DNA template. The base order is determined via the fluorescence of the bound nucleotides This DNA sequencing method allows large numbers of DNA nanoballs to be sequenced per run at lower reagent costs compared to other next generation sequencing platforms. However, a limitation of this method is that it generates only short sequences of DNA, which presents challenges to mapping its reads to a reference genome. After purchasing Complete Genomics, the Beijing Genomics Institute (BGI) refined DNA nanoball sequencing to sequence nucleotide samples on their own platform.

Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation sequencing. Some of these technologies emerged between 1993 and 1998 and have been commercially available since 2005. These technologies use miniaturized and parallelized platforms for sequencing of 1 million to 43 billion short reads per instrument run.

<span class="mw-page-title-main">Reduced representation bisulfite sequencing</span> Methylation process

Reduced representation bisulfite sequencing (RRBS) is an efficient and high-throughput technique for analyzing the genome-wide methylation profiles on a single nucleotide level. It combines restriction enzymes and bisulfite sequencing to enrich for areas of the genome with a high CpG content. Due to the high cost and depth of sequencing to analyze methylation status in the entire genome, Meissner et al. developed this technique in 2005 to reduce the amount of nucleotides required to sequence to 1% of the genome. The fragments that comprise the reduced genome still include the majority of promoters, as well as regions such as repeated sequences that are difficult to profile using conventional bisulfite sequencing approaches.

Multiple Annealing and Looping Based Amplification Cycles (MALBAC) is a quasilinear whole genome amplification method. Unlike conventional DNA amplification methods that are non-linear or exponential, MALBAC utilizes special primers that allow amplicons to have complementary ends and therefore to loop, preventing DNA from being copied exponentially. This results in amplification of only the original genomic DNA and therefore reduces amplification bias. MALBAC is “used to create overlapped shotgun amplicons covering most of the genome”. For next generation sequencing, MALBAC is followed by regular PCR which is used to further amplify amplicons.

Magnetic sequencing is a single-molecule sequencing method in development. A DNA hairpin, containing the sequence of interest, is bound between a magnetic bead and a glass surface. A magnetic field is applied to stretch the hairpin open into single strands, and the hairpin refolds after decreasing of the magnetic field. The hairpin length can be determined by direct imaging of the diffraction rings of the magnetic beads using a simple microscope. The DNA sequences are determined by measuring the changes in the hairpin length following successful hybridization of complementary nucleotides.

G&T-seq is a novel form of single cell sequencing technique allowing one to simultaneously obtain both transcriptomic and genomic data from single cells, allowing for direct comparison of gene expression data to its corresponding genomic data in the same cell...

<span class="mw-page-title-main">Duplex sequencing</span>

Duplex sequencing is a library preparation and analysis method for next-generation sequencing (NGS) platforms that employs random tagging of double-stranded DNA to detect mutations with higher accuracy and lower error rates.

BLESS, also known as breaks labeling, enrichment on streptavidin and next-generation sequencing, is a method used to detect genome-wide double-strand DNA damage. In contrast to chromatin immunoprecipitation (ChIP)-based methods of identifying DNA double-strand breaks (DSBs) by labeling DNA repair proteins, BLESS utilizes biotinylated DNA linkers to directly label genomic DNA in situ which allows for high-specificity enrichment of samples on streptavidin beads and the subsequent sequencing-based DSB mapping to nucleotide resolution.

GUIDE-Seq is a molecular biology technique that allows for the unbiased in vitro detection of off-target genome editing events in DNA caused by CRISPR/Cas9 as well as other RNA-guided nucleases in living cells. Similar to LAM-PCR, it employs multiple PCRs to amplify regions of interest that contain a specific insert that preferentially integrates into double-stranded breaks. As gene therapy is an emerging field, GUIDE-Seq has gained traction as a cheap method to detect the off-target effects of potential therapeutics without needing whole genome sequencing.

References

  1. CA 2158975,Canard, Bruno&Sarfati, Simon,"Novel derivatives usable for the sequencing of nucleic acids",published 1994-10-13, assigned to Pasteur Institute
  2. Canard B, Sarfati RS (October 1994). "DNA polymerase fluorescent substrates with reversible 3'-tags". Gene. 148 (1): 1–6. doi:10.1016/0378-1119(94)90226-7. PMID   7523248.
  3. "History of Illumina Sequencing". Archived from the original on 12 October 2014.
  4. "Illumina - Sequencing and array-based solutions for genetic research". www.illumina.com.
  5. 1 2 Meyer M, Kircher M (June 2010). "Illumina sequencing library preparation for highly multiplexed target capture and sequencing". Cold Spring Harbor Protocols. 2010 (6): pdb.prot5448. doi:10.1101/pdb.prot5448. PMID   20516186.
  6. 1 2 3 4 5 6 Clark, David P. (2 November 2018). Molecular biology. Pazdernik, Nanette Jean,, McGehee, Michelle R. (Third ed.). London. ISBN   978-0-12-813289-0. OCLC   1062496183.{{cite book}}: CS1 maint: location missing publisher (link)
  7. 1 2 "Illumina Sequencing Technology". YouTube . Retrieved 24 September 2015.
  8. 1 2 3 Feng YJ, Liu QF, Chen MY, Liang D, Zhang P (January 2016). "Parallel tagged amplicon sequencing of relatively long PCR products using the Illumina HiSeq platform and transcriptome assembly". Molecular Ecology Resources. 16 (1): 91–102. doi:10.1111/1755-0998.12429. PMID   25959587. S2CID   36882760.
  9. 1 2 Illumina, Inc. "Multiplexed Sequencing with the Illumina Genome Analyzer System" (PDF). Retrieved 25 September 2015.
  10. Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, et al. (July 2012). "A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers". BMC Genomics. 13: 341. doi: 10.1186/1471-2164-13-341 . PMC   3431227 . PMID   22827831.
  11. Clark, David P.; Pazdernik, Nanette J.; McGehee, Michelle R. (2019). Molecular Biology. Academic Cell. pp. 253–255. ISBN   9780128132883.
  12. 1 2 3 4 Morozova O, Marra MA (November 2008). "Applications of next-generation sequencing technologies in functional genomics". Genomics. 92 (5): 255–64. doi:10.1016/j.ygeno.2008.07.001. PMID   18703132.
  13. 1 2 3 4 5 Jeon YS, Park SC, Lim J, Chun J, Kim BS (January 2015). "Improved pipeline for reducing erroneous identification by 16S rRNA sequences using the Illumina MiSeq platform". Journal of Microbiology. 53 (1): 60–9. doi:10.1007/s12275-015-4601-y. PMID   25557481. S2CID   17210846.
  14. Ronaghi, Mostafa; Uhlén, Mathias; Nyrén, Pål (1998-07-17). "A Sequencing Method Based on Real-Time Pyrophosphate". Science. 281 (5375): 363–365. doi:10.1126/science.281.5375.363. ISSN   0036-8075. PMID   9705713.