Read (biology)

Last updated

In DNA sequencing, a read is an inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads. [1]

Contents

Read length

Sequencing technologies vary in the length of reads produced. Reads of length 20-40 base pairs (bp) are referred to as ultra-short. [2] Typical sequencers produce read lengths in the range of 100-500 bp. [3] However, Pacific Biosciences platforms produce read lengths of approximately 1500 bp. [4] Read length is a factor which can affect the results of biological studies. [5] For example, longer read lengths improve the resolution of de novo genome assembly and detection of structural variants. It is estimated that read lengths greater than 100 kilobases (kb) will be required for routine de novo human genome assembly. [6] Bioinformatic pipelines to analyze sequencing data usually take into account read lengths. [7]

Generations of sequencing and read lengths

A genome is the complete genetic information of an organism or a cell. Single or double stranded nucleic acids store this information in a linear or in a circular sequence. To precisely determine this sequence, over time more efficient technologies with increased accuracy, throughput and sequencing speed have been developed. Sanger and Maxam-Gilbert sequencing technologies were classified as the First Generation Sequencing Technology who initiated the field of DNA sequencing with their publication in 1977. [8] First Generation Sequencing typically has read lengths of 400 to 900 base pairs.[ citation needed ]

In 2005 Roche’s 454 technology introduced new sequencing technology that was capable of high throughput at low cost. [9] This and similar technologies came to be known as Second Generation Sequencing or Next Generation Sequencing (NGS). One of the hallmarks of NSG is short sequence reads. NGS methods may sequence millions to billions of reads in a single run, and the time it takes to create GigaBase-sized reads is only a few days or hours, making it superior to first-generation sequencing techniques like Sanger sequencing. All NSG techniques produce short reads, i.e. 80–200 bases, as opposed to longer length reads produced by Sanger sequencing. [10]

Beginning in the 2010s, revolutionary new technologies ushered in the Third-Generation Sequencing era (TGS). TGS is a term used to describe methods that are capable of sequencing single DNA molecules without amplification. While Sanger and SRS techniques can only produce read lengths of one kilobase pair, third-generation sequencing technologies can produce read lengths of 5 to 30 kilobase pairs. The longest read length ever generated by a third-generation sequencing technology is 2 million base pairs. [11]

NGS and read mapping

Historically, only one individual per species was addressed due to time and expense constraints, and its sequence served as the species' "reference" genome. These reference genomes can be used to guide resequencing efforts in the same species by serving as a read mapping template. Read mapping is the process to align NGS reads on a reference genome. [12] Any NGS application, such as genome variation calling, transcriptome analysis, transcription factor binding site calling, epigenetic mark calling, metagenomics, and so on, requires read mapping. The performance of these applications is influenced by accurate alignment. Furthermore, because the number of reads is so large, the mapping process must be efficient. There are different methods used to align reads on reference genome depending on how many mismatches and indels are being allowed. Roughly speaking, the methods can be divided into two categories: the seed-and-extension approach and the filtering approach. Many short read aligners use the seed-and-extend strategy, such as BWA-SW, Bowtie 2, BatAlign, LAST, Cushaw2, BWA-MEM, etc. A filter-based approach is used by a number of methods like SeqAlto, GEM, MASAI etc. [13]

Genome assembly and sequence reads

In genomics, reassembling genomes by DNA sequencing is a significant challenge. The retrieved reads span the entire genome uniformly due to random sampling. Reads are stitched together computationally to reconstruct the genome. This process is known as de novo genome assembly.

I Sanger sequencing has larger read length compared to NGS. Two assemblers were developed for assembling Sanger sequencing reads - the OLC assembler Celera and the de Bruijn graph assembler Euler. These two methods were used to put together our human reference genome. However, since Sanger sequencing is low throughput and expensive, only a few genomes are assembled with Sanger sequencing.

Second-generation sequencing reads are short, and these sequencing techniques can efficiently and cost-effectively sequence hundreds of millions of reads. For rebuilding genomes from short sequences, some custom genome assemblers have been built. Their success spawned several de novo genome assembly projects. Although this method is cost-effective, the reads are short and the repeat sections are long, resulting in fragmented genomes.

We now have very long reads (of 10,000 bp) thanks to the arrival of third-generation sequencing. Long reads are capable of resolving the ordering of repeat regions, although they have a high error rate (15–18%). To correct errors in third-generation sequencing reads, a number of computational methods have been devised.

Assembling with short reads and assembling with long reads have different advantages and disadvantages owing to error rates and ease of assembly. Sometimes a hybrid method is preferred, and short reads and long reads are combined to get better result. There are two approaches, the first one is using mate-pair reads and long reads to improve the assembly from the short reads. Second approach is using short reads to correct the errors in long reads.

Advantages and disadvantages of short reads

Second-generation sequencing generates short reads (of length < 300bp) and these are highly accurate (sequencing error rate equals ~1%). Short read sequencing technologies have made sequencing much easier, a lot faster and much cheaper than Sanger sequencing. The August 2019 report from the National Human Genome Research Institute put the cost of sequencing a complete human genome at $942.00 United States dollars (USD). [14] [15]

The inability to sequence lengthy sections of DNA is a drawback shared by all second-generation sequencing technology. To use NGS to sequence a big genome like human DNA, the DNA must be fragmented and amplified in clones ranging from 75 to 400 base pairs, that is why NGS is also known as "shortread sequencing" (SRS). After sequencing short reads, it then becomes a computational problem and many computer programs and techniques have been developed to assemble the random clones into a contiguous sequence. [16]

A necessary step in SRS is polymerase chain reaction which causes preferential amplification of repetitive DNA. SRS also fails to generate sufficient overlap sequence from the DNA fragments. This constitutes a major challenge for de novo sequencing of a highly complex and repetitive genome like the human genome. [17] Another challenge with SRS is the detection of large sequence changes, which is a major roadblock to studying structural variations. [18]

Advantages and disadvantages of long reads

The third-generation sequencing sequences long reads and is often referred to as long read sequencing (LRS). LRS technologies are capable of sequencing single DNA molecules without amplification. The availability of long reads constitutes a great advantage, because it is often difficult to generate long continuous consensus sequence using NGS because of the difficulty of detecting overlaps between NGS short reads, thus impacting the overall quality of assembly. LRS has been shown to considerably improve the quality of genome assemblies in several studies. [19] [20] Another advantage of LRS over NGS is that it provides the simultaneous capability of characterizing a variety of epigenetic marks along with DNA sequencing. [21] [22]

Major challenge of LRS is accuracy and cost. Though with LRS is improving fast in those areas too.

See also

Related Research Articles

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

<span class="mw-page-title-main">DNA sequencer</span> A scientific instrument used to automate the DNA sequencing process

A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then reported as a text string, called a read. Some DNA sequencers can be also considered optical instruments as they analyze light signals originating from fluorochromes attached to nucleotides.

A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

<span class="mw-page-title-main">DNA sequencing</span> Process of determining the nucleic acid sequence

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

<span class="mw-page-title-main">Sanger sequencing</span> Method of DNA sequencing developed in 1977

Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Frederick Sanger and colleagues in 1977, it became the most widely used sequencing method for approximately 40 years. It was first commercialized by Applied Biosystems in 1986. More recently, higher volume Sanger sequencing has been replaced by next generation sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use for smaller-scale projects and for validation of deep sequencing results. It still has the advantage over short-read sequencing technologies in that it can produce DNA sequence reads of > 500 nucleotides and maintains a very low error rate with accuracies around 99.99%. Sanger sequencing is still actively being used in efforts for public health initiatives such as sequencing the spike protein from SARS-CoV-2 as well as for the surveillance of norovirus outbreaks through the Center for Disease Control and Prevention's (CDC) CaliciNet surveillance network.

Single-molecule real-time (SMRT) sequencing is a parallelized single molecule DNA sequencing method. Single-molecule real-time sequencing utilizes a zero-mode waveguide (ZMW). A single DNA polymerase enzyme is affixed at the bottom of a ZMW with a single molecule of DNA as a template. The ZMW is a structure that creates an illuminated observation volume that is small enough to observe only a single nucleotide of DNA being incorporated by DNA polymerase. Each of the four DNA bases is attached to one of four different fluorescent dyes. When a nucleotide is incorporated by the DNA polymerase, the fluorescent tag is cleaved off and diffuses out of the observation area of the ZMW where its fluorescence is no longer observable. A detector detects the fluorescent signal of the nucleotide incorporation, and the base call is made according to the corresponding fluorescence of the dye.

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

<span class="mw-page-title-main">Hybrid genome assembly</span>

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

<span class="mw-page-title-main">Transmission electron microscopy DNA sequencing</span> Single-molecule sequencing technology

Transmission electron microscopy DNA sequencing is a single-molecule sequencing technology that uses transmission electron microscopy techniques. The method was conceived and developed in the 1960s and 70s, but lost favor when the extent of damage to the sample was recognized.

<span class="mw-page-title-main">DNA nanoball sequencing</span> DNA sequencing technology

DNA nanoball sequencing is a high throughput sequencing technology that is used to determine the entire genomic sequence of an organism. The method uses rolling circle replication to amplify small fragments of genomic DNA into DNA nanoballs. Fluorescent nucleotides bind to complementary nucleotides and are then polymerized to anchor sequences bound to known sequences on the DNA template. The base order is determined via the fluorescence of the bound nucleotides This DNA sequencing method allows large numbers of DNA nanoballs to be sequenced per run at lower reagent costs compared to other next generation sequencing platforms. However, a limitation of this method is that it generates only short sequences of DNA, which presents challenges to mapping its reads to a reference genome. After purchasing Complete Genomics, the Beijing Genomics Institute (BGI) refined DNA nanoball sequencing to sequence nucleotide samples on their own platform.

Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation sequencing. Some of these technologies emerged between 1993 and 1998 and have been commercially available since 2005. These technologies use miniaturized and parallelized platforms for sequencing of 1 million to 43 billion short reads per instrument run.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

<span class="mw-page-title-main">Jumping library</span>

Jumping libraries or junction-fragment libraries are collections of genomic DNA fragments generated by chromosome jumping. These libraries allow the analysis of large areas of the genome and overcome distance limitations in common cloning techniques. A jumping library clone is composed of two stretches of DNA that are usually located many kilobases away from each other. The stretch of DNA located between these two "ends" is deleted by a series of biochemical manipulations carried out at the start of this cloning technique.

<span class="mw-page-title-main">Scaffolding (bioinformatics)</span> Bioinformatics technique

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

Third-generation sequencing is a class of DNA sequencing methods which produce longer sequence reads, under active development since 2008.

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.

<span class="mw-page-title-main">Linked-read sequencing</span>

Linked-read sequencing, a type of DNA sequencing technology, uses specialized technique that tags DNA molecules with unique barcodes before fragmenting them. Unlike traditional sequencing technology, where DNA is broken into small fragments and then sequenced individually, resulting in short read lengths that has difficulties in accurately reconstructing the original DNA sequence, the unique barcodes of linked-read sequencing allows scientists to link together DNA fragments that come from the same DNA molecule. A pivotal benefit of this technology lies in the small quantities of DNA required for large genome information output, effectively combining the advantages of long-read and short-read technologies.

References

  1. "Sequencing library: what is it?". Breda Genetics. 2016-08-12. Retrieved 23 July 2017.
  2. Chaisson, Mark J. (2009). "De novo fragment assembly with short mate-paired reads: Does the read length matter?". Genome Research. 19 (2): 336–346. doi:10.1101/gr.079053.108. PMC   2652199 . PMID   19056694 . Retrieved 23 July 2017.
  3. Junemann, Sebastian (2013). "Updating benchtop sequencing performance comparison". Nature Biotechnology. 31 (4): 294–296. doi: 10.1038/nbt.2522 . PMID   23563421.
  4. Quail, Michael A. (2012). "A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers". BMC Genomics. 13 (1): 341. doi: 10.1186/1471-2164-13-341 . PMC   3431227 . PMID   22827831.
  5. Chhangawala, Sagar; Rudy, Gabe; Mason, Christopher E.; Rosenfeld, Jeffrey A. (23 June 2015). "The impact of read length on quantification of differentially expressed genes and splice junction detection". Genome Biology. 16 (1): 131. doi: 10.1186/s13059-015-0697-y . PMC   4531809 . PMID   26100517.
  6. Chaisson, Mark J.P. (2015). "Genetic variation and the de novo assembly of human genomes". Nature Reviews Genetics. 16 (11): 627–640. doi:10.1038/nrg3933. PMC   4745987 . PMID   26442640.
  7. Conesa, Ana; Madrigal, Pedro; Tarazona, Sonia; Gomez-Cabrero, David; Cervera, Alejandra; McPherson, Andrew; Szcześniak, Michał Wojciech; Gaffney, Daniel J.; Elo, Laura L.; Zhang, Xuegong; Mortazavi, Ali (26 January 2016). "A survey of best practices for RNA-seq data analysis". Genome Biology. 17 (1): 13. doi: 10.1186/s13059-016-0881-8 . PMC   4728800 . PMID   26813401.
  8. Giani, Alice Maria; Gallo, Guido Roberto; Gianfranceschi, Luca; Formenti, Giulio (2020). "Long walk to genomics: History and current approaches to genome sequencing and assembly". Computational and Structural Biotechnology Journal. 18: 9–19. doi:10.1016/j.csbj.2019.11.002. PMC   6926122 . PMID   31890139.
  9. Qiang-long, Zhu; Shi, Liu; Peng, Gao; Fei-shi, Luan (1 September 2014). "High-throughput Sequencing Technology and Its Application". Journal of Northeast Agricultural University (English Edition). 21 (3): 84–96. doi:10.1016/S1006-8104(14)60073-8.
  10. Chaisson, M.; Pevzner, P.; Tang, H. (1 September 2004). "Fragment assembly with short reads". Bioinformatics. 20 (13): 2067–2074. doi:10.1093/bioinformatics/bth205. PMID   15059830.
  11. Kraft, Florian; Kurth, Ingo (16 July 2019). "Long-read sequencing in human genetics". Medizinische Genetik. 31 (2): 198–204. doi: 10.1007/s11825-019-0249-z . S2CID   197402652.
  12. Sung, Wing-Kin (2017). Algorithms for next-generation sequencing. Boca Raton. ISBN   978-1466565500.{{cite book}}: CS1 maint: location missing publisher (link)
  13. Sung, Wing-Kin (2017). Algorithms for next-generation sequencing. Boca Raton. ISBN   978-1466565500.{{cite book}}: CS1 maint: location missing publisher (link)
  14. Adewale, Boluwatife A. (26 November 2020). "Will long-read sequencing technologies replace short-read sequencing technologies in the next 10 years?". African Journal of Laboratory Medicine. 9 (1): 5. doi:10.4102/ajlm.v9i1.1340. PMC   7736650 . PMID   33354530.
  15. "DNA Sequencing Costs: Data". Genome.gov.
  16. Mardis, Elaine R (February 2017). "DNA sequencing technologies: 2006–2016". Nature Protocols. 12 (2): 213–218. doi:10.1038/nprot.2016.182. PMID   28055035. S2CID   205466745.
  17. Mardis, Elaine R (February 2017). "DNA sequencing technologies: 2006–2016". Nature Protocols. 12 (2): 213–218. doi:10.1038/nprot.2016.182. PMID   28055035. S2CID   205466745.
  18. Ho, Steve S.; Urban, Alexander E.; Mills, Ryan E. (March 2020). "Structural variation in the sequencing era". Nature Reviews Genetics. 21 (3): 171–189. doi:10.1038/s41576-019-0180-9. PMC   7402362 . PMID   31729472.
  19. Rhoads, Anthony; Au, Kin Fai (October 2015). "PacBio Sequencing and Its Applications". Genomics, Proteomics & Bioinformatics. 13 (5): 278–289. doi:10.1016/j.gpb.2015.08.002. PMC   4678779 . PMID   26542840.
  20. Wenger, Aaron M.; Peluso, Paul; Rowell, William J.; Chang, Pi-Chuan; Hall, Richard J.; Concepcion, Gregory T.; Ebler, Jana; Fungtammasan, Arkarachai; Kolesnikov, Alexey; Olson, Nathan D.; Töpfer, Armin; Alonge, Michael; Mahmoud, Medhat; Qian, Yufeng; Chin, Chen-Shan; Phillippy, Adam M.; Schatz, Michael C.; Myers, Gene; DePristo, Mark A.; Ruan, Jue; Marschall, Tobias; Sedlazeck, Fritz J.; Zook, Justin M.; Li, Heng; Koren, Sergey; Carroll, Andrew; Rank, David R.; Hunkapiller, Michael W. (October 2019). "Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome". Nature Biotechnology. 37 (10): 1155–1162. doi:10.1038/s41587-019-0217-9. PMC   6776680 . PMID   31406327.
  21. Flusberg, Benjamin A; Webster, Dale R; Lee, Jessica H; Travers, Kevin J; Olivares, Eric C; Clark, Tyson A; Korlach, Jonas; Turner, Stephen W (June 2010). "Direct detection of DNA methylation during single-molecule, real-time sequencing". Nature Methods. 7 (6): 461–465. doi:10.1038/nmeth.1459. PMC   2879396 . PMID   20453866.
  22. Simpson, Jared T; Workman, Rachael E; Zuzarte, P C; David, Matei; Dursi, L J; Timp, Winston (April 2017). "Detecting DNA cytosine methylation using nanopore sequencing". Nature Methods. 14 (4): 407–410. doi:10.1038/nmeth.4184. PMID   28218898. S2CID   16152628.