DNA read errors

Last updated

In bioinformatics, a DNA read error occurs when a sequence assembler changes one DNA base for a different base. The reads from the sequence assembler can then be used to create a de Bruijn graph, which can be used in various ways to find errors.

Contents

Overview

SEQUENCE 1.png
sequence 1
Simple de Bruijn Graph.png
figure 1
A sample DNA sequence and its respective de Bruijn Graph

In a de Bruijn graph, there is a possibility of 4^k different nodes to make arrangements of a genome. The number of nodes used to create the graph can be reduced in number by considering only the k-mers found within the DNA strand of interest. Given sequence 1, it is possible to determine the nodes of size 7, or 7-mers, that will be in the graph. These 7-mers then create the graph shown in figure 1. [1]

The graph shown in figure 1 is a very simple version of what a graph could look like. [2] This graph is formed by taking the last 6 elements of the 7-mer and linking it to the node whose first 6 elements are the same. Figure 1 is the most simplistic a de Bruijn graph can be, since each node has exactly one path into it and one path out. Most of the time, graphs will have more than one edge directed to a node and/or more than one edge leaving a node. This happens due to the way nodes are connected. The nodes are connected by edges pointing to nodes if the last k-1 elements of the k-mer match the first k-1 elements of any node. This allows for a multiple-edged de Bruijn graph to form. These more complicated graphs happen due to either read errors or variations in DNA strands. Both causes make it difficult to determine the correct structure of the DNA, and what is causing the differences. Since most DNA strands will likely include read errors and variations, scientists hope to use an assembly process that can merge nodes of the graph when they are unambiguously connected after the graph has been cleaned of vertices and edges created by the errors. [3]

Tips and bubbles

When a graph is formed from sequenced data, the read errors form tips and bubbles. A tip is where an error occurred during the sequencing process and has caused the graph to end prematurely and includes both correct and incorrect k-mers. A bubble is also formed when an error occurs during the sequence reading process; however, wherever the error happens, there is a path for the k-mer reads to reconnect with the main graph and continue as though nothing had ever happened. When there are tips and bubbles present in a de Bruijn graph formed from the data, they may be removed only if an error is what caused the tip or bubble to appear. When scientists are using a reference genome, they can quickly and easily tell where tips are located by comparing the graph of the reference genome and the graph of the sequence. If there is not a reference genome, tips are eliminated by tracing the branches backward until a point of ambiguity is found. Tips are then removed only if the branch containing the tip is shorter than a set threshold length. [3] The process of removing bubbles is slightly more complicated. The first thing that needs done is to identify the beginning of the bubble. From there, each path from the beginning of the bubble is followed until the point of reconnection. The point of reconnection can be different for each path. Since there can be paths of various lengths from the beginning node, the path which has a lower coverage is removed. [3]

Example

sequence 2 Sequence.png
sequence 2

Given a sequence of any length, the first step that needs done is to enter the sequence into a sequencing program, have it sequenced, and a return base pair (bp) reads of a certain length. Since there is not a sequencing program that is completely accurate, there will always be some reads which contain errors. The most common sequencing method is the shotgun method, which is the method most probably used on sequence 2. Once a method is decided on, you have to specify the length of the bp reads you would like it to return. In the case of sequence 2, it returned 7-bp reads with all errors made during the process noted in red. [4]

Once the reads are obtained, they are hashed into k-mers. The k-mers then are recorded in a table with how many times each k-mer appeared in the reads. For this example, each read was hashed into 4-mers and if there was an error it was recorded in red. All of the 4-mers were then recorded, with their frequency in the following table.

ACAG (5X)ACGCAGAAAGAC (9X)AGAG (9X)AGAT (8X)
AGGC (16X)AGTC (7X)ATCC (7X)ATGA (8X)CCGA (7X)CGAC
CGAG (8X)CGAT (6X)CTAG (2X)CTCTCTTT (8X)GACA (8X)
GACGGAGA (12X)GAGG (16X)GATG (5X)GATC (8X)GATT
GCTC (2X)GCTT (8X)GGCT (11X)GTCG (9X)TAGA (16X)TAGT (3X)
TCCG (7X)TCGA (10X)TCTA (2X)TGAG (9X)TTAG (12X)TTTA (8X)

Each individual cell of the table will then form a node, allowing a de Bruijn graph to be formed from the given k-mers. In figure 2, linear stretches are identified and then another graph, figure 3, is formed where the linear stretches have become a single node, of a different k-mer size, allowing for a more concise graph. In this simplified graph, it is easy to identify various tips and bubbles, as shown in figure 4. These bubbles and tips can then be removed, since we can identify that they were formed from errors in the bp reads, giving us a graph structure that should accurately and completely reflect the original sequence. [4] If you follow the de Bruijn graph shown in figure 5, you will see that the sequence formed does indeed match the DNA sequence given in sequence 2.

figure 2
The de Bruijn graph with linear stretches identified Sample DNA flow chart.png
figure 2
The de Bruijn graph with linear stretches identified
figure 3
Simplified de Bruijn graph Simple DNA flow chart - New Page.png
figure 3
Simplified de Bruijn graph
figure 4
The de Bruijn graph with tips and bubbles identified Simplified Sample DNA flow chart.png
figure 4
The de Bruijn graph with tips and bubbles identified
figure 5
Final de Bruijn graph from DNA strand Final Sample DNA flow chart.png
figure 5
Final de Bruijn graph from DNA strand

Comparing two DNA strands

When comparing two strands of DNA, colored de Bruijn graphs are frequently used to identify errors. These errors, often polymorphisms, cause bubbles, similar to the ones mentioned above, to form. Currently there are four main algorithms used to generalize the data and locate bubbles. The four algorithms extend de Bruijn graphs by allowing the nodes and edges in the graph to be colored by the samples from which they were observed [5]

Bubble calling

The simplest use of a colored de Bruijn graph is known as the bubble calling algorithm. This algorithm looks, and locates, bubbles on the genome that differ from the original. These bubbles must be “clean”, or simply a divergence from the reference genome, but cannot be caused by deletions of DNA bases. This algorithm can have high false positive rates since there is a difficulty of separating repeat- and variant-induced bubbles; however, there is often a reference genome to help improve reliability. The reference genome also helps in the detection of variants and is essential to detect variant sites. [5] Recently, scientists have discovered a way to use the bubble calling algorithm with copy number variation detection to allow for an opportunity of unbiased detection of these variations in the future [6] [7]

Path divergence

When looking at complex variants, there is a very low chance that they will make a clean contig. Since this is the case most often, the path divergence algorithm is useful, especially when considering where deletions occur and the variant is so complex it is constrained to the reference allele. When there is a bubble formed, the path divergence algorithm is used the most frequently and allows detected bubbles to get deleted in a very systematic procedure. The algorithm first locates each point of divergence. Then from each point of divergence, the strands that form the bubble are traced to find where the two paths join after n nodes. If the two paths join, then the path with a lower coverage is removed and stored in a file. [3] [8]

Multiple sample analysis

Using multiple samples substantially improves the power and false discovery rate of detecting variants. In the simplest cases, the samples are combined into a group of a single color and the data is analysed as described previously. However, by maintaining separate colors for each sample set, additional information on how the bubbles were formed, whether by error or by repeats, presents itself. [5] In 1997, the Department of Technology at Genzyme Genetics in Framingham, Massachusetts developed a new approach that provided a breakthrough in dealing with bubbles using the multiplex allele-specific diagnostic assay (MASDA). This program combines forward dot-blot, complex simultaneous probe hybridization and direct mutation detection to help solve the dual problem of multiple sample analysis. [9]

Genotyping

The colored de Bruijn graphs can be used to genotype any DNA sample at a known loci, even when the coverage is less than sufficient for variant assembly. [5] The first step to this process is to construct a graph of the reference allele, known variants and data from the sample. The algorithm then calculates the likelihood of each genotype and accounts for the structure of the graph, both of the local and genome-wide sequence. This then generalizes to multiple allelic types and helps genotype complex and compound variants. [5] This algorithm is used frequently, as there are no bubbles formed to deal with. This also directly helps find the more complicated issues in genes more direct than any of the three algorithms previously mentioned.[ citation needed ]

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The process of analyzing and interpreting data can sometimes be referred to as computational biology, however this distinction between the two terms is often disputed. To some, the term computational biology refers to building and using models of biological systems.

<span class="mw-page-title-main">Shotgun sequencing</span> Method used for sequencing random DNA strands

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

<span class="mw-page-title-main">Computational biology</span> Branch of biology

Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and data science, the field also has foundations in applied mathematics, molecular biology, cell biology, chemistry, and genetics.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence. For a set of m symbols S = {s1, …, sm}, the set of vertices is:

<i>k</i>-mer Substrings of length k contained in a biological sequence

In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

<span class="mw-page-title-main">Hybrid genome assembly</span>

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25–300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000–15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

SPAdes is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

References

  1. Brown, Alexander J.; Al-Soodani, Aneesa T.; Saul, Miles; Her, Stephanie; Garcia, Juan C.; Ramsden, Dale A.; Her, Chengtao; Roberts, Steven A. (2018). "High-Throughput Analysis of DNA Break-Induced Chromosome Rearrangements by Amplicon Sequencing". Mechanisms of DNA Recombination and Genome Rearrangements: Intersection between Homologous Recombination, DNA Replication and DNA Repair. Methods in Enzymology. Vol. 601. pp. 111–144. doi:10.1016/bs.mie.2017.11.028. ISBN   978-0-12-813979-0. PMC   9703968 . PMID   29523230.
  2. De Bruijn Graph of a small sequence. (2011). Retrieved Feb 7, 2015, from Homolog.us — Bioinformatics: http://www.homolog.us/Tutorials/index.php?p=2.1&s=1 Archived 2014-10-30 at the Wayback Machine
  3. 1 2 3 4 Simpson, Jared T.; Wong, Kim; Jackman, Shaun D.; Schein, Jacqueline E.; Jones, Steven J.M.; Birol, İnanç (June 2009). "ABySS: A parallel assembler for short read sequence data". Genome Research. 19 (6): 1117–1123. doi:10.1101/gr.089532.108. PMC   2694472 . PMID   19251739.
  4. 1 2 Flicek, Paul; Birney, Ewan (November 2009). "Sense from sequence reads: methods for alignment and assembly". Nature Methods. 6 (S11): S6 –S12. doi:10.1038/nmeth.1376. PMID   19844229.
  5. 1 2 3 4 5 Iqbal, Zamin; Caccamo, Mario; Turner, Isaac; Flicek, Paul; McVean, Gil (February 2012). "De novo assembly and genotyping of variants using colored de Bruijn graphs". Nature Genetics. 44 (2): 226–232. doi:10.1038/ng.1028. PMC   3272472 . PMID   22231483.
  6. Nijkamp, Jurgen F.; van den Broek, Marcel A.; Geertman, Jan-Maarten A.; Reinders, Marcel J. T.; Daran, Jean-Marc G.; de Ridder, Dick (December 2012). "De novo detection of copy number variation by co-assembly". Bioinformatics. 28 (24): 3195–3202. doi:10.1093/bioinformatics/bts601. PMID   23047563.
  7. Mesner, Larry D.; Valsakumar, Veena; Cieślik, Marcin; Pickin, Rebecca; Hamlin, Joyce L.; Bekiranov, Stefan (November 2013). "Bubble-seq analysis of the human genome reveals distinct chromatin-mediated mechanisms for regulating early- and late-firing origins". Genome Research. 23 (11): 1774–1788. doi:10.1101/gr.155218.113. PMC   3814878 . PMID   23861383.
  8. "Path Divergence - Project Management Knowledge" . Retrieved 2020-10-09.
  9. Shuber, A. P.; Michalowsky, L. A.; Scott Nass, G.; Skoletsky, J.; Hire, L. M.; Kotsopoulos, S. K.; Phipps, M. F.; Barberio, D. M.; Klinger, K. W. (March 1997). "High Throughput Parallel Analysis of Hundreds of Patient Samples for More Than 100 Mutations in Multiple Disease Genes". Human Molecular Genetics. 6 (3): 337–347. doi:10.1093/hmg/6.3.337. PMID   9147636.