Phred (software)

Last updated

Phred is a computer program for base calling, that is to say, identifying a nucleobase sequence from fluorescence "trace" data generated by an automated DNA sequencer that uses electrophoresis and 4-fluorescent dye method. [1] [2] When originally developed, Phred produced significantly fewer errors in the data sets examined than other methods, averaging 40–50% fewer errors. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods.

Contents

Background

The fluorescent-dye DNA sequencing is a molecular biology technique that involves labeling single-strand DNA sequences of varied length with 4 fluorescent dyes (corresponding to 4 different bases used in DNA) and subsequently separating the DNA sequences by "slab gel"- or capillary-electrophoresis method (see DNA Sequencing). The electrophoresis run is monitored by a CCD on the DNA sequencer and this produces a time "trace" data (or "chromatogram") of the fluorescent "peaks" that passed the CCD point. Examining the fluorescence peaks in the trace data, we can determine the order of individual bases (nucleobase) in the DNA. Since the intensity, shape and the location of a fluorescence peak are not always consistent or unambiguous, however, sometimes it is difficult or time-consuming to determine (or "call") the correct bases for the peaks accurately if it is done manually.

Automated DNA sequencing techniques have revolutionized the field of molecular biology – generating vast amounts of DNA sequence data. However, the sequence data is produced at a significantly higher rate than can be manually processed (i.e. interpreting the trace data to produce the sequence data), thereby creating a bottleneck. To remove the bottleneck, both automated software that can speed up the processing with improved accuracy and a reliable measure of the accuracy are needed. To meet this need, many software programs have been developed. One such program is Phred.

History

Phred was originally conceived in the early 1990s by Phil Green, then a professor at Washington University in St. Louis. LaDeana Hillier, Michael Wendl, David Ficenec, Tim Gleeson, Alan Blanchard, and Richard Mott also contributed to the codebase and algorithm. Green moved to University of Washington in the mid 1990s, after which development was primarily managed by himself and Brent Ewing. Phred played a notable role in the Human Genome Project, where large amounts of sequence data were processed by automated scripts. It was at the time the most widely used base-calling software program by both academic and commercial DNA sequencing laboratories because of its high base calling accuracy. [3] Phred is distributed commercially by CodonCode Corporation, and used to perform the "Call bases" function in the program CodonCode Aligner. It is also used by the MacVector plugin Assembler.

Methods

Phred uses a four-phase procedure as outlined by Ewing et al. to determine a sequence of base calls from the processed DNA sequence tracing:

  1. Predicted peak locations are determined, based on the assumption that fragments are relatively evenly spaced, on average, in most regions of the gel, to determine the correct number of bases and their idealized evenly spaced locations in regions where the peaks are not well resolved, noisy, or displaced (as in compressions)
  2. Observed peaks are identified in the trace
  3. Observed peaks are matched to the predicted peak locations, omitting some peaks and splitting others; as each observed peak comes from a specific array and is thus associated with 1 of the 4 bases (A, G, T, or C), the ordered list of matched observed peaks determines a base sequence for the trace.
  4. The unmatched observed peaks are checked for any peak that appears to represent a base but could not be assigned to a predicted peak in the third phase and if found, the corresponding base is inserted into the read sequence.

The entire procedure is rapid, usually taking less than half a second per trace. The results can be output as a PHD file, which contains base data as triples consisting of the base call, quality, and position. [4]

Applications

Phred is often used together with another software program called Phrap, which is a program for DNA sequence assembly. Phrap was routinely used in some of the largest sequencing projects in the Human Genome Sequencing Project and is currently one of the most widely used DNA sequence assembly programs in the biotech industry. Phrap uses Phred quality scores to determine highly accurate consensus sequences and to estimate the quality of the consensus sequences. Phrap also uses Phred quality scores to estimate whether discrepancies between two overlapping sequences are more likely to arise from random errors, or from different copies of a repeated sequence.

Related Research Articles

In genetics and biochemistry, sequencing means to determine the primary structure of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succinctly summarizes much of the atomic-level structure of the sequenced molecule.

DNA sequencer

A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then reported as a text string, called a read. Some DNA sequencers can be also considered optical instruments as they analyze light signals originating from fluorochromes attached to nucleotides.

DNA sequencing Process of determining the order of nucleotides in DNA molecules

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

Sanger sequencing Method of DNA sequencing developed in 1977

Sanger sequencing is a method of DNA sequencing based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Frederick Sanger and colleagues in 1977, it became the most widely used sequencing method for approximately 40 years. It was first commercialized by Applied Biosystems in 1986. More recently, higher volume Sanger sequencing has been replaced by "Next-Gen" sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use, for smaller-scale projects, and for validation of Next-Gen results. It still has the advantage over short-read sequencing technologies in that it can produce DNA sequence reads of > 500 nucleotides.

Genetic analysis

Genetic analysis is the overall process of studying and researching in fields of science that involve genetics and molecular biology. There are a number of applications that are developed from this research, and these are also considered parts of the process. The base system of analysis revolves around general genetics. Basic studies include identification of genes and inherited disorders. This research has been conducted for centuries on both a large-scale physical observation basis and on a more microscopic scale. Genetic analysis can be used generally to describe methods both used in and resulting from the sciences of genetics and molecular biology, or to applications resulting from this research.

SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. It is a form of genotyping, which is the measurement of more general genetic variation. SNPs are one of the most common types of genetic variation. A SNP is a single base pair mutation at a specific locus, usually consisting of two alleles. SNPs are found to be involved in the etiology of many human diseases and are becoming of particular interest in pharmacogenetics. Because SNPs are conserved during evolution, they have been proposed as markers for use in quantitative trait loci (QTL) analysis and in association studies in place of microsatellites. The use of SNPs is being extended in the HapMap project, which aims to provide the minimal set of SNPs needed to genotype the human genome. SNPs can also provide a genetic fingerprint for use in identity testing. The increase of interest in SNPs has been reflected by the furious development of a diverse range of SNP genotyping methods.

Phred quality score

A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for the computer program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. The FASTQ format encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

2 base encoding

2 Base Encoding, also called SOLiD, is a next-generation sequencing technology developed by Applied Biosystems and has been commercially available since 2008. These technologies generate hundreds of thousands of small sequence reads at one time. Well-known examples of such DNA sequencing methods include 454 pyrosequencing, the Solexa system and the SOLiD system. These methods have reduced the cost from $0.01/base in 2004 to nearly $0.0001/base in 2006 and increased the sequencing capacity from 1,000,000 bases/machine/day in 2004 to more than 100,000,000 bases/machine/day in 2006.

Single-molecule real-time (SMRT) sequencing is a parallelized single molecule DNA sequencing method. Single-molecule real-time sequencing utilizes a zero-mode waveguide (ZMW). A single DNA polymerase enzyme is affixed at the bottom of a ZMW with a single molecule of DNA as a template. The ZMW is a structure that creates an illuminated observation volume that is small enough to observe only a single nucleotide of DNA being incorporated by DNA polymerase. Each of the four DNA bases is attached to one of four different fluorescent dyes. When a nucleotide is incorporated by the DNA polymerase, the fluorescent tag is cleaved off and diffuses out of the observation area of the ZMW where its fluorescence is no longer observable. A detector detects the fluorescent signal of the nucleotide incorporation, and the base call is made according to the corresponding fluorescence of the dye.

Consed is a program for viewing, editing, and finishing DNA sequence assemblies. Originally developed for sequence assemblies created with phrap, recent versions also support other sequence assembly programs like Newbler.

Phrap is a widely used program for DNA sequence assembly. It is part of the Phred-Phrap-Consed package.

The Staden Package is computer software, a set of tools for DNA sequence assembly, editing, and sequence analysis. It is open-source software, released under a BSD 3-clause license.

Hybrid genome assembly

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation sequencing. Some of these technologies emerged in 1994-1998 and have been commercially available since 2005. These technologies use miniaturized and parallelized platforms for sequencing of 1 million to 43 billion short reads per instrument run.

Illumina dye sequencing

Illumina dye sequencing is a technique used to determine the series of base pairs in DNA, also known as DNA sequencing. The reversible terminated chemistry concept was invented by Bruno Canard and Simon Sarfati at the Pasteur Institute in Paris. It was developed by Shankar Balasubramanian and David Klenerman of Cambridge University, who subsequently founded Solexa, a company later acquired by Illumina. This sequencing method is based on reversible dye-terminators that enable the identification of single nucleotides as they are washed over DNA strands. It can also be used for whole-genome and region sequencing, transcriptome analysis, metagenomics, small RNA discovery, methylation profiling, and genome-wide protein-nucleic acid interaction analysis.

Philip Palmer Green is a theoretical and computational biologist noted for developing important algorithms and procedures used in Gene mapping and DNA sequencing. He earned his doctorate from Berkeley in mathematics in 1976 with a dissertation on C*-algebra under the direction of Marc Rieffel, but transitioned from pure mathematics into applied work in biology and bioinformatics. Green has obtained numerous important results, including in developing Phred, a widely used DNA trace analyzer, in mapping techniques, and in genetic analysis. Green was elected to the National Academy of Sciences in 2001 and won the Gairdner Award in 2002.

Ladeana Hillier is a biomedical engineer and computational biologist. She was one of the earliest scientists involved in the Human Genome Project and is noted for her work in various branches of DNA sequencing, as well as for having co-developed Phred, a widely used DNA trace analyzer.

Michael Christopher Wendl is a mathematician and biomedical engineer who has worked on DNA sequencing theory, covering and matching problems in probability, theoretical fluid mechanics, and co-wrote Phred. He was a scientist on the Human Genome Project and has done bioinformatics and biostatistics work in cancer. Wendl is of ethnic German heritage and is the son of the aerospace engineer Michael J. Wendl.

Base calling is the process of assigning nucleobases to chromatogram peaks or electrical current changes resulting from nucleotides passing through a nanopore. One computer program for accomplishing this job is Phred, which is a widely used base calling software program by both academic and commercial DNA sequencing laboratories because of its high base calling accuracy.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

References

  1. Ewing B, Hillier L, Wendl MC, Green P. (1998): Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8(3):175–185. PMID   9521921 full article
  2. Ewing, Brent; Green, Phil (1998-03-01). "Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities". Genome Research. Cold Spring Harbor Laboratory. 8 (3): 186–194. doi: 10.1101/gr.8.3.186 . ISSN   1088-9051. PMID   9521922.
  3. Richterich P. (1998): Estimation of errors in "raw" DNA sequences: a validation study. Genome Res. 8(3):251–259. PMID   9521928
  4. Green, Phil; Ewing, Brent. "PHRED Documentation". Laboratory of Phil Green. University of Washington. Retrieved 30 September 2021.