Phred quality score

Last updated
Phred quality scores shown on a DNA sequence trace Phred Figure 1.jpg
Phred quality scores shown on a DNA sequence trace

A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. [1] [2] It was originally developed for the computer program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. [1] [2] The FASTQ format encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

Contents

Definition

Phred quality scores are logarithmically related to the base-calling error probabilities and defined as [2]

.

This relation can also be written as

.

For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000.

Phred quality scores are logarithmically linked to error probabilities
Phred Quality ScoreProbability of incorrect base callBase call accuracy
101 in 1090%
201 in 10099%
301 in 100099.9%
401 in 10,00099.99%
501 in 100,00099.999%
601 in 1,000,00099.9999%

The phred quality score is the negative ratio of the error probability to the reference level of expressed in Decibel (dB).

History

The idea of sequence quality scores can be traced back to the original description of the SCF file format by Staden's group in 1992. [3] In 1995, Bonfield and Staden proposed a method to use base-specific quality scores to improve the accuracy of consensus sequences in DNA sequencing projects. [4]

However, early attempts to develop base-specific quality scores [5] [6] had only limited success.

The first program to develop accurate and powerful base-specific quality scores was the program Phred. Phred was able to calculate highly accurate quality scores that were logarithmically linked to the error probabilities. Phred was quickly adopted by all the major genome sequencing centers as well as many other laboratories; the vast majority of the DNA sequences produced during the Human Genome Project were processed with Phred.

After Phred quality scores became the required standard in DNA sequencing, other manufacturers of DNA sequencing instruments, including Li-Cor and ABI, developed similar quality scoring metrics for their base calling software. [7]

Methods

Phred's approach to base calling and calculating quality scores was outlined by Ewing et al.. To determine quality scores, Phred first calculates several parameters related to peak shape and peak resolution at each base. Phred then uses these parameters to look up a corresponding quality score in huge lookup tables. These lookup tables were generated from sequence traces where the correct sequence was known, and are hard coded in Phred; different lookup tables are used for different sequencing chemistries and machines. An evaluation of the accuracy of Phred quality scores for a number of variations in sequencing chemistry and instrumentation showed that Phred quality scores are highly accurate. [8]

Phred was originally developed for "slab gel" sequencing machines like the ABI373. When originally developed, Phred had a lower base calling error rate than the manufacturer's base calling software, which also did not provide quality scores. However, Phred was only partially adapted to the capillary DNA sequencers that became popular later. In contrast, instrument manufacturers like ABI continued to adapt their base calling software changes in sequencing chemistry, and have included the ability to create Phred-like quality scores. Therefore, the need to use Phred for base calling of DNA sequencing traces has diminished, and using the manufacturer's current software versions can often give more accurate results.

Applications

Phred quality scores are used for assessment of sequence quality, recognition and removal of low-quality sequence (end clipping), and determination of accurate consensus sequences.

Originally, Phred quality scores were primarily used by the sequence assembly program Phrap. Phrap was routinely used in some of the largest sequencing projects in the Human Genome Sequencing Project and is currently one of the most widely used DNA sequence assembly programs in the biotech industry. Phrap uses Phred quality scores to determine highly accurate consensus sequences and to estimate the quality of the consensus sequences. Phrap also uses Phred quality scores to estimate whether discrepancies between two overlapping sequences are more likely to arise from random errors, or from different copies of a repeated sequence.

Within the Human Genome Project, the most important use of Phred quality scores was for automatic determination of consensus sequences. Before Phred and Phrap, scientists had to carefully look at discrepancies between overlapping DNA fragments; often, this involved manual determination of the highest-quality sequence, and manual editing of any errors. Phrap's use of Phred quality scores effectively automated finding the highest-quality consensus sequence; in most cases, this completely circumvents the need for any manual editing. As a result, the estimated error rate in assemblies that were created automatically with Phred and Phrap is typically substantially lower than the error rate of manually edited sequence.

In 2009, many commonly used software packages make use of Phred quality scores, albeit to a different extent. Programs like Sequencher use quality scores for display, end clipping, and consensus determination; other programs like CodonCode Aligner also implement quality-based consensus methods.

Compression

Quality scores are normally stored together with the nucleotide sequence in the widely accepted FASTQ format. They account for about half of the required disk space in the FASTQ format (before compression), and therefore the compression of the quality values can significantly reduce storage requirements and speed up analysis and transmission of sequencing data. Both lossless and lossy compression are recently being considered in the literature. For example, the algorithm QualComp [9] performs lossy compression with a rate (number of bits per quality value) specified by the user. Based on rate-distortion theory results, it allocates the number of bits so as to minimize the MSE (mean squared error) between the original (uncompressed) and the reconstructed (after compression) quality values. Other algorithms for compression of quality values include SCALCE, [10] Fastqz [11] and more recently QVZ, [12] AQUa [13] and the MPEG-G standard, that is currently under development by the MPEG standardisation working group. Both are lossless compression algorithms that provide an optional controlled lossy transformation approach. For example, SCALCE reduces the alphabet size based on the observation that “neighboring” quality values are similar in general.

Related Research Articles

<span class="mw-page-title-main">DNA sequencer</span> A scientific instrument used to automate the DNA sequencing process

A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then reported as a text string, called a read. Some DNA sequencers can be also considered optical instruments as they analyze light signals originating from fluorochromes attached to nucleotides.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

<span class="mw-page-title-main">DNA sequencing</span> Process of determining the nucleic acid sequence

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

<span class="mw-page-title-main">Sanger sequencing</span> Method of DNA sequencing developed in 1977

Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Frederick Sanger and colleagues in 1977, it became the most widely used sequencing method for approximately 40 years. It was first commercialized by Applied Biosystems in 1986. More recently, higher volume Sanger sequencing has been replaced by next generation sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use for smaller-scale projects and for validation of deep sequencing results. It still has the advantage over short-read sequencing technologies in that it can produce DNA sequence reads of > 500 nucleotides and maintains a very low error rate with accuracies around 99.99%. Sanger sequencing is still actively being used in efforts for public health initiatives such as sequencing the spike protein from SARS-CoV-2 as well as for the surveillance of norovirus outbreaks through the Center for Disease Control and Prevention's (CDC) CaliciNet surveillance network.

Phred is a computer program for base calling, that is to say, identifying a nucleobase sequence from fluorescence "trace" data generated by an automated DNA sequencer that uses electrophoresis and 4-fluorescent dye method. When originally developed, Phred produced significantly fewer errors in the data sets examined than other methods, averaging 40–50% fewer errors. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods.

Phrap is a widely used program for DNA sequence assembly. It is part of the Phred-Phrap-Consed package.

The Staden Package is computer software, a set of tools for DNA sequence assembly, editing, and sequence analysis. It is open-source software, released under a BSD 3-clause license.

FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

<span class="mw-page-title-main">Hybrid genome assembly</span>

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of SNP/indel calling and alignment. It was first used by Tony Cox and Zemin Ning at the Wellcome Trust Sanger Institute, and became widely known through its implementation within the SAMtools software suite.

High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.

Philip Palmer Green is a theoretical and computational biologist noted for developing important algorithms and procedures used in Gene mapping and DNA sequencing. He earned his doctorate from Berkeley in mathematics in 1976 with a dissertation on C*-algebra under the direction of Marc Rieffel, but transitioned from pure mathematics into applied work in biology and bioinformatics. Green has obtained numerous important results, including in developing Phred, a widely used DNA trace analyzer, in mapping techniques, and in genetic analysis. Green was elected to the National Academy of Sciences in 2001 and won the Gairdner Award in 2002.

Ladeana Hillier is a biomedical engineer and computational biologist. She was one of the earliest scientists involved in the Human Genome Project and is noted for her work in various branches of DNA sequencing, as well as for having co-developed Phred, a widely used DNA trace analyzer.

Michael Christopher Wendl is a mathematician and biomedical engineer who has worked on DNA sequencing theory, covering and matching problems in probability, theoretical fluid mechanics, and co-wrote Phred. He was a scientist on the Human Genome Project and has done bioinformatics and biostatistics work in cancer. Wendl is of ethnic German heritage and is the son of the aerospace engineer Michael J. Wendl.

Base calling is the process of assigning nucleobases to chromatogram peaks, light intensity signals, or electrical current changes resulting from nucleotides passing through a nanopore. One computer program for accomplishing this job is Phred, which is a widely used base calling software program by both academic and commercial DNA sequencing laboratories because of its high base calling accuracy.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al. It was developed when the 1000 Genomes Project wanted to move away from the MAQ mapper format and decided to design a new format. The overall TAB-delimited flavour of the format came from an earlier format inspired by BLAT’s PSL. The name of SAM came from Gabor Marth from University of Utah, who originally had a format under the same name but with a different syntax more similar to a BLAST output. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences. The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project.

In genetics, coverage is one of several measures of the depth or completeness of DNA sequencing, and is more specifically expressed in any of the following terms:

References

  1. 1 2 Ewing B; Hillier L; Wendl MC; Green P. (1998). "Base-calling of automated sequencer traces using phred. I. Accuracy assessment". Genome Research. 8 (3): 175–185. doi: 10.1101/gr.8.3.175 . PMID   9521921.
  2. 1 2 3 Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Research. 8 (3): 186–194. doi: 10.1101/gr.8.3.186 . PMID   9521922.
  3. Dear S, Staden R (1992). "A standard file format for data from DNA sequencing instruments". DNA Sequence. 3 (2): 107–110. doi:10.3109/10425179209034003. PMID   1457811.
  4. Bonfield JK, Staden R (25 Apr 1995). "The application of numerical estimates of base calling accuracy to DNA sequencing projects". Nucleic Acids Research. 23 (8): 1406–1410. doi:10.1093/nar/23.8.1406. PMC   306869 . PMID   7753633.
  5. Churchill GA, Waterman MS (Sep 1992). "The accuracy of DNA sequences: estimating sequence quality". Genomics. 14 (1): 89–98. doi:10.1016/S0888-7543(05)80288-5. hdl: 1813/31678 . PMID   1358801.
  6. Lawrence CB, Solovyev VV (1994). "Assignment of position-specific error probability to primary DNA sequence data". Nucleic Acids Research. 22 (7): 1272–1280. doi:10.1093/nar/22.7.1272. PMC   523653 . PMID   8165143.
  7. "Life Technologies - US" (PDF).
  8. Richterich P (1998). "Estimation of errors in "raw" DNA sequences: a validation study". Genome Research. 8 (3): 251–259. doi:10.1101/gr.8.3.251. PMC   310698 . PMID   9521928.
  9. Ochoa, Idoia; Asnani, Himanshu; Bharadia, Dinesh; Chowdhury, Mainak; Weissman, Tsachy; Yona, Golan (2013). "Qual Comp: A new lossy compressor for quality scores based on rate distortion theory". BMC Bioinformatics. 14: 187. doi: 10.1186/1471-2105-14-187 . PMC   3698011 . PMID   23758828.
  10. Hach, F; Numanagic, I; Alkan, C; Sahinalp, S. C. (2012). "SCALCE: Boosting sequence compression algorithms using locally consistent encoding". Bioinformatics. 28 (23): 3051–3057. doi:10.1093/bioinformatics/bts593. PMC   3509486 . PMID   23047557.
  11. "fastqz - FASTQ compressor".
  12. Malysa, Greg; Hernaez, Mikel; Ochoa, Idoia; Rao, Milind; Ganesan, Karthik; Weissman, Tsachy (2015-10-01). "QVZ: lossy compression of quality values". Bioinformatics. 31 (19): 3122–3129. doi:10.1093/bioinformatics/btv330. ISSN   1367-4803. PMC   5856090 . PMID   26026138.
  13. Paridaens, Tom; Van Wallendael, Glenn; De Neve, Wesley; Lambert, Peter (2018). "AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality". Bioinformatics. 34 (3): 425–433. doi: 10.1093/bioinformatics/btx607 . PMID   29028894.