HOME

TheInfoList



OR:

A Phred quality score is a measure of the quality of the identification of the
nucleobase Nucleobases, also known as ''nitrogenous bases'' or often simply ''bases'', are nitrogen-containing biological compounds that form nucleosides, which, in turn, are components of nucleotides, with all of these monomers constituting the basic b ...
s generated by automated
DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
. It was originally developed for the computer program Phred to help in the automation of DNA sequencing in the
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
. Phred quality scores are assigned to each
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
base call in automated sequencer traces. The
FASTQ format FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. ...
encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based
consensus sequence In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It serves as a simplified r ...
s.


Definition

Phred quality scores Q are logarithmically related to the base-calling error probabilities P and defined as Q = -10 \ \log_ P. This relation can be also be written as P = 10^. For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000. The phred quality score is the negative ratio of the error probability to the reference level of P = 1 expressed in Decibel (dB).


History

The idea of sequence quality scores can be traced back to the original description of the SCF file format by Staden's group in 1992. In 1995, Bonfield and Staden proposed a method to use base-specific quality scores to improve the accuracy of consensus sequences in DNA sequencing projects. However, early attempts to develop base-specific quality scores had only limited success. The first program to develop accurate and powerful base-specific quality scores was the program Phred. Phred was able to calculate highly accurate quality scores that were logarithmically linked to the error probabilities. Phred was quickly adopted by all the major genome sequencing centers as well as many other laboratories; the vast majority of the DNA sequences produced during the
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
were processed with Phred. After Phred quality scores became the required standard in DNA sequencing, other manufacturers of DNA sequencing instruments, including Li-Cor and ABI, developed similar quality scoring metrics for their base calling software.


Methods

Phred's approach to base calling and calculating quality scores was outlined by Ewing ''et al.''. To determine quality scores, Phred first calculates several parameters related to peak shape and peak resolution at each base. Phred then uses these parameters to look up a corresponding quality score in huge lookup tables. These lookup tables were generated from sequence traces where the correct sequence was known, and are hard coded in Phred; different lookup tables are used for different sequencing chemistries and machines. An evaluation of the accuracy of Phred quality scores for a number of variations in sequencing chemistry and instrumentation showed that Phred quality scores are highly accurate. Phred was originally developed for "slab gel" sequencing machines like the ABI373. When originally developed, Phred had a lower base calling error rate than the manufacturer's base calling software, which also did not provide quality scores. However, Phred was only partially adapted to the capillary DNA sequencers that became popular later. In contrast, instrument manufacturers like ABI continued to adapt their base calling software changes in sequencing chemistry, and have included the ability to create Phred-like quality scores. Therefore, the need to use Phred for base calling of DNA sequencing traces has diminished, and using the manufacturer's current software versions can often give more accurate results.


Applications

Phred quality scores are used for assessment of sequence quality, recognition and removal of low-quality sequence (end clipping), and determination of accurate consensus sequences. Originally, Phred quality scores were primarily used by the sequence assembly program
Phrap Phrap is a widely used program for DNA sequence assembly. It is part of the Phred-Phrap-Consed package. History Phrap was originally developed by Prof. Phil Green for the assembly of cosmids in large-scale cosmid shotgun sequencing within the ...
. Phrap was routinely used in some of the largest sequencing projects in the Human Genome Sequencing Project and is currently one of the most widely used DNA sequence assembly programs in the biotech industry. Phrap uses Phred quality scores to determine highly accurate consensus sequences and to estimate the quality of the consensus sequences. Phrap also uses Phred quality scores to estimate whether discrepancies between two overlapping sequences are more likely to arise from random errors, or from different copies of a repeated sequence. Within the
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
, the most important use of Phred quality scores was for automatic determination of consensus sequences. Before Phred and Phrap, scientists had to carefully look at discrepancies between overlapping DNA fragments; often, this involved manual determination of the highest-quality sequence, and manual editing of any errors. Phrap's use of Phred quality scores effectively automated finding the highest-quality consensus sequence; in most cases, this completely circumvents the need for any manual editing. As a result, the estimated error rate in assemblies that were created automatically with Phred and Phrap is typically substantially lower than the error rate of manually edited sequence. In 2009, many commonly used software packages make use of Phred quality scores, albeit to a different extent. Programs like
Sequencher Gene Codes Corporation is a privately owned international firm based in Ann Arbor, Michigan, which specializes in bioinformatics software for genetic sequence analysis In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA ...
use quality scores for display, end clipping, and consensus determination; other programs like
CodonCode Aligner CodonCode Aligner is a commercial application for DNA sequence assembly, sequence alignment, and editing on Mac OS X and Windows. Features Features include chromatogram editing, end clipping, and vector trimming, sequence assembly and contig e ...
also implement quality-based consensus methods.


Compression

Quality scores are normally stored together with the nucleotide sequence in the widely accepted
FASTQ format FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. ...
. They account for about half of the required disk space in the FASTQ format (before compression), and therefore the compression of the quality values can significantly reduce storage requirements and speed up analysis and transmission of sequencing data. Both
lossless Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistic ...
and
lossy compression In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data size ...
are recently being considered in the literature. For example, the algorithm QualComp performs lossy compression with a rate (number of bits per quality value) specified by the user. Based on rate-distortion theory results, it allocates the number of bits so as to minimize the MSE (mean squared error) between the original (uncompressed) and the reconstructed (after compression) quality values. Other algorithms for compression of quality values include SCALCE, Fastqz and more recently QVZ, AQUa and the MPEG-G standard, that is currently under development by the
MPEG The Moving Picture Experts Group (MPEG) is an alliance of working groups established jointly by International Organization for Standardization, ISO and International Electrotechnical Commission, IEC that sets standards for media coding, includ ...
standardisation working group. Both are lossless compression algorithms that provide an optional controlled lossy transformation approach. For example, SCALCE reduces the alphabet size based on the observation that “neighboring” quality values are similar in general.


References

{{Reflist, 2


External links


Long Reads with the KB Basecaller
Comparison of Phred accuracy with a competing program, ABI's KB Basecaller
The Laboratory of Phil Green
Phrap's homepage. Molecular biology DNA