Phred (software)
   HOME

TheInfoList



OR:

Phred is a computer program for
base calling Base calling is the process of assigning nucleobases to chromatogram peaks, light intensity signals, or electrical current changes resulting from nucleotides passing through a nanopore. One computer program for accomplishing this job is Phred (softw ...
, that is to say, identifying a
nucleobase Nucleobases, also known as ''nitrogenous bases'' or often simply ''bases'', are nitrogen-containing biological compounds that form nucleosides, which, in turn, are components of nucleotides, with all of these monomers constituting the basic b ...
sequence from fluorescence "trace" data generated by an automated
DNA sequencer A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is the ...
that uses
electrophoresis Electrophoresis, from Ancient Greek ἤλεκτρον (ḗlektron, "amber") and φόρησις (phórēsis, "the act of bearing"), is the motion of dispersed particles relative to a fluid under the influence of a spatially uniform electric fie ...
and 4-fluorescent dye method. When originally developed, Phred produced significantly fewer errors in the data sets examined than other methods, averaging 40–50% fewer errors.
Phred quality score A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for the computer program Phred (software), Phred to help in the automation of DNA sequenci ...
s have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods.


Background

The fluorescent-dye DNA
sequencing In genetics and biochemistry, sequencing means to determine the primary structure (sometimes incorrectly called the primary sequence) of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succ ...
is a
molecular biology Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity in and between cells, including biomolecular synthesis, modification, mechanisms, and interactions. The study of chemical and physi ...
technique that involves labeling single-strand DNA sequences of varied length with 4 fluorescent dyes (corresponding to 4 different bases used in DNA) and subsequently separating the DNA sequences by "slab gel"- or capillary-
electrophoresis Electrophoresis, from Ancient Greek ἤλεκτρον (ḗlektron, "amber") and φόρησις (phórēsis, "the act of bearing"), is the motion of dispersed particles relative to a fluid under the influence of a spatially uniform electric fie ...
method (see
DNA Sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
). The electrophoresis run is monitored by a CCD on the DNA sequencer and this produces a time "trace" data (or "
chromatogram In chemical analysis, chromatography is a laboratory technique for the separation of a mixture into its components. The mixture is dissolved in a fluid solvent (gas or liquid) called the ''mobile phase'', which carries it through a system (a ...
") of the fluorescent "peaks" that passed the CCD point. Examining the fluorescence peaks in the trace data, we can determine the order of individual bases (
nucleobase Nucleobases, also known as ''nitrogenous bases'' or often simply ''bases'', are nitrogen-containing biological compounds that form nucleosides, which, in turn, are components of nucleotides, with all of these monomers constituting the basic b ...
) in the DNA. Since the intensity, shape and the location of a fluorescence peak are not always consistent or unambiguous, however, sometimes it is difficult or time-consuming to determine (or "call") the correct bases for the peaks accurately if it is done manually. Automated DNA sequencing techniques have revolutionized the field of
molecular A molecule is a group of two or more atoms held together by attractive forces known as chemical bonds; depending on context, the term may or may not include ions which satisfy this criterion. In quantum physics, organic chemistry, and bioche ...
biology Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary i ...
– generating vast amounts of DNA sequence data. However, the sequence data is produced at a significantly higher rate than can be manually processed (i.e. interpreting the trace data to produce the sequence data), thereby creating a bottleneck. To remove the bottleneck, both automated software that can speed up the processing with improved accuracy and a reliable measure of the accuracy are needed. To meet this need, many
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
programs have been developed. One such program is Phred.


History

Phred was originally conceived in the early 1990s by Phil Green, then a professor at
Washington University in St. Louis Washington University in St. Louis (WashU or WUSTL) is a private research university with its main campus in St. Louis County, and Clayton, Missouri. Founded in 1853, the university is named after George Washington. Washington University is r ...
. LaDeana Hillier, Michael Wendl, David Ficenec, Tim Gleeson, Alan Blanchard, and Richard Mott also contributed to the codebase and algorithm. Green moved to
University of Washington The University of Washington (UW, simply Washington, or informally U-Dub) is a public research university in Seattle, Washington. Founded in 1861, Washington is one of the oldest universities on the West Coast; it was established in Seattle a ...
in the mid 1990s, after which development was primarily managed by himself and Brent Ewing. Phred played a notable role in the
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
, where large amounts of sequence data were processed by automated scripts. It was at the time the most widely used base-calling software program by both academic and commercial DNA sequencing laboratories because of its high
base calling Base calling is the process of assigning nucleobases to chromatogram peaks, light intensity signals, or electrical current changes resulting from nucleotides passing through a nanopore. One computer program for accomplishing this job is Phred (softw ...
accuracy. Phred is distributed commercially b
CodonCode Corporation
and used to perform the "Call bases" function in the program
CodonCode Aligner CodonCode Aligner is a commercial application for DNA sequence assembly, sequence alignment, and editing on Mac OS X and Windows. Features Features include chromatogram editing, end clipping, and vector trimming, sequence assembly and contig e ...
. It is also used by the
MacVector MacVector is a commercial sequence analysis application for Apple Macintosh computers running Mac OS X. It is intended to be used by Molecular biology, molecular biologists to help analyze, design, research and document their experiments in the la ...
plugin Assembler.


Methods

Phred uses a four-phase procedure as outlined by Ewing ''et al.'' to determine a sequence of base calls from the processed DNA sequence tracing: # Predicted peak locations are determined, based on the assumption that fragments are relatively evenly spaced, on average, in most regions of the gel, to determine the correct number of bases and their idealized evenly spaced locations in regions where the peaks are not well resolved, noisy, or displaced (as in compressions) # Observed peaks are identified in the trace # Observed peaks are matched to the predicted peak locations, omitting some peaks and splitting others; as each observed peak comes from a specific array and is thus associated with 1 of the 4 bases (A, G, T, or C), the ordered list of matched observed peaks determines a base sequence for the trace. # The unmatched observed peaks are checked for any peak that appears to represent a base but could not be assigned to a predicted peak in the third phase and if found, the corresponding base is inserted into the read sequence. The entire procedure is rapid, usually taking less than half a second per trace. The results can be output as a PHD file, which contains base data as triples consisting of the base call, quality, and position.{{cite web , last1=Green , first1=Phil , last2=Ewing , first2=Brent , title=PHRED Documentation , url=http://bozeman.mbt.washington.edu/phrap.docs/phred.html , website=Laboratory of Phil Green , publisher=University of Washington , access-date=30 September 2021


Applications

Phred is often used together with another software program called
Phrap Phrap is a widely used program for DNA sequence assembly. It is part of the Phred-Phrap-Consed package. History Phrap was originally developed by Prof. Phil Green for the assembly of cosmids in large-scale cosmid shotgun sequencing within the ...
, which is a program for DNA sequence assembly. Phrap was routinely used in some of the largest sequencing projects in the Human Genome Sequencing Project and is currently one of the most widely used DNA sequence assembly programs in the biotech industry. Phrap uses Phred quality scores to determine highly accurate consensus sequences and to estimate the quality of the consensus sequences. Phrap also uses Phred quality scores to estimate whether discrepancies between two overlapping sequences are more likely to arise from random errors, or from different copies of a repeated sequence.


References


External links


The Laboratory of Phil Green
Phrap's homepage. Molecular biology DNA Bioinformatics software