HOME

TheInfoList



OR:

DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
s in a sequence of DNA, otherwise known as
DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
. The practical aspects revolve around designing and optimizing sequencing projects (known as "strategic genomics"), predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of
systems engineering Systems engineering is an interdisciplinary field of engineering and engineering management that focuses on how to design, integrate, and manage complex systems over their enterprise life cycle, life cycles. At its core, systems engineering util ...
or
operations research Operations research ( en-GB, operational research) (U.S. Air Force Specialty Code: Operations Analysis), often shortened to the initialism OR, is a discipline that deals with the development and application of analytical methods to improve deci ...
. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses ''physical processes'' related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g.
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
. Publications sometimes do not make a careful distinction, but the latter are primarily concerned with
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
ic issues. Sequencing theory is based on elements of
mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
,
biology Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary i ...
, and
systems engineering Systems engineering is an interdisciplinary field of engineering and engineering management that focuses on how to design, integrate, and manage complex systems over their enterprise life cycle, life cycles. At its core, systems engineering util ...
, so it is highly interdisciplinary. The subject may be studied within the context of
computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
.


Theory and sequencing strategies


Sequencing as a covering problem

All mainstream methods of
DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
rely on reading small fragments of DNA and subsequently reconstructing these data to infer the original DNA target, either via
assembly Assembly may refer to: Organisations and meetings * Deliberative assembly, a gathering of members who use parliamentary procedure for making decisions * General assembly, an official meeting of the members of an organization or of their representa ...
or
alignment Alignment may refer to: Archaeology * Alignment (archaeology), a co-linear arrangement of features or structures with external landmarks * Stone alignment, a linear arrangement of upright, parallel megalithic standing stones Biology * Structu ...
to a reference. The
abstraction Abstraction in its main sense is a conceptual process wherein general rules and concepts are derived from the usage and classification of specific examples, literal ("real" or "concrete") signifiers, first principles, or other methods. "An abstr ...
common to these methods is that of a mathematical
covering problem In combinatorics and computer science, covering problems are computational problems that ask whether a certain combinatorial structure 'covers' another, or how large the structure has to be to do that. Covering problems are minimization problems ...
. For example, one can imagine a line segment representing the target and a subsequent process where smaller segments are "dropped" onto random locations of the target. The target is considered "sequenced" when adequate coverage accumulates (e.g., when no gaps remain). The abstract properties of covering have been studied by mathematicians for over a century. However, direct application of these results has not generally been possible. Closed-form mathematical solutions, especially for probability distributions, often cannot be readily evaluated. That is, they involve inordinately large amounts of computer time for parameters characteristic of
DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
. Stevens' configuration is one such example. Results obtained from the perspective of
pure mathematics Pure mathematics is the study of mathematical concepts independently of any application outside mathematics. These concepts may originate in real-world concerns, and the results obtained may later turn out to be useful for practical applications, ...
also do not account for factors that are actually important in sequencing, for instance detectable overlap in sequencing fragments, double-stranding, edge-effects, and target multiplicity. Consequently, development of sequencing theory has proceeded more according to the philosophy of
applied mathematics Applied mathematics is the application of mathematical methods by different fields such as physics, engineering, medicine, biology, finance, business, computer science, and industry. Thus, applied mathematics is a combination of mathematical s ...
. In particular, it has been problem-focused and makes expedient use of approximations, simulations, etc.


Early uses derived from elementary probability theory

The earliest result may be found directly from elementary probability theory. Suppose we model the above process taking L and G as the fragment length and target length, respectively. The probability of "covering" any given location on the target ''with one particular fragment'' is then L / G. (This presumes L \ll G, which is valid often, but not for all real-world cases.) The probability of a single fragment not covering a given location on the target is therefore 1 - L / G, and \left - L / G\rightN for N fragments. The probability of covering a given location on the target with ''at least one'' fragment is therefore :P = 1 - \left - \frac\rightN. This equation was first used to characterize plasmid libraries, but it may appear in a modified form. For most projects N \gg 1, so that, to a good degree of approximation :\left - \frac\rightN \sim \exp(-NL/G), where R = NL/G is called the ''redundancy''. Note the significance of redundancy as representing the average number of times a position is covered with fragments. Note also that in considering the covering process over all positions in the target, this probability is identical to the
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
of the random variable C, the fraction of the target coverage. The final result, :E\langle C \rangle = 1 - e^, remains in widespread use as a " back of the envelope" estimator and predicts that coverage for all projects evolves along a universal curve that is a function only of the redundancy.


Lander-Waterman theory

In 1988,
Eric Lander Eric Steven Lander (born February 3, 1957) is an American mathematician and geneticist who served as the 11th director of the Office of Science and Technology Policy and Science Advisor to the President, serving on the presidential Cabinet. Lan ...
and
Michael Waterman Michael Spencer Waterman (born June 28, 1942) is a Professor of Biology, Mathematics and Computer Science at the University of Southern California (USC), where he holds an Endowed Associates Chair in Biological Sciences, Mathematics and Computer S ...
published an important paper examining the covering problem from the standpoint of gaps. Although they focused on the so-called mapping problem, the abstraction to sequencing is much the same. They furnished a number of useful results that were adopted as the standard theory from the earliest days of "large-scale" genome sequencing. Their model was also used in designing the
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
and continues to play an important role in DNA sequencing. Ultimately, the main goal of a sequencing project is to close all gaps, so the "gap perspective" was a logical basis of developing a sequencing model. One of the more frequently used results from this model is the expected number of
contig A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005. In bottom-up sequencing projects, a contig refers to ov ...
s, given the number of fragments sequenced. If one neglects the amount of sequence that is essentially "wasted" by having to detect overlaps, their theory yields :E\langle contigs \rangle = N e^. In 1995,
Roach Roach may refer to: Animals * Cockroach, various insect species of the order Blattodea * Common roach (''Rutilus rutilus''), a fresh and brackish water fish of the family Cyprinidae ** ''Rutilus'' or roaches, a genus of fishes * California roach ...
published improvements to this theory, enabling it to be applied to sequencing projects in which the goal was to completely sequence a target genome. Michael Wendl and Bob Waterston confirmed, based on Stevens' method, that both models produced similar results when the number of contigs was substantial, such as in low coverage mapping or sequencing projects. As sequencing projects ramped up in the 1990s, and projects approached completion, low coverage approximations became inadequate, and the exact model of Roach was necessary. However, as the cost of sequencing dropped, parameters of sequencing projects became easier to directly test empirically, and interest and funding for strategic genomics diminished. The basic ideas of Lander–Waterman theory led to a number of additional results for particular variations in mapping techniques. However, technological advancements have rendered mapping theories largely obsolete except in organisms other than highly studied model organisms (e.g., yeast, flies, mice, and humans).


Parking strategy

The parking strategy for sequencing resembles the process of parking cars along a curb. Each car is a sequenced clone, and the curb is the genomic target. Each clone sequenced is screened to ensure that subsequently sequenced clones do not overlap any previously sequenced clone. No sequencing effort is redundant in this strategy. However, much like the gaps between parked cars, unsequenced gaps less than the length of a clone accumulate between sequenced clones. There can be considerable cost to close such gaps.


Pairwise end-sequencing

In 1995, Roach ''et al.'' proposed and demonstrated through simulations a generalization of a set of strategies explored earlier by Edwards and Caskey. This
whole-genome sequencing Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a ...
method became immensely popular as it was championed by Celera and used to sequence several model organisms before Celera applied it to the human genome. Today, most sequencing projects employ this strategy, often called paired end sequencing.


Post Human Genome Project advancements

The physical processes and protocols of DNA sequencing have continued to evolve, largely driven by advancements in bio-chemical methods, instrumentation, and automation techniques. There is now a wide range of problems that
DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
has made in-roads into, including
metagenomics Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microb ...
and medical (cancer) sequencing. There are important factors in these scenarios that classical theory does not account for. Recent work has begun to focus on resolving the effects of some of these issues. The level of mathematics becomes commensurately more sophisticated.


Various artifacts of large-insert sequencing

Biologists have developed methods to filter highly-repetitive, essentially un-sequenceable regions of genomes. These procedures are important for organisms whose genomes consist mostly of such DNA, for example corn. They yield multitudes of small islands of sequenceable DNA products. Wendl and Barbazuk proposed an extension to Lander–Waterman Theory to account for "gaps" in the target due to filtering and the so-called "edge-effect". The latter is a position-specific sampling bias, for example the terminal base position has only a 1 / G chance of being covered, as opposed to L / G for interior positions. For R < 1, classical Lander–Waterman Theory still gives good predictions, but dynamics change for higher redundancies. Modern sequencing methods usually sequence both ends of a larger fragment, which provides linking information for ''de novo'' assembly and improved probabilities for alignment to reference sequence. Researchers generally believe that longer lengths of data (read lengths) enhance performance for very large DNA targets, an idea consistent with predictions from distribution models. However, Wendl showed that smaller fragments provide better coverage on small, linear targets because they reduce the edge effect in linear molecules. These findings have implications for sequencing the products of DNA filtering procedures. Read-pairing and fragment size evidently have negligible influence for large, whole-genome class targets.


Individual and population sequencing

Sequencing is emerging as an important tool in medicine, for example in cancer research. Here, the ability to detect heterozygous mutations is important and this can only be done if the sequence of the diploid genome is obtained. In the pioneering efforts to sequence individuals, Levy ''et al.'' and Wheeler ''et al.'', who sequenced
Craig Venter John Craig Venter (born October 14, 1946) is an American biotechnologist and businessman. He is known for leading one of the first draft sequences of the human genome and assembled the first team to transfect a cell with a synthetic chromosome. ...
and Jim Watson, respectively, outlined models for covering both alleles in a genome. Wendl and Wilson followed with a more general theory that allowed for an arbitrary number of coverings of each allele and arbitrary
ploidy Ploidy () is the number of complete sets of chromosomes in a cell (biology), cell, and hence the number of possible alleles for Autosome, autosomal and Pseudoautosomal region, pseudoautosomal genes. Sets of chromosomes refer to the number of mat ...
. These results point to the general conclusion that the amount of data needed for such projects is significantly higher than for traditional haploid projects. Generally, at least 30-fold redundancy, i.e. each nucleotide spanned by an average of 30 sequence reads, is now standard. However, requirements can be even greater, depending upon what kinds of genomic events are to be found. For example, in the so-called "discordant read pairs method", DNA insertions can be inferred if the distance between read pairs is larger than expected. Calculations show that around 50-fold redundancy is needed to avoid false-positive errors at 1% threshold. The advent of
next-generation sequencing Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation s ...
has also made large-scale population sequencing feasible, for example the
1000 Genomes Project The 1000 Genomes Project (abbreviated as 1KGP), launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one th ...
to characterize variation in human population groups. While common variation is easily captured, rare variation poses a design challenge: too few samples with significant sequence redundancy risks not having a variant in the sample group, but large samples with light redundancy risk not capturing a variant in the read set that is actually in the sample group. Wendl and Wilson report a simple set of optimization rules that maximize the probability of discovery for a given set of parameters. For example, for observing a rare allele at least twice (to eliminate the possibility is unique to an individual) a little less than 4-fold redundancy should be used, regardless of the sample size.


Metagenomic sequencing

Next-generation instruments are now also enabling the sequencing of whole uncultured metagenomic communities. The sequencing scenario is more complicated here and there are various ways of framing design theories for a given project. For example, Stanhope developed a probabilistic model for the amount of sequence needed to obtain at least one contig of a given size from each novel organism of the community, while Wendl et al. reported analysis for the average contig size or the probability of completely recovering a novel organism for a given rareness within the community. Conversely, Hooper et al. propose a semi-empirical model based on the
gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distri ...
.


Limitations

DNA sequencing theories often invoke the assumption that certain random variables in a model are
independent and identically distributed In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usua ...
. For example, in Lander–Waterman Theory, a sequenced fragment is presumed to have the same probability of covering each region of a genome and all fragments are assumed to be independent of one another. In actuality, sequencing projects are subject to various types of bias, including differences of how well regions can be cloned, sequencing anomalies, biases in the target sequence (which is ''not'' random), and software-dependent errors and biases. In general, theory will agree well with observation up to the point that enough data have been generated to expose latent biases. The kinds of biases related to the underlying target sequence are particularly difficult to model, since the sequence itself may not be known ''a priori''. This presents a type of
Catch-22 (logic) A catch-22 is a paradoxical situation from which an individual cannot escape because of contradictory rules or limitations. The term was coined by Joseph Heller, who used it in his 1961 novel ''Catch-22''. An example is Brantley Foster in '' The S ...
problem.


See also

*
Computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
*
Bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
*
Mathematical biology Mathematical and theoretical biology, or biomathematics, is a branch of biology which employs theoretical analysis, mathematical models and abstractions of the living organisms to investigate the principles that govern the structure, development a ...
* Sulston score


References

{{reflist, 2 Bioinformatics Mathematical and theoretical biology DNA sequencing