HOME

TheInfoList



OR:

In
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, a DNA read error occurs when a sequence assembler changes one DNA base for a different base. The reads from the sequence assembler can then be used to create a
de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
, which can be used in various ways to find
error An error (from the Latin ''error'', meaning "wandering") is an action which is inaccurate or incorrect. In some usages, an error is synonymous with a mistake. The etymology derives from the Latin term 'errare', meaning 'to stray'. In statistics ...
s.


Overview

In a
de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
, there is a possibility of 4^k different nodes to make arrangements of a
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
. The number of nodes used to create the graph can be reduced in number by considering only the
k-mer In bioinformatics, ''k''-mers are substrings of length k contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which ''k''-mers are composed of nucleotides (''i.e''. A, T, G ...
s found within the DNA strand of interest. Given sequence 1, it is possible to determine the nodes of size 7, or 7-mers, that will be in the graph. These 7-mers then create the graph shown in figure 1. The
graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...
shown in figure 1 is a very simple version of what a graph could look like.''De Bruijn Graph of a small sequence''. (2011). Retrieved Feb 7, 2015, from Homolog.us — Bioinformatics: http://www.homolog.us/Tutorials/index.php?p=2.1&s=1 This graph is formed by taking the last 6 elements of the 7-mer and linking it to the node whose first 6 elements are the same. Figure 1 is the most simplistic a
de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
can be, since each node has exactly one path into it and one path out. Most of the time, graphs will have more than one edge directed to a node and/or more than one edge leaving a node. This happens due to the way nodes are connected. The nodes are connected by edges pointing to nodes if the last ''k-1'' elements of the ''k''-mer match the first ''k-1'' elements of any node. This allows for a multiple-edged
de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
to form. These more complicated graphs happen due to either read errors or variations in DNA strands. Both causes make it difficult to determine the correct structure of the DNA, and what is causing the differences. Since most DNA strands will likely include read errors and variations, scientists hope to use an assembly process that can merge nodes of the graph when they are unambiguously connected after the graph has been cleaned of vertices and edges created by the errors. Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J., & Birol, I. (2009). ABySS: a parallel assembler for short read sequence data. ''Genome research, 19''(6), 1117-1123


Tips and bubbles

When a graph is formed from
sequenced In genetics and biochemistry, sequencing means to determine the primary structure (sometimes incorrectly called the primary sequence) of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which suc ...
data, the read errors form tips and bubbles. A tip is where an error occurred during the sequencing process and has caused the graph to end prematurely and includes both correct and incorrect ''k''-mers. A bubble is also formed when an error occurs during the sequence reading process; however, wherever the error happens, there is a path for the ''k''-mer reads to reconnect with the main graph and continue as though nothing had ever happened. When there are tips and bubbles present in a
de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
formed from the data, they may be removed only if an error is what caused the tip or bubble to appear. When scientists are using a
reference genome A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assemble ...
, they can quickly and easily tell where tips are located by comparing the graph of the reference genome and the graph of the sequence. If there is not a reference genome, tips are eliminated by tracing the branches backward until a point of ambiguity is found. Tips are then removed only if the branch containing the tip is shorter than a set threshold length. The process of removing bubbles is slightly more complicated. The first thing that needs done is to identify the beginning of the bubble. From there, each path from the beginning of the bubble is followed until the point of reconnection. The point of reconnection can be different for each path. Since there can be paths of various lengths from the beginning node, the path which has a lower coverage is removed.


Example

Given a sequence of any length, the first step that needs done is to enter the sequence into a sequencing program, have it sequenced, and a return
base pair A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
(bp) reads of a certain length. Since there is not a sequencing program that is completely accurate, there will always be some reads which contain errors. The most common sequencing method is the shotgun method, which is the method most probably used on sequence 2. Once a method is decided on, you have to specify the length of the bp reads you would like it to return. In the case of sequence 2, it returned 7-bp reads with all errors made during the process noted in red.Flicek, P., & Birney, E. (2009). Sense from sequence reads: methods for alignment and assembly. ''Nature methods, 6'', S6-S12. Figure 3 Once the reads are obtained, they are hashed into ''k''-mers. The ''k''-mers then are recorded in a table with how many times each ''k''-mer appeared in the reads. For this example, each read was hashed into ''4''-mers and if there was an error it was recorded in red. All of the ''4''-mers were then recorded, with their frequency in the following table. Each individual cell of the table will then form a node, allowing a
de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
to be formed from the given ''k''-mers. In figure 2, linear stretches are identified and then another graph, figure 3, is formed where the linear stretches have become a single node, of a different ''k''-mer size, allowing for a more concise graph. In this simplified graph, it is easy to identify various tips and bubbles, as shown in figure 4. These bubbles and tips can then be removed, since we can identify that they were formed from errors in the bp reads, giving us a graph structure that should accurately and completely reflect the original sequence. If you follow the de Bruijn graph shown in figure 5, you will see that the sequence formed does indeed match the DNA sequence given in sequence 2.


Comparing two DNA strands

When comparing two strands of DNA, colored
de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
s are frequently used to identify errors. These errors, often polymorphisms, cause bubbles, similar to the ones mentioned above, to form. Currently there are four main
algorithms In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing c ...
used to generalize the data and locate bubbles. The four algorithms extend de Bruijn graphs by allowing the nodes and edges in the graph to be colored by the samples from which they were observedIqbal, Z., Caccamo, M., Turner, I., Flicek, P., & McVean, G. (2012). De novo assembly and genotyping of variants using colored de Bruijn graphs. ''Nature genetics, 44''(2), 226-232


Bubble calling

The simplest use of a colored de Bruijn graph is known as the bubble calling algorithm. This algorithm looks, and locates, bubbles on the genome that differ from the original. These bubbles must be “clean”, or simply a divergence from the reference genome, but cannot be caused by deletions of DNA bases. This algorithm can have high
false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
rates since there is a difficulty of separating repeat- and variant-induced bubbles; however, there is often a reference genome to help improve
reliability Reliability, reliable, or unreliable may refer to: Science, technology, and mathematics Computing * Data reliability (disambiguation), a property of some disk arrays in computer storage * High availability * Reliability (computer networking), a ...
. The reference genome also helps in the detection of variants and is essential to detect variant sites. Recently, scientists have discovered a way to use the bubble calling algorithm with
copy number variation Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals. Copy number variation is a type of structural variation: specifically, it is a type of d ...
detection to allow for an opportunity of
unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
detection of these variations in the futureNijkamp, J. F., van den Broek, M. A., Geertman, J. M. A., Reinders, M. J., Daran, J. M. G., & de Ridder, D. (2012). De novo detection of copy number variation by co-assembly. ''Bioinformatics, 28''(24), 3195-3202


Path divergence

When looking at complex variants, there is a very low chance that they will make a clean
contig A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005. In bottom-up sequencing projects, a contig refers to ov ...
. Since this is the case most often, the path
divergence In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of the ...
algorithm is useful, especially when considering where deletions occur and the variant is so complex it is constrained to the reference
allele An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution. ::"The chro ...
. When there is a bubble formed, the path divergence algorithm is used the most frequently and allows detected bubbles to get deleted in a very systematic procedure. The algorithm first locates each point of divergence. Then from each point of
divergence In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of the ...
, the strands that form the bubble are traced to find where the two paths join after ''n'' nodes. If the two paths join, then the path with a lower coverage is removed and stored in a file.


Multiple sample analysis

Using multiple samples substantially improves the power and false discovery rate of detecting variants. In the simplest cases, the samples are combined into a group of a single color and the data is analysed as described previously. However, by maintaining separate colors for each sample set, additional information on how the bubbles were formed, whether by error or by repeats, presents itself. In 1997, the Department of Technology at Genzyme Genetics in
Framingham Framingham () is a city in the Commonwealth of Massachusetts in the United States. Incorporated in 1700, it is located in Middlesex County and the MetroWest subregion of the Greater Boston metropolitan area. The city proper covers with a popul ...
,
Massachusetts Massachusetts (Massachusett language, Massachusett: ''Muhsachuweesut assachusett writing systems, məhswatʃəwiːsət'' English: , ), officially the Commonwealth of Massachusetts, is the most populous U.S. state, state in the New England ...
developed a new approach that provided a breakthrough in dealing with bubbles using the
multiplex Multiplex may refer to: * Multiplex (automobile), a former American car make * Multiplex (comics), a DC comic book supervillain * Multiplex (company), a global contracting and development company * Multiplex (assay), a biological assay which measu ...
allele-specific diagnostic assay (MASDA). This program combines forward dot-blot, complex simultaneous probe hybridization and direct mutation detection to help solve the dual problem of multiple sample analysis.Shuber, A. P., Michalowsky, L. A., Nass, G. S., Skoletsky, J., Hire, L. M., Kotsopoulos, S. K., ... & Klinger, K. W. (1997). High throughput parallel analysis of hundreds of patient samples for more than 100 mutations in multiple disease genes. ''Human molecular genetics, 6''(3), 337-347


Genotyping

The colored
de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
s can be used to
genotype The genotype of an organism is its complete set of genetic material. Genotype can also be used to refer to the alleles or variants an individual carries in a particular gene or genetic location. The number of alleles an individual can have in a ...
any DNA sample at a known loci, even when the
coverage Coverage may refer to: Filmmaking * Coverage (lens), the size of the image a lens can produce * Camera coverage, the amount of footage shot and different camera setups used in filming a scene * Script coverage, a short summary of a script, wri ...
is less than sufficient for variant assembly. The first step to this process is to construct a graph of the reference
allele An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution. ::"The chro ...
, known variants and data from the sample. The
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
then calculates the likelihood of each genotype and accounts for the structure of the graph, both of the local and genome-wide sequence. This then generalizes to multiple allelic types and helps genotype complex and compound variants. This algorithm is used frequently, as there are no bubbles formed to deal with. This also directly helps find the more complicated issues in genes more direct than any of the three algorithms previously mentioned.{{Cite web, title=Genotyping - an overview {{! ScienceDirect Topics, url=https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/genotyping, access-date=2020-10-09, website=www.sciencedirect.com


References

Bioinformatics