HOME

TheInfoList



OR:

In
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as
DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. T ...
technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from
shotgun sequencing In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun. The chain-termination method of DNA sequencing ("Sanger sequencing ...
genomic Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
DNA, or gene transcript ( ESTs). The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.


Genome assemblers

The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Al ...
programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity (from small
viruses A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Since Dmitri Ivanovsky's ...
over
plasmids A plasmid is a small, extrachromosomal DNA molecule within a cell that is physically separated from chromosomal DNA and can replicate independently. They are most commonly found as small circular, double-stranded DNA molecules in bacteria; how ...
to
bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were am ...
and finally
eukaryotes Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacter ...
), the assembly programs used in these
genome project Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist or a virus) and to annotate protein-coding genes ...
s needed increasingly sophisticated strategies to handle: * terabytes of sequencing data which need processing on computing clusters; * identical and nearly identical sequences (known as ''repeats'') which can, in the worst case, increase the time and space complexity of algorithms quadratically; * DNA read errors in the fragments from the sequencing instruments, which can confound assembly. Faced with the challenge of assembling the first larger eukaryotic genomes—the fruit fly ''
Drosophila melanogaster ''Drosophila melanogaster'' is a species of fly (the taxonomic order Diptera) in the family Drosophilidae. The species is often referred to as the fruit fly or lesser fruit fly, or less commonly the " vinegar fly" or "pomace fly". Starting with ...
'' in 2000 and the human genome just a year later,—scientists developed assemblers like Celera Assembler and Arachne able to handle genomes of 130 million (e.g., the fruit fly ''D. melanogaster'') to 3 billion (e.g., the human genome) base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS was launched to bring together all the innovations in genome assembly technology under the
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized so ...
framework.


EST assemblers

Expressed sequence tag or EST assembly was an early strategy, dating from the mid-1990s to the mid-2000s, to assemble individual genes rather than whole genomes. The problem differs from genome assembly in several ways. The input sequences for EST assembly are fragments of the transcribed
mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the ...
of a cell and represent only a subset of the whole genome. A number of algorithmical problems differ between genome and EST assembly. For instance, genomes often have large amounts of repetitive sequences, concentrated in the intergenic regions. Transcribed genes contain many fewer repeats, making assembly somewhat easier. On the other hand, some genes are expressed (transcribed) in very high numbers (e.g., housekeeping genes), which means that unlike whole-genome shotgun sequencing, the reads are not uniformly sampled across the genome. EST assembly is made much more complicated by features like (cis-)
alternative splicing Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. In this process, particular exons of a gene may be i ...
, trans-splicing,
single-nucleotide polymorphism In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently ...
, and
post-transcriptional modification Transcriptional modification or co-transcriptional modification is a set of biological processes common to most eukaryotic cells by which an RNA primary transcript is chemically altered following transcription from a gene to produce a mature, ...
. Beginning in 2008 when
RNA-Seq RNA-Seq (named as an abbreviation of RNA sequencing) is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing ...
was invented, EST sequencing was replaced by this far more efficient technology, described under de novo transcriptome assembly.


Types of sequence assembly

There are three approaches to assembling sequencing data: # De-novo: assembling sequencing reads to create full-length (sometimes novel) sequences, without using a template (see de novo sequence assemblers, de novo transcriptome assembly) # Mapping/Aligning: assembling reads by aligning reads against a template (AKA reference). The assembled consensus may not be identical to the template. # Reference-guided: grouping of reads by similarity to the most similar region within the reference (step wise mapping). Reads within each group are then shortened down to mimic short reads quality. A typical method to do so is the k-mer approach. Reference-guided assembly is most useful using long-reads. Referenced-guided assembly is a combination of the other types. This type is applied on long reads to mimic short reads advantages (i.e. call quality). The logic behind it is to group the reads by smaller windows within the reference. Reads in each group will then be reduced in size using the k-mere approach to select the highest quality and most probable contiguous (contig). Contigs will then will be joined together to create a scaffold. The final consense is made by closing any gaps in the scaffold.


De-novo vs. mapping assembly

In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm needs to compare every read with every other read (an operation that has a naive time complexity of O(n2)). Current de-novo genome assemblers may use different types of graph-based algorithms, such as the: * ''Overlap/Layout/Consensus'' (OLC) approach, which was typical of the Sanger-data assemblers and relies on an overlap graph. * ''de Bruijn Graph'' (DBG) approach, which is most widely applied to the short reads from the Solexa and SOLiD platforms. It relies on K-mer graphs, which performs well with vast quantities of short reads. * ''Greedy graph-based'' approach, which may also use one of the OLC or DBG approaches. With greedy graph-based algorithms, the contigs grow by greedy extension, always taking on the read that is found by following the highest-scoring overlap. Referring to the comparison drawn to shredded books in the introduction: while for mapping assemblies one would have a very similar book as a template (perhaps with the names of the main characters and a few locations changed), de-novo assemblies present a more daunting challenge in that one would not know beforehand whether this would become a science book, a novel, a catalogue, or even several books. Also, every shred would be compared with every other shred. Handling repeats in de-novo assembly requires the construction of a graph representing neighboring repeats. Such information can be derived from reading a long fragment covering the repeats in full or only its two ends. On the other hand, in a mapping assembly, parts with multiple or no matches are usually left for another assembling technique to look into.


Sequence assembly pipeline (bioinformatics)

In general, there are three steps in assembling sequencing reads into a scaffold: 1) Pre-assembly: this step is essential to ensure the integrity of downline analysis such as variant calling or final scaffold sequence. This step consists of two chronological workflow: A) Quality check: Depending on the types of sequencing technology, different errors might arise that would lead to a false base call. For example, sequencing "NAAAAAAAAAAAAN" and "NAAAAAAAAAAAN" which include 12 adenine might be wrongfully called with 11 adenine instead. Sequencing a highly repetitive segment of the target DNA/RNA might result in a call that is one short or one more base. Read quality is typically measured by Phred which is an encoded score of each nucleotide quality within a read's sequence. Some sequencing technologies such as
PacBio Pacific Biosciences of California, Inc. (aka PacBio) is an American biotechnology company founded in 2004 that develops and manufactures systems for gene sequencing and some novel real time biological observation. PacBio describes its platform ...
don't have a scoring method for the their sequenced reads. A common tool used in this step is FastQC. B) Filtering of reads: Reads that failed to pass the quality check should be removed from the FastQ file to get the best assembly contigs. 2) Assembly: during this step, reads alignment will be utilized with different criteria to map each read to the possible location. The predicted position of a read is based on either how much of its sequence aligns with other reads or a reference. Different alignment algorithms are used for reads from different sequencing technologies. Some of the commonly used approaches in the assembly are
de Bruijn De Bruijn is a Dutch surname meaning "the brown". Notable people with the surname include: * (1887–1968), Dutch politician * Brian de Bruijn (b. 1954), Dutch-Canadian ice hockey player * Chantal de Bruijn (b. 1976), Dutch field hockey defender ...
graph and overlapping. Read length,
coverage Coverage may refer to: Filmmaking * Coverage (lens), the size of the image a lens can produce * Camera coverage, the amount of footage shot and different camera setups used in filming a scene * Script coverage, a short summary of a script, writ ...
, quality, and the sequencing technique used plays a major role in choosing the best alignment algorithm in the case of Next Generation Sequencing. On the other hand, algorithms aligning 3rd generation sequencing reads requires advance approaches to account for the high error rate associated with them. 3) Post assembly: This step focusing on extracting valuable information from the assembled sequence.
Comparative genomics Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural ...
, and population analysis are examples go post-assemble analysis.


Influence of technological changes

The complexity of sequence assembly is driven by two major factors: the number of fragments and their lengths. While more and longer fragments allow better identification of sequence overlaps, they also pose problems as the underlying algorithms show quadratic or even exponential complexity behaviour to both number of fragments and their length. And while shorter sequences are faster to align, they also complicate the layout phase of an assembly as shorter reads are more difficult to use with repeats or near identical repeats. In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length (some dozen bases) after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand. In 1975, the ''dideoxy termination'' method (AKA ''Sanger sequencing'') was invented and until shortly after 2000, the technology was improved up to a point where fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome
shotgun sequencing In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun. The chain-termination method of DNA sequencing ("Sanger sequencing ...
projects where the reads * are about 800–900 bases long * contain sequencing artifacts like sequencing and cloning vectors * have error rates between 0.5 and 10% With the Sanger technology, bacterial projects with 20,000 to 200,000 reads could easily be assembled on one computer. Larger projects, like the human genome with approximately 35 million reads, needed large computing farms and distributed computing. By 2004 / 2005,
pyrosequencing Pyrosequencing is a method of DNA sequencing (determining the order of nucleotides in DNA) based on the "sequencing by synthesis" principle, in which the sequencing is performed by detecting the nucleotide incorporated by a DNA polymerase. Pyrosequ ...
had been brought to commercial viability by 454 Life Sciences. This new sequencing method generated reads much shorter than those of Sanger sequencing: initially about 100 bases, now 400-500 bases. Its much higher throughput and lower cost (compared to Sanger sequencing) pushed the adoption of this technology by genome centers, which in turn pushed development of sequence assemblers that could efficiently handle the read sets. The sheer amount of data coupled with technology-specific error patterns in the reads delayed development of assemblers; at the beginning in 2004 only the
Newbler Newbler is a software package for '' de novo'' DNA sequence assembly. It is designed specifically for assembling sequence data generated by the 454 GS-series of pyrosequencing platforms sold by 454 Life Sciences, a Roche Diagnostics company. Usag ...
assembler from 454 was available. Released in mid-2007, the hybrid version of the MIRA assembler by Chevreux et al. was the first freely available assembler that could assemble 454 reads as well as mixtures of 454 reads and Sanger reads. Assembling sequences from different sequencing technologies was subsequently coined ''hybrid assembly''. From 2006, the Illumina (previously Solexa) technology has been available and can generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. Illumina was initially limited to a length of only 36 bases, making it less suitable for de novo assembly (such as de novo transcriptome assembly), but newer iterations of the technology achieve read lengths above 100 bases from both ends of a 3-400bp clone. Announced at the end of 2007, the SHARCGS assembler by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads. It was quickly followed by a number of others. Later, new technologies like
SOLiD Solid is one of the four fundamental states of matter (the others being liquid, gas, and plasma). The molecules in a solid are closely packed together and contain the least amount of kinetic energy. A solid is characterized by structur ...
from
Applied Biosystems Applied Biosystems is one of various brands under the Life Technologies brand of Thermo Fisher Scientific corporation. The brand is focused on integrated systems for genetic analysis, which include computerized machines and the consumables used w ...
,
Ion Torrent Ion semiconductor sequencing is a method of DNA sequencing based on the detection of hydrogen ions that are released during the polymerization of DNA. This is a method of "sequencing by synthesis", during which a complementary strand is built based ...
and SMRT were released and new technologies (e.g.
Nanopore sequencing Nanopore sequencing is a third generation approach used in the sequencing of biopolymers — specifically, polynucleotides in the form of DNA or RNA. Using nanopore sequencing, a single molecule of DNA or RNA can be sequenced without the nee ...
) continue to emerge. Despite the higher error rates of these technologies they are important for assembly because their longer read length helps to address the repeat problem. It is impossible to assemble through a perfect repeat that is longer than the maximum read length; however, as reads become longer the chance of a perfect repeat that large becomes small. This gives longer sequencing reads an advantage in assembling repeats even if they have low accuracy (~85%).


Assembly algorithms

Different organisms have a distinct region of higher complexity within their genome. Hence, the need of different computational approaches is needed. Some of the commonly used algorithms are: * Graph Assembly: is based on Graph theory in computer science. The
de Bruijn Graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multipl ...
is an example of this approach and utilizes k-mers to assemble a contiguous from reads. * Greedy Graph Assembly: this approach score each added read to the assembly and selects the highest possible score from the overlapping region. Given a set of sequence fragments, the object is to find a longer sequence that contains all the fragments (see figure under ''Types of Sequence Assembly''): # Сalculate pairwise alignments of all fragments. # Choose two fragments with the largest overlap. # Merge chosen fragments. # Repeat step 2 and 3 until only one fragment is left. The result might not be an optimal solution to the problem.


Programs

For a lists of ''de-novo'' assemblers, see De novo sequence assemblers. For a list of mapping aligners, see List of sequence alignment software § Short-read sequence alignment. Some of the common tools used in different assembly steps are listed in the following table:


See also

* De novo sequence assemblers *
Sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Al ...
* De novo transcriptome assembly *
Set cover problem The set cover problem is a classical question in combinatorics, computer science, operations research, and complexity theory. It is one of Karp's 21 NP-complete problems shown to be NP-complete in 1972. Given a set of elements (called the un ...
*
List of sequenced animal genomes This list of sequenced animal genomes contains animal species for which complete genome sequences have been assembled, annotated and published. Substantially complete draft genomes are included, but not partial genome sequences or organelle-only se ...


References

{{DEFAULTSORT:Sequence Assembly Bioinformatics DNA sequencing methods