Velvet assembler
   HOME

TheInfoList



OR:

Velvet is an algorithm package that has been designed to deal with ''de novo''
genome assembly In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in on ...
and short read sequencing alignments. This is achieved through the manipulation of
de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
s for genomic sequence assembly via the removal of errors and the simplification of repeated regions.Zerbino, D. R.; Birney, E. (2008)
"Velvet: de novo assembly using very short reads"
Retrieved 18 October 2013.
Velvet has also been implemented in commercial packages, such as
Sequencher Gene Codes Corporation is a privately owned international firm based in Ann Arbor, Michigan, which specializes in bioinformatics software for genetic sequence analysis In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA ...
, Geneious,
MacVector MacVector is a commercial sequence analysis application for Apple Macintosh computers running Mac OS X. It is intended to be used by Molecular biology, molecular biologists to help analyze, design, research and document their experiments in the la ...
and BioNumerics.


Introduction

The development of next-generation sequencers (NGS) allowed for increased cost effectiveness on very short read sequencing. The manipulation of de Bruijn graphs as a method for alignment became more realistic but further developments were needed to address issues with errors and repeats. This led to the development of Velvet by Daniel Zerbino and
Ewan Birney John Frederick William Birney (known as Ewan Birney) (born 6 December 1972) is joint director of EMBL's European Bioinformatics Institute (EMBL-EBI), in Hinxton, Cambridgeshire and deputy director general of the European Molecular Biology Labora ...
at the
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...
in the United Kingdom. Velvet works by efficiently manipulating de Bruijn graphs through simplification and compression, without the loss of graph information, by converging non-intersecting paths into single nodes. It eliminates errors and resolves repeats by first using an error correction algorithm that merges sequences together. Repeats are then removed from the sequence via the repeat solver that separates paths which share local overlaps. The combination of short reads and read pairs allows Velvet to resolve small repeats and produce
contig A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005. In bottom-up sequencing projects, a contig refers to ov ...
s of reasonable length. This application of Velvet can produce contigs with a N50 length of 50 kb on paired-end
prokaryotic A prokaryote () is a Unicellular organism, single-celled organism that lacks a cell nucleus, nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Greek language, Greek wikt:πρό#Ancient Greek, πρό (, 'before') a ...
data and a 3 kb length for regions of
mammal Mammals () are a group of vertebrate animals constituting the class Mammalia (), characterized by the presence of mammary glands which in females produce milk for feeding (nursing) their young, a neocortex (a region of the brain), fur or ...
ian data.


Algorithm

As already mentioned Velvet uses the de Bruijn graph to assemble short reads. More specifically Velvet represents each different
k-mer In bioinformatics, ''k''-mers are substrings of length k contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which ''k''-mers are composed of nucleotides (''i.e''. A, T, G ...
obtained from the reads by a unique node on the graph. Two nodes are connected if its k-mers have a k-1 overlap. In other words, an arc from node A to node B exists if the last k-1 characters of the k-mer, represented by A, are the first k-1 characters of the k-mer represented by B. The following figure shows an example of a de Bruijn graph generated with Velvet: The same process is simultaneously done with the
reverse complement In molecular biology, complementarity describes a relationship between two structures each following the lock-and-key principle. In nature complementarity is the base principle of DNA replication and transcription as it is a property shared b ...
of all the k-mers to take into account the overlaps between the reads of opposite strands. A number of optimizations can be done over the graph which includes simplification and error removal.


Simplification

An easy way to save memory costs is to merge nodes that do not affect the path generated in the graph, i.e., whenever a node A has only one outgoing arc that points to node B, with only one ingoing arc, the nodes can be merged. It is possible to represent both nodes as one, merging them and all their information together. The next figure illustrates this process in the simplification of the initial example.


Error removal

Errors in the graph can be caused by the sequencing process or it could simply be that the biological sample contains some errors (for example polymorphisms). Velvet recognizes three kinds of errors: tips; bubbles; and erroneous connections.


Tips

A node is considered a tip and should be erased if it is disconnected on one of its ends, the length of the information stored in the node is shorter than 2k, and the arc leading to this node has a low multiplicity (number of times the arc was found during the construction of the graph) and as a result cannot be compared to other alternative paths. Once these errors are removed, the graph once again undergoes simplification.


Bubbles

Bubbles are generated when two distinct paths start and end at the same nodes. Normally bubbles are caused by errors or biological variants. These errors are removed using the Tour Bus algorithm, which is similar to a
Dijkstra's algorithm Dijkstra's algorithm ( ) is an algorithm for finding the shortest paths between nodes in a graph, which may represent, for example, road networks. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years ...
, a
breadth-first search Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next de ...
that detects the best path to follow and determines which ones should be erased. A simple example is shown in figure 4. This process is also shown in figure 5 following on from the examples shown in figures 1 and 2.


Erroneous connections

These are connections that do not generate correct paths or do not create any recognizable structures within the graph. Velvet erases these errors after completion of the Tour Bus algorithm, applying a simple coverage cut-off that must be defined by the user.


Velvet commands

Velvet provides the following functions: ; velveth : This command helps to construct the data set (hashes the reads) for velvetg and includes information about the meaning of each sequence files. ; velvetg : This command builds the de Bruijn graph from the k-mers obtained by velveth and runs simplification and error correction over the graph. It then extracts the contigs. After running velvetg a number of files are generated. Most importantly, a file of contigs contains the sequences of the contigs longer than 2k, where k is the word-length used in velveth. For more detail and examples refer to th
Velvet Manual
"Velvet Manual"
Retrieved 18 October 2013


Motivation

Current DNA sequencing technologies, including NGS, are limited on the basis that
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
s are much larger than any read length. Typically, NGS operate with small reads, less than 400 bp, and have a much lower cost per read than previous first generation machines. They are also simpler to operate with higher parallel operation and higher yield. However, short reads contain less information than larger reads thus requiring a higher assembly read coverage to allow for detectable overlaps. This in turn increases the complexity of the sequencing and significantly increases computational requirements. A larger number of reads also increases the size of the overlap graph, making it more difficult and lengthy to compute. The connections between the reads become more indistinct due to the decrease in overlapping sections leading to a greater possibility of errors. To overcome these issues, dynamic sequencing programs that are efficient, highly cost effective and able to resolve errors and repeats were developed. Velvet algorithms was designed for this and are able to perform short read de novo sequencing alignment in relatively short computational time and with lower memory usage compared to other assemblers.


Graphical interface

One of the main drawbacks in the use of Velvet is the use of the command-line interface and the difficulties users, especially beginners, face in the implementation of their data. A graphical user interface for the Velvet assembler was developed in 2012 and designed to overcome this problem and simplify the running of Velvet.


See also

*
De novo sequence assemblers De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two comm ...
*
SPAdes (software) SPAdes (St. Petersburg genome assembler) is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects. SPAdes works with Ion Torrent, PacBi ...


References

{{reflist Bioinformatics algorithms Bioinformatics software DNA sequencing Metagenomics software