SPAdes (St. Petersburg

genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding g ...

assembler Assembler may refer to: Arts and media * Nobukazu Takemura, avant-garde electronic musician, stage name Assembler * Assemblers, a fictional race in the ''Star Wars'' universe * Assemblers, an alternative name of the superhero group Champions of ...

) is a genome assembly

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

which was designed for single cell and multi-cells

bacterial Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were amon ...

data sets. Therefore, it might not be suitable for large genomes projects. SPAdes works with

Ion Torrent Ion semiconductor sequencing is a method of DNA sequencing based on the detection of hydrogen ions that are released during the polymerization of DNA. This is a method of "sequencing by synthesis", during which a complementary strand is built base ...

PacBio Pacific Biosciences of California, Inc. (aka PacBio) is an American biotechnology company founded in 2004 that develops and manufactures systems for gene sequencing and some novel real time biological observation. PacBio describes its platform ...

, Oxford Nanopore, and Illumina paired-end, mate-pairs and single reads. SPAdes has been integrated into Galaxy pipelines by Guy Lionel and Philip Mabon.

Background

Studying the genome of single cells will help to track changes that occur in DNA over time or associated with exposure to different conditions. Additionally, many projects such as

Human Microbiome Project The Human Microbiome Project (HMP) was a United States National Institutes of Health (NIH) research initiative to improve understanding of the microbiota involved in human health and disease. Launched in 2007, the first phase (HMP1) focused on ...

and

antibiotics An antibiotic is a type of antimicrobial substance active against bacteria. It is the most important type of antibacterial agent for fighting bacterial infections, and antibiotic medications are widely used in the treatment and prevention o ...

discovery would greatly benefit from Single-cell sequencing (SCS). SCS has an advantage over sequencing DNA extracted from large number of cells. The problem of averaging out the significant variations between cells can be overcome by using SCS. Experimental and computational technologies are being optimized to allow researchers to sequence single cells. For instance, amplification of DNA extracted from a single cell is one of the experimental challenges. To maximize the accuracy and quality of SCS, a uniform DNA amplification is needed. It was demonstrated that using multiple annealing and looping-based amplification cycles (

MALBAC Multiple Annealing and Looping Based Amplification Cycles (MALBAC) is a quasilinear whole genome amplification method. Unlike conventional DNA amplification methods that are non-linear or exponential (in each cycle, DNA copied can serve as template ...

) for DNA amplification generates less biasness compared to polymerase chain reaction ( PCR) or

multiple displacement amplification Multiple displacement amplification (MDA) is a DNA amplification technique. This method can rapidly amplify minute amounts of DNA samples to a reasonable quantity for genomic analysis. The reaction starts by annealing random hexamer primers to the ...

(MDA). Furthermore, it has been recognized that the challenges facing SCS are computational rather than experimental. Currently available assembler, such as

Velvet Weave details visible on a purple-colored velvet fabric Velvet is a type of woven tufted fabric in which the cut threads are evenly distributed, with a short pile, giving it a distinctive soft feel. By extension, the word ''velvety'' means ...

, String Graph Assembler (SGA) and EULER-SR, were not designed to handle SCS assembly. Assembly of single cell data is difficult due to non-uniform read coverage, variation in insert length, high levels of sequencing errors and chimeric reads. Therefore, the new algorithmic approach, SPAdes, was designed to address these issues.

SPAdes assembly approach

SPAdes uses

k-mer In bioinformatics, ''k''-mers are substrings of length k contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which ''k''-mers are composed of nucleotides (''i.e''. A, T, G ...

s for building the initial

de Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...

and on following stages it performs graph-theoretical operations which are based on graph structure, coverage and sequence lengths. Moreover, it adjusts errors iteratively. The stages of assembly in SPAdes are: * Stage 1: assembly graph construction. SPAdes employs multisized de Bruijn graph (See below), which detects and removes bulge/bubble and chimeric reads. * Stage 2: k-bimer (pairs of k-mers) adjustment. Exact distances between k-mers in the genome (edges in the assembly graph) are estimated. * Stage 3: paired assembly graph construction. * Stage 4: contig construction. SPAdes outputs contigs and allows to map reads back to their positions in the assembly graph after graph simplification (backtracking).

Details on SPAdes assembly

SPAdes was designed to overcome the problems associated with the assembly of single cell data as follows: 1. Non-uniform coverage. SPAdes utilizes multisized de Bruijn graph which allows employing different values of k. It has been suggested to use smaller values of k in low-coverage regions to minimize fragmentation, and larger values of k in high coverage regions to decrease repeat collapsing (Stage 1 above). 2. Variable insert sizes of paired-end reads. SPAdes employs the basic concept of paired de Bruijn graphs. However, paired de Bruijn works well on paired-end reads with fixed insert size. Therefore, SPAdes estimates 'distances' instead of using 'insert sizes'. Distance (d) of a paired-end read is defined as, for a read length L, d = insert size – L. By utilizing k-bimer adjustment approach, distances are exactly estimated. A k-bimer consisting of k-mers ‘α’ and ‘β’ together with the estimated distance between them in a genome (α, β,d). This approach breaks the paired–end reads into pairs of k-mers which are transformed to define pairs of edges (biedges) in the de Bruijn graphs. These sets of biedges are involved in the estimation of distances between edges paths between k-mers α and β. By clustering, the optimal distance estimate is chosen from each cluster (stage 2, above). To construct paired de Bruijn graph, the rectangle graphs are employed in SPAdes (stage 3). Rectangle graphs approach was first introduced in 2012 to construct paired de Bruijn graphs with doubtful distances. 3. Bulge, tips and chimeras. Bulges and tips occur due to errors in the middle and ends of reads, respectively. A chimeric connection joins two unrelated substrings of the genome. SPAdes identifies these based on graph topology, the length and coverage of the non-branching paths included in them. SPAdes keeps a data structure to be able to backtrack all corrections or removals. SPAdes modifies the previously used bulge removal approach and iterative de Bruijn graph approach from Peng ''et al'' (2010) and creates a new approach called ‘‘bulge corremoval’’, which stands for bulge correction and removal. The bulge corremoval algorithm can be summarized as follows: a simple bulge is formed by two small and similar paths (P and Q) connecting the same hubs. If P is a non-branching path (h-path), then SPAdes maps every edge in P to an edge projection in Q and removes P from the graph, as a result the coverage of Q increases. Unlike other assemblers, which use a fixed coverage cut-off bulge removal, SPAdes removes or projects the h-paths with low coverage step by step. This is achieved by employing gradually increasing cut-off thresholds and iterating through all h-paths in increasing order of coverage (for bulge corremoval and chimeric removal) or length (for tip removal). Moreover, in order to guarantee that no new sources/sinks are introduced to the graph, SPAdes deletes an h-path (in chimeric h-path removal) or projects (in bulge corremoval) only if its start and end vertices have at least two outgoing and ingoing edges. This helps to remove low coverage h-paths occurring from sequencing errors and chimeric reads but not from repeats.

SPAdes pipelines and performance

SPAdes is composed of the following tools: * Read error correction tool, BayesHammer (for Illumina data) and IonHammer (for IonTorrent data) . In traditional error correction, rare k-mers are considered errors. This can not be applied for SCS because of non-uniform coverage. Therefore, BayesHammer employs probabilistic subclustering which examine multiple central nucleotide, which will be better covered than others, of similar k-mers . It was claimed that for ''

Escherichia coli ''Escherichia coli'' (),Wells, J. C. (2000) Longman Pronunciation Dictionary. Harlow ngland Pearson Education Ltd. also known as ''E. coli'' (), is a Gram-negative, facultative anaerobic, rod-shaped, coliform bacterium of the genus '' Esc ...

'' (''E. coli'') single cell data set, BayesHammer runs in about 75 min, takes up to 10 Gb of RAM to carry out read error correction and requires 10 Gb additional disk space for temporary files. *

Iterative Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...

short-read genome assembler, SPAdes. For the same data set, this step runs for ~ 75 min. It takes ~ 40% of this time to perform stage 1 (see SPAdes assembly approach above) when using three iterations (k=22, 34 and 56), and ~ 45%, 14% and 1% for completing stages 2, 3 and 4, respectively. It also takes up to 5 Gb of RAM to perform assembly and needs 8 Gb additional disk space. * Mismatch corrector (which uses the BWA tool). This module requires the longest time (~ 120 min) and the largest additional disk space (~21 Gb) for temporary files. It takes up to 9 Gb RAM to complete mismatch correction of assembled ''E. coli'' single cell data set. * Module for assembling highly polymorphic diploid genomes, dipSPAdes. dipSPAdes constructs longer contigs by taking advantage of divergence between haplomes in repetitive genome regions. Afterwards, it produces consensus contigs construction and performs haplotype assembly.

Comparing assemblers

A study compared several genome assemblers on single cell ''E. coli'' samples. These assemblers are EULER-SR, Velvet, SOAPdenovo, Velvet-SC, EULER+ Velvet-SC (E+V-SC), IDBA-UD and SPAdes. It was demonstrated that IDBA-UD and SPAdes performed the best. SPAdes had the largest NG50 (99,913, NG50 statistics is the same as the N50 except that the genome size is used rather than the assembly size). Moreover, using ''E. coli'' reference genome, SPAdes assembled the highest percentage of genome (97%) and the highest number of complete genes (4,071 out of 4,324). The assemblers’ performances were as follows: *Number of contigs: IDBA-UD < Velvet < E+V-SC < ''SPAdes'' < EULER-SR < Velvet-SC < SOAPdenovo *NG50 ''SPAdes'' > IDBA-UD >>> E+V-SC > EULER-SR >Velvet >Velvet-SC > SOAPdenovo *Largest contig: IDBA-UD > ''SPAdes'' > > EULER-SR > Velvet= E+V-SC > Velvet-SC > SOAPdenovo *Mapped genome (%): ''SPAdes'' > IDBA-UD > E+V-SC > Velvet-SC > EULER-SR > SOAPdenovo > Velvet *Number of misassemblies: E+V-SC = Velvet = Velvet-SC < SOAPdenovo < IDBA-UD < ''SPADes'' < EULER-SR

References