Introduction
Metagenomic samples can contain reads from a huge number of organisms. For example, in a single gram of soil, there can be up to 18000 different types of organisms, each with its own genome. Metagenomic studies sample DNA from the whole community, and make it available as nucleotide sequences of certain length. In most cases, the incomplete nature of the obtained sequences makes it hard to assemble individual genes, much less recovering the fullAlgorithms
Binning algorithms can employ previous information, and thus act as supervised classifiers, or they can try to find new groups, those act as unsupervised classifiers. Many, of course, do both. The classifiers exploit the previously known sequences by performing alignments againstTETRA
TETRA is a statistical classifier that uses tetranucleotide usage patterns in genomic fragments. There are four possible nucleotides in DNA, therefore there can be different fragments of four consecutive nucleotides; these fragments are called tetramers. TETRA works by tabulating the frequencies of each tetramer for a given sequence. From these frequencies z-scores are then calculated, which indicate how over- or under-represented the tetramer is in contraposition with what would be expected by looking to individual nucleotide compositions. The z-scores for each tetramer are assembled in a vector, and the vectors corresponding to different sequences are compared pair-wise, to yield a measure of how similar different sequences from the sample are. It is expected that the most similar sequences belong to organisms in the same OTU.MEGAN
In the DIAMOND+MEGAN approach, all reads are first aligned against a protein reference database, such as NCBI-nr, and then the resulting alignments are analyzed using the naive LCA algorithm, which places a read on the lowest taxonomic node in the NCBI taxonomy that lies above all taxa to which the read has a significant alignment. Here, an alignment is usually deemed "significant", if its bit score lies above a given threshold (which depends on the length of the reads) and is within 10%, say, of the best score seen for that read. The rationale of using protein reference sequences, rather than DNA reference sequences, is that current DNA reference databases only cover a small fraction of the true diversity of genomes that exist in the environment.Phylopythia
Phylopythia is one supervised classifier developed by researchers at IBM labs, and is basically a support vector machine trained with DNA k-mers from known sequences.SOrt-ITEMS
SOrt-ITEMS (Monzoorul et al., 2009) is an alignment-based binning algorithm developed by Innovations Labs of Tata Consultancy Services (TCS) Ltd., India. Users need to perform a similarity search of the input metagenomic sequences (reads) against the nr protein database using BLASTx search. The generated BLASTx output is then taken as input by the SOrt-ITEMS program. The method uses a range of BLAST alignment parameter thresholds to first identify an appropriate taxonomic level (or rank) where the read can be assigned. An orthology-based approach is then adopted for the final assignment of the metagenomic read. Other alignment-based binning algorithms developed by the Innovation Labs of Tata Consultancy Services (TCS) include DiScRIBinATE, ProViDE and SPHINX. The methodologies of these algorithms are summarized below.DiScRIBinATE
DiScRIBinATE (Ghosh et al., 2010) is an alignment-based binning algorithm developed by the Innovations Labs of Tata Consultancy Services (TCS) Ltd., India. DiScRIBinATE replaces the orthology approach of SOrt-ITEMS with a quicker 'alignment-free' approach. Incorporating this alternate strategy was observed to reduce the binning time by half without any significant loss in the accuracy and specificity of assignments. Besides, a novel reclassification strategy incorporated in DiScRIBinATE was seem to reduce the overall misclassification rate.ProViDE
ProViDE (Ghosh et al., 2011) is an alignment-based binning approach developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd. for the estimation of viral diversity in metagenomic samples. ProViDE adopts the reverse orthology based approach similar to SOrt-ITEMS for the taxonomic classification of metagenomic sequences obtained from virome datasets. It a customized set of BLAST parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom.PCAHIER
PCAHIER (Zheng et al., 2010), another binning algorithm developed by the Georgia Institute of Technology., employs n-mer oligonucleotide frequencies as the features and adopts a hierarchical classifier (PCAHIER) for binning short metagenomic fragments. The principal component analysis was used to reduce the high dimensionality of the feature space. The effectiveness of the PCAHIER was demonstrated through comparisons against a non-hierarchical classifier, and two existing binning algorithms (TETRA and Phylopythia).SPHINX
SPHINX (Mohammed et al., 2011), another binning algorithm developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd., adopts a hybrid strategy that achieves high binning efficiency by utilizing the principles of both 'composition'- and 'alignment'-based binning algorithms. The approach was designed with the objective of analyzing metagenomic datasets as rapidly as composition-based approaches, but nevertheless with the accuracy and specificity of alignment-based algorithms. SPHINX was observed to classify metagenomic sequences as rapidly as composition-based algorithms. In addition, the binning efficiency (in terms of accuracy and specificity of assignments) of SPHINX was observed to be comparable with results obtained using alignment-based algorithms.INDUS and TWARIT
Represent other composition-based binning algorithms developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd. These algorithms utilize a range of oligonucleotide compositional (as well as statistical) parameters to improve binning time while maintaining the accuracy and specificity of taxonomic assignments.Other algorithms
This list is not exhaustive: * TACOA (Diaz et al., 2009) * Parallel-META (Su et al., 2011) * PhyloPythiaS (Patil et al., 2011) * RITA (MacDonald et al., 2012) * BiMeta (Le et al., 2015) * MetaPhlAn (Segata et al., 2012) * SeMeta (Le et al., 2016) * Quikr (Koslicki et al., 2013) * Taxoner (Pongor et al., 2014) *MaxBin (Wu et al., 2014) *MetaBAT 2 (Kang et al., 2019) *CONCOCT (Alneberg et al., 2014) *Anvi’o (Eren et al., 2015) *DAS Tool (Sieber et al., 2018) - wrapper that combines multiple binning algorithms All these algorithms employ different schemes for binning sequences, such asReferences
* {{Cite journal , doi = 10.1371/journal.pcbi.0020092 , pmid = 16848637 , pmc = 1513271 , volume = 2 , issue = 7 , pages = e92 , last = Schloss , first = Patrick D , author2=Jo Handelsman , title = Toward a Census of Bacteria in Soil , journal = PLOS Comput Biol , date = 2006-07-21 , bibcode = 2006PLSCB...2...92S Bioinformatics