MUltiple Sequence Comparison by Log-Expectation (MUSCLE) is computer

software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...

for multiple sequence alignment of

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...

and

nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...

sequences. It is licensed as

public domain The public domain (PD) consists of all the creative work A creative work is a manifestation of creative effort including fine artwork (sculpture, paintings, drawing, sketching, performance art), dance, writing (literature), filmmaking, ...

. The method was published by Robert C. Edgar in two papers in 2004. The first paper, published in '' Nucleic Acids Research'', introduced the sequence alignment algorithm. The second paper, published in ''

BMC Bioinformatics ''BMC Bioinformatics'' is a peer-reviewed open access scientific journal covering bioinformatics and computational biology published by BioMed Central. It was established in 2000, and has been one of the fastest growing and most successful journals ...

'', presented more technical details.

Algorithm

The MUSCLE algorithm proceeds in three stages: the ''draft progressive'', ''improved progressive'', and ''refinement'' stage.

Stage 1: Draft Progressive

In this first stage, the algorithm produces a multiple alignment, emphasizing speed over accuracy. This step begins by computing the k-mer distance for every pair of input sequences to create a distance matrix.

UPGMA UPGMA (unweighted pair group method with arithmetic mean) is a simple agglomerative (bottom-up) hierarchical clustering method. The method is generally attributed to Sokal and Michener. The UPGMA method is similar to its ''weighted'' variant, the ...

clusters the distance matrix to produce a

binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binary t ...

. From this tree a progressive alignment is constructed, beginning with the creation of profiles for each leaf of the tree. For every node in the tree, a pairwise alignment is constructed of the two child profiles, creating a new profile to be assigned to that node. This continues until there is a multiple

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...

of all input sequences at the root of the tree.

Stage 2: Improved Progressive

This stage focuses on obtaining a more optimal tree by calculating the ''Kimura distance'' for each pair of input sequences using the multiple sequence alignment obtained in Stage one, and creates a second distance matrix. UPGMA clusters this distance matrix to obtain a second binary tree. A progressive alignment is performed to obtain a multiple sequence alignment like in Stage 1, but it is optimized by only computing alignments in subtrees whose branching orders have changed from the first binary tree, resulting in a more accurate alignment.

Stage 3: Refinement

In this final stage, an edge is chosen from the second tree, with edges being visited in decreasing distance from the root. The chosen edge is deleted, dividing the tree into two subtrees. The profile of the multiple alignment is then computed for each subtree. A new multiple sequence alignment is produced by re-aligning the subtree profiles. If the SP score is improved, the new alignment is kept, otherwise, it is discarded. The process of deleting an edge and aligning is repeated until convergence, or until a user-defined limit is reached.

Complexity and Comparison

In the first two stages of the algorithm, the

time complexity In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...

is , the space complexity is . The ''refinement'' stage adds to the time complexity another term, . MUSCLE is often used as a replacement for Clustal, since it usually (but not always) gives better sequence alignments, depending on the chosen options. is significantly faster than Clustal, more so for larger alignments.

Algorithm Flowchart

Integration

MUSCLE is integrated into

DNASTAR DNASTAR is a global bioinformatics software company incorporated in 1984 that is headquartered in Madison, Wisconsin. DNASTAR develops and sells software for sequence analysis in the fields of genomics, molecular biology, and structural biology. ...

's Lasergene software, Geneious, and

MacVector MacVector is a commercial sequence analysis application for Apple Macintosh computers running Mac OS X. It is intended to be used by Molecular biology, molecular biologists to help analyze, design, research and document their experiments in the la ...

and is available in

Sequencher Gene Codes Corporation is a privately owned international firm based in Ann Arbor, Michigan, which specializes in bioinformatics software for genetic sequence analysis In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA ...

MEGA Mega or MEGA may refer to: Science * mega-, a metric prefix denoting 106 * Mega (number), a certain very large integer in Steinhaus–Moser notation * "mega-" a prefix meaning "large" that is used in taxonomy * Gravity assist, for ''Moon-Earth ...

, and UGENE as a plug-in. MUSCLE is also available as a web service via the

European Molecular Biology Laboratory The European Molecular Biology Laboratory (EMBL) is an intergovernmental organization dedicated to molecular biology research and is supported by 27 member states, two prospect states, and one associate member state. EMBL was created in 1974 and ...

(EMBL)-

European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...

(EBI). As of September 2016, the two papers describing MUSCLE have been cited more than 19,000 times in total.

References

External links

*
MUSCLE Web Server (EMBL-EBI)
{{Bioinformatics Phylogenetics software