HOME

TheInfoList



OR:

MUltiple Sequence Comparison by Log-Expectation (MUSCLE) is computer
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
for
multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...
of
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
and
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
sequences. It is
licensed A license (or licence) is an official permission or permit to do, use, or own something (as well as the document of that permission or permit). A license is granted by a party (licensor) to another party (licensee) as an element of an agreeme ...
as
public domain The public domain (PD) consists of all the creative work A creative work is a manifestation of creative effort including fine artwork (sculpture, paintings, drawing, sketching, performance art), dance, writing (literature), filmmaking, ...
. The method was published by Robert C. Edgar in two papers in 2004. The first paper, published in ''
Nucleic Acids Research ''Nucleic Acids Research'' is an open-access peer-reviewed scientific journal published since 1974 by the Oxford University Press. The journal covers research on nucleic acids, such as DNA and RNA, and related work. According to the ''Journal Cit ...
'', introduced the sequence alignment algorithm. The second paper, published in ''
BMC Bioinformatics ''BMC Bioinformatics'' is a peer-reviewed open access scientific journal covering bioinformatics and computational biology published by BioMed Central. It was established in 2000, and has been one of the fastest growing and most successful journals ...
'', presented more technical details.


Algorithm

The MUSCLE algorithm proceeds in three stages: the ''draft progressive'', ''improved progressive'', and ''refinement'' stage.


Stage 1: Draft Progressive

In this first stage, the algorithm produces a multiple alignment, emphasizing speed over accuracy. This step begins by computing the
k-mer In bioinformatics, ''k''-mers are substrings of length k contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which ''k''-mers are composed of nucleotides (''i.e''. A, T, G ...
distance for every pair of input sequences to create a
distance matrix In mathematics, computer science and especially graph theory, a distance matrix is a square matrix (two-dimensional array) containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the ''dist ...
.
UPGMA UPGMA (unweighted pair group method with arithmetic mean) is a simple agglomerative (bottom-up) hierarchical clustering method. The method is generally attributed to Sokal and Michener. The UPGMA method is similar to its ''weighted'' variant, the ...
clusters the distance matrix to produce a
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binary t ...
. From this tree a progressive alignment is constructed, beginning with the creation of profiles for each leaf of the tree. For every node in the tree, a pairwise alignment is constructed of the two child profiles, creating a new profile to be assigned to that node. This continues until there is a multiple
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
of all input sequences at the root of the tree.


Stage 2: Improved Progressive

This stage focuses on obtaining a more optimal tree by calculating the ''Kimura distance'' for each pair of input sequences using the multiple sequence alignment obtained in Stage one, and creates a second distance matrix. UPGMA clusters this distance matrix to obtain a second binary tree. A progressive alignment is performed to obtain a multiple sequence alignment like in Stage 1, but it is optimized by only computing alignments in subtrees whose branching orders have changed from the first binary tree, resulting in a more accurate alignment.


Stage 3: Refinement

In this final stage, an edge is chosen from the second tree, with edges being visited in decreasing distance from the root. The chosen edge is deleted, dividing the tree into two subtrees. The profile of the multiple alignment is then computed for each subtree. A new multiple sequence alignment is produced by re-aligning the subtree profiles. If the SP score is improved, the new alignment is kept, otherwise, it is discarded. The process of deleting an edge and aligning is repeated until convergence, or until a user-defined limit is reached.


Complexity and Comparison

In the first two stages of the algorithm, the
time complexity In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
is , the
space complexity The space complexity of an algorithm or a computer program is the amount of memory space required to solve an instance of the computational problem as a function of characteristics of the input. It is the memory required by an algorithm until it ex ...
is . The ''refinement'' stage adds to the time complexity another term, . MUSCLE is often used as a replacement for
Clustal Clustal is a series of widely used computer programs used in bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its a ...
, since it usually (but not always) gives better sequence alignments, depending on the chosen options. is significantly faster than Clustal, more so for larger alignments.


Algorithm Flowchart


Integration

MUSCLE is integrated into
DNASTAR DNASTAR is a global bioinformatics software company incorporated in 1984 that is headquartered in Madison, Wisconsin. DNASTAR develops and sells software for sequence analysis in the fields of genomics, molecular biology, and structural biology. ...
's Lasergene software, Geneious, and
MacVector MacVector is a commercial sequence analysis application for Apple Macintosh computers running Mac OS X. It is intended to be used by Molecular biology, molecular biologists to help analyze, design, research and document their experiments in the la ...
and is available in
Sequencher Gene Codes Corporation is a privately owned international firm based in Ann Arbor, Michigan, which specializes in bioinformatics software for genetic sequence analysis In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA ...
,
MEGA Mega or MEGA may refer to: Science * mega-, a metric prefix denoting 106 * Mega (number), a certain very large integer in Steinhaus–Moser notation * "mega-" a prefix meaning "large" that is used in taxonomy * Gravity assist, for ''Moon-Eart ...
, and
UGENE UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2. UGENE helps biolo ...
as a plug-in. MUSCLE is also available as a web service via the
European Molecular Biology Laboratory The European Molecular Biology Laboratory (EMBL) is an intergovernmental organization dedicated to molecular biology research and is supported by 27 member states, two prospect states, and one associate member state. EMBL was created in 1974 and ...
(EMBL)-
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...
(EBI). As of September 2016, the two papers describing MUSCLE have been cited more than 19,000 times in total.


See also

*
Sequence alignment software This list of sequence alignment software is a compilation of software tools and web portals used in pairwise sequence alignment and multiple sequence alignment. See structural alignment software for structural alignment of proteins. Database sear ...
*
DNASTAR DNASTAR is a global bioinformatics software company incorporated in 1984 that is headquartered in Madison, Wisconsin. DNASTAR develops and sells software for sequence analysis in the fields of genomics, molecular biology, and structural biology. ...
*
Clustal Clustal is a series of widely used computer programs used in bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its a ...
*
ProbCons ProbCons is an open source probabilistic consistency-based multiple alignment of amino acid sequences. It is one of the most efficient protein multiple sequence alignment programs, since it has repeatedly demonstrated a statistically significant adv ...
*
AMAP AMAP is a multiple sequence alignment program based on sequence annealing. This approach consists of building up the multiple alignment one match at a time, thereby circumventing many of the problems of progressive alignment. The AMAP parameter ...
*
T-COFFEE T-Coffee (Tree-based Consistency Objective Function for Alignment Evaluation) is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can al ...
*
MAFFT In bioinformatics, MAFFT (for multiple alignment using fast Fourier transform) is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on ...


References


External links

*
MUSCLE Web Server (EMBL-EBI)
{{Bioinformatics Phylogenetics software