In
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
, sequence clustering
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
s attempt to group
biological sequences that are somehow related. The sequences can be either of
genomic
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
, "
transcriptomic" (
ESTs) or
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
origin.
For proteins,
homologous sequence
Sequence homology is the homology (biology), biological homology between DNA sequence, DNA, RNA sequence, RNA, or Protein primary structure, protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments ...
s are typically grouped into
families
Family (from ) is a group of people related either by consanguinity (by recognized birth) or affinity (by marriage or other relationship). It forms the basis for social order. Ideally, families offer predictability, structure, and safety as ...
. For EST data, clustering is important to group sequences originating from the same
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
before the ESTs are
assembled to reconstruct the original
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
.
Some clustering algorithms use
single-linkage clustering
In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of el ...
, constructing a
transitive closure
In mathematics, the transitive closure of a homogeneous binary relation on a set (mathematics), set is the smallest Relation (mathematics), relation on that contains and is Transitive relation, transitive. For finite sets, "smallest" can be ...
of sequences with a
similarity over a particular threshold. UCLUST
and CD-HIT
use a
greedy algorithm
A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally ...
that identifies a
representative sequence for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
. Sequence clustering is often used to make a
non-redundant set of
representative sequences.
Sequence clusters are often synonymous with (but not identical to)
protein families
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be c ...
. Determining a representative
tertiary structure
Protein tertiary structure is the three-dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains and the ...
for each sequence cluster is the aim of many
structural genomics
Structural genomics seeks to describe the Protein Structure, 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of protein ...
initiatives.
Sequence clustering algorithms and packages
* CD-HIT
[
* ]UCLUST
UCLUST
is an algorithm designed to cluster nucleotide or amino-acid sequences into clusters based on sequence similarity. The algorithm was published in 2010 and implemented in a program also named UCLUST. The algorithm is described by the author ...
in USEARCH[
* Starcode: a fast sequence clustering algorithm based on exact all-pairs search.]
* OrthoFinder: a fast, scalable and accurate method for clustering proteins into gene families (orthogroups)
* Linclust: first algorithm whose runtime scales linearly with input set size, very fast, part o
MMseqs2
ref name="pmid29035372"> software suite for fast, sensitive sequence searching and clustering of large sequence sets
* TribeMCL: a method for clustering proteins into related groups
* BAG: a graph theoretic sequence clustering algorithm
* JESAM: Open source parallel scalable DNA alignment engine with optional clustering software component
* UICluster: Parallel Clustering of EST (Gene) Sequences
* BLASTClust single-linkage clustering with BLAST
* Clusterer: extendable java application for sequence grouping and cluster analyses
* PATDB: a program for rapidly identifying perfect substrings
* nrdb: a program for merging trivially redundant (identical) sequences
* CluSTr: A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI
* ICAtools - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering
* Skipredudant EMBOSS tool to remove redundant sequences from a set
* CLUSS Algorithm to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver
* CLUSS2 Algorithm for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver
Non-redundant sequence databases
* PISCES: A Protein Sequence Culling Server
* RDB90
* UniRef: A non-redundant UniProt
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...
sequence database
* Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity.
* Virus Orthologous Clusters: A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity
See also
*Cluster analysis
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
* Social sequence analysis
References
{{reflist
Bioinformatics