In
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
, sequence clustering
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s attempt to group
biological sequence
Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length sc ...
s that are somehow related. The sequences can be either of
genomic
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
, "
transcriptomic" (
ESTs) or
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
origin.
For proteins,
homologous sequence
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spe ...
s are typically grouped into
families
Family (from la, familia) is a group of people related either by consanguinity (by recognized birth) or affinity (by marriage or other relationship). The purpose of the family is to maintain the well-being of its members and of society. Idea ...
. For EST data, clustering is important to group sequences originating from the same
gene
In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
before the ESTs are
assembled to reconstruct the original
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein.
mRNA is created during the ...
.
Some clustering algorithms use
single-linkage clustering
In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of e ...
, constructing a
transitive closure
In mathematics, the transitive closure of a binary relation on a set is the smallest relation on that contains and is transitive. For finite sets, "smallest" can be taken in its usual sense, of having the fewest related pairs; for infinit ...
of sequences with a
similarity
Similarity may refer to:
In mathematics and computing
* Similarity (geometry), the property of sharing the same shape
* Matrix similarity, a relation between matrices
* Similarity measure, a function that quantifies the similarity of two objects
* ...
over a particular threshold. UCLUST
and CD-HIT
use a
greedy algorithm
A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locall ...
that identifies a
representative sequence for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...
. Sequence clustering is often used to make a
non-redundant set of
representative sequences
In social sciences and other domains, representative sequences are whole sequences that best characterize or summarize a set of sequences. In bioinformatics, representative sequences also designate substrings of a sequence that characterize the se ...
.
Sequence clusters are often synonymous with (but not identical to)
protein families
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be ...
. Determining a representative
tertiary structure
Protein tertiary structure is the three dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains may int ...
for each sequence cluster is the aim of many
structural genomics
Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling ...
initiatives.
Sequence clustering algorithms and packages
* CD-HIT
[
* ]UCLUST
UCLUST
is an algorithm designed to cluster nucleotide or amino-acid sequences into clusters based on sequence similarity. The algorithm was published in 2010 and implemented in a program also named UCLUST. The algorithm is described by the author ...
in USEARCH[
* Starcode: a fast sequence clustering algorithm based on exact all-pairs search.]
* OrthoFinder: a fast, scalable and accurate method for clustering proteins into gene families (orthogroups)
* Linclust: first algorithm whose runtime scales linearly with input set size, very fast, part o
MMseqs2
ref name="pmid29035372"> software suite for fast, sensitive sequence searching and clustering of large sequence sets
* TribeMCL: a method for clustering proteins into related groups
* BAG: a graph theoretic sequence clustering algorithm
* JESAM: Open source parallel scalable DNA alignment engine with optional clustering software component
* UICluster: Parallel Clustering of EST (Gene) Sequences
* BLASTClust single-linkage clustering with BLAST
* Clusterer: extendable java application for sequence grouping and cluster analyses
* PATDB: a program for rapidly identifying perfect substrings
* nrdb: a program for merging trivially redundant (identical) sequences
* CluSTr: A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI
* ICAtools - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering
* Skipredudant EMBOSS tool to remove redundant sequences from a set
* CLUSS Algorithm to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver
* CLUSS2 Algorithm for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver
Non-redundant sequence databases
* PISCES: A Protein Sequence Culling Server
* RDB90
* UniRef: A non-redundant UniProt
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...
sequence database
* Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity.
* Virus Orthologous Clusters: A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity
See also
*Cluster analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
*Social sequence analysis
In social sciences, sequence analysis (SA) is concerned with the analysis of sets of categorical sequences that typically describe longitudinal data. Analyzed sequences are encoded representations of, for example, individual life trajectories such ...
References
{{reflist
Bioinformatics