bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

, sequence clustering algorithms attempt to group

biological sequence Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length sc ...

s that are somehow related. The sequences can be either of

genomic Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...

, " transcriptomic" ( ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are

assembled Assemble or Assembled may refer to: * ''Assemble'' (album), 2005 album by Grown at Home *Assemble (collective), collective of people based in London *'' Marvel Studios: Assembled'', American television series of documentary specials See also * * ...

to reconstruct the original mRNA. Some clustering algorithms use single-linkage clustering, constructing a transitive closure of sequences with a similarity over a particular threshold. UCLUST and CD-HIT use a greedy algorithm that identifies a representative sequence for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...

. Sequence clustering is often used to make a non-redundant set of

representative sequences In social sciences and other domains, representative sequences are whole sequences that best characterize or summarize a set of sequences. In bioinformatics, representative sequences also designate substrings of a sequence that characterize the se ...

. Sequence clusters are often synonymous with (but not identical to)

protein families A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be ...

. Determining a representative tertiary structure for each sequence cluster is the aim of many structural genomics initiatives.

Sequence clustering algorithms and packages

* CD-HIT *

UCLUST UCLUST is an algorithm designed to cluster nucleotide or amino-acid sequences into clusters based on sequence similarity. The algorithm was published in 2010 and implemented in a program also named UCLUST. The algorithm is described by the author ...

in USEARCH * Starcode: a fast sequence clustering algorithm based on exact all-pairs search. * OrthoFinder: a fast, scalable and accurate method for clustering proteins into gene families (orthogroups) * Linclust: first algorithm whose runtime scales linearly with input set size, very fast, part o
MMseqs2
ref name="pmid29035372"> software suite for fast, sensitive sequence searching and clustering of large sequence sets * TribeMCL: a method for clustering proteins into related groups * BAG: a graph theoretic sequence clustering algorithm * JESAM: Open source parallel scalable DNA alignment engine with optional clustering software component * UICluster: Parallel Clustering of EST (Gene) Sequences * BLASTClust single-linkage clustering with BLAST * Clusterer: extendable java application for sequence grouping and cluster analyses * PATDB: a program for rapidly identifying perfect substrings * nrdb: a program for merging trivially redundant (identical) sequences * CluSTr: A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI * ICAtools - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering * Skipredudant EMBOSS tool to remove redundant sequences from a set * CLUSS Algorithm to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver * CLUSS2 Algorithm for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver

Non-redundant sequence databases

* PISCES: A Protein Sequence Culling Server * RDB90 * UniRef: A non-redundant UniProt sequence database * Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. * Virus Orthologous Clusters: A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity

References

{{reflist Bioinformatics