bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...

, sequence clustering

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

s attempt to group

biological sequence Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length sc ...

s that are somehow related. The sequences can be either of

genomic Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...

, " transcriptomic" ( ESTs) or

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...

origin. For proteins,

homologous sequence Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spe ...

s are typically grouped into

families Family (from la, familia) is a group of people related either by consanguinity (by recognized birth) or affinity (by marriage or other relationship). The purpose of the family is to maintain the well-being of its members and of society. Idea ...

. For EST data, clustering is important to group sequences originating from the same

gene In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...

before the ESTs are assembled to reconstruct the original

mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the ...

. Some clustering algorithms use

single-linkage clustering In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of e ...

, constructing a

transitive closure In mathematics, the transitive closure of a binary relation on a set is the smallest relation on that contains and is transitive. For finite sets, "smallest" can be taken in its usual sense, of having the fewest related pairs; for infinit ...

of sequences with a

similarity Similarity may refer to: In mathematics and computing * Similarity (geometry), the property of sharing the same shape * Matrix similarity, a relation between matrices * Similarity measure, a function that quantifies the similarity of two objects * ...

over a particular threshold. UCLUST and CD-HIT use a

greedy algorithm A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locall ...

that identifies a representative sequence for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...

. Sequence clustering is often used to make a non-redundant set of

representative sequences In social sciences and other domains, representative sequences are whole sequences that best characterize or summarize a set of sequences. In bioinformatics, representative sequences also designate substrings of a sequence that characterize the se ...

. Sequence clusters are often synonymous with (but not identical to)

protein families A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be ...

. Determining a representative

tertiary structure Protein tertiary structure is the three dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains may int ...

for each sequence cluster is the aim of many

structural genomics Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling ...

initiatives.

Sequence clustering algorithms and packages

* CD-HIT *

UCLUST UCLUST is an algorithm designed to cluster nucleotide or amino-acid sequences into clusters based on sequence similarity. The algorithm was published in 2010 and implemented in a program also named UCLUST. The algorithm is described by the author ...

in USEARCH * Starcode: a fast sequence clustering algorithm based on exact all-pairs search. * OrthoFinder: a fast, scalable and accurate method for clustering proteins into gene families (orthogroups) * Linclust: first algorithm whose runtime scales linearly with input set size, very fast, part o
MMseqs2
ref name="pmid29035372"> software suite for fast, sensitive sequence searching and clustering of large sequence sets * TribeMCL: a method for clustering proteins into related groups * BAG: a graph theoretic sequence clustering algorithm * JESAM: Open source parallel scalable DNA alignment engine with optional clustering software component * UICluster: Parallel Clustering of EST (Gene) Sequences * BLASTClust single-linkage clustering with BLAST * Clusterer: extendable java application for sequence grouping and cluster analyses * PATDB: a program for rapidly identifying perfect substrings * nrdb: a program for merging trivially redundant (identical) sequences * CluSTr: A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI * ICAtools - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering * Skipredudant EMBOSS tool to remove redundant sequences from a set * CLUSS Algorithm to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver * CLUSS2 Algorithm for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver

Non-redundant sequence databases

* PISCES: A Protein Sequence Culling Server * RDB90 * UniRef: A non-redundant

UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...

sequence database * Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. * Virus Orthologous Clusters: A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity

References

{{reflist Bioinformatics

Sequence clustering algorithms and packages

Non-redundant sequence databases

See also

References