bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

, sequence clustering

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

s attempt to group biological sequences that are somehow related. The sequences can be either of

genomic Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...

, " transcriptomic" ( ESTs) or

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...

origin. For proteins,

homologous sequence Sequence homology is the homology (biology), biological homology between DNA sequence, DNA, RNA sequence, RNA, or Protein primary structure, protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments ...

s are typically grouped into

families Family (from ) is a group of people related either by consanguinity (by recognized birth) or affinity (by marriage or other relationship). It forms the basis for social order. Ideally, families offer predictability, structure, and safety as ...

. For EST data, clustering is important to group sequences originating from the same

gene In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...

before the ESTs are assembled to reconstruct the original

mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein. mRNA is ...

. Some clustering algorithms use

single-linkage clustering In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of el ...

, constructing a

transitive closure In mathematics, the transitive closure of a homogeneous binary relation on a set (mathematics), set is the smallest Relation (mathematics), relation on that contains and is Transitive relation, transitive. For finite sets, "smallest" can be ...

of sequences with a similarity over a particular threshold. UCLUST and CD-HIT use a

greedy algorithm A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally ...

that identifies a representative sequence for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...

. Sequence clustering is often used to make a non-redundant set of representative sequences. Sequence clusters are often synonymous with (but not identical to)

protein families A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be c ...

. Determining a representative

tertiary structure Protein tertiary structure is the three-dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains and the ...

for each sequence cluster is the aim of many

structural genomics Structural genomics seeks to describe the Protein Structure, 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of protein ...

initiatives.

Sequence clustering algorithms and packages

* CD-HIT *

UCLUST UCLUST is an algorithm designed to cluster nucleotide or amino-acid sequences into clusters based on sequence similarity. The algorithm was published in 2010 and implemented in a program also named UCLUST. The algorithm is described by the author ...

in USEARCH * Starcode: a fast sequence clustering algorithm based on exact all-pairs search. * OrthoFinder: a fast, scalable and accurate method for clustering proteins into gene families (orthogroups) * Linclust: first algorithm whose runtime scales linearly with input set size, very fast, part o
MMseqs2
ref name="pmid29035372"> software suite for fast, sensitive sequence searching and clustering of large sequence sets * TribeMCL: a method for clustering proteins into related groups * BAG: a graph theoretic sequence clustering algorithm * JESAM: Open source parallel scalable DNA alignment engine with optional clustering software component * UICluster: Parallel Clustering of EST (Gene) Sequences * BLASTClust single-linkage clustering with BLAST * Clusterer: extendable java application for sequence grouping and cluster analyses * PATDB: a program for rapidly identifying perfect substrings * nrdb: a program for merging trivially redundant (identical) sequences * CluSTr: A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI * ICAtools - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering * Skipredudant EMBOSS tool to remove redundant sequences from a set * CLUSS Algorithm to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver * CLUSS2 Algorithm for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver

Non-redundant sequence databases

* PISCES: A Protein Sequence Culling Server * RDB90 * UniRef: A non-redundant

UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...

sequence database * Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. * Virus Orthologous Clusters: A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity

References

{{reflist Bioinformatics

Sequence clustering algorithms and packages

Non-redundant sequence databases

See also

References