Sequence Clustering
   HOME

TheInfoList



OR:

In bioinformatics, sequence clustering
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "
transcriptomic Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. He ...
" ( ESTs) or
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, res ...
origin. For proteins,
homologous sequence Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a s ...
s are typically grouped into
families Family (from la, familia) is a group of people related either by consanguinity (by recognized birth) or affinity (by marriage or other relationship). The purpose of the family is to maintain the well-being of its members and of society. Ideal ...
. For EST data, clustering is important to group sequences originating from the same
gene In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
before the ESTs are assembled to reconstruct the original
mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the ...
. Some clustering algorithms use
single-linkage clustering In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of el ...
, constructing a
transitive closure In mathematics, the transitive closure of a binary relation on a set is the smallest relation on that contains and is transitive. For finite sets, "smallest" can be taken in its usual sense, of having the fewest related pairs; for infinite ...
of sequences with a similarity over a particular threshold. UCLUST and CD-HIT use a
greedy algorithm A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally ...
that identifies a representative sequence for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on sequence alignment. Sequence clustering is often used to make a non-redundant set of representative sequences. Sequence clusters are often synonymous with (but not identical to) protein families. Determining a representative
tertiary structure Protein tertiary structure is the three dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains may i ...
for each sequence cluster is the aim of many
structural genomics Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling ...
initiatives.


Sequence clustering algorithms and packages

* CD-HIT * UCLUST in USEARCH * Starcode: a fast sequence clustering algorithm based on exact all-pairs search. * OrthoFinder: a fast, scalable and accurate method for clustering proteins into gene families (orthogroups) * Linclust: first algorithm whose runtime scales linearly with input set size, very fast, part o
MMseqs2
ref name="pmid29035372">
software suite for fast, sensitive sequence searching and clustering of large sequence sets * TribeMCL: a method for clustering proteins into related groups * BAG: a graph theoretic sequence clustering algorithm * JESAM: Open source parallel scalable DNA alignment engine with optional clustering software component * UICluster: Parallel Clustering of EST (Gene) Sequences * BLASTClust single-linkage clustering with BLAST * Clusterer: extendable java application for sequence grouping and cluster analyses * PATDB: a program for rapidly identifying perfect substrings * nrdb: a program for merging trivially redundant (identical) sequences * CluSTr: A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI * ICAtools - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering * Skipredudant EMBOSS tool to remove redundant sequences from a set * CLUSS Algorithm to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver * CLUSS2 Algorithm for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver


Non-redundant sequence databases

* PISCES: A Protein Sequence Culling Server * RDB90 * UniRef: A non-redundant
UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...
sequence database * Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. * Virus Orthologous Clusters: A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity


See also

* Cluster analysis *
Social sequence analysis In social sciences, sequence analysis (SA) is concerned with the analysis of sets of categorical sequences that typically describe longitudinal data. Analyzed sequences are encoded representations of, for example, individual life trajectories such ...


References

{{reflist Bioinformatics