In social sciences and other domains, representative sequences are whole sequences that best characterize or summarize a set of sequences. In bioinformatics, representative sequences also designate substrings of a sequence that characterize the sequence.

Social sciences

Sequence analysis in social sciences In social sciences, sequence analysis (SA) is concerned with the analysis of sets of categorical sequences that typically describe longitudinal data. Analyzed sequences are encoded representations of, for example, individual life trajectories such ...

, representative sequences are used to summarize sets of sequences describing for example the family life course or professional career of several thousands individuals. The identification of representative sequences proceeds from the pairwise dissimilarities between sequences. One typical solution is the medoid sequence, i.e., the observed sequence that minimizes the sum of its distances to all other sequences in the set. An other solution is the densest observed sequence, i.e., the sequence with the greatest number of other sequences in its neighborhood. When the diversity of the sequences is large, a single representative is often insufficient to efficiently characterize the set. In such cases, an as small as possible set of representative sequences covering (i.e., which includes in at least one neighborhood of a representative) a given percentage of all sequences is searched. A solution also considered is to select the medoids of relative frequency groups. More specifically, the method consists in sorting the sequences (for example, according to the first principal coordinate of the pairwise dissimilarity matrix), splitting the sorted list into equal sized groups (called relative frequency groups), and selecting the medoids of the equal sized groups. The methods for identifying representative sequences described above have been implemented in the R packag
TraMineR

Bioinformatics

Representative sequences are short regions within protein sequences that can be used to approximate the

evolutionary relationships Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation te ...

of those proteins, or the organisms from which they come. Representative sequences are contiguous subsequences (typically 300

residues Residue may refer to: Chemistry and biology * An amino acid, within a peptide chain * Crop residue, materials left after agricultural processes * Pesticide residue, refers to the pesticides that may remain on or in food after they are appli ...

) from

ubiquitous Omnipresence or ubiquity is the property of being present anywhere and everywhere. The term omnipresence is most often used in a religious context as an attribute of a deity or God, supreme being, while the term ubiquity is generally used to des ...

, conserved proteins, such that each

orthologous Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spe ...

family of representative sequences taken alone gives a

distance matrix In mathematics, computer science and especially graph theory, a distance matrix is a square matrix In mathematics, a square matrix is a matrix with the same number of rows and columns. An ''n''-by-''n'' matrix is known as a square matrix of orde ...

in close agreement with the consensus matrix.

Use

Protein sequence Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesi ...

s can provide data about the

biological function In evolutionary biology, function is the reason some object or process occurred in a system that evolved through natural selection. That reason is typically that it achieves some result, such as that chlorophyll helps to capture the energy of sun ...

and

evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...

of proteins and

protein domain In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist o ...

s. Grouping and interrelating protein sequences can therefore provide information about both human biological processes, and the evolutionary development of biological processes on earth; such sequence clusters allow for the effective coverage of sequence space. Sequence clusters can reduce a large database of sequences to a smaller set of ''sequence representatives'', each of which should represent its cluster at the sequence level. Sequence representatives allow the effective coverage of the original database with fewer sequences. The database of sequence representatives is called ''non-redundant'', as similar (or redundant) sequences have been removed at a certain similarity threshold.

References

Protein structure Bioinformatics {{Bioinformatics-stub Social sciences

Social sciences

Bioinformatics

Use

See also

References