In social sciences and other domains, representative sequences are whole sequences that best characterize or summarize a set of sequences.
In bioinformatics, representative sequences also designate substrings of a sequence that characterize the sequence.
Social sciences
In
Sequence analysis in social sciences
In social sciences, sequence analysis (SA) is concerned with the analysis of sets of categorical sequences that typically describe longitudinal data. Analyzed sequences are encoded representations of, for example, individual life trajectories such ...
, representative sequences are used to summarize sets of sequences describing for example the family life course or professional career of several thousands individuals.
The identification of representative sequences
proceeds from the pairwise dissimilarities between sequences. One typical solution is the medoid sequence, i.e., the observed sequence that minimizes the sum of its distances to all other sequences in the set. An other solution is the densest observed sequence, i.e., the sequence with the greatest number of other sequences in its neighborhood. When the diversity of the sequences is large, a single representative is often insufficient to efficiently characterize the set. In such cases, an as small as possible set of representative sequences covering (i.e., which includes in at least one neighborhood of a representative) a given percentage of all sequences is searched.
A solution also considered is to select the medoids of relative frequency groups. More specifically, the method consists in sorting the sequences (for example, according to the first principal coordinate of the pairwise dissimilarity matrix), splitting the sorted list into equal sized groups (called relative frequency groups), and selecting the medoids of the equal sized groups.
The methods for identifying representative sequences described above have been implemented in the R packag
TraMineR
Bioinformatics
Representative sequences are short regions within
protein sequences
Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthes ...
that can be used to approximate the
evolutionary relationships
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation t ...
of those proteins, or the organisms from which they come. Representative sequences are contiguous subsequences (typically 300
residues) from
ubiquitous
Omnipresence or ubiquity is the property of being present anywhere and everywhere. The term omnipresence is most often used in a religious context as an attribute of a deity or supreme being, while the term ubiquity is generally used to describe ...
, conserved proteins, such that each
orthologous
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spec ...
family of representative sequences taken alone gives a
distance matrix
In mathematics, computer science and especially graph theory, a distance matrix is a square matrix (two-dimensional array) containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the ''dist ...
in close agreement with the consensus matrix.
Use
Protein sequence
Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesi ...
s can provide data about the
biological function
In evolutionary biology, function is the reason some object or process occurred in a system that evolved through natural selection. That reason is typically that it achieves some result, such as that chlorophyll helps to capture the energy of sunl ...
and
evolution
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
of proteins and
protein domain
In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of s ...
s. Grouping and interrelating protein sequences can therefore provide information about both human biological processes, and the evolutionary development of biological processes on earth; such
sequence clusters allow for the effective coverage of sequence space. Sequence clusters can reduce a large database of sequences to a smaller set of ''sequence representatives'', each of which should represent its cluster at the sequence level. Sequence representatives allow the effective coverage of the original database with fewer sequences. The database of sequence representatives is called ''non-redundant'', as similar (or redundant) sequences have been removed at a certain similarity threshold.
See also
Sequence analysis in social sciences
In social sciences, sequence analysis (SA) is concerned with the analysis of sets of categorical sequences that typically describe longitudinal data. Analyzed sequences are encoded representations of, for example, individual life trajectories such ...
Sequence analysis
In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignm ...
in bioinformatics
References
Protein structure
Bioinformatics
{{Bioinformatics-stub
Social sciences