HOME

TheInfoList



OR:

In
evolutionary biology Evolutionary biology is the subfield of biology that studies the evolutionary processes (natural selection, common descent, speciation) that produced the diversity of life on Earth. It is also defined as the study of the history of life fo ...
, sequence space is a way of representing all possible sequences (for a
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
,
gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
or
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
). The sequence space has one dimension per
amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
or
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
in the sequence leading to highly dimensional spaces. Most sequences in sequence space have no function, leaving relatively small regions that are populated by naturally occurring genes. Each protein sequence is adjacent to all other sequences that can be reached through a single
mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mi ...
. It has been estimated that the whole functional protein sequence space has been explored by life on the Earth. Evolution can be visualised as the process of sampling nearby sequences in sequence space and moving to any with improved fitness over the current one.


Representation

A sequence space is usually laid out as a grid. For
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
sequence spaces, each
residue Residue may refer to: Chemistry and biology * An amino acid, within a peptide chain * Crop residue, materials left after agricultural processes * Pesticide residue, refers to the pesticides that may remain on or in food after they are applied ...
in the protein is represented by a
dimension In physics and mathematics, the dimension of a Space (mathematics), mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any Point (geometry), point within it. Thus, a Line (geometry), lin ...
with 20 possible positions along that axis corresponding to the possible amino acids. Hence there are 400 possible
dipeptide A dipeptide is an organic compound derived from two amino acids. The constituent amino acids can be the same or different. When different, two isomers of the dipeptide are possible, depending on the sequence. Several dipeptides are physiologicall ...
s arranged in a 20x20 space but that expands to 10130 for even a small protein of 100 amino acids arranges in a space with 100 dimensions. Although such overwhelming multidimensionality cannot be visualised or represented diagrammatically, it provides a useful abstract model to think about the range of proteins and
evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
from one sequence to another. These highly multidimensional spaces can be compressed to 2 or 3 dimensions using
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
. A fitness landscape is simply a sequence space with an extra vertical axis of fitness added for each sequence.


Functional sequences in sequence space

Despite the diversity of protein superfamilies, sequence space is extremely sparsely populated by functional proteins. Most random protein sequences have no fold or function.
Enzyme superfamilies A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred (see homology). Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similari ...
, therefore, exist as tiny clusters of active proteins in a vast empty space of non-functional sequence. The density of functional proteins in sequence space, and the proximity of different functions to one another is a key determinant in understanding
evolvability Evolvability is defined as the capacity of a system for adaptive evolution. Evolvability is the ability of a population of organisms to not merely generate genetic diversity, but to generate ''adaptive'' genetic diversity, and thereby evolve throu ...
. The degree of interpenetration of two neutral networks of different activities in sequence space will determine how easy it is to evolve from one activity to another. The more overlap between different activities in sequence space, the more
cryptic variation Evolutionary capacitance is the storage and release of variation, just as electric capacitors store and release charge. Living systems are robust to mutations. This means that living systems accumulate genetic variation without the variation havin ...
for promiscuous activity will be. Protein sequence space has been compared to the ''
Library of Babel "The Library of Babel" ( es, La biblioteca de Babel) is a short story by Argentine author and librarian Jorge Luis Borges (1899–1986), conceiving of a universe in the form of a vast library containing all possible 410-page books of a certain ...
'', a theoretical library containing all possible books that are 410 pages long. In the ''Library of Babel'', finding any book that made sense was impossible due to the sheer number and lack of order. The same would be true of protein sequences if it were not for natural selection, which has selected out only protein sequences that make sense. Additionally, each protein sequences is surrounded by a set of neighbours (point mutants) that are likely to have at least some function. On the other hand, the effective "alphabet" of the sequence space may in fact be quite small, reducing the useful number of amino acids from 20 to a much lower number. For example, in an extremely simplified view, all amino acids can be sorted into two classes (hydrophobic/polar) by
hydrophobicity In chemistry, hydrophobicity is the physical property of a molecule that is seemingly repelled from a mass of water (known as a hydrophobe). In contrast, hydrophiles are attracted to water. Hydrophobic molecules tend to be nonpolar and, th ...
and still allow many common structures to show up. Early life on Earth may have only four or five types of amino acids to work with, and researches have shown that functional proteins can be created from wild-type ones by a similar alphabet-reduction process. Reduced alphabets are also useful in
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, as they provide an easy way of analyzing protein similarity.


Exploration through directed evolution and rational design

A major focus in the field of
protein engineering Protein engineering is the process of developing useful or valuable proteins. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It has been used to imp ...
is on creating DNA libraries that
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
regions of sequence space, often with the goal of finding mutants of proteins with enhanced functions compared to the
wild type The wild type (WT) is the phenotype of the typical form of a species as it occurs in nature. Originally, the wild type was conceptualized as a product of the standard "normal" allele at a locus, in contrast to that produced by a non-standard, "m ...
. These libraries are created either by using a wild type sequence as a template and applying one or more
mutagenesis Mutagenesis () is a process by which the genetic information of an organism is changed by the production of a mutation. It may occur spontaneously in nature, or as a result of exposure to mutagens. It can also be achieved experimentally using la ...
techniques to make different variants of it, or by creating proteins from scratch using
artificial gene synthesis Artificial gene synthesis, or simply gene synthesis, refers to a group of methods that are used in synthetic biology to construct and assemble genes from nucleotides '' de novo''. Unlike DNA synthesis in living cells, artificial gene synthesis do ...
. These libraries are then screened or selected, and ones with improved
phenotype In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology or physical form and structure, its developmental processes, its biochemical and physiological proper ...
s are used for the next round of mutagenesis.


See also

*
Protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
*
Sequence space In functional analysis and related areas of mathematics, a sequence space is a vector space whose elements are infinite sequences of real or complex numbers. Equivalently, it is a function space whose elements are functions from the natural num ...
*
Directed evolution Directed evolution (DE) is a method used in protein engineering that mimics the process of natural selection to steer proteins or nucleic acids toward a user-defined goal. It consists of subjecting a gene to iterative rounds of mutagenesis (cre ...
*
Protein engineering Protein engineering is the process of developing useful or valuable proteins. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It has been used to imp ...
*
High-dimensional space In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coordina ...


References

{{genarch Evolutionary biology Genetics Biochemistry