
In
evolutionary biology
Evolutionary biology is the subfield of biology that studies the evolutionary processes such as natural selection, common descent, and speciation that produced the diversity of life on Earth. In the 1930s, the discipline of evolutionary biolo ...
, sequence space is a way of representing all possible sequences (for a
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
,
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
or
genome
A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
). The sequence space has one dimension per
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
or
nucleotide
Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
in the sequence leading to
highly dimensional spaces.
Most sequences in sequence space have no function, leaving relatively small regions that are populated by naturally occurring genes. Each protein sequence is adjacent to all other sequences that can be reached through a single
mutation
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, ...
. It has been estimated that the whole functional protein sequence space has been explored by life on the Earth. Evolution by natural selection can be visualised as the process of sampling nearby sequences in sequence space and moving to any with improved
fitness over the current one.
Representation
A sequence space is usually laid out as a grid. For
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
sequence spaces, each
residue in the protein is represented by a
dimension
In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coo ...
with 20 possible positions along that axis corresponding to the possible amino acids.
Hence there are 400 possible
dipeptide
A dipeptide is an organic compound derived from two amino acids. The constituent amino acids can be the same or different. When different, two isomers of the dipeptide are possible, depending on the sequence. Several dipeptides are physiological ...
s arranged in a 20x20 space but that expands to 10
130 for even a small protein of 100 amino acids arranged in a space with 100 dimensions. Although such overwhelming multidimensionality cannot be visualised or represented diagrammatically, it provides a useful abstract model to think about the range of proteins and
evolution
Evolution is the change in the heritable Phenotypic trait, characteristics of biological populations over successive generations. It occurs when evolutionary processes such as natural selection and genetic drift act on genetic variation, re ...
from one sequence to another.
These highly multidimensional spaces can be compressed to 2 or 3 dimensions using
principal component analysis
Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
The data is linearly transformed onto a new coordinate system such that th ...
. A fitness landscape is simply a sequence space with an extra vertical axis of fitness added for each sequence.
Functional sequences in sequence space
Despite the diversity of protein superfamilies, sequence space is extremely sparsely populated by functional proteins. Most random protein sequences have no fold or function.
Enzyme superfamilies
A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred (see homology). Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence simila ...
, therefore, exist as tiny clusters of active proteins in a vast empty space of non-functional sequence.
The density of functional proteins in sequence space, and the proximity of different functions to one another is a key determinant in understanding
evolvability
Evolvability is defined as the capacity of a system for adaptive evolution. Evolvability is the ability of a population of organisms to not merely generate genetic diversity, but to generate '' adaptive'' genetic diversity, and thereby evolve thr ...
. The degree of interpenetration of two
neutral networks of different
activities in sequence space will determine how easy it is to evolve from one activity to another. The more overlap between different activities in sequence space, the more
cryptic variation for
promiscuous activity will be.
Protein sequence space has been compared to the ''
Library of Babel'', a theoretical library containing all possible books that are 410 pages long. In the ''Library of Babel'', finding any book that made sense was impossible due to the sheer number and lack of order. The same would be true of protein sequences if it were not for natural selection, which has selected out only protein sequences that make sense. Additionally, each protein sequences is surrounded by a set of neighbours (point mutants) that are likely to have at least some function.
On the other hand, the effective "alphabet" of the sequence space may in fact be quite small, reducing the useful number of amino acids from 20 to a much lower number. For example, in an extremely simplified view, all amino acids can be sorted into two classes (hydrophobic/polar) by
hydrophobicity
In chemistry, hydrophobicity is the chemical property of a molecule (called a hydrophobe) that is seemingly intermolecular force, repelled from a mass of water. In contrast, hydrophiles are attracted to water.
Hydrophobic molecules tend to b ...
and still allow many common structures to show up. Early life on Earth may have only four or five types of amino acids to work with, and researches have shown that functional proteins can be created from wild-type ones by a similar alphabet-reduction process. Reduced alphabets are also useful in
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
, as they provide an easy way of analyzing protein similarity.
Exploration through directed evolution and rational design
A major focus in the field of
protein engineering is on creating
DNA libraries that
sample regions of sequence space, often with the goal of finding mutants of proteins with enhanced functions compared to the
wild type
The wild type (WT) is the phenotype of the typical form of a species as it occurs in nature. Originally, the wild type was conceptualized as a product of the standard "normal" allele at a locus, in contrast to that produced by a non-standard, " ...
. These libraries are created either by using a wild type sequence as a template and applying one or more
mutagenesis
Mutagenesis () is a process by which the genetic information of an organism is changed by the production of a mutation. It may occur spontaneously in nature, or as a result of exposure to mutagens. It can also be achieved experimentally using lab ...
techniques to make different variants of it, or by creating proteins from scratch using
artificial gene synthesis. These libraries are then
screened or selected, and ones with improved
phenotype
In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology (physical form and structure), its developmental processes, its biochemical and physiological propert ...
s are used for the next round of mutagenesis.
See also
*
Protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
*
Sequence space
In functional analysis and related areas of mathematics, a sequence space is a vector space whose elements are infinite sequences of real or complex numbers. Equivalently, it is a function space whose elements are functions from the natural num ...
*
Directed evolution
Directed evolution (DE) is a method used in protein engineering that mimics the process of natural selection to steer proteins or nucleic acids toward a user-defined goal. It consists of subjecting a gene to iterative rounds of mutagenesis (cre ...
*
Protein engineering
*
High-dimensional space
In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coordi ...
References
{{genarch
Evolutionary biology
Genetics
Biochemistry