
In
evolutionary biology
Evolutionary biology is the subfield of biology that studies the evolutionary processes such as natural selection, common descent, and speciation that produced the diversity of life on Earth. In the 1930s, the discipline of evolutionary biolo ...
, conserved sequences are identical or similar
sequences in
nucleic acid
Nucleic acids are large biomolecules that are crucial in all cells and viruses. They are composed of nucleotides, which are the monomer components: a pentose, 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nuclei ...
s (
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
and
RNA
Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
) or
proteins
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, re ...
across species (
orthologous sequences), or within a
genome
A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
(
paralogous sequences), or between donor and receptor taxa (
xenologous sequences). Conservation indicates that a sequence has been maintained by
natural selection
Natural selection is the differential survival and reproduction of individuals due to differences in phenotype. It is a key mechanism of evolution, the change in the Heredity, heritable traits characteristic of a population over generation ...
.
A highly conserved sequence is one that has remained relatively unchanged far back up the
phylogenetic tree
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
, and hence far back in
geological time. Examples of highly conserved sequences include the
RNA components of
ribosome
Ribosomes () are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (messenger RNA translation). Ribosomes link amino acids together in the order s ...
s present in all
domains of life, the
homeobox
A homeobox is a Nucleic acid sequence, DNA sequence, around 180 base pairs long, that regulates large-scale anatomical features in the early stages of embryonic development. Mutations in a homeobox may change large-scale anatomical features of ...
sequences widespread amongst
eukaryotes
The eukaryotes ( ) constitute the domain of Eukaryota or Eukarya, organisms whose cells have a membrane-bound nucleus. All animals, plants, fungi, seaweeds, and many unicellular organisms are eukaryotes. They constitute a major group of ...
, and the
tmRNA in
bacteria
Bacteria (; : bacterium) are ubiquitous, mostly free-living organisms often consisting of one Cell (biology), biological cell. They constitute a large domain (biology), domain of Prokaryote, prokaryotic microorganisms. Typically a few micr ...
. The study of sequence conservation overlaps with the fields of
genomics
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
,
proteomics
Proteomics is the large-scale study of proteins. Proteins are vital macromolecules of all living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replicatio ...
,
evolutionary biology
Evolutionary biology is the subfield of biology that studies the evolutionary processes such as natural selection, common descent, and speciation that produced the diversity of life on Earth. In the 1930s, the discipline of evolutionary biolo ...
,
phylogenetics
In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...
,
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
and
mathematics
Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...
.
History
The discovery of the role of
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
in
heredity
Heredity, also called inheritance or biological inheritance, is the passing on of traits from parents to their offspring; either through asexual reproduction or sexual reproduction, the offspring cells or organisms acquire the genetic infor ...
, and observations by
Frederick Sanger of variation between animal
insulin
Insulin (, from Latin ''insula'', 'island') is a peptide hormone produced by beta cells of the pancreatic islets encoded in humans by the insulin (''INS)'' gene. It is the main Anabolism, anabolic hormone of the body. It regulates the metabol ...
s in 1949, prompted early molecular biologists to study
taxonomy
image:Hierarchical clustering diagram.png, 280px, Generalized scheme of taxonomy
Taxonomy is a practice and science concerned with classification or categorization. Typically, there are two parts to it: the development of an underlying scheme o ...
from a molecular perspective.
Studies in the 1960s used
DNA hybridization and protein cross-reactivity techniques to measure similarity between known
orthologous proteins, such as
hemoglobin
Hemoglobin (haemoglobin, Hb or Hgb) is a protein containing iron that facilitates the transportation of oxygen in red blood cells. Almost all vertebrates contain hemoglobin, with the sole exception of the fish family Channichthyidae. Hemoglobin ...
and
cytochrome c. In 1965,
Émile Zuckerkandl and
Linus Pauling
Linus Carl Pauling ( ; February 28, 1901August 19, 1994) was an American chemist and peace activist. He published more than 1,200 papers and books, of which about 850 dealt with scientific topics. ''New Scientist'' called him one of the 20 gre ...
introduced the concept of the
molecular clock, proposing that steady rates of amino acid replacement could be used to estimate the time since two organisms
diverged. While initial phylogenies closely matched the
fossil record
A fossil (from Classical Latin , ) is any preserved remains, impression, or trace of any once-living thing from a past geological age. Examples include bones, shells, exoskeletons, stone imprints of animals or microbes, objects preserved ...
, observations that some genes appeared to evolve at different rates led to the development of theories of
molecular evolution
Molecular evolution describes how Heredity, inherited DNA and/or RNA change over evolutionary time, and the consequences of this for proteins and other components of Cell (biology), cells and organisms. Molecular evolution is the basis of phylogen ...
.
Margaret Dayhoff's 1966 comparison of
ferredoxin
Ferredoxins (from Latin ''ferrum'': iron + redox, often abbreviated "fd") are iron–sulfur proteins that mediate electron transfer in a range of metabolic reactions. The term "ferredoxin" was coined by D.C. Wharton of the DuPont Co. and applied t ...
sequences showed that
natural selection
Natural selection is the differential survival and reproduction of individuals due to differences in phenotype. It is a key mechanism of evolution, the change in the Heredity, heritable traits characteristic of a population over generation ...
would act to conserve and optimise protein sequences essential to life.
Mechanisms
Over many generations, nucleic acid sequences in the
genome
A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
of an
evolutionary lineage can gradually change over time due to random mutations and
deletions. Sequences may also recombine or be deleted due to
chromosomal rearrangements. Conserved sequences are sequences which persist in the genome despite such forces, and have slower rates of mutation than the background mutation rate.
Conservation can occur in
coding and
non-coding nucleic acid sequences. Highly conserved DNA sequences are thought to have functional value, although the role for many highly conserved
non-coding DNA
Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and reg ...
sequences is poorly understood. The extent to which a sequence is conserved can be affected by varying
selection pressures, its
robustness to mutation,
population size and
genetic drift
Genetic drift, also known as random genetic drift, allelic drift or the Wright effect, is the change in the Allele frequency, frequency of an existing gene variant (allele) in a population due to random chance.
Genetic drift may cause gene va ...
. Many functional sequences are also
modular, containing regions which may be subject to independent
selection pressures, such as
protein domains.
Coding sequence
In coding sequences, the nucleic acid and amino acid sequence may be conserved to different extents, as the degeneracy of the
genetic code
Genetic code is a set of rules used by living cell (biology), cells to Translation (biology), translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished ...
means that
synonymous mutations in a coding sequence do not affect the amino acid sequence of its protein product.
Amino acid sequences can be conserved to maintain the
structure
A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...
or function of a protein or domain. Conserved proteins undergo fewer
amino acid replacements, or are more likely to
substitute amino acids with similar biochemical properties. Within a sequence, amino acids that are important for
folding, structural stability, or that form a
binding site
In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may includ ...
may be more highly conserved.
The nucleic acid sequence of a protein coding gene may also be conserved by other selective pressures. The
codon usage bias in some organisms may restrict the types of synonymous mutations in a sequence. Nucleic acid sequences that cause
secondary structure
Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
in the mRNA of a coding gene may be selected against, as some structures may negatively affect translation, or conserved where the mRNA also acts as a functional non-coding RNA.
Non-coding
Non-coding sequences important for
gene regulation, such as the binding or recognition sites of
ribosomes and
transcription factor
In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription (genetics), transcription of genetics, genetic information from DNA to messenger RNA, by binding t ...
s, may be conserved within a genome. For example, the
promoter of a conserved gene or
operon may also be conserved. As with proteins, nucleic acids that are important for the structure and function of
non-coding RNA
A non-coding RNA (ncRNA) is a functional RNA molecule that is not Translation (genetics), translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally imp ...
(ncRNA) can also be conserved. However, sequence conservation in ncRNAs is generally poor compared to protein-coding sequences, and
base pairs that contribute to structure or function are often conserved instead.
Identification
Conserved sequences are typically identified by
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
approaches based on
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
. Advances in
high-throughput DNA sequencing and
protein mass spectrometry has substantially increased the availability of protein sequences and whole genomes for comparison since the early 2000s.
Homology search
Conserved sequences may be identified by
homology search, using tools such as
BLAST,
HMMEROrthologR and Infernal. Homology search tools may take an individual nucleic acid or protein sequence as input, or use statistical models generated from
multiple sequence alignments of known related sequences. Statistical models such as
profile-HMMs, and RNA covariance models which also incorporate structural information, can be helpful when searching for more distantly related sequences. Input sequences are then aligned against a database of sequences from related individuals or other species. The resulting alignments are then scored based on the number of matching amino acids or bases, and the number of gaps or deletions generated by the alignment. Acceptable conservative substitutions may be identified using substitution matrices such as
PAM and
BLOSUM. Highly scoring alignments are assumed to be from homologous sequences. The conservation of a sequence may then be inferred by detection of highly similar homologs over a broad phylogenetic range.
Multiple sequence alignment

Multiple sequence alignments can be used to visualise conserved sequences. The
CLUSTAL format includes a plain-text key to annotate conserved columns of the alignment, denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ) Sequence logos can also show conserved sequence by representing the proportions of characters at each point in the alignment by height.
Genome alignment
Whole genome alignments (WGAs) may also be used to identify highly conserved regions across species. Currently the accuracy and
scalability
Scalability is the property of a system to handle a growing amount of work. One definition for software systems specifies that this may be done by adding resources to the system.
In an economic context, a scalable business model implies that ...
of WGA tools remains limited due to the computational complexity of dealing with rearrangements, repeat regions and the large size of many eukaryotic genomes. However, WGAs of 30 or more closely related bacteria (prokaryotes) are now increasingly feasible.
Scoring systems
Other approaches use measurements of conservation based on
statistical tests that attempt to identify sequences which mutate differently to an expected background (neutral) mutation rate.
The GERP (Genomic Evolutionary Rate Profiling) framework scores conservation of genetic sequences across species. This approach estimates the rate of neutral mutation in a set of species from a multiple sequence alignment, and then identifies regions of the sequence that exhibit fewer mutations than expected. These regions are then assigned scores based on the difference between the observed mutation rate and expected background mutation rate. A high GERP score then indicates a highly conserved sequence.
LIST
(Local Identity and Shared Taxa) is based on the assumption that variations observed in species closely related to human are more significant when assessing conservation compared to those in distantly related species. Thus, LIST utilizes the local alignment identity around each position to identify relevant sequences in the multiple sequence alignment (MSA) and then it estimates conservation based on the taxonomy distances of these sequences to human. Unlike other tools, LIST ignores the count/frequency of variations in the MSA.
Aminode combines multiple alignments with phylogenetic analysis to analyze changes in homologous proteins and produce a plot that indicates the local rates of evolutionary changes. This approach identifies the Evolutionarily Constrained Regions in a protein, which are segments that are subject to
purifying selection and are typically critical for normal protein function.
Other approaches such as PhyloP and PhyloHMM incorporate
statistical phylogenetics methods to compare
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
s of substitution rates, which allows the detection of both conservation and accelerated mutation. First, a background probability distribution is generated of the number of substitutions expected to occur for a column in a multiple sequence alignment, based on a
phylogenetic tree
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
. The estimated evolutionary relationships between the species of interest are used to calculate the significance of any substitutions (i.e. a substitution between two closely related species may be less likely to occur than distantly related ones, and therefore more significant). To detect conservation, a probability distribution is calculated for a subset of the multiple sequence alignment, and compared to the background distribution using a statistical test such as a
likelihood-ratio test or
score test.
P-values generated from comparing the two distributions are then used to identify conserved regions. PhyloHMM uses
hidden Markov models to generate probability distributions. The PhyloP software package compares probability distributions using a
likelihood-ratio test or
score test, as well as using a GERP-like scoring system.
Extreme conservation
Ultra-conserved elements
Ultra-conserved elements or UCEs are sequences that are highly similar or identical across multiple
taxonomic groupings. These were first discovered in
vertebrates
Vertebrates () are animals with a vertebral column (backbone or spine), and a cranium, or skull. The vertebral column surrounds and protects the spinal cord, while the cranium protects the brain.
The vertebrates make up the subphylum Vertebra ...
, and have subsequently been identified within widely-differing taxa. While the origin and function of UCEs are poorly understood, they have been used to investigate deep-time divergences in
amniote
Amniotes are tetrapod vertebrate animals belonging to the clade Amniota, a large group that comprises the vast majority of living terrestrial animal, terrestrial and semiaquatic vertebrates. Amniotes evolution, evolved from amphibious Stem tet ...
s,
insects
Insects (from Latin ') are hexapod invertebrates of the class Insecta. They are the largest group within the arthropod phylum. Insects have a chitinous exoskeleton, a three-part body (head, thorax and abdomen), three pairs of jointed ...
, and between
animals
Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia (). With few exceptions, animals consume organic material, breathe oxygen, have myocytes and are able to move, can reproduce sexually, and grow from a ...
and
plants
Plants are the eukaryotes that form the kingdom Plantae; they are predominantly photosynthetic. This means that they obtain their energy from sunlight, using chloroplasts derived from endosymbiosis with cyanobacteria to produce sugars f ...
.
Universally conserved genes
The most highly conserved genes are those that can be found in all organisms. These consist mainly of the
ncRNAs and proteins required for
transcription and
translation
Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...
, which are assumed to have been conserved from the
last universal common ancestor
The last universal common ancestor (LUCA) is the hypothesized common ancestral cell from which the three domains of life, the Bacteria, the Archaea, and the Eukarya originated. The cell had a lipid bilayer; it possessed the genetic code a ...
of all life.
Genes or gene families that have been found to be universally conserved include
GTP-binding elongation factors,
Methionine aminopeptidase 2,
Serine hydroxymethyltransferase, and
ATP transporters. Components of the transcription machinery, such as
RNA polymerase
In molecular biology, RNA polymerase (abbreviated RNAP or RNApol), or more specifically DNA-directed/dependent RNA polymerase (DdRP), is an enzyme that catalyzes the chemical reactions that synthesize RNA from a DNA template.
Using the e ...
and
helicases, and of the translation machinery, such as
ribosomal RNA
Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
s,
tRNAs and
ribosomal protein
A ribosomal protein (r-protein or rProtein) is any of the proteins that, in conjunction with rRNA, make up the ribosomal subunits involved in the cellular process of translation. ''E. coli'', other bacteria and Archaea have a 30S small subunit ...
s are also universally conserved.
Applications
Phylogenetics and taxonomy
Sets of conserved sequences are often used for generating
phylogenetic tree
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
s, as it can be assumed that organisms with similar sequences are closely related. The choice of sequences may vary depending on the taxonomic scope of the study. For example, the most highly conserved genes such as the 16S RNA and other ribosomal sequences are useful for reconstructing deep phylogenetic relationships and identifying bacterial
phyla
Phyla, the plural of ''phylum'', may refer to:
* Phylum, a biological taxon between Kingdom and Class
* by analogy, in linguistics, a large division of possibly related languages, or a major language family which is not subordinate to another
Phy ...
in
metagenomics
Metagenomics is the study of all genetics, genetic material from all organisms in a particular environment, providing insights into their composition, diversity, and functional potential. Metagenomics has allowed researchers to profile the mic ...
studies. Sequences that are conserved within a
clade
In biology, a clade (), also known as a Monophyly, monophyletic group or natural group, is a group of organisms that is composed of a common ancestor and all of its descendants. Clades are the fundamental unit of cladistics, a modern approach t ...
but undergo some mutations, such as
housekeeping genes, can be used to study species relationships. The
internal transcribed spacer (ITS) region, which is required for spacing conserved rRNA genes but undergoes rapid evolution, is commonly used to classify
fungi
A fungus (: fungi , , , or ; or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and mold (fungus), molds, as well as the more familiar mushrooms. These organisms are classified as one ...
and strains of rapidly evolving bacteria.
Medical research
As highly conserved sequences often have important biological functions, they can be useful a starting point for identifying the cause of
genetic diseases. Many
congenital metabolic disorders and
Lysosomal storage diseases are the result of changes to individual conserved genes, resulting in missing or faulty enzymes that are the underlying cause of the symptoms of the disease. Genetic diseases may be predicted by identifying sequences that are conserved between humans and lab organisms such as
mice
A mouse (: mice) is a small rodent. Characteristically, mice are known to have a pointed snout, small rounded ears, a body-length scaly tail, and a high breeding rate. The best known mouse species is the common house mouse (''Mus musculus' ...
or
fruit flies, and studying the effects of
knock-outs
An electrical enclosure is a Cabinetry, cabinet for Electrical equipment, electrical or electronic equipment to mount switches, Control knob, knobs and Display device, displays and to prevent electrical shock to equipment users and protect t ...
of these genes.
Genome-wide association studies can also be used to identify variation in conserved sequences associated with disease or health outcomes. More than two dozen novel potential susceptibility loci have been discovered for Alzehimer's disease.
Functional annotation
Identifying conserved sequences can be used to discover and predict functional sequences such as genes. Conserved sequences with a known function, such as protein domains, can also be used to predict the function of a sequence. Databases of conserved protein domains such as
Pfam and the
Conserved Domain Database can be used to annotate functional domains in predicted protein coding genes.
See also
*
Evolutionary developmental biology
Evolutionary developmental biology, informally known as evo-devo, is a field of biological research that compares the developmental biology, developmental processes of different organisms to infer how developmental processes evolution, evolved. ...
*
NAPP (database)
*
Segregating site
*
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
*
Sequence alignment software
*
UCbase
*
Ultra-conserved element
References
{{Use dmy dates, date=April 2017
Computational phylogenetics
Nucleic acids
Protein structure
Population genetics
Molecular genetics
Evolutionary developmental biology