HOME

TheInfoList



OR:

In
evolutionary biology Evolutionary biology is the subfield of biology that studies the evolutionary processes such as natural selection, common descent, and speciation that produced the diversity of life on Earth. In the 1930s, the discipline of evolutionary biolo ...
, conserved sequences are identical or similar sequences in
nucleic acid Nucleic acids are large biomolecules that are crucial in all cells and viruses. They are composed of nucleotides, which are the monomer components: a pentose, 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nuclei ...
s (
DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
and
RNA Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
) or
proteins Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, re ...
across species ( orthologous sequences), or within a
genome A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
( paralogous sequences), or between donor and receptor taxa ( xenologous sequences). Conservation indicates that a sequence has been maintained by
natural selection Natural selection is the differential survival and reproduction of individuals due to differences in phenotype. It is a key mechanism of evolution, the change in the Heredity, heritable traits characteristic of a population over generation ...
. A highly conserved sequence is one that has remained relatively unchanged far back up the
phylogenetic tree A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
, and hence far back in geological time. Examples of highly conserved sequences include the RNA components of
ribosome Ribosomes () are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (messenger RNA translation). Ribosomes link amino acids together in the order s ...
s present in all domains of life, the
homeobox A homeobox is a Nucleic acid sequence, DNA sequence, around 180 base pairs long, that regulates large-scale anatomical features in the early stages of embryonic development. Mutations in a homeobox may change large-scale anatomical features of ...
sequences widespread amongst
eukaryotes The eukaryotes ( ) constitute the domain of Eukaryota or Eukarya, organisms whose cells have a membrane-bound nucleus. All animals, plants, fungi, seaweeds, and many unicellular organisms are eukaryotes. They constitute a major group of ...
, and the tmRNA in
bacteria Bacteria (; : bacterium) are ubiquitous, mostly free-living organisms often consisting of one Cell (biology), biological cell. They constitute a large domain (biology), domain of Prokaryote, prokaryotic microorganisms. Typically a few micr ...
. The study of sequence conservation overlaps with the fields of
genomics Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
,
proteomics Proteomics is the large-scale study of proteins. Proteins are vital macromolecules of all living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replicatio ...
,
evolutionary biology Evolutionary biology is the subfield of biology that studies the evolutionary processes such as natural selection, common descent, and speciation that produced the diversity of life on Earth. In the 1930s, the discipline of evolutionary biolo ...
,
phylogenetics In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...
,
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
and
mathematics Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...
.


History

The discovery of the role of
DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
in
heredity Heredity, also called inheritance or biological inheritance, is the passing on of traits from parents to their offspring; either through asexual reproduction or sexual reproduction, the offspring cells or organisms acquire the genetic infor ...
, and observations by Frederick Sanger of variation between animal
insulin Insulin (, from Latin ''insula'', 'island') is a peptide hormone produced by beta cells of the pancreatic islets encoded in humans by the insulin (''INS)'' gene. It is the main Anabolism, anabolic hormone of the body. It regulates the metabol ...
s in 1949, prompted early molecular biologists to study
taxonomy image:Hierarchical clustering diagram.png, 280px, Generalized scheme of taxonomy Taxonomy is a practice and science concerned with classification or categorization. Typically, there are two parts to it: the development of an underlying scheme o ...
from a molecular perspective. Studies in the 1960s used DNA hybridization and protein cross-reactivity techniques to measure similarity between known orthologous proteins, such as
hemoglobin Hemoglobin (haemoglobin, Hb or Hgb) is a protein containing iron that facilitates the transportation of oxygen in red blood cells. Almost all vertebrates contain hemoglobin, with the sole exception of the fish family Channichthyidae. Hemoglobin ...
and cytochrome c. In 1965, Émile Zuckerkandl and
Linus Pauling Linus Carl Pauling ( ; February 28, 1901August 19, 1994) was an American chemist and peace activist. He published more than 1,200 papers and books, of which about 850 dealt with scientific topics. ''New Scientist'' called him one of the 20 gre ...
introduced the concept of the molecular clock, proposing that steady rates of amino acid replacement could be used to estimate the time since two organisms diverged. While initial phylogenies closely matched the
fossil record A fossil (from Classical Latin , ) is any preserved remains, impression, or trace of any once-living thing from a past geological age. Examples include bones, shells, exoskeletons, stone imprints of animals or microbes, objects preserved ...
, observations that some genes appeared to evolve at different rates led to the development of theories of
molecular evolution Molecular evolution describes how Heredity, inherited DNA and/or RNA change over evolutionary time, and the consequences of this for proteins and other components of Cell (biology), cells and organisms. Molecular evolution is the basis of phylogen ...
. Margaret Dayhoff's 1966 comparison of
ferredoxin Ferredoxins (from Latin ''ferrum'': iron + redox, often abbreviated "fd") are iron–sulfur proteins that mediate electron transfer in a range of metabolic reactions. The term "ferredoxin" was coined by D.C. Wharton of the DuPont Co. and applied t ...
sequences showed that
natural selection Natural selection is the differential survival and reproduction of individuals due to differences in phenotype. It is a key mechanism of evolution, the change in the Heredity, heritable traits characteristic of a population over generation ...
would act to conserve and optimise protein sequences essential to life.


Mechanisms

Over many generations, nucleic acid sequences in the
genome A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
of an evolutionary lineage can gradually change over time due to random mutations and deletions. Sequences may also recombine or be deleted due to chromosomal rearrangements. Conserved sequences are sequences which persist in the genome despite such forces, and have slower rates of mutation than the background mutation rate. Conservation can occur in coding and non-coding nucleic acid sequences. Highly conserved DNA sequences are thought to have functional value, although the role for many highly conserved
non-coding DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and reg ...
sequences is poorly understood. The extent to which a sequence is conserved can be affected by varying selection pressures, its robustness to mutation, population size and
genetic drift Genetic drift, also known as random genetic drift, allelic drift or the Wright effect, is the change in the Allele frequency, frequency of an existing gene variant (allele) in a population due to random chance. Genetic drift may cause gene va ...
. Many functional sequences are also modular, containing regions which may be subject to independent selection pressures, such as protein domains.


Coding sequence

In coding sequences, the nucleic acid and amino acid sequence may be conserved to different extents, as the degeneracy of the
genetic code Genetic code is a set of rules used by living cell (biology), cells to Translation (biology), translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished ...
means that synonymous mutations in a coding sequence do not affect the amino acid sequence of its protein product. Amino acid sequences can be conserved to maintain the
structure A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...
or function of a protein or domain. Conserved proteins undergo fewer amino acid replacements, or are more likely to substitute amino acids with similar biochemical properties. Within a sequence, amino acids that are important for folding, structural stability, or that form a
binding site In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may includ ...
may be more highly conserved. The nucleic acid sequence of a protein coding gene may also be conserved by other selective pressures. The codon usage bias in some organisms may restrict the types of synonymous mutations in a sequence. Nucleic acid sequences that cause
secondary structure Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
in the mRNA of a coding gene may be selected against, as some structures may negatively affect translation, or conserved where the mRNA also acts as a functional non-coding RNA.


Non-coding

Non-coding sequences important for gene regulation, such as the binding or recognition sites of ribosomes and
transcription factor In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription (genetics), transcription of genetics, genetic information from DNA to messenger RNA, by binding t ...
s, may be conserved within a genome. For example, the promoter of a conserved gene or operon may also be conserved. As with proteins, nucleic acids that are important for the structure and function of
non-coding RNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not Translation (genetics), translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally imp ...
(ncRNA) can also be conserved. However, sequence conservation in ncRNAs is generally poor compared to protein-coding sequences, and base pairs that contribute to structure or function are often conserved instead.


Identification

Conserved sequences are typically identified by
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
approaches based on
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
. Advances in high-throughput DNA sequencing and protein mass spectrometry has substantially increased the availability of protein sequences and whole genomes for comparison since the early 2000s.


Homology search

Conserved sequences may be identified by homology search, using tools such as BLAST, HMMER
OrthologR
and Infernal. Homology search tools may take an individual nucleic acid or protein sequence as input, or use statistical models generated from multiple sequence alignments of known related sequences. Statistical models such as profile-HMMs, and RNA covariance models which also incorporate structural information, can be helpful when searching for more distantly related sequences. Input sequences are then aligned against a database of sequences from related individuals or other species. The resulting alignments are then scored based on the number of matching amino acids or bases, and the number of gaps or deletions generated by the alignment. Acceptable conservative substitutions may be identified using substitution matrices such as PAM and BLOSUM. Highly scoring alignments are assumed to be from homologous sequences. The conservation of a sequence may then be inferred by detection of highly similar homologs over a broad phylogenetic range.


Multiple sequence alignment

Multiple sequence alignments can be used to visualise conserved sequences. The CLUSTAL format includes a plain-text key to annotate conserved columns of the alignment, denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ) Sequence logos can also show conserved sequence by representing the proportions of characters at each point in the alignment by height.


Genome alignment

Whole genome alignments (WGAs) may also be used to identify highly conserved regions across species. Currently the accuracy and
scalability Scalability is the property of a system to handle a growing amount of work. One definition for software systems specifies that this may be done by adding resources to the system. In an economic context, a scalable business model implies that ...
of WGA tools remains limited due to the computational complexity of dealing with rearrangements, repeat regions and the large size of many eukaryotic genomes. However, WGAs of 30 or more closely related bacteria (prokaryotes) are now increasingly feasible.


Scoring systems

Other approaches use measurements of conservation based on statistical tests that attempt to identify sequences which mutate differently to an expected background (neutral) mutation rate. The GERP (Genomic Evolutionary Rate Profiling) framework scores conservation of genetic sequences across species. This approach estimates the rate of neutral mutation in a set of species from a multiple sequence alignment, and then identifies regions of the sequence that exhibit fewer mutations than expected. These regions are then assigned scores based on the difference between the observed mutation rate and expected background mutation rate. A high GERP score then indicates a highly conserved sequence. LIST (Local Identity and Shared Taxa) is based on the assumption that variations observed in species closely related to human are more significant when assessing conservation compared to those in distantly related species. Thus, LIST utilizes the local alignment identity around each position to identify relevant sequences in the multiple sequence alignment (MSA) and then it estimates conservation based on the taxonomy distances of these sequences to human. Unlike other tools, LIST ignores the count/frequency of variations in the MSA. Aminode combines multiple alignments with phylogenetic analysis to analyze changes in homologous proteins and produce a plot that indicates the local rates of evolutionary changes. This approach identifies the Evolutionarily Constrained Regions in a protein, which are segments that are subject to purifying selection and are typically critical for normal protein function. Other approaches such as PhyloP and PhyloHMM incorporate statistical phylogenetics methods to compare
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
s of substitution rates, which allows the detection of both conservation and accelerated mutation. First, a background probability distribution is generated of the number of substitutions expected to occur for a column in a multiple sequence alignment, based on a
phylogenetic tree A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
. The estimated evolutionary relationships between the species of interest are used to calculate the significance of any substitutions (i.e. a substitution between two closely related species may be less likely to occur than distantly related ones, and therefore more significant). To detect conservation, a probability distribution is calculated for a subset of the multiple sequence alignment, and compared to the background distribution using a statistical test such as a likelihood-ratio test or score test. P-values generated from comparing the two distributions are then used to identify conserved regions. PhyloHMM uses hidden Markov models to generate probability distributions. The PhyloP software package compares probability distributions using a likelihood-ratio test or score test, as well as using a GERP-like scoring system.


Extreme conservation


Ultra-conserved elements

Ultra-conserved elements or UCEs are sequences that are highly similar or identical across multiple taxonomic groupings. These were first discovered in
vertebrates Vertebrates () are animals with a vertebral column (backbone or spine), and a cranium, or skull. The vertebral column surrounds and protects the spinal cord, while the cranium protects the brain. The vertebrates make up the subphylum Vertebra ...
, and have subsequently been identified within widely-differing taxa. While the origin and function of UCEs are poorly understood, they have been used to investigate deep-time divergences in
amniote Amniotes are tetrapod vertebrate animals belonging to the clade Amniota, a large group that comprises the vast majority of living terrestrial animal, terrestrial and semiaquatic vertebrates. Amniotes evolution, evolved from amphibious Stem tet ...
s,
insects Insects (from Latin ') are hexapod invertebrates of the class Insecta. They are the largest group within the arthropod phylum. Insects have a chitinous exoskeleton, a three-part body (head, thorax and abdomen), three pairs of jointed ...
, and between
animals Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia (). With few exceptions, animals consume organic material, breathe oxygen, have myocytes and are able to move, can reproduce sexually, and grow from a ...
and
plants Plants are the eukaryotes that form the kingdom Plantae; they are predominantly photosynthetic. This means that they obtain their energy from sunlight, using chloroplasts derived from endosymbiosis with cyanobacteria to produce sugars f ...
.


Universally conserved genes

The most highly conserved genes are those that can be found in all organisms. These consist mainly of the ncRNAs and proteins required for transcription and
translation Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...
, which are assumed to have been conserved from the
last universal common ancestor The last universal common ancestor (LUCA) is the hypothesized common ancestral cell from which the three domains of life, the Bacteria, the Archaea, and the Eukarya originated. The cell had a lipid bilayer; it possessed the genetic code a ...
of all life. Genes or gene families that have been found to be universally conserved include GTP-binding elongation factors, Methionine aminopeptidase 2, Serine hydroxymethyltransferase, and ATP transporters. Components of the transcription machinery, such as
RNA polymerase In molecular biology, RNA polymerase (abbreviated RNAP or RNApol), or more specifically DNA-directed/dependent RNA polymerase (DdRP), is an enzyme that catalyzes the chemical reactions that synthesize RNA from a DNA template. Using the e ...
and helicases, and of the translation machinery, such as
ribosomal RNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
s, tRNAs and
ribosomal protein A ribosomal protein (r-protein or rProtein) is any of the proteins that, in conjunction with rRNA, make up the ribosomal subunits involved in the cellular process of translation. ''E. coli'', other bacteria and Archaea have a 30S small subunit ...
s are also universally conserved.


Applications


Phylogenetics and taxonomy

Sets of conserved sequences are often used for generating
phylogenetic tree A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
s, as it can be assumed that organisms with similar sequences are closely related. The choice of sequences may vary depending on the taxonomic scope of the study. For example, the most highly conserved genes such as the 16S RNA and other ribosomal sequences are useful for reconstructing deep phylogenetic relationships and identifying bacterial
phyla Phyla, the plural of ''phylum'', may refer to: * Phylum, a biological taxon between Kingdom and Class * by analogy, in linguistics, a large division of possibly related languages, or a major language family which is not subordinate to another Phy ...
in
metagenomics Metagenomics is the study of all genetics, genetic material from all organisms in a particular environment, providing insights into their composition, diversity, and functional potential. Metagenomics has allowed researchers to profile the mic ...
studies. Sequences that are conserved within a
clade In biology, a clade (), also known as a Monophyly, monophyletic group or natural group, is a group of organisms that is composed of a common ancestor and all of its descendants. Clades are the fundamental unit of cladistics, a modern approach t ...
but undergo some mutations, such as housekeeping genes, can be used to study species relationships. The internal transcribed spacer (ITS) region, which is required for spacing conserved rRNA genes but undergoes rapid evolution, is commonly used to classify
fungi A fungus (: fungi , , , or ; or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and mold (fungus), molds, as well as the more familiar mushrooms. These organisms are classified as one ...
and strains of rapidly evolving bacteria.


Medical research

As highly conserved sequences often have important biological functions, they can be useful a starting point for identifying the cause of genetic diseases. Many congenital metabolic disorders and Lysosomal storage diseases are the result of changes to individual conserved genes, resulting in missing or faulty enzymes that are the underlying cause of the symptoms of the disease. Genetic diseases may be predicted by identifying sequences that are conserved between humans and lab organisms such as
mice A mouse (: mice) is a small rodent. Characteristically, mice are known to have a pointed snout, small rounded ears, a body-length scaly tail, and a high breeding rate. The best known mouse species is the common house mouse (''Mus musculus' ...
or fruit flies, and studying the effects of
knock-outs An electrical enclosure is a Cabinetry, cabinet for Electrical equipment, electrical or electronic equipment to mount switches, Control knob, knobs and Display device, displays and to prevent electrical shock to equipment users and protect t ...
of these genes. Genome-wide association studies can also be used to identify variation in conserved sequences associated with disease or health outcomes. More than two dozen novel potential susceptibility loci have been discovered for Alzehimer's disease.


Functional annotation

Identifying conserved sequences can be used to discover and predict functional sequences such as genes. Conserved sequences with a known function, such as protein domains, can also be used to predict the function of a sequence. Databases of conserved protein domains such as Pfam and the Conserved Domain Database can be used to annotate functional domains in predicted protein coding genes.


See also

*
Evolutionary developmental biology Evolutionary developmental biology, informally known as evo-devo, is a field of biological research that compares the developmental biology, developmental processes of different organisms to infer how developmental processes evolution, evolved. ...
* NAPP (database) * Segregating site *
Sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
* Sequence alignment software * UCbase * Ultra-conserved element


References

{{Use dmy dates, date=April 2017 Computational phylogenetics Nucleic acids Protein structure Population genetics Molecular genetics Evolutionary developmental biology