In
evolutionary biology, conserved sequences are identical or similar
sequences in
nucleic acid
Nucleic acids are biopolymers, macromolecules, essential to all known forms of life. They are composed of nucleotides, which are the monomers made of three components: a 5-carbon sugar, a phosphate group and a nitrogenous base. The two main cl ...
s (
DNA and
RNA
Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
) or
proteins
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
across species (
orthologous sequences), or within a
genome (
paralogous sequences), or between donor and receptor taxa (
xenologous sequences). Conservation indicates that a sequence has been maintained by
natural selection.
A highly conserved sequence is one that has remained relatively unchanged far back up the
phylogenetic tree
A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...
, and hence far back in
geological time
The geologic time scale, or geological time scale, (GTS) is a representation of time based on the rock record of Earth. It is a system of chronological dating that uses chronostratigraphy (the process of relating strata to time) and geochronol ...
. Examples of highly conserved sequences include the
RNA components of
ribosomes present in all
domain
Domain may refer to:
Mathematics
*Domain of a function, the set of input values for which the (total) function is defined
** Domain of definition of a partial function
**Natural domain of a partial function
**Domain of holomorphy of a function
*Do ...
s of life, the
homeobox
A homeobox is a DNA sequence, around 180 base pairs long, that regulates large-scale anatomical features in the early stages of embryonic development. For instance, mutations in a homeobox may change large-scale anatomical features of the full ...
sequences widespread amongst
Eukaryotes
Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bact ...
, and the
tmRNA in
Bacteria
Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were am ...
. The study of sequence conservation overlaps with the fields of
genomics
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
,
proteomics
Proteomics is the large-scale study of proteins. Proteins are vital parts of living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replication of DNA. In ...
,
evolutionary biology,
phylogenetics
In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups ...
,
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
and
mathematics.
History
The discovery of the role of
DNA in
heredity
Heredity, also called inheritance or biological inheritance, is the passing on of traits from parents to their offspring; either through asexual reproduction or sexual reproduction, the offspring cells or organisms acquire the genetic info ...
, and observations by
Frederick Sanger
Frederick Sanger (; 13 August 1918 – 19 November 2013) was an English biochemist who received the Nobel Prize in Chemistry twice.
He won the 1958 Chemistry Prize for determining the amino acid sequence of insulin and numerous other pr ...
of variation between animal
insulin
Insulin (, from Latin ''insula'', 'island') is a peptide hormone produced by beta cells of the pancreatic islets encoded in humans by the ''INS'' gene. It is considered to be the main anabolic hormone of the body. It regulates the metabol ...
s in 1949, prompted early molecular biologists to study
taxonomy
Taxonomy is the practice and science of categorization or classification.
A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types. ...
from a molecular perspective.
Studies in the 1960s used
DNA hybridization and protein cross-reactivity techniques to measure similarity between known
orthologous
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spe ...
proteins, such as
hemoglobin
Hemoglobin (haemoglobin BrE) (from the Greek word αἷμα, ''haîma'' 'blood' + Latin ''globus'' 'ball, sphere' + ''-in'') (), abbreviated Hb or Hgb, is the iron-containing oxygen-transport metalloprotein present in red blood cells (erythrocyte ...
and
cytochrome c
The cytochrome complex, or cyt ''c'', is a small hemeprotein found loosely associated with the inner membrane of the mitochondrion. It belongs to the cytochrome c family of proteins and plays a major role in cell apoptosis. Cytochrome c is hig ...
. In 1965,
Émile Zuckerkandl and
Linus Pauling
Linus Carl Pauling (; February 28, 1901August 19, 1994) was an American chemist, biochemist, chemical engineer, peace activist, author, and educator. He published more than 1,200 papers and books, of which about 850 dealt with scientific top ...
introduced the concept of the
molecular clock
The molecular clock is a figurative term for a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleo ...
, proposing that steady rates of amino acid replacement could be used to estimate the time since two organisms
diverged. While initial phylogenies closely matched the
fossil record
A fossil (from Classical Latin , ) is any preserved remains, impression, or trace of any once-living thing from a past geological age. Examples include bones, shells, exoskeletons, stone imprints of animals or microbes, objects preserved ...
, observations that some genes appeared to evolve at different rates led to the development of theories of
molecular evolution
Molecular evolution is the process of change in the sequence composition of cell (biology), cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and popula ...
.
Margaret Dayhoff's 1966 comparison of
ferrodoxin
Ferredoxins (from Latin ''ferrum'': iron + redox, often abbreviated "fd") are iron–sulfur proteins that mediate electron transfer in a range of metabolic reactions. The term "ferredoxin" was coined by D.C. Wharton of the DuPont Co. and applied ...
sequences showed that
natural selection would act to conserve and optimise protein sequences essential to life.
Mechanisms
Over many generations, nucleic acid sequences in the
genome of an
evolutionary lineage can gradually change over time due to random mutations and
deletions. Sequences may also recombine or be deleted due to
chromosomal rearrangement
In genetics, a chromosomal rearrangement is a mutation that is a type of chromosome abnormality involving a change in the structure of the native chromosome. Such changes may involve several different classes of events, like deletions, duplicatio ...
s. Conserved sequences are sequences which persist in the genome despite such forces, and have slower rates of mutation than the background mutation rate.
Conservation can occur in
coding and
non-coding nucleic acid sequences. Highly conserved DNA sequences are thought to have functional value, although the role for many highly conserved non-coding DNA sequences is poorly understood. The extent to which a sequence is conserved can be affected by varying
selection pressures
Any cause that reduces or increases reproductive success in a portion of a population potentially exerts evolutionary pressure, selective pressure or selection pressure, driving natural selection. It is a quantitative description of the amount of ...
, its
robustness
Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
to mutation,
population size
In population genetics and population ecology, population size (usually denoted ''N'') is the number of individual organisms in a population. Population size is directly associated with amount of genetic drift, and is the underlying cause of effect ...
and
genetic drift
Genetic drift, also known as allelic drift or the Wright effect, is the change in the frequency of an existing gene variant (allele) in a population due to random chance.
Genetic drift may cause gene variants to disappear completely and there ...
. Many functional sequences are also
modular
Broadly speaking, modularity is the degree to which a system's components may be separated and recombined, often with the benefit of flexibility and variety in use. The concept of modularity is used primarily to reduce complexity by breaking a s ...
, containing regions which may be subject to independent
selection pressures
Any cause that reduces or increases reproductive success in a portion of a population potentially exerts evolutionary pressure, selective pressure or selection pressure, driving natural selection. It is a quantitative description of the amount of ...
, such as
protein domains
In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of se ...
.
Coding sequence
In coding sequences, the nucleic acid and amino acid sequence may be conserved to different extents, as the degeneracy of the
genetic code
The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...
means that
synonymous mutations in a coding sequence do not affect the amino acid sequence of its protein product.
Amino acid sequences can be conserved to maintain the
structure or function of a protein or domain. Conserved proteins undergo fewer
amino acid replacement
Amino acid replacement is a change from one amino acid to a different amino acid in a protein due to point mutation in the corresponding DNA sequence. It is caused by nonsynonymous missense mutation which changes the codon sequence to code other a ...
s, or are more likely to
substitute amino acids with similar biochemical properties. Within a sequence, amino acids that are important for
folding, structural stability, or that form a
binding site
In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may inclu ...
may be more highly conserved.
The nucleic acid sequence of a protein coding gene may also be conserved by other selective pressures. The
codon usage bias in some organisms may restrict the types of synonymous mutations in a sequence. Nucleic acid sequences that cause
secondary structure
Protein secondary structure is the three dimensional form of ''local segments'' of proteins. The two most common secondary structural elements are alpha helices and beta sheets, though beta turns and omega loops occur as well. Secondary struct ...
in the mRNA of a coding gene may be selected against, as some structures may negatively affect translation, or conserved where the mRNA also acts as a functional non-coding RNA.
Non-coding
Non-coding sequences important for
gene regulation
Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products (protein or RNA). Sophisticated programs of gene expression are wide ...
, such as the binding or recognition sites of
ribosomes
Ribosomes ( ) are macromolecular machines, found within all cells, that perform biological protein synthesis (mRNA translation). Ribosomes link amino acids together in the order specified by the codons of messenger RNA (mRNA) molecules to ...
and
transcription factor
In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The fu ...
s, may be conserved within a genome. For example, the
promoter of a conserved gene or
operon
In genetics, an operon is a functioning unit of DNA containing a cluster of genes under the control of a single promoter. The genes are transcribed together into an mRNA strand and either translated together in the cytoplasm, or undergo spli ...
may also be conserved. As with proteins, nucleic acids that are important for the structure and function of
non-coding RNA
A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non- ...
(ncRNA) can also be conserved. However, sequence conservation in ncRNAs is generally poor compared to protein-coding sequences, and
base pairs
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both D ...
that contribute to structure or function are often conserved instead.
Identification
Conserved sequences are typically identified by
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
approaches based on
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...
. Advances in
high-throughput DNA sequencing and
protein mass spectrometry
Protein mass spectrometry refers to the application of mass spectrometry to the study of proteins. Mass spectrometry is an important method for the accurate mass determination and characterization of proteins, and a variety of methods and instrum ...
has substantially increased the availability of protein sequences and whole genomes for comparison since the early 2000s.
Homology search
Conserved sequences may be identified by
homology search, using tools such as
BLAST
Blast or The Blast may refer to:
*Explosion, a rapid increase in volume and release of energy in an extreme manner
*Detonation, an exothermic front accelerating through a medium that eventually drives a shock front
Film
* ''Blast'' (1997 film), ...
,
HMMER
HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a ...
OrthologR and Infernal. Homology search tools may take an individual nucleic acid or protein sequence as input, or use statistical models generated from
multiple sequence alignment
Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolution ...
s of known related sequences. Statistical models such as
profile-HMMs, and RNA covariance models which also incorporate structural information, can be helpful when searching for more distantly related sequences. Input sequences are then aligned against a database of sequences from related individuals or other species. The resulting alignments are then scored based on the number of matching amino acids or bases, and the number of gaps or deletions generated by the alignment. Acceptable conservative substitutions may be identified using substitution matrices such as
PAM and
BLOSUM. Highly scoring alignments are assumed to be from homologous sequences. The conservation of a sequence may then be inferred by detection of highly similar homologs over a broad phylogenetic range.
Multiple sequence alignment
Multiple sequence alignments can be used to visualise conserved sequences. The
CLUSTAL
Clustal is a series of widely used computer programs used in bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its ...
format includes a plain-text key to annotate conserved columns of the alignment, denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ) Sequence logos can also show conserved sequence by representing the proportions of characters at each point in the alignment by height.
Genome alignment
Whole genome alignments (WGAs) may also be used to identify highly conserved regions across species. Currently the accuracy and
scalability
Scalability is the property of a system to handle a growing amount of work by adding resources to the system.
In an economic context, a scalable business model implies that a company can increase sales given increased resources. For example, a ...
of WGA tools remains limited due to the computational complexity of dealing with rearrangements, repeat regions and the large size of many eukaryotic genomes. However, WGAs of 30 or more closely related bacteria (prokaryotes) are now increasingly feasible.
Scoring systems
Other approaches use measurements of conservation based on
statistical tests
A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
...
that attempt to identify sequences which mutate differently to an expected background (neutral) mutation rate.
The GERP (Genomic Evolutionary Rate Profiling) framework scores conservation of genetic sequences across species. This approach estimates the rate of neutral mutation in a set of species from a multiple sequence alignment, and then identifies regions of the sequence that exhibit fewer mutations than expected. These regions are then assigned scores based on the difference between the observed mutation rate and expected background mutation rate. A high GERP score then indicates a highly conserved sequence.
LIST
(Local Identity and Shared Taxa) is based on the assumption that variations observed in species closely related to human are more significant when assessing conservation compared to those in distantly related species. Thus, LIST utilizes the local alignment identity around each position to identify relevant sequences in the multiple sequence alignment (MSA) and then it estimates conservation based on the taxonomy distances of these sequences to human. Unlike other tools, LIST ignores the count/frequency of variations in the MSA.
Aminode combines multiple alignments with phylogenetic analysis to analyze changes in homologous proteins and produce a plot that indicates the local rates of evolutionary changes. This approach identifies the Evolutionarily Constrained Regions in a protein, which are segments that are subject to
purifying selection and are typically critical for normal protein function.
Other approaches such as PhyloP and PhyloHMM incorporate
statistical phylogenetics methods to compare
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
s of substitution rates, which allows the detection of both conservation and accelerated mutation. First, a background probability distribution is generated of the number of substitutions expected to occur for a column in a multiple sequence alignment, based on a
phylogenetic tree
A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...
. The estimated evolutionary relationships between the species of interest are used to calculate the significance of any substitutions (i.e. a substitution between two closely related species may be less likely to occur than distantly related ones, and therefore more significant). To detect conservation, a probability distribution is calculated for a subset of the multiple sequence alignment, and compared to the background distribution using a statistical test such as a
likelihood-ratio test
In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after ...
or
score test
In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the ''score''—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if ...
.
P-value
In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...
s generated from comparing the two distributions are then used to identify conserved regions. PhyloHMM uses
hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
s to generate probability distributions. The PhyloP software package compares probability distributions using a
likelihood-ratio test
In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after ...
or
score test
In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the ''score''—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if ...
, as well as using a GERP-like scoring system.
Extreme conservation
Ultra-conserved elements
Ultra-conserved element An ultra-conserved element (UCE) was originally defined as a genome segment longer than 200 base pairs (bp) that is absolutely conserved, with no insertions or deletions and 100% identity, between orthologous regions of the human, rat, and mouse gen ...
s or UCEs are sequences that are highly similar or identical across multiple
taxonomic groupings. These were first discovered in
vertebrates
Vertebrates () comprise all animal taxa within the subphylum Vertebrata () ( chordates with backbones), including all mammals, birds, reptiles, amphibians, and fish. Vertebrates represent the overwhelming majority of the phylum Chordata, wi ...
, and have subsequently been identified within widely-differing taxa. While the origin and function of UCEs are poorly understood, they have been used to investigate deep-time divergences in
amniotes,
insects
Insects (from Latin ') are pancrustacean hexapod invertebrates of the class Insecta. They are the largest group within the arthropod phylum. Insects have a chitinous exoskeleton, a three-part body (head, thorax and abdomen), three pairs ...
, and between
animals
Animals are multicellular, eukaryotic organisms in the Kingdom (biology), biological kingdom Animalia. With few exceptions, animals Heterotroph, consume organic material, Cellular respiration#Aerobic respiration, breathe oxygen, are Motilit ...
and
plants
Plants are predominantly photosynthetic eukaryotes of the kingdom Plantae. Historically, the plant kingdom encompassed all living things that were not animals, and included algae and fungi; however, all current definitions of Plantae exclud ...
.
Universally conserved genes
The most highly conserved genes are those that can be found in all organisms. These consist mainly of the
ncRNA
A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non- ...
s and proteins required for
transcription and
translation
Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...
, which are assumed to have been conserved from the
last universal common ancestor
The last universal common ancestor (LUCA) is the most recent population from which all organisms now living on Earth share common descent—the most recent common ancestor of all current life on Earth. This includes all cellular organisms; ...
of all life.
Genes or gene families that have been found to be universally conserved include
GTP-binding elongation factors,
Methionine aminopeptidase 2,
Serine hydroxymethyltransferase, and
ATP transporters. Components of the transcription machinery, such as
RNA polymerase
In molecular biology, RNA polymerase (abbreviated RNAP or RNApol), or more specifically DNA-directed/dependent RNA polymerase (DdRP), is an enzyme that synthesizes RNA from a DNA template.
Using the enzyme helicase, RNAP locally opens the ...
and
helicase
Helicases are a class of enzymes thought to be vital to all organisms. Their main function is to unpack an organism's genetic material. Helicases are motor proteins that move directionally along a nucleic acid phosphodiester backbone, separat ...
s, and of the translation machinery, such as
ribosomal RNA
Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
s,
tRNAs
Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino a ...
and
ribosomal protein
A ribosomal protein (r-protein or rProtein) is any of the proteins that, in conjunction with rRNA, make up the ribosomal subunits involved in the cellular process of translation. ''E. coli'', other bacteria and Archaea have a 30S small subunit ...
s are also universally conserved.
Applications
Phylogenetics and taxonomy
Sets of conserved sequences are often used for generating
phylogenetic tree
A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...
s, as it can be assumed that organisms with similar sequences are closely related. The choice of sequences may vary depending on the taxonomic scope of the study. For example, the most highly conserved genes such as the 16S RNA and other ribosomal sequences are useful for reconstructing deep phylogenetic relationships and identifying bacterial
phyla in
metagenomics
Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or micr ...
studies. Sequences that are conserved within a
clade but undergo some mutations, such as
housekeeping genes, can be used to study species relationships. The
internal transcribed spacer
Internal transcribed spacer (ITS) is the spacer DNA situated between the small-subunit ribosomal RNA (rRNA) and large-subunit rRNA genes in the chromosome or the corresponding transcribed region in the polycistronic rRNA precursor transcript.
...
(ITS) region, which is required for spacing conserved rRNA genes but undergoes rapid evolution, is commonly used to classify
fungi
A fungus (plural, : fungi or funguses) is any member of the group of Eukaryote, eukaryotic organisms that includes microorganisms such as yeasts and Mold (fungus), molds, as well as the more familiar mushrooms. These organisms are classified ...
and strains of rapidly evolving bacteria.
Medical research
As highly conserved sequences often have important biological functions, they can be useful a starting point for identifying the cause of
genetic disease
A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosomal abnormality. Although polygenic disorders ...
s. Many
congenital metabolic disorders and
Lysosomal storage disease
Lysosomal storage diseases (LSDs; ) are a group of over 70 rare inherited metabolic disorders that result from defects in lysosomal function. Lysosomes are sacs of enzymes within cells that digest large molecules and pass the fragments on to other ...
s are the result of changes to individual conserved genes, resulting in missing or faulty enzymes that are the underlying cause of the symptoms of the disease. Genetic diseases may be predicted by identifying sequences that are conserved between humans and lab organisms such as
mice
A mouse ( : mice) is a small rodent. Characteristically, mice are known to have a pointed snout, small rounded ears, a body-length scaly tail, and a high breeding rate. The best known mouse species is the common house mouse (''Mus musculus' ...
or
fruit flies
Fruit fly may refer to:
Organisms
* Drosophilidae, a family of small flies, including:
** ''Drosophila'', the genus of small fruit flies and vinegar flies
** ''Drosophila melanogaster'' or common fruit fly
** '' Drosophila suzukii'' or Asian fruit ...
, and studying the effects of
knock-outs of these genes.
Genome-wide association studies
In genomics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any varian ...
can also be used to identify variation in conserved sequences associated with disease or health outcomes. In Alzehimer's disease there had been over two dozen novel potential susceptibility loci discovered
Functional annotation
Identifying conserved sequences can be used to discover and predict functional sequences such as genes. Conserved sequences with a known function, such as protein domains, can also be used to predict the function of a sequence. Databases of conserved protein domains such as
Pfam
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families.
Use ...
and the
Conserved Domain Database can be used to annotate functional domains in predicted protein coding genes.
See also
*
Evolutionary developmental biology
Evolutionary developmental biology (informally, evo-devo) is a field of biological research that compares the developmental processes of different organisms to infer how developmental processes evolved.
The field grew from 19th-century beginn ...
*
NAPP (database) The Nucleic acid phylogenetic profiling (NAPP) is a database of coding and non-coding sequences according to their pattern of conservation across the other genomes.
See also
* Conserved sequence
References
External links
* http://napp.u-psud.f ...
*
Segregating site
*
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...
*
Sequence alignment software
This list of sequence alignment software is a compilation of software tools and web portals used in pairwise sequence alignment and multiple sequence alignment. See structural alignment software for structural alignment of proteins.
Database searc ...
*
UCbase
*
Ultra-conserved element An ultra-conserved element (UCE) was originally defined as a genome segment longer than 200 base pairs (bp) that is absolutely conserved, with no insertions or deletions and 100% identity, between orthologous regions of the human, rat, and mouse gen ...
References
{{Use dmy dates, date=April 2017
Computational phylogenetics
Nucleic acids
Protein structure
Population genetics
Molecular genetics
Evolutionary developmental biology