Sequence homology is the
biological homology between
DNA,
RNA
Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
, or
protein sequences
Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthes ...
, defined in terms of shared ancestry in the
evolutionary history of life
The history of life on Earth traces the processes by which living and fossil organisms evolved, from the earliest #Origins of life on Earth, emergence of life to present day. Earth formed about 4.5 billion years ago (abbreviated as ''Ga'', fo ...
. Two segments of DNA can have shared ancestry because of three phenomena: either a
speciation
Speciation is the evolutionary process by which populations evolve to become distinct species. The biologist Orator F. Cook coined the term in 1906 for cladogenesis, the splitting of lineages, as opposed to anagenesis, phyletic evolution within ...
event (orthologs), or a
duplication event (paralogs), or else a
horizontal (or lateral) gene transfer event (xenologs).
Homology among DNA, RNA, or proteins is typically inferred from their
nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
or
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
sequence similarity. Significant similarity is strong evidence that two sequences are related by evolutionary changes from a common ancestral sequence.
Alignments of multiple sequences are used to indicate which regions of each sequence are homologous.
Identity, similarity, and conservation
The term "percent homology" is often used to mean "sequence similarity”, that is the percentage of identical residues (''percent identity''), or the percentage of residues conserved with similar physicochemical properties (''percent similarity''), e.g.
leucine
Leucine (symbol Leu or L) is an essential amino acid that is used in the biosynthesis of proteins. Leucine is an α-amino acid, meaning it contains an α-amino group (which is in the protonated −NH3+ form under biological conditions), an α- ca ...
and
isoleucine
Isoleucine (symbol Ile or I) is an α-amino acid that is used in the biosynthesis of proteins. It contains an α-amino group (which is in the protonated −NH form under biological conditions), an α-carboxylic acid group (which is in the deprot ...
, is usually used to "quantify the homology." Based on the definition of homology specified above this terminology is incorrect since sequence similarity is the observation, homology is the conclusion.
Sequences are either homologous or not.
[ This involves that the term "percent homology" is a misnomer.
As with morphological and anatomical structures, sequence similarity might occur because of ]convergent evolution
Convergent evolution is the independent evolution of similar features in species of different periods or epochs in time. Convergent evolution creates analogous structures that have similar form or function but were not present in the last com ...
, or, as with shorter sequences, by chance, meaning that they are not homologous. Homologous sequence regions are also called conserved. This is not to be confused with conservation in amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
sequences, where the amino acid at a specific position has been substituted with a different one that has functionally equivalent physicochemical properties.
Partial homology can occur where a segment of the compared sequences has a shared origin, while the rest does not. Such partial homology may result from a gene fusion A fusion gene is a hybrid gene formed from two previously independent genes. It can occur as a result of translocation, interstitial deletion, or chromosomal inversion. Fusion genes have been found to be prevalent in all main types of human neoplas ...
event.
Orthology
Homologous sequences are orthologous if they are inferred to be descended from the same ancestral sequence separated by a speciation
Speciation is the evolutionary process by which populations evolve to become distinct species. The biologist Orator F. Cook coined the term in 1906 for cladogenesis, the splitting of lineages, as opposed to anagenesis, phyletic evolution within ...
event: when a species diverges into two separate species, the copies of a single gene in the two resulting species are said to be orthologous. Orthologs, or orthologous genes, are genes in different species that originated by vertical descent from a single gene of the last common ancestor
In biology and genetic genealogy, the most recent common ancestor (MRCA), also known as the last common ancestor (LCA) or concestor, of a set of organisms is the most recent individual from which all the organisms of the set are descended. The ...
. The term "ortholog" was coined in 1970 by the molecular evolution
Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genetics ...
ist Walter Fitch.
For instance, the plant Flu regulatory protein is present both in ''Arabidopsis
''Arabidopsis'' (rockcress) is a genus in the family Brassicaceae. They are small flowering plants related to cabbage and mustard. This genus is of great interest since it contains thale cress (''Arabidopsis thaliana''), one of the model organi ...
'' (multicellular higher plant) and ''Chlamydomonas
''Chlamydomonas'' is a genus of green algae consisting of about 150 speciesSmith, G.M. 1955 ''Cryptogamic Botany Volume 1. Algae and Fungi'' McGraw-Hill Book Company Inc of unicellular flagellates, found in stagnant water and on damp soil, ...
'' (single cell green algae). The ''Chlamydomonas'' version is more complex: it crosses the membrane twice rather than once, contains additional domains and undergoes alternative splicing. However it can fully substitute the much simpler ''Arabidopsis'' protein, if transferred from algae to plant genome by means of genetic engineering
Genetic engineering, also called genetic modification or genetic manipulation, is the modification and manipulation of an organism's genes using technology. It is a set of technologies used to change the genetic makeup of cells, including t ...
. Significant sequence similarity and shared functional domains indicate that these two genes are orthologous genes, inherited from the shared ancestor.
Orthology is strictly defined in terms of ancestry. Given that the exact ancestry of genes in different organisms is difficult to ascertain due to gene duplication
Gene duplication (or chromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene. ...
and genome rearrangement events, the strongest evidence that two similar genes are orthologous is usually found by carrying out phylogenetic analysis of the gene lineage. Orthologs often, but not always, have the same function.
Orthologous sequences provide useful information in taxonomic classification and phylogenetic studies of organisms. The pattern of genetic divergence can be used to trace the relatedness of organisms. Two organisms that are very closely related are likely to display very similar DNA sequences between two orthologs. Conversely, an organism that is further removed evolutionarily from another organism is likely to display a greater divergence in the sequence of the orthologs being studied.
Databases of orthologous genes
Given their tremendous importance for biology and bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, orthologous genes have been organized in several specialized databases
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
that provide tools to identify and analyze orthologous gene sequences. These resources employ approaches that can be generally classified into those that use heuristic
A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
analysis of all pairwise sequence comparisons, and those that use phylogenetic
In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...
methods. Sequence comparison methods were first pioneered in the COGs database in 1997. These methods have been extended and automated in twelve different databases the most advanced being AYbRAH Analyzing Yeasts by Reconstructing Ancestry of Homologs as well as these following databases right now.
*eggNOG
Eggnog (), historically also known as a milk punch or an egg milk punch when alcoholic beverages are added, is a rich, chilled, sweetened, dairy-based beverage. It is traditionally made with milk, cream, sugar, egg yolks, and whipped egg whites ...
* GreenPhylDB for plants
* InParanoid focuses on pairwise ortholog relationships
OHNOLOGS
is a repository of the genes retained from whole genome duplications in the vertebrate genomes including human and mouse.
* OMA
* OrthoDB
OrthoDB presents a catalog of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria. Orthology refers to the last common ancestor of the species under consideration, and thus OrthoDB explicitly delineates or ...
appreciates that the orthology concept is relative to different speciation points by providing a hierarchy of orthologs along the species tree.
OrthoInspector
is a repository of orthologous genes for 4753 organisms covering the three domains of life
* OrthologID
* OrthoMaM for mammals
* OrthoMCL
* Roundup
Tree-based phylogenetic
In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...
approaches aim to distinguish speciation from gene duplication events by comparing gene trees with species trees, as implemented in databases and software tools such as:
* LOFT
* TreeFam
* OrthoFinder
OrthoFinder is a Command-line_interface, command-line software tool for Bioinformatics#Comparative_genomics, comparative genomics. OrthoFinder determines the correspondence between genes in different organisms (also known as orthology analysis ...
A third category of hybrid approaches uses both heuristic and phylogenetic methods to construct clusters and determine trees, for example:
* EnsemblCompara GeneTrees
* HomoloGene
* Ortholuge
Paralogy
Paralogous genes are genes that are related via duplication events in the last common ancestor (LCA) of the species being compared. They result from the mutation of duplicated genes during separate speciation events. When descendants from the LCA share mutated homologs of the original duplicated genes then those genes are considered paralogs.
As an example, in the LCA, one gene (gene A) may get duplicated to make a separate similar gene (gene B), those two genes will continue to get passed to subsequent generations. During speciation, one environment will favor a mutation in gene A (gene A1), producing a new species with genes A1 and B. Then in a separate speciation event, one environment will favor a mutation in gene B (gene B1) giving rise to a new species with genes A and B1. The descendants’ genes A1 and B1 are paralogous to each other because they are homologs that are related via a duplication event in the last common ancestor of the two species.
Additional classifications of paralogs include alloparalogs (out-paralogs) and symparalogs (in-paralogs). Alloparalogs are paralogs that evolved from gene duplications that preceded the given speciation event. In other words, alloparalogs are paralogs that evolved from duplication events that happened in the LCA of the organisms being compared. The example above is an example alloparalogy. Symparalogs are paralogs that evolved from gene duplication of paralogous genes in subsequent speciation events. From the example above, if the descendant with genes A1 and B underwent another speciation event where gene A1 duplicated, the new species would have genes B, A1a, and A1b. In this example, genes A1a and A1b are symparalogs.
Paralogous genes can shape the structure of whole genomes and thus explain genome evolution to a large extent. Examples include the Homeobox
A homeobox is a DNA sequence, around 180 base pairs long, that regulates large-scale anatomical features in the early stages of embryonic development. For instance, mutations in a homeobox may change large-scale anatomical features of the full- ...
( Hox) genes in animals. These genes not only underwent gene duplications within chromosome
A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins are ...
s but also whole genome duplications. As a result, Hox genes in most vertebrates are clustered across multiple chromosomes with the HoxA-D clusters being the best studied.[
Another example are the ]globin
The globins are a superfamily of heme-containing globular proteins, involved in binding and/or transporting oxygen. These proteins all incorporate the globin fold, a series of eight alpha helical segments. Two prominent members include myo ...
genes which encode
The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome.
ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...
myoglobin
Myoglobin (symbol Mb or MB) is an iron- and oxygen-binding protein found in the cardiac and skeletal muscle tissue of vertebrates in general and in almost all mammals. Myoglobin is distantly related to hemoglobin. Compared to hemoglobin, myoglobi ...
and hemoglobin
Hemoglobin (haemoglobin BrE) (from the Greek word αἷμα, ''haîma'' 'blood' + Latin ''globus'' 'ball, sphere' + ''-in'') (), abbreviated Hb or Hgb, is the iron-containing oxygen-transport metalloprotein present in red blood cells (erythrocyte ...
and are considered to be ancient paralogs. Similarly, the four known classes of hemoglobins (hemoglobin A
Hemoglobin A (HbA), also known as adult hemoglobin, hemoglobin A1 or α2β2, is the most common human hemoglobin tetramer, accounting for over 97% of the total red blood cell hemoglobin. Hemoglobin is an oxygen-binding protein, found in erythrocyte ...
, hemoglobin A2
Hemoglobin A2 (HbA2) is a normal variant of hemoglobin A that consists of two alpha and two delta chains (α2δ2) and is found at low levels in normal human blood. Hemoglobin A2 may be increased in beta thalassemia or in people who are heterozygo ...
, hemoglobin B
Hemoglobin subunit beta (beta globin, β-globin, haemoglobin beta, hemoglobin beta) is a globin protein, coded for by the ''HBB'' gene, which along with alpha globin ( HBA), makes up the most common form of haemoglobin in adult humans, hemoglobi ...
, and hemoglobin F
Fetal hemoglobin, or foetal haemoglobin (also hemoglobin F, HbF, or α2γ2) is the main oxygen carrier protein in the human fetus. Hemoglobin F is found in fetal red blood cells, and is involved in transporting oxygen from the mother's bloodstream ...
) are paralogs of each other. While each of these proteins serves the same basic function of oxygen transport, they have already diverged slightly in function: fetal hemoglobin (hemoglobin F) has a higher affinity for oxygen than adult hemoglobin. Function is not always conserved, however. Human angiogenin
Angiogenin (ANG) also known as ribonuclease 5 is a small 123 amino acid protein that in humans is encoded by the ''ANG'' gene. Angiogenin is a potent stimulator of new blood vessels through the process of angiogenesis. Ang hydrolyzes cellular R ...
diverged from ribonuclease
Ribonuclease (commonly abbreviated RNase) is a type of nuclease that catalyzes the degradation of RNA into smaller components. Ribonucleases can be divided into endoribonucleases and exoribonucleases, and comprise several sub-classes within the ...
, for example, and while the two paralogs remain similar in tertiary structure, their functions within the cell are now quite different.
It is often asserted that orthologs are more functionally similar than paralogs of similar divergence, but several papers have challenged this notion.
Regulation
Paralogs are often regulated differently, e.g. by having different tissue-specific expression patterns (see Hox genes). However, they can also be regulated differently on the protein level. For instance, ''Bacillus subtilis
''Bacillus subtilis'', known also as the hay bacillus or grass bacillus, is a Gram-positive, catalase-positive bacterium, found in soil and the gastrointestinal tract of ruminants, humans and marine sponges. As a member of the genus ''Bacillu ...
'' encodes two paralogues of glutamate dehydrogenase
Glutamate dehydrogenase (GLDH, GDH) is an enzyme observed in both prokaryotes and eukaryotic mitochondria. The aforementioned reaction also yields ammonia, which in eukaryotes is canonically processed as a substrate in the urea cycle. Typical ...
: GudB is constitutively transcribed whereas RocG is tightly regulated. In their active, oligomeric states, both enzymes show similar enzymatic rates. However, swaps of enzymes and promoters cause severe fitness losses, thus indicating promoter–enzyme coevolution. Characterization of the proteins shows that, compared to RocG, GudB's enzymatic activity is highly dependent on glutamate and pH.
Paralogous chromosomal regions
Sometimes, large regions of chromosomes share gene content similar to other chromosomal regions within the same genome. They are well characterised in the human genome, where they have been used as evidence to support the 2R hypothesis The 2R hypothesis or Ohno's hypothesis, first proposed by Susumu Ohno in 1970,Ohno, Susumu (1970). ''Evolution by Gene Duplication.'' London: Allen and Unwin, . is a hypothesis that the genomes of the early vertebrate lineage underwent two complete ...
. Sets of duplicated, triplicated and quadruplicated genes, with the related genes on different chromosomes, are deduced to be remnants from genome or chromosomal duplications. A set of paralogy regions is together called a paralogon. Well-studied sets of paralogy regions include regions of human chromosome 2, 7, 12 and 17 containing Hox gene
Hox genes, a subset of homeobox genes, are a group of related genes that specify regions of the body plan of an embryo along the head-tail axis of animals. Hox proteins encode and specify the characteristics of 'position', ensuring that the cor ...
clusters, collagen
Collagen () is the main structural protein in the extracellular matrix found in the body's various connective tissues. As the main component of connective tissue, it is the most abundant protein in mammals, making up from 25% to 35% of the whole ...
genes, keratin
Keratin () is one of a family of structural fibrous proteins also known as ''scleroproteins''. Alpha-keratin (α-keratin) is a type of keratin found in vertebrates. It is the key structural material making up scales, hair, nails, feathers, ho ...
genes and other duplicated genes, regions of human chromosomes 4, 5, 8 and 10 containing neuropeptide receptor genes, NK class homeobox genes
A homeobox is a DNA sequence, around 180 base pairs long, that regulates large-scale anatomical features in the early stages of embryonic development. For instance, mutations in a homeobox may change large-scale anatomical features of the full ...
and many more gene families
A gene family is a set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions. One such family are the genes for human hemoglobin subunits; the ten genes are in two clusters on ...
, and parts of human chromosomes 13, 4, 5 and X containing the ParaHox
The ParaHox gene cluster is an array of homeobox genes (involved in morphogenesis, the regulation of patterns of anatomical development) from the Gsx, Xlox ( Pdx) and Cdx gene families.
Regulatory gene cluster
These genes were first shown to be ...
genes and their neighbors. The Major histocompatibility complex
The major histocompatibility complex (MHC) is a large locus on vertebrate DNA containing a set of closely linked polymorphic genes that code for cell surface proteins essential for the adaptive immune system. These cell surface proteins are calle ...
(MHC) on human chromosome 6 has paralogy regions on chromosomes 1, 9 and 19. Much of the human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the n ...
seems to be assignable to paralogy regions.
Ohnology
Ohnologous genes are paralogous gene
In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
s that have originated by a process of 2R whole-genome duplication
Paleopolyploidy is the result of genome duplications which occurred at least several million years ago (MYA). Such an event could either double the genome of a single species ( autopolyploidy) or combine those of two species (allopolyploidy). B ...
. The name was first given in honour of Susumu Ohno
Susumu is a masculine Japanese given name. Notable people with the name include:
* Susumu Akagi (born 1972) Japanese voice actor
*Susumu Aoyagi (青柳 進, born 1968), Japanese baseball player
*Susumu Chiba (born 1970), Japanese voice actor
*, Ja ...
by Ken Wolfe. Ohnologues are useful for evolutionary analysis because all ohnologues in a genome have been diverging for the same length of time (since their common origin in the whole genome duplication). Ohnologues are also known to show greater association with cancers, dominant genetic disorders, and pathogenic copy number variations.
Xenology
Homologs resulting from horizontal gene transfer
Horizontal gene transfer (HGT) or lateral gene transfer (LGT) is the movement of genetic material between Unicellular organism, unicellular and/or multicellular organisms other than by the ("vertical") transmission of DNA from parent to offsprin ...
between two organisms are termed xenologs. Xenologs can have different functions if the new environment is vastly different for the horizontally moving gene. In general, though, xenologs typically have similar function in both organisms. The term was coined by Walter Fitch.
Homoeology
Homoeologous (also spelled homeologous) chromosomes or parts of chromosomes are those brought together following inter-species hybridization and allopolyploidization to form a hybrid genome, and whose relationship was completely homologous in an ancestral species. In allopolyploids, the homologous chromosomes within each parental sub-genome should pair faithfully during meiosis
Meiosis (; , since it is a reductional division) is a special type of cell division of germ cells in sexually-reproducing organisms that produces the gametes, such as sperm or egg cells. It involves two rounds of division that ultimately resu ...
, leading to disomic inheritance; however in some allopolyploids, the homoeologous chromosomes of the parental genomes may be nearly as similar to one another as the homologous chromosomes, leading to tetrasomic inheritance (four chromosomes pairing at meiosis), intergenomic recombination, and reduced fertility.
Gametology
Gametology denotes the relationship between homologous genes on non-recombining, opposite sex chromosomes
A sex chromosome (also referred to as an allosome, heterotypical chromosome, gonosome, heterochromosome, or idiochromosome) is a chromosome that differs from an ordinary autosome in form, size, and behavior. The human sex chromosomes, a typical ...
. The term was coined by García-Moreno and Mindell. 2000. Gametologs result from the origination of genetic sex determination and barriers to recombination between sex chromosomes. Examples of gametologs include CHDW and CHDZ in birds.
See also
* Deep homology
In evolutionary developmental biology, the concept of deep homology is used to describe cases where growth and differentiation processes are governed by genetic mechanisms that are homologous and deeply conserved across a wide range of specie ...
* EggNOG (database) The eggNOG database is a database of biological information hosted by the EMBL. It is based on the original idea of COGs (clusters of orthologous groups) and expands that idea to non-supervised ortholog
Sequence homology is the biological homolo ...
* OrthoDB
OrthoDB presents a catalog of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria. Orthology refers to the last common ancestor of the species under consideration, and thus OrthoDB explicitly delineates or ...
* Orthologous MAtrix OMA (Orthologous MAtrix) is a database of orthologs extracted from available complete genomes. The orthology predictions of OMA are available in several forms:
* OMA Pairs: for a given gene, a list of predicted orthologs in other species is provided ...
(OMA)
* PhEVER
PhEVER is a database of homologous gene families between viral sequences and sequences from cellular organisms.
See also
* Phylogenetics
In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενε ...
* Protein family
* Protein superfamily
* TreeFam
* Syntelog
References
{{reflist, 30em
Evolutionary biology
Phylogenetics
Evolutionary developmental biology