Molecular phylogenetics () is the branch of
phylogeny
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or Taxon, taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, M ...
that analyzes genetic, hereditary molecular differences, predominantly in DNA sequences, to gain information on an organism's evolutionary relationships. From these analyses, it is possible to determine the processes by which diversity among species has been achieved. The result of a molecular
phylogenetic
In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...
analysis is expressed in a
phylogenetic tree
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
. Molecular phylogenetics is one aspect of molecular systematics, a broader term that also includes the use of molecular data in
taxonomy
image:Hierarchical clustering diagram.png, 280px, Generalized scheme of taxonomy
Taxonomy is a practice and science concerned with classification or categorization. Typically, there are two parts to it: the development of an underlying scheme o ...
and
biogeography
Biogeography is the study of the species distribution, distribution of species and ecosystems in geography, geographic space and through evolutionary history of life, geological time. Organisms and biological community (ecology), communities o ...
.
Molecular phylogenetics and
molecular evolution
Molecular evolution describes how Heredity, inherited DNA and/or RNA change over evolutionary time, and the consequences of this for proteins and other components of Cell (biology), cells and organisms. Molecular evolution is the basis of phylogen ...
correlate. Molecular evolution is the process of selective changes (mutations) at a molecular level (genes, proteins, etc.) throughout various branches in the tree of life (evolution). Molecular phylogenetics makes inferences of the evolutionary relationships that arise due to molecular evolution and results in the construction of a phylogenetic tree.
History
The theoretical frameworks for molecular
systematics
Systematics is the study of the diversification of living forms, both past and present, and the relationships among living things through time. Relationships are visualized as evolutionary trees (synonyms: phylogenetic trees, phylogenies). Phy ...
were laid in the 1960s in the works of
Emile Zuckerkandl,
Emanuel Margoliash,
Linus Pauling
Linus Carl Pauling ( ; February 28, 1901August 19, 1994) was an American chemist and peace activist. He published more than 1,200 papers and books, of which about 850 dealt with scientific topics. ''New Scientist'' called him one of the 20 gre ...
, and
Walter M. Fitch. Applications of molecular systematics were pioneered by
Charles G. Sibley (
bird
Birds are a group of warm-blooded vertebrates constituting the class (biology), class Aves (), characterised by feathers, toothless beaked jaws, the Oviparity, laying of Eggshell, hard-shelled eggs, a high Metabolism, metabolic rate, a fou ...
s),
Herbert C. Dessauer (
herpetology
Herpetology (from Ancient Greek ἑρπετόν ''herpetón'', meaning "reptile" or "creeping animal") is a branch of zoology concerned with the study of amphibians (including frogs, salamanders, and caecilians (Gymnophiona)) and reptiles (in ...
), and
Morris Goodman (
primate
Primates is an order (biology), order of mammals, which is further divided into the Strepsirrhini, strepsirrhines, which include lemurs, galagos, and Lorisidae, lorisids; and the Haplorhini, haplorhines, which include Tarsiiformes, tarsiers a ...
s), followed by
Allan C. Wilson,
Robert K. Selander, and
John C. Avise (who studied various groups). Work with
protein electrophoresis began around 1956. Although the results were not quantitative and did not initially improve on morphological classification, they provided tantalizing hints that long-held notions of the classifications of
bird
Birds are a group of warm-blooded vertebrates constituting the class (biology), class Aves (), characterised by feathers, toothless beaked jaws, the Oviparity, laying of Eggshell, hard-shelled eggs, a high Metabolism, metabolic rate, a fou ...
s, for example, needed substantial revision. In the period of 1974–1986,
DNA–DNA hybridization was the dominant technique used to measure genetic difference.
Theoretical background
Early attempts at molecular systematics were also termed
chemotaxonomy and made use of proteins,
enzyme
An enzyme () is a protein that acts as a biological catalyst by accelerating chemical reactions. The molecules upon which enzymes may act are called substrate (chemistry), substrates, and the enzyme converts the substrates into different mol ...
s,
carbohydrate
A carbohydrate () is a biomolecule composed of carbon (C), hydrogen (H), and oxygen (O) atoms. The typical hydrogen-to-oxygen atomic ratio is 2:1, analogous to that of water, and is represented by the empirical formula (where ''m'' and ''n'' ...
s, and other molecules that were separated and characterized using techniques such as
chromatography
In chemical analysis, chromatography is a laboratory technique for the Separation process, separation of a mixture into its components. The mixture is dissolved in a fluid solvent (gas or liquid) called the ''mobile phase'', which carries it ...
. These have been replaced in recent times largely by
DNA sequencing
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The ...
, which produces the exact sequences of
nucleotides
Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
or ''bases'' in either DNA or RNA segments extracted using different techniques. In general, these are considered superior for evolutionary studies, since the actions of evolution are ultimately reflected in the genetic sequences. At present, it is still a long and expensive process to sequence the entire DNA of an organism (its
genome
A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
). However, it is quite feasible to determine the sequence of a defined area of a particular
chromosome
A chromosome is a package of DNA containing part or all of the genetic material of an organism. In most chromosomes, the very long thin DNA fibers are coated with nucleosome-forming packaging proteins; in eukaryotic cells, the most import ...
. Typical molecular systematic analyses require the sequencing of around 1000
base pair
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
s. At any location within such a sequence, the bases found in a given position may vary between organisms. The particular sequence found in a given organism is referred to as its
haplotype
A haplotype (haploid genotype) is a group of alleles in an organism that are inherited together from a single parent.
Many organisms contain genetic material (DNA) which is inherited from two parents. Normally these organisms have their DNA orga ...
. In principle, since there are four base types, with 1000 base pairs, we could have 4
1000 distinct haplotypes. However, for organisms within a particular species or in a group of related species, it has been found empirically that only a minority of sites show any variation at all, and most of the variations that are found are correlated, so that the number of distinct haplotypes that are found is relatively small.

In a molecular systematic analysis, the haplotypes are determined for a defined area of
genetic material
Nucleic acids are large biomolecules that are crucial in all cells and viruses. They are composed of nucleotides, which are the monomer components: a 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nucleic aci ...
; a substantial sample of individuals of the target
species
A species () is often defined as the largest group of organisms in which any two individuals of the appropriate sexes or mating types can produce fertile offspring, typically by sexual reproduction. It is the basic unit of Taxonomy (biology), ...
or other
taxon
In biology, a taxon (back-formation from ''taxonomy''; : taxa) is a group of one or more populations of an organism or organisms seen by taxonomists to form a unit. Although neither is required, a taxon is usually known by a particular name and ...
is used; however, many current studies are based on single individuals. Haplotypes of individuals of closely related, yet different, taxa are also determined. Finally, haplotypes from a smaller number of individuals from a definitely different taxon are determined: these are referred to as an
outgroup. The base sequences for the haplotypes are then compared. In the simplest case, the difference between two haplotypes is assessed by counting the number of locations where they have different bases: this is referred to as the number of ''substitutions'' (other kinds of differences between haplotypes can also occur, for example, the ''insertion'' of a section of
nucleic acid
Nucleic acids are large biomolecules that are crucial in all cells and viruses. They are composed of nucleotides, which are the monomer components: a pentose, 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nuclei ...
in one haplotype that is not present in another). The difference between organisms is usually re-expressed as a ''percentage divergence'', by dividing the number of substitutions by the number of base pairs analysed: the hope is that this measure will be independent of the location and length of the section of DNA that is sequenced.
An older and superseded approach was to determine the divergences between the
genotype
The genotype of an organism is its complete set of genetic material. Genotype can also be used to refer to the alleles or variants an individual carries in a particular gene or genetic location. The number of alleles an individual can have in a ...
s of individuals by
DNA–DNA hybridization. The advantage claimed for using hybridization rather than gene sequencing was that it was based on the entire genotype, rather than on particular sections of DNA. Modern sequence comparison techniques overcome this objection by the use of multiple sequences.
Once the divergences between all pairs of samples have been determined, the resulting
triangular matrix
In mathematics, a triangular matrix is a special kind of square matrix. A square matrix is called if all the entries ''above'' the main diagonal are zero. Similarly, a square matrix is called if all the entries ''below'' the main diagonal are z ...
of differences is submitted to some form of statistical
cluster analysis
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
, and the resulting
dendrogram is examined in order to see whether the samples cluster in the way that would be expected from current ideas about the taxonomy of the group. Any group of haplotypes that are all more similar to one another than any of them is to any other haplotype may be said to constitute a
clade
In biology, a clade (), also known as a Monophyly, monophyletic group or natural group, is a group of organisms that is composed of a common ancestor and all of its descendants. Clades are the fundamental unit of cladistics, a modern approach t ...
, which may be visually represented as the figure displayed on the right demonstrates.
Statistical
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
techniques such as
bootstrapping and
jackknifing help in providing reliability estimates for the positions of haplotypes within the evolutionary trees.
Techniques and applications
Every living
organism
An organism is any life, living thing that functions as an individual. Such a definition raises more problems than it solves, not least because the concept of an individual is also difficult. Many criteria, few of them widely accepted, have be ...
contains deoxyribonucleic acid (
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
), ribonucleic acid (
RNA
Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
), and
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
s. In general, closely related organisms have a high degree of similarity in the
molecular structure
Molecular geometry is the three-dimensional arrangement of the atoms that constitute a molecule. It includes the general shape of the molecule as well as bond lengths, bond angles, torsional angles and any other geometrical parameters that det ...
of these substances, while the molecules of organisms distantly related often show a pattern of dissimilarity. Conserved sequences, such as mitochondrial DNA, are expected to accumulate mutations over time, and assuming a constant rate of mutation, provide a
molecular clock for dating divergence. Molecular phylogeny uses such data to build a "relationship tree" that shows the probable
evolution
Evolution is the change in the heritable Phenotypic trait, characteristics of biological populations over successive generations. It occurs when evolutionary processes such as natural selection and genetic drift act on genetic variation, re ...
of various organisms. With the invention of
Sanger sequencing in 1977, it became possible to isolate and identify these molecular structures.
High-throughput sequencing may also be used to obtain the
transcriptome of an organism, allowing
inference of phylogenetic relationships using transcriptomic data.
The most common approach is the comparison of
homologous sequences for genes using
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
techniques to identify similarity. Another application of molecular phylogeny is in
DNA barcoding, wherein the species of an individual organism is identified using small sections of
mitochondrial DNA
Mitochondrial DNA (mtDNA and mDNA) is the DNA located in the mitochondrion, mitochondria organelles in a eukaryotic cell that converts chemical energy from food into adenosine triphosphate (ATP). Mitochondrial DNA is a small portion of the D ...
or
chloroplast DNA. Another application of the techniques that make this possible can be seen in the very limited field of human genetics, such as the ever-more-popular use of
genetic testing
Genetic testing, also known as DNA testing, is used to identify changes in DNA sequence or chromosome structure. Genetic testing can also include measuring the results of genetic changes, such as RNA analysis as an output of gene expression, or ...
to determine a child's
paternity, as well as the emergence of a new branch of criminal
forensics
Forensic science combines principles of law and science to investigate criminal activity. Through crime scene investigations and laboratory analysis, forensic scientists are able to link suspects to evidence. An example is determining the time and ...
focused on evidence known as
genetic fingerprinting.
Molecular phylogenetic analysis
There are several methods available for performing a molecular phylogenetic analysis. One method, including a comprehensive step-by-step protocol on constructing a phylogenetic tree, including DNA/Amino Acid contiguous sequence assembly,
multiple sequence alignment, model-test (testing best-fitting substitution models), and phylogeny reconstruction using Maximum Likelihood and Bayesian Inference, is available at Nature Protocol.
Another molecular phylogenetic analysis technique has been described by Pevsner and shall be summarized in the sentences to follow (Pevsner, 2015). A phylogenetic analysis typically consists of five major steps. The first stage comprises sequence acquisition. The following step consists of performing a multiple sequence alignment, which is the fundamental basis of constructing a phylogenetic tree. The third stage includes different models of DNA and amino acid substitution. Several models of substitution exist. A few examples include
Hamming distance
In information theory, the Hamming distance between two String (computer science), strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number ...
, the Jukes and Cantor one-parameter model, and the Kimura two-parameter model (see
Models of DNA evolution). The fourth stage consists of various methods of tree building, including distance-based and character-based methods. The normalized Hamming distance and the Jukes-Cantor correction formulas provide the degree of divergence and the probability that a nucleotide changes to another, respectively. Common tree-building methods include unweighted pair group method using arithmetic mean (
UPGMA) and
Neighbor joining, which are distance-based methods,
Maximum parsimony, which is a character-based method, and
Maximum likelihood estimation
In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
and
Bayesian inference, which are character-based/model-based methods. UPGMA is a simple method; however, it is less accurate than the neighbor-joining approach. Finally, the last step comprises evaluating the trees. This assessment of accuracy is composed of consistency, efficiency, and robustness.
MEGA (molecular evolutionary genetics analysis) is an analysis software that is user-friendly and free to download and use. This software is capable of analyzing both distance-based and character-based tree methodologies. MEGA also contains several options one may choose to utilize, such as heuristic approaches and bootstrapping.
Bootstrapping is an approach that is commonly used to measure the robustness of topology in a phylogenetic tree, which demonstrates the percentage each clade is supported after numerous replicates. In general, a value greater than 70% is considered significant. The flow chart displayed on the right visually demonstrates the order of the five stages of Pevsner's molecular phylogenetic analysis technique that have been described.
Limitations
Molecular systematics is an essentially
cladistic
Cladistics ( ; from Ancient Greek 'branch') is an approach to biological classification in which organisms are categorized in groups ("clades") based on hypotheses of most recent common ancestry. The evidence for hypothesized relationships is ...
approach: it assumes that classification must correspond to phylogenetic descent, and that all valid taxa must be
monophyletic
In biological cladistics for the classification of organisms, monophyly is the condition of a taxonomic grouping being a clade – that is, a grouping of organisms which meets these criteria:
# the grouping contains its own most recent co ...
. This is a limitation when attempting to determine the optimal tree(s), which often involves bisecting and reconnecting portions of the phylogenetic tree(s).
The recent discovery of extensive
horizontal gene transfer
Horizontal gene transfer (HGT) or lateral gene transfer (LGT) is the movement of genetic material between organisms other than by the ("vertical") transmission of DNA from parent to offspring (reproduction). HGT is an important factor in the e ...
among organisms provides a significant complication to molecular systematics, indicating that different genes within the same organism can have different phylogenies. HGTs can be detected and excluded using a number of phylogenetic methods (see ).
In addition, molecular phylogenies are sensitive to the assumptions and models that go into making them. Firstly, sequences must be aligned; then, issues such as
long-branch attraction,
saturation, and
taxon
In biology, a taxon (back-formation from ''taxonomy''; : taxa) is a group of one or more populations of an organism or organisms seen by taxonomists to form a unit. Although neither is required, a taxon is usually known by a particular name and ...
sampling problems must be addressed. This means that strikingly different results can be obtained by applying different models to the same dataset.
The tree-building method also brings with it specific assumptions about tree topology, evolution speeds, and sampling. The simplistic UPGMA assumes a rooted tree and a uniform molecular clock, both of which can be incorrect.
The low resolution power of single genes have been overcome using
multigene phylogenies, though the issue of HGT still merits careful algorithmic design.
See also
*
Computational phylogenetics
Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, Heuristic (computer science), heuristics, and approaches involved in Phylogenetics, phylogenetic analyses. The goal i ...
*
Microbial phylogenetics
*
Molecular clock
*
Molecular evolution
Molecular evolution describes how Heredity, inherited DNA and/or RNA change over evolutionary time, and the consequences of this for proteins and other components of Cell (biology), cells and organisms. Molecular evolution is the basis of phylogen ...
*
PhyloCode
The ''International Code of Phylogenetic Nomenclature'', known as the ''PhyloCode'' for short, is a formal set of rules governing phylogenetic nomenclature. Its current version is specifically designed to regulate the naming of clades, leaving the ...
*
Phylogenetic nomenclature
Phylogenetic nomenclature is a method of nomenclature for taxa in biology that uses phylogenetic definitions for taxon names as explained below. This contrasts with the traditional method, by which taxon names are defined by a '' type'', which c ...
Notes and references
Further reading
*
*
External links
NCBI – Systematics and Molecular PhylogeneticsMEGA Software*
Molecular phylogenetics
/span> from ''Encyclopædia Britannica''.
{{Bioinformatics
Phylogenetics
Molecular evolution