In biology, a substitution model, also called models of DNA sequence evolution, are

Markov model In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Mark ...

s that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules (e.g.,

DNA sequence DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. T ...

s) represented as

sequence of symbols In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...

(A, C, G, and T in the case of DNA). Substitution models are used to calculate the

likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...

phylogenetic tree A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...

s using

multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...

data. Thus, substitution models are central to maximum likelihood estimation of phylogeny as well as

Bayesian inference in phylogeny Bayesian inference of phylogeny combines the information in the prior and in the data likelihood to create the so-called posterior probability of trees, which is the probability that the tree is correct given the data, the prior and the likelihood ...

. Estimates of evolutionary distances (numbers of substitutions that have occurred since a pair of sequences diverged from a common ancestor) are typically calculated using substitution models (evolutionary distances are used input for distance methods such as neighbor joining). Substitution models are also central to phylogenetic invariants because they are necessary to predict site pattern frequencies given a tree topology. Substitution models are also necessary to simulate sequence data for a group of organisms related by a specific tree. Site pattern frequencies models

Phylogenetic tree topologies and other parameters

Phylogenetic tree topologies are often the parameter of interest; thus, branch lengths and any other parameters describing the substitution process are often viewed as

nuisance parameter Nuisance (from archaic ''nocence'', through Fr. ''noisance'', ''nuisance'', from Lat. ''nocere'', "to hurt") is a common law tort. It means that which causes offence, annoyance, trouble or injury. A nuisance can be either public (also "commo ...

s. However, biologists are sometimes interested in the other aspects of the model. For example, branch lengths, especially when those branch lengths are combined with information from the

fossil record A fossil (from Classical Latin , ) is any preserved remains, impression, or trace of any once-living thing from a past geological age. Examples include bones, shells, exoskeletons, stone imprints of animals or microbes, objects preserved ...

and a model to estimate the timeframe for evolution. Other model parameters have been used to gain insights into various aspects of the process of evolution. The K_a/K_s ratio (also called ω in codon substitution models) is a parameter of interest in many studies. The K_a/K_s ratio can be used to examine the action of natural selection on protein-coding regions, it provides information about the relative rates of nucleotide substitutions that change amino acids (non-synonymous substitutions) to those that do not change the encoded amino acid (synonymous substitutions).

Application to sequence data

Most of the work on substitution models has focused on DNA/ RNA and

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, res ...

sequence evolution. Models of DNA sequence evolution, where the

alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syllab ...

corresponds to the four

nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecu ...

s (A, C, G, and T), are probably the easiest models to understand. DNA models can also be used to examine

RNA virus An RNA virus is a virusother than a retrovirusthat has ribonucleic acid ( RNA) as its genetic material. The nucleic acid is usually single-stranded RNA (ssRNA) but it may be double-stranded (dsRNA). Notable human diseases caused by RNA virus ...

evolution; this reflects the fact that RNA also has a four nucleotide alphabet (A, C, G, and U). However, substitution models can be used for alphabets of any size; the alphabet is the 20

proteinogenic amino acid Proteinogenic amino acids are amino acids that are incorporated biosynthetically into proteins during translation. The word "proteinogenic" means "protein creating". Throughout known life, there are 22 genetically encoded (proteinogenic) amino aci ...

s for proteins and the sense codons (i.e., the 61 codons that encode amino acids in the standard genetic code) for aligned protein-coding gene sequences. In fact, substitution models can be developed for any biological characters that can be encoded using a specific alphabet (e.g., amino acid sequences combined with information about the conformation of those amino acids in three-dimensional protein structures). The majority of substitution models used for evolutionary research assume independence among sites (i.e., the probability of observing any specific site pattern is identical regardless of where the site pattern is in the sequence alignment). This simplifies likelihood calculations because it is only necessary to calculate the probability of all site patterns that appear in the alignment then use those values to calculate the overall likelihood of the alignment (e.g., the probability of three "GGGG" site patterns given some model of DNA sequence evolution is simply the probability of a single "GGGG" site pattern raised to the third power). This means that substitution models can be viewed as implying a specific multinomial distribution for site pattern frequencies. If we consider a multiple sequence alignment of four DNA sequences there are 256 possible site patterns so there are 255

degrees of freedom Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...

for the site pattern frequencies. However, it is possible to specify the expected site pattern frequencies using five degrees of freedom if using the Jukes-Cantor model of DNA evolution, which is a simple substitution model that allows one to calculate the expected site pattern frequencies only the tree topology and the branch lengths (given four taxa an unrooted bifurcating tree has five branch lengths). Substitution models also make it possible to simulate sequence data using

Monte Carlo method Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be deter ...

s. Simulated multiple sequence alignments can be used to assess the performance of phylogenetic methods and generate the null distribution for certain statistical tests in the fields of

molecular evolution Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genet ...

and molecular phylogenetics. Examples of these tests include tests of model fit and the "SOWH test" that can be used to examine tree topologies.

Application to morphological data

The fact that substitution models can be used to analyze any biological alphabet has made it possible to develop models of evolution for phenotypic datasets (e.g., morphological and behavioural traits). Typically, "0" is. used to indicate the absence of a trait and "1" is used to indicate the presence of a trait, although it is also possible to score characters using multiple states. Using this framework, we might encode a set of phenotypes as binary strings (this could be generalized to ''k''-state strings for characters with more than two states) before analyses using an appropriate mode. This can be illustrated using a "toy" example: we can use a binary alphabet to score the following phenotypic traits "has feathers", "lays eggs", "has fur", "is warm-blooded", and "capable of powered flight". In this toy example

hummingbird Hummingbirds are birds native to the Americas and comprise the Family (biology), biological family Trochilidae. With about 361 species and 113 genus, genera, they occur from Alaska to Tierra del Fuego, but the vast majority of the species are ...

s would have sequence 11011 (most other

bird Birds are a group of warm-blooded vertebrates constituting the class Aves (), characterised by feathers, toothless beaked jaws, the laying of hard-shelled eggs, a high metabolic rate, a four-chambered heart, and a strong yet lightweig ...

s would have the same string),

ostrich Ostriches are large flightless birds of the genus ''Struthio'' in the order Struthioniformes, part of the infra-class Palaeognathae, a diverse group of flightless birds also known as ratites that includes the emus, rheas, and kiwis. There ...

es would have the sequence 11010,

cattle Cattle (''Bos taurus'') are large, domesticated, cloven-hooved, herbivores. They are a prominent modern member of the subfamily Bovinae and the most widespread species of the genus '' Bos''. Adult females are referred to as cows and adult ...

(and most other land

mammal Mammals () are a group of vertebrate animals constituting the class Mammalia (), characterized by the presence of mammary glands which in females produce milk for feeding (nursing) their young, a neocortex (a region of the brain), fur ...

s) would have 00110, and bats would have 00111. The likelihood of a phylogenetic tree can then be calculated using those binary sequences and an appropriate substitution model. The existence of these morphological models make it possible to analyze data matrices with fossil taxa, either using the morphological data alone or a combination of morphological and molecular data (with the latter scored as missing data for the fossil taxa). There is an obvious similarity between use of molecular or phenotypic data in the field of

cladistics Cladistics (; ) is an approach to biological classification in which organisms are categorized in groups (" clades") based on hypotheses of most recent common ancestry. The evidence for hypothesized relationships is typically shared derived cha ...

and analyses of morphological characters using a substitution model. However, there has been
vociferous debate
in the

systematics Biological systematics is the study of the diversification of living forms, both past and present, and the relationships among living things through time. Relationships are visualized as evolutionary trees (synonyms: cladograms, phylogenetic t ...

community regarding the question of whether or not cladistic analyses should be viewed as "model-free". The field of cladistics (defined in the strictest sense) favor the use of the

maximum parsimony In phylogenetics, maximum parsimony is an optimality criterion under which the phylogenetic tree that minimizes the total number of character-state changes (or miminizes the cost of differentially weighted character-state changes) is preferred. ...

criterion for phylogenetic inference. Many cladists reject the position that maximum parsimony is based on a substitution model and (in many cases) they justify the use of parsimony using the philosophy of

Karl Popper Sir Karl Raimund Popper (28 July 1902 – 17 September 1994) was an Austrian-British philosopher, academic and social commentator. One of the 20th century's most influential philosophers of science, Popper is known for his rejection of the ...

. However, the existence of "parsimony-equivalent" models (i.e., substitution models that yield the maximum parsimony tree when used for analyses) makes it possible to view parsimony as a substitution model.

The molecular clock and the units of time

Typically, a branch length of a phylogenetic tree is expressed as the expected number of substitutions per site; if the evolutionary model indicates that each site within an ancestral sequence will typically experience ''x'' substitutions by the time it evolves to a particular descendant's sequence then the ancestor and descendant are considered to be separated by branch length ''x''. Sometimes a branch length is measured in terms of geological years. For example, a fossil record may make it possible to determine the number of years between an ancestral species and a descendant species. Because some species evolve at faster rates than others, these two measures of branch length are not always in direct proportion. The expected number of substitutions per site per year is often indicated with the Greek letter mu (μ). A model is said to have a strict

molecular clock The molecular clock is a figurative term for a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleo ...

if the expected number of substitutions per year μ is constant regardless of which species' evolution is being examined. An important implication of a strict molecular clock is that the number of expected substitutions between an ancestral species and any of its present-day descendants must be independent of which descendant species is examined. Note that the assumption of a strict molecular clock is often unrealistic, especially across long periods of evolution. For example, even though

rodent Rodents (from Latin , 'to gnaw') are mammals of the order Rodentia (), which are characterized by a single pair of continuously growing incisors in each of the upper and lower jaws. About 40% of all mammal species are rodents. They are n ...

s are genetically very similar to

primate Primates are a diverse order of mammals. They are divided into the strepsirrhines, which include the lemurs, galagos, and lorisids, and the haplorhines, which include the tarsiers and the simians ( monkeys and apes, the latter includin ...

s, they have undergone a much higher number of substitutions in the estimated time since divergence in some regions of the

genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...

. This could be due to their shorter

generation time In population biology and demography, generation time is the average time between two consecutive generations in the lineages of a population. In human populations, generation time typically ranges from 22 to 33 years. Historians sometimes use this ...

, higher

metabolic rate Metabolism (, from el, μεταβολή ''metabolē'', "change") is the set of life-sustaining chemical reactions in organisms. The three main functions of metabolism are: the conversion of the energy in food to energy available to run cell ...

, increased population structuring, increased rate of

speciation Speciation is the evolutionary process by which populations evolve to become distinct species. The biologist Orator F. Cook coined the term in 1906 for cladogenesis, the splitting of lineages, as opposed to anagenesis, phyletic evolution withi ...

, or smaller body size. When studying ancient events like the

Cambrian explosion The Cambrian explosion, Cambrian radiation, Cambrian diversification, or the Biological Big Bang refers to an interval of time approximately in the Cambrian Period when practically all major animal phyla started appearing in the fossil record. ...

under a molecular clock assumption, poor concurrence between

cladistic Cladistics (; ) is an approach to biological classification in which organisms are categorized in groups ("clades") based on hypotheses of most recent common ancestry. The evidence for hypothesized relationships is typically shared derived char ...

and phylogenetic data is often observed. There has been some work on models allowing variable rate of evolution. Models that can take into account variability of the rate of the molecular clock between different evolutionary lineages in the phylogeny are called “relaxed” in opposition to “strict”. In such models the rate can be assumed to be correlated or not between ancestors and descendants and rate variation among lineages can be drawn from many distributions but usually exponential and lognormal distributions are applied. There is a special case, called “local molecular clock” when a phylogeny is divided into at least two partitions (sets of lineages) and a strict molecular clock is applied in each, but with different rates.

Time-reversible and stationary models

Many useful substitution models are

time-reversible A mathematical or physical process is time-reversible if the dynamics of the process remain well-defined when the sequence of time-states is reversed. A Deterministic system, deterministic process is time-reversible if the time-reversed process sa ...

; in terms of the mathematics, the model does not care which sequence is the ancestor and which is the descendant so long as all other parameters (such as the number of substitutions per site that is expected between the two sequences) are held constant. When an analysis of real biological data is performed, there is generally no access to the sequences of ancestral species, only to the present-day species. However, when a model is time-reversible, which species was the ancestral species is irrelevant. Instead, the phylogenetic tree can be rooted using any of the species, re-rooted later based on new knowledge, or left unrooted. This is because there is no 'special' species, all species will eventually derive from one another with the same probability. A model is time reversible if and only if it satisfies the property (the notation is explained below) :

\pi_iQ_ = \pi_jQ_

or, equivalently, the

detailed balance The principle of detailed balance can be used in kinetic systems which are decomposed into elementary processes (collisions, or steps, or elementary reactions). It states that at equilibrium, each elementary process is in equilibrium with its reve ...

property, :

\pi_iP(t)_ = \pi_jP(t)_

for every ''i'', ''j'', and ''t''. Time-reversibility should not be confused with stationarity. A model is stationary if ''Q'' does not change with time. The analysis below assumes a stationary model.

The mathematics of substitution models

Stationary, neutral, independent, finite sites models (assuming a constant rate of evolution) have two parameters, ''π'', an equilibrium vector of base (or character) frequencies and a rate matrix, ''Q'', which describes the rate at which bases of one type change into bases of another type; element

Q_

for ''i'' ≠ ''j'' is the rate at which base ''i'' goes to base ''j''. The diagonals of the ''Q'' matrix are chosen so that the rows sum to zero: :

Q_ = -  \,,

The equilibrium row vector ''π'' must be annihilated by the rate matrix ''Q'': :

\pi \, Q = 0\,.

The transition matrix function is a function from the branch lengths (in some units of time, possibly in substitutions), to a

matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** '' The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchi ...

of conditional probabilities. It is denoted

P(t)

. The entry in the ''i''^th column and the ''j''^th row,

P_(t)

, is the probability, after time ''t'', that there is a base ''j'' at a given position, conditional on there being a base ''i'' in that position at time 0. When the model is time reversible, this can be performed between any two sequences, even if one is not the ancestor of the other, if you know the total branch length between them. The asymptotic properties of ''P''_''ij''(t) are such that ''P''_''ij''(0) = δ_''ij'', where δ_''ij'' is the

Kronecker delta In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise: \delta_ = \begin 0 &\text i \neq j, \\ 1 & ...

function. That is, there is no change in base composition between a sequence and itself. At the other extreme,

\lim_ P_(t) = \pi_\,,

or, in other words, as time goes to infinity the probability of finding base ''j'' at a position given there was a base ''i'' at that position originally goes to the equilibrium probability that there is base ''j'' at that position, regardless of the original base. Furthermore, it follows that

\pi P(t) = \pi

for all ''t''. The transition matrix can be computed from the rate matrix via matrix exponentiation: :

P(t) = e^ = \sum_^\infty Q^n\frac\,,

where ''Q''^''n'' is the matrix ''Q'' multiplied by itself enough times to give its ''n''^th power. If ''Q'' is

diagonalizable In linear algebra, a square matrix A is called diagonalizable or non-defective if it is similar to a diagonal matrix, i.e., if there exists an invertible matrix P and a diagonal matrix D such that or equivalently (Such D are not unique.) F ...

, the matrix exponential can be computed directly: let ''Q'' = ''U''⁻¹ Λ ''U'' be a diagonalization of ''Q'', with :

\Lambda = \begin
\lambda_1 & \ldots & 0 \\
\vdots & \ddots & \vdots \\
0 & \ldots & \lambda_4
\end\,,

where Λ is a diagonal matrix and where

\lbrace \lambda_i \rbrace

are the eigenvalues of ''Q'', each repeated according to its multiplicity. Then :

P(t) = e^ = e^ = U^ e^\,U\,,

where the diagonal matrix ''e''^Λt is given by :

e^ = \begin
e^ & \ldots & 0 \\
\vdots & \ddots & \vdots \\
0 & \ldots & e^
\end\,.

Generalised time reversible

Generalised time reversible (GTR) is the most general neutral, independent, finite-sites, time-reversible model possible. It was first described in a general form by Simon Tavaré in 1986. The GTR model is often called the general time reversible model in publications; it has also been called the REV model. The GTR parameters for nucleotides consist of an equilibrium base frequency vector,

\vec = (\pi_1, \pi_2, \pi_3, \pi_4)

, giving the frequency at which each base occurs at each site, and the rate matrix :

Q = \begin  & x_1 & x_2 & x_3 \\  &  & x_4 & x_5 \\  &  &  & x_6 \\   &  &  &  \end

Because the model must be time reversible and must approach the equilibrium nucleotide (base) frequencies at long times, each rate below the diagonal equals the reciprocal rate above the diagonal multiplied by the equilibrium ratio of the two bases. As such, the nucleotide GTR requires 6 substitution rate parameters and 4 equilibrium base frequency parameters. Since the 4 frequency parameters must sum to 1, there are only 3 free frequency parameters. The total of 9 free parameters is often further reduced to 8 parameters plus

\mu

, the overall number of substitutions per unit time. When measuring time in substitutions (

\mu

=1) only 8 free parameters remain. In general, to compute the number of parameters, you count the number of entries above the diagonal in the matrix, i.e. for n trait values per site

, and then add ''n-1'' for the equilibrium frequencies, and subtract 1 because

\mu

is fixed. You get :

+ (n - 1) - 1 = n^2 + n - 2.

For example, for an amino acid sequence (there are 20 "standard"

amino acids Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...

that make up

proteins Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...

), you would find there are 208 parameters. However, when studying coding regions of the genome, it is more common to work with a

codon The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...

substitution model (a codon is three bases and codes for one amino acid in a protein). There are

4^3 = 64

codons, resulting in 2078 free parameters. However, the rates for transitions between codons which differ by more than one base are often assumed to be zero, reducing the number of free parameters to only

+ 63 - 1 = 632

parameters. Another common practice is to reduce the number of codons by forbidding the stop (or

nonsense Nonsense is a communication, via speech, writing, or any other symbolic system, that lacks any coherent meaning. Sometimes in ordinary usage, nonsense is synonymous with absurdity or the ridiculous. Many poets, novelists and songwriters have u ...

) codons. This is a biologically reasonable assumption because including the stop codons would mean that one is calculating the probability of finding sense codon ''

j

'' after time ''

t

'' given that the ancestral codon is ''

i

'' would involve the possibility of passing through a state with a premature stop codon. An alternative (and commonly used) way to write the instantaneous rate matrix (''

Q

'' matrix) for the nucleotide GTR model is:

Q = \begin  & a\pi_C & b\pi_G & c\pi_T \\ a\pi_A &  & d\pi_G & e\pi_T \\ b\pi_A & d\pi_C &  & f\pi_T \\ c\pi_A & e\pi_C & f\pi_G &  \end

The ''

Q

'' matrix is normalized so

-\sum_^4 \pi_i Q_ = 1

. This notation is easier to understand than the notation originally used by Tavaré, because all model parameters correspond either to "exchangeability" parameters (''

a

'' through ''

f

'', which can also be written using the notation ''

r_

'') or to equilibrium

frequencies

\vec = (\pi_A, \pi_C, \pi_G, \pi_T)

. Note that the nucleotides in the ''

Q

'' matrix have been written in alphabetical order. In other words, the transition probability matrix for the ''

Q

'' matrix above would be:

P(t) = e^ = \begin
            p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) \\
            p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) \\
            p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) \\
            p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t)
       \end

Some publications write the nucleotides in a different order (e.g., some authors choose to group two

purine Purine is a heterocyclic aromatic organic compound that consists of two rings ( pyrimidine and imidazole) fused together. It is water-soluble. Purine also gives its name to the wider class of molecules, purines, which include substituted purines ...

s together and the two

pyrimidine Pyrimidine (; ) is an aromatic, heterocyclic, organic compound similar to pyridine (). One of the three diazines (six-membered heterocyclics with two nitrogen atoms in the ring), it has nitrogen atoms at positions 1 and 3 in the ring. The othe ...

s together; see also

models of DNA evolution A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are ...

). These difference in notation make it important to be clear regarding the order of the states when writing the ''

Q

'' matrix. The value of this notation is that instantaneous rate of change from nucleotide ''

i

'' to nucleotide ''

j

'' can always be written as ''

r_\pi_j

'', where ''

r_

'' is the exchangeability of nucleotides ''

i

'' and ''

j

'' and ''

\pi_j

'' is the equilibrium frequency of the ''

j^

'' nucleotide. The matrix shown above uses the letters ''

a

'' through ''

f

'' for the exchangeability parameters in the interest of readability, but those parameters could also be to written in a systematic manner using the ''

r_

'' notation (e.g.,

a = r_

b = r_

, and so forth). Note that the ordering of the nucleotide subscripts for exchangeability parameters is irrelevant (e.g.,

r_ = r_

) but the transition probability matrix values are not (i.e.,

p_\mathrm(t)

is the probability of observing A in sequence 1 and C in sequence 2 when the evolutionary distance between those sequences is ''

t

'' whereas

p_\mathrm(t)

is the probability of observing C in sequence 1 and A in sequence 2 at the same evolutionary distance). An arbitrarily chosen exchangeability parameters (e.g., ''

f=r_

'') is typically set to a value of 1 to increase the readability of the exchangeability parameter estimates (since it allows users to express those values relative to chosen exchangeability parameter). The practice of expressing the exchangeability parameters in relative terms is not problematic because the ''

Q

'' matrix is normalized. Normalization allows ''

t

'' (time) in the matrix exponentiation

P(t) = e^

to be expressed in units of expected substitutions per site (standard practice in molecular phylogenetics). This is the equivalent to the statement that one is setting the mutation rate

\mu

to 1) and reducing the number of free parameters to eight. Specifically, there are five free exchangeability parameters (''

a

'' through ''

e

'', which are expressed relative to the fixed ''

f=r_=1

'' in this example) and three equilibrium base frequency parameters (as described above, only three ''

\pi_i

'' values need to be specified because

\vec

must sum to 1). The alternative notation also makes it easier to understand the sub-models of the GTR model, which simply correspond to cases where exchangeability and/or equilibrium base frequency parameters are constrained to take on equal values. A number of specific sub-models have been named, largely based on their original publications: There are 203 possible ways that the exchangeability parameters can be restricted to form sub-models of GTR, ranging from the JC69 and F81 models (where all exchangeability parameters are equal) to the SYM model and the full GTR (or REV) model (where all exchangeability parameters are free). The equilibrium base frequencies are typically treated in two different ways: 1) all

\pi_i

values are constrained to be equal (i.e.,

\pi_A = \pi_C = \pi_G = \pi_T = 0.25

); or 2) all

\pi_i

values are treated as free parameters. Although the equilibrium base frequencies can be constrained in other ways most constraints that link some but not all

\pi_i

values are unrealistic from a biological standpoint. The possible exception is enforcing strand symmetry (i.e., constraining

\pi_A = \pi_T

and

\pi_C = \pi_G

but allowing

\pi_A + \pi_T \neq \pi_C + \pi_G

). The alternative notation also makes it straightforward to see how the GTR model can be applied to biological alphabets with a larger state-space (e.g.,

codons The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...

). It is possible to write a set of equilibrium state frequencies as

\pi_1

\pi_2

, ...

\pi_k

and a set of exchangeability parameters (''

r_

'') for any alphabet of

k

character states. These values can the be used to populate the ''

Q

'' matrix by setting the off-diagonal elements as shown above (the general notation would be

Q_=r_\pi_j

), setting the diagonal elements

Q_

to the negative sum of the off-diagonal elements on the same row, and normalizing. Obviously,

k=20

for

and

k=61

for

(assuming the standard genetic code). However, the generality of this notation is beneficial because one can use reduced alphabets for amino acids. For example, one can use

k=6

and encode amino acids by recoding the amino acids using the six categories proposed by Margaret Dayhoff. Reduced amino acid alphabets are viewed as a way to reduce the impact of compositional variation and saturation.

Mechanistic vs. empirical models

A main difference in evolutionary models is how many parameters are estimated every time for the data set under consideration and how many of them are estimated once on a large data set. Mechanistic models describe all substitutions as a function of a number of parameters which are estimated for every data set analyzed, preferably using

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...

. This has the advantage that the model can be adjusted to the particularities of a specific data set (e.g. different composition biases in DNA). Problems can arise when too many parameters are used, particularly if they can compensate for each other (this can lead to non-identifiability). Then it is often the case that the data set is too small to yield enough information to estimate all parameters accurately. Empirical models are created by estimating many parameters (typically all entries of the rate matrix as well as the character frequencies, see the GTR model above) from a large data set. These parameters are then fixed and will be reused for every data set. This has the advantage that those parameters can be estimated more accurately. Normally, it is not possible to estimate all entries of the

substitution matrix In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in ...

from the current data set only. On the downside, the parameters estimated from the training data might be too generic and therefore have a poor fit to any particular dataset. A potential solution for that problem is to estimate some parameters from the data using

(or some other method). In studies of protein evolution the equilibrium amino acid frequencies

\vec = (\pi_A, \pi_R, \pi_N, ... \pi_V)

(using the one-letter IUPAC codes for amino acids to indicate their equilibrium frequencies) are often estimated from the data while keeping the exchangeability matrix fixed. Beyond the common practice of estimating amino acid frequencies from the data, methods to estimate exchangeability parameters or adjust the ''

Q

'' matrix for protein evolution in other ways have been proposed. With the large-scale genome sequencing still producing very large amounts of DNA and protein sequences, there is enough data available to create empirical models with any number of parameters, including empirical codon models. Because of the problems mentioned above, the two approaches are often combined, by estimating most of the parameters once on large-scale data, while a few remaining parameters are then adjusted to the data set under consideration. The following sections give an overview of the different approaches taken for DNA, protein or codon-based models.

DNA substitution models

The first models of DNA evolution was proposed

Jukes Jukes is a surname. Notable people with the surname include: * Andrew Jukes (theologian) (1815–1901) *Andrew Jukes (missionary) (1847–1931), Anglican missionary * Betty Jukes (1910–2006), British sculptor * Bill Jukes (c.1883–1939), English ...

and

Cantor A cantor or chanter is a person who leads people in singing or sometimes in prayer. In formal Jewish worship, a cantor is a person who sings solo verses or passages to which the choir or congregation responds. In Judaism, a cantor sings and lead ...

in 1969. The Jukes-Cantor (JC or JC69) model assumes equal transition rates as well as equal equilibrium frequencies for all bases and it is the simplest sub-model of the GTR model. In 1980,

Motoo Kimura (November 13, 1924 – November 13, 1994) was a Japanese biologist best known for introducing the neutral theory of molecular evolution in 1968. He became one of the most influential theoretical population geneticists. He is remembered in genet ...

introduced a model with two parameters (K2P or K80): one for the transition and one for the

transversion Transversion, in molecular biology, refers to a point mutation in DNA in which a single (two ring) purine ( A or G) is changed for a (one ring) pyrimidine ( T or C), or vice versa. A transversion can be spontaneous, or it can be caused by i ...

rate. A year later,

Kimura Kimura (written: lit. "tree village") is the 17th most common Japanese surname. Notable people with the surname include: *, Japanese novelist *, Japanese footballer *, Japanese botanist *, Japanese idol and singer *, Japanese footballer *, Japanes ...

introduced a second model (K3ST, K3P, or K81) with three substitution types: one for the transition rate, one for the rate of

s that conserve the strong/weak properties of nucleotides (

A\leftrightarrow T

and

C\leftrightarrow G

, designated

\beta

by Kimura), and one for rate of

s that conserve the amino/keto properties of nucleotides (

A\leftrightarrow C

and

G\leftrightarrow T

, designated

\gamma

by Kimura). An 1981,

Joseph Felsenstein Joseph "Joe" Felsenstein (born May 9, 1942) is a Professor Emeritus in the Departments of Genome Sciences and Biology at the University of Washington in Seattle. He is best known for his work on phylogenetic inference, and is the author of ''Infer ...

proposed a four-parameter model (F81) in which the substitution rate corresponds to the equilibrium frequency of the target nucleotide. Hasegawa, Kishino, and Yano unified the two last models to a five-parameter model (HKY). After these pioneering efforts, many additional sub-models of the GTR model were introduced into the literature (and common use) in the 1990s. Other models that move beyond the GTR model in specific ways were also developed and refined by several researchers. Almost all DNA substitution models are mechanistic models (as described above). The small number of parameters that one needs to estimate for these models makes it feasible to estimate those parameters from the data. It is also necessary because the patterns of DNA sequence evolution often differ among organisms and among genes within organisms. The later may reflect optimization by the action of selection for specific purposes (e.g. fast

expression Expression may refer to: Linguistics * Expression (linguistics), a word, phrase, or sentence * Fixed expression, a form of words with a specific meaning * Idiom, a type of fixed expression * Metaphorical expression, a particular word, phrase, o ...

or messenger RNA stability) or it might reflect neutral variation in the patterns of substitution. Thus, depending on the organism and the type of gene, it is likely necessary to adjust the model to these circumstances.

Two-state substitution models

An alternative way to analyze DNA sequence data is to recode the nucleotides as purines (R) and pyrimidines (Y); this practice is often called RY-coding. Insertions and deletions in multiple sequence alignments can also be encoded as binary data and analyzed in using a two-state model. The simplest two-state model of sequence evolution is called the Cavender-Farris model or the Cavender-Farris- Neyman (CFN) model; the name of this model reflects the fact that it was described independently in several different publications. The CFN model is identical to the Jukes-Cantor model adapted to two states and it has even been implemented as the "JC2" model in the popular IQ-TREE software package (using this model in IQ-TREE requires coding the data as 0 and 1 rather than R and Y; the popular

PAUP* PAUP* (Phylogenetic Analysis Using Parsimony *and other methods) is a computational phylogenetics program for inferring evolutionary trees (Phylogenetics, phylogenies), written by David L. Swofford. Originally, as the name implies, PAUP only implem ...

software package can interpret a data matrix comprising only R and Y as data to be analyzed using the CFN model). It is also straightforward to analyze binary data using the phylogenetic

Hadamard transform The Hadamard transform (also known as the Walsh–Hadamard transform, Hadamard–Rademacher–Walsh transform, Walsh transform, or Walsh–Fourier transform) is an example of a generalized class of Fourier transforms. It performs an orthogonal ...

. The alternative two-state model allows the equilibrium frequency parameters of R and Y (or 0 and 1) to take on values other than 0.5 by adding single free parameter; this model is variously called CFu or GTR2 (in IQ-TREE).

Amino acid substitution models

For many analyses, particularly for longer evolutionary distances, the evolution is modeled on the amino acid level. Since not all DNA substitution also alter the encoded amino acid, information is lost when looking at amino acids instead of nucleotide bases. However, several advantages speak in favor of using the amino acid information: DNA is much more inclined to show compositional bias than amino acids, not all positions in the DNA evolve at the same speed ( non-synonymous mutations are less likely to become fixed in the population than

synonymous A synonym is a word, morpheme, or phrase that means exactly or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are a ...

ones), but probably most important, because of those fast evolving positions and the limited alphabet size (only four possible states), the DNA suffers from more back substitutions, making it difficult to accurately estimate evolutionary longer distances. Unlike the DNA models, amino acid models traditionally are empirical models. They were pioneered in the 1960s and 1970s by Dayhoff and co-workers by estimating replacement rates from protein alignments with at least 85% identity (originally with very limited data and ultimately culminating in the Dayhoff PAM model of 1978). This minimized the chances of observing multiple substitutions at a site. From the estimated rate matrix, a series of replacement probability matrices were derived, known under names such as PAM250. Log-odds matrices based on the Dayhoff PAM model were commonly used to assess the significance of homology search results, although the BLOSUM matrices have superseded the PAM log-odds matrices in this context because the BLOSUM matrices appear to be more sensitive across a variety of evolutionary distances, unlike the PAM log-odds matrices. The Dayhoff PAM matrix was the source of the exchangeability parameters used in one of the first maximum-likelihood analyses of phylogeny that used protein data and the PAM model (or an improved version of the PAM model called DCMut) continues to be used in phylogenetics. However, the limited number of alignments used to generate the PAM model (reflecting the limited amount of sequence data available in the 1970s) almost certainly inflated the variance of some rate matrix parameters (alternatively, the proteins used to generate the PAM model could have been a non-representative set). Regardless, it is clear that the PAM model seldom has as good of a fit to most datasets as more modern empirical models (Keane et al. 2006 tested thousands of

vertebrate Vertebrates () comprise all animal taxa within the subphylum Vertebrata () ( chordates with backbones), including all mammals, birds, reptiles, amphibians, and fish. Vertebrates represent the overwhelming majority of the phylum Chordata, with ...

bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were am ...

l, and

archaea Archaea ( ; singular archaeon ) is a domain of single-celled organisms. These microorganisms lack cell nuclei and are therefore prokaryotes. Archaea were initially classified as bacteria, receiving the name archaebacteria (in the Archaeba ...

l proteins and they found that the Dayhoff PAM model had the best-fit to at most <4% of the proteins). Starting in the 1990s, the rapid expansion of sequence databases due to improved sequencing technologies led to the estimation of many new empirical matrices (see for a complete list). The earliest efforts used methods similar to those used by Dayhoff, using large-scale matching of the protein database to generate a new log-odds matrix and the JTT (Jones-Taylor-Thornton) model. The rapid increases in compute power during this time (reflecting factors such as

Moore's law Moore's law is the observation that the number of transistors in a dense integrated circuit (IC) doubles about every two years. Moore's law is an observation and projection of a historical trend. Rather than a law of physics, it is an empi ...

) made it feasible to estimate parameters for empirical models using

(e.g., the WAG and LG models) and other methods (e.g., the VT and PMB models). The IQ-Tree software package allows users to infer their own time reversible model using QMaker, or non-time-reversible using nQMaker. Another set of substitution models of protein evolution directly incorporate information from the protein structure and provide a more realistic modeling in terms of likelihood and biological meaning.

The no common mechanism (NCM) model and maximum parsimony

In 1997, Tuffley and Steel described a model that they named the no common mechanism (NCM) model. The topology of the

tree for a specific dataset given the NCM model is identical to the topology of the optimal tree for the same data given the

criterion. The NCM model assumes all of the data (e.g., homologous nucleotides, amino acids, or morphological characters) are related by a common phylogenetic tree. Then ''

2T-3

'' parameters are introduced for each homologous character, where ''

T

'' is the number of sequences. This can be viewed as estimating a separate rate parameter for every character × branch pair in the dataset (note that the number of branches in a fully resolved phylogenetic tree is ''

2T-3

''). Thus, the number of free parameters in the NCM model always exceeds the number of homologous characters in the data matrix, and the NCM model has been criticized as consistently "over-parameterized."

References

External links

Empirical Models of Amino Acid Substitution

Notes

{{notelist Bioinformatics Stochastic models Computational phylogenetics Statistical genetics