Distance matrices are used in phylogeny as

non-parametric Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distri ...

distance methods and were originally applied to

phenetic In biology, phenetics ( el, phainein – to appear) , also known as taximetrics, is an attempt to classify organisms based on overall similarity, usually in morphology or other observable traits, regardless of their phylogeny or evolutionary rel ...

data using a matrix of pairwise distances. These distances are then reconciled to produce a tree (a phylogram, with informative branch lengths). The

distance matrix In mathematics, computer science and especially graph theory, a distance matrix is a square matrix (two-dimensional array) containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the ''dist ...

can come from a number of different sources, including measured distance (for example from immunological studies) or morphometric analysis, various pairwise distance formulae (such as

euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...

) applied to discrete morphological characters, or

genetic distance Genetic distance is a measure of the genetic divergence between species or between populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Populations with many similar alleles have sma ...

from sequence, restriction fragment, or

allozyme Alloenzymes (or also called allozymes) are variant forms of an enzyme which differ structurally but not functionally from other allozymes coded for by different alleles at the same locus. These are opposed to isozymes, which are enzymes that perfo ...

data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states (

Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...

Distance-matrix methods

Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore they require an MSA (multiple sequence alignment) as an input. Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches.Mount DM. (2004). ''Bioinformatics: Sequence and Genome Analysis'' 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY. Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same

interior node In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be con ...

and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types of

multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...

. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.

Neighbor-joining

Neighbor-joining methods apply general

data clustering Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

techniques to sequence analysis using genetic distance as a clustering metric. The simple

neighbor-joining In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm require ...

method produces unrooted trees, but it does not assume a constant rate of evolution (i.e., a

molecular clock The molecular clock is a figurative term for a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleoti ...

) across lineages.

UPGMA and WPGMA

The

UPGMA UPGMA (unweighted pair group method with arithmetic mean) is a simple agglomerative (bottom-up) hierarchical clustering method. The method is generally attributed to Sokal and Michener. The UPGMA method is similar to its ''weighted'' variant, the ...

(''Unweighted Pair Group Method with Arithmetic mean'') and

WPGMA WPGMA (Weighted Pair Group Method with Arithmetic Mean) is a simple agglomerative (bottom-up) hierarchical clustering method, generally attributed to Sokal and Michener. The WPGMA method is similar to its ''unweighted'' variant, the UPGMA method ...

(''Weighted Pair Group Method with Arithmetic mean'') methods produce rooted trees and require a constant-rate assumption – that is, it assumes an

ultrametric In mathematics, an ultrametric space is a metric space in which the triangle inequality is strengthened to d(x,z)\leq\max\left\. Sometimes the associated metric is also called a non-Archimedean metric or super-metric. Although some of the theorems ...

tree in which the distances from the root to every branch tip are equal.

Fitch–Margoliash method

The Fitch–Margoliash method uses a weighted

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

method for clustering based on genetic distance. Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. In practice, the distance correction is only necessary when the evolution rates differ among branches. The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and distantly related groups. The distances calculated by this method must be

linear Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...

; the linearity criterion for distances requires that the

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...

s of the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances – a property that applies to biological sequences only when they have been corrected for the possibility of back mutations at individual sites. This correction is done through the use of a

substitution matrix In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in ...

such as that derived from the

Jukes–Cantor model In biology, a substitution model, also called models of DNA sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules (e.g., DNA sequences) represented as sequen ...

of DNA evolution. The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is

NP-complete In computational complexity theory, a problem is NP-complete when: # it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by tryi ...

, so

heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...

search methods like those used in maximum-parsimony analysis are applied to the search through tree space.

Using outgroups

Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees. Standard usage of distance-matrix methods involves the inclusion of at least one outgroup sequence known to be only distantly related to the sequences of interest in the query set. This usage can be seen as a type of

experimental control A scientific control is an experiment or observation designed to minimize the effects of variables other than the independent variable (i.e. confounding variables). This increases the reliability of the results, often through a comparison betw ...

. If the outgroup has been appropriately chosen, it will have a much greater

and thus a longer branch length than any other sequence, and it will appear near the root of a rooted tree. Choosing an appropriate outgroup requires the selection of a sequence that is moderately related to the sequences of interest; too close a relationship defeats the purpose of the outgroup and too distant adds

noise Noise is unwanted sound considered unpleasant, loud or disruptive to hearing. From a physics standpoint, there is no distinction between noise and desired sound, as both are vibrations through a medium, such as air or water. The difference arise ...

to the analysis. Care should also be taken to avoid situations in which the species from which the sequences were taken are distantly related, but the gene encoded by the sequences is highly conserved across lineages.

Horizontal gene transfer Horizontal gene transfer (HGT) or lateral gene transfer (LGT) is the movement of genetic material between Unicellular organism, unicellular and/or multicellular organisms other than by the ("vertical") transmission of DNA from parent to offsprin ...

, especially between otherwise divergent

bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were among ...

, can also confound outgroup usage.

Weaknesses of different methods

In general, pairwise distance data are an underestimate of the path-distance between taxa on a phylogram. Pairwise distances effectively "cut corners" in a manner analogous to geographic distance: the distance between two cities may be 100 miles "as the crow flies," but a traveler may actually be obligated to travel 120 miles because of the layout of roads, the terrain, stops along the way, etc. Between pairs of taxa, some character changes that took place in ancestral lineages will be undetectable, because later changes have erased the evidence (often called multiple hits and back mutations in sequence data). This problem is common to all phylogenetic estimation, but it is particularly acute for distance methods, because only two samples are used for each distance calculation; other methods benefit from evidence of these hidden changes found in other taxa not considered in pairwise comparisons. For

nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...

and

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...

sequence data, the same stochastic models of nucleotide change used in maximum likelihood analysis can be employed to "correct" distances, rendering the analysis "semi-parametric." Several simple algorithms exist to construct a tree directly from pairwise distances, including

and

neighbor joining In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm require ...

(NJ), but these will not necessarily produce the best tree for the data. To counter potential complications noted above, and to find the best tree for the data, distance analysis can also incorporate a tree-search protocol that seeks to satisfy an explicit optimality criterion. Two optimality criteria are commonly applied to distance data, minimum evolution (ME) and least squares inference. Least squares is part of a broader class of regression-based methods lumped together here for simplicity. These regression formulae minimize the residual differences between path-distances along the tree and pairwise distances in the data matrix, effectively "fitting" the tree to the empirical distances. In contrast, ME accepts the tree with the shortest sum of branch lengths, and thus minimizes the total amount of evolution assumed. ME is closely akin to parsimony, and under certain conditions, ME analysis of distances based on a discrete character dataset will favor the same tree as conventional parsimony analysis of the same data. Phylogeny estimation using distance methods has produced a number of controversies.

assumes an

tree (a tree where all the path-lengths from the root to the tips are equal). If the rate of evolution were equal in all sampled lineages (a

), and if the tree were completely balanced (equal numbers of taxa on both sides of any split, to counter the node density effect), UPGMA should not produce a biased result. These expectations are not met by most datasets, and although UPGMA is somewhat robust to their violation, it is not commonly used for phylogeny estimation. The advantage of UPGMA is that it is fast and can handle many sequences.

Neighbor-joining In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm require ...

is a form of star decomposition and, as a

method, is generally the least computationally intensive of these methods. It is very often used on its own, and in fact quite frequently produces reasonable trees. However, it lacks any sort of tree search and optimality criterion, and so there is no guarantee that the recovered tree is the one that best fits the data. A more appropriate analytical procedure would be to use NJ to produce a starting tree, then employ a tree search using an optimality criterion, to ensure that the best tree is recovered. Many scientists eschew distance methods, for various reasons. A commonly cited reason is that distances are inherently

rather than

phylogenetic In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...

, in that they do not distinguish between ancestral similarity ( symplesiomorphy) and derived similarity (

synapomorphy In phylogenetics, an apomorphy (or derived trait) is a novel character or character state that has evolved from its ancestral form (or plesiomorphy). A synapomorphy is an apomorphy shared by two or more taxa and is therefore hypothesized to have ...

). This criticism is not entirely fair: most currently implementations of parsimony, likelihood, and Bayesian phylogenetic inference use time-reversible character models, and thus accord no special status to derived or ancestral character states. Under these models, the tree is estimated unrooted; rooting, and consequently determination of polarity, is performed after the analysis. The primary difference between these methods and distances is that parsimony, likelihood, and Bayesian methods fit individual characters to the tree, whereas distance methods fit all the characters at once. There is nothing inherently less phylogenetic about this approach. More practically, distance methods are avoided because the relationship between individual characters and the tree is lost in the process of reducing characters to distances. These methods do not use character data directly, and information locked in the distribution of character states can be lost in the pairwise comparisons. Also, some complex phylogenetic relationships may produce biased distances. On any phylogram, branch lengths will be underestimated because some changes cannot be discovered at all due to failure to sample some species due to either experimental design or extinction (a phenomenon called the node density effect). However, even if pairwise distances from genetic data are "corrected" using stochastic models of evolution as mentioned above, they may more easily sum to a different tree than one produced from analysis of the same data and model using

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...

. This is because pairwise distances are not independent; each branch on a tree is represented in the distance measurements of all taxa it separates. Error resulting from any characteristic of that branch that might confound phylogeny (stochastic variability, change in evolutionary parameters, an abnormally long or short branch length) will be propagated through all of the relevant distance measurements. The resulting distance matrix may then better fit an alternate (presumably less optimal) tree. Despite these potential problems, distance methods are extremely fast, and they often produce a reasonable estimate of phylogeny. They also have certain benefits over the methods that use characters directly. Notably, distance methods allow use of data that may not be easily converted to character data, such as DNA-DNA hybridization assays. They also permit analyses that account for the possibility that the rate at which particular nucleotides are incorporated into sequences may vary over the tree, using LogDet distances. For some network-estimation methods (notably

NeighborNet NeighborNet is an algorithm for constructing phylogenetic networks which is loosely based on the neighbor joining algorithm. Like neighbor joining, the method takes a distance matrix as input, and works by agglomerating clusters. However, the Nei ...

), the abstraction of information about individual characters in distance data are an advantage. When considered character-by character, conflict between character and a tree due to reticulation cannot be told from conflict due either to homoplasy or error. However, pronounced conflict in distance data, which represents an amalgamation of many characters, is less likely due to error or homoplasy unless the data are strongly biased, and is thus more likely to be a result of reticulation. Distance methods are popular among molecular systematists, a substantial number of whom use NJ without an optimization stage almost exclusively. With the increasing speed of character-based analyses, some of the advantages of distance methods will probably wane. However, the nearly instantaneous NJ implementations, the ability to incorporate an evolutionary model in a speedy analysis, LogDet distances, network estimation methods, and the occasional need to summarize relationships with a single number all mean that distance methods will probably stay in the mainstream for a long time to come.

References

{{phylo Computational phylogenetics