Least squares inference in phylogeny
   HOME

TheInfoList



OR:

Least squares inference in phylogeny generates a
phylogenetic tree A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...
based on an observed matrix of pairwise
genetic distance Genetic distance is a measure of the genetic divergence between species or between populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Populations with many similar alleles have s ...
s and optionally a weight matrix. The goal is to find a tree which satisfies the distance constraints as best as possible.


Ordinary and weighted least squares

The discrepancy between the observed pairwise distances D_ and the distances T_ over a phylogenetic tree (i.e. the sum of the branch lengths in the path from leaf i to leaf j) is measured by : S = \sum_ w_ (D_-T_)^2 where the weights w_ depend on the least squares method used. Least squares distance tree construction aims to find the tree (topology and branch lengths) with minimal S. This is a non-trivial problem. It involves searching the discrete space of unrooted binary tree topologies whose size is exponential in the number of leaves. For n leaves there are 1 • 3 • 5 • ... • (2n-3) different topologies. Enumerating them is not feasible already for a small number of leaves. Heuristic search methods are used to find a reasonably good topology. The evaluation of S for a given topology (which includes the computation of the branch lengths) is a linear least squares problem. There are several ways to weight the squared errors (D_-T_)^2, depending on the knowledge and assumptions about the variances of the observed distances. When nothing is known about the errors, or if they are assumed to be independently distributed and equal for all observed distances, then all the weights w_ are set to one. This leads to an ordinary least squares estimate. In the weighted least squares case the errors are assumed to be independent (or their correlations are not known). Given independent errors, a particular weight should ideally be set to the inverse of the variance of the corresponding distance estimate. Sometimes the variances may not be known, but they can be modeled as a function of the distance estimates. In the Fitch and Margoliash method Fitch WM, Margoliash E. (1967). Construction of phylogenetic trees. ''Science'' 155: 279-84. for instance it is assumed that the variances are proportional to the squared distances.


Generalized least squares

The ordinary and weighted least squares methods described above assume independent distance estimates. If the distances are derived from genomic data their estimates covary, because evolutionary events on internal branches (of the true tree) can push several distances up or down at the same time. The resulting covariances can be taken into account using the method of generalized least squares, i.e. minimizing the following quantity :\sum_ w_ (D_-T_) (D_-T_) where w_ are the entries of the inverse of the
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of the distance estimates.


Computational Complexity

Finding the tree and branch lengths minimizing the least squares residual is an
NP-complete In computational complexity theory, a problem is NP-complete when: # it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by trying ...
problem. However, for a given tree, the optimal branch lengths can be determined in O(n^2) time for ordinary least squares, O(n^3) time for weighted least squares, and O(n^4) time for generalised least squares (given the inverse of the
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
).David Bryant, Peter Waddell,
Rapid Evaluation of Least-Squares and Minimum-Evolution Criteria on Phylogenetic Trees
Mol Biol Evol (1998) 15(10): 1346


External links



a freely distributed phylogenetic analysis package containing an implementation of the weighted least squares method
PAUP
a similar package available for purchase
Darwin
a programming environment with a library of functions for statistics, numerics, sequence and phylogenetic analysis


References

{{Phylogenetics Computational phylogenetics