Types of phylogenetic trees and networks
Phylogenetic trees generated by computational phylogenetics can be either ''rooted'' or ''unrooted'' depending on the input data and the algorithm used. A rooted tree is aCoding characters and defining homology
Morphological analysis
The basic problem in morphological phylogenetics is the assembly of a matrix representing a mapping from each of the taxa being compared to representative measurements for each of the phenotypic characteristics being used as a classifier. The types of phenotypic data used to construct this matrix depend on the taxa being compared; for individual species, they may involve measurements of average body size, lengths or sizes of particular bones or other physical features, or even behavioral manifestations. Of course, since not every possible phenotypic characteristic could be measured and encoded for analysis, the selection of which features to measure is a major inherent obstacle to the method. The decision of which traits to use as a basis for the matrix necessarily represents a hypothesis about which traits of a species or higher taxon are evolutionarily relevant. Morphological studies can be confounded by examples ofMolecular analysis
The problem of character coding is very different in molecular analyses, as the characters in biological sequence data are immediate and discretely defined - distinctDistance-matrix methods
Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore, they require an MSA as an input. Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches. Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same interior node and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types ofUPGMA and WPGMA
TheNeighbor-joining
Neighbor-joining methods apply general cluster analysis techniques to sequence analysis using genetic distance as a clustering metric. The simpleFitch–Margoliash method
The Fitch–Margoliash method uses a weighted least squares method for clustering based on genetic distance. Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and distantly related groups. The distances calculated by this method must beUsing outgroups
Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees. Standard usage of distance-matrix methods involves the inclusion of at least one outgroup sequence known to be only distantly related to the sequences of interest in the query set. This usage can be seen as a type ofMaximum parsimony
Branch and bound
TheSankoff-Morel-Cedergren algorithm
The Sankoff-Morel-Cedergren algorithm was among the first published methods to simultaneously produce an MSA and a phylogenetic tree for nucleotide sequences. The method uses aMALIGN and POY
More recent phylogenetic tree/MSA methods use heuristics to isolate high-scoring, but not necessarily optimal, trees. The MALIGN method uses a maximum-parsimony technique to compute a multiple alignment by maximizing aMaximum likelihood
TheBayesian inference
Bayesian inference can be used to produce phylogenetic trees in a manner closely related to the maximum likelihood methods. Bayesian methods assume a prior probability distribution of the possible trees, which may simply be the probability of any one tree among all the possible trees that could be generated from the data, or may be a more sophisticated estimate derived from the assumption that divergence events such as speciation occur as stochastic processes. The choice of prior distribution is a point of contention among users of Bayesian-inference phylogenetics methods. Implementations of Bayesian methods generally useModel selection
Molecular phylogenetics methods rely on a defined substitution model that encodes a hypothesis about the relative rates ofTypes of models
All substitution models assign a set of weights to each possible change of state represented in the sequence. The most common model types are implicitly reversible because they assign the same weight to, for example, a G>C nucleotide mutation as to a C>G mutation. The simplest possible model, the Jukes-Cantor model, assigns an equal probability to every possible change of state for a given nucleotide base. The rate of change between any two distinct nucleotides will be one-third of the overall substitution rate. More advanced models distinguish between transitions andChoosing the best model
The selection of an appropriate model is critical for the production of good phylogenetic analyses, both because underparameterized or overly restrictive models may produce aberrant behavior when their underlying assumptions are violated, and because overly complex or overparameterized models are computationally expensive and the parameters may be overfit. The most common method of model selection is theEvaluating tree support
As with all statistical analysis, the estimation of phylogenies from character data requires an evaluation of confidence. A number of methods exist to test the amount of support for a phylogenetic tree, either by evaluating the support for each sub-tree in the phylogeny (nodal support) or evaluating whether the phylogeny is significantly different from other possible trees (alternative tree hypothesis tests).Nodal support
The most common method for assessing tree support is to evaluate the statistical support for each node on the tree. Typically, a node with very low support is not considered valid in further analysis, and visually may be collapsed into aConsensus tree
Many methods for assessing nodal support involve consideration of multiple phylogenies. The consensus tree summarizes the nodes that are shared among a set of trees. In a *strict consensus,* only nodes found in every tree are shown, and the rest are collapsed into an unresolvedBootstrapping and jackknifing
In statistics, the bootstrap is a method for inferring the variability of data that has an unknown distribution using pseudoreplications of the original data. For example, given a set of 100 data points, a pseudoreplicate is a data set of the same size (100 points) randomly sampled from the original data, with replacement. That is, each original data point may be represented more than once in the pseudoreplicate, or not at all. Statistical support involves evaluation of whether the original data has similar properties to a large set of pseudoreplicates. In phylogenetics, bootstrapping is conducted using the columns of the character matrix. Each pseudoreplicate contains the same number of species (rows) and characters (columns) randomly sampled from the original matrix, with replacement. A phylogeny is reconstructed from each pseudoreplicate, with the same methods used to reconstruct the phylogeny from the original data. For each node on the phylogeny, the nodal support is the percentage of pseudoreplicates containing that node. The statistical rigor of the bootstrap test has been empirically evaluated using viral populations with known evolutionary histories, finding that 70% bootstrap support corresponds to a 95% probability that the clade exists. However, this was tested under ideal conditions (e.g. no change in evolutionary rates, symmetric phylogenies). In practice, values above 70% are generally supported and left to the researcher or reader to evaluate confidence. Nodes with support lower than 70% are typically considered unresolved. Jackknifing in phylogenetics is a similar procedure, except the columns of the matrix are sampled without replacement. Pseudoreplicates are generated by randomly subsampling the data—for example, a "10% jackknife" would involve randomly sampling 10% of the matrix many times to evaluate nodal support.Posterior probability
Reconstruction of phylogenies using Bayesian inference generates a posterior distribution of highly probable trees given the data and evolutionary model, rather than a single "best" tree. The trees in the posterior distribution generally have many different topologies. When the input data is variant allelic frequency data (VAF), the tool EXACT can compute the probabilities of trees exactly, for small, biologically relevant tree sizes, by exhaustively searching the entire tree space. Most Bayesian inference methods utilize a Markov-chain Monte Carlo iteration, and the initial steps of this chain are not considered reliable reconstructions of the phylogeny. Trees generated early in the chain are usually discarded asStep counting methods
Bremer support counts the number of extra steps needed to contradict a clade.Shortcomings
These measures each have their weaknesses. For example, smaller or larger clades tend to attract larger support values than mid-sized clades, simply as a result of the number of taxa in them. Bootstrap support can provide high estimates of node support as a result of noise in the data rather than the true existence of a clade.Limitations and workarounds
Ultimately, there is no way to measure whether a particular phylogenetic hypothesis is accurate or not, unless the true relationships among the taxa being examined are already known (which may happen with bacteria or viruses under laboratory conditions). The best result an empirical phylogeneticist can hope to attain is a tree with branches that are well supported by the available evidence. Several potential pitfalls have been identified:Homoplasy
Certain characters are more likely to evolve convergently than others; logically, such characters should be given less weight in the reconstruction of a tree. Weights in the form of a model of evolution can be inferred from sets of molecular data, so thatHorizontal gene transfer
In general, organisms can inherit genes in two ways: vertical gene transfer andHybrids, speciation, introgressions and incomplete lineage sorting
The basic assumption underlying the mathematical model of cladistics is a situation where species split neatly in bifurcating fashion. While such an assumption may hold on a larger scale (bar horizontal gene transfer, see above), speciation is often much less orderly. Research since the cladistic method was introduced has shown thatTaxon sampling
Owing to the development of advanced sequencing techniques inPhylogenetic signal
Another important factor that affects the accuracy of tree reconstruction is whether the data analyzed actually contain a useful phylogenetic signal, a term that is used generally to denote whether a character evolves slowly enough to have the same state in closely related taxa as opposed to varying randomly. Tests for phylogenetic signal exist.Continuous characters
Morphological characters that sample a continuum may contain phylogenetic signal, but are hard to code as discrete characters. Several methods have been used, one of which is gap coding, and there are variations on gap coding. In the original form of gap coding:group means for a character are first ordered by size. The pooled within-group standard deviation is calculated ... and differences between adjacent means ... are compared relative to this standard deviation. Any pair of adjacent means is considered different and given different integer scores ... if the means are separated by a "gap" greater than the within-group standard deviation ... times some arbitrary constant.If more taxa are added to the analysis, the gaps between taxa may become so small that all information is lost. Generalized gap coding works around that problem by comparing individual pairs of taxa rather than considering one set that contains all of the taxa.
Missing data
In general, the more data that are available when constructing a tree, the more accurate and reliable the resulting tree will be. Missing data are no more detrimental than simply having fewer data, although the impact is greatest when most of the missing data are in a small number of taxa. Concentrating the missing data across a small number of characters produces a more robust tree.The role of fossils
Because many characters involve embryological, or soft-tissue or molecular characters that (at best) hardly ever fossilize, and the interpretation of fossils is more ambiguous than that of living taxa, extinct taxa almost invariably have higher proportions of missing data than living ones. However, despite these limitations, the inclusion of fossils is invaluable, as they can provide information in sparse areas of trees, breaking up long branches and constraining intermediate character states; thus, fossil taxa contribute as much to tree resolution as modern taxa. Fossils can also constrain the age of lineages and thus demonstrate how consistent a tree is with the stratigraphic record; stratocladistics incorporates age information into data matrices for phylogenetic analyses.See also
*References
Further reading
* * * *External links
* {{Use dmy dates, date=April 2017