Tuesday, 14 March 2017

The trees of life: what are they

Evolution has become synonymous with the image of the tree of life where each branch depicts the diversification of life on our planet. While these images are often nothing more than a reflection of an evolutionary narrative, their validity is rooted in the mathematically derived structures produced from the data sets painstakingly extracted from hours of lab work. But not all of these trees tell you the same things, some have branches whose lengths represent evolutionary time between taxa (ultrametric trees) while others (cladograms) only give you the relative pattern of common ancestry.
The various approaches to constructing a tree can be separated into two general groups; distance methods (eg: Neighbour-joining) and discrete methods (eg: Maximum-likelihood, Maximim-parsimony and Bayesian inference). Distance methods have the advantage that they can analyse both sequence data as well as banding patterns as these formats are easily converted into pair-wise distance matrices, which are then used in the construction of a tree. Discrete methods are limited to sequence data where each character is processed as information. While this limits the type of data that can be analysed, the explicit functions that relate the tree to the data allow for the analysis and comparison of different evolutionary hypotheses to the observed data . These different methods are discussed below.
While some trees may look alike, they are telling us different stories. A) Neighbour joining trees and C) Maximum-likelihood trees both have scales representing evolutionary distance, while B) Maximum-parsimony does not. Trees taken from Potts et al., 2004, doi:10.1093/sysbio/syt052  

Neighbour joining (shortest steps wins)


Neighbour joining involves the calculation of evolutionary distances between each pair of taxa. These values are then placed into a pair wise difference matrix and the relationships between these distance vales are used to construct a tree. The process of constructing a tree from such data starts with a completely unresolved tree (a star with arms of equal length). The pair of taxa with the lowest distance score are identified and connected to form a new node, the branch lengths (distance) of these paired taxa to the node are calculated. New distance values are then calculated between the remaining taxa and the node (each time replacing the paired taxa involved in its formation). This new distance matrix is then evaluated and the process begins again with next shorted distance pair is then added to form the next new node.
The main benefit of using a neighbour joining approach for tree construction is that it is much faster than other, computationally demanding, options. This allows for the analysis of large data sets (with more than 100 taxa) as well as validity tests such as bootstrapping and jack-knifing. Furthermore, the correctness of the tree produced is generally good and maintains it statistical constancy when the matrix is of an additive nature, while not functioning on the assumption of a constant rate of evolution allowing for a variety of evolutionary theories to be evaluated. However, this relatively straight forward approach follows a problem solving heuristic, making the best possible choice at each step by reassessing the distance matrix based on newly produced nodes. Such an approach does not always identify the shortest tree overall, resulting in tree topologies that don’t achieve a Balanced Minimum Evolution (BME). Neighbour joining approaches have generally been superseded by more accurate discrete data methods at the cost of speed and computational ease.

Maximum-parsimony (most simple tree wins)

Maximum-parsimony attempts to construct a tree that requires the fewest number of steps to explain the observed variability in the data, thus achieving a BME. Evolutionarily speaking this means that the most likely phylogeny is the one that requires the lowest amount of evolutionary change. This requires that every possible tree topology be run and scored based on how well they imply a parsimonious distribution of data. Once every potential tree has been scored, the tree that produces the most parsimonious data distribution is selected. This required a great deal of computational power, limiting the number of samples that can be analysed in this way. Heuristic methods have been developed to overcome this challenge. This come with a trade-off though, and the most parsimonious tree is not guaranteed when a hill-climbing algorithm is adopted.
Further issues associated with this method include its tendency to underestimate evolutionary change since homoplasies are often likely to be over looked in order to produce the most parsimonious tree, but this might not always be the case. A second problem associated with this method is that there have been reports of it being statistically inconsistent. This means that under some conditions there is no guarantee that a true evolutionary tree will be produced even with sufficient data available. Maximum parsimony does offer value to intraspecific phylogenies where relatively small number of taxa with relatively recent evolutionary divergence histories, as this will reduce the potential for long-branch attractions. Long-branch attractions are likely to occur when there is a high level of divergence between sequences, or when rates of evolution between sequences are variable.

Maximum-likelihood (most probable tree wins)

Maximum likelihood is used to estimate some unknown descriptor of a probability model. In phylogenetic analyses there may be many parameters that need to be addressed in a model, maximum likelihood will select a value for a parameter that results in the maximum probability of the observed data. This results in the development of an evolutionary tree that makes the observed data most probable. Following this maximum likelihood analysis phylogenetics underwent a steady development and the computational challenges associated with the method being overcome as well as the models becoming more biologically realistic. A general maximum likelihood approach was finally produced, overcoming the computational difficulty of the analyses. While sequence data is the most practical for maximum likelihood analysis, due to the rate of genetic divergence being associated with nucleotide changes, there are cases where restriction site data have been used where an appropriate model was available.
Due to the fact that maximum likelihood methods require differences in evolutionary rate between sites and lineages, it is well suited to the analysis of distantly related taxa. However, the sample size is often limited by computational power as it analyses all potential combinations of tree topology and branch length. This limitation has , however, been addressed by using pruning algorithms that reduce the amount of data being processed independently, rather the likelihood of subtrees is calculated. The advantages of using maximum likelihood to produce a phylogenetic tree is that the statistics behind the selection of the most likely tree topology are well understood, the explicit model of evolution can be made to fit the data and the branch lengths are better accounted for, resulting in more realistic branch lengths that reflect evolutionary rates.

Bayesian inferences (most visited tree wins)

Bayesian inference of phylogeny applies a likelihood function to determine the posterior probability of trees. That is, the probability of a tree after new evidence or background information has been taken into account. Initially, all possible trees are considered equally probable and its likelihood calculated under a Markov model of character evaluation. This involves evaluating every potential tree and for each tree, integration over every combination of branch length and model parameter. This would be exceptionally computationally taxing, but fortunately there are numerical models such as the Marcov chain Monte Carlo that has allowed for the application of Bayesian inference to determine evolutionary tree typologies. This involves two steps. First, a new tree is proposed from a stochastic modification of the existing tree. This new tree is then accepted or rejected based on its probability. If this new tree is accepted, it is subjected to further modifications and retested. If the Marcov chain is correctly constructed and run, the number of times a specific tree is visited is proportional to it posterior probability.
Bayesian inference of phylogeny offers the opportunity to analyse large data sets and produce tree topologies that are correlated to an evolutionary model of your choice, allowing for the selection of a model that best fits the data. It is however, important to select the correct mode as it has been shown that over simplified models are likely to produce higher posterior probability values. Furthermore, posterior probability values tend to be higher than bootstrap values calculated for maximum likelihood and parsimony phylogenies.  

So there you have it, the trees of life. 

Pretty straight forward right? But what happens when using a tree doesn’t make sense, like when a new species results from hybridization. No two tree branches grown into each other and then split again into three or maybe even five new branches, although this is the case for some species phylogenies. In my next post I will be discussing gene networks, the solution to these issues of reticulate evolution.

No comments:

Post a Comment