Saturday, December 3, 2016

Molecular clock models in three different programs

There are quite a few molecular clock models implemented in various phylogenetic programs. What I find somewhat annoying is that there is generally only very cryptic information on how they work and how they relate to each other.

What we usually get is the manual or documentation merely saying "our software offers models A, B, and C", without any details on what A means. If we are extremely lucky we find a reference to a paper. In that paper there will be a lot of complicated formulas but rarely will we find a sentence that simply says "model A assumes that rates on neighbouring branches are not correlated".

So I just slogged through documentation, references, and some more or less helpful websites to try and figure out the clock models in R's chronos function, which turns a pre-existing phylogram into a chronogram, and in MrBayes 3 and BEAST 2, which produce chronograms directly.

R: ape package: chronos function

Correlated

As far as I understand this is probably the original Penalised Likelihood approach as also implemented in the software r8s. If so, it allows rates to vary across the tree but with neighbouring internodes of the tree having more similar rates than distant branches are allowed to have. How strongly the rate can vary across the tree depends on the smoothness factor lambda of the model. Here a default of 1 is often used, and it has to changed by orders of magnitude to explore better settings (10, 100...).

Relaxed

Each internode of the tree has its own rate of evolution, with the rate distribution drawn from a gamma parameter.

Discrete

Attempts to estimate a number of categories of internodes, where all the branches of the same category have the same rate of evolution.

MrBayes 3

This is a bit more complicated because MrBayes has several strict clock models, and the relaxed models are always based on an underlying strict model to which they add at least one parameter to describe how rates vary. So how can there be different strict clock models, when a strict clock model means nothing but a constant rate? Well, really this is about the prior (expectation) for the distribution of divergence times across the tree.

Uniform (a.k.a. simple)

Basically the expectation is that a divergence is equally likely to have happened at any point in time between the divergences immediately before and after it.

Birth-death

This model has parameters for speciation rate, extinction rate, and sampling probability. It looks at the tree as something that comes in existence through a process of lineage splits and lineage extinctions, forward through time. Crucially, trees can have rather non-uniform branching patterns depending on the speciation and extinction rates. If both are high, for example, there will be many recent splits but long branches deep in time.

Coalescent

Parameters are theta and the ploidy level, to estimate effective population sizes. The perspective is the opposite of the previous one, with the tree being seen as the coalescence of contemporary lineages into common ancestors as we go back in time.

There are also a birth-death based fossilisation model and the species tree model as additional alternatives. But coming now to how the clock can be relaxed, MrBayes has three options for that (clockvarpr):

Independent Gamma Rate (igr)

I assume this is the same as the relaxed model of ape's chronos function, see above, as the MrBayes manual says it is uncorrelated, and they both have the gamma parameter.

Brownian motion / autocorrelated lognormal distribution (tk02)

The terms Brownian motion and autocorrelated suggest that it is a correlated clock model, i.e. neighbouring branches do not vary independently.

Compound Poisson Process (cpp)

The way I read the paper where this was first suggested it seems to me as if rates are correlated. It allows shifts in rate to occur anywhere on the tree.

BEAST 2

My understanding is that BEAST always uses the coalescent model. It then offers the following options for the clock model:

Strict

The same rate across all branches. So this should be equivalent to using the coalescent divergence time prior with a strict model in MrBayes.

Relaxed exponential or relaxed log normal

Two uncorrelated models differing in the shape of the rate distribution. No idea which one could a priori be expected to be more realistic.

Random local

There are different clocks applying to different parts of the phylogeny (reference). This should work well with sudden rate shifts, but I don't know if it will make clock rooting any more reliable. The way I understand it, it also means that the model is uncorrelated.

In other words, it looks as if "strict" is the only setting that is not an uncorrelated model. This brings us to two final points.

What model to use?

First, I have spoken to a senior colleague and browsed a few sources online, and the general thought seems to be that correlated clock models are more biologically realistic than uncorrelated ones. That makes a lot of sense to me, although with the caveat that there seem to be obvious shifts in some phylogenies, generally associated with a change in a crucial trait such as generation time or metabolism.

Second, one of the sources I read opined that only the strict clock model is really appropriate with the coalescent model, because the latter kind of assumes the former. I have no real opinion on this; I assume that the BEAST developers would have a reason to offer several relaxed models in their coalescent-based package.

Summary

In summary, and hoping I didn't get anything wrong, this is how I currently understand the hierarchy of the relevant clock models:

Strict clock (strict in BEAST; default in MrBayes if no clockvarpr added)
Relaxed clocks
    Correlated rates
        Brownian motion (tk02 in MrBayes)
        Compound Poisson Process (cpp in MrBayes)
        Penalised Likelihood (correlated in chronos)
    Uncorrelated rates
        Independent Gamma Rates (relaxed in chronos; igr in MrBayes)
        Discrete rate categories (discrete in chronos)
        Relaxed exponential (just that name in BEAST)
        Relaxed log normal (just that name in BEAST)
        Random local (just that name in BEAST)

No comments:

Post a Comment