Tuesday, June 21, 2016

Nucleotide substitution models

I recently had reason to consider alternative nucleotide substitution models and got a bit frustrated at the scarcity of clear, easily understood summaries of (1) their differences, (2) their availability in different phylogenetic programs, and (3) how to implement them in the various programs.

This is not to say that the information isn't out there. It is out there, and other people may be very satisfied with the form in which it can be found, e.g. in Wikipedia articles full of mathematical matrices. But because I had a hard time finding exactly what I would have wanted I am going to write a few posts over the next few days summarising it in a way that seems intuitive to me. And perhaps this will also be useful to others.

Some throat clearing will be necessary.


First, what are models? I am certainly not a modeller but was originally trained in parsimony analysis. Consequently I am not the ideal person to introduce this, and I may make a few slips further down the line. But the principle is rather simple: models are mathematical descriptions or abstractions that allow us to describe a process and ideally to make predictions, at least stochastically.

Models may differ vastly in their complexity, in the number of their parameters. Imagine we want a model that allows us to describe how fast I am cycling under different circumstances, assuming that I never change gears. A simple model would perhaps have only one parameter, how much effort I invest. We can now put a value in that describes the energy I am using to pedal and get out a speed value.

We may compare this model against observations of me cycling, and we will find that while there is some fit it is far from ideal. Most likely adding another parameter, the force of headwind I have to work against, will improve the fit of the model to the data. Perhaps adding a third parameter, the incline on which I am cycling at any given moment, will further improve the model, and so on.


It is often the case that a model becomes better the more parameters are added. There are, however, most likely diminishing returns, as adding the tenth factor will improve the model less than adding the second. Worse, there is a concept called 'overparameterisation', meaning that a model can be too complicated for its own good. I found a very nice explanation of what that means here.

In case the link is dead when you read this, their example is predicting the seventh point on a graph from the first six points. A very simple model would be a straight line (the usual y = a * x + b), more complex, quadratic models would be monotonous curves, and even more complex, higher polynomial ones would be wavy lines.

If the six points are not exactly on a straight line (perhaps just because there was some noise in the data), it is trivial to produce a very complex model that fits them better than the simple line, because it could get much closer to each of them. But they would work much less well at predicting a seventh point outside of the range of the first six if the underlying pattern was really just a straight line, because they'd tend to shoot off to very high or low values at either end of the range to which they were fitted.

This also related to the reason why I am currently looking into models. It has now happened for the second time this year that jModelTest suggested substitutions models that turned out to be too parameter-rich when I tried to use them in BEAST, prompting me to consider which one I should use instead and why.

Nucleotide substitution models

As their name indicates, nucleotide substitution models are used in phylogenetic analysis of DNA sequence data. They differ in the number of parameters but also have several things in common. Most importantly, all commonly used models assume reversibility, meaning a given substitution is as likely as the the reversal of that substitution.

Parameters of the nucleotide substitution models

Obviously, this cuts the number of possible parameters that can be used to describe substitutions in half, but it still leaves the following:

Each of the six double-headed arrows is a possible parameter describing the likelihood of substitution of one nucleotide in the DNA with another (and the reverse). They can be grouped into classes, most obviously transitions and transversions. Nucleotides fall into two chemically different groups, purines and pyrimidines. A substitution within a group is called a transition (Ts), a substitution between purine and pyrimidine is called a transversion (Tv). Accordingly there are parameter-poorer models that assign the same probability to all transversions and/or to all transitions, or something in-between.

Relevant for model complexity is, by the way, not the number of parameters, but the number of free parameters. Even in a model that has different probabilities assigned to all six substitutions the sixth can be calculated from the other five, meaning that there is always one free parameter less than there are parameters.

In addition to these up to six substitution parameters, models may have the four base frequencies as additional parameters. Again the number of free parameters is one less, i.e. three, because obviously the fourth base frequency must be one minus the other three. Other models simply assume equal base frequencies; in these cases the number of free parameters from that part of the model is zero.

Any of the substitution models described by these sets of parameters can then be modified as '+ I', '+ G', and '+ I + G'. I is the invariant model, involving an estimate of the ratio of invariant sites in the data. G is the gamma model which allows sites to evolve at different rates using a set number of gamma categories and a factor alpha describing the shape of the frequency distribution of sites across the categories. I found this explanation rather helpful.

What this is going to be about

Finally, note that the following posts will not be about how the likelihood function is maximised, how phylogenetic software searches for the best tree given a model, or any theoretical or mathematical background. Honestly, at least to some degree the end user of phylogenetic software would be justified to treat much of this as a black box. The point is rather to summarise information on what parameters the different models have, how they are related, where and how they are implemented, and even such minor but useful to know things as which two or three names confusingly refer to the very same model.

No comments:

Post a Comment