Sunday, June 26, 2016

Implementation of substitution models in phylogenetic software

Concluding the little series of posts on nucleotide substitution models, below is a summary of my current understanding of how to set several of the models discussed in the previous post in PAUP and MrBayes. But first a few comments.

For PAUP there are three possible options for the basefreq parameter. My current understanding is that equal sets equal base frequencies as in the JC model (duh), empirical fixes them to the frequencies observed in the data set, and estimate has them estimated across the tree. My understanding is further that while estimate is more rigorous, empirical is often 'close enough' and saves computation time. The point is, in any case below where it says one of the latter two options I believe one could also use the other without changing the model as such. I hope that is accurate.

What I do not quite understand is how one would set models like Tamura-Nei in PAUP. At least one of the sources I consulted when researching for this post (see below) suggests that one can set in PAUP models for example with variable transition rates but equal transversion rates, but the PAUP 4.0b10 manual states that the only options for the number of substitution rate categories are 1 (all equal), 2 (presumably to distinguish Tv and Ts), and 6 (all different). Would one not need nst = 3 or nst = 5 to set the relevant models? Perhaps the trick is to set nst = 6 but fix the substitution matrix? But that would mean one cannot estimate it during the analysis.

For the MrBayes commands note that the applyto parameter needs to be specified with whatever part(s) of the partition the particular model should apply to, as in applyto = (2) for the second part. The MrBayes commands I found my sources appear to be rather short; I assume that default settings do the rest. Note also that the old MrBayes versions that I am familiar with have now been superseded by RevBayes, but I have no experience with the latter program's idiosyncratic scripting language.

In addition to PAUP and MrBayes, I mention whether a model can be selected in three other programs I am familiar with, RAxML, BEAST and PhyML. I have no personal experience with most other Likelihood phylogenetics packages. I tried to check what is available in MEGA, but it wouldn't install for me, and I am not sure if the list of models in the online manual shows only those that are actually available for Likelihood search or whether it includes some that are only available for Distance analysis. The relevant list is linked from a page on Likelihood, but its own text implies it is about Distance. Either way, MEGA appears to have lots of options but I didn't indicate them below.

General Time-Reversible (GTR)

PAUP: lset nst = 6 basefreq = empirical rmatrix = estimate;

MrBayes: lset applyto=() nst=6;

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Tamura - Nei (TN93 / TrN)

Available in BEAST 2.1.3 and PhyML 3.0.

Symmetrical (SYM)

PAUP: lset nst = 6 basefreq = equal rmatrix = estimate;

MrBayes: lset applyto=() nst=6; prset applyto=() statefreqpr=fixed(equal);

Hasegawa-Kishino-Yano (HKY / HKY85)

PAUP: lset nst=2 basefreq=estimate variant=hky tratio=estimate;

MrBayes: lset applyto=() nst=2;

Available in BEAST 2.1.3 and PhyML 3.0.

Felsenstein 84 (F84)

PAUP: lset nst=2 basefreq=estimate variant=F84 tratio=estimate;

Available in PhyML 3.0.

Felsenstein 81 (F81 / TN84)

PAUP: lset nst=1 basefreq=empirical;

MrBayes: lset applyto=() nst=1;

Available in PhyML 3.0.

Kimura 2-parameters (K80 / K2P)

PAUP: lset nst=2 basefreq=equal tratio=estimate;

MrBayes: lset applyto=() nst=2; prset applyto=() statefreqpr=fixed(equal);

Available in PhyML 3.0.

Jukes Cantor (JC / JC69)

PAUP: lset nst=1 basefreq=equal;

MrBayes: lset applyto=() nst=1; prset applyto=() statefreqpr=fixed(equal);

Available in BEAST 2.1.3 and PhyML 3.0.

Invariant sites (+ I)

For PAUP, add the following to the lset command: pinvar = estimate

For MrBayes, add the following to the lset command: rates = propinv

Gamma (+ G)

For PAUP, add the following to the lset command: pinvar = 0 rates = gamma ncat = 5 shape = estimate (or another number of categories of your choice for ncat).

For MrBayes, add the following to the lset command: rates = gamma

Invariant sites and Gamma (+ I + G)

For PAUP, add the following to the lset command: pinvar = estimate rates = gamma ncat = 5 shape = estimate (or another number of categories of your choice for ncat).

For MrBayes, add the following to the lset command: rates = invgamma

References

This post is partly based on the following very helpful sources, all accessed c. 20 June 2016.

Faircloth B. Substitution models in mrbayes. https://gist.github.com/brantfaircloth/895282

Lewis PO. PAUP* Lab. http://evolution.gs.washington.edu/sisg/2014/2014_SISG_12_7.pdf

Lewis PO. BIOL848 Phylogenetic Methods Lab 5. http://phylo.bio.ku.edu/slides/lab5ML/lab5ML.html

And a PDF posted on molecularevolution.org.

Friday, June 24, 2016

What number of schemes to test to get what substitution models in jModelTest

Another thing that occurred to me that might be useful to write about in the context of substitution models concerns jModelTest.

The background is here that before setting a model in a phylogenetic analysis, one would usually conduct model-testing. There are some programs that do their own testing, but others don't, and consequently people have written software that examines your dataset and suggests the best model. The classic program was simply called ModelTest. It was developed by David Posada, and it was primarily used through and for PAUP. It actually wrote out the PAUP commands of the best-fitting model, so that one could simply copy and paste them over into the Nexus file, and off we go.

Then people wanted to use the results of the model test for MrBayes. Problem was, MrBayes didn't do all the models that PAUP did, and it was annoying to see a model suggested that one couldn't implement. So Johan Nylander produced a modified version called MrModeltest, and it conveniently wrote out the MrBayes and PAUP commands for the best-fitting model.

These programs have now been superseded by jModelTest. On the plus side, this tool is highly portable and extremely user-friendly thanks to its GUI. Also, the user does not have to have PAUP, but instead jModelTest simply comes packaged with PhyML and hands the data over to that software. Also, it is parallelised. On the other hand, in contrast to its predecessors it does not appear to write out the PAUP or MrBayes code. But well, that is what my next post is going to deal with.

For the moment I want to address a different problem: When starting the Likelihood calculations for various models, the user can select whether 3, 5, 7, 11 or 203 (!) substitution schemes, and thus double the number of models, are going to be tested. (And with or without invariant sites and gamma, if desired.) But the thing is, it is not clear from just looking at the interface which models for example the seven schemes are going to cover. If I select seven, will all the models I could implement in BEAST be included? Or will I just waste a lot of computing time on testing models that BEAST can't do anyway?

So I just ran all the numbers across a very small dataset, and this is what models are being tested in each case:

3 substitution schemes: JC, F81, K80, HKY, SYM, GTR

5 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, SYM, GTR

I am writing them out as named in jModelTest, but note that TPM1 is synonymous with K81 or K3P, and TrN is called TN93 is some software. An "uf" clearly means a variant with unequal base frequencies, and as mentioned in the previous post "ef" is similarly a variant with equal base frequencies.

Note that this, five schemes, is the highest you need to do if you are model-testing for a BEAST run, because all its models are covered. It also seems to me as if with the exception of F84 all models available in PhyML are covered, and that one doesn't appear to be tested under higher numbers of substitution schemes either. So the same consideration applies, unless jModelTest uses a different name for it (?).

7 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TIM1ef, TIM1, TVMef, TVM, TPM1, TPM1uf, SYM, GTR

11 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, TPM2, TPM2uf, TPM3, TPM3uf, TIM1ef, TIM1, TIM2ef, TIM2, TIM3ef, TIM3, TVMef, TVM, TPM1, TPM1uf, SYM, GTR

At this stage it becomes clear that there are a lot of strange variants of the Three-Parameter and Transitional Models that I overlooked in my previous post. I don't think they are used very often though...

203 substitution schemes: This adds a gazillion models named with a scheme of six letters or six letters plus F. Not sure how many people ever use them, so I will just stop here. I have a strong tendency to find GTR or HKY suggested for my data anyway...

Thursday, June 23, 2016

Overview over the substitution models

Continuing from the previous post, here is a run-down of the commonly used models, their abbreviations, alternative names and abbreviations, and parameters. They are sorted from most parameter-rich to simplest.

General Time-Reversible (GTR)
Base frequencies variable.
All six substitution rates variable.
Number of free parameters: 8

Transversional (TVM)
Base frequencies variable.
All four transversion rates variable, transitions equal.
Number of free parameters: 7

Transitional (TIM)
Base frequencies variable.
Two different transversion rates, transitions equal.
Number of free parameters: 6

Symmetrical (SYM)
Base frequencies equal.
All six substitution rates variable.
Number of free parameters: 5

Kimura 3-parameters (K81, K3P, TPM1)
Base frequencies variable.
Two different transversion rates, transition rates equal.
Number of free parameters: 5

Tamura-Nei (TN93, TrN)
Base frequencies variable.
Transversion rates equal, transition rates variable.
Number of free parameters: 5

Hasegawa-Kishino-Yano (HKY)
Base frequencies variable.
Transversion rates equal, transition rates equal.
Number of free parameters: 4
Differs from F84 in how the transversion rate and transition rate parameters are derived.

Felsenstein 84 (F84)
Base frequencies variable.
Transversion rates equal, transition rates equal.
Number of free parameters: 4
Differs from HKY in how the transversion rate and transition rate parameters are derived.

Felsenstein 81 (F81, TN84)
Base frequencies variable.
All substitution rates equal.
Number of free parameters: 3

Kimura 2-parameters (K80, K2P)
Base frequencies equal.
Transversion rates equal, transition rates equal.
Number of free parameters: 1

Jukes-Cantor (JC, JC69)
Base frequencies equal.
All substitution rates equal.
Number of free parameters: 0

And below an overview of the same information in graphical form. The numbers in brackets are the numbers of free parameters that the relevant axis is contributing.


Apparently at least some people fill the empty fields on the left in by using the name of the relevant model to the right and adding "ef" to it. So TVMef would, for example, be a model with all transversion rates variable, transition rates equal, and equal base frequencies. But there is probably a reason why those missing models don't have any well known names; they seem to be rarely used, if ever.

Tuesday, June 21, 2016

Nucleotide substitution models

I recently had reason to consider alternative nucleotide substitution models and got a bit frustrated at the scarcity of clear, easily understood summaries of (1) their differences, (2) their availability in different phylogenetic programs, and (3) how to implement them in the various programs.

This is not to say that the information isn't out there. It is out there, and other people may be very satisfied with the form in which it can be found, e.g. in Wikipedia articles full of mathematical matrices. But because I had a hard time finding exactly what I would have wanted I am going to write a few posts over the next few days summarising it in a way that seems intuitive to me. And perhaps this will also be useful to others.

Some throat clearing will be necessary.

Models

First, what are models? I am certainly not a modeller but was originally trained in parsimony analysis. Consequently I am not the ideal person to introduce this, and I may make a few slips further down the line. But the principle is rather simple: models are mathematical descriptions or abstractions that allow us to describe a process and ideally to make predictions, at least stochastically.

Models may differ vastly in their complexity, in the number of their parameters. Imagine we want a model that allows us to describe how fast I am cycling under different circumstances, assuming that I never change gears. A simple model would perhaps have only one parameter, how much effort I invest. We can now put a value in that describes the energy I am using to pedal and get out a speed value.

We may compare this model against observations of me cycling, and we will find that while there is some fit it is far from ideal. Most likely adding another parameter, the force of headwind I have to work against, will improve the fit of the model to the data. Perhaps adding a third parameter, the incline on which I am cycling at any given moment, will further improve the model, and so on.

Overparameterisation

It is often the case that a model becomes better the more parameters are added. There are, however, most likely diminishing returns, as adding the tenth factor will improve the model less than adding the second. Worse, there is a concept called 'overparameterisation', meaning that a model can be too complicated for its own good. I found a very nice explanation of what that means here.

In case the link is dead when you read this, their example is predicting the seventh point on a graph from the first six points. A very simple model would be a straight line (the usual y = a * x + b), more complex, quadratic models would be monotonous curves, and even more complex, higher polynomial ones would be wavy lines.

If the six points are not exactly on a straight line (perhaps just because there was some noise in the data), it is trivial to produce a very complex model that fits them better than the simple line, because it could get much closer to each of them. But they would work much less well at predicting a seventh point outside of the range of the first six if the underlying pattern was really just a straight line, because they'd tend to shoot off to very high or low values at either end of the range to which they were fitted.

This also related to the reason why I am currently looking into models. It has now happened for the second time this year that jModelTest suggested substitutions models that turned out to be too parameter-rich when I tried to use them in BEAST, prompting me to consider which one I should use instead and why.

Nucleotide substitution models

As their name indicates, nucleotide substitution models are used in phylogenetic analysis of DNA sequence data. They differ in the number of parameters but also have several things in common. Most importantly, all commonly used models assume reversibility, meaning a given substitution is as likely as the the reversal of that substitution.

Parameters of the nucleotide substitution models

Obviously, this cuts the number of possible parameters that can be used to describe substitutions in half, but it still leaves the following:


Each of the six double-headed arrows is a possible parameter describing the likelihood of substitution of one nucleotide in the DNA with another (and the reverse). They can be grouped into classes, most obviously transitions and transversions. Nucleotides fall into two chemically different groups, purines and pyrimidines. A substitution within a group is called a transition (Ts), a substitution between purine and pyrimidine is called a transversion (Tv). Accordingly there are parameter-poorer models that assign the same probability to all transversions and/or to all transitions, or something in-between.

Relevant for model complexity is, by the way, not the number of parameters, but the number of free parameters. Even in a model that has different probabilities assigned to all six substitutions the sixth can be calculated from the other five, meaning that there is always one free parameter less than there are parameters.

In addition to these up to six substitution parameters, models may have the four base frequencies as additional parameters. Again the number of free parameters is one less, i.e. three, because obviously the fourth base frequency must be one minus the other three. Other models simply assume equal base frequencies; in these cases the number of free parameters from that part of the model is zero.

Any of the substitution models described by these sets of parameters can then be modified as '+ I', '+ G', and '+ I + G'. I is the invariant model, involving an estimate of the ratio of invariant sites in the data. G is the gamma model which allows sites to evolve at different rates using a set number of gamma categories and a factor alpha describing the shape of the frequency distribution of sites across the categories. I found this explanation rather helpful.

What this is going to be about

Finally, note that the following posts will not be about how the likelihood function is maximised, how phylogenetic software searches for the best tree given a model, or any theoretical or mathematical background. Honestly, at least to some degree the end user of phylogenetic software would be justified to treat much of this as a black box. The point is rather to summarise information on what parameters the different models have, how they are related, where and how they are implemented, and even such minor but useful to know things as which two or three names confusingly refer to the very same model.

Saturday, June 18, 2016

Taxonomic terminology again

I have often complained about idiotically unusable identification keys. Unclear descriptions of species are a related issue.

Yesterday I read the description of a daisy and encountered the term 'peracute'. Acute means pointy and is usually understood to mean that the apex of an organ shows an angle of less than 90 degrees (as opposed to obtuse for more than 90 degrees). The most frequently encountered problem is somebody writing 'subacute' to indicate... well, that is a good question. Pointy but not very pointy? And then of course we have issues with some taxonomists using related terms for pointy apices that show some kind of curvature, like apiculate or attenuate, in inconsistent ways.

But I have never before come across peracute. My first reaction was to try and translate it. Again, acute means pointy. Per means through, so through-acute? That does not make any sense whatsoever. A more modern meaning is 'for each', as in twenty kilometres per hour. For each acute? Yeah, not very sensible either.

But wait, I have that old glossary I wrote about in March. But no, sadly it doesn't contain that term either. So really, what is the point if a taxonomist uses a word that simply doesn't exist? Who will understand the description under those circumstances? Wouldn't it be nice if we could define four to six clear and universally agreed-on terms for, in this case, apex shapes, and use them, and only them, consistently?

Ah well. At least I had the opportunity to peruse the glossary again. And of course peracute is nothing. This book is the mother lode of bad ideas.

Pampinus - n. Tendril.
Then why not write tendril? Ye gods.

Panduriform - a. Fiddle-shaped.
Then why not... oh, we had that already.

Parastomon - n. An abortive stamen, a staminodium.
You know, we already have a word for that. It is 'staminodium' or 'staminode'.

Phoranthium - n. The receptacle of the capitulum of Compositae.
Ah, I have heard of that structure, only everybody calls it a ... receptacle.

Argh. Argh. Argh. Argh. Headdesk. Are some people actually deliberately trying to make their taxonomic publications pointless, useless and maximally infuriating?

Friday, June 17, 2016

Botany picture #230: Jacobaea vulgaris


Jacobaea vulgaris (Asteraceae), also known as Senecio jacobaea, Germany, 2005. This species is a widely introduced weed, and among the places it has been introduced to is Australia.

I had known for some time that it and several relatives had been officially segregated out into Jacobaea. Having met it first as a student under a Senecio name, not seeing any obvious morphological differences to species that remained under that genus, and, perhaps most importantly, without any pressing need to research the matter, I was unsure what to make of it.

Perhaps this was just some random splitting based on a whim? Or there would have been a splitting and a lumping option to make something monophyletic, and the splitting option that creates two indistinguishable genera was preferred over a lumping option that would create a heterogeneous genus? For all I knew this may well have been one of those scenarios that 'evolutionary' systematists always complain about.

Recently, however, a colleague discussed the matter with me, and that prompted me to properly study the literature. The phylogeny of the Senecioneae was thoroughly resolved in a series of papers published around ten years ago by Pieter Pelser and collaborators. Perhaps the best overview for present purposes is provided by figure 1 of Pelser et al. (2007, Taxon 56: 1077-1104). It is a single phylogeny broken up over ten pages. The part that concerns Jacobaea is figure 1F, and it can conveniently be summarised as follows:


So this is not just paraphyletic. Even 'evolutionary' systematists should be unable to accept a genus that includes Jacobaea and Senecio in the strict sense. But surely then two clades separated by so many other clades, among them illustrious genera like Crassocephalum and Dendrosenecio, must have some diagnostic characters? Sadly, no:
Although Jacobaea is distinguished from Senecio s.str. and other genera in Senecioneae by DNA sequence characters (Pelser & al. 2002), clear morphological synapomorphies for Jacobaea have not been identified to date.
That may be very unsatisfying, but life isn't always fair. There is no rule that says that we will always find diagnostic characters to tell all distantly related groups apart.

Sunday, June 12, 2016

Parsimony in phylogenetics again

Just some short observations:

A few days ago I learned that somebody has found my TNT script for the Templeton test useful and is not only using it but also improving on it. A few days before that I found that my post on using TNT made it into the acknowledgements of a publication. That is really nice to see; my expectation was never that this blog would be home to a lot of real-time discussion, but rather that people can find something useful to them even years after I posted it.

---

I checked that 'parsimonygate' hash tag again, and found a few interesting (or perhaps revealing) tweets. The first comments on one of the graphs from my surprisingly popular post on the popularity of phylogenetic programs over the years with a laconic "TNT is thin green". Now I have no idea what the tweeter meant with that. His profile clarifies that "re-tweet does not equal endorsement", so any comment could be about anything. But in the context of the parsimonygate hash tag, it could be read as an argumentum ad populum, on the lines of: see, hardly anybody uses parsimony these days, those guys are fringe.

That, however, would make little sense regardless of one's position on the silly parsimony versus model controversy. It would be much harder to figure out how often people use methods than how often they cite programs, but it should be obvious that many of the people citing PAUP, PHYLIP or MEGA have also used the parsimony methods implemented in those programs. TNT is just one of the parsimony programs out there, and it unsurprising that it is not the most popular one, seeing how it uses an idiosyncratic data format and scripting language instead of the more widely used Nexus, Newick and Phylip formats.

The other notable tweets are the series of comments that appeared after the publication of a recent study comparing the performance of Bayesian and parsimony methods on discrete morphological characters. (This is of some interest to myself. My preference when faced with DNA sequence data is to use likelihood, but the results of using the Mk model on morphology generally seem nonsensical to me.) Samples:
Bayesian phylogenetic methods outperform parsimony in estimating phylogeny from discrete morphological data (link to paper)
and
Time to abandon all parsimony? (link to paper)
Wow, that paper must have really shown parsimony to have trouble! Let's look at the paper then:
Only minor differences are seen in the accuracy of phylogenetic topology reconstruction between the Bayesian implementation of the Mk-model and parsimony methods. Our findings both support and contradict elements of the results of Wright & Hillis [5] in that we can corroborate their observation, that the Mk-model outperforms equal-weights parsimony in accuracy, but the Mk-model achieves this at the expense of precision.
Oh.

Again, I cannot stress enough that I am a methods pragmatist who regularly uses both parsimony and model-based approaches. I also appreciate that there are indeed phylogeneticists who are irrationally opposed to model-based methods. But are these not examples of rather, shall we say?, selective perception of what turned out to be a trade-off situation with minor difference either way?