Friday, June 24, 2016

What number of schemes to test to get what substitution models in jModelTest

Another thing that occurred to me that might be useful to write about in the context of substitution models concerns jModelTest.

The background is here that before setting a model in a phylogenetic analysis, one would usually conduct model-testing. There are some programs that do their own testing, but others don't, and consequently people have written software that examines your dataset and suggests the best model. The classic program was simply called ModelTest. It was developed by David Posada, and it was primarily used through and for PAUP. It actually wrote out the PAUP commands of the best-fitting model, so that one could simply copy and paste them over into the Nexus file, and off we go.

Then people wanted to use the results of the model test for MrBayes. Problem was, MrBayes didn't do all the models that PAUP did, and it was annoying to see a model suggested that one couldn't implement. So Johan Nylander produced a modified version called MrModeltest, and it conveniently wrote out the MrBayes and PAUP commands for the best-fitting model.

These programs have now been superseded by jModelTest. On the plus side, this tool is highly portable and extremely user-friendly thanks to its GUI. Also, the user does not have to have PAUP, but instead jModelTest simply comes packaged with PhyML and hands the data over to that software. Also, it is parallelised. On the other hand, in contrast to its predecessors it does not appear to write out the PAUP or MrBayes code. But well, that is what my next post is going to deal with.

For the moment I want to address a different problem: When starting the Likelihood calculations for various models, the user can select whether 3, 5, 7, 11 or 203 (!) substitution schemes, and thus double the number of models, are going to be tested. (And with or without invariant sites and gamma, if desired.) But the thing is, it is not clear from just looking at the interface which models for example the seven schemes are going to cover. If I select seven, will all the models I could implement in BEAST be included? Or will I just waste a lot of computing time on testing models that BEAST can't do anyway?

So I just ran all the numbers across a very small dataset, and this is what models are being tested in each case:

3 substitution schemes: JC, F81, K80, HKY, SYM, GTR

Three schemes is the highest you need to test if deciding on a model for RAxML or MrBayes. Note that as of updating this post (30 Nov 2016) RAxML can only do JC, K80, HKY, and GTR, while MrBayes can do all six.

5 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, SYM, GTR

I am writing them out as named in jModelTest, but note that TPM1 is synonymous with K81 or K3P, and TrN is called TN93 is some software. An "uf" clearly means a variant with unequal base frequencies, and as mentioned in the previous post "ef" is similarly a variant with equal base frequencies.

Five schemes is the highest you need to do if you are model-testing for a BEAST run, because all its models are covered. It also seems to me as if with the exception of F84 all models available in PhyML are covered, and that one doesn't appear to be tested under higher numbers of substitution schemes either. So the same consideration applies, unless jModelTest uses a different name for it (?).

7 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TIM1ef, TIM1, TVMef, TVM, TPM1, TPM1uf, SYM, GTR

11 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, TPM2, TPM2uf, TPM3, TPM3uf, TIM1ef, TIM1, TIM2ef, TIM2, TIM3ef, TIM3, TVMef, TVM, TPM1, TPM1uf, SYM, GTR

At this stage it becomes clear that there are a lot of strange variants of the Three-Parameter and Transitional Models that I overlooked in my previous post. I don't think they are used very often though...

203 substitution schemes: This adds a gazillion models named with a scheme of six letters or six letters plus F. Not sure how many people ever use them, so I will just stop here. I have a strong tendency to find GTR or HKY suggested for my data anyway...

(Updated 30 Nov 2016 to include information for RAxML and MrBayes.)

No comments:

Post a Comment