PhyloBotanist: The Markov k model for discrete mophological data

The most frequently used model of character evolution for morphological data is called the Markov k (Mk) model. It was suggested by Lewis (2001) and is implemented in a few Likelihood or Bayesian phylogenetics programs.

The idea here is that there are several discrete character states. So for continuous traits like organ lengths one would divide the continuum into categories, e.g. character state 0 for small than 5 cm and state 1 for larger than 5 cm. But as that is also how most people build their datasets for parsimony analysis it means that the same data can often be used for both analyses.

Some software allows the states of one character to be ordered, so that to change from state 0 to state 2 a lineage has to pass through state 1, counting as two mutation steps. Some also allow for a gamma parameter, so that the different characters can fall into categories with different rates of change (some faster-evolving and some slower-evolving).

Another important consideration with morphological data is the scoring approach. Datasets of sequence regions generally contain all the sequence data that were obtained, i.e. both the ones that are variable and the ones that are entirely constant across the study group. When scoring morphological data, however, people tend not to put data in that are constant. Imagine building a trait list for several species of frogs - would you add a column for "wings" only to have "no" as the only state across the entire group? Probably not. However, some datasets may contain constant characters, and they may or may not contain characters that differ for only one species. The analysis has to be told what to expect so that branch lengths in the resulting phylogeny are modelled well.

After my recent dive into nucleotide substitution models I also looked up how to properly set the Mk model in PAUP and MrBayes.

The Mk model in MrBayes

The Mk model is set automatically for matrices with datatype = standard. These data can have states 0-9, which should generally be enough.

Depending on the coverage, one can then use lset coding = all if the dataset includes constant characters. Alternative options are variable if there are no constant characters, and informative if there are neither constant characters nor characters that differ for only one species. The Mk model with only variable characters is also sometimes called the Mkv model.

If there are no constant characters, equal rates of change for all characters can be assumed with lset rates = equal, variable rates with lset rates = gamma. If constant characters are included, my understanding is that propinv and invgamma should be used instead.

The default is that all characters are unordered. They can be changed to ordered by using the ctype command, as in ctype ordered: 2 4 for characters number two and number four.

The Mk model in PAUP

I have tried setting the Mk model in one of the new test versions of PAUP, specifically 4a149. To set the model as such, lset nst = Mkv. Unfortunately, beyond that the options are rather limited. The model always assumes equal rates, and as that little v at the end indicates it also seems to assume that all constant characters have been excluded.

Mk model versus parsimony: my admittedly anecdotal experience

I have always made clear that I am not really that terribly interested in philosophical foundations or statistical theory when using a phylogenetic method. For me the real questions are pragmatic ones:

Does the method produce sensible results with empirical data, i.e. results that fit information that we have from other data?
Does the method produce the correct results with simulated data?
Is the method computationally feasible? (What good is a robust Bayesian coalescent approach if it takes weeks on a supercomputer even for six species?)
Can the method be mislead in certain scenarios? But if so, are these scenarios likely to be frequent, or are there other ways of dealing with them than discarding the method? (E.g. different data or better taxon sampling to deal with Long Branch Attraction.)

For the Mk model, the problem is mostly the first point. Just for the giggles, I have in the past used it on a few morphological datasets from small genera, and the results were generally much less convincing than the ones from parsimony analysis. I have also used it in Mesquite for ancestral character reconstruction along trees obtained from e.g. Bayesian analysis of sequence data, and the results were rather nonsensical.

That being said, after the recent publication claiming that Bayesian phylogenetics outperforms parsimony on simulated data, I tried again with a little dataset I am generating, at that moment only 23 traits for 13 species. I am happy to report that the results of running those data through MrBayes were much more meaningful than what I had seen in the past. So I will definitely keep that in mind as an option.

Another interesting observation, however, is that Likelihood or Bayesian analysis of morphological data tends to produce fully resolved trees where parsimony shows uncertainty clearly as polytomies. This is rather ironic given that one of the main arguments of Bayesians is that their preferred approach better shows uncertainty in the data. Of course one could point at low Posterior Probabilities and say, see, there is your measure of uncertainty, but then again support values are always worse for morphological data than for sequences simply because there are much fewer characters. It is not rare to have a dataset with fifty taxa but only twenty characters; of course you will never see a lot of 100% bootstraps or 1.00 PPs under those circumstances, even in the best cases. Thus a fully resolved tree will look very suggestive even at 0.57 PP where a polytomy tells us that we really don't know.

A final reason why I will not soon drop parsimony analysis for morphological data (even as I will give the Mk model more attention) is that there are numerous well established ways of doing parsimony according to how a character can be expected to evolve. Assume, for example, that you have four states 0, 1, 2, and 3, and that 1-3 can all arise from 0 but not from each other (meaning that to get from 1 to 2 you have to pass through 0). Or assume that you want to set a character state so that it was gained precisely once but is impossible to be regained once it is lost.

It would be easy to set this up in parsimony. Maybe it is possible to do this in a model based analysis, but if so then it is at least not part of standard implementations. More generally, the assumption behind a model that there is a general process across all the characters in the analysis makes a lot of sense for molecular data. A base pair is a base pair, and all the sequence positions will be affected by polymerase errors. But does it make nearly as much sense for morphology? A fruit shape is not the same thing as the presence or absence of stipules, and a collar bone shape is not the same thing as the possession of a red patch on the throat.

Again, I am happy to admit that the Mk model in MrBayes surpassed my expectations, and I will use it more often in the future. I am, however, still not ready to do without the option of parsimony, at least for the admittedly rare cases when I want to analyse morphological data.

References

Lewis PO, 2001. A Likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology 50: 913-925.

2 comments:

April WrightJuly 6, 2016 at 4:03 AM
You might be interested in this paper, which I recently wrote with David Hillis and Graeme Lloyd: http://sysbio.oxfordjournals.org/content/early/2015/12/22/sysbio.syv122.short

(blog here if you don't want to read the whole thing: http://wrightaprilm.github.io/posts/morph-paper.html)
I might pick this project up a bit more in RevBayes to elaborate the model for use with a dataset I'm assembling. It's a pretty cool time to be interested in modeling morphology.

Sunday, July 3, 2016

The Markov k model for discrete mophological data

2 comments: