Sunday, May 3, 2015

Ancestral character state reconstruction in Mesquite 3: parsimony versus likelihood

One of the more curious recent developments in my area is that some journals now make all reviewer reports available to all of the peer reviewers of a given manuscript. I like it because it allows me to get a better feeling for whether I have been too lenient or too critical, see other colleagues' style of making comments, and so on.

Very recently I have reviewed a manuscript, and just two days ago I saw what the second reviewer thought. Our recommendations turned out to be generally the same, but one sentence of theirs really annoyed me. When discussing ancestral character state reconstruction, they complained that all reconstructions in the present study were done "only" with parsimony.

What this is about is figuring out what the ancestor of a group of species may have looked like. Let us say you consider spore-producing plants (lycophytes and the fern/horsetail lineage) versus seed-producing plants (seed plants, obviously) and want to know which of the two character states the common ancestor had; was the first vascular plant spore- or seed-producing? Did the first land animal have scales or hair? Especially in the absence of fossils, that is what these algorithms and models are for.

I have complained before about quasi-religious Bayesians who think that everything has to be Bayesian or it is worthless. Conversely, I also know of cladists who reject all modelling and likelihood analyses. This methodology fundamentalism all seems very silly to me. They are all tools with their own advantages and disadvantages - in particular there is a clear trade-off between simplicity, transparency and speed on one side and sophistication on the other -, and I prefer to be pragmatic and use whatever is most handy in any given situation.

So this "I only believe you if you do a likelihood reconstruction" mindset annoys me quite profoundly. For heaven's sake, they were criticising a simple ancestral state reconstruction of binary discrete characters; do they really expect that likelihood would suddenly give a totally different and much better result?

But then I thought, who knows? Let's actually take a look at the behaviour of the two approaches with a contrived dataset. So I started Mesquite 3.01 on my computer at home and gave it a simple phylogeny of six species (A,(B,((C,D),(E,F)))). I defined a few binary discrete characters and told Mesquite to trace them across this phylogeny using either parsimony or likelihood reconstruction, in the latter case using the usual Mk1 model of character evolution.

To my surprise, there is some difference between the two even in this very simple case. To my even greater surprise, the likelihood reconstruction doesn't seem to make a lot of sense.


Above is one example. Here species D, E and F in the four species clade have the black state, whereas the early diverging A and B as well as one member of the larger clade have the white state. Parsimony (left) infers the ancestor to have the white state, but likelihood (right) is undecided 50/50.

Now maybe it is just because I have a pro-parsimony bias, but that doesn't make a lick of sense to me. If you have two sister groups, one of which is shrubby and the other herbaceous, okay, then in the absence of any independent information it could go either way. But if there are several branches all of shrubs, and then only one group deeply nested in the phylogeny is herbaceous, surely that must budge likelihoods for the ancestral state towards what the early diverging lineages have in common? Perhaps to 75% or so?

The second example, with which I originally wanted to test the effect of missing data, is even worse:


Here, only two deeply nested species have the black state; all others have the white state except for the one where we don't know. And even in this case likelihood reconstruction strangely gives a near-50% probability of the common ancestor having had the black state when in my eyes it should be somewhere south of 10% if not pretty much zero.

No idea how the model figures that. At any rate this soundly failed to convince me that the Mk1 model works better than parsimony reconstruction; quite the opposite.

Which is probably going to be relevant in a few weeks. I have a paper under review where I also used parsimony reconstruction of ancestral character states. It is a really minor side issue, as the main point of the paper is ancestral area reconstruction, and for that we used Bayesian approaches. But I fully expect at least one reviewer to complain about me using parsimony for the characters. Because it is too unsophisticated a model to trip over its own shoelaces I guess.

By the way, Mesquite and R have at least one thing in common. Yes, the first is GUI based and for phylogenetics whereas the latter is user-unfriendly and for statistics, but both are open, collaborative projects that get updated frequently. And consequently both of them constantly complain that you should reinstall them. In my case, Mesquite told me off for using version 3.01 instead of - wait for it - 3.03. Yeah, that'll make a difference. And if I do then in two weeks it will condescend towards me because I haven't got 3.04, as if I never had anything better to do than reinstall software.

(Updated to clarify phrasing.)

No comments:

Post a Comment