Friday, November 4, 2016

CBA seminar on molecular phylogenetics

Today I went to a Centre of Biodiversity Analysis seminar over at the Australian National University: Prof. Lindell Bromham on Reading the story in DNA - the core principles of molecular phylogenetic inference. This was very refreshing, as I have spent most of the year doing non-phylogenetic work such as cytology, programming, species delimitation, and building identification keys.

The seminar was packed, the audience was lively and from very diverse fields, and the speaker was clear and engaging. As can be expected, Prof. Bromham started with the very basics but had nearly two hours (!) to get to very complicated topics: sequence alignments, signal saturation, distance methods, parsimony analysis, likelihood phylogenetics, Bayesian phylogenetics, and finally various problems with the latter, including choice of priors or when results merely restate the priors.

The following is a slightly unsystematic run-down of what I found particularly interesting. Certainly other participants will have a different perspective.

Signal saturation or homoplasy at the DNA level erases the historical evidence. Not merely: makes the evidence harder to find. Erases. It is gone. That means that strictly speaking we cannot infer or even estimate phylogenies, even with a superb model, we can only ever build hypotheses.

Phylogenetics is a social activity. The point is that fads and fashions, irrational likes and dislikes, groupthink, the age of a method, and quite simply the availability and user-friendliness of software determine the choice of analysis quite as much as the appropriateness of the analysis. Even if one were able to show that parsimony, for example, works well for a particular dataset one would still not be able to get the paper into any prestigious journal except Cladistics. And yes, she stressed that there is no method that is automatically inappropriate, even distance or parsimony. It depends on the data.

Any phylogenetic approach taken in a study can be characterised with three elements: a search strategy, an optimality criterion, and a model of how evolution works. For parsimony, for example, the search strategy is usually heuristic (not her words, see below), the optimality criterion is minimal number of character changes, and the implicit model is that character changes are rare and absence of homoplasy.

The more sophisticated the method, the harder it gets to state its assumptions. Just saying out loud all the assumptions behind a BEAST run would take a lot of time. Of course that does not mean that the simpler methods do not make assumptions - they are merely implicit. (I guess if one were to spell them out, they would then often be "this factor can safely be ignored".)

Nominally Bayesian phylogeneticists often behave in very un-Bayesian ways. Examples are use of arbitrary Bayes factor cut-offs, not updating priors but treating every analysis as independent, and frowning upon informative topology priors.

Unfortunately, in Bayesian phylogenetics priors determine the posterior more often than most people realise. This brought me back to discussions with a very outspoken Bayesian seven years ago; his argument was "a wrong prior doesn't matter if you have strong data", which if true would kind of make me wonder what the point is of doing Bayesian analysis in the first place.

However, Prof. Bromham also said a few things that I found a bit odd, or at least potentially in need of some clarification.

She implied that parsimony analysis generally used exhaustive searches. Although there was also a half-sentence to the effect of at least originally, I would stress that search strategy and optimality criterion are two very different things. Nothing keeps a likelihood analysis from using an exhaustive search (except that it would not stop before the heat death of the universe), and conversely no TNT user today who has a large dataset would dream of doing anything but heuristic searches. Indeed the whole point of that program was to offer ways of cutting even more corners in the search.

Parsimony analysis is also a form of likelihood analysis. Well, I would certainly never claim, as some people do, that it comes without assumptions. I would say that parsimony has a model of evolution in the same sense as the word model is used across science, yes. I can also understand how and why people interpret parsimony as a model in the specific sense of likelihood phylogenetics and examine what that means for its behaviour and parameterisation compared to other models. But calling it a subset of likelihood analysis still leaves me a bit uncomfortable, because it does not use likelihood as a criterion but simply tree length. Maybe I am overlooking something, in fact most likely I am overlooking something, but to me the logic of the analysis seems to be rather different, for better or for worse.

One of the reasons why parsimony has fallen out of fashion is that "cladistics" is an emotional and controversial topic; this was illustrated with a caricature of Willi Hennig dressed up as a saint. I feel that this may conflate Hennig's phylogenetic systematics with parsimony analysis, in other words a principle of classification with an optimality criterion. Although the topic is indeed still hotly debated by a small minority, phylogenetic systematics is today state of the art, even as people have moved to using Bayesian methods to figure out whether a group is monophyletic or not.

The main reasons for the popularity of Bayesian methods are (a) that they allow more complex models and (b) that they are much faster than likelihood analyses. The second claim surprised me greatly because it does not at all reflect my personal experience. When I later discussed it with somebody at work, I realised that it depends greatly on what software we choose for comparison. I was thinking BEAST versus RAxML with fast bootstapping, i.e. several days on a supercomputer versus less than an hour on my desktop. But if we compare MrBayes versus likelihood analysis in PAUP with thorough bootstrapping, well, suddenly I see where this comes from.

These days you can only get published if you use Bayesian methods. Again, that is not at all my experience. It seems to depend on the data, not least because huge genomic datasets can often not be processed with Bayesian approaches anyway. We can see likelihood trees of transcriptome data published in Nature, or ASTRAL trees in other prestigious journals. Definitely not Bayesian.

In summary, this was a great seminar to go to especially because I am planning some phylogenetics work over summer. It definitely got the old cogs turning again. Also, Prof. Bromham provided perhaps the clearest explanation I have ever heard of how Bayesian/MCMC analyses work, and that may become useful for when I have to discuss them with a student myself...

No comments:

Post a Comment