PhyloBotanist: parsimony

Showing posts with label parsimony. Show all posts

Thursday, September 26, 2019

Incongruence Length Difference test in TNT

Because I am fed up with figuring it out anew every time I need to use the Incongruence Length Difference (ILD) test (Farris et al., 1994) in TNT, I will post it once and for all here:

Download TNT and the script "ildtnt.run" from PhyloWiki. In the script, you may have to replace all instances of "numreps" with "num_reps" to make it functional. I at least get the error "numreps is a reserved expression", suggesting that the programmer should not have used that as a variable name.

Open TNT, increase memory, and set data to DNA and treating gaps as missing data. Then load your data matrix, which should of course be in TNT format:

mxram 200 ;
nstates DNA ;
nstates NOGAPS ;
proc (your_alignment_file_name) ;

Look up how many characters your first partition has, then run the test with:

run ildtnt.run (length_of_first_partition) (replicates) ;

There is an alternative script for doing the test called Ild.run, but I have so far failed to set the number of user variables high enough to accommodate my datasets. They seem to be limited to 1,000?

Perhaps this guide will also be useful to somebody besides me.

Reference

Farris JS, Källersjö M, Kluge AG, Bult C, 1994. Testing significance of incongruence. Cladistics 10: 315-319.

Friday, April 7, 2017

Parsimony versus models for morphological data: a recent paper

I have written on this blog before about the use of likelihood or Bayesian phylogenetics for morphological data. In our journal club this week we discussed another of the small but growing number of recent papers arguing that parsimony should be dropped in favour of model-based analyses even for morphology:

Puttick et al., 2017. Uncertain-tree: discriminating among competing approaches to the phylogenetic analysis of phenotype data. Proceedings of the Royal Society Biological Series 284, doi 10.1098/rspb.2016.2290

Puttick et al. constructed maximally balanced and unbalanced phylogenies, simulated sequence data for them under the HKY + G model of nucleotide substitution, turned the data matrices into binary and presumably unordered multistate integer characters, and then used equal weights parsimony, implied weights parsimony, and Bayesian and likelihood analyses under the Mk model to try and get the phylogenies back with an eye on accuracy (correctness) and tree resolution. In a second approach, they reanalysed previously published morphological datasets to see what happened to controversial taxon placement under the different approaches.

One of the problems with simulation studies is always that they can come out as kind of circular: if you simulate data under a model it is no surprise that the same model would perform best when trying to infer the input into the simulations. In this case Puttick et al. were admirably circumspect in that not only did they simulate their data under a different model (HKY + G) than that ultimately used in phylogenetic analysis (Mk), but they also repeated the analyses until they had achieved a distribution of homoplasy that mirrored the one found in empirical datasets. This is important because morphology datasets for parsimony analysis are scored to minimise homoplasy, while uncritically simulating matrices may lead to much higher levels of homoplasy, thus putting parsimony at a disadvantage.

Still, it should be observed that the HKY + G model is nonetheless unlikely to have produced data that are a realistic representation of morphological datasets, especially considering that the latter would at a minimum also include multistate characters with ordered states. Also, from a cladist's perspective homoplasy in a morphological dataset is a character scoring error waiting to be corrected in a subsequent analysis. But well, of course using zero homoplasy datasets would also have been unrealistic because real life datasets do have homoplasy in them. (And of course parsimony would "win" all the time if there was zero homoplasy, pretty much by definition.)

Now what are the results? To simplify, Bayesian was best at getting the tree topology right, followed by equal weights parsimony and implied weights parsimony, with likelihood coming in last. Likelihood always produces fully resolved trees, and Bayesian produces the least resolved ones. The authors argue, as Bayesians would, that this is exactly how it should be, as it simply tells us that the data aren't strong enough; the other approaches may give us false confidence. (Although of course parsimony and likelihood analyses can likewise involve several different ways of quantifying support or confidence.)

In conclusion, Puttick et al. make the following recommendations:

First, Bayesian inference should be the preferred approach.

Second, future morphological datasets should be scored with model-based approaches in mind. This means that the number of characters should be maximised by including homoplasious ones, because that will allow a better estimate of rates. As this is the exact opposite scoring strategy of what parsimony analysis requires this will make it hard to change habits.

What is more, I have to smile at Puttick et al.'s expectations here: they simulated data matrices of 100, 350 and 1,000 characters. Maybe you can get 400 or so for some animals (if the fossils are well enough preserved), but for any plant group I have worked on I would struggle to get 30. And wouldn't you know it, the single empirical botanical dataset they re-analysed had only 48.

Third, researchers should lower their expectations and get used to living with unresolved relationships, as Bayesian analysis produces less resolved phylogenies.

Our discussion of the paper was wide-ranging. When I commented that one of the advantages of traditional parsimony software is that it easily allows the implementation of any step matrix that is needed (imagine a character where state 0 can change into states 1, 2 or 3, but 1-3 cannot change into each other) I was informed that that is in fact possible in BEAST. That is a pleasant surprise, as I had assumed that it was limited to setting a few simple models such as standard Mk for unordered states, nothing more. However, those who have written XML files for BEAST may want to consider if that is "easy" compared with writing a Nexus file for PAUP. Personally I find BEAST input files very hard to understand.

Another concern was that while nucleotide substitution models are based on a fairly good understanding of what can happen to DNA nucleotides which, after all, have a limited number of states and transitions between those states, it is considerably less clear what the most appropriate model for any given morphological character is.

What is more, somebody pointed out that there are essentially two options in a model based analysis: either the likelihood of state transitions is fixed, which is a difficult decision to make, or it is estimated during the analysis. But in the latter case the probability of, for example, changing the number of petals would be influenced by the probability of shifting between opposite and alternate leaf arrangement. And clearly that idea is immediately nonsensical.

In summary, the drumbeat of papers on the lines of "we are the Bayesians; you will be assimilated; resistance is futile" is not going to stop any time soon. I use Bayesian and likelihood analyses all the time for molecular data, no problem. But I am still not convinced that the Mk model would be my go-to approach the next time I have to deal with morphological data. It seems to me that it is much easier to justify one's model selection in the case of DNA than in the case of, say, flower colour or leaf length; that the idea of setting one model and estimating gamma across totally incomparable traits is odd; and that I would hardly ever have enough characters for Bayesian analysis to produce more than a large polytomy.

But I guess all that depends on the study group. I can imagine there would be morphometric data for some groups of organisms for which stochastic models work quite well.

Sunday, December 11, 2016

Cladistics textbook, part 2

Coming back to the textbook

Kitching IJ, Forey PL, Humphries CJ, Williams DM, 1998. Cladistics second edition - the theory and practice of parsimony analysis. The Systematics Association Publication No. 11. Oxford Science Publications.

..., in my previous post I mentioned that I also ran into a section that I find hard to agree with. The chapter on support values opens with the following:

Page 118: The study of phylogeny is an historical science, concerned with the discovery of historical singularities. Consequently, we do not consider phylogenetic inference per se to be fundamentally a statistical question, open to discoverable and objectively definable confidence limits. Hence, we are in diametric opposition to those who would include such a standard statistical framework as part of cladistic theory and practice.

I can only repeat in slightly different words what I wrote some time ago about the same question in the context of biogeographic studies. I find it hard to draw a line between historical science and non-historical science, not least because, to take just one example, any physical experiment, be it ever so reproducible, turns into a singular historical event a split second after it has been conducted.

To me there is really no big difference. We always infer what is most likely to have happened in individual instances in the past and then draw more general conclusions from those instances, no matter whether it is history or social science, archeology or engineering, paleobotany or (extant) plant taxonomy, evolutionary biology or population genetics.

I assume that a big part of the difference in perspective here is about what organismal characters people are thinking of. Reading through the cladistics textbook, the focus is pretty much always on morphology. Reading through works that introduce likelihood or Bayesian phylogenetics, in other words probabilistic and model-based evolutionary analysis, the focus is pretty much always on nucleotide sequence data, with protein sequence data coming a distant second.

It makes sense to me that somebody who thinks predominantly in terms of trait shifts like the evolution of bird feathers from scales or of angiosperm gynoecia from ovules sitting nakedly on a stalk would have reason to favour parsimony analysis. In fact I myself, despite frequently using likelihood and Bayesian phylogenetics for sequence data, would still have to be counted among those who are highly sceptical whether the Mk model works better with morphological traits than parsimony.

These kinds of characters have very low homoplasy, at least if scored correctly; and where they do show homoplasy, I would say that is due to a scoring error that can be rectified (e.g. if double fertilisation has evolved independently in angiosperms and gnetophytes then the two should be scored as separate character states). And it just so happens that parsimony analysis is a better tool for the data the less homoplasy there is. What is more, it seems a bit odd to try and apply the same model to all morphological characters, given how vastly different they are.

It also makes a lot of sense to me that somebody who thinks predominantly in terms of trait shifts like an A in the DNA sequence turning into T would see reason to favour analyses using models of sequence evolution. As Prof. Bromham pointed out during her talk I heard a few weeks ago, if that A has changed into a T in two parallel instances and then all the A-carrying individuals died out there is no way in which we can ever find evidence for that.

In other words, in the case of our four letter soup of DNA sequence characters homoplasy is not a scoring error to be discovered by looking closer but a hard fact of life that we cannot rid ourselves of (except to the degree that we can choose slower-evolving markers). And it just so happens that parsimony analysis is a worse tool for the data the more homoplasy there is, while the right model-based approach can deal with that. (Or at least somewhat better - obviously, once homoplasy is so rampant that all signal is lost no phylogenetic method will work, and likelihood analysis has also been shown to suffer from long branch attraction.) What is more, it seems logical to apply the same model to all DNA sequence characters, given that they are equivalent nucleotides along a chain.

So when I call myself a cladist, what I mean is not that I prefer parsimony analysis for all data, but that I acknowledge Willi Hennig's legacy, the idea that systematists should classify consistently by relatedness.

Tuesday, November 29, 2016

Cladistics textbook

In my office I have two 'proper' phylogenetics textbooks, that is counting only those that cover the principles and theory as opposed to offering only a practical how-to manual. One is Felsenstein's, who is strongly associated with likelihood phylogenetics, although his book covers all approaches. The second is:

Kitching IJ, Forey PL, Humphries CJ, Williams DM, 1998. Cladistics second edition - the theory and practice of parsimony analysis. The Systematics Association Publication No. 11. Oxford Science Publications.

As the title implies, it is entirely about parsimony phylogenetics.

Having recently looked into Kitching et al., I noticed two short sections that I found interesting enough to discuss here. I will start with the question of ancestors. Proponents of paraphyletic taxa often make claims on the lines of cladists "not accepting the existence of ancestral species", of "ignoring ancestors", or of "treating all species as sister taxa".

Here now we have a textbook written by cladists, in other words the official version, to the degree that an official version exists. It is, of couse, not as easy as that because the only thing that unites cladists in the sense of what paraphylists argue against is that supraspecific taxa should be monophyletic. Many other details differ from cladist to cladist, and in the sense of what paraphylists argue against the concept of cladist includes those who use e.g. Bayesian phylogenetics.

I also do not want to give the impression that I, personally, take what Kitching et al. promote on this or that detailed question to necessarily be The Correct View. It is well possible that I, a cladist, find myself in disagreement with some chapter of that textbook. I am not even arguing here, in this instance, that making taxa monophyletic is the way to go (although of course I do believe that).

No, the point of the post is merely this: if Kitching et al. argue not-XYZ, then this demonstrates decisively that any claim of all cladists arguing XYZ is nonsense.

So, about ancestors, and turning to page 14 of the textbook:

In fact, to date, Archaeopteryx has no recognized autapomorphies. Indeed, if there were, Achaeopteryx would have to be placed as the sister-group to the rest of the birds.

It does not matter here whether more recent analyses have demonstrated Archaeopteryx to have autapomorphies and to actually have been a side branch relative to modern birds. We should here simply think of any species that looks exactly like the ancestral species of a later-existing clade is inferred to have looked like.

It should be clear that the above section is correct. An ancestral species would not have any systematically useful characters relative to its descendants, because that descendant clade would have started out as that species. My view - and here other cladists may differ - is actually that the ancestral species and the clade are one and the same. The ancestral species has over time turned (diversified) into the clade.

In terms of unique characters, Archaeopteryx simply does not exist. This is absurd, for its remains have been excavated and studied. To circumvent this logical dilemma, cladists place likely ancestors on the cladogram as the sister-group to their putative descendants and accept that they must be nominal paraphyletic taxa (Fig. 1.9c). Ancestors, just like paraphyletic taxa in general, can only be recognized by a particular combination of characters that they have and characters that they do not have. The unique attribute of possible ancestors is the time at which they lived.

Here is the reason why paraphylists complain about ancestors being treated as sister to their descendants: they are treated like that, crucially, so that we can do the analysis. It is a practical, not a philosophical reason.

Note also that at least the cladists who wrote the textbook do not have any problem with paraphyletic species. Whether we think that this use of the word paraphyletic makes sense or not (as do I), it is discussions like this one which make me groan in frustration whenever I read a paraphylist claim that cladists only accepted paraphyletic species as a cop-out once they could no longer deny that they existed. No, cladism was founded on the principle that monophyly applies above the species level, so it never had to backpedal like that.

After a cladistic analysis has been completed the cladogram may be reinterpreted as a tree (see below)

What they mean here is that they see a cladogram as such (merely) as a different visualisation of the data from the data matrix, while the "tree" is the cladogram's interpretation in terms of evolutionary relationships, of actual genealogical relatedness of the terminals.

and at this stage some palaeontologists may choose to recognize these paraphyletic taxa as ancestors, particularly when they do not overlap in time with their putative descendants (see Smith 199a for a discussion).

And this is the main point. Here we have a group of senior cladists who wrote, to put it in the simplest possible terms, "we need to treat every species as a terminal to get a cladogram, but then if you wish you can interpret a terminal without autapomorphies as an ancestor".

It is as if the people who claim that cladists do not accept the existence of ancestors haven't even bothered to figure out what any cladists really think.

Next time I will look at a short section of the textbook that I definitely disagree with.

Friday, November 4, 2016

CBA seminar on molecular phylogenetics

Today I went to a Centre of Biodiversity Analysis seminar over at the Australian National University: Prof. Lindell Bromham on Reading the story in DNA - the core principles of molecular phylogenetic inference. This was very refreshing, as I have spent most of the year doing non-phylogenetic work such as cytology, programming, species delimitation, and building identification keys.

The seminar was packed, the audience was lively and from very diverse fields, and the speaker was clear and engaging. As can be expected, Prof. Bromham started with the very basics but had nearly two hours (!) to get to very complicated topics: sequence alignments, signal saturation, distance methods, parsimony analysis, likelihood phylogenetics, Bayesian phylogenetics, and finally various problems with the latter, including choice of priors or when results merely restate the priors.

The following is a slightly unsystematic run-down of what I found particularly interesting. Certainly other participants will have a different perspective.

Signal saturation or homoplasy at the DNA level erases the historical evidence. Not merely: makes the evidence harder to find. Erases. It is gone. That means that strictly speaking we cannot infer or even estimate phylogenies, even with a superb model, we can only ever build hypotheses.

Phylogenetics is a social activity. The point is that fads and fashions, irrational likes and dislikes, groupthink, the age of a method, and quite simply the availability and user-friendliness of software determine the choice of analysis quite as much as the appropriateness of the analysis. Even if one were able to show that parsimony, for example, works well for a particular dataset one would still not be able to get the paper into any prestigious journal except Cladistics. And yes, she stressed that there is no method that is automatically inappropriate, even distance or parsimony. It depends on the data.

Any phylogenetic approach taken in a study can be characterised with three elements: a search strategy, an optimality criterion, and a model of how evolution works. For parsimony, for example, the search strategy is usually heuristic (not her words, see below), the optimality criterion is minimal number of character changes, and the implicit model is that character changes are rare and absence of homoplasy.

The more sophisticated the method, the harder it gets to state its assumptions. Just saying out loud all the assumptions behind a BEAST run would take a lot of time. Of course that does not mean that the simpler methods do not make assumptions - they are merely implicit. (I guess if one were to spell them out, they would then often be "this factor can safely be ignored".)

Nominally Bayesian phylogeneticists often behave in very un-Bayesian ways. Examples are use of arbitrary Bayes factor cut-offs, not updating priors but treating every analysis as independent, and frowning upon informative topology priors.

Unfortunately, in Bayesian phylogenetics priors determine the posterior more often than most people realise. This brought me back to discussions with a very outspoken Bayesian seven years ago; his argument was "a wrong prior doesn't matter if you have strong data", which if true would kind of make me wonder what the point is of doing Bayesian analysis in the first place.

However, Prof. Bromham also said a few things that I found a bit odd, or at least potentially in need of some clarification.

She implied that parsimony analysis generally used exhaustive searches. Although there was also a half-sentence to the effect of at least originally, I would stress that search strategy and optimality criterion are two very different things. Nothing keeps a likelihood analysis from using an exhaustive search (except that it would not stop before the heat death of the universe), and conversely no TNT user today who has a large dataset would dream of doing anything but heuristic searches. Indeed the whole point of that program was to offer ways of cutting even more corners in the search.

Parsimony analysis is also a form of likelihood analysis. Well, I would certainly never claim, as some people do, that it comes without assumptions. I would say that parsimony has a model of evolution in the same sense as the word model is used across science, yes. I can also understand how and why people interpret parsimony as a model in the specific sense of likelihood phylogenetics and examine what that means for its behaviour and parameterisation compared to other models. But calling it a subset of likelihood analysis still leaves me a bit uncomfortable, because it does not use likelihood as a criterion but simply tree length. Maybe I am overlooking something, in fact most likely I am overlooking something, but to me the logic of the analysis seems to be rather different, for better or for worse.

One of the reasons why parsimony has fallen out of fashion is that "cladistics" is an emotional and controversial topic; this was illustrated with a caricature of Willi Hennig dressed up as a saint. I feel that this may conflate Hennig's phylogenetic systematics with parsimony analysis, in other words a principle of classification with an optimality criterion. Although the topic is indeed still hotly debated by a small minority, phylogenetic systematics is today state of the art, even as people have moved to using Bayesian methods to figure out whether a group is monophyletic or not.

The main reasons for the popularity of Bayesian methods are (a) that they allow more complex models and (b) that they are much faster than likelihood analyses. The second claim surprised me greatly because it does not at all reflect my personal experience. When I later discussed it with somebody at work, I realised that it depends greatly on what software we choose for comparison. I was thinking BEAST versus RAxML with fast bootstapping, i.e. several days on a supercomputer versus less than an hour on my desktop. But if we compare MrBayes versus likelihood analysis in PAUP with thorough bootstrapping, well, suddenly I see where this comes from.

These days you can only get published if you use Bayesian methods. Again, that is not at all my experience. It seems to depend on the data, not least because huge genomic datasets can often not be processed with Bayesian approaches anyway. We can see likelihood trees of transcriptome data published in Nature, or ASTRAL trees in other prestigious journals. Definitely not Bayesian.

In summary, this was a great seminar to go to especially because I am planning some phylogenetics work over summer. It definitely got the old cogs turning again. Also, Prof. Bromham provided perhaps the clearest explanation I have ever heard of how Bayesian/MCMC analyses work, and that may become useful for when I have to discuss them with a student myself...

Sunday, June 12, 2016

Parsimony in phylogenetics again

Just some short observations:

A few days ago I learned that somebody has found my TNT script for the Templeton test useful and is not only using it but also improving on it. A few days before that I found that my post on using TNT made it into the acknowledgements of a publication. That is really nice to see; my expectation was never that this blog would be home to a lot of real-time discussion, but rather that people can find something useful to them even years after I posted it.

---

I checked that 'parsimonygate' hash tag again, and found a few interesting (or perhaps revealing) tweets. The first comments on one of the graphs from my surprisingly popular post on the popularity of phylogenetic programs over the years with a laconic "TNT is thin green". Now I have no idea what the tweeter meant with that. His profile clarifies that "re-tweet does not equal endorsement", so any comment could be about anything. But in the context of the parsimonygate hash tag, it could be read as an argumentum ad populum, on the lines of: see, hardly anybody uses parsimony these days, those guys are fringe.

That, however, would make little sense regardless of one's position on the silly parsimony versus model controversy. It would be much harder to figure out how often people use methods than how often they cite programs, but it should be obvious that many of the people citing PAUP, PHYLIP or MEGA have also used the parsimony methods implemented in those programs. TNT is just one of the parsimony programs out there, and it unsurprising that it is not the most popular one, seeing how it uses an idiosyncratic data format and scripting language instead of the more widely used Nexus, Newick and Phylip formats.

The other notable tweets are the series of comments that appeared after the publication of a recent study comparing the performance of Bayesian and parsimony methods on discrete morphological characters. (This is of some interest to myself. My preference when faced with DNA sequence data is to use likelihood, but the results of using the Mk model on morphology generally seem nonsensical to me.) Samples:

Bayesian phylogenetic methods outperform parsimony in estimating phylogeny from discrete morphological data (link to paper)

and

Time to abandon all parsimony? (link to paper)

Wow, that paper must have really shown parsimony to have trouble! Let's look at the paper then:

Only minor differences are seen in the accuracy of phylogenetic topology reconstruction between the Bayesian implementation of the Mk-model and parsimony methods. Our findings both support and contradict elements of the results of Wright & Hillis [5] in that we can corroborate their observation, that the Mk-model outperforms equal-weights parsimony in accuracy, but the Mk-model achieves this at the expense of precision.

Oh.

Again, I cannot stress enough that I am a methods pragmatist who regularly uses both parsimony and model-based approaches. I also appreciate that there are indeed phylogeneticists who are irrationally opposed to model-based methods. But are these not examples of rather, shall we say?, selective perception of what turned out to be a trade-off situation with minor difference either way?

Sunday, May 1, 2016

That editorial in BMC Evolutionary Biology

On Friday I looked at the website of the open access journal BMC Evolutionary Biology, after a colleague mentioned it as an option. Apart from the whopping article processing fee I noticed the little field "submitting a phylogenetic study? Please consult our editorial for guidance on the methodologies we consider to be of a suitable standard". That sounded interesting.

The editorial published in 2013 lists "common pipeline steps" as follows:

Detecting homologs
Multiple Sequence Alignment
Quality control
Model selection

Ah. What if one is not using a model-based approach? At that point I pressed ctrl + F and entered "parsimony" to see what they had to say on it. I found this:

Until the early nineties, parsimony and distance-based tree-building methods were preferred. More recently, probabilistic model-based methods, namely the maximum likelihood (ML) and the Bayesian approaches have grown to prominence due to their statistical properties and inferential powers. Moreover, these approaches go beyond simple phylogeny inference, providing a convenient statistical framework for further model selection and biological hypothesis testing. While parsimony is sometimes justified as model-free, it has mathematical properties and is not assumption-free; therefore explicit models should be generated for many biological problems. Likewise, distance-based methods may be unreliable for highly diverged data, yet they are often model-based and have nice mathematical properties and thus they may enable very fast and relatively accurate estimation of relevant biological parameters. Distance-based methods for tree reconstruction, such as neighbor joining, are extremely fast, and can provide reasonable solutions for extremely large data sets, something that would be much more computationally challenging with ML or Bayesian methods, even with recent computational advances.

Well, they say "many" biological problems instead of all, and maybe I am missing some nuances here - I am not a native speaker of English, after all is said and done - but to the best of my understanding this seems to say that BMC Evol Biol accepts any phylogenetic method except parsimony analysis.

I want to make perfectly clear that personally I have nothing against model based, statistical approaches. My first instinct when faced with a single, small DNA sequence alignment would be to run it through PhyML as packaged in my version of SeaView. For large supermatrices I use RAxML, and for smaller multi-gene datasets BEAST. For morphological datasets parsimony analysis in PAUP is my default approach, and for population genetic type data I would use distance methods. Really, I am a methods pragmatist and not irrationally attached to parsimony analysis as the proverbial hammer that makes everything look like a nail.

So that being out of the way, I have to say that I just do not see how the above section is anything but the mirror image of the much-maligned Cladistics editorial from earlier this year.

The Templeton Test in parsimony analysis: part 2, how to do it in PAUP and TNT

After discussing the Templeton Test in an earlier post, this post is about how to conduct it in practice.

The Templeton Test in parsimony analysis: part 1, principles

When we want to know whether a taxon, for example a genus, is monophyletic (and thus acceptable in a phylogenetic classification), the first thing to do is to infer a phylogeny. We may then find a topology like the following:

Genus A is non-monophyletic on this tree because B is sister to A2. However, the support value for clade A2 + B (the red number, 63 of 100) is not exactly stunning; if this is Bayesian Posterior Probability, you would want 95 or higher, and if it is bootstrap or jackknife you would still want to see at least 80 or so, preferably more. With so little support it could just as well be that these relationships are wrong and genus A is monophyletic after all.

(It is amazing, by the way, how many people find it hard to intuit that when discussing the status of A the red support value is indeed the relevant one. If it is high then precisely that number provides support for the non-monophyly of A while the black value is irrelevant - it merely shows that A1, A2 and B belong together but doesn't tell us anything about A versus B. Some time ago I even had a peer reviewer who got that wrong at first. Perhaps people are just so trained to look for how strong the evidence for monophyly is that they can get confused when they need to look for evidence for non-monophyly.)

So yes, the tree shows A as non-monophyletic, but we can't be sure if the evidence is strong enough. Is there another way of testing whether, let us say, A is "significantly" non-monophyletic?

This is where, for parsimony analysis, the Templeton Test comes in, which by the way doesn't have anything to do with the Templeton Foundation.

Vicariance and parsimony in biogeography, continued

On and off I have read a few more of the papers from the late 1980ies and 1990ies dealing with vicariance biogeograhy and searching for ways to apply that (then still) awesome new idea of parsimony analysis to biogeography.

Parsimony Analysis of Endemism and similar

Quiiiite some time ago now I started a little series on the uses of parsimony in systematics, evolutionary biology and biogeography, and then kind of dropped the ball before coming to the biogeography part. Having recently read a few more papers on methods in biogeography, this seems like an opportune time to pick the thread up again.

Specifically, I came across an approach that was apparently very popular in the early noughties but then seems to have disappeared again: Parsimony Analysis of Endemicity (PAE; e.g. Nihei, 2006) and its variant Cladistic Analysis of Distributions and Endemism (CADE; e.g. Porzecanski & Cracraft, 2005).

But before I consider if PAE is worth trying out, it would be interesting to know what it is supposed to be good for, and for that let's consider...

What is biogeography about?

This is actually not an easy question to answer. I mean, it is very simple for phylogenetics (inferring relationships between species) or taxonomy (naming and classifying groups of organisms), but certain other fields like ecology or evolutionary biology are much more complex, broad and fuzzy. And biogeography is one of them, at least in my eyes.

At a minimum, biogeography as a discipline seems to encompass all of the following:

Ancestral character state reconstruction in Mesquite 3: parsimony versus likelihood

One of the more curious recent developments in my area is that some journals now make all reviewer reports available to all of the peer reviewers of a given manuscript. I like it because it allows me to get a better feeling for whether I have been too lenient or too critical, see other colleagues' style of making comments, and so on.

Very recently I have reviewed a manuscript, and just two days ago I saw what the second reviewer thought. Our recommendations turned out to be generally the same, but one sentence of theirs really annoyed me. When discussing ancestral character state reconstruction, they complained that all reconstructions in the present study were done "only" with parsimony.

Parsimony analysis in TNT using the command line version

I guess I can just as well make it a habit to blog some advice whenever I have dealt with a recalcitrant piece of software. Today: Tree analysis using New Technology (TNT).

As I have mentioned before, there are four main ways of inferring phylogenetic trees of evolutionary relationships:

Distance/clustering analysis. This is not really a phylogenetic analysis in the strict sense but merely clusters terminals by their similarity, but on the plus side clustering is always extremely fast. There are several programs that can do it, including good old PAUP and MEGA.
Likelihood analysis. Simplifying a bit one could say it searches for the tree with the best log likelihood score given a model of sequence evolution and the data. Again there are several programs available to do this kind of analysis, including PAUP, MEGA and PHYLIP. Calculating likelihood values across large phylogenetic trees is computationally intensive, and thus they can take quite some time for larger datasets. This is why somebody wrote the software RAxML, which is designed to do complex likelihood searches with seemingly ridiculous speed by cutting a few corners.
Bayesian phylogenetics. This approach estimates the posterior probability of phylogenetic relationships with a Marcov Chain Monte Carlo (MCMC) method. Standard software packages for this are MrBayes and BEAST. If you want a quick answer, you are out of luck though, because MCMC always takes time.
Parsimony analysis. The logic here is to find the tree with the lowest number of character changes along the branches, under the assumption that, all else being equal, the simplest explanation is the best. It is often considered less sophisticated than the previous two approaches but it comes with less assumptions; I like it that I know where the computer has its hands, so to say. Once more PAUP, MEGA and PHYLIP implement parsimony searches but they are fairly slow for larger datasets.

This is where TNT comes in. Published in 2003 and made free-ware through a subsidy of the Willi Hennig Society in 2007, TNT could be called the RAxML of parsimony analysis. It can take a fairly large dataset and finish the tree search before PAUP has got its shoes on. What is more, in addition to the already fast standard search it implements the innovative search strategies that gave it its New Technology name part, such as the Parsimony Ratchet. When you use these you will know what speed means!

Sadly, the program has a few downsides. First, its input and output formats are rather idiosyncratic. Second, it has a GUI only on the Windows version but not on Mac or Linux, so that you will have to use command line and scripting on the latter two systems. Third, the documentation is unsystematic and unhelpful, making it very hard to figure out how to effectively use the command line and scripting. Actually, that is not quite true; documentation on scripting per se seems to be okay, it is rather the simple standard analyses that aren't explained anywhere.

This is why I am writing this post. I have just done a simple analysis, and I want to spare others the same investment in time and frustration, and I want to be able to look up my own post in the future, especially should some time pass before I use TNT again.

In praise of PAUP*

Hell is freezing over! Pigs are flying! PAUP* is getting updated for the first time in twelve years!

Jokes aside, this is great news. PAUP*, short for Phylogenetic Analysis Using Parsimony (* and other methods), is one of the best known software tools for phylogenetics. Indeed to me it is pretty much the phylogenetic software tool. Yes, depending on the task at hand I also use TNT, RAxML, Mesquite, MrBayes and BEAST with various of its add-ons, but PAUP* is the one I started out with while writing my thesis and it is still the one I feel most comfortable using.

Another major issue is what you can and cannot do with the various programs. The downside of PAUP*, or at least of the previous version, is that it is comparatively slow. So if you have a large dataset with many taxa, you are better off using TNT for parsimony and RAxML for likelihood analyses. But PAUP* can do various kinds of analyses that no other software can do; for example, I would not know how to conduct a Templeton test without it.

(My experience with PHYLIP is limited. Maybe it can do some of the same things. The problem is that its combination of rather excessive modularity and a call centre style user interface - on the lines of "press 3 for this kind of analysis" - has put me off using it so far.)

So over the past few years I have sometimes worried about the day when PAUP* would suddenly stop working on the newest computers. It is good to know that a new version is coming up!

The idea is that ultimately there will be GUIs for Win and Mac that one has to buy, but that command line versions for Win, Mac and Linux will be free. I guess I will be happy to use command line myself, but it might be a good idea to get a GUI licence for small student projects where the student cannot necessarily be expected to learn the PAUP* commands.

Saturday, May 17, 2014

Matrix Representation Parsimony supertrees

Continuing, for the moment, my little series of posts on the use of parsimony methods in phylogenetics and biogeography, we come to the topic of supertrees.

Some phylogenetic studies deal with higher level groups. For example, one might see an evolutionary tree of the land vertebrates or of the land plants. But in those cases the sampling of the individual groups is very restricted, so that a whole family of mammals or a whole order of plants might be represented with only one terminal.

Other studies deal with more fine scale relationships. For example, there are publications only on the phylogeny of one medium size genus of daisies or one genus of birds. In this case the species within the genera in question are well sampled (hopefully complete or nearly so), but obviously everything outside the study group is represented by only a few close relatives.

At some point one might now want to put all of this information together to arrive at the complete tree of life or, perhaps less ambitiously, at a complete evolutionary tree of all birds or of all flowering plants. How can we take all these individual studies, all dealing with different species and often using very different types of data, and get one tree out of them?

There are two main approaches. It should be obvious that both necessarily require that there is some overlap between the various trees.

Parsimony reconstruction of species trees from gene trees

Continuing my series on the uses of parsimony analysis in phylogenetics and biogeography, we come to the inference of species trees from gene trees.

I have written about the problems with inferring species relationships directly from the relationships of genes that are sample from these species before. In short, healthy species contain genetic diversity with potentially several different alleles for any given locus (e.g. how human eyes can be different colours). The same was true for all ancestral species in evolutionary history, and at first their two descendant species in a speciation event may each have inherited a part of that genetic diversity.

Because there is limited space available for alleles in any given species, even merely through the random process of genetic drift some of them will be lost in the descendant species. However, it takes some time for this loss to happen, and so it is possible that by the next lineage split resulting in more descendant species one gene may still have alleles that diverged in the previous ancestor. If that is the case, then the descendants may inherit a random selection of alleles that show different relationships to each other than the real species relationships, potentially misleading our phylogenetic inference.

For example, although we know from multiple lines of evidence (including most genetic data) that the chimpanzees are our closest living relatives, a minority of our genes is more closely related to those of the gorilla than to those of the chimpanzees. So if only one of those genes were sampled and all other evidence ignored, one might mistakenly infer that the gorillas are our sister species. And in some plant and animal groups mistakes like this can easily be made.

The solution is to use more samples per species than one, to use more genes for the molecular analysis than one, and to use species tree methods. As indicated in my earlier post on species tree software, there is a parsimony approach to this issue. In fact there are two different ways of doing species tree parsimony, depending on what kind of gene trees we are dealing with.

Character optimisation in parsimony phylogenetics

As mentioned in my last post on parsimony analysis, there are different forms of parsimony that are used in the reconstruction of phylogenetic relationships. We could describe them as different ways of counting the necessary number of character changes to explain a given phylogenetic tree.

Phylogenetic analysis using the parsimony criterion

One of the simplest ways to reconstruct the phylogenetic relationships between different organisms is parsimony analysis. As explained in the previous post of this series, the principle as applied to tree inference is very straightforward: compare possible solutions by counting the number of events in each and accept the solution that needs the smallest such number.

Now what are the events, and how does that work in practice?

Parsimony in phylogenetics, evolutionary biology, and biogeography

The first thing most people think of when hearing 'science' is hypothesis testing. However, many a philosopher of science will be quick to point out that there is much more to science. It would be too much to say that science has "moved beyond" falsificationism because falsifying hypotheses still plays a, and perhaps the, central role in empirical research and probably always will. But there are many others, such as modelling.

Another indispensable tool of the scientist, but one that is rarely mentioned, is the principle of parsimony, also known as Occam's Razor. It is the principle of accepting, all else being equal, the explanation that is simplest. This approach hardly needs a theoretical justification; we only have to think up a few everyday scenarios to see that it makes sense, and that everybody unconsciously uses it all the time.