Friday, April 28, 2017

Arguments for paraphyletic taxa: orchid taxonomy edition

As usual, the following is my personal opinion and not necessarily the official stance of any person or institution that I am affiliated with or related to, and so on.

One of the recurrent topics of this blog is the controversy around the acceptance of paraphyletic taxa. Although I have become a bit jaded over the years, my original stance was, and to a certain degree still is, that I am trying to understand the reasoning offered by colleagues who think that paraphyletic taxa are acceptable or even unavoidable. Because, who knows?, there may be a novel argument that shows cladism to be misguided after all, and I want to keep an open mind.

Sadly, however, it is mostly the same few talking points that lost the discussion in the 1970s and 1980s, resurfacing again and again. It is rare, although not unheard of, that a new and truly interesting argument is presented.

Today's candidate paper freshly online is
Baranow et al. 2017. Brasolia, a new genus highlighted from Sobralia (Orchidaceae). Plant Systematics and Evolution. DOI 10.1007/s00606-017-1413-z
The authors present phylogenetic analyses and change the classification of the titular orchid genus. The only point of interest for present purposes is that they argue for the recognition of Sobralia section Sobralia at the genus level despite that group being paraphyletic, and in what follows I do not want to imply any criticism of any other part of the publication or of the hard work the authors have put into their study. It is only the theory of classification that I like to hash out.

The argumentation in favour of paraphyletic taxa runs across three paragraphs in the discussion section. Let's see if I can learn something new!
In the light of phylogenetic outcomes, the proposed taxon is paraphyletic, which means that its species have a common ancestor, but the taxon does not include all its descendants (e.g., Elleanthus).
Polyphyletic taxa also have a common ancestor, so by the reasoning implied here one could justify any classification whatsoever. I am consequently unsure what the point of this first sentence is.
Monophyly in its broader definition describes groups with a common ancestry, including both paraphyletic and monophyletic groups (sensu Hennig 1950); therefore, Hörandl and Stuessy (2010) advocate returning to this broader definition of monophyly and, adopting Ashlock's term, holophyly for monophyly s.str.
Again I am afraid I must be missing the point. The controversy is really about whether we should consistently classify by relatedness or not. I don't mean to be uncharitable, but this could potentially be taken to mean the authors hope that recognising non-monophyletic taxa would become more palatable to mainstream systematists if one could hoodwink them into forgetting what monophyletic means. It would then be equivalent to hoping that your child will accept a mountain hike instead of the promised trip to the beach if you just said "mountains are also a kind of beach" with enough conviction. Nice try, but there will still be no swimming in the ocean, and little Tommy sees right through it.
Paraphyly is a natural transition stage in the evolution of taxa (Hörandl and Stuessy 2010). According to Brummitt (2002), paraphyletic taxa are ''products of the evolutionary process, which is left behind as evolution moves on to a new level of organization.''
The logic of these quotations appears to be as follows: "We really, really want to recognise paraphyletic taxa. So we draw a paraphyletic taxon onto the phylogenetic tree. Look, cladist, there is a paraphyletic taxon in the evolutionary process! Why are you so unreasonable not to accept it?" Unfortunately, circular reasoning does not become more convincing just because it has been published somewhere and can now be cited.

To clarify, there are no paraphyletic taxa out there in nature; there is only a tree of life, and phylogenetic systematists consistently circumscribe taxa on that tree to be monophyletic, while 'evolutionary' taxonomists circumscribe some taxa on that tree to be paraphyletic.
We realize that this is in conflict with commonly accepted phylogenetic methods which declare that monophyly s.str. should be the only criterion for grouping organisms.
A "phylogenetic method" is what produced the orchid phylogeny, so I assume what is meant here is "approach to classification". But whatever, that is not the point, so onwards.
However, a somewhat analogical situation has been recognized within Coelogyne (Gravendeel et al. 2001). In this case, the authors interpreted the morphology of the studied species as not corresponding to the cladograms, probably as a result of convergent evolution and they decided to maintain polyphyletic Coelogyne. Kolanowska and Szlachetko (2016) postulate to maintain paraphyletic Odontoglossum.
This appears to be an instance of the argumentum ad populum, and not even very much populum at that. Consider: is it a good idea to shoot a stapler into your own foot? Okay, so there will have been at least two people in the history of humanity who have done that, so you could now cite them for support. But does that make shooting a stapler into your foot any more sensible? Exactly; a better argument is needed here.

Also, as I only realised some time after first drafting this, the senior author of the present paper is the same as in one of those two references. So this is apparently also an instance of the rarely seen ipse dixit. (It is, of course, valid to cite one's own prior research results, but in this case we are dealing not with an empirical question but simply with the argument that an action is acceptable because it is not unprecedented.)
Recognition of distinctive characters which have evolved in a group is essential for an understanding its evolution (Brummitt 2006).
Quite the opposite, in my eyes: having an accurate classification is essential for understanding evolution, because paraphyletic taxa mislead us about relationships. In the present case, treating Elleanthus as a subgroup of Sobralia would (correctly) show that Elleanthus evolved out of Sobralia, whereas treating Sobralia and Elleanthus as separate genera implies (wrongly) that they are evolutionarily distinct units, side by side.
This point of view is shared by numerous other authors (Sosef 1997; Dias et al. 2005; Nordal and Stedje 2005) who state that traditional classification is the optimal tool for cataloging biodiversity and requires recognition of paraphyletic taxa.
This reads like more argumentum ad populum, and sadly it is left unmentioned why paraphyletic taxa are supposedly required.
We decided to follow the Darwinian (evolutionary) classification, which requires consideration of two criteria: similarity and common descent.
Leaving aside the obvious argument from name-checking here, which is exactly as relevant as using Newton to reject Einstein (and for the same reasons), the problem remains that trying to classify by two criteria at the same time will lead to a useless classification that is not reliably reflecting either.

Assume I have never heard of Sobralia before, and then it is mentioned to me for the first time. Given a phylogenetic classification, I know that it constitutes a natural group whose members are each other's closest relatives. Given a classification as argued for in the present paper, it could be a natural group... but it could also be a group defined by similarity that includes species more closely related to another genus than to any other species of Sobralia. I just won't know.
The approach will allow us to propose a classification based on the phylogenetic relationships, but at the same time it will be practical--with clearly defined and recognizable units.
No, sorry to say so, but it quite simply will not. First, it will not be based on phylogenetic relationships, because in one crucial instance phylogenetic relationships will be ignored. Second, and again, it will not be practical, because if two criteria are mixed the end user cannot know without going back to the original publications whether a given group was circumscribed based on relatedness or based on 'similarity', see above.

Now obviously I understand that this is not a theory paper arguing for a wholesale shift in our practice of classification. What is more, I know we cannot expect all solutions to be easy or all groups immediately to be circumscribed as monophyletic the moment somebody looks at them. I can happily accept a paper concluding "we know this group is probably paraphyletic, but for the moment we don't have a better solution, let's wait until more data are in", or "the group is clearly polyphyletic, but at this moment we do not want to make hasty taxonomic changes", or something along those lines.

But the three paragraphs quoted above were specifically meant to justify the ultimate recognition of paraphyletic genera, so one would expect to find a convincing justification. Sadly I, personally, have to admit to being anti-convinced by this paper, which as previously mentioned I take to mean an argument had the effect of making me even more convinced of the idea it was meant to refute, in this case classification by relatedness.

Sunday, April 23, 2017

The unexpected dangers of rerooting phylogenies

A couple of days ago a colleague circulated the following recently published paper,
Czech L, Huerta-Cepas J, Stamatakis A, 2017. A critical review on the use of support values in tree viewers and bioinformatics toolkits. Molecular Biology and Evolution. DOI: 10.1093/molbev/msx055
The authors found something that, in retrospect, seems glaringly obvious. Phylogenetic trees are nearly always saved in the Newick format of nested brackets, for example as follows:
In this case we are dealing with a rooted tree of only three taxa. A is sister to a clade of B and C. The numbers after the colons indicate branch lengths, and the 99 directly after the brackets is a support value, most likely bootstrap, for the sister group relationship (B,C).

The problem explored by Czech et al. is ultimately that under the Newick format branch support values or other branch annotations are not actually attached to branches; they are attached to nodes. In this case, for example, the 99 is attached to the node that is the hypothetical common ancestor of B and C. Logically, because the tree is rooted we can assume that the support value is meant for the branch leading down from the ancestor of B and C towards the root.

But what if we reroot a tree with node annotations that are really meant to be branch annotations a posterioiri? (My post on the various options for rooting phylogenies can be found here.) Czech et al. found that the behaviour of the these values is undefined. For some software they were able to demonstrate that the branch annotation ended up on the wrong branch after rerooting.

How serious an issue is that? I guess it depends on what one's practice is. The problem should be pretty much limited to analyses producing unrooted trees (e.g. in RAxML, PAUP or MrBayes) under the assumption of reversibility, where the user then uses outgroup rooting to polarise the tree a posteriori. Any analysis using a clock model would avoid it, as would asymmetric step-matrices or, crucially, those analyses specifying the outgroup before the start of the analysis.

In addition, it seems as if the problem would be limited to a few branches between the pseudo-root used to save unrooted trees and the new root after rerooting, so that most relationships should be fine. I may look at one or two of my published phylogenies to see if I ever had that problem, but I am not worried; in the most recent case where support values were a critical part of my argumentation, for example, they are fairly deep inside the tree, because we sampled widely around the ingroup, and I also used Templeton tests and suchlike to demonstrate the non-monophyly of certain taxa.

Apparently Czech et al. have already achieved some success at getting software providers to make changes that will help solve the confusion around where the branch annotations end up. But nonetheless my main take-home from this is to be less blasé about a posteriori rooting. In the future I will make sure to always define an outgroup already when I set up a PAUP or RAxML run, so that the need to reroot does not arise.

Thursday, April 20, 2017

Botany picture #242: Gentianella muelleriana

Gentianella muelleriana (Gentianaceae) as seen today on the ascent to Mount Stillwell, Kosciusko National Park, New South Wales. One of the few plants still in flower this late in the season.

In the European Alps, gentians are, of course, generally blue and rarely yellow, but here white seems to be the preferred colour.

Friday, April 14, 2017

Back from Queensland

Unfortunately I was unable to transfer the pictures I had taken to a computer until I got back home, so here are the ones I want to put on the blog all in one post. We drove west from Brisbane to Chinchilla with a major stop along the way, had a day trip north to the vicinity of Wandoan, spent half a day around Chinchilla and Kogan the following day, and then returned to Brisbane.

Rainforest of Boombana in D'Aguilar National Park just west of Brisbane.

A fern climbing up a liana that climbs up a tree trunk.

Not many daisy species like rainforests, but this one does: Acomis acoma (Asteraceae). It was the reason for our detour into D'Aguilar. Admittedly it is not found in the darkest and wettest parts.

View from Jolly's lookout, still in D'Aguilar National Park.

In the Chinchilla area ecologists showed us several field sites and conservation management actions. Near Wandoan we happened to see this population of treelets with rather impressive fruits. Still need to figure this species out; we suspected it may be a native Australian lemon (Citrus, Rutaceae). But I have not seen one of those before, only other Rutaceae genera.

We learned more about what is clearly the most problematic weed in the area, buffel grass (Cenchrus ciliaris, Poaceae). As seen in the picture it forms clumps that suppress a lot of other vegetation but are not dense enough to avoid soil erosion from the gaps between individual plants - the worst of both worlds! It also accumulates litter causing very intense bush fires in a local habitat (dry rainforest and vine thicket) whose key species are not fire-adapted. On the other hand, we were told that farmers liked buffel grass due to its drought resistance and high food value for stock.

One of the species the trip was about is this phyllodinous wattle, Acacia wardellii (Fabaceae). Although currently not in flower it is quite attractive due to its straight growth and strikingly white stem. It is locally common after disturbance but has a very restricted range.

Near Kogan we were shown this site, which I found particularly interesting. The habitat is on a ridge with very poor, rocky, shallow soil, and features species that are very localised to those conditions.

Scattered across the ground was Brunoniella (Acanthaceae). I worked on a genus of the Acanthaceae family for my Diplom thesis (roughly equivalent to honours), so that brought back nice memories. However, while my study group then were large shrubs, this species is herbaceous and in fact seems to remain fairly small. I assume it spends most of its life as dormant root-stock underground and then sends these little shoots up if there has been enough rain to be worth the while.

Monday, April 10, 2017

Back to Queensland

Another trip to south-eastern Queensland, only for a few days this time.

First, the most disappointing window seat I have ever had on a flight. It is not even clear to me why this segment was the only one without a window, and only on my side :-)

The skyline of Brisbane as seen from the cultural district.

The Queensland Herbarium, which is located at the Botanic Gardens. I am very grateful to Ailsa Holland and Tony Bean for the kindness they showed us during our visit today.

Friday, April 7, 2017

Parsimony versus models for morphological data: a recent paper

I have written on this blog before about the use of likelihood or Bayesian phylogenetics for morphological data. In our journal club this week we discussed another of the small but growing number of recent papers arguing that parsimony should be dropped in favour of model-based analyses even for morphology:
Puttick et al., 2017. Uncertain-tree: discriminating among competing approaches to the phylogenetic analysis of phenotype data. Proceedings of the Royal Society Biological Series 284, doi 10.1098/rspb.2016.2290
Puttick et al. constructed maximally balanced and unbalanced phylogenies, simulated sequence data for them under the HKY + G model of nucleotide substitution, turned the data matrices into binary and presumably unordered multistate integer characters, and then used equal weights parsimony, implied weights parsimony, and Bayesian and likelihood analyses under the Mk model to try and get the phylogenies back with an eye on accuracy (correctness) and tree resolution. In a second approach, they reanalysed previously published morphological datasets to see what happened to controversial taxon placement under the different approaches.

One of the problems with simulation studies is always that they can come out as kind of circular: if you simulate data under a model it is no surprise that the same model would perform best when trying to infer the input into the simulations. In this case Puttick et al. were admirably circumspect in that not only did they simulate their data under a different model (HKY + G) than that ultimately used in phylogenetic analysis (Mk), but they also repeated the analyses until they had achieved a distribution of homoplasy that mirrored the one found in empirical datasets. This is important because morphology datasets for parsimony analysis are scored to minimise homoplasy, while uncritically simulating matrices may lead to much higher levels of homoplasy, thus putting parsimony at a disadvantage.

Still, it should be observed that the HKY + G model is nonetheless unlikely to have produced data that are a realistic representation of morphological datasets, especially considering that the latter would at a minimum also include multistate characters with ordered states. Also, from a cladist's perspective homoplasy in a morphological dataset is a character scoring error waiting to be corrected in a subsequent analysis. But well, of course using zero homoplasy datasets would also have been unrealistic because real life datasets do have homoplasy in them. (And of course parsimony would "win" all the time if there was zero homoplasy, pretty much by definition.)

Now what are the results? To simplify, Bayesian was best at getting the tree topology right, followed by equal weights parsimony and implied weights parsimony, with likelihood coming in last. Likelihood always produces fully resolved trees, and Bayesian produces the least resolved ones. The authors argue, as Bayesians would, that this is exactly how it should be, as it simply tells us that the data aren't strong enough; the other approaches may give us false confidence. (Although of course parsimony and likelihood analyses can likewise involve several different ways of quantifying support or confidence.)

In conclusion, Puttick et al. make the following recommendations:

First, Bayesian inference should be the preferred approach.

Second, future morphological datasets should be scored with model-based approaches in mind. This means that the number of characters should be maximised by including homoplasious ones, because that will allow a better estimate of rates. As this is the exact opposite scoring strategy of what parsimony analysis requires this will make it hard to change habits.

What is more, I have to smile at Puttick et al.'s expectations here: they simulated data matrices of 100, 350 and 1,000 characters. Maybe you can get 400 or so for some animals (if the fossils are well enough preserved), but for any plant group I have worked on I would struggle to get 30. And wouldn't you know it, the single empirical botanical dataset they re-analysed had only 48.

Third, researchers should lower their expectations and get used to living with unresolved relationships, as Bayesian analysis produces less resolved phylogenies.

Our discussion of the paper was wide-ranging. When I commented that one of the advantages of traditional parsimony software is that it easily allows the implementation of any step matrix that is needed (imagine a character where state 0 can change into states 1, 2 or 3, but 1-3 cannot change into each other) I was informed that that is in fact possible in BEAST. That is a pleasant surprise, as I had assumed that it was limited to setting a few simple models such as standard Mk for unordered states, nothing more. However, those who have written XML files for BEAST may want to consider if that is "easy" compared with writing a Nexus file for PAUP. Personally I find BEAST input files very hard to understand.

Another concern was that while nucleotide substitution models are based on a fairly good understanding of what can happen to DNA nucleotides which, after all, have a limited number of states and transitions between those states, it is considerably less clear what the most appropriate model for any given morphological character is.

What is more, somebody pointed out that there are essentially two options in a model based analysis: either the likelihood of state transitions is fixed, which is a difficult decision to make, or it is estimated during the analysis. But in the latter case the probability of, for example, changing the number of petals would be influenced by the probability of shifting between opposite and alternate leaf arrangement. And clearly that idea is immediately nonsensical.

In summary, the drumbeat of papers on the lines of "we are the Bayesians; you will be assimilated; resistance is futile" is not going to stop any time soon. I use Bayesian and likelihood analyses all the time for molecular data, no problem. But I am still not convinced that the Mk model would be my go-to approach the next time I have to deal with morphological data. It seems to me that it is much easier to justify one's model selection in the case of DNA than in the case of, say, flower colour or leaf length; that the idea of setting one model and estimating gamma across totally incomparable traits is odd; and that I would hardly ever have enough characters for Bayesian analysis to produce more than a large polytomy.

But I guess all that depends on the study group. I can imagine there would be morphometric data for some groups of organisms for which stochastic models work quite well.

Tuesday, April 4, 2017


There is so much science spam these days that a message has to be particularly remarkable to even register; mostly I just mark as junk or report without even thinking about them. But this one is a beauty.

Let's count the ways:
  1. The message uses four different text colours (counting the links), several different font types, and more different font sizes than anybody in their right mind could consider tasteful.
  2. The title - International Journal of Humanities and Social Science Invention - is likely among the top five most convoluted titles I have ever seen, and given the competition that is saying something.
  3. The title does not make any sense either, but I guess that goes without saying.
  4. The spammer did not even write their script to personalise the message. At least other spammers have it insert the name of the recipient, but this one merely reads "dear author/researcher". Lazy.
  5. The first sentence randomly capitalises "international journal" and is poorly written.
  6. The second sentence claims the journal is indexed in "major indexing" (major indexing what?) and then lists four names none of which I have ever heard of. So whatever they are, they are certainly not "major".
  7. "IJHSSI follows the rapid publication process." So there is a rapid publication process, just one?
  8. Like many other spammers, this one sets arbitrary paper submission deadlines, presumably to create a sense of urgency. Why would a journal, which by definition publishes regular issues, ever do that?
  9. The sentence in bold and red is ungrammatical.
  10. The spammer does not even bother to invent a name for their imaginary editor-in-chief IJHSSI. Remember Robest Pual Ashcraft? That was fun. But no, here we only get a generic title.
  11. Note that there is very conspicuously no mention of the article processing fees in this message.
I think this is another, ahem, "journal" that I will pass on.