Sunday, April 23, 2017

The unexpected dangers of rerooting phylogenies

A couple of days ago a colleague circulated the following recently published paper,
Czech L, Huerta-Cepas J, Stamatakis A, 2017. A critical review on the use of support values in tree viewers and bioinformatics toolkits. Molecular Biology and Evolution. DOI: 10.1093/molbev/msx055
The authors found something that, in retrospect, seems glaringly obvious. Phylogenetic trees are nearly always saved in the Newick format of nested brackets, for example as follows:
(A:2,(B:1,C:2)99:1);
In this case we are dealing with a rooted tree of only three taxa. A is sister to a clade of B and C. The numbers after the colons indicate branch lengths, and the 99 directly after the brackets is a support value, most likely bootstrap, for the sister group relationship (B,C).

The problem explored by Czech et al. is ultimately that under the Newick format branch support values or other branch annotations are not actually attached to branches; they are attached to nodes. In this case, for example, the 99 is attached to the node that is the hypothetical common ancestor of B and C. Logically, because the tree is rooted we can assume that the support value is meant for the branch leading down from the ancestor of B and C towards the root.

But what if we reroot a tree with node annotations that are really meant to be branch annotations a posterioiri? (My post on the various options for rooting phylogenies can be found here.) Czech et al. found that the behaviour of the these values is undefined. For some software they were able to demonstrate that the branch annotation ended up on the wrong branch after rerooting.

How serious an issue is that? I guess it depends on what one's practice is. The problem should be pretty much limited to analyses producing unrooted trees (e.g. in RAxML, PAUP or MrBayes) under the assumption of reversibility, where the user then uses outgroup rooting to polarise the tree a posteriori. Any analysis using a clock model would avoid it, as would asymmetric step-matrices or, crucially, those analyses specifying the outgroup before the start of the analysis.

In addition, it seems as if the problem would be limited to a few branches between the pseudo-root used to save unrooted trees and the new root after rerooting, so that most relationships should be fine. I may look at one or two of my published phylogenies to see if I ever had that problem, but I am not worried; in the most recent case where support values were a critical part of my argumentation, for example, they are fairly deep inside the tree, because we sampled widely around the ingroup, and I also used Templeton tests and suchlike to demonstrate the non-monophyly of certain taxa.

Apparently Czech et al. have already achieved some success at getting software providers to make changes that will help solve the confusion around where the branch annotations end up. But nonetheless my main take-home from this is to be less blasé about a posteriori rooting. In the future I will make sure to always define an outgroup already when I set up a PAUP or RAxML run, so that the need to reroot does not arise.

Thursday, April 20, 2017

Botany picture #242: Gentianella muelleriana


Gentianella muelleriana (Gentianaceae) as seen today on the ascent to Mount Stillwell, Kosciusko National Park, New South Wales. One of the few plants still in flower this late in the season.

In the European Alps, gentians are, of course, generally blue and rarely yellow, but here white seems to be the preferred colour.

Friday, April 14, 2017

Back from Queensland

Unfortunately I was unable to transfer the pictures I had taken to a computer until I got back home, so here are the ones I want to put on the blog all in one post. We drove west from Brisbane to Chinchilla with a major stop along the way, had a day trip north to the vicinity of Wandoan, spent half a day around Chinchilla and Kogan the following day, and then returned to Brisbane.


Rainforest of Boombana in D'Aguilar National Park just west of Brisbane.


A fern climbing up a liana that climbs up a tree trunk.


Not many daisy species like rainforests, but this one does: Acomis acoma (Asteraceae). It was the reason for our detour into D'Aguilar. Admittedly it is not found in the darkest and wettest parts.


View from Jolly's lookout, still in D'Aguilar National Park.


In the Chinchilla area ecologists showed us several field sites and conservation management actions. Near Wandoan we happened to see this population of treelets with rather impressive fruits. Still need to figure this species out; we suspected it may be a native Australian lemon (Citrus, Rutaceae). But I have not seen one of those before, only other Rutaceae genera.


We learned more about what is clearly the most problematic weed in the area, buffel grass (Cenchrus ciliaris, Poaceae). As seen in the picture it forms clumps that suppress a lot of other vegetation but are not dense enough to avoid soil erosion from the gaps between individual plants - the worst of both worlds! It also accumulates litter causing very intense bush fires in a local habitat (dry rainforest and vine thicket) whose key species are not fire-adapted. On the other hand, we were told that farmers liked buffel grass due to its drought resistance and high food value for stock.


One of the species the trip was about is this phyllodinous wattle, Acacia wardellii (Fabaceae). Although currently not in flower it is quite attractive due to its straight growth and strikingly white stem. It is locally common after disturbance but has a very restricted range.


Near Kogan we were shown this site, which I found particularly interesting. The habitat is on a ridge with very poor, rocky, shallow soil, and features species that are very localised to those conditions.


Scattered across the ground was Brunoniella (Acanthaceae). I worked on a genus of the Acanthaceae family for my Diplom thesis (roughly equivalent to honours), so that brought back nice memories. However, while my study group then were large shrubs, this species is herbaceous and in fact seems to remain fairly small. I assume it spends most of its life as dormant root-stock underground and then sends these little shoots up if there has been enough rain to be worth the while.

Monday, April 10, 2017

Back to Queensland

Another trip to south-eastern Queensland, only for a few days this time.


First, the most disappointing window seat I have ever had on a flight. It is not even clear to me why this segment was the only one without a window, and only on my side :-)


The skyline of Brisbane as seen from the cultural district.


The Queensland Herbarium, which is located at the Botanic Gardens. I am very grateful to Ailsa Holland and Tony Bean for the kindness they showed us during our visit today.

Friday, April 7, 2017

Parsimony versus models for morphological data: a recent paper

I have written on this blog before about the use of likelihood or Bayesian phylogenetics for morphological data. In our journal club this week we discussed another of the small but growing number of recent papers arguing that parsimony should be dropped in favour of model-based analyses even for morphology:
Puttick et al., 2017. Uncertain-tree: discriminating among competing approaches to the phylogenetic analysis of phenotype data. Proceedings of the Royal Society Biological Series 284, doi 10.1098/rspb.2016.2290
Puttick et al. constructed maximally balanced and unbalanced phylogenies, simulated sequence data for them under the HKY + G model of nucleotide substitution, turned the data matrices into binary and presumably unordered multistate integer characters, and then used equal weights parsimony, implied weights parsimony, and Bayesian and likelihood analyses under the Mk model to try and get the phylogenies back with an eye on accuracy (correctness) and tree resolution. In a second approach, they reanalysed previously published morphological datasets to see what happened to controversial taxon placement under the different approaches.

One of the problems with simulation studies is always that they can come out as kind of circular: if you simulate data under a model it is no surprise that the same model would perform best when trying to infer the input into the simulations. In this case Puttick et al. were admirably circumspect in that not only did they simulate their data under a different model (HKY + G) than that ultimately used in phylogenetic analysis (Mk), but they also repeated the analyses until they had achieved a distribution of homoplasy that mirrored the one found in empirical datasets. This is important because morphology datasets for parsimony analysis are scored to minimise homoplasy, while uncritically simulating matrices may lead to much higher levels of homoplasy, thus putting parsimony at a disadvantage.

Still, it should be observed that the HKY + G model is nonetheless unlikely to have produced data that are a realistic representation of morphological datasets, especially considering that the latter would at a minimum also include multistate characters with ordered states. Also, from a cladist's perspective homoplasy in a morphological dataset is a character scoring error waiting to be corrected in a subsequent analysis. But well, of course using zero homoplasy datasets would also have been unrealistic because real life datasets do have homoplasy in them. (And of course parsimony would "win" all the time if there was zero homoplasy, pretty much by definition.)

Now what are the results? To simplify, Bayesian was best at getting the tree topology right, followed by equal weights parsimony and implied weights parsimony, with likelihood coming in last. Likelihood always produces fully resolved trees, and Bayesian produces the least resolved ones. The authors argue, as Bayesians would, that this is exactly how it should be, as it simply tells us that the data aren't strong enough; the other approaches may give us false confidence. (Although of course parsimony and likelihood analyses can likewise involve several different ways of quantifying support or confidence.)

In conclusion, Puttick et al. make the following recommendations:

First, Bayesian inference should be the preferred approach.

Second, future morphological datasets should be scored with model-based approaches in mind. This means that the number of characters should be maximised by including homoplasious ones, because that will allow a better estimate of rates. As this is the exact opposite scoring strategy of what parsimony analysis requires this will make it hard to change habits.

What is more, I have to smile at Puttick et al.'s expectations here: they simulated data matrices of 100, 350 and 1,000 characters. Maybe you can get 400 or so for some animals (if the fossils are well enough preserved), but for any plant group I have worked on I would struggle to get 30. And wouldn't you know it, the single empirical botanical dataset they re-analysed had only 48.

Third, researchers should lower their expectations and get used to living with unresolved relationships, as Bayesian analysis produces less resolved phylogenies.

Our discussion of the paper was wide-ranging. When I commented that one of the advantages of traditional parsimony software is that it easily allows the implementation of any step matrix that is needed (imagine a character where state 0 can change into states 1, 2 or 3, but 1-3 cannot change into each other) I was informed that that is in fact possible in BEAST. That is a pleasant surprise, as I had assumed that it was limited to setting a few simple models such as standard Mk for unordered states, nothing more. However, those who have written XML files for BEAST may want to consider if that is "easy" compared with writing a Nexus file for PAUP. Personally I find BEAST input files very hard to understand.

Another concern was that while nucleotide substitution models are based on a fairly good understanding of what can happen to DNA nucleotides which, after all, have a limited number of states and transitions between those states, it is considerably less clear what the most appropriate model for any given morphological character is.

What is more, somebody pointed out that there are essentially two options in a model based analysis: either the likelihood of state transitions is fixed, which is a difficult decision to make, or it is estimated during the analysis. But in the latter case the probability of, for example, changing the number of petals would be influenced by the probability of shifting between opposite and alternate leaf arrangement. And clearly that idea is immediately nonsensical.

In summary, the drumbeat of papers on the lines of "we are the Bayesians; you will be assimilated; resistance is futile" is not going to stop any time soon. I use Bayesian and likelihood analyses all the time for molecular data, no problem. But I am still not convinced that the Mk model would be my go-to approach the next time I have to deal with morphological data. It seems to me that it is much easier to justify one's model selection in the case of DNA than in the case of, say, flower colour or leaf length; that the idea of setting one model and estimating gamma across totally incomparable traits is odd; and that I would hardly ever have enough characters for Bayesian analysis to produce more than a large polytomy.

But I guess all that depends on the study group. I can imagine there would be morphometric data for some groups of organisms for which stochastic models work quite well.

Tuesday, April 4, 2017

IJHSSIOMGWTFBBQ

There is so much science spam these days that a message has to be particularly remarkable to even register; mostly I just mark as junk or report without even thinking about them. But this one is a beauty.


Let's count the ways:
  1. The message uses four different text colours (counting the links), several different font types, and more different font sizes than anybody in their right mind could consider tasteful.
  2. The title - International Journal of Humanities and Social Science Invention - is likely among the top five most convoluted titles I have ever seen, and given the competition that is saying something.
  3. The title does not make any sense either, but I guess that goes without saying.
  4. The spammer did not even write their script to personalise the message. At least other spammers have it insert the name of the recipient, but this one merely reads "dear author/researcher". Lazy.
  5. The first sentence randomly capitalises "international journal" and is poorly written.
  6. The second sentence claims the journal is indexed in "major indexing" (major indexing what?) and then lists four names none of which I have ever heard of. So whatever they are, they are certainly not "major".
  7. "IJHSSI follows the rapid publication process." So there is a rapid publication process, just one?
  8. Like many other spammers, this one sets arbitrary paper submission deadlines, presumably to create a sense of urgency. Why would a journal, which by definition publishes regular issues, ever do that?
  9. The sentence in bold and red is ungrammatical.
  10. The spammer does not even bother to invent a name for their imaginary editor-in-chief IJHSSI. Remember Robest Pual Ashcraft? That was fun. But no, here we only get a generic title.
  11. Note that there is very conspicuously no mention of the article processing fees in this message.
I think this is another, ahem, "journal" that I will pass on.

Sunday, April 2, 2017

The taxonomic impediment as illustrated by journals' criteria for the acceptance of manuscripts

About two weeks ago I learned from a co-author, who in that case is the corresponding author, that a certain systematic botany journal would consider our manuscript unacceptable no matter how much we improved it simply because it was out of scope. You see, our work was only "revisionary", as in dealing with species delimitation, and it would have to be a phylogenetic study to be acceptable. A few thoughts:

I do understand why higher-profile systematics journals do not accept descriptions of taxonomic novelties that take a qualitative approach like "hey, that looks different to that other species", or papers that merely validate taxonomic changes based on evidence presented elsewhere. But I completely fail to understand what the problem is with papers that, as in our case, use integrative, quantitative analyses of morphological, genetic and environmental data to resolve difficult species complexes. I would love to understand how a phylogenetic study is more serious than that. The conservation impact is, for example, much higher in studies finding a previously unrecognised, rare species than in those that only change the circumscription of a genus.

The journal in question is TAXON. Think about it: a journal literally called "taxon" has decided to accept no more taxonomic studies going forward. No word on when Evolution will stop accepting studies dealing with evolutionary biology, or when Heredity will reject all manuscripts dealing with genetics.

Note also that TAXON is still the go-to journal for nomenclatural suggestions in botany. In the latest issue as of writing, for example, we find Brownsey & Perrie, "Proposal to conserve the name Asplenium richardii with a conserved type" and Dorr & Gulledge, "Request for a binding decision on whether Briquetastrum Robyns & Lebrun (Lamiaceae) and Briquetiastrum Bovini (Malvaceae) are sufficiently alike to be confused". Those papers are important and need a forum, and it is good that TAXON is that forum. But the same is true for revisionary studies, and I cannot help but feel that in terms of editorial policy accepting nomenclatural suggestions like these but not evidence-based revisionary studies is the equivalent of saying, "we don't serve alcohol to minors, but we make an exception if you are under six months old."

The general problem is that there are quite a few systematics journals that have made the same decision over the last few years. I have thought about what journals there are in my field, and I cannot at the moment think of one with an impact factor of more than approximately one that would still accept revisionary studies. Most of the options are local journals published by university or state herbaria, usually named after a 19th century taxonomist or a plant genus, that either do not have an IF or one that is around 0.3-0.7. As valuable as those outlets are for publishing new species or smaller taxonomic revisions they just do not seem to be the right venue and have the right audience for a two-year study using complex analyses of genomic data. Surely if we have molecular phylogenetics journals with IFs of 2 to 5 it should be possible to have journals in that range that publish what might be called molecular taxonomy? If not, why not?

If we do not have journals like that, if the only option for a researcher doing species delimitation with cutting edge, expensive methods is to publish in journals that a job or promotion committee might consider to be a liability to publish in, then it is no wonder that fewer and fewer people will be willing to figure out how many and what species there are on our planet, and that those who are willing to do it will find it hard to get a job in academia. That is known as the taxonomic impediment: There are still many species to be discovered before we are even in a position to know what we need to conserve, but the number of people, institutions and resources assigned to that task is dwindling.

Which brings me to the final point. A year and a half ago I wrote about a study published in Systematic Biology that claimed to have disproved (!) the citation impediment to taxonomy. The authors actually mentioned the non-acceptance of taxonomic papers by high impact journals as one of the arguments underlying the citation impediment, but then argued the latter does not exist. As I wrote at the time, my interpretation of their paper is that they reached their conclusion based on defining phylogenetic studies that happen to include a taxonomic act as taxonomic papers, and then comparing them against phylogenetic studies that do not include a taxonomic act. For example, they had the Botanical Journal of the Linnean Society in their data, which at that moment had officially stopped accepting taxonomic papers for several years. In other words, the study's approach seems to have been the equivalent of examining discrimination against women by comparing men who grow a beard with men who do not grow a beard.

In the light of my recent experience, that paper now seems even more upsetting.

Saturday, April 1, 2017

People don't understand the value of biodiversity collections

An American university's decision to eliminate its natural history collection to make room for, no joke!, a running track is currently making the news. Apparently, if no other institution takes it by July it will be destroyed; and of course other institutions are likely operating under tight budgets and have no space to accommodate millions of additional specimens at short notice.

To expand on what I commented at another website:

Collection specimens are the basis of research because whenever scientists present data - morphology, anatomy, cytology, chemistry, DNA - they need to refer to the specimen ("voucher") they got them from, and that specimen needs to be deposited at an accessible, curated collection, so that the research is reproducible. I am not talking Arabidopsis, zebra fish or fruit flies here, but if somebody is doing work on non-model organisms serious journals will not publish a paper unless each data point is vouchered.

Collection specimens are the basis of research because more and more of them are databased, resulting in large databases such as GBIF or ALA, which are then used by species distribution modellers, biogeographers, conservation scientists etc. to conduct spatial studies that would have been unthinkable even just 20 years ago. And who knows what people will come up with in another 20 years? Think about it: millions and millions of data points saying "this individual was found at this time of the year in this location so and so many years ago, and according to this expert it belonged to this species". This is an invaluable resource for research.

Collections are, of course, our only access to specimens from the past. I have seen a talk by a researcher who used insect specimens collected over decades to study how pesticide resistance evolved and spread in a population, hoping to gain knowledge that will be useful for pest management in the future. Without broadly and deeply sampled natural history collections such research would be impossible.

Collections are also our only access to specimens of species that have since gone extinct. Just yesterday I handled two specimens of a plant that was last collected in the 19th century and is presumed extinct; but with modern techniques you could now study its genome! Again, who knows what other things we can do with 150 year old herbarium specimens in fifty years, things that we would not have expected to be possible?

Finally, collection specimens represent a massive investment. Even while acknowledging that they are not really replaceable because you will never again be able to collect in 1859 or from an area that is now covered in apartment blocks, natural history collections can be valued based on how much it would cost to replace them, in the sense of collecting the same number of specimens again. This includes work hours, fuel and other transport costs, equipment, specimen processing, databasing, and much more. People should look at that number and realise that this is the value that they have the responsibility to safeguard. It is not only part of our cultural heritage, it is also an investment that should not be thrown away merely to make room for a sports facility.

And make no mistake, the number that comes out of such a valuation is always going to be in "holy s***, no way" territory even for a small university museum, the kind of number that will make the institution's accountants break out in cold sweat. What is more, the specimens do not depreciate - they only become more valuable over time, because, again, you can perhaps go back and replace a specimen that was collected five years ago in the forest next door but not one that was collected two hundred years ago where the forest has since been turned into pasture.

As I have written before, I am constantly astonished that people would even so much as consider destroying a biodiversity collection, not least because the same people would not do the same to a humanities collection. Seriously, can you imagine what would happen if they said, "if you can't find somebody else to take it, we will throw all our Rembrandt and Dali paintings into the trash" or "either find a new building, or our collection of bronze age artifacts goes to landfill"?