Sunday, April 23, 2017

The unexpected dangers of rerooting phylogenies

A couple of days ago a colleague circulated the following recently published paper,
Czech L, Huerta-Cepas J, Stamatakis A, 2017. A critical review on the use of support values in tree viewers and bioinformatics toolkits. Molecular Biology and Evolution. DOI: 10.1093/molbev/msx055
The authors found something that, in retrospect, seems glaringly obvious. Phylogenetic trees are nearly always saved in the Newick format of nested brackets, for example as follows:
(A:2,(B:1,C:2)99:1);
In this case we are dealing with a rooted tree of only three taxa. A is sister to a clade of B and C. The numbers after the colons indicate branch lengths, and the 99 directly after the brackets is a support value, most likely bootstrap, for the sister group relationship (B,C).

The problem explored by Czech et al. is ultimately that under the Newick format branch support values or other branch annotations are not actually attached to branches; they are attached to nodes. In this case, for example, the 99 is attached to the node that is the hypothetical common ancestor of B and C. Logically, because the tree is rooted we can assume that the support value is meant for the branch leading down from the ancestor of B and C towards the root.

But what if we reroot a tree with node annotations that are really meant to be branch annotations a posterioiri? (My post on the various options for rooting phylogenies can be found here.) Czech et al. found that the behaviour of the these values is undefined. For some software they were able to demonstrate that the branch annotation ended up on the wrong branch after rerooting.

How serious an issue is that? I guess it depends on what one's practice is. The problem should be pretty much limited to analyses producing unrooted trees (e.g. in RAxML, PAUP or MrBayes) under the assumption of reversibility, where the user then uses outgroup rooting to polarise the tree a posteriori. Any analysis using a clock model would avoid it, as would asymmetric step-matrices or, crucially, those analyses specifying the outgroup before the start of the analysis.

In addition, it seems as if the problem would be limited to a few branches between the pseudo-root used to save unrooted trees and the new root after rerooting, so that most relationships should be fine. I may look at one or two of my published phylogenies to see if I ever had that problem, but I am not worried; in the most recent case where support values were a critical part of my argumentation, for example, they are fairly deep inside the tree, because we sampled widely around the ingroup, and I also used Templeton tests and suchlike to demonstrate the non-monophyly of certain taxa.

Apparently Czech et al. have already achieved some success at getting software providers to make changes that will help solve the confusion around where the branch annotations end up. But nonetheless my main take-home from this is to be less blasé about a posteriori rooting. In the future I will make sure to always define an outgroup already when I set up a PAUP or RAxML run, so that the need to reroot does not arise.

Thursday, April 20, 2017

Botany picture #242: Gentianella muelleriana


Gentianella muelleriana (Gentianaceae) as seen today on the ascent to Mount Stillwell, Kosciusko National Park, New South Wales. One of the few plants still in flower this late in the season.

In the European Alps, gentians are, of course, generally blue and rarely yellow, but here white seems to be the preferred colour.

Friday, April 14, 2017

Back from Queensland

Unfortunately I was unable to transfer the pictures I had taken to a computer until I got back home, so here are the ones I want to put on the blog all in one post. We drove west from Brisbane to Chinchilla with a major stop along the way, had a day trip north to the vicinity of Wandoan, spent half a day around Chinchilla and Kogan the following day, and then returned to Brisbane.


Rainforest of Boombana in D'Aguilar National Park just west of Brisbane.


A fern climbing up a liana that climbs up a tree trunk.


Not many daisy species like rainforests, but this one does: Acomis acoma (Asteraceae). It was the reason for our detour into D'Aguilar. Admittedly it is not found in the darkest and wettest parts.


View from Jolly's lookout, still in D'Aguilar National Park.


In the Chinchilla area ecologists showed us several field sites and conservation management actions. Near Wandoan we happened to see this population of treelets with rather impressive fruits. Still need to figure this species out; we suspected it may be a native Australian lemon (Citrus, Rutaceae). But I have not seen one of those before, only other Rutaceae genera.


We learned more about what is clearly the most problematic weed in the area, buffel grass (Cenchrus ciliaris, Poaceae). As seen in the picture it forms clumps that suppress a lot of other vegetation but are not dense enough to avoid soil erosion from the gaps between individual plants - the worst of both worlds! It also accumulates litter causing very intense bush fires in a local habitat (dry rainforest and vine thicket) whose key species are not fire-adapted. On the other hand, we were told that farmers liked buffel grass due to its drought resistance and high food value for stock.


One of the species the trip was about is this phyllodinous wattle, Acacia wardellii (Fabaceae). Although currently not in flower it is quite attractive due to its straight growth and strikingly white stem. It is locally common after disturbance but has a very restricted range.


Near Kogan we were shown this site, which I found particularly interesting. The habitat is on a ridge with very poor, rocky, shallow soil, and features species that are very localised to those conditions.


Scattered across the ground was Brunoniella (Acanthaceae). I worked on a genus of the Acanthaceae family for my Diplom thesis (roughly equivalent to honours), so that brought back nice memories. However, while my study group then were large shrubs, this species is herbaceous and in fact seems to remain fairly small. I assume it spends most of its life as dormant root-stock underground and then sends these little shoots up if there has been enough rain to be worth the while.

Monday, April 10, 2017

Back to Queensland

Another trip to south-eastern Queensland, only for a few days this time.


First, the most disappointing window seat I have ever had on a flight. It is not even clear to me why this segment was the only one without a window, and only on my side :-)


The skyline of Brisbane as seen from the cultural district.


The Queensland Herbarium, which is located at the Botanic Gardens. I am very grateful to Ailsa Holland and Tony Bean for the kindness they showed us during our visit today.

Friday, April 7, 2017

Parsimony versus models for morphological data: a recent paper

I have written on this blog before about the use of likelihood or Bayesian phylogenetics for morphological data. In our journal club this week we discussed another of the small but growing number of recent papers arguing that parsimony should be dropped in favour of model-based analyses even for morphology:
Puttick et al., 2017. Uncertain-tree: discriminating among competing approaches to the phylogenetic analysis of phenotype data. Proceedings of the Royal Society Biological Series 284, doi 10.1098/rspb.2016.2290
Puttick et al. constructed maximally balanced and unbalanced phylogenies, simulated sequence data for them under the HKY + G model of nucleotide substitution, turned the data matrices into binary and presumably unordered multistate integer characters, and then used equal weights parsimony, implied weights parsimony, and Bayesian and likelihood analyses under the Mk model to try and get the phylogenies back with an eye on accuracy (correctness) and tree resolution. In a second approach, they reanalysed previously published morphological datasets to see what happened to controversial taxon placement under the different approaches.

One of the problems with simulation studies is always that they can come out as kind of circular: if you simulate data under a model it is no surprise that the same model would perform best when trying to infer the input into the simulations. In this case Puttick et al. were admirably circumspect in that not only did they simulate their data under a different model (HKY + G) than that ultimately used in phylogenetic analysis (Mk), but they also repeated the analyses until they had achieved a distribution of homoplasy that mirrored the one found in empirical datasets. This is important because morphology datasets for parsimony analysis are scored to minimise homoplasy, while uncritically simulating matrices may lead to much higher levels of homoplasy, thus putting parsimony at a disadvantage.

Still, it should be observed that the HKY + G model is nonetheless unlikely to have produced data that are a realistic representation of morphological datasets, especially considering that the latter would at a minimum also include multistate characters with ordered states. Also, from a cladist's perspective homoplasy in a morphological dataset is a character scoring error waiting to be corrected in a subsequent analysis. But well, of course using zero homoplasy datasets would also have been unrealistic because real life datasets do have homoplasy in them. (And of course parsimony would "win" all the time if there was zero homoplasy, pretty much by definition.)

Now what are the results? To simplify, Bayesian was best at getting the tree topology right, followed by equal weights parsimony and implied weights parsimony, with likelihood coming in last. Likelihood always produces fully resolved trees, and Bayesian produces the least resolved ones. The authors argue, as Bayesians would, that this is exactly how it should be, as it simply tells us that the data aren't strong enough; the other approaches may give us false confidence. (Although of course parsimony and likelihood analyses can likewise involve several different ways of quantifying support or confidence.)

In conclusion, Puttick et al. make the following recommendations:

First, Bayesian inference should be the preferred approach.

Second, future morphological datasets should be scored with model-based approaches in mind. This means that the number of characters should be maximised by including homoplasious ones, because that will allow a better estimate of rates. As this is the exact opposite scoring strategy of what parsimony analysis requires this will make it hard to change habits.

What is more, I have to smile at Puttick et al.'s expectations here: they simulated data matrices of 100, 350 and 1,000 characters. Maybe you can get 400 or so for some animals (if the fossils are well enough preserved), but for any plant group I have worked on I would struggle to get 30. And wouldn't you know it, the single empirical botanical dataset they re-analysed had only 48.

Third, researchers should lower their expectations and get used to living with unresolved relationships, as Bayesian analysis produces less resolved phylogenies.

Our discussion of the paper was wide-ranging. When I commented that one of the advantages of traditional parsimony software is that it easily allows the implementation of any step matrix that is needed (imagine a character where state 0 can change into states 1, 2 or 3, but 1-3 cannot change into each other) I was informed that that is in fact possible in BEAST. That is a pleasant surprise, as I had assumed that it was limited to setting a few simple models such as standard Mk for unordered states, nothing more. However, those who have written XML files for BEAST may want to consider if that is "easy" compared with writing a Nexus file for PAUP. Personally I find BEAST input files very hard to understand.

Another concern was that while nucleotide substitution models are based on a fairly good understanding of what can happen to DNA nucleotides which, after all, have a limited number of states and transitions between those states, it is considerably less clear what the most appropriate model for any given morphological character is.

What is more, somebody pointed out that there are essentially two options in a model based analysis: either the likelihood of state transitions is fixed, which is a difficult decision to make, or it is estimated during the analysis. But in the latter case the probability of, for example, changing the number of petals would be influenced by the probability of shifting between opposite and alternate leaf arrangement. And clearly that idea is immediately nonsensical.

In summary, the drumbeat of papers on the lines of "we are the Bayesians; you will be assimilated; resistance is futile" is not going to stop any time soon. I use Bayesian and likelihood analyses all the time for molecular data, no problem. But I am still not convinced that the Mk model would be my go-to approach the next time I have to deal with morphological data. It seems to me that it is much easier to justify one's model selection in the case of DNA than in the case of, say, flower colour or leaf length; that the idea of setting one model and estimating gamma across totally incomparable traits is odd; and that I would hardly ever have enough characters for Bayesian analysis to produce more than a large polytomy.

But I guess all that depends on the study group. I can imagine there would be morphometric data for some groups of organisms for which stochastic models work quite well.

Tuesday, April 4, 2017

IJHSSIOMGWTFBBQ

There is so much science spam these days that a message has to be particularly remarkable to even register; mostly I just mark as junk or report without even thinking about them. But this one is a beauty.


Let's count the ways:
  1. The message uses four different text colours (counting the links), several different font types, and more different font sizes than anybody in their right mind could consider tasteful.
  2. The title - International Journal of Humanities and Social Science Invention - is likely among the top five most convoluted titles I have ever seen, and given the competition that is saying something.
  3. The title does not make any sense either, but I guess that goes without saying.
  4. The spammer did not even write their script to personalise the message. At least other spammers have it insert the name of the recipient, but this one merely reads "dear author/researcher". Lazy.
  5. The first sentence randomly capitalises "international journal" and is poorly written.
  6. The second sentence claims the journal is indexed in "major indexing" (major indexing what?) and then lists four names none of which I have ever heard of. So whatever they are, they are certainly not "major".
  7. "IJHSSI follows the rapid publication process." So there is a rapid publication process, just one?
  8. Like many other spammers, this one sets arbitrary paper submission deadlines, presumably to create a sense of urgency. Why would a journal, which by definition publishes regular issues, ever do that?
  9. The sentence in bold and red is ungrammatical.
  10. The spammer does not even bother to invent a name for their imaginary editor-in-chief IJHSSI. Remember Robest Pual Ashcraft? That was fun. But no, here we only get a generic title.
  11. Note that there is very conspicuously no mention of the article processing fees in this message.
I think this is another, ahem, "journal" that I will pass on.

Sunday, April 2, 2017

The taxonomic impediment as illustrated by journals' criteria for the acceptance of manuscripts

About two weeks ago I learned from a co-author, who in that case is the corresponding author, that a certain systematic botany journal would consider our manuscript unacceptable no matter how much we improved it simply because it was out of scope. You see, our work was only "revisionary", as in dealing with species delimitation, and it would have to be a phylogenetic study to be acceptable. A few thoughts:

I do understand why higher-profile systematics journals do not accept descriptions of taxonomic novelties that take a qualitative approach like "hey, that looks different to that other species", or papers that merely validate taxonomic changes based on evidence presented elsewhere. But I completely fail to understand what the problem is with papers that, as in our case, use integrative, quantitative analyses of morphological, genetic and environmental data to resolve difficult species complexes. I would love to understand how a phylogenetic study is more serious than that. The conservation impact is, for example, much higher in studies finding a previously unrecognised, rare species than in those that only change the circumscription of a genus.

The journal in question is TAXON. Think about it: a journal literally called "taxon" has decided to accept no more taxonomic studies going forward. No word on when Evolution will stop accepting studies dealing with evolutionary biology, or when Heredity will reject all manuscripts dealing with genetics.

Note also that TAXON is still the go-to journal for nomenclatural suggestions in botany. In the latest issue as of writing, for example, we find Brownsey & Perrie, "Proposal to conserve the name Asplenium richardii with a conserved type" and Dorr & Gulledge, "Request for a binding decision on whether Briquetastrum Robyns & Lebrun (Lamiaceae) and Briquetiastrum Bovini (Malvaceae) are sufficiently alike to be confused". Those papers are important and need a forum, and it is good that TAXON is that forum. But the same is true for revisionary studies, and I cannot help but feel that in terms of editorial policy accepting nomenclatural suggestions like these but not evidence-based revisionary studies is the equivalent of saying, "we don't serve alcohol to minors, but we make an exception if you are under six months old."

The general problem is that there are quite a few systematics journals that have made the same decision over the last few years. I have thought about what journals there are in my field, and I cannot at the moment think of one with an impact factor of more than approximately one that would still accept revisionary studies. Most of the options are local journals published by university or state herbaria, usually named after a 19th century taxonomist or a plant genus, that either do not have an IF or one that is around 0.3-0.7. As valuable as those outlets are for publishing new species or smaller taxonomic revisions they just do not seem to be the right venue and have the right audience for a two-year study using complex analyses of genomic data. Surely if we have molecular phylogenetics journals with IFs of 2 to 5 it should be possible to have journals in that range that publish what might be called molecular taxonomy? If not, why not?

If we do not have journals like that, if the only option for a researcher doing species delimitation with cutting edge, expensive methods is to publish in journals that a job or promotion committee might consider to be a liability to publish in, then it is no wonder that fewer and fewer people will be willing to figure out how many and what species there are on our planet, and that those who are willing to do it will find it hard to get a job in academia. That is known as the taxonomic impediment: There are still many species to be discovered before we are even in a position to know what we need to conserve, but the number of people, institutions and resources assigned to that task is dwindling.

Which brings me to the final point. A year and a half ago I wrote about a study published in Systematic Biology that claimed to have disproved (!) the citation impediment to taxonomy. The authors actually mentioned the non-acceptance of taxonomic papers by high impact journals as one of the arguments underlying the citation impediment, but then argued the latter does not exist. As I wrote at the time, my interpretation of their paper is that they reached their conclusion based on defining phylogenetic studies that happen to include a taxonomic act as taxonomic papers, and then comparing them against phylogenetic studies that do not include a taxonomic act. For example, they had the Botanical Journal of the Linnean Society in their data, which at that moment had officially stopped accepting taxonomic papers for several years. In other words, the study's approach seems to have been the equivalent of examining discrimination against women by comparing men who grow a beard with men who do not grow a beard.

In the light of my recent experience, that paper now seems even more upsetting.

Saturday, April 1, 2017

People don't understand the value of biodiversity collections

An American university's decision to eliminate its natural history collection to make room for, no joke!, a running track is currently making the news. Apparently, if no other institution takes it by July it will be destroyed; and of course other institutions are likely operating under tight budgets and have no space to accommodate millions of additional specimens at short notice.

To expand on what I commented at another website:

Collection specimens are the basis of research because whenever scientists present data - morphology, anatomy, cytology, chemistry, DNA - they need to refer to the specimen ("voucher") they got them from, and that specimen needs to be deposited at an accessible, curated collection, so that the research is reproducible. I am not talking Arabidopsis, zebra fish or fruit flies here, but if somebody is doing work on non-model organisms serious journals will not publish a paper unless each data point is vouchered.

Collection specimens are the basis of research because more and more of them are databased, resulting in large databases such as GBIF or ALA, which are then used by species distribution modellers, biogeographers, conservation scientists etc. to conduct spatial studies that would have been unthinkable even just 20 years ago. And who knows what people will come up with in another 20 years? Think about it: millions and millions of data points saying "this individual was found at this time of the year in this location so and so many years ago, and according to this expert it belonged to this species". This is an invaluable resource for research.

Collections are, of course, our only access to specimens from the past. I have seen a talk by a researcher who used insect specimens collected over decades to study how pesticide resistance evolved and spread in a population, hoping to gain knowledge that will be useful for pest management in the future. Without broadly and deeply sampled natural history collections such research would be impossible.

Collections are also our only access to specimens of species that have since gone extinct. Just yesterday I handled two specimens of a plant that was last collected in the 19th century and is presumed extinct; but with modern techniques you could now study its genome! Again, who knows what other things we can do with 150 year old herbarium specimens in fifty years, things that we would not have expected to be possible?

Finally, collection specimens represent a massive investment. Even while acknowledging that they are not really replaceable because you will never again be able to collect in 1859 or from an area that is now covered in apartment blocks, natural history collections can be valued based on how much it would cost to replace them, in the sense of collecting the same number of specimens again. This includes work hours, fuel and other transport costs, equipment, specimen processing, databasing, and much more. People should look at that number and realise that this is the value that they have the responsibility to safeguard. It is not only part of our cultural heritage, it is also an investment that should not be thrown away merely to make room for a sports facility.

And make no mistake, the number that comes out of such a valuation is always going to be in "holy s***, no way" territory even for a small university museum, the kind of number that will make the institution's accountants break out in cold sweat. What is more, the specimens do not depreciate - they only become more valuable over time, because, again, you can perhaps go back and replace a specimen that was collected five years ago in the forest next door but not one that was collected two hundred years ago where the forest has since been turned into pasture.

As I have written before, I am constantly astonished that people would even so much as consider destroying a biodiversity collection, not least because the same people would not do the same to a humanities collection. Seriously, can you imagine what would happen if they said, "if you can't find somebody else to take it, we will throw all our Rembrandt and Dali paintings into the trash" or "either find a new building, or our collection of bronze age artifacts goes to landfill"?

Saturday, March 25, 2017

How not to convince a scientist that comic artists make good science communicators

Thanks to RationalWiki I found a blog post by a comic artist on science communication. It left me confused at several levels. As always I write the following not in any official capacity, and my opinion is mine alone and not necessarily shared by any person or institution I am affiliated with.
I don't know much about science, and even less about climate science.
This right here may well be the core problem of what follows.
So as a practical matter, I like to side with the majority of scientists until they change their collective minds. They might be wrong, but their guess is probably better than mine.
On the other hand, this is a very insightful paragraph. It would be helpful if we could all respect each other's expertise a bit more. Unless I have good reason not to, I assume that fully qualified primary school teachers know more about teaching primary school children than I do, plumbers know more about plumbing than I do, and so on.
That said, it is mind-boggling to me that the scientific community can't make a case for climate science that sounds convincing, even to some of the people on their side, such as me. In other words, I think scientists are right (because I play the odds), but I am puzzled by why they can't put together a convincing argument, whereas the skeptics can, and easily do. Shouldn't it be the other way around?
The implication is that it is the climate scientists' fault that there are climate change denialists, because scientists are poor communicators. Fair enough, many of us scientists probably could be better communicators. But in this context the argument only works if one assumes that everybody is rational and open to evidence in the first place. The fact is, it is just a really, really uncomfortable idea that our daily comforts like driving the car to work or cranking up air conditioning might be destroying our collective future. It is understandable that many people would reject such an idea regardless of how good a case could be made.

Whether denialists actually do make a better case than scientists is, of course, yet another matter. I do not think so, but then again, I am also a scientist, so I may not be representative.
As a public service, and to save the planet, obviously, I will tell you what it would take to convince skeptics that climate science is a problem that we must fix. Please avoid the following persuasion mistakes.
A comic book author telling scientists how to communicate science. Next up: a dentist telling comic artists how to draw, followed by a philosopher telling structural engineers how to design a bridge.
1. Stop telling me the "models" (plural) are good. If you told me one specific model was good, that might sound convincing. But if climate scientists have multiple models, and they all point in the same general direction, something sounds fishy. If climate science is relatively "settled," wouldn't we all use the same models and assumptions?

And why can't science tell me which one of the different models is the good one, so we can ignore the less-good ones? What's up with that? If you can't tell me which model is better than the others, why would I believe anything about them?
So as his first point the author assumes that there can only ever be one model in any area of science, and all the rest should be discarded. That is not how this works. That is not how any of this works. I am currently envisioning somebody applying the same logic to molecular phylogenetics: "If evolution was settled, wouldn't you all use the same model of character evolution? Why do you still have GTR, JC, F81, and all those other models?"

And how is it "fishy" if scientists have several models that "all point in the same general direction"? Logically, wouldn't the exact opposite look fishy, if each model lead to a different conclusion?
2. Stop telling me the climate models are excellent at hindcasting, meaning they work when you look at history. That is also true of financial models, and we know financial models can NOT predict the future. We also know that investment advisors like to show you their pure-luck past performance to scam you into thinking they can do it in the future. To put it bluntly, climate science is using the most well-known scam method (predicting the past) to gain credibility. That doesn't mean climate models are scams. It only means scientists picked the least credible way to claim credibility. Were there no options for presenting their case in a credible way?

Just to be clear, hindcasting is a necessary check-off for knowing your models are rational and worthy of testing in the future. But it tells you nothing of their ability to predict the future. If scientists were honest about that point, they would be more credible.
This seems more like a personal hang-up than a general problem. How many members of the general public will think "ah, the scientists say that their models work well if tested against past observations, but precisely that is a very good reason not to trust their capacity to predict the future"? Cannot imagine it would be many.

And I find the comparison with investment advisors a bit misguided; we are not talking stock performance here, where one tries to predict the future of one particular investment. We are talking something more comparable to macro-economic modeling, and while there is certainly a lot of motivated reasoning in economics such high-level processes can be predicted with some confidence. It would be hard to say where exactly IBM shares will be in two years, but it should be no problem to provide a prediction on whether inflation will go up or down if the central bank of a country prints a lot more money. (Even I know that increasing the amount of money raises inflation, all else being equal.) Likewise, it might be hard to say exactly how much rain Madrid will have in the year 2100, but it should be no problem to provide a prediction on whether temperature will go up or down if CO2 levels in the atmosphere are doubled, and by how much approximately. (Apparently up by between 1.5 and 4.5C.)
3. Tell me what percentage of warming is caused by humans versus natural causes. If humans are 10% of the cause, I am not so worried. If we are 90%, you have my attention. And if you leave out the percentage caused by humans, I have to assume the omission is intentional. And why would you leave out the most important number if you were being straight with people? Sounds fishy.
This is, again, very strange. If somebody says, "I will now push you over the cliff edge" they have your attention, but if they say "get back, quick, the cliff is crumbling under your feet!", you ignore them? What? I at least would say that even if warming were natural we should not ignore it but still prepare for flooded coastal cities and failed harvests.
There might be a good reason why science doesn't know the percentage of human-made warming and still has a good reason for being alarmed. I just haven't seen it, and I've been looking for it. Why would climate science ignore the only important fact for persuasion?
No idea where the idea comes from that climate science ignores this factor. It is widely agreed among the climate science community that humans are the main factor in what is currently happening, and in turn that expert consensus is widely known to exist.
Today I saw an article saying humans are responsible for MORE than 100% of warming because the earth would otherwise be in a cooling state. No links provided. Credibility = zero.
Why credibility = zero? Does the author not know that the earth underwent some noticeable cooling during the early modern period? Little ice age, anyone? There is also a good argument to be made, based on the timing of previous glacial cycles, that we are due for the start of another ice age, although of course such a change would take hundreds to thousands of years. I haven't looked into it deeply, but the idea that the earth would be cooling a bit if not for the use of fossil fuels is, in fact, at the very least credible to me given these considerations.
4. Stop attacking some of the messengers for believing that our reality holds evidence of Intelligent Design.
What "messengers"? What has any of this to do with Intelligent Design - where does that suddenly come from?
Climate science alarmists need to update their thinking to the "simulated universe" idea that makes a convincing case that we are a trillion times more likely to be a simulation than we are likely to be the first creatures who can create one. No God is required in that theory, and it is entirely compatible with accepted science. (Even if it is wrong.)
Ye gods, the simulated universe... Although I cannot find the link again I once read a very nice analogy for it. "Look, we can do simulations - so probably we are also simulated" is entirely equivalent to some Renaissance philosopher seeing the first paintings that used realistic perspective and concluding that because the real world also has perspective we must be paint pigments on another being's canvas.

It is all about getting caught up in enthusiasm about a new technology, with no evidence being involved anywhere along the chain of reasoning. There is no evidence that something like us could even be simulated, and it seems rather implausible that somebody would be motivated to run such a simulation. I guess one could play the mysterious ways card regarding the simulator's motivations, but then we are deeply in religious apologetics territory.

But still, the main point is that point #4 is completely besides the point.
5. Skeptics produce charts of the earth's temperature going up and down for ages before humans were industrialized. If you can't explain-away that chart, I can't hear anything else you say. I believe the climate alarmists are talking about the rate of increase, not the actual temperatures. But why do I never see their chart overlayed on the skeptics' chart so we can see the difference? That seems like the obvious thing to do. In fact, climate alarmists should throw out everything but that one chart.
Sorry to say, but reading this item I cannot help but think of the term Not Even Wrong. Of course temperatures go up and down naturally, so no scientist is ever going to "explain that away". The implied claim that climate scientists assume no non-anthropogenic climate change has ever taken place is shades of crocoduck, a ridiculous straw-man that would only be brought up by somebody who has not made the slightest effort at understanding the science in question. Scientific publications "produce" the very same charts of natural change, that is where the denialists get them from. The question is, do I have to "explain away" the fact that people die of natural causes all the time before I can object to somebody trying to kill me?

And why rates of increase? Of course a higher rate of change is a problem because it gives us less time to adapt and wildlife less time to move with their climate zone, but ultimately that is not all that "alarmists are talking about". Yes, if Miami is going to turn into Atlantis it may matter whether rates of change are different to, say, the onset of the current interglacial, but first and foremost it matters that the population of Miami will have to move, right?
6. Stop telling me the arctic ice on one pole is decreasing if you are ignoring the increase on the other pole. Or tell me why the experts observing the ice increase are wrong. When you ignore the claim, it feels fishy.
Maybe I missed something, but to the best of my understanding ice is shrinking on both poles. But even if this refers to some reference saying that ice is growing in some part of the Antarctic (a weblink would have been helpful), nobody would claim that every place on earth will experience the same effect with the same effect size. It is, for example, entirely to be expected that it will get drier in one place but wetter in another. In fact, the reason the former place is now drier is most likely that the rain it usually got is now falling in the latter place!
7. When skeptics point out that the Earth has not warmed as predicted, don't change the subject to sea levels. That sounds fishy.
This must either refer to some isolated incident that is not referenced or represent a misunderstanding: It sounds like a garbled version of the observation that the ocean has absorbed some of the warming that was expected to be absorbed by the atmosphere.
8. Don't let the skeptics talk last. The typical arc I see online is that Climate Scientists point out that temperatures are rising, then skeptics produce a chart saying the temperatures are always fluctuating, and have for as far as we can measure. If the real argument is about rate of change, stop telling me about record high temperatures as if they are proof of something.
This is merely a repeat of #5.
9. Stop pointing to record warmth in one place when we're also having record cold in others. How is one relevant and the other is not?
I already touched on this with regard to #6. North America seems to have unusually cold winters precisely because the north pole has unusually warm ones, due to shifting air currents. Truth be told, this objection really astonishes me. Some denialists sound as if they would be surprised by workplaces being empty at the same time as when beaches are full of people. "So are there more people or less people? You don't make sense!"
10. Don't tell me how well your models predict the past. Tell me how many climate models have ever been created, since we started doing this sort of thing, and tell me how many have now been discarded because they didn't predict correctly. If the answer is "All of the old ones failed and we were totally surprised because they were good at hindcasting," then why would I trust the new ones?
This is partly a repeat of #1 and partly a severe misunderstanding of how science works. "If Newton's theory of gravity was superseded by Einstein's theory, why should I now trust Einstein?"

Also, this.
11. When you claim the oceans have risen dramatically, you need to explain why insurance companies are ignoring this risk and why my local beaches look exactly the same to me.
To the best of my understanding, even Donald Trump's Irish golf course has lobbied the local government for a sea wall to protect against rising sea levels...
Also, when I Google this question, why are half of the top search results debunking the rise? How can I tell who is right? They all sound credible to me.
Yes, when I google about health, the search results variously suggest certified pharmaceuticals, homeopathy, reiki, acupuncture, chiropractics, and much more. There are quacks on one side and science-based medical research on the other. How can I tell who is right? I am so confused!
12. If you want me to believe warmer temperatures are bad, you need to produce a chart telling me how humankind thrived during various warmer and colder eras. Was warming usually good or usually bad?

You also need to convince me that economic models are accurate. Sure, we might have warming, but you have to run economic models to figure out how that affects things. And economic models are, as you know, usually worthless.
To be fair, the author may not realise that the last time global temperatures underwent several degrees of change we did not have billions of people living in coastal areas that are going to be flooded, or billions of people to be fed by crops that will suddenly find themselves under heat and drought stress.
13. Stop conflating the basic science and the measurements with the models. Each has its own credibility. The basic science and even the measurements are credible. The models are less so. If you don't make that distinction, I see the message as manipulation, not an honest transfer of knowledge.
Once more this probably refers to an unreferenced incident, so it is difficult to address. More generally, every mathematical description of a system is a model. If I say, "every day this plant grows 5 mm" I have formulated an (admittedly simplistic) model. It not sure how that is so much less credible than a chart showing the plant to have a stem height of 4.3 cm, 4.8 cm, and 5.3 cm on successive days. It is merely a different way of expressing the same pattern.
14. If skeptics make you retreat to Pascal's Wager as your main argument for aggressively responding the climate change, please understand that you lost the debate. The world is full of risks that might happen. We don't treat all of them as real. And we can't rank any of these risks to know how to allocate our capital to the best path. Should we put a trillion dollars into climate remediation or use that money for a missile defense system to better protect us from North Korea?
Yet another instance of what was presumably an unreferenced incident experienced by the author. I would not know how any serious climate scientists would ever have to propose Pascal's Wager, given that the action of CO2 as a greenhouse gas has been established for more than a century and that evidence of rising sea levels, shrinking glaciers, rising atmospheric temperatures, and increasingly extreme weather events are all around us. But then again, I am not even a climate scientist myself, so I don't know very much how they generally argue.
Anyway, to me it seems brutally wrong to call skeptics on climate science "anti-science" when all they want is for science to make its case in a way that doesn't look exactly like a financial scam.* Is that asking a lot?
This is a hilariously naive understanding of denialism. Sure, everybody everywhere is totally open to argument and merely "want[s] for science to make its case in a way that doesn't look exactly like a financial scam". Financial and political interests or tribal instincts do not exist. Riiight.

So in summary, I am sure that many scientists, me included, could learn a lot more about how to communicate. This post, however, was the equivalent of "hey medical profession, you could convince people not to use homeopathy if only you admitted that magic works, and you should stop all that double-blind experiment nonsense, because that just looks as if you have something to hide".

Thursday, March 23, 2017

Species delimitation using the coalescent model

For two weeks or so now a new paper has been making the rounds, and we discussed it in our journal club today:
Sukumaran J, Knowles LL, 2017. Multispecies coalescent delimits structure, not species. PNAS 7: 1607-1612.
The context is species delimitation: given a bunch of individuals, how many species are there, and which individuals belong to which species? There are a number of ways to address these questions, and they partly depend on the available data and technology and partly on the species concept the researcher is using.

Very traditionally, of course, a taxonomist would look at the morphology of the specimens and more or less intuitively try to form clusters of similar specimens separated from each other by gaps in morphological variation. In other words, a qualitative application of the Genotypic Cluster Species Concept. More dubious approaches would involve ideal "types" (in a Platonic sense), "central identities", or rules of thumb on the lines of "one difference means subspecies, two differences means species", none of which seem to have much basis in what we know about genetics or evolutionary biology.

More formally, one can take the same theoretical approach but conduct an explicit, quantitative analysis. Score the morphological data and produce a pair-wise distance matrix for example with the Gower metric, then do a Principal Coordinates Analysis to visualise potential clusters and gaps between them, or do hierarchical or non-hierarchical clustering. The same can be done with non-morphological data, such as environmental data from the collecting localities, in that case to show that putative species have different ecological niches.

A clustering approach can also, of course, be used for genetic data. In that case one would use some kind of genotyping approach, for example microsatellites, AFLP or genome-wide SNPs, and do hierarchical clustering or use a software such as STRUCTURE. Although using a population genetics model, the results produced by the latter are at a practical level comparable to the non-hierarchical clustering in that we get an optimal number of clusters and information on what sample belongs to what cluster; we then need to make the additional interpretative step of assuming that the clusters are the species. (Meaning we have solved the grouping problem but need additional arguments to solve the ranking problem.)

But today more and more people have multi-locus sequence data at their disposal. They are used for phylogenetics under the coalescent model and using species tree approaches, so it was probably unavoidable that the coalescent model would also be applied to species delimitation. The idea behind the relevant software tools such as the currently very popular BPP (disclosure: I have never used it) is that the information from multiple loci can be used to figure out how many species there are among the samples, under the assumption that samples belonging to the same species should have a history of reticulation but samples belonging to different species should have a history of (permanent) lineage divergence.

That sounds logical, but the aforementioned paper seems to hit this idea under the waterline: as the title suggests, the authors conclude that species delimitation under the coalescent resolves population structure, not species limits.

Frankly, although the method has been extremely popular lately, there has also been a lot of scepticism in the community. After all, its application has produced rather one-sided results, nearly always splitting species into several smaller species. I have heard a talk that amounted to a scathing criticism of the approach, arguing that genetic isolation of a small population for less than 200 years would be enough to make it show up as a separate "species" under the coalescent, surely a ridiculous outcome.

Consequently, the present paper fits my thinking on the issue; I, personally, would rather use clustering approaches to search for gaps in variation. But that being said, the way the authors addressed the issue still seems a bit odd to me and leaves me wondering how far their particular argument will carry.

The thing is, the study does not involve any empirical data, it is entirely based on simulations. The authors used a model under which at first only populations split and then some of them may turn into separate species after varying lag times; although there does not appear to be an explicit process in the model I guess the assumption is that it needs a bit of time to accumulate enough differences that a population cannot reunite with its sister population even if they get back into contact with each other. They then simulated species lineages under that model, and then gene trees in those species lineages, and then sequence matrices for those gene trees. And then they analysed the sequence matrices with the coalescent-based species delimitation approach trying to get the original species back.

Surprise, surprise, the coalescent species delimitation approach recovered the population splits, not the species splits. But what has this really shown? As far as I can tell, it has shown that an approach using a model counting all population splits immediately as species splits will not produce the results expected under a model not counting all population splits immediately as species splits.

Maybe I am missing something, but that is exactly what I would have expected before complex simulations on supercomputers had been conducted. If I simulate bicycle rides under a model that assumes I cycle to work at 20 km/h and then try to fit the results back to a model that assumes I cycle to work at 100 km/h I will also likely find that there is poor fit, right? But that does not tell me anything about how fast I really cycle to work, or in other words, anything about which of the two models is a better fit to reality.

Consequently I have to admit that arguments on the lines of "this real-life population that is clearly not a separate species but has merely been isolated for 200 years comes out as a new species under the coalescent approach" seem to be more impressive.

Thursday, March 16, 2017

Some very, very basic notes on paper writing

The following are just a few notes on manuscript or student report writing for my area of science, which I would circumscribe as plant systematics, biogeography and evolutionary biology. I will probably at some point use this or a revised version for other purposes, but thought it might be interesting to blog about.

As should be well known, the typical research paper (and a student report mirrored after it) in my area has, in this order, a title, author names and contact info, an abstract, key words, introduction, materials and methods, results, discussion, optionally conclusions (or they may be part of the discussion), acknowledgements, reference list, figure legends, tables, and potentially appendices or supplementary data. I will only deal with some of these for now. The main text, i.e. without reference list, should probably not be longer than c. 5,000 words, and the shorter the better.

Abstract

The abstract should summarise all the other sections in really abbreviated form: what is the paper about, what main methods were used, what are the main results, and what do they mean. Do not cite references in the abstract, and do not provide taxonomic authorities after plant names. They are provided on the first mention (and only on the first mention) of a name in the main text.

Introduction

This section provides background information, describes the question or problem, and ends with aims of the study. Ideally start with general, well-established, and unproblematic claims ("biodiversity loss is accelerating", "genomic data have become increasingly available for use in phylogenetic studies", etc.) and move relatively quickly to the problem ("guidance is needed to prioritise conservation planning", "but analysis of the large amounts of data produced by high-throughput sequencing is computationally challenging"). Do not begin with your study group, unless the paper is a purely taxonomic or phylogenetic one, as this will not draw in as many readers; the introduction of the study group should come after the general question has been established.

Obviously the introduction needs to be full of references, and all but the most widely accepted statements should be followed by at least one of them.

The aims should follow logically from the questions or problems and need to tie in logically with the methods, results and discussion. They can be formulated as hypotheses or questions, but should not be too vague. Think "we will test if patterns agree with those postulated by Smith (1980)" instead of "we want to explore the patterns".

Methods

Be as concise as you can. Cite software, tests and methods, but not the most basic ones; PCR or t-tests, for example, can be considered sufficiently established. Start with the materials (sampling strategy etc.), then work logically through the lab and analysis pipeline.

Results

This section presents the results and nothing more. Any sentence that interprets the results or explains them does not belong here but into the discussion. References do not belong here. Explanation of how something was done does not belong here but into the methods.

On the other hand, all observations need to be stated explicitly in the main text of the results section, with a reference to the figure or table where they are presented. The figure legend itself should, in turn, be very concise and merely provide enough information to understand the figure, but it should not restate what the reader can see in the figure just above said legend anyway. For example, the figure legend might say "A, map of Australia showing endemism hotspots", but something like "hotspots are found in the southwest and southeast" is superfluous here and goes into the main text: "endemism hotspots are found in the southwest and southeast (Fig. 2A)".

Discussion

Perhaps the hardest part to write, it explains what the results mean and how they fit into the wider context. It may also end on suggestions for further study. The main problem for beginners is to avoid repeating what was already said in the introduction and results sections.

The discussion should again have lots of references - not because we always need to cite lots of papers per se, but because any paragraph in the discussion that does not have at least one reference can be considered under suspicion of either belonging into the results section or being mere padding.

Acknowledgements

Collaborators, funding sources, peer reviewers. For student reports, one would expect the supervisor(s) to be mentioned. It is not clear to me why many people feel the need to initialise the names of those who they are thanking, as writing them out makes it easier to identify them, especially if they have common family names.

Figures

I haven't really made a tally, but the typical article would probably have between two and six figures, after that it becomes excessive.

Journals in my area expect the manuscript text and figures to be uploaded as separate files, and they expect high quality file formats such as EPS for vector graphics and TIFs with lossless compression for bitmaps.

For exchanging a manuscript draft between co-authors, or a student report draft between student and supervisor, it is, however, probably best to insert figures into the manuscript file, because then your collaborator has less files to handle and to print. I would suggest the following approach:

Have a separate section for the figure legends after the references. This is as it should be for journal submission.

Produce EPS (vector) or JPGs (bitmap) of the figures to save on file size and insert each of them into the text directly above the relevant figure legend. Although Word allows for them, I would very strongly advise against using convoluted text boxes-inside-object boxes-inside-object boxes. A student report I once had to reformat froze my computer for several minutes when I tried to resize one of its "figures". Ultimately I had to export the page as a PDF and then export the PDF from Inkscape into yet another format, whereupon I reinserted the figure into the Word document. A journal would, of course, not accept anything like that anyway.

Select the wrapping option "in line with text". This will treat the figure like a character, meaning that it will remain anchored to its position relative to the text no matter what you do upwards of its position. If you use other wrapping options such as "on top of the text" the figures will float around freely, and changing line spacing or font size, or deleting paragraphs, will mess everything up to no end. I once dealt with a student report where three figures ended up on top of each other!

Do not use text boxes for the figure legends, or for tables, for that matter. Really, why would you? They can be normal text, just like the rest of the manuscript. In fact I have yet to see a use case for Word's text boxes in my line of work (PowerPoint is a different matter).

Language

There is no reason to use words like "whilst" where a simple "while" will do.

Often sentences can be simplified greatly; some people seem to have a penchant for writing something on the lines of "for pollination, it has been demonstrated that hummingbirds are more efficient", but the same could be said as "hummingbirds are more efficient pollinators" - saved us six words! Similarly, "in the literature it is documented that the sky is blue (Smith, 1980)" can be reduced to "the sky is blue (Smith, 1980)". And yes, I have seen sentences like these, particularly as a reviewer.

Small stuff

Do not use double spacers at the end of a sentence.

Do not have spacers at the end of a paragraph.

There needs to be a spacer between any measurement and its unit (5 km, 5 h), with the exception of temperatures (5ºC) and angles (5º).

This might seem a bit OCD, but believe me, you do not have to make some poor copy editor's life harder than it is.

Saturday, March 11, 2017

Promiscuity

Recently I participated in an interesting discussion on the internet. The main topic was how some people reject scientific evidence if it contradicts their religious or ideological commitments, but the example was the nexus of evolutionary biology, male and female reproductive strategies, and differences between men and women.

It seems rather self-evident that males of nearly every species can potentially, if they are lucky and pursue the "right" strategy to achieve that end, have many more children than females. That is, after all, how female is defined in biology: it is the sex that makes the greater investment in offspring, usually at a minimum by producing a few large, immobile gametes, while the male is defined as the sex that makes the lower investment into each individual potential descendant, usually at a minimum by producing many small, mobile gametes. On top of that many species have layered additional female investment into the developing offspring, be it by giving live birth (or its botanical counterpart of producing seeds instead of spores), producing milk, or providing paternal care.

It is at this stage that the situation can, rarely, be flipped, e.g. by male sea-horses taking over the pregnancy, or male ratites raising the young; or paternal care can be shared by the sexes. But for most species, the female is the bottleneck, so to speak: How many offspring a female and a male can have is capped by the female's fertility.

It follows logically that a male can increase its number of offspring by being promiscuous, while a female cannot. The conclusion for reproductive strategies is that males in your modal species should evolve to be non-discriminating with regard to sexual encounters, and to maximise the number of partners. Females, on the other hand, do not get anything out of such behaviour. (Unless other considerations come into play, such as earning money with prostitution, or using casual sex as social glue, as it said the bonobos do.)

Whether, for example, human men are more interested in having many partners or more willing to cheat than women is a testable hypothesis. But the answer to that question is not really what I want to dwell on.

What interested me was that a lot of people who argue from reproductive strategies as discussed above write things on the lines of "men cheat more than women" or "men are more promiscuous than women". Also quite interestingly, rarely somebody will pop up who argues the opposite, claiming that "women cheat more than men". Honestly I do not understand the logic for that latter claim, as it does not even have the advantage of making sense from an evolutionary biology perspective; the idea that it is based entirely on misogyny is at least not easily dismissed.

But really for present purposes both claims can be treated as equivalent: I think both of them are, equally, mathematically impossible.

Yes, perhaps it can be shown that men are wired to seek more partners; maybe that is even biological as opposed to cultural. But that does not mean that they will be successful at having more partners, and that is unfortunately what being promiscuous means. Wishing is not doing.

Assume equal numbers of men and women, and disregard homosexual pairings, as neither of these factors are what those who claim "[gender] cheats more than [other gender]" are concerned with. Make a row of female circles on the left and a row of male circles on the right. Now draw lines between female and male circles to indicate pairings.

You can end up with very different network structures, of course. You could have three quarters of all men unpaired, while a quarter of them is paired with four women each, a harem scenario. You could first have each man paired with one woman, and very women also paired with lots of men, a prostitution / men cheat a lot scenario.

But it is simply impossible to have more average promiscuity on the left than on the right, or vice versa, because obviously all connections start on the left and end on the right, meaning that promiscuity is in all cases = number of people of that gender / connections, and we assumed equal numbers of men and women.

Arguments could perhaps be made about the median, but that is not what people intuitively mean or understand when somebody says, for example, "women cheat more than men". Claims like those just don't make any sense, and one doesn't even have to collect evidence on that. They fail right out of the gate, on basic logic.

Saturday, February 25, 2017

More Brisbane impressions


A bridge.


A sign along the way.


Tall but surprisingly thin buildings. When it got dark it was weird to see fruit bats flying between the skyscrapers.


What might be called the culture district. The blueish building is the performing arts centre.

Friday, February 24, 2017

Botany picture #241: Psilotum


Currently I am in the Bane of Bris, in the Land of the Queen, having today co-organised a workshop. I may have mentioned before that there are certain groups that seemed weird and exotic but that some exposure to Australia has suddenly made seem rather more mundane and everyday. One example are cycads, which to my botany student self in Germany seemed to be this rare dinosaur plant of which you might see one displayed in the greenhouse of a botanic garden, but which cover the entire forest floor just a bit east of where I live now.


Today's example is Psilotum, a weird fern that is morphologically so reduced that it was for a long time considered to be a living fossil representing the first vascular plants before the invention of roots. We now know that it is instead nested within the monilophytes and has lost the roots secondarily, but still, weird. So again my student self knew it as a rare and fascinating object studied in a first year botany course but would not have thought to ever see it in the wild. And today I walked past a specimen growing on an alley tree in the city centre of Brisbane, between the workshop venue and our hotel. That was unexpected.

Sunday, February 19, 2017

Bombing of veterans


Did anybody else read this and wonder if bombing veterans in celebration of an anniversary isn't a bit cruel? Writing news headlines is an art.

Another one I always wonder about is the form, "suspect still at large: police". I have yet to understand how the fact that a suspect is still at large can say the word police.

Saturday, February 18, 2017

Robert Lanfear on the state of molecular phylogenetics

I had hoped to write this up earlier, but there we are. On 9 February I went to a presentation by Robert Lanfear, the author among other things of PartitionFinder, a software that assists in the selection of models of nucleotide evolution and, as the name implies, dataset partitioning. His talk gave an overview of where he sees the field of (model-based) molecular phylogenetics, its problems and potential solutions.

I will structure my notes on his talk and my own thoughts about it as a kind of numbered list, for easier cross-reference, with no claim to having written this up in a particularly beautiful way.

1. The problem

Lanfear started out with the observation that the current practice in molecular phylogenetics works well, but it works increasingly less well. What he means here is that if there is a phylogenetic question that has a clear and strongly supported answer, then even cutting a lot of corners and making some mistakes will produce that correct answer.

Now, however, those "low-hanging fruit" have largely been harvested, and what is left are really hard to resolve relationships. In those cases small differences in how the analysis is done will lead to different answers (see point 2 below). An example he referred to at least twice during the talk was the relationship between crocodiles, birds and turtles, another one were relationships between major clades of birds.

What I find interesting here is how people set their priorities. Apparently there are a lot of researchers who care very deeply about, for example, whether crocodiles are sister to birds or to turtles. Honestly I couldn't care less, and the same would be true for comparable cases in plant phylogenetics. What phylogenetics is about for me is to identify monophyletic groups for classification and to provide phylogenies for downstream analyses in biogeography and evolutionary biology. For the former, the most relevant observation is that turtles, crocs and birds form three reciprocally monophyletic groups, but if we don't know their relationships to each other we can simply classify them next to each other at the same rank, problem solved. For the latter, there are ways of taking uncertainty into consideration, problem solved.

In other words, where I see need for more work in the field is in the many clades of plants, insects, nematodes, mites, etc., that have so far not been well studied, as opposed to re-analysing over and over and over the same few charismatic but overstudied groups of vertebrates. Each to their own I guess, but the thing is that all the considerations that follow assume first that being unable to decisively resolve every single node in a phylogeny is at all important to anything or anybody. I am just not sure I see that.

2. How do we know that the current practice is working less well now?

Partly because people get very different results with high confidence. Lanfear called this the "new normal": large amounts of genomic data give strong statistical support for contradictory results.

This is a very good observation that will hopefully also be convincing to those who like to stress our inability to know the truth, and that we can only hope to build hypotheses.

3. The current best practice for genomic sequencing

Data cleaning of genomic data is crucial because everything is full of microbes. Even DNA extraction kits are contaminated, so never do genomic sequencing without a negative control.

I must admit that I have not always followed that advice, but with amplicon sequencing or target enrichment for example it may not be that relevant, given that non-targeted DNA is unlikely to amplify and you know if a sequence comes totally out of left field. The example Lanfear used, however, was a de novo genome assembly where contaminants were presented as evidence of horizontal gene transfer. That would have been embarrassing.

He also argued for inclusion of a positive control, as in adding a known genome to check for contamination percentage. That does of course assume that you always have a known genome in your study group, which is unlikely to be the case in most groups.

Finally, there should be biological and technical replicates, probably the sampling guideline that the largest number of people are aware of and follow.

4. The current best practice for assembling the data matrices

Remove parts of the alignment that cannot be trusted. Lanfear mentioned the software GBlocks, which I personally have never used. However, he cited a paper that argues it doesn't seem to help (Tan et al. 2015, Syst Biol 64: 778) and seemed to advise against using it. His own preference is to pragmatically make an automated alignment and then check by eye and delete non-homologous sites manually.

5. Examining the individual gene trees

Next comes paralog detection, if that is relevant to the data type. One of the most stunning observations Lanfear mentioned was that in multi-locus species tree analyses some loci may have massive leverage on the results. He cited a case in which two undetected paralogs made the difference between 100% support for one and 100% support for the other answer.

His suggested positive control here: be suspicious if a gene tree does not show a very well established clade. Keep that one in mind as it will come up again.

6. Multi-locus analysis versus concatenation

We are talking phylogenomics here, so there are always multiple independent loci. A full Bayesian analysis of gene trees and species tree together in StarBEAST is best but limited to max. 50 species. I wasn't aware of the ballpark number, so this is good to know. Interestingly, the next best thing is concatenation, because according to Lanfear short-cut methods using previously inferred gene trees to infer the species tree in a second step (ASTRAL et al.) perform worst. Not sure how easy it will generally be to convince peer reviewers of this.

7. Model selection

Not many people are aware that we have to guess a topology to even do an alignment, and also to do model selection. Then we co-estimate all model parameters at the same time as the final topology is inferred.

We may need a separate model for each codon position and gene, stem vs. loop for rDNA; even for only three genes, the possibilities are already too huge. Also, there is a trade-off between having enough parameters and not being able to estimate all of many parameters. Here cometh PartitionFinder to help with that. However, as the author of that software Lanfear himself stresses that thinking carefully about data may be better than using the automated approach.

He was what I cannot help but call surprisingly cynical about how little we know about model selection and alignment.

8. Tree inference

Be aware that all software has bugs and limitations. Lanfear cited a few examples including a known but so far unresolvable branch length bug in RAxML (10x branch length inflation in 5 of 34 datasets tested). He also said that RAxML does not implement linked branch lengths across parts of the partition, and that few people were aware of that. Me neither.

At any rate he suggested to use more than one software and compare, as a "sanity check". His suggestions for likelihood were RAxML, PhyML, and IQ-tree; for Bayesian phylogenetics obviously MrBayes and BEAST.

Parsimony seemed to be The Method That May Not Be Named, although there is a long tradition in the area I am working in to run at least Bayesian and parsimony analyses and then perhaps also likelihood for comparison. Indeed if I remember correctly the word parsimony was only mentioned once at the beginning of the talk, and it was in the context of something like "parsimony also makes assumptions". Hardly anybody would doubt that; the arguments of parsimony advocates appear to be mostly epistemological (I have discussed before why that doesn't convince me personally) and on the lines of modelling making too many and/or unjustified assumptions, whether that is true or not.

From my own perspective as a methods pragmatist who happily uses all of them as long as they are a good match for the data and computationally feasible, I was once more surprised that a likelihood phylogeneticist like Lanfear explicitly mentioned Neighbor Joining as perfectly fine, something that I had seen previously in that BMC Evolutionary Biology editorial. I am sorry to say that I don't really get it. It seems like saying that you shouldn't use your kitchen knife for emergency surgery because it wasn't properly sterilised, but the muddy shovel from the garden shed will do in a pinch.

9. Special considerations for Bayesian phylogenetics

Keep an eye on sampling and convergence using software such as Tracer and RWTR; effective sample size needs to be > 200 so that samples are independent enough. None of this should be news to anybody who is using Bayesian phylogenetics, one would hope, but I haven't tried RWTY so far.

Two things Lanfear mentioned were less familiar to me, unsurprisingly given that I am not really a Bayesian. First, in theory Markov Chain Monte Carlo only works if run for infinite time, but it "works in practice". Second, apparently there is no good way yet of calculating ESS for tree topology or covergence, but "RWTY helps".

10. The way forward

Lanfear's hopes for improving molecular phylogenetics in the future are based on what he called "integrated analyses". They include trying to infer the model of evolution at the same time as tree topology.

Next there is the need for "better" models, e.g. non-reversible ones, which he mentioned as coming soon to IQ-tree and PartitionFinder, and different models for different parts of tree, which however may be computationally too hard.

Stationarity of model parameters across evolutionary history, reversibility, homogeneity, and tree-likeness (no recombination) are model assumptions that are universal and hardly ever tested. But tests are possible, and then the data that don't fit the model can be removed. Most generally, instead of big data use the data that can reliably be modelled only. I found this really refreshing to hear, as many people seem to prefer throwing more data at a problem in the hope it goes away.

Finally, Lanfear suggested to conduct blinded analyses. He said that in many cases there was a hidden extra step after tree inference: is the tree the one we wanted? If yes, it gets published; if no, if it disagrees with preconceived notions, some people go back and tweak the data. Clearly this is problematic, but I was not the only one in the audience who thought back to what I have here written up as point number 5 and observed a bit of a self-contradiction.

I assume the answer is that there is a difference between being sceptical about a gene tree that contradicts really well established knowledge and tweaking the results that your study really are about. To use a non-phylogenetic example, if you want to find out if one brand of car can go faster than another it is not okay to tweak data after the results show that your favoured brand is the slower one; but it is okay to go back to check your data if they show one of them to have speed of 50,000 km/h, because that just doesn't seem plausible.

Wednesday, February 15, 2017

Why is public reporting about science often so frustrating?

Reading a bit of ABC online over breakfast, I was surprised at the claim, to quote the title of the piece, that a "pregnant reptile fossil suggests bird ancestors gave birth to live young". Wow, that would be quite something, if the ancestor of the birds had given birth to live young and then later down the lineage they had re-invented egg laying. I would not have thought something like that possible, Dollo's Law and all.

Closer examination of the article shows that the title is quite a bit at variance with the rest. There is no mention of the reptile in question being the actual ancestor of the birds. It is sitting on a side branch of the phylogeny, and the conclusion made by the authors is merely "scientists can at least rule out the possibility that animals in this group", i.e. the clade that birds and crocodiles belong to, "were somehow incapable of evolving the ability to give birth to live young". They actually show the phylogenetic tree from the original paper and it shows the relevant reptile on a side branch.

So the title is not merely misleading but actually downright wrong. Don't science journalists know what an "ancestor" is? Did they not show the final article to somebody who knows that stuff and ask for feedback?

Sunday, February 12, 2017

Sturgeon's law

While on the topic of the book fair, I have to say that as much as I love browsing through the books and finding gems, it is also one of the moments that produce a certain sense of alienation from the majority of humanity in me. The only other moment that parallels it is "standing in front of the magazine rack in a supermarket".

As far as I am concerned, there are generally no more than two to three journals in the average magazine rack that one could reasonably count as a loss if somebody were to torch the lot. In fact, not only would there be no loss to the wealth and welfare of humanity if titles like "Kim Kardashian's new bikini body" or "Nicole Kidman's relationship crisis", most of them blatantly invented anyway, went up in flames, but burning the paper to generate energy would be considerably more productive than using it to print this kind of dreck. And people are actually wasting hard-earned money on all of it.

Similarly, I cannot help but observe, as I look across the dozens of tables in the book fair, that there are entire sections on astrology and "alternative medicine". These kinds of books have only one goal, and that is to make their readers more ignorant and less capable of critical thought. (You might argue that the ultimate goal is to sell, okay. But they will only sell if they first achieve the goal I mentioned. A swindler first has to swindle, only then can they extract money.) In a way it is, of course, nice to see them being sold again for a few bucks to finance a crisis hotline, but there is no way around the fact that as long they are in circulation some of these works will continue to harm gullible people by getting them to rely on snake oil and forgoing real treatment for their illnesses.

As for fantasy and science fiction novels, there are so many crappy books out there that it is extremely hard to find the few worthwhile ones between them. And I don't even have very high standards - some of the ones mentioned in my previous post are not exactly Nobel Prize in literature material either. But for an example of the 90% crud that makes browsing books so hard, I would like to present a novel that I bought on a whim at the previous fair we went to:

Stan Nicholls, Legion of Thunder. Book 2 of Orcs: First Blood.

Being part of a series is not decisive evidence of being crud, but it is a first warning sign. At a minimum I am starting to think that the better authors are the ones that write a series so that each novel can stand by itself. Think Martin Scott's Thraxas, Barry Hughart's Master Li chronicles, or Terry Pratchett's Discworld novels; each book is a self-contained story. When everything has to end on a cliff-hanger, however, it just looks cheap and like trying too hard. There is also the risk that the story will never be brought to a resolution and instead end with author existence failure.

Now as for the book itself, I was fooled into buying it because I had read other, fairly good books by different authors written from the perspective of the usual fantasy underdogs like orcs or dark elves. In the present case, however, the plot of the novel can comfortably be summarised as follows:

Protagonists search for McGuffins (yes, plural; they have to collect several).
Protagonists get into fight.
Protagonists search for McGuffins.
Protagonists get into fight.
Protagonists search for McGuffins.
Protagonists get into fight.
Novel ends on a cliff-hanger.

The fights appear to be the main attraction here, as they are written in a very voyeuristic manner. Apparently some readers really look forward to knowing which evil mook gets a knife into the eye, which one gets its arm cut off, and how far the blood sprays.

But the insults to the reader's intelligence don't stop there. In the background there is a big bad sorceress who is so comically evil and so prone to randomly killing her own followers that she should have been murdered in a palace coup years ago. During what is clearly meant to be a pivotal moment in her character development, she demands of one of her sisters, who is ruling over a people of aquatic semi-humanoids, to help her hunt for the protagonists, who are moving entirely on land. Her sister rejects the demand, and so she magics her dead.

The things is, it never really becomes clear how helping would have looked like. Why didn't her sister simply agree, on the lines of: "I will gladly help you, let me just command all my soldiers who can operate on dry land to assist you OH WAIT I DON'T HAVE ANY"?

Seriously, the world does not need this kind of book to use up paper that could be used to print decent ones.

Saturday, February 11, 2017

This season's Lifeline Bookfair haul so far

Not sure if I go another time tomorrow, but so far today's visit to the Lifeline Bookfair here in Canberra has netted the following:

Tolkien JRR, The Silmarillion.
I have read that one before, although in German I think (?). But we didn't own the book ourselves, and I may want to read it again.

Orwell G, Animal Farm.
Another one that I have read once before, but as a teenager. Again I did not have the book myself, having at that time borrowed it from a friend.

Wells HG, The Invisible Man.
A classic that caught my interest.

Scott W, Ivanhoe.
Likely not the best book I have bought today. My understanding is that it is pretty cheesy. But when I was younger I played Defender of the Crown and watched Ivanhoe movies, so it might be nice to read the novel that started it all.

MacDonald G, The Wise Woman and other Fantastic Stories.
Sounds interesting because the author is billed as "the great nineteenth-century innovator of modern fantasy" who "came to influence" CS Lewis, Charles Williams and JRR Tolkien. The back cover further calls the book one of a set of four, but sadly the other three were nowhere to be seen.

Silverberg R, The Longest Way Home.
A science fiction novel from an author some of whose books I have read in Germany translation years ago (mostly Majipoor novels). Not sure how it will turn out.

Bramah E, Kai Lung Unrolls His Mat.
Finally, this is probably the weirdest of them all. My hope is it will be something in the vein of Barry Hughart's chronicles of Master Li. We shall see.

In addition, we bought several books and a puzzle for our daughter, and yesterday my wife already went for several books and CDs herself. May have to donate some books back one of these days, or the bookshelf with the novels will fold into itself and turn into a singularity.

Update 12 Feb 2017: Went back again today and spent more time in non-fiction.

James W, The Varieties of Religious Experience.
A very famous book originally published in 1902, it examines the origin of religion from a psychological perspective. The critical introduction claims that the author was actually fairly charitable ("a classic that is ... too religious to have influenced much psychological research"), but one can imagine that the whole idea behind the work wouldn't have sat too well with many of the faithful.

Baggini J, Freedom Regained.
Having participated in the never-ending online discussion on Free Will I thought it might be good to read something by a philosopher on the subject. Admittedly there might be some bias on my side, as the author clearly has the same stance as I have, at least in the broad outlines.

Machiavelli N, Il Principe.
The classic's classic of all the books I bought, this is the 16th century book that Machiavelli is famous for. I got the German translation.

Astonishing, by the way, how much has been sold since yesterday.

Friday, February 10, 2017

Botany picture #240: Neottia nidus-avis, and parasitic plants in general


At the moment parasitic plant expert Sasa Stefanovic is visiting our herbarium to study the genus Cuscuta (Convolvulaceae; but unfortunately I do not have good picture of it). Today he gave a seminar at the ANU, and I noted with interest what terminology he used to distinguish the two main groups of parasitic plants.

The first group are plants that have haustoria, organs that they use to attack the phloem of other plants and draw water, nutrients and energy from them. This adaptation occurs across several groups of eudicots but interestingly not in the monocots.

The second group are plants that parasitise on fungi. An example is the strange European orchid Neottia nidus-avis depicted above. They have clearly evolved from ancestors that used mutually beneficial mycorrhiza, trading sugar against otherwise hard to obtain nutrients, but then turned the relationship into pure exploitation. This adaptation is found in Asterids, Monocots and one truly bizarre New Caledonian conifer, but apparently not in Rosids. In the past, people often believed that this second group was saprophytic, and one can even now see books making that mistake. In reality, there are no saprophytic plants; they are all either photosynthetic or parasitic.

Now the interesting thing is that according to Sasa Stefanovic, the community of parasitic plant researchers calls only the first group parasites, whereas the second group is called mycotrophic or heterotrophic. I must admit I find this a bit strange, as they are clearly both parasitic, only on different groups of organisms, and both heterotrophic. What is more, the people who are still stuck with the impression that the second group is not parasitic would not have their confusion cleared up if they heard these two terms used in this way.

But well, if that is what the community has decided, that is that. Not my area. At any rate it was interesting to learn how these two forms of parasitism are distributed phylogenetically.