Friday, April 5, 2019

Still not convinced by Vicariance Biogeography

When reading recent methodological papers, review articles, or publications on my study group I sometimes add to the mix the odd paper that is not directly relevant for my work and maybe not even very recent but which is relevant to my broader interests. In this case I decided to take a look at Heads 2009, Inferring biogeographic history from molecular phylogenies, Biol J Linn Soc 98: 757-774.

Michael Heads is perhaps the most published proponent of Vicariance Biogeography, the school of biogeography that rejects speciation following long-distance dispersal (LDD) because... and that is where it gets interesting, because I still find that rejection puzzling. To the best of my understanding at least some vicariance biogeographers consider the conclusion of LDD to be unscientific because they believe it can explain any possible contemporary range, on the lines of 'if your hypothesis can explain every observation it explains nothing'. This does not make sense to me, because LDD would still be more or less plausible depending on the dating of cladogenesis events relative to tektonic events or island ages, prevailing wind and water currents, dispersal ecology, and many other factors. It also seems rather more unscientific to reject a possible explanation a priori, regardless of any evidence in its favour. But to get a better understanding of the arguments of vicariance biogeographers is precisely my reason for picking up this paper. So, on with it.

In a section titled "critique of founder dispersal in population genetic studies", Heads first describes the concept as "the founder individual has been isolated from its parent population by dispersing over a barrier (an apparent contradiction)". Right out of the gate this seems odd. I may be missing something, but it appears as if Heads would accept only extremes: either there is a barrier, meaning zero dispersal, or there is none, meaning panmixis. I have previously observed similar arguments in other papers from the vicariance school.

Assume I have a garden with a fence around it, and then one day a cat jumps over it. Does this mean I have no barrier around the garden? Of course not, it may still have kept various stray dogs and neighbours' children out. On the other hand, it was never a barrier to birds or insects. The same in biogeography. No barrier on this planet is absolute, and each barrier has a different force for different groups of organisms. A channel that is near-insurmountable to a monkey may be crossed by insects if blown over by a strong enough storm, and it may be no barrier at all to fern spores. Perhaps even more importantly, dispersal is a stochastic process. The Atlantic Ocean did not keep all cacti from crossing (Rhipsalis made it over to Africa), but it kept the seeds of >99.9% of them away, so it is still a barrier even if not an absolute one.

Beyond that the argument of the section relies on citing five papers that "failed to corroborate predictions of founder effect speciation", of which one is missing from the reference list. I checked three of the remaining four papers, and in all cases they are experiments on fruit flies limited to time frames on the order of ten years and designed to test the very narrow question whether severe population bottlenecks will cause pre-mating isolation. Now I may completely have misunderstood the claim made by mainstream biogeographers regarding founder speciation, but I believe it was not "ten years after an organism has dispersed to an island it will have achieved biological pre-mating isolation". The way I understand it the claim is more on the lines of the large distance from the parental population producing geographic pre-mating isolation, which enables speciation to take place subsequently. The point is not the speed with which the new population evolves (although that is an exciting research question in itself) but rather that it has become geographically isolated.

The argument consequently seems to miss the point. If there is a problem for founder speciation then it would be whether a single pregnant female or a single seed can establish a viable population. Potential problems are inbreeding and, in plants that have such features, self-incompatibility systems that cause failure to set seed. But if a population establishes, helped perhaps by herbivore release and lack of competition, subsequent speciation is not an extraordinary claim. It really does not matter if isolation has been achieved by vicariance or by LDD, the subsequent process of divergence is the same except the latter will also cause a genetic bottleneck.

The section "critique of founder dispersal in biogeographic studies" points out that there is good evidence for similar vicariance patters in many taxa. I am unaware of anybody who denies that vicariance is an important process - but it does not logically follow that LDD is therefor implausible. I can agree that a lot of white swans exist without therefore having to believe that black ones cannot possibly exist.

This is followed by "founder dispersal and new ideas on rift tectonics", where the idea seems to be that seemingly young oceanic islands do not require LDD to be colonised because they kind of have always been there. It is not entirely clear to me if the claim is that the individual islands are all much older than the oldest still observable lava flows or if, as implied by the reference to "seamounts", the local species would have constantly hopped from one short-lived and now submerged island to the next. If the first, it seems rather ad-hoc; if the second, one wonders why species that can so easily jump ten times from one disappearing island to the next island in the chain cannot simply jump a single time from continent to island. What is the more parsimonious conclusion here?

Next, molecular clocks and time calibration of phylogenies are rejected. All inferences, be it from fossils but in particular from geological events such as the formation of the isthmus of Panama, are dismissed as unreliable, but apparently present distributions are reliable evidence of ancestral distributions. Unfortunately I remain anti-convinced.

To quote the following paragraph in full:

"In Ronquist's (1997) method of dispersal-vicariance analysis, inferences of dispersal events are minimized as they attract a 'cost'. Extinction also attracts a cost but vicariance does not. It was not explained why this approach was taken and it appears to be based on a confusion of the two different concepts of 'dispersal'. Ecological dispersal in the sense of ordinary movement should not attract any cost in any model; founder dispersal would attract no cost in a traditional dispersalist model, but, in a vicariance model of speciation or evolution, it is rejected a priori."

What Heads does here is reject a formal parsimony-based inference of ancestral ranges in favour of, to judge from the second half of the paper, an informal, intuitive, pencil-on-a-map deduction process. What does he not like about Dispersal-Vicariance Analysis (DIVA)? Apparently primarily that dispersal events have a parsimony cost. It may be that he did not contemplate how such an analysis would work or if it could even work at all, if the only process having a cost would be extinction - of course it would mean that dispersal would be much too 'cheap', and every single ancestral species would always be inferred to have occupied the union of the ranges of its two descendants.

The great irony here is that even with a dispersal cost DIVA is well known for mercilessly (and implausibly) favouring vicariance as a process. I ran that analysis on two or three data sets a few years ago, and unless one restricts the maximum range size of ancestral species to something biologically plausible one pretty much always ends up with the vicariance biogeographers' preferred conclusion: the ancestor of the study group was already everywhere where any of its descendants occur today.

The second part of the paper is taken up by a large number of case studies, taxa which have sometimes been suggested to have undergone LDD but for which Heads presents a vicariance explanation instead. Some of these I find more plausible than others, but I do not want to go into each of them in detail. Instead, it seems more efficient to discuss what I see as three problems running through the entire argumentation:

First, there seems to be a lot of ad-hoccery going on. Where necessary to arrive at the conclusion of vicariance, for example to explain the overlapping distributions of African Arctotideae, 'normal ecological' range expansion is invoked as common and easy. But where necessary to arrive at the conclusion of vicariance, for example when distantly related subclades of a taxon occur right next to each other in Tasmania or New Zealand (suggesting relatively recent LDD from elsewhere), they are assumed to have been sitting in these narrow localities for tens of millions of years, apparently unable to move at all, so that a very ancient vicariance event can have taken place between their present ranges. Is that not rather convenient?

Which brings me to the second point. The text presenting the case studies certainly uses words like "may" and "might" a lot. To be honest, I sometimes found myself reminded of Erich von Daniken, whose style was to the effect of "the traditional explanation is that the pyramids were build by the ancient Egyptians - but could it not have been extra-terrestrials?" Yes, in each of these cases vicariance (or extra-terrestrials) could be the explanation. But mere possibility is a low hurdle to clear; the real question is, is that the most plausible explanation?

Third, as always with vicariance- or panbiogeography the problem is that dispersal is still required. Somehow this taxon here must have reached this volcanic island, somehow that taxon there must have spread all over the world. How does the vicariance biogeographer arrive at contemporary ranges without invoking jumps across oceans? Partly by hiding the dispersal away before the start of the analysis. To quote the present paper, "assuming a worldwide ancestor..." Well, if we can just assume that at our leisure it becomes easy to conclude few dispersal events, long distance or otherwise.

Now quite apart from the question whether a single species occurring worldwide is biologically realistic for all groups of organisms (I'd say it isn't), the problem remains that we have a lot of nested groups that would all have to have been ancestrally cosmopolitan, requiring several global range expansions in between. The daisy family is an excellent example. With reference to them, Heads writes that "through the history of the family as a whole, only a small number of widespread ancestors may have existed (groups such as Senecioneae and Astereae each require their own global ancestor)." I think that is a wee bit of an underestimate.

To walk through just one example in order of containing taxon to subordinate taxon: The Asteraceae family is cosmopolitan. The Asteroideae subfamily is cosmopolitan. The Astereae tribe is cosmopolitan. And the genus Conyza is cosmopolitan. If vicariance is the explanation for all speciation events we still need at least four consecutive cases of spreading across all continents. The same applies to a large number of the other tribes in the family: yes, that includes the aforementioned Senecioneae, but also Gnaphalieae, Anthemideae, Heliantheae, Cichorieae, Cardueae, Inuleae, and Vernonieae. And several of these include genera occurring across several continents or even (as with Senecio) all of them except Antarctica.

There is certainly a lot of dispersal required to explain that even in a vicariance approach, and unless we assume that most speciation in these groups took place before the breakup of Pangaea 175 million years ago (meaning the early dinosaurs would have known many of the same daisies as we do now, tens of millions of years before the oldest estimates for the origin of the daisy family) we will have to assume that some of that dispersal was long-distance.

Why not simply accept that organisms can sometimes, rarely but often enough to matter, cross an ocean and establish on the other side, followed by speciation? What is is so extraordinary about that conclusion, really? What is so different about it compared to being separated by vicariance, followed by speciation? I am still puzzled.

Wednesday, February 20, 2019

Review of the Aachen Memorandum

I picked this book up at a book fair after having read that it was a satire on bureaucracy and 'political correctness'. Although I am not the kind of person who believes that not being able to use sexist and racist insults is the end of the world and thus unlikely to agree with the author politically I nonetheless thought I might still find this kind of book interesting. I can, for example, read the original Conan novels through to the end without believing myself, as their author did, that all civilisation is corrupt and deserves to be destroyed.

Unfortunately, Robert E. Howard was a master of wit and subtlety compared to Andrew Roberts, and I only made it halfway through the Aachen Memorandum before giving up. Roberts took everything he dislikes - immigration, high taxes on the rich, animal protection, weed, speed limits, feminism, anti-racism, grade inflation, concern for healthy nutrition, and so much more, stuffed it all into one pot and then scrawled 'Europe' onto it.

The results are, unfortunately, not even intellectually coherent. The book has all European nations dissolved into a Euro-superstate, but somehow France is still able to buy the Channel Islands off England. The dominant culture is depicted as a caricature of feminist prudery, while the protagonist is constantly lecherous and voyeuristic, but he also complains that advertisements are all using sex to sell products. Europe is a total dictatorship with complete surveillance of communications, no free press, and continental armies stationed in England to forcefully squash nationalist protests, but (what follows is the only minor spoiler here) somehow the entire edifice collapses the moment somebody finds evidence that a referendum a generation ago was manipulated. The ruling ideology is clearly supposed to be left-wing and cosmopolitan, but at the same time Adolf Hitler is venerated in the schools.

How does that any of that even start to make sense? It seems as if the author believed that everybody who is not part of his own political sect is interchangeable and in cahoots with each other.

Underneath the visceral hatred of everybody outside of Britain oozing from the pages it is just about possible to see the outline of a potentially amusing thriller, but the problem is that I cannot maintain willing suspension of disbelief. Yes, the reader will soon understand that the author despises the European Union in general and Germany and Polish taxi drivers in particular, so well done communicating that, but novels also need an at least somewhat plausible and logically coherent setting, otherwise they don't work. And that is before even mentioning how blatant a wish-fulfillment self-insert the protagonist is.

I assume there was, and still is, a very particular audience for this book in one particular country, but at least in my eyes everybody else would be better served by doing something more entertaining than reading it, such as watching paint dry or counting how many grains there are in one kg of sugar.

Friday, June 29, 2018

TreeBASE and Dryad

It is now generally expected that scientists, unless working on commercial or otherwise confidential projects, make the data underlying their scientific publications freely and publicly available, so that the studies can be replicated if necessary and so that others can use the data for further research.

Sometimes the data are submitted as supplementary material to be published on the journal website, together with the article itself. Some research organisations have their own data repositories. In many cases, however specialised databases are used. GenBank, for example, is a repository of DNA sequence data. Further down the analysis pipeline, I have in the past used TreeBASE to make available sequence alignment matrices and phylogenetic trees, and in one case I have reanalysed other people's data after obtaining them from there.

Recently I had reason to submit another such set of data matrices and phylogenetic trees to a database, and I thought I would go back to TreeBASE. Somehow it did not work out as well as it did a few years ago.

I was able to log in, I created a new submission, I submitted my files, and I described our analysis. The latter process is rather clunky, but okay, it works. Then it turned out that we needed to redo one of the phylogenetic analyses minus one sequence, so I had to delete one of the matrices and one of the trees and replace them with updated versions. That is when the fun started.

Although googling around a bit suggests that other people can do so, I find it impossible to delete anything in TreeBASE. There is no delete button next to anything except co-authors and submissions (i.e. the entire studies). Being unable to change data in a submission, I decided to delete the entire submission and start from scratch. That is surely not how it is meant to work, and it is a lot of extra effort, but what can I do?

As it turns out, not even that. When I ask that a submission be deleted, the web interface thinks for a bit an then throws a Java error at me. I now have three submissions under identical names and cannot delete the first two. Hurray.

At some point I thought I could maybe try out the alternative data repository Dryad.  Perhaps that would work more reliably? At least I have seen it used in several publications lately. I have now twice submitted my eMail address on their 'sign up for a new account' form, been told twice that a confirmation eMail has been sent, and days later neither I nor my spam folder have received any such message.

Perhaps the journal will accept our manuscript without us having the matrix and trees in a public repository? This process is becoming somewhat off-putting.

Update: After a mere four days I have now finally been sent a confirmation link by Dryad. Will see how that repository works.

Saturday, June 9, 2018

A particularly striking example of how paraphyletic taxa confuse our thinking about evolution

I recently reread Jason Rosenhouse's Among the Creationists and came across the following extended quote from Stephen Jay Gould, a widely admired and famous evolutionary biologist.
If mammals had arisen late and helped to drive dinosaurs to their doom, then we could legitimately propose a scenario of expected progress. But dinosaurs remained dominant and probably became extinct only as a quirky result of the most unpredictable of all events - a mass dying triggered by extraterrestrial impact. If dinosaurs had not died in this event, they would probably still dominate the domain of large-bodied vertebrates, as they had for so long with such conspicuous success, and mammals would still be small creatures in the interstices of their world. [...] Since dinosaurs were not moving toward markedly larger brains, and since such a prospect may lie outside the capabilities of reptilian design, we must assume that consciousness would not have evolved on our planet if a cosmic catastrophe had not claimed the dinosaurs as victims. (Gould 1989, 318)
The context is the controversy around convergence and contingency in evolution. Rosenhouse discusses convergence as one of the hopes of Christians trying to reconcile evolution and Christian teachings, citing various proponents of the idea that their god set up the universe in a way that human-like intelligence was guaranteed to arise, thus producing beings that can have a "relationship" with said god.

Convergence is, of course, not only an observation considered helpful by the proponents of one variant of theistic evolution. To what degree the organisms that evolved on our planet would again turn out to be kind of similar if we replayed the tape or if organisms on other planets can be expected to look very similar to those on ours are very interesting questions of broad interest. Even an atheist may ask if we can expect lots of other planets where life arose to produce land plants, something a bit like insects, and perhaps even sentient beings given enough time, or if the vast majority of them will, for example, remain populated only by bacteria, because even evolving as much as multicellularity was a rare fluke.

Rosenhouse cites Gould as a well-known proponent of the importance of contingency. Although I tend much more towards the opposite view, I understand Gould's position. I believe the strongest argument for the contingency side is that while there are many impressive cases of convergence there are also quite a few crucial events in the history of life on this planet that appear to have happened only once: complex Eukaryotic cells; colonisation of dry land by multi-cellular plants; vertebrates; and of course human-like intelligence.

If, for example, the independent evolution of wings by insects, pterosaurs, birds and bats is counted as evidence for the importance of convergence, should something happening only once not be counted as evidence for the importance of contingency? My response would be competition, or in other words the change in the adaptive landscape caused by the first organisms to settle on a new peak. Where there may have been a ridge connecting the niches "kelp" and "large land-living plant" when nobody had occupied the latter, the first lineage to do so quickly became so good at being large land-living plants that the ridge crumbled away and became a canyon. If all land plants were wiped out, however, I would expect the land to be colonised anew, this time perhaps by red or brown algae.

But that is not actually about the main argument Gould is quoted as making in the above excerpt, and not what I found interesting about the quote. To take it in smaller pieces:
If mammals had arisen late and helped to drive dinosaurs to their doom, then we could legitimately propose a scenario of expected progress.
"Expected progress" is a bit of an odd term here. I am not sure if that is what is meant, but it could be read as if any group of animals that does not evolve towards large brains and intelligence is a refutation of the possibility that one group on each planet might evolve towards larger brains. But I do not think that this works as a refutation. And few proponents of the importance of convergence would argue that it is all about one linear progression towards large brains anyway. There are also progressions, for example towards body shapes that work well for swimming, towards paternal care for the young, towards powered flight, etc., and all of these happen at the same time but only in those lineages for which they solve relevant problems or create new opportunities.

If I understand the argument correctly, it is like pointing at a hole in the ground and saying, "if I now throw a pebble into the air and it does not end up in this specific hole, gravity is refuted", whereas the argument for convergence is that, what with evolution throwing thousands of pebbles into the air every year, we are very likely to find a few of them at the bottom of this hole as opposed to half way up its wall.
But dinosaurs remained dominant and probably became extinct only as a quirky result of the most unpredictable of all events - a mass dying triggered by extraterrestrial impact. If dinosaurs had not died in this event, they would probably still dominate the domain of large-bodied vertebrates, as they had for so long with such conspicuous success, and mammals would still be small creatures in the interstices of their world.
Although this is not my field, and I understand that it is an active area of research, I believe it can already be said with some confidence that mass extinction is not random. There are generally some reasons for why an extinction event claims this lineage here but leaves that other one over there largely intact. If a mass extinction of marine life is caused, for example, by a massive drop in the oxygen content of the oceans, then we would expect lineages that can survive under low oxygen conditions to come out in relatively good shape, all things considered, while those with a high oxygen need would be hammered.

In the present case, if we hypothesise that the impact of a large meteorite would have caused massive shockwaves followed by a few years of something like nuclear winter, we could expect the following: Species of small animals may find it easier to survive because they need less food per number of individuals. Bonus points if you have a burrow to hide in when the devastation sweeps across your area (small mammals) or if you can move easily to other areas where a bit more food is left (flight-capable birds). Large animals that can go with little food for long times may also have a good chance, in other words being cold-blooded may help to survive several bad years (crocodiles). If, however, you are large and (!) at the same time you have a high rate of metabolism then you might be in trouble, as you constantly need lots of food per number of individuals. As far as I understand, that describes the non-avian dinosaurs: large and warm-blooded.

The point is, catastrophes do happen from time to time, and once one happened it would probably have decimated the largest animals, even if it had come ten million years later than it did. Their niches are filled up again by small animals evolving to be large (another good example of convergence). What killed off the pterosaur lineage, for example, may well have been that the birds had already out-competed all small pterosaurs, leaving only the very large species when the meteorite struck. But again, this is not my area of expertise really.
Since dinosaurs were not moving toward markedly larger brains, and since such a prospect may lie outside the capabilities of reptilian design, we must assume that consciousness would not have evolved on our planet if a cosmic catastrophe had not claimed the dinosaurs as victims.
And this last part is really what I find the most interesting, because it illustrates so nicely how paraphyletic taxa can confuse the thinking even of the smartest of us, even of experts in evolutionary biology. What is the problem with the argument here?

First, and most obviously, birds are dinosaurs. Second, corvids (crows and ravens) and parrots are highly intelligent. Not quite human-level intelligence, but in some experiments corvids have proved to be smarter even than chimpanzees, our closest relatives. It follows that  dinosaurs have actually "moved toward markedly larger brains", meaning here relative to the size of the body as a whole and, crucially, in terms of actual intelligence. Gould's premise is simply false, but his mistake is understandable, because at fault is really a misleading, i.e. non-phylogenetic, classification.

"Outside the capabilities of reptilian design" is, by the way, the same mistake at a deeper phylogenetic level. Mammals were not created fully formed, as mammals. Some of our ancestors were "reptiles", and here we are, having human-like intelligence by definition, what with us being humans and all that, so apparently there was a way of evolving human-like intelligence from a reptilian starting point. And from a fish starting point, and from a worm starting point, and from a bacterial starting point. All it took was lots of time and open niches waiting to be filled.

But I am not saying that anything here decisively refutes the idea that our sentience is a very rare fluke, unlikely to happen again should we go extinct. Maybe it is. The point is really how corrosive paraphyletic taxa are to reasoning about evolutionary processes.

Reference

Gould SJ, 1989. Wonderful Life: The Burgess Shale and the Nature of History. W.W. Norton, New York.

Wednesday, June 6, 2018

Manuscript submission then and now

When I started in science, back in the dark ages, submitting a manuscript to a journal was still quite simple, if perhaps a bit inefficient:
  1. Print the manuscript in triplicate.
  2. Write a cover letter and print it.
  3. Put everything into an envelope and send it off to the editor.
And that was that.

The first innovation was that you only had to send the manuscript to the editor as an eMail attachment, which was actually faster and saved a lot of paper. Unfortunately, however, things have changed again since then.

This is how it works today:
  1. Log into the editorial management software of the journal of my choice. If I do not have an account with that journal yet, create one first.
  2. Go to the author interface, click new submission.
  3. Select the type of article.
  4. Paste the title and abstract into an online form.
  5. Select key words or topics that supposedly help the journal to assign editors and/or reviewers. Click 'save and continue'.
  6. Upload main manuscript file, generally as an MS Word document.
  7. Upload all the figures as separate files, generally as TIF or EPS, although JPGs may be acceptable at the review stage. Paste figure legends and write 'link texts' into the form fields.
  8. Upload all the supplementary data files. If necessary, update the order of the files. Click 'save and continue'.
  9. Next, the authorship page. As the corresponding author with an account at that journal I am already in, but I may be asked to link my ORCID. (I have no idea if anybody actually uses it for anything - I only ever look people up with their ResearcherID or Google Scholar.)
  10. Search for my co-authors by name or eMail. I find the second co-author, great. The first and third co-authors aren't in the system, so I create entries for them. Click 'save and continue'.
  11. Error: No telephone number provided for second co-author. But he was in the system, so you accepted him before! Also, will any editor really ever want to use it? Argh. Let's look up his number. Okay, edited. Click 'save and continue'.
  12. Suggesting an editor for the manuscript. Oh dear, that's a long list. Hm. I know this guy hates one of the methods we used, he is out. This one is highly qualified but he will probably require us to add this other analysis that he likes. Ah well, worse things could happen. This one is also very qualified, but she works at a university I have a connection to - is that already a conflict of interest? Well, they can always choose somebody else, done.
  13. Okay, suggesting peer reviewers and providing their contact information. This guy is an obvious choice as he is the expert for one of the analyses we used, but darn, he is currently between institutions. Let's google his name. No, that's outdated. This one too. Ah, I'm lucky: he has an updated CV on this third page I found, complete with the new phone number and eMail address. Okay, now for reviewer suggestion number two. She is another obvious choice as one of the world experts on our study group. Easy to find her information on a staff page, so that's good. Who else? Maybe two more experts on the study group? Ah yes, she would be interested in this, and I have her contact details. And then this other guy from Europe. Google. Darn, nothing, despite the unique name. Perhaps there is contact info on recent papers. No, he is too senior, the corresponding authors are always others. Ah, wait, here? No, an eMail address from 2012 going "director@institution.org" sounds fishy, most likely somebody else is director now. More Google. Ah, finally, was able to click myself through to a staff website, well hidden and not in English. Ye gods. Four qualified reviewers, that should be enough to get going. Click 'save and continue'.
  14. Long, complicated page with miscellaneous information and declarations. First, write or upload cover letter. Done.
  15. Next, declare that we have not submitted this manuscript elsewhere. Okay.
  16. Is this a resubmission? No.
  17. Declare that we have followed protocol so-and-so on ethical collection practices. Yes.
  18. Declare that we have added a section on data availability. Wait, was that in the instructions to authors? Don't remember that. Argh. Save. Open manuscript file. Add data availability section. Back to file upload. Delete manuscript file. Re-upload manuscript file. Reorder files. Click 'save and continue'. Back to declaration. Yes, we now have a section on data availability.
  19. Declare no conflicts of interest. Okay. Click 'save and continue'.
  20. Large summary page. Check everything I entered so far. Down at the bottom: have to check PDF proofs before being allowed to submit. Click button, wait while the editorial manager bundles everything into a PDF.
  21. Open PDF. One of the EPS figures does not display. Argh. Argh. Argh. Back to file upload. Delete offending figure. Re-upload figure - as a TIF this time, that should be foolproof. Reorder files. Click 'save and continue'. Back to summary page.
  22. Re-check everything I entered so far. Click button, wait while the editorial manager generates a new PDF. Looks good this time.
  23. The big moment is there: click here to submit. "Are you certain? This will submit your manuscript." Yes!
Yay, progress?

Saturday, May 12, 2018

What are monotypic genera good for?

There are a lot of monotypic genera around. In the group I am currently working on the most, the daisy family Asteraceae in Australia, there are an awful lot of monotypic genera indeed. Why do we need so many of them?

I would argue that there are two different scenarios to be considered. First, however, we need to keep in mind that:
  1. We should classify organisms by their degree of relatedness, meaning that supraspecific taxa (including genera) should be monophyletic, and
  2. while this previous rule tells us how we should group it does not tell us how we should rank. There is no genusness to be discovered in nature. Whether it is here in the phylogeny where we call a clade a genus or four nodes deeper down the tree is ultimately an arbitrary human decision.
This may at first suggest that there is no good argument to be had against monotypic genera either. If ranking is arbitrary then a classification consisting entirely out of monotypic genera - each species in the tree of life gets its own genus - is just as valid as the current one, so why not?

It is true that this is one of many possible ranking solutions compatible with phylogenetic systematics, but to decide between those many possible ranking solutions we can bring other criteria to bear. And here I would argue that it would be useful to minimise the number of monotypic genera as far as possible. Why? Because I would consider the genus level 'wasted' in many of those cases.

The entire point of a classification is that each taxon provides a piece of information. That information is: The members of this taxon are more closely related to each other than they are to non-members of this taxon. If we have a species, the species-taxon provides this information for all the members of that species. If we now have that species classified in a monotypic genus, the genus-taxon provides... the exact same information over again. It doesn't add anything. It is wasted.

Consequently, I believe that the proper use of monotypic genera is for when they are actually required for phylogenetic classification, but that there is a good argument for sinking them into larger genera whenever things could be made monophyletic without them. Two examples may illustrate the argument.


The above presents a case where the monotypic genus in red is actually needed. There are two genera marked in blue and green, and so obviously the phylogenetically isolated lineage in red cannot be lumped into either of them without making them paraphyletic. It is 'left over' and needs its own genus.


A perfect example for this is the ginkgo tree, Ginkgo biloba, which is a phylogenetically isolated living fossil. It is here photographed as an alley tree in front of our apartment block in Zürich, back when I was a postdoc there.


In the above phylogeny, however, the monotypic genus in red is sister to another genus in blue, and that latter genus isn't very large either. Now I can understand why it might perhaps be desirable to recognise the two as different genera if their divergence happened many tens of millions of years ago and they are morphologically quite distinct. Unfortunately, however, the world is full of monotypic genera that are very young and look exactly like the slightly larger sister genus, but differ from it in a single morphological character.

In those cases, do we really need that kind of taxonomic inflation? What then is the use of the genus rank?


The species that occasioned these ruminations in me is the above Tasmanian daisy tree Centropappus brunonis, which is clearly just a Bedfordia without hairs on the leaves; otherwise the two genera are pretty much indistinguishable. And Bedfordia itself has a mere three species, so it is not as if it would get unmanageably large if they were united.

There are many, many similar cases.

Friday, April 20, 2018

Time-calibrated or at least ultrametric trees with the R package ape: an overview

I had reason today to look into time-calibrating phylogenetic trees again, specifically trees that are so large that Bayesian approaches are not computationally feasible. It turns out that there are more options in the R package APE than I had previously been aware of - but unfortunately they are not all equally useful in everyday phylogenetics.

In all cases we first need a phylogram that we want to time-calibrate or at least make ultrametric to use in downstream analyses that require ultrametricity. As we assume that our phylogeny is very large it may for example have been inferred by RAxML, and the branch lengths are proportional to the probability of character changes having happened along them. For present purposes I have used a smaller tree (actually a clade cut out of a larger tree I had floating around), so that I could do the calibrations quickly and so that the figures of this post look nice. My example phylogram has this shape:


We fire up R, load the ape package, and import our phylogeny with read.tree() or read.nexus(), depending on whether it is in Newick or Nexus format, e.g.
mytree <- read.tree("treefilename.tre")
Now to the various methods.

Penalised Likelihood

I have previously done a longer, dedicated post on this method. I did not, however, go into the various models and options then, so let's cover the basics here.

Penalised Likelihood (PL) is, I think, the most sophisticated approach available in APE, allowing the comparison of likelihood scores between different models. It is also the most flexible. It is possible to set multiple calibration points, as discussed in the linked earlier post, but here we simply set the root age to 50 million years:
mycalibration <- makeChronosCalib(mytree, node="root", age.max=50)
We have three different clock models at our disposal, correlated, discrete, and relaxed. Correlated means that adjacent parts of the phylogeny are not allowed to evolve at rates that are very different. Discrete models different parts of the tree as evolving at different rates. As I understand it, relaxed allows the rates to vary most freely. Another important factor that can be adjusted is the smoothing parameter lambda; I usually run all three clock models at lambdas of 1 and 10 and pick the one with the best likelihood score. For present purposes I will restrict myself to lambda = 1.

Let's start with correlated:
mytimetree <- chronos(mytree, lambda = 1, model = "correlated", calibration = mycalibration, control = chronos.control() )
When plotted, the chronogram looks as follows.


Next, discrete. The command is the same as above except for the text in the model parameter. The branch length distribution and likelihood score turned out to be very close to those for the correlated model:


Finally, relaxed. Very different branch length distribution and a by far worse likelihood score compared to the other two:


I have only considered testing a strict clock model with chronos for the first time today. It turns out that you get it by running it as a special case of the discrete model, which by default is set to assume ten rate categories. You simply set the number of categories to one:
mytimetree <- chronos(mytree, lambda = 1, model = "discrete", calibration = mycalibration, control = chronos.control(nb.rate.cat=1) )
In my example case this looks rather similar to the results from correlated model and discrete with ten categories:


The problem with PL is that is seems to be a bit touchy. Even today we had several cases of an inexplicable error message, and several cases of the analysis being unable to find a reasonable starting solution. We finally found that it helped to vastly increase the root age (we had played around with 15, assuming that it doesn't matter, and it worked when we set it to a more realistic three digit number). It is possible that our true problem was short terminal branches.

PL is also the slowest of the methods presented here. I would use it for trees that are too large for Bayesian time calibration but where I need an actual chronogram with a meaningful time axis and want to do model comparison. If I just want an ultrametric tree the following three methods would be faster and simpler alternatives. That being said, so far I had no use case for them.

A superseded but fast alternative: chronopl()

This really came as a surprise as I believed that the function chronopl() had been removed from ape. I thought I had tried to find it in vain a few years ago, but I saw it in the ape documentation today (albeit with the comment "the new function chronos replaces the present one which is no more maintained") and was then able to use it in my current R installation. I must have confused it with a different function.

chronopl() does not provide a likelihood score as far as I can see, but it seems to be very fast. I quickly ran it with default parameters and lambda = 1, again setting root age to 50:
mytimetree <- chronopl(mytree, lambda = 1, age.min = 50, age.max = NULL, node = "root")
The result looks very similar to what chronos() produced with the (low likelihood) relaxed model:


Various parameters can be changed, but as implied above, if I want to do careful model comparison I would use chronos() anyway.

Mean Path Lengths

The chronoMPL() method time-calibrates the phylogeny with what is called a mean path lengths method. The documentation makes clear that multiple calibration points cannot be used; the idea is to make an ultrametric tree, pick one lineage split for which one has a credible date, and then scale the whole tree so that the split has the right age. Command is simply:
mytimetree <- chronoMPL(mytree)
The problem is, the resulting chronogram often looks like this:


Most of the branch length distribution fits the results for the favoured model in the analysis with chronos(), see above. That's actually great, because chronoMPL() is so much faster! But you will notice some wonky lines in particular in the top right and bottom right corners of this tree graph. Those are negative branch lengths. Did somebody throw the ancestral species into a time machine and set them free a bit before they actually evolved?

Some googling suggests that this happens if the phylogram is very unclocklike, which, unfortunately, is often the case in real life. That limits rather sharply what mean path lengths can be used for.

The compute.brtime() function

Another function that I have now tried out is compute.brtime(). It can do two rather different things.

The first is to transform a tree according to what I understand has be a full set of branching times for all splits in the tree. The use case for that seems to be if you have a tree figure and a table of divergence times in a published paper and want to copy that chronogram for a follow-up analysis, but the authors cannot or won't send it to you. So you manually type out the tree, manually type out a vector of divergence times (knowing which node number is which in the R phylo format!), and then you use this function to get the right branch length distribution. May happen, but presumably not a daily occurrence. What we usually have is a tree for which we want the analysis to infer biologically realistic divergence times that we don't know yet.

The second thing the function can do is to infer an ultrametric tree without any calibration points at all but under the coalescent model. The command is then as follows.
mytimetree <- compute.brtime(mytree, method="coalescent", force.positive=TRUE)
It seems that the problem of ending up with negative branch lengths was, in this case, recognised and solved simply by giving the user the option to tell the function PLEASE DON'T. I assume they are collapsed to zero length (?). My result looked like this:


Note that this is more on the lines of "one possible solution under the coalescent model" instead of "the optimal solution under this here clock model", so that every run will produce a slightly different ultrametric tree. I ran it a few times, and one aspect that did not change was the clustering of nearly all splits close to the present, which I (and PL, see above) would consider biologically unrealistic. Still, we have an ultrametric tree in case we need one in a hurry.

It is well possible that I have still missed other options in APE, but these are the ones I have tried out so far.

Something completely different: non-ultrametric chronograms

Finally, I should mention that there are methods to produce very different time-calibrated trees in palaeontology. The chronograms discussed in this post are all inferred under the assumption that we are dealing with extant lineages, so all branches on the chronogram end flush in the present, and consequently a chronogram is an ultrametric tree. And usually the data that went into inferring the topology was DNA sequence data or similar.

Palaeontologists, however, deal with chronograms where many or all branches end in the past because a lineage went extinct, making their chronograms non-ultrametric and look like phylograms. And usually the data that went into inferring the tree topology was morphological. This is a whole different world for me, and I can only refer to posts like this one and this one which discuss an R package called paleotree.

There also seems to be a function in newer APE versions called node.date() which is introduced with the following justification:
Our software, node.dating, uses a maximum likelihood approach to perform divergence-time analysis. node.dating is written in R v3.30 and is a recent addition to the R package ape v4.0 (Paradis et al., 2004). Previously, ape had the capability to estimate the dates of internal nodes via the chronos function; however, chronos requires ultrametric trees and is thus unable to incorporate information from tips that are sampled at different points in time.
This suggests that the point is the same, to allow chronograms with extinct lineages, but in this case aimed more at molecular data. Their example case are virus sequence data.