Friday, April 20, 2018

Time-calibrated or at least ultrametric trees with the R package ape: an overview

I had reason today to look into time-calibrating phylogenetic trees again, specifically trees that are so large that Bayesian approaches are not computationally feasible. It turns out that there are more options in the R package APE than I had previously been aware of - but unfortunately they are not all equally useful in everyday phylogenetics.

In all cases we first need a phylogram that we want to time-calibrate or at least make ultrametric to use in downstream analyses that require ultrametricity. As we assume that our phylogeny is very large it may for example have been inferred by RAxML, and the branch lengths are proportional to the probability of character changes having happened along them. For present purposes I have used a smaller tree (actually a clade cut out of a larger tree I had floating around), so that I could do the calibrations quickly and so that the figures of this post look nice. My example phylogram has this shape:


We fire up R, load the ape package, and import our phylogeny with read.tree() or read.nexus(), depending on whether it is in Newick or Nexus format, e.g.
mytree <- read.tree("treefilename.tre")
Now to the various methods.

Penalised Likelihood

I have previously done a longer, dedicated post on this method. I did not, however, go into the various models and options then, so let's cover the basics here.

Penalised Likelihood (PL) is, I think, the most sophisticated approach available in APE, allowing the comparison of likelihood scores between different models. It is also the most flexible. It is possible to set multiple calibration points, as discussed in the linked earlier post, but here we simply set the root age to 50 million years:
mycalibration <- makeChronosCalib(mytree, node="root", age.max=50)
We have three different clock models at our disposal, correlated, discrete, and relaxed. Correlated means that adjacent parts of the phylogeny are not allowed to evolve at rates that are very different. Discrete models different parts of the tree as evolving at different rates. As I understand it, relaxed allows the rates to vary most freely. Another important factor that can be adjusted is the smoothing parameter lambda; I usually run all three clock models at lambdas of 1 and 10 and pick the one with the best likelihood score. For present purposes I will restrict myself to lambda = 1.

Let's start with correlated:
mytimetree <- chronos(mytree, lambda = 1, model = "correlated", calibration = mycalibration, control = chronos.control() )
When plotted, the chronogram looks as follows.


Next, discrete. The command is the same as above except for the text in the model parameter. The branch length distribution and likelihood score turned out to be very close to those for the correlated model:


Finally, relaxed. Very different branch length distribution and a by far worse likelihood score compared to the other two:


I have only considered testing a strict clock model with chronos for the first time today. It turns out that you get it by running it as a special case of the discrete model, which by default is set to assume ten rate categories. You simply set the number of categories to one:
mytimetree <- chronos(mytree, lambda = 1, model = "discrete", calibration = mycalibration, control = chronos.control(nb.rate.cat=1) )
In my example case this looks rather similar to the results from correlated model and discrete with ten categories:


The problem with PL is that is seems to be a bit touchy. Even today we had several cases of an inexplicable error message, and several cases of the analysis being unable to find a reasonable starting solution. We finally found that it helped to vastly increase the root age (we had played around with 15, assuming that it doesn't matter, and it worked when we set it to a more realistic three digit number). It is possible that our true problem was short terminal branches.

PL is also the slowest of the methods presented here. I would use it for trees that are too large for Bayesian time calibration but where I need an actual chronogram with a meaningful time axis and want to do model comparison. If I just want an ultrametric tree the following three methods would be faster and simpler alternatives. That being said, so far I had no use case for them.

A superseded but fast alternative: chronopl()

This really came as a surprise as I believed that the function chronopl() had been removed from ape. I thought I had tried to find it in vain a few years ago, but I saw it in the ape documentation today (albeit with the comment "the new function chronos replaces the present one which is no more maintained") and was then able to use it in my current R installation. I must have confused it with a different function.

chronopl() does not provide a likelihood score as far as I can see, but it seems to be very fast. I quickly ran it with default parameters and lambda = 1, again setting root age to 50:
mytimetree <- chronopl(mytree, lambda = 1, age.min = 50, age.max = NULL, node = "root")
The result looks very similar to what chronos() produced with the (low likelihood) relaxed model:


Various parameters can be changed, but as implied above, if I want to do careful model comparison I would use chronos() anyway.

Mean Path Lengths

The chronoMPL() method time-calibrates the phylogeny with what is called a mean path lengths method. The documentation makes clear that multiple calibration points cannot be used; the idea is to make an ultrametric tree, pick one lineage split for which one has a credible date, and then scale the whole tree so that the split has the right age. Command is simply:
mytimetree <- chronoMPL(mytree)
The problem is, the resulting chronogram often looks like this:


Most of the branch length distribution fits the results for the favoured model in the analysis with chronos(), see above. That's actually great, because chronoMPL() is so much faster! But you will notice some wonky lines in particular in the top right and bottom right corners of this tree graph. Those are negative branch lengths. Did somebody throw the ancestral species into a time machine and set them free a bit before they actually evolved?

Some googling suggests that this happens if the phylogram is very unclocklike, which, unfortunately, is often the case in real life. That limits rather sharply what mean path lengths can be used for.

The compute.brtime() function

Another function that I have now tried out is compute.brtime(). It can do two rather different things.

The first is to transform a tree according to what I understand has be a full set of branching times for all splits in the tree. The use case for that seems to be if you have a tree figure and a table of divergence times in a published paper and want to copy that chronogram for a follow-up analysis, but the authors cannot or won't send it to you. So you manually type out the tree, manually type out a vector of divergence times (knowing which node number is which in the R phylo format!), and then you use this function to get the right branch length distribution. May happen, but presumably not a daily occurrence. What we usually have is a tree for which we want the analysis to infer biologically realistic divergence times that we don't know yet.

The second thing the function can do is to infer an ultrametric tree without any calibration points at all but under the coalescent model. The command is then as follows.
mytimetree <- compute.brtime(mytree, method="coalescent", force.positive=TRUE)
It seems that the problem of ending up with negative branch lengths was, in this case, recognised and solved simply by giving the user the option to tell the function PLEASE DON'T. I assume they are collapsed to zero length (?). My result looked like this:


Note that this is more on the lines of "one possible solution under the coalescent model" instead of "the optimal solution under this here clock model", so that every run will produce a slightly different ultrametric tree. I ran it a few times, and one aspect that did not change was the clustering of nearly all splits close to the present, which I (and PL, see above) would consider biologically unrealistic. Still, we have an ultrametric tree in case we need one in a hurry.

It is well possible that I have still missed other options in APE, but these are the ones I have tried out so far.

Something completely different: non-ultrametric chronograms

Finally, I should mention that there are methods to produce very different time-calibrated trees in palaeontology. The chronograms discussed in this post are all inferred under the assumption that we are dealing with extant lineages, so all branches on the chronogram end flush in the present, and consequently a chronogram is an ultrametric tree. And usually the data that went into inferring the topology was DNA sequence data or similar.

Palaeontologists, however, deal with chronograms where many or all branches end in the past because a lineage went extinct, making their chronograms non-ultrametric and look like phylograms. And usually the data that went into inferring the tree topology was morphological. This is a whole different world for me, and I can only refer to posts like this one and this one which discuss an R package called paleotree.

There also seems to be a function in newer APE versions called node.date() which is introduced with the following justification:
Our software, node.dating, uses a maximum likelihood approach to perform divergence-time analysis. node.dating is written in R v3.30 and is a recent addition to the R package ape v4.0 (Paradis et al., 2004). Previously, ape had the capability to estimate the dates of internal nodes via the chronos function; however, chronos requires ultrametric trees and is thus unable to incorporate information from tips that are sampled at different points in time.
This suggests that the point is the same, to allow chronograms with extinct lineages, but in this case aimed more at molecular data. Their example case are virus sequence data.

Friday, April 13, 2018

Monophyletic species, kind of

A paper by bryologist Brent Mishler and philosopher of biology John Wilkins has just come out, with the title The Hunting of the SnaRC: A Snarky Solution to the Species Problem. It is open access in the journal Philosophy Theory and Practice in Biology, so anybody with internet access can check it out.

Many bloggers have issues that they return to again and again even if they are not necessarily the nominal topics of their blogs - for example, Jerry Coyne frequently posts about Free Will and about students trying to shut down talks by speakers they don't like, and Larry Moran regularly takes apart papers claiming that junk DNA has been disproved. This much less widely known blogger can reliably be coaxed out from behind the oven by at least two such recurring issues: bad arguments for the acceptance of paraphyletic taxa, and the in my eyes incoherent concept of "monophyletic species".

As the title indicates, Mishler & Wilkins present a solution for the species problem, i.e. the perennial question in biology of what 'a species' even is. Especially as the paper is freely accessible it would serve no purpose to summarise its introduction, so I will move immediately to what I find most interesting: their views on how to view species and some pointers on how to do classification at the lowest levels in practice.

Note that I say "their views", plural, deliberately, because this is one aspect of the paper that I have not quite understood yet:

Wilkins has argued in the past that the popular approach of developing a theoretical species concept and then applying it to a potentially recalcitrant reality is a dead end. What biologists should do is the opposite, i.e. consider species as empirical phenomena in need of individual explanations. And here in this paper, Wilkins' argument is reiterated concisely in section 3, A Way Forward: Species Are at Least Initially Phenomena.

What I like about this flip in perspective is that it allows much more flexibility; obviously the empirical phenomena that we generally identify as species, be it popularly or as biologists - generally gaps in morphological or genetic variation - need a different scientific explanation for example in asexual than in sexual species, making one-size-fits-all species concepts difficult to apply.

Mishler, in turn, has argued in the past that species are not a special biological category different from e.g. monophyletic genera and families. The species category is arbitrary, and we should just classify all organisms into nested monophyletic groups, AKA clades, all the way down to the individual specimens. And here in this paper, Mishler's argument is reiterated in sections 4, Rankless Taxonomy, 5, Capturing the SNaRC, and 6, Using SNaRCs in Systematic, Evolutionary, and Ecological Studies.

The thing is, while there is perhaps technically no direct contradiction between those two arguments to the degree that there is a contradiction between "all taxa should be monophyletic" and "taxa should be allowed to be paraphyletic", they appear to be two rather different prescriptions. If I understand correctly, the first says,
  • We should treat species as empirical phenomena in need of explanation instead of indiscriminately applying a given theoretical concept to them.
The second says,
  • It makes no sense to even talk of species, we should stop doing so, and here is a single theoretical concept (everything is clades) that we should indiscriminately apply to all specimens.
In fact I am currently unable to see how sections 4-6 and the conclusions of this paper would have to change if section 3 were to be deleted in its entirety. What am I missing?

What I found most useful about this paper was that it has some thoughts on how to do classification into nested clades all the way down to the individual specimens in practice, because that was completely unclear to me in all past instances when this approach was suggested. There are some apparent problems with it, particularly that we need items forming a tree structure to even have clades. It is sometimes difficult to illustrate the issue, but it can perhaps be presented as follows:
  1. The prescription is, as mentioned above, that a classification should be clades (= monophyletic groups) all the way down to individual specimens.
  2. A clade is a complete branch in a tree structure, and usually understood to be specifically a complete branch of a species phylogeny.
  3. In other words, the way the term clade is defined, it applies only in a tree-structure but is inapplicable in a net-like structure.
  4. Sexually reproducing species are systems consisting of individual specimens that have net-like relationships with each other, because they share numerous ancestors instead of one ancestor in each sufficiently earlier generation.
  5. It follows necessarily from the previous two points that the term clade cannot be applied to describe the relationship between specimens if what we are looking at includes multiple specimens from the same sexually reproducing species.
  6. If follows then that it is logically impossible to classify into clades all the way down to these specimens, unless the meaning of the word clade is changed to a degree that the whole purpose of having that word is defeated.
To my understanding this is why Hennig spent so much time discussing the different ways that specimens (or snapshots of them, which he called semaphoronts) can be related to each other. The relationship between four (non-hybridogenic) species is tree-like, so they can, and should, be classified into clades. But relationships between individuals within a sexually reproducing species are net-like, so they cannot possibly be classified into clades, as the word does not even have a meaning in that structure.

The point at which approaches to classification change is approximately at the species level. Phylogenetic systematics applies only above it, and it uses species as the units that it groups into clades, because if it used any smaller units there would not be clades. This is also why in my opinion one cannot coherently reject the reality of species and be a phylogenetic systematist and, conversely, coherently accept the reality of species and promote paraphyletic taxa, because clades are species that have diversified. Many others, of course, disagree.

Now, what is the practical approach suggested by the present paper? It argues that the terminal units of classification should be "the finest-scale clades that can be convincingly demonstrated with current data", here called Smallest Named and Registered Clades (SNaRCs). Obviously such a 'clade' cannot be based on information from a single gene, as it may show a different history than other genes, for example because of introgression or incomplete lineage sorting. The solution is to use as evidence for monophyly "the preponderance of gene lineages making up a clade", or in other words "congruence among the majority of gene trees and other types of phylogenetic characters available".

On the plus side, this is a very empirical and testable prescription. But consider two thought experiments. First, take three samples A, B and C, look at, say, 100 gene trees, and if 51 of them show ((A,B),C) then A and B form a 'clade', even if all three of them are members of the same sexually reproducing species. Again, that is doable, empirical and testable, and we get a clear answer.

Nonetheless this approach does not convince me at the moment, nor will it even if we assume a scenario of 100 gene trees supporting (A,B), simply because no matter what the gene trees say, in reality there is no tree-structure inside the species. Yes, we can easily sequence for example the DNA of three siblings and run an analysis that will produce a phylogenetic tree for each gene, but in reality these three people just don't have a tree-relationship with each other, so it does not make sense to me to use terminology or a classification that implies there is one.

For the second thought experiment, take three samples D, E, and F, and if 33 gene trees say ((D,E),F), 33 say (D,(E,F)), and 34 say (E,(D,F)), we are inside a SNaRC and should not delimit any more narrowly, even if D is a specimen from an arid zone ephemeral, E from an alpine perennial, and F from a narrow endemic of the northwestern Blue Mountains that only occurs on ironstone-sandstone outcrops, and all three of them are geographically isolated from each other.

This hypothetical case has three very distinct entities that show a lot of gene tree discordance for the genes we used for our analysis. This is a much weaker problem than the previous one because Mishler & Wilkins argue that SNaRCs are, as all scientific hypotheses, tentative and await revision after the examination of more data. Maybe the next 100 gene trees will clinch it for (A,(B,C)), and then at least we could separate out A; more realistically, sampling more individuals of all three species will presumably resolve the three species as three SNaRCs, even if we cannot figure out the relationship of those three SNaRCs with each other (they may even form a true polytomy, and that's fine).

Still it bothers me that in a situation where we unfortunately have only one sample per species available for analysis the approach promoted in the present paper might lead to the tentative lumping of clearly distinct entities. And unless something is added to the approach, or unless I am missing something, it would have to, because it does not seem to include a way of recognising single-specimen SNaRCs except in the case of one being left alone as sister to another SNaRCs, that, in turn, would still consist of two potentially vastly different specimens. But maybe I am taking this too literally.

On top of that there is perhaps another methodological issue, or again maybe just something I don't understand. It seems to me as if "majority vote of the gene trees" is not actually how multi-locus phylogenetic analyses generally work. To the best of my understanding they reconcile gene trees in rather more complex ways, even in the case of such a simple approach as Gene Tree Parsimony, let alone the multi-gene coalescent model. Many of these approaches actually presuppose the existence of species or populations, and for the same reason as I argued above: what happens within a sexually reproducing lineage is rather different from what happens between such lineages.

More than anything what I find uncomfortable about the approach presented here is that it seems to care not so much about the actual patterns of common descent of what it classifies as about character or gene tree distribution. The difference may come across as subtle, admittedly. What I am trying to say is that I believe phylogenetic systematics should be about classifying organisms by relatedness, by exclusivity of common descent.

I do not, for example, care very much about the fact that most of the ancestral chloroplast genome has been moved over into the nucleus of the host cell, because the chloroplasts are directly descended in an unbroken line from the first cyanobacterium that colonised a plant cell, and the plant species we have today are descended in an unbroken line from that plant cell. To me chloroplasts are a subclade of cyanobacteria and plants are a subclade of eucaryotes, all regardless of what happened to the individual genes.

To use an example from within a species, I have mentioned in the past that it is possible, although statistically unlikely, that I have inherited no genetic material whatsoever from my maternal grandfather, if it just so happened that all the chromosomes my mother gave me were those she got from her mother (the Y chromosome is of course always from the paternal grandfather, by necessity). But even if that were the case we would nonetheless consider it to be an important piece of information that I descended from my maternal grandfather, and I would nonetheless not exist without his involvement. So yes, we use the genes to infer common descent, but the point is really the common descent itself, and the genes are just a data source that can potentially mislead us. Sometimes the right answer may be (A,(B,C)) even if most genes say ((A,B),C).

The "majority vote of the gene trees" approach, however, feels as if its practical concern starts and ends at the pattern shown by the genes, regardless of what the patterns of descent are. To me that feels the wrong way around.

Another way of looking at the issue may be this: If we truly accept the argument made in section 3, that we should look at natural phenomena, consider them to be explananda, and find the most appropriate scientific explanation for each of them, would the logical result not be Hennig's original approach? The phenomenon that a beetle specimen shares more traits with a bee specimen than either share with a slug specimen has an explanation, and that is that the former two share a much more recent common ancestor from which they inherited the shared traits. We express that reality by grouping the former two into a taxon called 'insects' while leaving the slug out.

The fact that I may easily in some cases share more genetic similarity with somebody born in Italy than with another northern German, however, would most likely be due to the stochastic nature of allele inheritance inside our sexually reproducing species. There is no clade wherein two specimens of humanity - the hypothetical Italian and I - share one and only one most recent common ancestor. Instead, beyond some point in the past we share thousands of ancestral 'specimens' in each generation. Because this is a different biological phenomenon than ((beetle,bee),slug), we need a different approach to classification at that level.

Wednesday, April 11, 2018

Botany picture #257: Gentianella aspera


Has it been that long since I posted the last botany picture? With my mind still on the mountains, here is a European gentian, Gentianella aspera (Gentianaceae), European Alps, 2004. Although sometimes split off into their own genus, the Australian gentians are phylogenetically also Gentianella.

One thing that I found strange about the Australian ones, by the way, is that they are generally white, because the European gentians are rather famously blue, violet, or very rarely yellow. There is even an obnoxious German Schlager song making that point, with the first line of the chorus translating as "blue, blue, blue blooms the gentian".

WARNING: follow that link at your own risk.

Tuesday, April 10, 2018

Sam Harris and Ezra Klein on intelligence and race

Recently atheist activist Sam Harris and journalist Ezra Klein had a discussion about intelligence and race. The background is that Harris had Charles Murray, the author of The Bell Curve, as a guest on his podcast, Klein's Vox site published an article critical of that interview, and Harris felt that that article was unfair.

Having read through the transcript of Harris' and Klein's conversation, I must say that it went reasonably well, considering the topic. Harris' discussion with Noam Chomsky, for example, was much worse, as his first argument went completely over Harris' head, and they just went in circles from that moment on.

The frustrating thing is that at the bottom of what Harris is trying to argue there are quite a few ideas that are valid. Yes, scientific results should be accepted for what they are instead of being pushed aside for fear of being politically incorrect. But his otherwise reasonable points are completely overshadowed by his tendency to make it all about how mean his critics are to him for calling him biased and his inability to see that making it all about how his critics are mean to him while bracketing out how this discussion fits into its historical and political context in the United States is his own unacknowledged bias at work.

What is in my eyes particularly ironic, however, is that while Harris makes it all about how unfair his critics are, he argues at the same time that the science should be the focus. So I tried to have an eye on how the scientific evidence was discussed, and as far as I can tell it seemed to go as follows:

Klein sometimes brings up evidence that shows that intelligence (as measured by IQ or similar tests, which is another whole can of worms) is strongly influenced by the environmental conditions under which somebody grows up, e.g. when children from disadvantaged backgrounds are adopted by affluent families, and cites, by name, relevant scientists who argue that at the very least there is at this moment no evidence yet for any significant genetically determined IQ difference between groups. (And I have no idea where such evidence could even potentially come from, unless there is behind this the usual misunderstanding of what heritability means.) Harris never addresses those arguments, as far as I can tell. His counter-arguments appear to be:

(1) "genes are involved for basically every[thing]". But that is so trivially true as to be meaningless. Genes are involved for the development of fingers, still there are no differences in the number of fingers between different populations. And even if we are talking about traits that vary, it gets us nowhere, because it doesn't necessarily follow that the genes determine more than, say, 5% of the variation. And even if intelligence is strongly heritable it says nothing about significant differences between groups either, as he readily admits that variation within is much stronger than between.

(2) Then there is Harris' sports example, where he says that West Africans dominate certain running sports. He argues "if you have populations that have their means slightly different genetically, 80 percent of a standard deviation difference, you’re going to see massive difference in the tail ends of the distribution, where you could have 100-fold difference in the numbers of individuals who excel at the 99.99 percent level". Now I get that this might be a valid argument to explain the underrepresentation of a group with a hypothetically slightly lower mean at excelling at the >99.9% level under the Utopian assumption of complete equality of opportunity, but then we would be talking about Field Medal winners or Nobel laureates. As an explanation for lower societal achievement on average, i.e. why members of a group are vastly overrepresented in prisons and have vastly lower household wealth than the majority, it is a non-starter and thus irrelevant to the discussion from the get-go.

(3) Harris cites unnamed scientists who, he says, do not want to have their names published because of fear of being called racist, but who are said to agree with him. Not knowing who they are one is, of course, unable to confirm what they said or meant as well as to assess their qualifications, their potential agendas and biases, and if they are even from a relevant field of research. (Note that according to Wikipedia Charles Murray, with whom that whole discussion started, is a "political scientist, author, and columnist" working for a conservative think tank. That is, he is not an expert in the areas of population genetics, human cognitive development, comparative assessments, or any other field of relevance.)

I find that a bit disappointing. For all Harris' claims that the science is clearly on Charles Murray's side, it rather looks to me as if his argumentation runs simply as follows: There are differences in IQ between groups, and these differences must obviously have a genetic component, because everything has a genetic component. And that's it, at least as far as one can tell from the conversation with Klein.

Monday, April 2, 2018

How problematic is the jump dispersal parameter in ancestral area inference?

I recently read an article in the Journal of Biogeography titled "Conceptual and statistical problems with the DEC + J model of founder-event speciation and its comparison with DEC via model selection". Its authors are Richard Ree, the developer of the original DEC model, and Isabel Sanmartin.

The main problem with discussing the paper here is that it would probably take 5,000 words to properly explain what it is even about. I will try to provide the most superficial introduction to the topic and otherwise assume that of the few people who will read this blog most are at least somewhat familiar with it.

The area of research this is about is the estimation or inference of ancestral areas and biogeographic events. Say we have a number of related species, the phylogeny showing how they are related, a number of geographic areas in which each species is either present or absent, and at least one model of biogeographic history. For the purposes of what I will subsequently call ancestral area inference (AAI) we assume that we know the species are well-defined and that the phylogeny is as close to true as we can infer at the time, so that they will simply be accepted as given. How to objectively define biogeographic areas for the study group is another big question, but again we take it as given that that has been done.

The idea of AAI is to take these pieces of information and infer what distribution ranges the ancestral species at each node of the phylogeny had, and what biogeographic events took place along the phylogeny to lead to the present patterns of distribution. What model of biogeographic events we accept matters a lot, of course. Imagine the following simple scenario of three species and three areas, with sister species occurring in areas A and B, respectively, and their more distant relative occurring in both areas B and C:



Assuming, for example that our model of biogeographic history favours vicariant speciation and range expansions, we may consider the scenario on the left to be a very probable explanation of how we ended up with those patterns of distribution. First the ancestral species of the whole clade occurred in all areas, and vicariant speciation split it into a species in area A and one in areas B and C. The former expanded to occur in both A and B and then underwent another vicariant speciation event, done.

If we have reason to assume that this is unlikely, for example because area A is an oceanic island, we may favour a different model. In the right hand scenario we see the ancestral species occurring in areas B and C and producing one of its daughter species via subset sympatry in area B. At least one seed or pregnant female of that new lineage is then dispersed to island A. An event such as this last one, where dispersal leads to instant genetic isolation and consequent speciation, is in this context often called 'jump dispersal' or, as in the title of the paper, 'founder-event speciation', to differentiate it from the much slower process of gradual range expansion followed by vicariant or sympatric speciation*.

I am not saying that either of these scenarios is the best one to explain how the hypothetical three species evolved and dispersed. In fact I would say that three species are too small a dataset to estimate biogeographic history with any degree of confidence, but it provides an idea of what ancestral area inference is about.

Perhaps the best established approaches to AAI are Dispersal and Vicariance Analysis (DIVA) and the Dispersal, Extinction and Cladogenesis model (DEC). The former was originally implemented as parsimony analysis in a software with the same name, and it has a tendency to favour vicariance, as the name suggests. Likelihood analysis under the DEC model became popular in its implementation in the software Lagrange, and in my limited experience and to the best of my understanding it is designed to have daughter species inherit part of the range of the ancestor, often leading to subset sympatry. And there are other approaches, of course.

As the result of his PhD project, Nick Matzke introduced the following two big innovations in AAI: First, the addition of a parameter j, for jump dispersal, to existing models. This allows the kind of instantaneous speciation after dispersal to a new area that I described above, and which can be assumed to be particularly important in island systems. Second, the idea that the most appropriate model for a study group should be chosen through statistical model selection, as in other areas of evolutionary biology. He created the R package BioGeoBEARS to allow such model selection. It implemented originally likelihood versions of DIVA, DEC and a third model called BayArea, all assuming the operation of slightly different sets of biogeographic processes. Each of them can be tested with and without the j parameter and, after another update, with or without an x parameter for distance-dependent dispersal.

Now I come finally (!) to Ree & Sanmartin. Their eight page paper, as the title implies, is a criticism of these two innovations. What do they argue? I hope I am summarising this faithfully, but in my eyes their three core points are as follows:
  • A biogeographic model with events happening at the nodes of the tree as opposed to along the branches, as is the case with jump dispersal, is not a proper evolutionary model because such events are then "not modeled as time-dependent". In other words, only events that have a per-time-unit probability of occurring along a branch are appropriate.
  • Under certain conditions the most probable explanation provided by a model including the j parameter is that all biogeographic events were jump dispersals. The j parameter gets maximised and explains everything by itself. They call this scenario "degenerate", because the "true" model must "surely" include time-dependent processes.
  • DEC and DEC + j (and, I assume, by extension any other model and its + j variant) cannot be compared in the sense of model selection.

I must, of course, admit that model development is not my area. Consequently I am happy to defer regarding points one and three to others who have more expertise, and who will certainly have something to say about this at some point. I can only at this moment state that these claims do not immediately convince me. Certainly it is often the case that models with very different parameters are statistically compared with each other?

Is it not possible that the best model to explain an evolutionary process may sometimes indeed have a parameter that is not time-dependent but dependent on lineage splits? In the present case, if it is a fact that jump dispersal caused a lineage split, then both events quite simply happened instantaneously (at the relevant time scale of millions of years); in a sense, they were the same event, as the dispersal itself interrupted gene flow.

Perhaps more importantly, however, I am not at all convinced by the second point. Generally I am more interested in practical and pragmatic considerations than theory of statistics and philosophy. In phylogenetics, for example, I am less impressed by the claim that parsimony is supposedly not statistically consistent than by a comparison of the results produced by parsimony and likelihood analysis of DNA sequence datasets. Do they make sense? What can mislead an analysis? What software is available? How computationally feasible is what would otherwise be the best approach, and can it deal with missing data?

So in the present case I would also like to consider the practical side. Is the problem of j being maximised so that everything is explained by jump dispersal at all likely to occur in empirical datasets? In the paper Ree and Sanmartin illustrate a two species / two area example. That is clearly not a realistic empirical dataset, as it is much too small for proper analysis. But if we understand to some degree how the various model parameters work we can deduce under what circumstances j is likely to be maximised.

Unless I am mistaken, the circumstances appear to be as follows: We need a dataset in which all species are local endemics, i.e. all are restricted to a single area, and in which sister species never share part of their ranges. This is because other patterns cannot be explained by jump dispersal. If a species occupies two or more areas, it would have had to expand its range, so the analysis cannot reduce the d parameter for range expansion to zero. If sister species share part of their ranges, likewise; if they share the same single area, they must have diverged sympatrically, which again is not speciation through jump dispersal.

This raises the question, how likely are we to find datasets in which these two conditions apply? In my admittedly limited experience such datasets do not appear to be very common. If we are dealing, for example, with a small to medium sized genus on one continent, we will generally find partly overlapping ranges, and often at least one very widespread species. The j parameter will not be maximised. If we are doing a global analysis of a large clade, we will need rather large areas (because if you use too many small areas the problem becomes computationally intractable). This means, among other things, that entire subclades will share the same single-area range, and j will not be maximised.

In other words, the problem of 'all-jump dispersal' solutions appears to be rather theoretical. But what if we actually do have such a dataset? Is it not a problem then? To me the next question is under what circumstances such a situation would arise. Again, we have all species restricted to single areas, meaning that they apparently find it hard to expand their ranges across two areas. Why? Perhaps geographic separation to the degree that they rarely disperse? Geographic separation to the degree that when they disperse gene flow is interrupted, leading to immediate speciation? Again, we never have sister species sharing an area. Why? A good explanation would be that each area is too small for sympatric speciation to be possible.

Now what does that dataset sound like? To me it sounds like an archipelago of small islands, or perhaps a metaphorical island system such as isolated mountain top habitats. The exact scenario, in other words, in which all-jump dispersal seems like a very probable explanation. Because your ancestral island is too small for speciation, the only way to speciate is to jump to another island, and if you jump to another island you are immediately so isolated from your ancestral population that you speciate.

Again, I am not a modeler, and I have not run careful simulation experiments before writing this, but based on this thought experiment it seems to me as if the + j models would work just as they should: j would not be maximised under circumstances where the other processes are needed to explain present ranges, but it would be maximised under precisely those extremely rare circumstances where 'all jump dispersal' is the only realistic explanation.

Footnote

*) Sympatric meaning here at the scale of the areas defined for the analysis. If one the areas in the analysis is all of North America, for example, it is likely that the 'sympatric' events inside that area would in truth mostly have been allopatric, parapatric or peripatric at a smaller spatial scale.

Sunday, April 1, 2018

Weekend in the mountains

We just came back from a nice weekend in the Australian 'Alps', making use of what may have been the last period of nicely warm weather. Still rather cold camping during the night, it is definitely not summer anymore.


Turns out we may finally have to buy a new tent. On the plus side, the belt of the plant press served well to keep the tent pole in shape; not perfectly, but sufficiently to give us just enough structural integrity for two nights.


The main attraction this time was Yarrangobilly Caves. The last time we passed by it was too late in the day, so we were unable to visit them. This time we bought passes for two of the caves.


Although probably weird looking enough, this photo does not do reality justice - the entrance area to South Glory Cave is massive and awe-inspiring.


We camped at our favourite spot in the area, Three Mile Dam. I have posted photos of the lake before, but here is one showing the moon reflected in the water during the night.


Morning mist above the camp site penetrated by the first rays of the sun.


And to conclude, something botanical: Golden everlasting daisies (Xerochrysum subundulatum, Asteraceae) fruiting on the Old Kiandra Gold Fields.