Monday, April 2, 2018

How problematic is the jump dispersal parameter in ancestral area inference?

I recently read an article in the Journal of Biogeography titled "Conceptual and statistical problems with the DEC + J model of founder-event speciation and its comparison with DEC via model selection". Its authors are Richard Ree, the developer of the original DEC model, and Isabel Sanmartin.

The main problem with discussing the paper here is that it would probably take 5,000 words to properly explain what it is even about. I will try to provide the most superficial introduction to the topic and otherwise assume that of the few people who will read this blog most are at least somewhat familiar with it.

The area of research this is about is the estimation or inference of ancestral areas and biogeographic events. Say we have a number of related species, the phylogeny showing how they are related, a number of geographic areas in which each species is either present or absent, and at least one model of biogeographic history. For the purposes of what I will subsequently call ancestral area inference (AAI) we assume that we know the species are well-defined and that the phylogeny is as close to true as we can infer at the time, so that they will simply be accepted as given. How to objectively define biogeographic areas for the study group is another big question, but again we take it as given that that has been done.

The idea of AAI is to take these pieces of information and infer what distribution ranges the ancestral species at each node of the phylogeny had, and what biogeographic events took place along the phylogeny to lead to the present patterns of distribution. What model of biogeographic events we accept matters a lot, of course. Imagine the following simple scenario of three species and three areas, with sister species occurring in areas A and B, respectively, and their more distant relative occurring in both areas B and C:



Assuming, for example that our model of biogeographic history favours vicariant speciation and range expansions, we may consider the scenario on the left to be a very probable explanation of how we ended up with those patterns of distribution. First the ancestral species of the whole clade occurred in all areas, and vicariant speciation split it into a species in area A and one in areas B and C. The former expanded to occur in both A and B and then underwent another vicariant speciation event, done.

If we have reason to assume that this is unlikely, for example because area A is an oceanic island, we may favour a different model. In the right hand scenario we see the ancestral species occurring in areas B and C and producing one of its daughter species via subset sympatry in area B. At least one seed or pregnant female of that new lineage is then dispersed to island A. An event such as this last one, where dispersal leads to instant genetic isolation and consequent speciation, is in this context often called 'jump dispersal' or, as in the title of the paper, 'founder-event speciation', to differentiate it from the much slower process of gradual range expansion followed by vicariant or sympatric speciation*.

I am not saying that either of these scenarios is the best one to explain how the hypothetical three species evolved and dispersed. In fact I would say that three species are too small a dataset to estimate biogeographic history with any degree of confidence, but it provides an idea of what ancestral area inference is about.

Perhaps the best established approaches to AAI are Dispersal and Vicariance Analysis (DIVA) and the Dispersal, Extinction and Cladogenesis model (DEC). The former was originally implemented as parsimony analysis in a software with the same name, and it has a tendency to favour vicariance, as the name suggests. Likelihood analysis under the DEC model became popular in its implementation in the software Lagrange, and in my limited experience and to the best of my understanding it is designed to have daughter species inherit part of the range of the ancestor, often leading to subset sympatry. And there are other approaches, of course.

As the result of his PhD project, Nick Matzke introduced the following two big innovations in AAI: First, the addition of a parameter j, for jump dispersal, to existing models. This allows the kind of instantaneous speciation after dispersal to a new area that I described above, and which can be assumed to be particularly important in island systems. Second, the idea that the most appropriate model for a study group should be chosen through statistical model selection, as in other areas of evolutionary biology. He created the R package BioGeoBEARS to allow such model selection. It implemented originally likelihood versions of DIVA, DEC and a third model called BayArea, all assuming the operation of slightly different sets of biogeographic processes. Each of them can be tested with and without the j parameter and, after another update, with or without an x parameter for distance-dependent dispersal.

Now I come finally (!) to Ree & Sanmartin. Their eight page paper, as the title implies, is a criticism of these two innovations. What do they argue? I hope I am summarising this faithfully, but in my eyes their three core points are as follows:
  • A biogeographic model with events happening at the nodes of the tree as opposed to along the branches, as is the case with jump dispersal, is not a proper evolutionary model because such events are then "not modeled as time-dependent". In other words, only events that have a per-time-unit probability of occurring along a branch are appropriate.
  • Under certain conditions the most probable explanation provided by a model including the j parameter is that all biogeographic events were jump dispersals. The j parameter gets maximised and explains everything by itself. They call this scenario "degenerate", because the "true" model must "surely" include time-dependent processes.
  • DEC and DEC + j (and, I assume, by extension any other model and its + j variant) cannot be compared in the sense of model selection.

I must, of course, admit that model development is not my area. Consequently I am happy to defer regarding points one and three to others who have more expertise, and who will certainly have something to say about this at some point. I can only at this moment state that these claims do not immediately convince me. Certainly it is often the case that models with very different parameters are statistically compared with each other?

Is it not possible that the best model to explain an evolutionary process may sometimes indeed have a parameter that is not time-dependent but dependent on lineage splits? In the present case, if it is a fact that jump dispersal caused a lineage split, then both events quite simply happened instantaneously (at the relevant time scale of millions of years); in a sense, they were the same event, as the dispersal itself interrupted gene flow.

Perhaps more importantly, however, I am not at all convinced by the second point. Generally I am more interested in practical and pragmatic considerations than theory of statistics and philosophy. In phylogenetics, for example, I am less impressed by the claim that parsimony is supposedly not statistically consistent than by a comparison of the results produced by parsimony and likelihood analysis of DNA sequence datasets. Do they make sense? What can mislead an analysis? What software is available? How computationally feasible is what would otherwise be the best approach, and can it deal with missing data?

So in the present case I would also like to consider the practical side. Is the problem of j being maximised so that everything is explained by jump dispersal at all likely to occur in empirical datasets? In the paper Ree and Sanmartin illustrate a two species / two area example. That is clearly not a realistic empirical dataset, as it is much too small for proper analysis. But if we understand to some degree how the various model parameters work we can deduce under what circumstances j is likely to be maximised.

Unless I am mistaken, the circumstances appear to be as follows: We need a dataset in which all species are local endemics, i.e. all are restricted to a single area, and in which sister species never share part of their ranges. This is because other patterns cannot be explained by jump dispersal. If a species occupies two or more areas, it would have had to expand its range, so the analysis cannot reduce the d parameter for range expansion to zero. If sister species share part of their ranges, likewise; if they share the same single area, they must have diverged sympatrically, which again is not speciation through jump dispersal.

This raises the question, how likely are we to find datasets in which these two conditions apply? In my admittedly limited experience such datasets do not appear to be very common. If we are dealing, for example, with a small to medium sized genus on one continent, we will generally find partly overlapping ranges, and often at least one very widespread species. The j parameter will not be maximised. If we are doing a global analysis of a large clade, we will need rather large areas (because if you use too many small areas the problem becomes computationally intractable). This means, among other things, that entire subclades will share the same single-area range, and j will not be maximised.

In other words, the problem of 'all-jump dispersal' solutions appears to be rather theoretical. But what if we actually do have such a dataset? Is it not a problem then? To me the next question is under what circumstances such a situation would arise. Again, we have all species restricted to single areas, meaning that they apparently find it hard to expand their ranges across two areas. Why? Perhaps geographic separation to the degree that they rarely disperse? Geographic separation to the degree that when they disperse gene flow is interrupted, leading to immediate speciation? Again, we never have sister species sharing an area. Why? A good explanation would be that each area is too small for sympatric speciation to be possible.

Now what does that dataset sound like? To me it sounds like an archipelago of small islands, or perhaps a metaphorical island system such as isolated mountain top habitats. The exact scenario, in other words, in which all-jump dispersal seems like a very probable explanation. Because your ancestral island is too small for speciation, the only way to speciate is to jump to another island, and if you jump to another island you are immediately so isolated from your ancestral population that you speciate.

Again, I am not a modeler, and I have not run careful simulation experiments before writing this, but based on this thought experiment it seems to me as if the + j models would work just as they should: j would not be maximised under circumstances where the other processes are needed to explain present ranges, but it would be maximised under precisely those extremely rare circumstances where 'all jump dispersal' is the only realistic explanation.

Footnote

*) Sympatric meaning here at the scale of the areas defined for the analysis. If one the areas in the analysis is all of North America, for example, it is likely that the 'sympatric' events inside that area would in truth mostly have been allopatric, parapatric or peripatric at a smaller spatial scale.

No comments:

Post a Comment