Sunday, August 20, 2017

I still don't get area cladistics, and 'geographic paralogy' in particular

Since I started looking into panbiogeography and area cladistics, I have been curious about the concept of geographic paralogy. The word is used by area cladists (in the widest sense), and I have so far been doubtful about whether the analogy to gene paralogy fits.

To recap, area cladistics attempts to infer biogeographic area relationships from the patterns that species' areas of distribution show on a phylogenetic tree. If, for example, several plant or animal groups show distributions on a phylogeny that are ( Africa, ( South America , Australia ) ), i.e. sister lineages are endemic to South America and Australia, and more distantly related lineages are endemic to Africa, then an area cladist would conclude that South America and Australia are "more closely related" biogeographically than either is to Africa, or even that they form a "monophyletic biogeographic area".

Whatever that is supposed to mean, given that the word monophyletic only applies if we presuppose tree-like relationships. But I am getting ahead of myself.

The problem is now that phylogenies do not necessarily show such a simple pattern. Some species may be widespread and occur in several of the areas in the analysis, and of course the same area may occur repeatedly in different parts of the phylogeny. This is what area cladists call 'geographic paralogy', and they 'solve' the problem it poses for their analyses by selecting 'paralogy-free' subtrees from a phylogeny.

Again, two questions: Does it make sense to call this geographic paralogy, in analogy to gene paralogy? And does it make sense to do area cladistics by cherry-picking 'paralogy-free' subtrees, effectively ignoring these patterns?

I started a conversation with a colleague at the IBC, and he recommended I read Ladiges (1998, "Biogeography after Burbidge", Australian Systematic Botany 11: 231-242) as an introduction to the relevant concepts and approaches. So this I have now done. Unfortunately, the paper did not really solve my conceptual problems. I will start with a few quotes:
In cladistic biogeography, nodes of a cladogram for organisms (1,2 and 3) are potentially informative about the geographic areas (A, B and C) in which they occur: node 2 in Fig. 3 shows that areas B and C are related more closely to each other than to area A.

Such statements of relationship, the nodes of the cladogram, are explained by a variety of historical theories. One is dispersal from a restricted ancestral area, for example from area A to areas B and C, a pattern that may match fossil ages and distribution. An alternative explanation is vicariance of a widespread ancestral species coincident with physical breakup or climatic differentiation of the general area. A vicariance explanation is favoured by evidence of biogeographic congruence: finding the same pattern for other groups of organisms.
So far so good, although I do wonder whether the concept of area relationships makes sense if dispersal is the right answer. It seems to me that even calling it relationships only makes sense if there is no frequent floristic or faunal exchange, if near-everything is due to vicariance. And as I have mentioned before, there are good alternative explanations for congruence that do not imply vicariance, in particular prevailing directions of wind or ocean currents, common routes of migratory birds, etc.

Now come the complications:
Data for any one group of organisms are rarely as simple as the example shown (...). Some taxa are widespread, and some areas have more than one taxon. When combining data for different groups of organisms, not all areas are represented in each taxonomic group. Such complications are obstacles to development of analytical methods for determining area cladograms and general area cladograms.
Well yes, either that or, alternatively, they prove that the concept of an area cladogram is as incoherent as a 'species-level phylogeny' with only human populations as the terminals, and that the research program of area cladistics is a non-starter. Two pages on, the term at the centre of this post is introduced.
I offer two conclusions: (1) that evidence of historical geographic relationship is associated with nodes (not the distribution per se of terminal taxa) and (2) that some nodes of cladograms of organisms are paralogous. (...)

What is geographic paralogy? It is evidenced by duplication or overlap in geographic distribution of taxa related at a node (references). The term has its origin in molecular biology, geographic paralogy being analogous to gene duplication, with each gene copy subsequently tracking a separate evolutionary history.

(...) There is duplication of biogeographic regions across the clades (e.g. South America is in three), which is evidence of geographic paralogy. In other words, the major lineages shown in the cladogram existed prior to the breakup of Gondwana and each potentially reflects that geological history.
Consider what is claimed here. First, as we have seen earlier, simple area relationships that are congruent across lineages are claimed as evidence for vicariance. Now the fact that the same area shows up in several parts of a phylogeny is seen as evidence for paralogy; and this paralogy is also seen as evidence for vicariance and against dispersal. I cannot say that this makes a lot of sense to me.

Having gone through these quotes, I now want to carefully examine the analogy between gene paralogy and geographic paralogy. Let's start with the former. It works like this:

In this and the following figures, we see a grey species tree with species 1, 2 and 3. Within it we see the gene trees, as genes evolve inside the species. Here an originally single gene lineage (blue) was duplicated in the common ancestor of all three species, creating a red gene and a black gene. We now call the alleles A and Y paralogues of each other, because while they are distantly related they are not really the same gene anymore. In contrast, A and B are orthologues of each other. They are really the same gene, only in two different species.

The above figure now shows the problem that gene paralogy can cause in phylogeny reconstruction. If in this case Z is wrongly assumed to be an orthologue of A and B, we will infer the wrong species relationships, i.e. ((1,2),3) instead of the true (1,(2,3)). However, there are also other causes why we may get conflicting or complicated patterns.

In the above case we have the gene tree contradicting the species tree, but nonetheless there is no paralogy because there is only one gene involved. What has happened here is that two versions of the gene arose in an ancestral population, and that subsequent populations were large enough and/or speciation events happened so close after each other that both copies were carried through to the ancestor of 2 and 3. We call this incomplete lineage sorting (ILS) or ancestral polymorphism. We could also still find all gene variants in all three species. Point is, this is not paralogy.

Something different has happened in the above scenario. We get the same pattern of a gene tree showing ((1,2),3) despite the species phylogeny of (1,(2,3)), but this time because of a hybridisation or introgression event between 1 and 2. Of course, we could also still find the original gene variant in species 2 along with the introgressed one. Again, this is not paralogy.

Now the same for biogeography. Above the scenario where I think the analogy works: There are two clades that arose before continental breakup, and they both independently trace the 'area relationships'. In this case it makes sense to use the two clades or subtrees as independent data points for inference in area cladistics.

Here is the same problem for area cladistics as for phylogenetic inference. If we do not realise that we are treating paralogues as orthologues, we may get species phylogenies and, by analogy, area relationships wrong. So in the case of phylogenetics, people have developed methods for orthologue inference and to exclude paralogues from the data.

What I don't really see is how area cladists do the same. They claim they pick 'paralogue-free subtrees', but that merely means that they search for a statement like ((1,2),3) and remove statements like (1&2&3,(1&2,2&3)). It does  not mean that they actually have any way of recognising that ((1,2),3) is an instance of paralogy while (1,(2,3)) isn't. They can merely hope that it comes out in the wash because the true relationship will be more frequent than the wrong ones.

This appears to be rather problematic, unless I am missing something equivalent to orthology inference in phylogenetics. But on top of that we have the other scenarios, those where there really is no paralogy.

The above is the biogeographic equivalent of incomplete lineage sorting. We could imagine here that species C stayed endemic to a part of South America while its sister species was more widespread. If we now also had some species occurring in two areas, area cladists would speak of paralogous nodes, but again, there does not appear to be any paralogy involved.

But really crucial is the biogeographic parallel to gene introgression: dispersal. The above scenario shows what area cladists call paralogy and, as we saw in the quotes above, consider evidence of vicariance, but what reason is there to exclude dispersal as a possible explanation? This is, of course, precisely the pattern that dispersal would produce!

And it is clearly not in any way comparable to gene paralogy anyway, because there are no paralogues involved. It makes no sense to use a term that assumes the existence of two genes independently tracing the species phylogeny (and, by analogy, two species-lineages independently tracing 'area relationships') to refer to any difficult pattern, even where there are no such two deep species-lineages.

In summary, I am still not exactly convinced that area cladistics makes sense. The assumption that pretty much any pattern - congruence as well as the contradictory data from paralogy! - is evidence of vicariance seems particularly hard to swallow.

No comments:

Post a Comment