Tuesday, February 9, 2016

Patrocladistics 1: How does it work? And a contrived example

As the approach is often mentioned in pro-paraphyly publications as an objective method of delimiting paraphyletic taxa, I thought I should look into patrocladistics again and examine it in a blog post or three. In the following I will approach patrocladistics from three different angles:

1. What is patrocladistics and how does it work?  This is very straightforward.

2. How does the patrocladistic approach perform when ancestors are added?

It is often easy enough to explain how something works in the abstract, but it is perhaps more enlightening to throw different problems at a method and see under what conditions it is more or less useful or may be mislead. For example, explaining how BEAST does its phylogenetic inferences does not necessarily by itself tell us how it will perform when faced with, say, 25% missing data. I often criticise the pro-paraphyly movement for what I see as their reliance on the fortuitous absence of intermediate fossils to separate out paraphyletic groups. Conversely, members of that movement have a tendency to criticise cladists for supposedly ignoring ancestors. So in the case of patrocladistics, I wanted to see what happens if the method is provided not only with extant taxa but also with ancestors.

3. What is the rationale behind patrocladistics?

In other words, if somebody who is agnostic about the whole phylogenetic versus 'evolutionary' systematics issue were to ask why they should do a patrocladistic analysis, or what the biological or philosophical justification for such an analysis is, what would the answer be?

This post will cover the first point.

What is patrocladistics?

Patrocladistics is an approach suggested by Stuessy & König in a paper published in the plant systematics journal TAXON in 2008. It is sometimes cited by proponents of the formal recognition of paraphyletic taxa as a way to delimit such taxa in an objective, formalised way.

This is done especially as a rebuttal of the cladist argument that it is impossible to objectively delimit paraphyletic groups given the gradual nature of evolution: justifying the recognition of a new subgroup is one thing, but how do you justify that some ancestor in the past was in one insect order but its immediate descendant in another if you could hardly have distinguished the two species? Or more importantly, why in this case and not in all the others where there were the same relatedness and the same degree of difference? Patrocladistics is presented as a way out of this dilemma.

How does patrocladistics work?

A patrocladistic analysis takes a phylogenetic tree, that is a tree of evolutionary relationships between species or other tree terminals, and (re-)clusters those terminals by their phylogenetic distance on the tree.

So first, take a phylogenetic tree with branch lengths proportional to character changes, i.e. a phylogram. From this construct a distance matrix of each terminal against each terminal, using the number of tree nodes separating each pair of terminals as the distance between them. These distances are called cladistic distances.

Now construct a second distance matrix of each terminal against each terminal, using the number of character changes along the tree branches (branch lengths) between any two terminals as the distance between them. These distances are in this context called patristic distances.

Construct the final distance matrix by adding up cladistic and patristic distances. So two sister species sitting on a branch of length one and a branch of length three would have a distance of five on this final matrix, four for the patristic distance and one for the node separating them. The summed distances are called patrocladistic distances.

The matrix of patrocladistic distances is used for a clustering analysis. The paper in which the approach was published is somewhat vague about what clustering method should be used. It mentions UPGMA and single-linkage, expressing a personal preference for the latter because "it more quickly connects groups and also more distinctly reveals dendrogram structure".

The concern with computation speed is somewhat strange given that any available clustering algorithm would have taken only a fraction of a second even for medium-size datasets on a year 2008 desktop computer. In addition, I did not understand what is meant with "more distinctly reveals dendrogram structure", so I consulted that repository of knowledge, Wikipedia, and found the following explanations (accessed 7 Feb 2016):
It is based on grouping clusters in bottom-up fashion (agglomerative clustering)...
This means the clusters will be rooted automatically. I can only assume that bottom-up methods like single-linkage and UPGMA were proposed quite consciously to address the problem of how to objectively root the resulting clusters. Strangely, however, the paper does not appear to explicitly discuss the issue at all; searching the paper for "root" didn't bring anything up. A potential user may thus decide to try out a different clustering method and only later notice a very interesting problem. (That being said, 'evolutionary' systematics being a minority position very few people appear to be using patrocladistics in the first place.)
...at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other.

A drawback of this method is that it tends to produce long thin clusters in which nearby elements of the same cluster have small distances, but elements at opposite ends of a cluster may be much farther from each other than to elements of other clusters. This may lead to difficulties in defining classes that could usefully subdivide the data.
I found this rather interesting given the aforementioned problem that the 'evolutionary' approach to classification would have to place in two different phyla or classes an ancestor-descendant pair of species that are so similar to each other that if presented with them in isolation one would be hard pressed to even justify their placement in different subgenera. It seems the clustering approach for patrocladistics was wisely chosen to produce such solutions.

But note how whoever wrote the above section of the Wikipedia article characterises this behaviour as undesirable, and that although the topic isn't even biological classification! The context is clustering in the abstract, not systems that should reflect the reality of evolutionary processes, and even there the people dealing with the performance of clustering algorithms consider such situations problematic.

Anyway, using a clustering algorithm we get clusters of terminals that may not necessarily have formed a clade in the original phylogenetic tree, generally because they will now lack nested members that are very divergent in whatever set of characters underlies the tree. And now these new clusters are used as an argument to recognise them as paraphyletic taxa in 'evolutionary' classifications.

A contrived example

The example case I am using is a contrived one so that nobody can claim any emotional attachment to any particular classification that they learned as a student.

We have five ingroup species: primitiva is the sister to the rest of the ingroup, which consists of two pairs of sister species. One pair, communis and vulgaris, has changed very little relative to their common ancestor with primitiva and thus sits on short branches of length one. The other pair, aberrans and anomalica, is the end product of some rapid evolutionary changes, and they are together at the end of a long branch of length five. The ingroup is separated by another branch of length five from the outgroup, two species imaginatively called outgroupica and outgroupopsis. This is the phylogram:

Phylogenetic systematists (cladists) classify by relatedness and would thus have to place aberrans and anomalica into whatever group primitiva, communis and vulgaris are in, because communis and vulgaris are actually more closely related to aberrans and anomalica than they are to primitiva. Of course it might make sense to recognise the divergence of aberrans and anomalica by giving that subclade a name, but it has to be a subgroup, it cannot be a new group at the same level as that containing the other three ingroup species.

'Evolutionary' systematists do not consistently classify by relatedness but would in this case most likely be impressed by the long branch between aberrans / anomalica and the other species. They would say, "but they look so different!", and thus prefer to place primitiva, communis and vulgaris in one group and aberrans and anomalica in another group at the same level. The point of patrocladistics is to produce a clustering solution that will support such a classification: we want a cluster of aberrans and anomalica outside of the cluster of primitiva, communis and vulgaris.

First, calculate the cladistic distances by counting the nodes (T-crossings) between any two species:

Next, calculate the patristic distances by adding up branch lengths between any two species. Here the red numbers above the branches are helpful:

I hope I got that all right. There is also probably a function in some R package for pairwise phylogenetic distance, but with as few taxa as in my case I didn't bother to search.

Add up the two to produce patrocladistic distances:

Now that we have the distance matrix, we fire up R, load the library(stats) and import the matrix. For me the following worked: Make a tab separated text file containing a complete matrix including the all-zero diagonal and the other half (the above is only one half, for clarity), with taxon names both in the first row and the first column. Import it as a data frame with df <- read.csv("filename", row.names=1, sep="\t", header=TRUE). Cast it into a distance matrix using dm <- as.dist(df).

Clustering can then be done using the hclust function, one of whose methods is single-linkage: cl <- hclust(dm, method = "single", members = NULL). Draw the resulting dendrogram with plot(cl) and you get this:

Voilà, we have the desired result. Not only are aberrans and anomalica outside of the rest of the ingroup, they even ended up on the far side of the outgroup.

Next post: What happens when we include the intermediate species that have existed along the branches? Can a method developed by a school of classification that always criticises cladists for "ignoring ancestors" deal with ancestors? The results were not quite what I had expected.


  1. I don't understand why Stuessy prefers single-linkage. Everyone using patrocladistics uses average-linkage instead (i.e. UPGMA). Willner (2014) justifies this as follows: "We used average-linkage as a cluster algorithm because it also reflects the internal heterogeneity of a group and not only the size of the gap between groups as in the case of single-linkage." It seems reasonable to me.

    Your example is good for understanding the way the algorithm works. However, you should be aware it is not a statiscally satisfaying one, because there are only 5 ingroup species and the long branch has also a length of 5. So arguably, the two putative adaptive zones are connected by a bridge as wide as the zones themselves, i.e. there is only one adaptive zone. I guess this is what will reveal adding the ancestors, isn't? Single-linkage leads to a completely unresolved patrocladogram while average-linkage should lead to the same result as the cladogram.

    In your second post, you should try with a tree where one could reasonably think there are indeed two adaptive zones, for example by increasing the number of species in both the basal paraphyletic group and the crown autophyletic one.

    1. Hey, no spoilers!

      I do not understand why assuming a longer branch - which, by the way, does not necessarily indicate a different adaptive zone but sometimes just a bunch of leaf shape and fruit shape characters - would change what happens when ancestors are added. I can try it, but logically you will still get absence of clusters in single-linkage and a garbled version of the phylogram in average. What is more, even if you add only one intermediate the results of clustering by branch length will usually change - not the recipe for a stable and universally useful classification.

      It is simply the case that patrocladistics cannot deal with ancestors, and I seem to remember that you did not doubt that yourself.

      Also: What you wrote above reads like a weird conflation of (a) species numbers, (b) branch lengths, and (c) 'width' of 'adaptive zones'. Apart from the difficulty of even just defining what the scare-quoted terms mean quantitatively, those three things are very different indeed. Surely having ten species in an adaptive zone doesn't make it 'wider' than having two species in it?

    2. You can still achieve stability by adding more and more species, bootstrapping and draw consensus tree.

      As far as I know, yes patrocladistics is not intended/has not been designed for paleontological data. But it does not mean it cannot deal with ancestors to a certain extent. I have done my own tests about this.

      The three items you cite are surely different things, but they are not independent. It makes sense to try to infer one from the other two, but indoubtedly there will be false positives and false negatives among the results.

      I can't understand what is so obscur for you about adaptive zones. Zones of high fitness in the multidimensional space of all possible phenotypes. Since species inside the same zone are competing with each other, yes the size of the zone is actually the number of species that can live within in the same time. There are many evolutionary phenomena associated with this, for example the in-zone stabilizing selection is very high, so diversification rate and species turnover are low. Reaching another adaptive zone implies an initial high phenotypic rate change and a high species turnover. Etc. These phenomena can be seen on a phylogenetic tree, especially on phylograms and chronograms.

    3. Yeees... the stability of either not having any clusters or having the clusters approach the original clades, which defeats the purpose.

      I can see how a long branch is nothing but a large number of intermediate ancestors, right, but the adaptive zone, no. You treat the group as if it is living in a biological vacuum. Has it occurred to you that there might only be space for two of its species on a 'large' adaptive peak because five other clades are already occupying it with 200 other species?

      What is so obscure to me is what to call an adaptive zone, although note that I only see a non-sequitur between any of this and classification anyway. I am working with plants. So what is the adaptive zone of Sanchezia oblonga: Land-living photoautotrophic organism? Shrub? Rainforest plant? Ballistochory? Trochilophily? All of it?

      Towards the former the 'zone' is so huge and ill-defined that it is meaningless. All of it would be so specific as to be meaningless, because it is not clear how there would be competition for, say, the space taken up by the combination of ballistochory and trochilophily. At a minimum there is a lot of subjectivity going on.

    4. "Yeees... the stability of either not having any clusters or having the clusters approach the original clades, which defeats the purpose." Not always. You should do further tests.

      You are confusing adaptive zone with ecological niche. But yes peaks can be indeed small and thus hard to detect through this method. But if you know the ages of ancestors for example, you can use another method, say phenotypic rate shift.

      "what is the adaptive zone of Sanchezia oblonga" You don't need to use ill-defined categories, just use the characters you can actually measure. Then yes, each of these is a dimension of that space. Adaptive zones are nested, peaks are within peaks. That's why for example all marsupials were replaced by placentals in South America when the continents have been connected, it's a case of higher taxa selection within a broad zone.

      "I only see a non-sequitur between any of this and classification anyway" The purpose of evolutionary classification is to represent the kind of macroevolutionary dynamics I explained.

    5. Characters I can actually measure? You imply that some kind of data matrix containing columns such as presence of glandular hairs, leaf width, and flower colour orange or yellow defines an adaptive zone. If that is so then every species sits on its own adaptive peak and the concept is empty, with all it entails for the meaningfulness of 'evolutionary' classification. Nested, okay, but that just means we can move goalposts as we want.

      Don't get me wrong, I also visualise the fitness landscape in valleys and peaks. But that mental model doesn't mean that we can decide on what exactly constitutes the peak for the purposes of your classification, quite apart from the fact that the shape of the landscape is constantly changing.

      I am not sure "higher taxa selection" would be a meaningful concept to many evolutionary biologists.