PhyloBotanist: phylogenetic systematics

Showing posts with label phylogenetic systematics. Show all posts

Saturday, October 19, 2019

Arguments for paraphyletic taxa, part 543,997 or so

Although having largely moved on from blogging, I found myself writing another post on the most frequent topic of this blog, arguments for the acceptance of paraphyletic taxa and whether they make sense. A paper has recently appeared that describes a new species of flowering plants (Carnicero et al 2019, Bot J Linn Soc: boz052). The first paragraph of its introduction argues for paraphyletic taxa as follows:

From a cladistic perspective, monophyly of taxa is desirable, but important evolutionary processes such as hybridization, anagenetic and anacladogenetic speciation (budding sensu Mayr & Bock, 2002) unavoidably result in non-dichotomous branching patterns (Hörandl, 2006; Hörandl & Stuessy, 2010).

I am afraid I already find this first bit confused in several details. First, from a cladistic perspective, monophyly is not merely desirable but required. That is the entire point of cladism.

Second, non-dichotomous branching patters are polytomies, meaning the branch splits into more than two sub-branches. Polytomies are no problem for making supraspecific taxa monophyletic, so on the face of it, it is not clear what the argument is. But none of the mentioned processes necessarily produce polytomies anyway, and some of them do not even produce any branching at all.

Hybridisation - presumably the authors mean hybridogenic speciation, e.g. by allopolyploidy, and not actually hybridisation per se, which is usually a dead end - is not branching, it is the opposite. The problem for the argument here is that reticulation does not just mean there is no monophyly, it also means there no is paraphyly either, as there is no phyletic (tree-like) structure. It makes no sense to argue for paraphyly in a situation where there is no paraphyly. (More on that below.)

'Budding' speciation is dichotomous, just like any other lineage split, unless an ancestral species fractures into three or more descendant species at the exact same moment, just like could happen with a non-'budding' lineage split. It is no problem whatsoever for making supraspecific taxa monophyletic.

Third, anagenetic means that something happens along a lineage without a lineage split, so it is again odd to speak of a "non-dichotomous" branching pattern. If anagenesis is happening there is by definition no branching pattern, dichotomous or otherwise. Nor is there any problem for making supraspecific taxa monophyletic. So yes, the observation that there is no dichotomy is correct, but merely in the same trivial sense as the observation that a book isn't a car. You can go around saying that, but book authors or publishers will simply say, "we know, so what?" Cladists likewise when told that anagenesis happens.

Anacladogenesis is a case of peripatric speciation, in which a population or a group of populations from a species diverge, resulting in a derivative monophyletic species (Stuessy, Crawford & Marticorena, 1990). Unlike in cladogenetic processes, the ancestral species remains essentially unchanged and often becomes paraphyletic (Mayr & Bock, 2002; Crawford, 2010).

With this the two closely related misconceptions at the heart of the paper's argumentation become clear. The first is that the cladist approach requires making species monophyletic. It doesn't. The second is that it makes sense to call species monophyletic or paraphyletic in the first place. It doesn't. (Although this is a very, very common and widespread misconception.)

As already indicated above, the concepts ending in -phyly apply in tree-like structures, such as the tree of life. The individuals of sexually reproducing species, however, do not form a tree-like but instead a net-like structure. Consequently, -phyly does not apply inside sexually reproducing species. Another attempt at an analogy: I can be asleep, but the molecules I consist of do not sleep. The concept "asleep or awake" does not apply to individual molecules, just as monophyly does not apply to individuals of the same sexually reproducing species. Fallacy of division is the keyword here.

This is not a new idea that cladists came up with only as a rearguard action, as frequently claimed by paraphyletists. We can go back all the way to the inventor of cladism, Willi Hennig. The central and best known figure in his book illustrates the different relationships that species, individuals, and life stages have to each other. Phylogenetic systematics ('cladism') is the approach to take when classifying species into supraspecific taxa, but not when classifying individuals into species. The claim that a species is monophyletic or paraphyletic is a category error.

Over time, the ancestral species may converge to monophyly through gene flow and lineage sorting (Baum & Shaw, 1995).

Same as above, but in addition it has to be unclear what is meant with 'gene flow', as on the face of it such flow would work against lineage sorting. It is possible that the authors meant to say 'restriction of gene flow'.

This sentence also makes clear where the conceptual error is located that leads a surprising number of people to the idea that species can be something or other-phyletic. Lineage sorting happens to alleles, and yes, the alleles of a gene occurring inside a sexually reproducing species can be paraphyletic to the alleles occurring inside a different sexually reproducing species. But taxonomists do not classify alleles into species, they classify individuals into species, so this would be another category error.

Far from an exception, anacladogenetic speciation has been considered to be of main importance in plant evolution (Rieseberg & Brouillet, 1994; Anacker & Strauss, 2014). As integrative taxonomy advocates that taxa should reflect evolutionary processes (Stuessy, 2009; Schlick-Steiner et al., 2010), it may be necessary to recognize certain paraphyletic entities.

The argument that Integrative Taxonomy requires paraphyly was not familiar to me. My understanding has always been that Integrative Taxonomy is about combining diverse kinds of evidence to support taxonomic decisions in species delimitation, e.g. a combination of ecological niche, population genetics, and morphology. The seminal Schlick-Steiner paper, for example, was clearly about alpha taxonomy, i.e. species delimitation. Searching it for the snippet "paraph" brings up only one entry in its reference list. (Stuessy is a different story, as he is one of the two or three most vociferous botanists still arguing for paraphyletic taxa; but then again he is not to my understanding a founding figure of Integrative Taxonomy.)

Again the central problem is, however, not what Schlick-Steiner et al may have thought about paraphyletic taxa, but that Integrative Taxonomy is about species delimitation, where paraphyly applies just as much as decibels apply to colours, and not about supraspecific taxa, where there concept properly applies.

The paragraph ends with something like an argumentum ad populum.

Indeed, examples of recognized paraphyletic taxa exist at various taxonomic levels (e.g. class Reptilia: Mayr & Bock, 2002; Pozoa coriacea Lag.: López et al., 2012; Helichrysum Mill.: Galbany- Casals et al., 2014; Plethodon wehrlei Fowler & Dunn: Kuchta, Brown & Highton, 2018; Columnea strigosa Benth.: Smith, Ooi & Clark, 2018).

The individual species used as examples are irrelevant for the reasons outlined above, because unless they are reproducing clonally, in which case they should have been circumscribed to be monophyletic, they are not paraphyletic but instead tokogenetic (net-like), and cladism does not apply inside tokogenetic structures. That leaves two supraspecific taxa that the taxonomic community has long recognised as ill-circumscribed due to their paraphyly: reptilia and Helichrysum.

One might point out that Mayr, for example, remained opposed to phylogenetic classification even as he saw it being adopted by the scientific community around him, and that recognition of reptilia as a paraphyletic taxon is not state of the art in zoology today. The vast majority of animal systematists today classify animals consistently by relatedness.

But more importantly, there is no way to base the acceptance of paraphyletic reptilia or Helichrysum on the argumentation presented in this paper, which argues entirely from the existence of hybridogenic and 'budding' speciation. This illustrates an extremely common pattern in papers arguing for paraphyletic taxa: an argument is made that applies inside a species (although even that only if we misconstrue the conceptual basis and actual practice of phylogenetic systematics), and then the entirely unwarranted jump is made to the conclusion that paraphyly should be accepted at a much higher level of classification, where the argument would not apply even if it were correct.

Saturday, June 9, 2018

A particularly striking example of how paraphyletic taxa confuse our thinking about evolution

I recently reread Jason Rosenhouse's Among the Creationists and came across the following extended quote from Stephen Jay Gould, a widely admired and famous evolutionary biologist.

If mammals had arisen late and helped to drive dinosaurs to their doom, then we could legitimately propose a scenario of expected progress. But dinosaurs remained dominant and probably became extinct only as a quirky result of the most unpredictable of all events - a mass dying triggered by extraterrestrial impact. If dinosaurs had not died in this event, they would probably still dominate the domain of large-bodied vertebrates, as they had for so long with such conspicuous success, and mammals would still be small creatures in the interstices of their world. [...] Since dinosaurs were not moving toward markedly larger brains, and since such a prospect may lie outside the capabilities of reptilian design, we must assume that consciousness would not have evolved on our planet if a cosmic catastrophe had not claimed the dinosaurs as victims. (Gould 1989, 318)

The context is the controversy around convergence and contingency in evolution. Rosenhouse discusses convergence as one of the hopes of Christians trying to reconcile evolution and Christian teachings, citing various proponents of the idea that their god set up the universe in a way that human-like intelligence was guaranteed to arise, thus producing beings that can have a "relationship" with said god.

Convergence is, of course, not only an observation considered helpful by the proponents of one variant of theistic evolution. To what degree the organisms that evolved on our planet would again turn out to be kind of similar if we replayed the tape or if organisms on other planets can be expected to look very similar to those on ours are very interesting questions of broad interest. Even an atheist may ask if we can expect lots of other planets where life arose to produce land plants, something a bit like insects, and perhaps even sentient beings given enough time, or if the vast majority of them will, for example, remain populated only by bacteria, because even evolving as much as multicellularity was a rare fluke.

Rosenhouse cites Gould as a well-known proponent of the importance of contingency. Although I tend much more towards the opposite view, I understand Gould's position. I believe the strongest argument for the contingency side is that while there are many impressive cases of convergence there are also quite a few crucial events in the history of life on this planet that appear to have happened only once: complex Eukaryotic cells; colonisation of dry land by multi-cellular plants; vertebrates; and of course human-like intelligence.

If, for example, the independent evolution of wings by insects, pterosaurs, birds and bats is counted as evidence for the importance of convergence, should something happening only once not be counted as evidence for the importance of contingency? My response would be competition, or in other words the change in the adaptive landscape caused by the first organisms to settle on a new peak. Where there may have been a ridge connecting the niches "kelp" and "large land-living plant" when nobody had occupied the latter, the first lineage to do so quickly became so good at being large land-living plants that the ridge crumbled away and became a canyon. If all land plants were wiped out, however, I would expect the land to be colonised anew, this time perhaps by red or brown algae.

But that is not actually about the main argument Gould is quoted as making in the above excerpt, and not what I found interesting about the quote. To take it in smaller pieces:

If mammals had arisen late and helped to drive dinosaurs to their doom, then we could legitimately propose a scenario of expected progress.

"Expected progress" is a bit of an odd term here. I am not sure if that is what is meant, but it could be read as if any group of animals that does not evolve towards large brains and intelligence is a refutation of the possibility that one group on each planet might evolve towards larger brains. But I do not think that this works as a refutation. And few proponents of the importance of convergence would argue that it is all about one linear progression towards large brains anyway. There are also progressions, for example towards body shapes that work well for swimming, towards paternal care for the young, towards powered flight, etc., and all of these happen at the same time but only in those lineages for which they solve relevant problems or create new opportunities.

If I understand the argument correctly, it is like pointing at a hole in the ground and saying, "if I now throw a pebble into the air and it does not end up in this specific hole, gravity is refuted", whereas the argument for convergence is that, what with evolution throwing thousands of pebbles into the air every year, we are very likely to find a few of them at the bottom of this hole as opposed to half way up its wall.

But dinosaurs remained dominant and probably became extinct only as a quirky result of the most unpredictable of all events - a mass dying triggered by extraterrestrial impact. If dinosaurs had not died in this event, they would probably still dominate the domain of large-bodied vertebrates, as they had for so long with such conspicuous success, and mammals would still be small creatures in the interstices of their world.

Although this is not my field, and I understand that it is an active area of research, I believe it can already be said with some confidence that mass extinction is not random. There are generally some reasons for why an extinction event claims this lineage here but leaves that other one over there largely intact. If a mass extinction of marine life is caused, for example, by a massive drop in the oxygen content of the oceans, then we would expect lineages that can survive under low oxygen conditions to come out in relatively good shape, all things considered, while those with a high oxygen need would be hammered.

In the present case, if we hypothesise that the impact of a large meteorite would have caused massive shockwaves followed by a few years of something like nuclear winter, we could expect the following: Species of small animals may find it easier to survive because they need less food per number of individuals. Bonus points if you have a burrow to hide in when the devastation sweeps across your area (small mammals) or if you can move easily to other areas where a bit more food is left (flight-capable birds). Large animals that can go with little food for long times may also have a good chance, in other words being cold-blooded may help to survive several bad years (crocodiles). If, however, you are large and (!) at the same time you have a high rate of metabolism then you might be in trouble, as you constantly need lots of food per number of individuals. As far as I understand, that describes the non-avian dinosaurs: large and warm-blooded.

The point is, catastrophes do happen from time to time, and once one happened it would probably have decimated the largest animals, even if it had come ten million years later than it did. Their niches are filled up again by small animals evolving to be large (another good example of convergence). What killed off the pterosaur lineage, for example, may well have been that the birds had already out-competed all small pterosaurs, leaving only the very large species when the meteorite struck. But again, this is not my area of expertise really.

Since dinosaurs were not moving toward markedly larger brains, and since such a prospect may lie outside the capabilities of reptilian design, we must assume that consciousness would not have evolved on our planet if a cosmic catastrophe had not claimed the dinosaurs as victims.

And this last part is really what I find the most interesting, because it illustrates so nicely how paraphyletic taxa can confuse the thinking even of the smartest of us, even of experts in evolutionary biology. What is the problem with the argument here?

First, and most obviously, birds are dinosaurs. Second, corvids (crows and ravens) and parrots are highly intelligent. Not quite human-level intelligence, but in some experiments corvids have proved to be smarter even than chimpanzees, our closest relatives. It follows that dinosaurs have actually "moved toward markedly larger brains", meaning here relative to the size of the body as a whole and, crucially, in terms of actual intelligence. Gould's premise is simply false, but his mistake is understandable, because at fault is really a misleading, i.e. non-phylogenetic, classification.

"Outside the capabilities of reptilian design" is, by the way, the same mistake at a deeper phylogenetic level. Mammals were not created fully formed, as mammals. Some of our ancestors were "reptiles", and here we are, having human-like intelligence by definition, what with us being humans and all that, so apparently there was a way of evolving human-like intelligence from a reptilian starting point. And from a fish starting point, and from a worm starting point, and from a bacterial starting point. All it took was lots of time and open niches waiting to be filled.

But I am not saying that anything here decisively refutes the idea that our sentience is a very rare fluke, unlikely to happen again should we go extinct. Maybe it is. The point is really how corrosive paraphyletic taxa are to reasoning about evolutionary processes.

Reference

Gould SJ, 1989. Wonderful Life: The Burgess Shale and the Nature of History. W.W. Norton, New York.

Saturday, May 12, 2018

What are monotypic genera good for?

There are a lot of monotypic genera around. In the group I am currently working on the most, the daisy family Asteraceae in Australia, there are an awful lot of monotypic genera indeed. Why do we need so many of them?

I would argue that there are two different scenarios to be considered. First, however, we need to keep in mind that:

We should classify organisms by their degree of relatedness, meaning that supraspecific taxa (including genera) should be monophyletic, and
while this previous rule tells us how we should group it does not tell us how we should rank. There is no genusness to be discovered in nature. Whether it is here in the phylogeny where we call a clade a genus or four nodes deeper down the tree is ultimately an arbitrary human decision.

This may at first suggest that there is no good argument to be had against monotypic genera either. If ranking is arbitrary then a classification consisting entirely out of monotypic genera - each species in the tree of life gets its own genus - is just as valid as the current one, so why not?

It is true that this is one of many possible ranking solutions compatible with phylogenetic systematics, but to decide between those many possible ranking solutions we can bring other criteria to bear. And here I would argue that it would be useful to minimise the number of monotypic genera as far as possible. Why? Because I would consider the genus level 'wasted' in many of those cases.

The entire point of a classification is that each taxon provides a piece of information. That information is: The members of this taxon are more closely related to each other than they are to non-members of this taxon. If we have a species, the species-taxon provides this information for all the members of that species. If we now have that species classified in a monotypic genus, the genus-taxon provides... the exact same information over again. It doesn't add anything. It is wasted.

Consequently, I believe that the proper use of monotypic genera is for when they are actually required for phylogenetic classification, but that there is a good argument for sinking them into larger genera whenever things could be made monophyletic without them. Two examples may illustrate the argument.

The above presents a case where the monotypic genus in red is actually needed. There are two genera marked in blue and green, and so obviously the phylogenetically isolated lineage in red cannot be lumped into either of them without making them paraphyletic. It is 'left over' and needs its own genus.

A perfect example for this is the ginkgo tree, Ginkgo biloba, which is a phylogenetically isolated living fossil. It is here photographed as an alley tree in front of our apartment block in Zürich, back when I was a postdoc there.

In the above phylogeny, however, the monotypic genus in red is sister to another genus in blue, and that latter genus isn't very large either. Now I can understand why it might perhaps be desirable to recognise the two as different genera if their divergence happened many tens of millions of years ago and they are morphologically quite distinct. Unfortunately, however, the world is full of monotypic genera that are very young and look exactly like the slightly larger sister genus, but differ from it in a single morphological character.

In those cases, do we really need that kind of taxonomic inflation? What then is the use of the genus rank?

The species that occasioned these ruminations in me is the above Tasmanian daisy tree Centropappus brunonis, which is clearly just a Bedfordia without hairs on the leaves; otherwise the two genera are pretty much indistinguishable. And Bedfordia itself has a mere three species, so it is not as if it would get unmanageably large if they were united.

There are many, many similar cases.

Friday, April 13, 2018

Monophyletic species, kind of

A paper by bryologist Brent Mishler and philosopher of biology John Wilkins has just come out, with the title The Hunting of the SnaRC: A Snarky Solution to the Species Problem. It is open access in the journal Philosophy Theory and Practice in Biology, so anybody with internet access can check it out.

Many bloggers have issues that they return to again and again even if they are not necessarily the nominal topics of their blogs - for example, Jerry Coyne frequently posts about Free Will and about students trying to shut down talks by speakers they don't like, and Larry Moran regularly takes apart papers claiming that junk DNA has been disproved. This much less widely known blogger can reliably be coaxed out from behind the oven by at least two such recurring issues: bad arguments for the acceptance of paraphyletic taxa, and the in my eyes incoherent concept of "monophyletic species".

As the title indicates, Mishler & Wilkins present a solution for the species problem, i.e. the perennial question in biology of what 'a species' even is. Especially as the paper is freely accessible it would serve no purpose to summarise its introduction, so I will move immediately to what I find most interesting: their views on how to view species and some pointers on how to do classification at the lowest levels in practice.

Note that I say "their views", plural, deliberately, because this is one aspect of the paper that I have not quite understood yet:

Wilkins has argued in the past that the popular approach of developing a theoretical species concept and then applying it to a potentially recalcitrant reality is a dead end. What biologists should do is the opposite, i.e. consider species as empirical phenomena in need of individual explanations. And here in this paper, Wilkins' argument is reiterated concisely in section 3, A Way Forward: Species Are at Least Initially Phenomena.

What I like about this flip in perspective is that it allows much more flexibility; obviously the empirical phenomena that we generally identify as species, be it popularly or as biologists - generally gaps in morphological or genetic variation - need a different scientific explanation for example in asexual than in sexual species, making one-size-fits-all species concepts difficult to apply.

Mishler, in turn, has argued in the past that species are not a special biological category different from e.g. monophyletic genera and families. The species category is arbitrary, and we should just classify all organisms into nested monophyletic groups, AKA clades, all the way down to the individual specimens. And here in this paper, Mishler's argument is reiterated in sections 4, Rankless Taxonomy, 5, Capturing the SNaRC, and 6, Using SNaRCs in Systematic, Evolutionary, and Ecological Studies.

The thing is, while there is perhaps technically no direct contradiction between those two arguments to the degree that there is a contradiction between "all taxa should be monophyletic" and "taxa should be allowed to be paraphyletic", they appear to be two rather different prescriptions. If I understand correctly, the first says,

We should treat species as empirical phenomena in need of explanation instead of indiscriminately applying a given theoretical concept to them.

The second says,

It makes no sense to even talk of species, we should stop doing so, and here is a single theoretical concept (everything is clades) that we should indiscriminately apply to all specimens.

In fact I am currently unable to see how sections 4-6 and the conclusions of this paper would have to change if section 3 were to be deleted in its entirety. What am I missing?

What I found most useful about this paper was that it has some thoughts on how to do classification into nested clades all the way down to the individual specimens in practice, because that was completely unclear to me in all past instances when this approach was suggested. There are some apparent problems with it, particularly that we need items forming a tree structure to even have clades. It is sometimes difficult to illustrate the issue, but it can perhaps be presented as follows:

The prescription is, as mentioned above, that a classification should be clades (= monophyletic groups) all the way down to individual specimens.
A clade is a complete branch in a tree structure, and usually understood to be specifically a complete branch of a species phylogeny.
In other words, the way the term clade is defined, it applies only in a tree-structure but is inapplicable in a net-like structure.
Sexually reproducing species are systems consisting of individual specimens that have net-like relationships with each other, because they share numerous ancestors instead of one ancestor in each sufficiently earlier generation.
It follows necessarily from the previous two points that the term clade cannot be applied to describe the relationship between specimens if what we are looking at includes multiple specimens from the same sexually reproducing species.
If follows then that it is logically impossible to classify into clades all the way down to these specimens, unless the meaning of the word clade is changed to a degree that the whole purpose of having that word is defeated.

To my understanding this is why Hennig spent so much time discussing the different ways that specimens (or snapshots of them, which he called semaphoronts) can be related to each other. The relationship between four (non-hybridogenic) species is tree-like, so they can, and should, be classified into clades. But relationships between individuals within a sexually reproducing species are net-like, so they cannot possibly be classified into clades, as the word does not even have a meaning in that structure.

The point at which approaches to classification change is approximately at the species level. Phylogenetic systematics applies only above it, and it uses species as the units that it groups into clades, because if it used any smaller units there would not be clades. This is also why in my opinion one cannot coherently reject the reality of species and be a phylogenetic systematist and, conversely, coherently accept the reality of species and promote paraphyletic taxa, because clades are species that have diversified. Many others, of course, disagree.

Now, what is the practical approach suggested by the present paper? It argues that the terminal units of classification should be "the finest-scale clades that can be convincingly demonstrated with current data", here called Smallest Named and Registered Clades (SNaRCs). Obviously such a 'clade' cannot be based on information from a single gene, as it may show a different history than other genes, for example because of introgression or incomplete lineage sorting. The solution is to use as evidence for monophyly "the preponderance of gene lineages making up a clade", or in other words "congruence among the majority of gene trees and other types of phylogenetic characters available".

On the plus side, this is a very empirical and testable prescription. But consider two thought experiments. First, take three samples A, B and C, look at, say, 100 gene trees, and if 51 of them show ((A,B),C) then A and B form a 'clade', even if all three of them are members of the same sexually reproducing species. Again, that is doable, empirical and testable, and we get a clear answer.

Nonetheless this approach does not convince me at the moment, nor will it even if we assume a scenario of 100 gene trees supporting (A,B), simply because no matter what the gene trees say, in reality there is no tree-structure inside the species. Yes, we can easily sequence for example the DNA of three siblings and run an analysis that will produce a phylogenetic tree for each gene, but in reality these three people just don't have a tree-relationship with each other, so it does not make sense to me to use terminology or a classification that implies there is one.

For the second thought experiment, take three samples D, E, and F, and if 33 gene trees say ((D,E),F), 33 say (D,(E,F)), and 34 say (E,(D,F)), we are inside a SNaRC and should not delimit any more narrowly, even if D is a specimen from an arid zone ephemeral, E from an alpine perennial, and F from a narrow endemic of the northwestern Blue Mountains that only occurs on ironstone-sandstone outcrops, and all three of them are geographically isolated from each other.

This hypothetical case has three very distinct entities that show a lot of gene tree discordance for the genes we used for our analysis. This is a much weaker problem than the previous one because Mishler & Wilkins argue that SNaRCs are, as all scientific hypotheses, tentative and await revision after the examination of more data. Maybe the next 100 gene trees will clinch it for (A,(B,C)), and then at least we could separate out A; more realistically, sampling more individuals of all three species will presumably resolve the three species as three SNaRCs, even if we cannot figure out the relationship of those three SNaRCs with each other (they may even form a true polytomy, and that's fine).

Still it bothers me that in a situation where we unfortunately have only one sample per species available for analysis the approach promoted in the present paper might lead to the tentative lumping of clearly distinct entities. And unless something is added to the approach, or unless I am missing something, it would have to, because it does not seem to include a way of recognising single-specimen SNaRCs except in the case of one being left alone as sister to another SNaRCs, that, in turn, would still consist of two potentially vastly different specimens. But maybe I am taking this too literally.

On top of that there is perhaps another methodological issue, or again maybe just something I don't understand. It seems to me as if "majority vote of the gene trees" is not actually how multi-locus phylogenetic analyses generally work. To the best of my understanding they reconcile gene trees in rather more complex ways, even in the case of such a simple approach as Gene Tree Parsimony, let alone the multi-gene coalescent model. Many of these approaches actually presuppose the existence of species or populations, and for the same reason as I argued above: what happens within a sexually reproducing lineage is rather different from what happens between such lineages.

More than anything what I find uncomfortable about the approach presented here is that it seems to care not so much about the actual patterns of common descent of what it classifies as about character or gene tree distribution. The difference may come across as subtle, admittedly. What I am trying to say is that I believe phylogenetic systematics should be about classifying organisms by relatedness, by exclusivity of common descent.

I do not, for example, care very much about the fact that most of the ancestral chloroplast genome has been moved over into the nucleus of the host cell, because the chloroplasts are directly descended in an unbroken line from the first cyanobacterium that colonised a plant cell, and the plant species we have today are descended in an unbroken line from that plant cell. To me chloroplasts are a subclade of cyanobacteria and plants are a subclade of eucaryotes, all regardless of what happened to the individual genes.

To use an example from within a species, I have mentioned in the past that it is possible, although statistically unlikely, that I have inherited no genetic material whatsoever from my maternal grandfather, if it just so happened that all the chromosomes my mother gave me were those she got from her mother (the Y chromosome is of course always from the paternal grandfather, by necessity). But even if that were the case we would nonetheless consider it to be an important piece of information that I descended from my maternal grandfather, and I would nonetheless not exist without his involvement. So yes, we use the genes to infer common descent, but the point is really the common descent itself, and the genes are just a data source that can potentially mislead us. Sometimes the right answer may be (A,(B,C)) even if most genes say ((A,B),C).

The "majority vote of the gene trees" approach, however, feels as if its practical concern starts and ends at the pattern shown by the genes, regardless of what the patterns of descent are. To me that feels the wrong way around.

Another way of looking at the issue may be this: If we truly accept the argument made in section 3, that we should look at natural phenomena, consider them to be explananda, and find the most appropriate scientific explanation for each of them, would the logical result not be Hennig's original approach? The phenomenon that a beetle specimen shares more traits with a bee specimen than either share with a slug specimen has an explanation, and that is that the former two share a much more recent common ancestor from which they inherited the shared traits. We express that reality by grouping the former two into a taxon called 'insects' while leaving the slug out.

The fact that I may easily in some cases share more genetic similarity with somebody born in Italy than with another northern German, however, would most likely be due to the stochastic nature of allele inheritance inside our sexually reproducing species. There is no clade wherein two specimens of humanity - the hypothetical Italian and I - share one and only one most recent common ancestor. Instead, beyond some point in the past we share thousands of ancestral 'specimens' in each generation. Because this is a different biological phenomenon than ((beetle,bee),slug), we need a different approach to classification at that level.

Thursday, August 3, 2017

Basal and transitional taxa

Shortly before I left for China I received an alert on an interesting paper:

Bronzati M, 2017. Should the terms 'basal taxon' and 'transitional taxon' be extinguished from cladistic studies with extinct organisms? Palaeontologia Electronica 20.2.3E: 1-12.

As can be expected from this title, Bronzati argues that the terms are misleading and confusing, and that they should not be used. I find myself tending to disagree, at least in part, and not only because of an allergic reaction to being told what words I am not supposed to use because it might confuse 'the public' (cf. free will debate). Before I go over the arguments, however, I would like to clarify where I agree:

First, there are clearly cases where it would be desirable not to use a concept or term because it is really wrong or incoherent, and in some cases even because it is misleading. At the recent conference I flinched at a speaker who said "this individual is paraphyletic". Although I understand what they meant (the utterly trivial and commonplace observation that an individual had two different alleles at a gene locus they had sequenced) such a sentence is Not Even Wrong and has to be based on confusion about, well, pretty much everything that matters in molecular and phylogenetic systematics beyond perhaps how to hold a pipette the right way and click "run analysis" in a few programs. But it is not necessarily the case that the terms basal and transitional suffer from the same problems.

Second, I obviously agree that supraspecific taxa should be monophyletic.

Third, I also agree that evolution is not teleological (with a caveat I will go into below) and that terms such as primitive or advanced are to be avoided, in particular when talking about organisms that live(d) in the same time-slice. And in fact there are very few people left who still think that e.g. mosses are primitive compared to seed plants. Both lineages as they exist today have evolved for precisely the same time. The mosses are certainly not more primitive as mosses than seed plants would be as mosses, they just went completely elsewhere in terms of morphospace and adaptive peaks.

Evolution is a story not of progress but of diversification, and it only looks to us as if there was progress from morphospace position A to position B because life necessarily had to start in some position, and even after a pure random walk some extant organism may still (or again) occupy that starting position or something close to it. A good analogy I once read is to imagine a bunch of people all starting in front of a wall and then milling about aimlessly. Although their movement is random the group will still expand in one direction, away from the wall, because they cannot go in the opposite direction; conversely then, the fact that they are now further away from the wall does not mean that they meant to move in that direction specifically.

It is consequently important to keep in mind the "studies with extinct organisms" part of the paper title, because on the question of sorting extant organisms into a ladder of progress all competent evolutionary biologists are agreed anyway. Okay, but what now of extinct taxa, which had in their time not yet undergone the same amount of evolution as the taxa we have today? Are they basal or transitional to the latter?

Bronzati starts by examining whether basal taxa are those that are older than the non-basal ones and observes that fossil ages do not necessarily reflect the ages of the lineages they belong to. He suggests instead to use "'early' and 'late' in an explicit comparative framework". That is very clear, but I do not think that this is how the word basal is meant by most people anyway, and it is certainly not how I would use it. As Bronzati soon observes himself, "'basal' is a relative term regarding the base of the tree" and thus refers to the relative age of lineages, not to the age of fossils.

I am not quite sure I understand the next part, where he writes that "different people certainly have different assumptions of what a 'basal' taxon is" and discusses whether something outside of clade A can be a basal A or not. I'd say not, but again, I think basal is a relative term along a tree topology and not something that I would use in this way.

Now Bronzati turns to the question that I consider the most relevant: "Basal taxa are closer to the root" - precisely that is how I understand the word - "but how to measure it?" But this is also where I think the argumentation becomes a bit odd, because he argues against the use of the term by comparing apples and oranges, and then throwing incomplete sampling into the mix. This will now need an illustration. Consider the following phylogeny:

Bronzati argues against the use of 'basal' by looking at species A, which people would supposedly (?) consider to be basal because they read the tree like a ladder from left to right, and then observing that this species is actually more distant from the root in terms of internal tree nodes than species B. I hope the problem is immediately obvious: species A is not the unit we would be talking about when saying "more basal than B". Would anybody ever actually say that A is basal in the tree? It is clearly fairly nested. Instead, the only use of such terminology that makes sense would be to say that the entire genus Ales (the red box) is basal in the entire family also consisting of the other genera Beles, Celes and Deles (the other three boxes), or more basal than Beles.

And this is where I am willing to be convinced otherwise but at the moment happy to continue using the term basal: if and only if we are talking about the branching order along a phylogeny backbone, along a grade. I will be the first to agree that all supraspecific ranks are arbitrary, but we also have to appreciate that we are using them, or alternatively unranked clade names, nonetheless. This is not so much about evolutionary theory as about having at our disposal non-atrocious language to describe a tree topology. When talking about these genera, what is so problematic about saying "Ales is basal in its family" compared with "Ales is sister to the rest of its family"? At least in my eyes the two statements are equivalent and neither is more misleading than the other. Making it about species A feels like a red herring.

And this is then also all that needs to be said about sampling, because it is based on a similar argument. Bronzati describes a hypothetical tree of all dinosaurs with all of the huge bird clade represented only by the chicken and then jokes he "would hope that no one would suggest that the bird is a basal dinosaur ... based on the number of intervening nodes to the root". No, I don't think anybody would. But maybe we would say the birds (!) are. If, hypothetically, part of the topology were (birds, (dinos2, (dinos3, dinos4) ) ) then yes, I would not have any problem saying that the birds as a whole are more basal in the tree than that other named clade dinos2, for example, because the birds as a whole are quite simply branching off one more inclusive ancestor closer to the root than dinos2.

What is really puzzling to me is that Bronzati himself makes the same point two paragraphs later: "it is not terminal taxa (...) that can be 'more basal' in relation to other terminal taxa, but the nodes (i.e. hypothetical ancestors) of the tree in relation to other nodes".

Concluding his discussion of basal 'basal', Bronzati examines the question whether basal taxa have more plesiomorphic traits and concludes no, but again this is based on considering in isolation a very derived descendant of the entire clade I would call 'basal'.

He then turns to the term 'transitional'. Here he appears to make two main arguments against its use. First, that evolution has no goal, and second, that phylogenetic trees are branching diagrams instead of ladders.

I have already mentioned above that I agree completely that evolution is non-teleological, but with one caveat, which is this: lineages may discover, for the first time, a new peak in the adaptive landscape, and when that happens we can expect them to evolve up that peak, so that earlier forms would be more poorly adapted to the new situation than their later descendants. Bronzati himself mentions the colonisation of dry land, focusing of course on vertebrates, his specialty. Using the group that I am more familiar with as an example, it seems clear that the early vascular plants started out without roots, and that the lineages that descended from them evolved roots because having those was a pretty good idea on dry land. In fact there are none left that are primarily without roots, presumably because they were out-competed (although there are a few secondary losses under unusual circumstances, e.g. Cuscuta).

I would argue that this, and only this, and only along a time axis, is where we can perhaps meaningfully speak of primitive and advanced, but that is not even the point here, because the term we are dealing with is transitional. More important seems the second argument. Yes, phylogenetic trees are branching diagrams, but they do not merely consist of terminals, they also consist of hypothetical ancestors. It is a bit unclear to me where Bronzati stands on the question of those; on page 6 of his paper, as mentioned, he talks about hypothetical ancestors himself, but here he spends considerable time arguing in a way that suggests that he does not want to identify actual species or fossils as ancestral:

It is important to stress that the absence of autapomorphies in taxa [sic] B does not indicate that it is transitional between A and C-F. Firstly, this might be just a reflex [sic] of the lack of ability to translate different morphologies into phylogenetic characters. Furthermore, the study of living species shows us that even if there is no recognisable morphological difference between [sic], they can differ at the genetic level.

Of course they can, but remains unclear to me what should keep us from tentatively concluding that some fossil may represent an ancestor until we get additional evidence that shows otherwise, just like pretty much every other conclusion in science is also tentative. And if we have a presumed ancestor we can say that it is transitional between an even earlier presumed ancestor and descendants further down the line. There is no teleology involved here, but the internal nodes of a phylogeny can indeed be read as a ladder of ancestor-descendant relationships.

I am sorry to say I just don't see the problem here either.

Bronzati ends with making four recommendations:

Tree toplogies should be described with sister-group statements, avoiding terms like basal or early diverging. My concern is that this will lead to very ugly and repetitive language when describing anything but a very small phylogeny: "our results indicate that A is sister to the rest of the study group. B is then sister to the rest of that rest, and then C is sister to the rest of that rest we just mentioned; now D is sister to the rest of that last rest ...", and so on for another four clades. That is just not very aesthetic. So why not a much more concise "the earliest diverging lineage is A, followed by B, C, and D"?

Instead of calling a terminal taxon a basal member of clade A, we should say it is a non-A member of the next larger named clade around it, as in non-avian dinosaurs. That makes sense, but again, I would never have used basal for a deeply nested terminal anyway but only to discuss the relative position of several clades along a grade.

We should say "this taxon fills a gap in the fossil record" instead of "this taxon is transitional". As mentioned above, I don't see it, perhaps because I have a different approach to internal nodes and species without autapomorphies.

Finally, we should avoid teleological language. No disagreement from me on this one!

Friday, April 28, 2017

Arguments for paraphyletic taxa: orchid taxonomy edition

As usual, the following is my personal opinion and not necessarily the official stance of any person or institution that I am affiliated with or related to, and so on.

One of the recurrent topics of this blog is the controversy around the acceptance of paraphyletic taxa. Although I have become a bit jaded over the years, my original stance was, and to a certain degree still is, that I am trying to understand the reasoning offered by colleagues who think that paraphyletic taxa are acceptable or even unavoidable. Because, who knows?, there may be a novel argument that shows cladism to be misguided after all, and I want to keep an open mind.

Sadly, however, it is mostly the same few talking points that lost the discussion in the 1970s and 1980s, resurfacing again and again. It is rare, although not unheard of, that a new and truly interesting argument is presented.

Today's candidate paper freshly online is

Baranow et al. 2017. Brasolia, a new genus highlighted from Sobralia (Orchidaceae). Plant Systematics and Evolution. DOI 10.1007/s00606-017-1413-z

The authors present phylogenetic analyses and change the classification of the titular orchid genus. The only point of interest for present purposes is that they argue for the recognition of Sobralia section Sobralia at the genus level despite that group being paraphyletic, and in what follows I do not want to imply any criticism of any other part of the publication or of the hard work the authors have put into their study. It is only the theory of classification that I like to hash out.

The argumentation in favour of paraphyletic taxa runs across three paragraphs in the discussion section. Let's see if I can learn something new!

In the light of phylogenetic outcomes, the proposed taxon is paraphyletic, which means that its species have a common ancestor, but the taxon does not include all its descendants (e.g., Elleanthus).

Polyphyletic taxa also have a common ancestor, so by the reasoning implied here one could justify any classification whatsoever. I am consequently unsure what the point of this first sentence is.

Monophyly in its broader definition describes groups with a common ancestry, including both paraphyletic and monophyletic groups (sensu Hennig 1950); therefore, Hörandl and Stuessy (2010) advocate returning to this broader definition of monophyly and, adopting Ashlock's term, holophyly for monophyly s.str.

Again I am afraid I must be missing the point. The controversy is really about whether we should consistently classify by relatedness or not. I don't mean to be uncharitable, but this could potentially be taken to mean the authors hope that recognising non-monophyletic taxa would become more palatable to mainstream systematists if one could hoodwink them into forgetting what monophyletic means. It would then be equivalent to hoping that your child will accept a mountain hike instead of the promised trip to the beach if you just said "mountains are also a kind of beach" with enough conviction. Nice try, but there will still be no swimming in the ocean, and little Tommy sees right through it.

Paraphyly is a natural transition stage in the evolution of taxa (Hörandl and Stuessy 2010). According to Brummitt (2002), paraphyletic taxa are ''products of the evolutionary process, which is left behind as evolution moves on to a new level of organization.''

The logic of these quotations appears to be as follows: "We really, really want to recognise paraphyletic taxa. So we draw a paraphyletic taxon onto the phylogenetic tree. Look, cladist, there is a paraphyletic taxon in the evolutionary process! Why are you so unreasonable not to accept it?" Unfortunately, circular reasoning does not become more convincing just because it has been published somewhere and can now be cited.

To clarify, there are no paraphyletic taxa out there in nature; there is only a tree of life, and phylogenetic systematists consistently circumscribe taxa on that tree to be monophyletic, while 'evolutionary' taxonomists circumscribe some taxa on that tree to be paraphyletic.

We realize that this is in conflict with commonly accepted phylogenetic methods which declare that monophyly s.str. should be the only criterion for grouping organisms.

A "phylogenetic method" is what produced the orchid phylogeny, so I assume what is meant here is "approach to classification". But whatever, that is not the point, so onwards.

However, a somewhat analogical situation has been recognized within Coelogyne (Gravendeel et al. 2001). In this case, the authors interpreted the morphology of the studied species as not corresponding to the cladograms, probably as a result of convergent evolution and they decided to maintain polyphyletic Coelogyne. Kolanowska and Szlachetko (2016) postulate to maintain paraphyletic Odontoglossum.

This appears to be an instance of the argumentum ad populum, and not even very much populum at that. Consider: is it a good idea to shoot a stapler into your own foot? Okay, so there will have been at least two people in the history of humanity who have done that, so you could now cite them for support. But does that make shooting a stapler into your foot any more sensible? Exactly; a better argument is needed here.

Also, as I only realised some time after first drafting this, the senior author of the present paper is the same as in one of those two references. So this is apparently also an instance of the rarely seen ipse dixit. (It is, of course, valid to cite one's own prior research results, but in this case we are dealing not with an empirical question but simply with the argument that an action is acceptable because it is not unprecedented.)

Recognition of distinctive characters which have evolved in a group is essential for an understanding its evolution (Brummitt 2006).

Quite the opposite, in my eyes: having an accurate classification is essential for understanding evolution, because paraphyletic taxa mislead us about relationships. In the present case, treating Elleanthus as a subgroup of Sobralia would (correctly) show that Elleanthus evolved out of Sobralia, whereas treating Sobralia and Elleanthus as separate genera implies (wrongly) that they are evolutionarily distinct units, side by side.

This point of view is shared by numerous other authors (Sosef 1997; Dias et al. 2005; Nordal and Stedje 2005) who state that traditional classification is the optimal tool for cataloging biodiversity and requires recognition of paraphyletic taxa.

This reads like more argumentum ad populum, and sadly it is left unmentioned why paraphyletic taxa are supposedly required.

We decided to follow the Darwinian (evolutionary) classification, which requires consideration of two criteria: similarity and common descent.

Leaving aside the obvious argument from name-checking here, which is exactly as relevant as using Newton to reject Einstein (and for the same reasons), the problem remains that trying to classify by two criteria at the same time will lead to a useless classification that is not reliably reflecting either.

Assume I have never heard of Sobralia before, and then it is mentioned to me for the first time. Given a phylogenetic classification, I know that it constitutes a natural group whose members are each other's closest relatives. Given a classification as argued for in the present paper, it could be a natural group... but it could also be a group defined by similarity that includes species more closely related to another genus than to any other species of Sobralia. I just won't know.

The approach will allow us to propose a classification based on the phylogenetic relationships, but at the same time it will be practical--with clearly defined and recognizable units.

No, sorry to say so, but it quite simply will not. First, it will not be based on phylogenetic relationships, because in one crucial instance phylogenetic relationships will be ignored. Second, and again, it will not be practical, because if two criteria are mixed the end user cannot know without going back to the original publications whether a given group was circumscribed based on relatedness or based on 'similarity', see above.

Now obviously I understand that this is not a theory paper arguing for a wholesale shift in our practice of classification. What is more, I know we cannot expect all solutions to be easy or all groups immediately to be circumscribed as monophyletic the moment somebody looks at them. I can happily accept a paper concluding "we know this group is probably paraphyletic, but for the moment we don't have a better solution, let's wait until more data are in", or "the group is clearly polyphyletic, but at this moment we do not want to make hasty taxonomic changes", or something along those lines.

But the three paragraphs quoted above were specifically meant to justify the ultimate recognition of paraphyletic genera, so one would expect to find a convincing justification. Sadly I, personally, have to admit to being anti-convinced by this paper, which as previously mentioned I take to mean an argument had the effect of making me even more convinced of the idea it was meant to refute, in this case classification by relatedness.

Sunday, April 23, 2017

The unexpected dangers of rerooting phylogenies

A couple of days ago a colleague circulated the following recently published paper,

Czech L, Huerta-Cepas J, Stamatakis A, 2017. A critical review on the use of support values in tree viewers and bioinformatics toolkits. Molecular Biology and Evolution. DOI: 10.1093/molbev/msx055

The authors found something that, in retrospect, seems glaringly obvious. Phylogenetic trees are nearly always saved in the Newick format of nested brackets, for example as follows:

(A:2,(B:1,C:2)99:1);

In this case we are dealing with a rooted tree of only three taxa. A is sister to a clade of B and C. The numbers after the colons indicate branch lengths, and the 99 directly after the brackets is a support value, most likely bootstrap, for the sister group relationship (B,C).

The problem explored by Czech et al. is ultimately that under the Newick format branch support values or other branch annotations are not actually attached to branches; they are attached to nodes. In this case, for example, the 99 is attached to the node that is the hypothetical common ancestor of B and C. Logically, because the tree is rooted we can assume that the support value is meant for the branch leading down from the ancestor of B and C towards the root.

But what if we reroot a tree with node annotations that are really meant to be branch annotations a posterioiri? (My post on the various options for rooting phylogenies can be found here.) Czech et al. found that the behaviour of the these values is undefined. For some software they were able to demonstrate that the branch annotation ended up on the wrong branch after rerooting.

How serious an issue is that? I guess it depends on what one's practice is. The problem should be pretty much limited to analyses producing unrooted trees (e.g. in RAxML, PAUP or MrBayes) under the assumption of reversibility, where the user then uses outgroup rooting to polarise the tree a posteriori. Any analysis using a clock model would avoid it, as would asymmetric step-matrices or, crucially, those analyses specifying the outgroup before the start of the analysis.

In addition, it seems as if the problem would be limited to a few branches between the pseudo-root used to save unrooted trees and the new root after rerooting, so that most relationships should be fine. I may look at one or two of my published phylogenies to see if I ever had that problem, but I am not worried; in the most recent case where support values were a critical part of my argumentation, for example, they are fairly deep inside the tree, because we sampled widely around the ingroup, and I also used Templeton tests and suchlike to demonstrate the non-monophyly of certain taxa.

Apparently Czech et al. have already achieved some success at getting software providers to make changes that will help solve the confusion around where the branch annotations end up. But nonetheless my main take-home from this is to be less blasé about a posteriori rooting. In the future I will make sure to always define an outgroup already when I set up a PAUP or RAxML run, so that the need to reroot does not arise.

Wednesday, January 11, 2017

A new phylogenetic classification of ferns

Just a couple days ago a new, phylogenetic classification of ferns and lycophytes was published. In parallel with the well-established Angiosperm Phylogeny Group, the authors call themselves the Pteridophyte Phylogeny Group.

I was happy to see that publication for several reasons. The PPG includes several people I know or am friends with, so good for them. As mentioned, the classification is sensibly phylogenetic, so good for science. And personally I have a great (although non-research) interest in ferns, so I find it simply good to get this update.

So, what do we learn? What is the state of the art?

The lycophytes are well established as sister to all other vascular plants, and the main groups Lycopodiaceae, Isoetaceae and Selaginellaceae are not in any doubt whatsoever. What surprised me was the degree to which the Lycopodiaceae have been atomised into numerous mid-sized to monotypic genera. I have no idea what degree of divergence is recognised in this subdivision, but it seems a bit odd next to the Selaginellaceae with their single large genus.

The remaining groups are forming together what was so far known to me as the monilophytes, but here they are called Polypodiopsida. They are sister to the extant seed plants.

A monilophyte group that I am particularly fond of are the horsetails (genus Equisetum), here ranked all the way up to subclass. Pity there are none native to Australia. Next are the morphologically reduced Psilotales with their traditional genera Tmesipteris and Psilotum, again nothing unexpected.

Sister to them are the odd Ophioglossales, which, however, present the very same surprise as the Lycopodiaceae. I thought I knew them as featuring four easily explained genera: Ophioglossum with usually undivided fertile and sterile parts of the frond, Botrychium with pinnately divided leaves, and the somewhat palmately divided and monotypic Helminthostachys and Mankyua, the latter only described in 2002 and endemic to one Korean island (!). But the two larger genera are here divided into several genera, a tradition that I apparently was completely unaware of.

The last eusporangiate group are the Marattiaceae, a clade of large tropical ferns that had its heyday in the Carboniferous. As far as I can tell the classification is not much changed from what I read years ago.

Finally, we have the large diversity of the leptosporangiate clade. Many groups I am insufficiently familiar with to appreciate any potential changes in classification, but the higher level structure is well known. There is a grade of smaller groups - king ferns, filmy ferns, Gleicheniaceae & relatives, Schizaeaceae & Lygodiaceae, water ferns and tree ferns - and at its end the speciose Polypodiales clade, where it really gets complicated.

Apart from a few Polypodiales families that have piqued my interest, like Aspleniaceae or Dennstaedtiaceae, I really have no even half-informed opinion here. The point is that I can now use this new publication as a tour guide to figuring out where in the system I am when I next run into one of those generic, large-fronded, rosette-growing round ferns.

Reference

The Pteridophyte Phylogeny Group 2016. A community-derived classification for extant lycophytes and ferns. Journal of Systematics and Evolution 54: 563-603.

Sunday, December 11, 2016

Cladistics textbook, part 2

Coming back to the textbook

Kitching IJ, Forey PL, Humphries CJ, Williams DM, 1998. Cladistics second edition - the theory and practice of parsimony analysis. The Systematics Association Publication No. 11. Oxford Science Publications.

..., in my previous post I mentioned that I also ran into a section that I find hard to agree with. The chapter on support values opens with the following:

Page 118: The study of phylogeny is an historical science, concerned with the discovery of historical singularities. Consequently, we do not consider phylogenetic inference per se to be fundamentally a statistical question, open to discoverable and objectively definable confidence limits. Hence, we are in diametric opposition to those who would include such a standard statistical framework as part of cladistic theory and practice.

I can only repeat in slightly different words what I wrote some time ago about the same question in the context of biogeographic studies. I find it hard to draw a line between historical science and non-historical science, not least because, to take just one example, any physical experiment, be it ever so reproducible, turns into a singular historical event a split second after it has been conducted.

To me there is really no big difference. We always infer what is most likely to have happened in individual instances in the past and then draw more general conclusions from those instances, no matter whether it is history or social science, archeology or engineering, paleobotany or (extant) plant taxonomy, evolutionary biology or population genetics.

I assume that a big part of the difference in perspective here is about what organismal characters people are thinking of. Reading through the cladistics textbook, the focus is pretty much always on morphology. Reading through works that introduce likelihood or Bayesian phylogenetics, in other words probabilistic and model-based evolutionary analysis, the focus is pretty much always on nucleotide sequence data, with protein sequence data coming a distant second.

It makes sense to me that somebody who thinks predominantly in terms of trait shifts like the evolution of bird feathers from scales or of angiosperm gynoecia from ovules sitting nakedly on a stalk would have reason to favour parsimony analysis. In fact I myself, despite frequently using likelihood and Bayesian phylogenetics for sequence data, would still have to be counted among those who are highly sceptical whether the Mk model works better with morphological traits than parsimony.

These kinds of characters have very low homoplasy, at least if scored correctly; and where they do show homoplasy, I would say that is due to a scoring error that can be rectified (e.g. if double fertilisation has evolved independently in angiosperms and gnetophytes then the two should be scored as separate character states). And it just so happens that parsimony analysis is a better tool for the data the less homoplasy there is. What is more, it seems a bit odd to try and apply the same model to all morphological characters, given how vastly different they are.

It also makes a lot of sense to me that somebody who thinks predominantly in terms of trait shifts like an A in the DNA sequence turning into T would see reason to favour analyses using models of sequence evolution. As Prof. Bromham pointed out during her talk I heard a few weeks ago, if that A has changed into a T in two parallel instances and then all the A-carrying individuals died out there is no way in which we can ever find evidence for that.

In other words, in the case of our four letter soup of DNA sequence characters homoplasy is not a scoring error to be discovered by looking closer but a hard fact of life that we cannot rid ourselves of (except to the degree that we can choose slower-evolving markers). And it just so happens that parsimony analysis is a worse tool for the data the more homoplasy there is, while the right model-based approach can deal with that. (Or at least somewhat better - obviously, once homoplasy is so rampant that all signal is lost no phylogenetic method will work, and likelihood analysis has also been shown to suffer from long branch attraction.) What is more, it seems logical to apply the same model to all DNA sequence characters, given that they are equivalent nucleotides along a chain.

So when I call myself a cladist, what I mean is not that I prefer parsimony analysis for all data, but that I acknowledge Willi Hennig's legacy, the idea that systematists should classify consistently by relatedness.

Tuesday, November 29, 2016

Cladistics textbook

In my office I have two 'proper' phylogenetics textbooks, that is counting only those that cover the principles and theory as opposed to offering only a practical how-to manual. One is Felsenstein's, who is strongly associated with likelihood phylogenetics, although his book covers all approaches. The second is:

Kitching IJ, Forey PL, Humphries CJ, Williams DM, 1998. Cladistics second edition - the theory and practice of parsimony analysis. The Systematics Association Publication No. 11. Oxford Science Publications.

As the title implies, it is entirely about parsimony phylogenetics.

Having recently looked into Kitching et al., I noticed two short sections that I found interesting enough to discuss here. I will start with the question of ancestors. Proponents of paraphyletic taxa often make claims on the lines of cladists "not accepting the existence of ancestral species", of "ignoring ancestors", or of "treating all species as sister taxa".

Here now we have a textbook written by cladists, in other words the official version, to the degree that an official version exists. It is, of couse, not as easy as that because the only thing that unites cladists in the sense of what paraphylists argue against is that supraspecific taxa should be monophyletic. Many other details differ from cladist to cladist, and in the sense of what paraphylists argue against the concept of cladist includes those who use e.g. Bayesian phylogenetics.

I also do not want to give the impression that I, personally, take what Kitching et al. promote on this or that detailed question to necessarily be The Correct View. It is well possible that I, a cladist, find myself in disagreement with some chapter of that textbook. I am not even arguing here, in this instance, that making taxa monophyletic is the way to go (although of course I do believe that).

No, the point of the post is merely this: if Kitching et al. argue not-XYZ, then this demonstrates decisively that any claim of all cladists arguing XYZ is nonsense.

So, about ancestors, and turning to page 14 of the textbook:

In fact, to date, Archaeopteryx has no recognized autapomorphies. Indeed, if there were, Achaeopteryx would have to be placed as the sister-group to the rest of the birds.

It does not matter here whether more recent analyses have demonstrated Archaeopteryx to have autapomorphies and to actually have been a side branch relative to modern birds. We should here simply think of any species that looks exactly like the ancestral species of a later-existing clade is inferred to have looked like.

It should be clear that the above section is correct. An ancestral species would not have any systematically useful characters relative to its descendants, because that descendant clade would have started out as that species. My view - and here other cladists may differ - is actually that the ancestral species and the clade are one and the same. The ancestral species has over time turned (diversified) into the clade.

In terms of unique characters, Archaeopteryx simply does not exist. This is absurd, for its remains have been excavated and studied. To circumvent this logical dilemma, cladists place likely ancestors on the cladogram as the sister-group to their putative descendants and accept that they must be nominal paraphyletic taxa (Fig. 1.9c). Ancestors, just like paraphyletic taxa in general, can only be recognized by a particular combination of characters that they have and characters that they do not have. The unique attribute of possible ancestors is the time at which they lived.

Here is the reason why paraphylists complain about ancestors being treated as sister to their descendants: they are treated like that, crucially, so that we can do the analysis. It is a practical, not a philosophical reason.

Note also that at least the cladists who wrote the textbook do not have any problem with paraphyletic species. Whether we think that this use of the word paraphyletic makes sense or not (as do I), it is discussions like this one which make me groan in frustration whenever I read a paraphylist claim that cladists only accepted paraphyletic species as a cop-out once they could no longer deny that they existed. No, cladism was founded on the principle that monophyly applies above the species level, so it never had to backpedal like that.

After a cladistic analysis has been completed the cladogram may be reinterpreted as a tree (see below)

What they mean here is that they see a cladogram as such (merely) as a different visualisation of the data from the data matrix, while the "tree" is the cladogram's interpretation in terms of evolutionary relationships, of actual genealogical relatedness of the terminals.

and at this stage some palaeontologists may choose to recognize these paraphyletic taxa as ancestors, particularly when they do not overlap in time with their putative descendants (see Smith 199a for a discussion).

And this is the main point. Here we have a group of senior cladists who wrote, to put it in the simplest possible terms, "we need to treat every species as a terminal to get a cladogram, but then if you wish you can interpret a terminal without autapomorphies as an ancestor".

It is as if the people who claim that cladists do not accept the existence of ancestors haven't even bothered to figure out what any cladists really think.

Next time I will look at a short section of the textbook that I definitely disagree with.

Friday, November 4, 2016

CBA seminar on molecular phylogenetics

Today I went to a Centre of Biodiversity Analysis seminar over at the Australian National University: Prof. Lindell Bromham on Reading the story in DNA - the core principles of molecular phylogenetic inference. This was very refreshing, as I have spent most of the year doing non-phylogenetic work such as cytology, programming, species delimitation, and building identification keys.

The seminar was packed, the audience was lively and from very diverse fields, and the speaker was clear and engaging. As can be expected, Prof. Bromham started with the very basics but had nearly two hours (!) to get to very complicated topics: sequence alignments, signal saturation, distance methods, parsimony analysis, likelihood phylogenetics, Bayesian phylogenetics, and finally various problems with the latter, including choice of priors or when results merely restate the priors.

The following is a slightly unsystematic run-down of what I found particularly interesting. Certainly other participants will have a different perspective.

Signal saturation or homoplasy at the DNA level erases the historical evidence. Not merely: makes the evidence harder to find. Erases. It is gone. That means that strictly speaking we cannot infer or even estimate phylogenies, even with a superb model, we can only ever build hypotheses.

Phylogenetics is a social activity. The point is that fads and fashions, irrational likes and dislikes, groupthink, the age of a method, and quite simply the availability and user-friendliness of software determine the choice of analysis quite as much as the appropriateness of the analysis. Even if one were able to show that parsimony, for example, works well for a particular dataset one would still not be able to get the paper into any prestigious journal except Cladistics. And yes, she stressed that there is no method that is automatically inappropriate, even distance or parsimony. It depends on the data.

Any phylogenetic approach taken in a study can be characterised with three elements: a search strategy, an optimality criterion, and a model of how evolution works. For parsimony, for example, the search strategy is usually heuristic (not her words, see below), the optimality criterion is minimal number of character changes, and the implicit model is that character changes are rare and absence of homoplasy.

The more sophisticated the method, the harder it gets to state its assumptions. Just saying out loud all the assumptions behind a BEAST run would take a lot of time. Of course that does not mean that the simpler methods do not make assumptions - they are merely implicit. (I guess if one were to spell them out, they would then often be "this factor can safely be ignored".)

Nominally Bayesian phylogeneticists often behave in very un-Bayesian ways. Examples are use of arbitrary Bayes factor cut-offs, not updating priors but treating every analysis as independent, and frowning upon informative topology priors.

Unfortunately, in Bayesian phylogenetics priors determine the posterior more often than most people realise. This brought me back to discussions with a very outspoken Bayesian seven years ago; his argument was "a wrong prior doesn't matter if you have strong data", which if true would kind of make me wonder what the point is of doing Bayesian analysis in the first place.

However, Prof. Bromham also said a few things that I found a bit odd, or at least potentially in need of some clarification.

She implied that parsimony analysis generally used exhaustive searches. Although there was also a half-sentence to the effect of at least originally, I would stress that search strategy and optimality criterion are two very different things. Nothing keeps a likelihood analysis from using an exhaustive search (except that it would not stop before the heat death of the universe), and conversely no TNT user today who has a large dataset would dream of doing anything but heuristic searches. Indeed the whole point of that program was to offer ways of cutting even more corners in the search.

Parsimony analysis is also a form of likelihood analysis. Well, I would certainly never claim, as some people do, that it comes without assumptions. I would say that parsimony has a model of evolution in the same sense as the word model is used across science, yes. I can also understand how and why people interpret parsimony as a model in the specific sense of likelihood phylogenetics and examine what that means for its behaviour and parameterisation compared to other models. But calling it a subset of likelihood analysis still leaves me a bit uncomfortable, because it does not use likelihood as a criterion but simply tree length. Maybe I am overlooking something, in fact most likely I am overlooking something, but to me the logic of the analysis seems to be rather different, for better or for worse.

One of the reasons why parsimony has fallen out of fashion is that "cladistics" is an emotional and controversial topic; this was illustrated with a caricature of Willi Hennig dressed up as a saint. I feel that this may conflate Hennig's phylogenetic systematics with parsimony analysis, in other words a principle of classification with an optimality criterion. Although the topic is indeed still hotly debated by a small minority, phylogenetic systematics is today state of the art, even as people have moved to using Bayesian methods to figure out whether a group is monophyletic or not.

The main reasons for the popularity of Bayesian methods are (a) that they allow more complex models and (b) that they are much faster than likelihood analyses. The second claim surprised me greatly because it does not at all reflect my personal experience. When I later discussed it with somebody at work, I realised that it depends greatly on what software we choose for comparison. I was thinking BEAST versus RAxML with fast bootstapping, i.e. several days on a supercomputer versus less than an hour on my desktop. But if we compare MrBayes versus likelihood analysis in PAUP with thorough bootstrapping, well, suddenly I see where this comes from.

These days you can only get published if you use Bayesian methods. Again, that is not at all my experience. It seems to depend on the data, not least because huge genomic datasets can often not be processed with Bayesian approaches anyway. We can see likelihood trees of transcriptome data published in Nature, or ASTRAL trees in other prestigious journals. Definitely not Bayesian.

In summary, this was a great seminar to go to especially because I am planning some phylogenetics work over summer. It definitely got the old cogs turning again. Also, Prof. Bromham provided perhaps the clearest explanation I have ever heard of how Bayesian/MCMC analyses work, and that may become useful for when I have to discuss them with a student myself...

Sunday, August 21, 2016

Monophyletic species yet again: a recent example

Recently I received a publication alert for Ja soon to be published manuscript. It gave me reason once more to write about the issue of "monophyletic" species.

I do not want to give the impression I am deliberately picking on this particular paper. On the one hand, for all I know its data are completely awesome and its conclusions are valid; the issue I am writing about here is somewhat tangential to the paper anyway, its main focus being on genus level phylogeny. On the other hand, this same issue can be seen in many, many other papers in the field, as all too many systematics lectures at universities seem to leave it at "stuff must be monophyletic", without explaining the relevant background like the various relationships that OTUs can have to each other and, crucially, that different classification approaches apply to different relationships. So the occasion here is really only that the present paper showcases the issue in an extremely compact format, all condensed down to a mere three sentences in the discussion.

In full they run as follows:

However, while cladists debate whether higher level taxonomic groups should be monophyletic (e.g. Horandl and Stuessy, 2010; Schmidt-Lebuhn, 2012), it is conceivable that species need not be monophyletic, as different modes of speciation may have different phylogenetic outcomes (Rieseberg and Brouillet, 1994); non-monophyly is an expected intermediate state as taxa diverge (Avise and Ball, 1990). Indeed, in a morphology - based survey of 206 Australian plant species and subspecies (Proteaceae and Fabaceae), Crisp and Chandler (1996) estimated that 21% were paraphyletic. In addition, eucalypt taxonomists generally follow the ecological species concept that allows for hybridisation between taxa (Johnson, 1976), and such reticulation can cause non-monophyly and incongruence between morphological and genetic markers (e.g. Rutherford et al., 2016).

So what do I find problematic about these three sentences? Going through again in order...

However, while cladists debate whether higher level taxonomic groups should be monophyletic (e.g. Horandl and Stuessy, 2010; Schmidt-Lebuhn, 2012),

First, Hoerandl and Stuessy are not cladists. Second, of course cladists do not debate if supraspecific taxa should be monophyletic, because a cladist is defined as somebody who has already decided that supraspecific taxa should be monophyletic. If you still debate it you are by definition Not A Cladist. Compare "vegetarians debate whether they should stop eating meat". If they are still pondering that question they are Not Vegetarians.

it is conceivable that species need not be monophyletic,

This is where we get to the real issue: monophyly of species. Admittedly we first have to ask, are we talking about sexually reproducing species here? But I assume we are, because the article is about eucalypts, and they are mostly sexual. And the thing is, if that is the case then the concept of monophyly just simply does not apply. It is a word that describes a group of terminals on a rooted tree graph, like so:

A good analogy to describe what is going on is this. Imagine you have a real tree in front of you, and you are taking a group of twigs off using secateurs. To get a monophyletic group, you need to cut exactly once. To get a non-monophyletic group, you need to cut more than once. (If after cutting off a non-monophyletic group you have kept only one piece in your hand, it is paraphyletic; if you have kept several pieces but thrown out what used to connect them, it is polyphyletic. But that just as an aside.)

Within a sexually reproducing species, AKA a breeding group, there is no tree structure but a network structure, as individuals have numerous ancestors in each generation as opposed to one. That looks like this:

How do you get a monophyletic group? Well, you cannot, it is impossible. You could argue that my secateurs analogy would work for paraphyly even in a network, but only by jumping over the "imagine you have a real tree in front of you" part. You don't have a tree, you have a fishnet.

So to me this fragment - and everything that follows - makes as much sense as "it is conceivable that songs need not be yellow with purple stripes". Of course they don't - Amy Winehouse's Rehab, for example, is not yellow with purple stripes and yet it is a perfectly acceptable song. But then again I have no idea how it could be yellow with purple stripes, even if one were to try and make it so.

as different modes of speciation may have different phylogenetic outcomes (Rieseberg and Brouillet, 1994); non-monophyly is an expected intermediate state as taxa diverge (Avise and Ball, 1990).

Opponents of Phylogenetic Systematics regularly make the argument that "non-monophyly is an expected intermediate state as taxa diverge" at all taxonomic levels. At the supraspecific level, this constitutes a rather clear example of circular reasoning. A cladist would argue that a subclade cannot diverge from the larger clade it is part of, ever, because it is by definition part of that clade. That is what the sub- part of subclade means.

At the species level, on the other hand, the above fragment makes sense if we think of incomplete lineage sorting. Barring recombination, the copies or alleles of an individual gene do indeed evolve in a tree-like fashion, and the alleles found in one species will at first generally be paraphyletic to the copies found in its sister species. Only over time will selection or even loss through purely stochastic processes (genetic drift) make the alleles from each species monophyletic on the gene tree, a process known as lineage sorting.

It is possible that this is what the authors are referring to. But to me it still does not mean that it makes sense to call a species paraphyletic, because the components of a species are not gene copies but individuals, and individuals of the same sexually reproducing species stand in a network-relationship to each other, so that the word paraphyletic does not apply.

Indeed, in a morphology - based survey of 206 Australian plant species and subspecies (Proteaceae and Fabaceae), Crisp and Chandler (1996) estimated that 21% were paraphyletic.

Although I would not use the terms as they did (see above), the conclusions of the Crisp & Chandler paper are completely in accord with what I am saying here: species are special, because they are the level at which and from which on downwards it does not make sense any more to try and make stuff monophyletic. That being said, however, the paper does show species as paraphyletic on several morphology-based trees. How did it arrive at that result? Or in other words, given what I wrote earlier, what is the difference in perspective?

First, the terminals on the trees in Crisp & Chandler, the OTUs, are not actually individuals but groups of individuals, such as populations or subspecies; second, the authors conducted phylogenetic analyses on these OTUs. What that means is that the OTUs are forced into a tree-relationship even if the true relationship is net-like, because that is what phylogenetic analyses do. But if we are really talking about structures within a breeding group, within a sexually reproducing species, then in my eyes that analysis was just not appropriate because yes, the true relationship is net-like instead of tree-like. (And if the OTUs are not in a net-like, reticulating relationship, but instead genetically isolated, separate evolutionary linages, then why aren't they recognised as species?)

For example, I can jot down some morphological traits of a bunch of fellow humans, make a data matrix, and do a phylogenetic analysis. Because I do a phylogenetic analysis, the analysis will invariably return a tree. But does that mean that each of my OTUs - individual humans - had only a single parent, and only a single grandparent? Of course not, because we humans do not have a branching, tree-like relationship to each other either. The analysis simply made assumptions that do not hold up against reality.

In addition, eucalypt taxonomists generally follow the ecological species concept that allows for hybridisation between taxa (Johnson, 1976), and such reticulation can cause non-monophyly

Unfortunately it is left unclear what items are forming a non-monophyletic group in those situations. If it is alleles, see a few paragraphs further up; if it is individuals, see the immediately preceding section.

and incongruence between morphological and genetic markers (e.g. Rutherford et al., 2016).

That is true, but we could also mention several other processes, like aforementioned incomplete lineage sorting, meaning that we can have such incongruence even in the complete absence of hybridisation.

To close this post I would like to present a little paragraph that shows how the three sentences discussed above read to me:

However, while opponents of Scottish independence debate whether Scotland should be independent, it is conceivable that citizens need not be independent nations as different ways of acquiring citizenship may have different political outcomes; not being a geographic entity is an expected intermediate state as nations become independent countries. Indeed, in a survey of 206 individual citizens, Doe & Average (2010) estimated that 21% of them were not independent nations. In addition, political scientists usually follow a concept of citizenship that allows dual citizenship, and such reticulation can cause nations not being independent from other nations and incongruence between native language and nationality.

Again, this could be part of a great article on the Scottish independence movement, just like the present paper presents interesting genomic data on its study genus. But does this read as if the author was just a tiny bit confused about the difference between nations and the citizens that nations consist of? Quite so.

A lot of unproductive controversy and confusion among systematists and evolutionary biologists could be avoided if it became a bit more widely known what even ur-cladist Willi Hennig himself had in mind when he came up with the idea of "making stuff monophyletic". He was only arguing that supraspecific taxa should be monophyletic groups of species; the concept of species being monophyletic groups of individuals would not have made any sense to him, as he was very clear on the difference between usually tree-like (phylogenetic) relationships between species and net-like relationships within them.

Reference

Crisp MD, Chandler GT, 1996. Paraphyletic species. Telopea 6(4): 813–844.