Wednesday, January 2, 2013

Simplistic molecular phylogenetics and incomplete lineage sorting

By now, a whole generation of botanists has been trained with molecular techniques. While the 1980ies and early 1990ies were the heyday of morphology-based cladistic analyses, most phylogenetic studies today are conducted with DNA sequence data.

However, many higher level relationships in plants are by now well understood. Not all, of course - the exact phylogenetic relationships of the major groups of seed plants, for example, are still problematic, especially with regard to the closest relatives of the flowering plants. But we now have a fairly good understanding of the phylogeny of the flowering plants, and can be confident that most major plant families have either been confirmed to be monophyletic (daisies, orchids, grasses, etc.) or else recircumscribed to make them so (mint family, heath family, etc.). Consequently, attention is increasingly turning to lower taxonomic levels, to infer relationships in groups of closely related species. This is important both to further improve our classification and as a prerequisite for studies in evolutionary biology, biogeography, and other areas.

In the light of this shift in focus, it is especially troubling to realize that a surprising number of colleagues has a very simplistic view of how to use molecular data for phylogenetics - a view that is fairly unproblematic at higher taxonomic levels but fatal at the lower ones. Lemme explain.

The by now traditional approach - and of course I am using it myself often enough - is to grab one specimen of each species in our group of interest, extract DNA, and then generate sequence data for a limited number of regions. In plants that would usually be one very easily amplifiable region from the nuclear genome (ribosomal DNA, or rDNA) and a few chloroplast regions (cpDNA). In animals, a particularly popular marker is the cytochrome c oxidase subunit I (COI) from the mitochondrium.

Various methods can then be employed to infer the phylogenetic relationships of the sequence copies that have been found in the specimens. The methods are not the point of this post, so we will now just assume that they are reliable. The point I want to make is this: the phylogenetic history of any one gene region is not necessarily the same as the phylogenetic history of the study species; the analyses based on sequences obviously give us the former; what we really want to know, however, is the latter, the species phylogeny. It is utterly crucial to be clear about this distinction.

There are two reasons why gene trees and species trees can differ. The more obvious one is that there could be rare gene flow between two species, mediated by partially fertile hybrids; this process is known as introgression. The second possibility, which is the focus of this post, is that the study species still retain genetic diversity that they inherited form their common ancestral species.

Imagine we start with one species A that is, at first, genetically homogeneous, i.e. it has the same single "haplotype" for a specific gene region in all its individuals. Now let it happily exist for a million years, and random mutations will produce various different haplotypes for that gene region, which diverge from the original one in a tree-like fashion and "inhabit" this species as a group of diversifying species inhabits an island. Finally, with a good diversity of haplotypes available in it, we imagine the species diverging into two descendant species B and C, not through some serious bottleneck but simply through some barrier to gene flow splitting A into two isolated populations of equal size. All the various haplotypes that exist in A can be inherited by B, by C, or by both descendant species. In fact, it is entirely possible that, at the very beginning, both B and C contain all haplotypes that were found in their ancestor A.

Assume now that the process continues. B and C themselves diverge into two descendant species each, so that we now have four species D, E, F and G in our thought experiment, and all of them inherit some part of the ancestral haplotype diversity of A plus more diversity generated during the existence of B and C. The problem with the one-sample-per-species approach I described above should be immediately obvious, especially if we look at the figure accompanying this post: Although the true phylogeny is ((D,E),(F,G)), we can randomly grab samples from the four extant species that lead to extremely different gene phylogenies - like the ones I marked in red in that figure. Worse, if a species is sexually reproducing, it will freely recombine its chromosomes every generation, and so gene regions from different chromosomes will have very different gene phylogenies, any of which may or may not be congruent with the species phylogeny.

Species tree (grey) with embedded gene tree (black); this is from a real-life dataset although edited to remove a fifth species. The red haplotype lineage shows how examining too few samples can mislead: it suggests D and F as sister species although evidence from more samples indicates that D and E are most closely related.

Two questions arise: If that is so, why does the diversity of haplotypes in any and all species not increase ad infinitum? And under those circumstances, how do we ever figure out what the true species phylogeny is?

As for the first, there is of course only limited space for haplotypes to exist within a species. If we are talking about a nuclear single copy gene region, and the species consists of a million diploid individuals, then there will be two million copies of the gene region in existence; a chloroplast region in the same species has only one million copies (more really, but we can generally assume all copies in the same individual to be identical). While generation after generation lives and reproduces, the frequency of each haplotype will randomly go up or down, and some will randomly disappear from the species, a process known as genetic drift. If this happens for long enough, so many branches of the haplotype phylogeny will die out within a species that the haplotypes that are left in it have become monophyletic. This process is called lineage sorting, and the situation before monophyly of sequence copies is achieved is consequently called incomplete lineage sorting.

Of course, even if lineage sorting has been achieved for all gene regions, the problem still remains that any individual gene phylogeny may not be representative of the species phylogeny, and that several gene phylogenies may tell different stories. So how do we infer a species phylogeny at all?

The relevant tools here are called species tree methods. The idea goes back at least to Maddison (1997), and a good overview has been provided by Knowles (2009). Basically, we need to examine several individuals per species instead of only one, and preferably several independent gene regions. The species tree methods then try to infer the most parsimonious or most likely explanation for how the present species would have ended up with their respective shares of the various gene trees. For example, under the assumption that lineage sorting gradually takes place over time, the MDC method searches for the species phylogeny where the inferred splits in the embedded gene trees are as recent as possible.

Anyway, the details are not really important here and can be read up elsewhere. The take home messages are these:
  • It is okay to infer high level phylogenies with one sample per species and few gene regions because there was more than enough time for lineage sorting to take place and for really big differences to accumulate between the reciprocally isolated lineages.
  • But it is problematic to use the same approach at very low taxonomic levels, e.g. to examine relationships between very closely related species, because they will share some ancestral genetic diversity, because individual gene regions can be expected to tell you different stories, and because some of them can be expected to be incongruent with the real species phylogeny.
  • Just concatenating all your data and ignoring the problem is not an acceptable solution. At best, you will end up with the wrong answer, at worst with nothing but a big polytomy.
  • The solution is to examine several (>=10?) samples per species, preferably representing most of its geographic distribution, and to use species tree methods.
That being said, science never makes the claim that we can the know anything with complete certainty, it just follows the best evidence available to where it leads. Phylogenetics and systematic biology are no different: using species tree methods and throwing a lot of data at the problem is the best we can do, but our inferences are always tentative.


Knowles LL, 2009. Estimating species trees: Methods of phylogenetic analysis when there is incongruence across genes. Systematic Biology 58: 463-467.
Maddison WP, 1997. Gene trees in species trees. Systematic Biology 46: 523-536.


  1. Nice post! I just have a few questions to clarify my understanding. Is there a difference between haplotypes and alleles? I'm a little confused what a haplotype is. If there is a difference, incomplete lineage sorting can occur with alleles, right?

    I've skimmed over this paper by A.V.Z. Brower called "Gene trees, species trees, and systematics: a cladistics perspective" and he seems to question the validity of the gene-species tree argument. I was wondering what your thoughts are on this.


  2. I would say they are kinda the same for a given value of "the same", that is they are the same sequence but seen from a different perspective.

    If one says allele one is looking at the sequence from the perspective of occupying a slot in an organism or in a population, in competition for that space with other such alleles. So one would most likely say "this individual is homozygous for the major allele" instead of haplotype.

    If one says haplotype one is looking at the sequence from the perspective of being one of several homologous sequences in one's dataset in a phylogeographic study, for example. So one would make a diagram showing how often each sequence was sampled from each population, and in those cases one would rarely call them alleles.

    And then there are of course genomes that are only ever haplotypic, such as chloroplast or mitochondria.

    I do not have access to the Brower article until tomorrow but note that the end of the abstract says that the issue "may be severe" than thought by others. That does not mean non-existent, and it should also be noted that Brower is a zoologist. They do seem to have it a bit easier than botanists with respect to distinctness of species, perhaps partly because at least land animals consciously select their partners instead of waiting for some random pollen to arrive. At any rate, the existence of the problem cannot really be denied at this stage.

  3. Ah, thanks for the clarification. If I recall correctly, ILS tends to be more significant if the time between successive speciation events is short and if effective population sizes are large, right? Since on a basic model of genetic drift it takes on average 4Ne for a newly arisen allele/ haplotype to become fixed in a population.