However, many higher level relationships in plants are by now well understood. Not all, of course - the exact phylogenetic relationships of the major groups of seed plants, for example, are still problematic, especially with regard to the closest relatives of the flowering plants. But we now have a fairly good understanding of the phylogeny of the flowering plants, and can be confident that most major plant families have either been confirmed to be monophyletic (daisies, orchids, grasses, etc.) or else recircumscribed to make them so (mint family, heath family, etc.). Consequently, attention is increasingly turning to lower taxonomic levels, to infer relationships in groups of closely related species. This is important both to further improve our classification and as a prerequisite for studies in evolutionary biology, biogeography, and other areas.
In the light of this shift in focus, it is especially troubling to realize that a surprising number of colleagues has a very simplistic view of how to use molecular data for phylogenetics - a view that is fairly unproblematic at higher taxonomic levels but fatal at the lower ones. Lemme explain.
The by now traditional approach - and of course I am using it myself often enough - is to grab one specimen of each species in our group of interest, extract DNA, and then generate sequence data for a limited number of regions. In plants that would usually be one very easily amplifiable region from the nuclear genome (ribosomal DNA, or rDNA) and a few chloroplast regions (cpDNA). In animals, a particularly popular marker is the cytochrome c oxidase subunit I (COI) from the mitochondrium.
Various methods can then be employed to infer the phylogenetic relationships of the sequence copies that have been found in the specimens. The methods are not the point of this post, so we will now just assume that they are reliable. The point I want to make is this: the phylogenetic history of any one gene region is not necessarily the same as the phylogenetic history of the study species; the analyses based on sequences obviously give us the former; what we really want to know, however, is the latter, the species phylogeny. It is utterly crucial to be clear about this distinction.
There are two reasons why gene trees and species trees can differ. The more obvious one is that there could be rare gene flow between two species, mediated by partially fertile hybrids; this process is known as introgression. The second possibility, which is the focus of this post, is that the study species still retain genetic diversity that they inherited form their common ancestral species.
Imagine we start with one species A that is, at first, genetically homogeneous, i.e. it has the same single "haplotype" for a specific gene region in all its individuals. Now let it happily exist for a million years, and random mutations will produce various different haplotypes for that gene region, which diverge from the original one in a tree-like fashion and "inhabit" this species as a group of diversifying species inhabits an island. Finally, with a good diversity of haplotypes available in it, we imagine the species diverging into two descendant species B and C, not through some serious bottleneck but simply through some barrier to gene flow splitting A into two isolated populations of equal size. All the various haplotypes that exist in A can be inherited by B, by C, or by both descendant species. In fact, it is entirely possible that, at the very beginning, both B and C contain all haplotypes that were found in their ancestor A.
Assume now that the process continues. B and C themselves diverge into two descendant species each, so that we now have four species D, E, F and G in our thought experiment, and all of them inherit some part of the ancestral haplotype diversity of A plus more diversity generated during the existence of B and C. The problem with the one-sample-per-species approach I described above should be immediately obvious, especially if we look at the figure accompanying this post: Although the true phylogeny is ((D,E),(F,G)), we can randomly grab samples from the four extant species that lead to extremely different gene phylogenies - like the ones I marked in red in that figure. Worse, if a species is sexually reproducing, it will freely recombine its chromosomes every generation, and so gene regions from different chromosomes will have very different gene phylogenies, any of which may or may not be congruent with the species phylogeny.
Two questions arise: If that is so, why does the diversity of haplotypes in any and all species not increase ad infinitum? And under those circumstances, how do we ever figure out what the true species phylogeny is?
As for the first, there is of course only limited space for haplotypes to exist within a species. If we are talking about a nuclear single copy gene region, and the species consists of a million diploid individuals, then there will be two million copies of the gene region in existence; a chloroplast region in the same species has only one million copies (more really, but we can generally assume all copies in the same individual to be identical). While generation after generation lives and reproduces, the frequency of each haplotype will randomly go up or down, and some will randomly disappear from the species, a process known as genetic drift. If this happens for long enough, so many branches of the haplotype phylogeny will die out within a species that the haplotypes that are left in it have become monophyletic. This process is called lineage sorting, and the situation before monophyly of sequence copies is achieved is consequently called incomplete lineage sorting.
Of course, even if lineage sorting has been achieved for all gene regions, the problem still remains that any individual gene phylogeny may not be representative of the species phylogeny, and that several gene phylogenies may tell different stories. So how do we infer a species phylogeny at all?
The relevant tools here are called species tree methods. The idea goes back at least to Maddison (1997), and a good overview has been provided by Knowles (2009). Basically, we need to examine several individuals per species instead of only one, and preferably several independent gene regions. The species tree methods then try to infer the most parsimonious or most likely explanation for how the present species would have ended up with their respective shares of the various gene trees. For example, under the assumption that lineage sorting gradually takes place over time, the MDC method searches for the species phylogeny where the inferred splits in the embedded gene trees are as recent as possible.
Anyway, the details are not really important here and can be read up elsewhere. The take home messages are these:
- It is okay to infer high level phylogenies with one sample per species and few gene regions because there was more than enough time for lineage sorting to take place and for really big differences to accumulate between the reciprocally isolated lineages.
- But it is problematic to use the same approach at very low taxonomic levels, e.g. to examine relationships between very closely related species, because they will share some ancestral genetic diversity, because individual gene regions can be expected to tell you different stories, and because some of them can be expected to be incongruent with the real species phylogeny.
- Just concatenating all your data and ignoring the problem is not an acceptable solution. At best, you will end up with the wrong answer, at worst with nothing but a big polytomy.
- The solution is to examine several (>=10?) samples per species, preferably representing most of its geographic distribution, and to use species tree methods.
Knowles LL, 2009. Estimating species trees: Methods of phylogenetic analysis when there is incongruence across genes. Systematic Biology 58: 463-467.
Maddison WP, 1997. Gene trees in species trees. Systematic Biology 46: 523-536.