Saturday, March 16, 2013

So why is incomplete lineage sorting not an issue at higher taxonomic levels?

When I wrote about incomplete lineage sorting some time back, my main point was that it is an often neglected problem at lower taxonomic levels. One cannot assume that one will infer the true species phylogeny based on only one sample per species and/or only one gene region because many species may have inherited some ancestral polymorphism. Different genes and different individuals may tell conflicting stories. (A good example is the publication from a year ago that found a good portion of the gorilla genome to be more closely related to that of humans or chimpanzees although the overall evidence clearly shows the latter two to be sister species.)

I mentioned also that incomplete lineage sorting is not a problem at higher taxonomic levels, such as when we want to figure out whether a genus or subfamily of plants is monophyletic. The same was recently stated confidently when I was meeting with a few colleagues over lunch.

But why, actually? One might wonder how a problem that makes it harder to infer the true relationships of closely related species A, B, C and D, with some genes saying ((A,B),(C,D)) and others saying ((A,C),(B,D)) would suddenly disappear fifteen millions later. Surely if we want to infer the phylogenetic relationships of four clades A', B', C' and D' that have descended from those four species we will run into precisely the same problem?

Well, I guess so: if these four most closely related species all diversified into clades over those fifteen million years that are happily alive today, then the problem remains because all information that is available to infer the relationships of A'-D' is the information that we could have used to infer the relationships of A-D when they were still only four closely related species. A gene that was fixed in the four lineages so that it tells the story ((A,C),(B,D)) although the real species phylogeny is ((A,B),(C,D)) would mislead us equally in both situations.

No, the real difference between the two situations becomes clear when we think about the likelihood of four species from fifteen million years ago all surviving - it is vanishingly small. Most of everything goes extinct. Just as most seeds do not get to be mature plants and most eggs do not get to be mature animals, most species do not diversify into clades but instead go extinct, and most small clades do not diversify into large clades but instead go extinct.

Because of that, it is quite unlikely that the crown groups of the deeper clades whose relationships to each other we want to infer today are derived from a group of very closely related species (their stem groups necessarily are, by definition, but that is besides the point because we cannot sample anything but the crown group). Instead of clades A'-D' in a relationship of
((A',B'),(C',D'))
what we will mostly find are clades A'-D' in a relationship of
(0,(0,((0,(A',((0,(((0,0),((0,(0,(0,0))),(0,(0,(0,0))))),(0,0))),((0,(0,(0,(0,(0,0))))),((0,(0,(0,(0,(0,(0,(0,(0,(0,(0,(0,(((0,(0,(0,(0,0)))),((0,(0,0)),((B',0),(0,0)))),(0,(0,0)))))))))))))),(0,(0,0))))))),(((0,(0,(0,0))),(0,0)),(0,((0,0),(((0,((0,0),((((0,0),(0,0)),(0,(0,(0,0)))),(0,(0,(0,(0,((C',(0,0)),(0,(0,0)))))))))),((0,(0,(0,((0,(0,0)),(0,0))))),((0,0),(0,(0,0))))),(((D',0),(0,(0,0))),(0,0)))))))))
with "0" representing all the related species that have gone extinct over those fifteen million years and thus will never turn up in a molecular analysis.

Clades (sections, genera, tribes, etc.) that are alive today are in most cases derived from ancestral species that were far enough apart on the phylogeny to have accumulated additional synapomorphies along the branches studded only with extinct side lineages. Ultimately, it is not the completion of lineage sorting over time, i.e. the extinction of gene families within species, but instead the extinction of entire species that ensures that incomplete lineage sorting is not a problem for inferring higher level relationships.

4 comments:

  1. Have you seen this, about alfalfa phylogeny?
    http://nothinginbiology.org/2013/03/12/how-many-phylogenies-are-there-in-a-genome-lots/

    ReplyDelete
  2. Thanks, I think I hadn't seen that although I remember discussing a similar paper in journal club once. The figure at the website you linked to is very nice indeed.

    What we should remember is that it appears to work in many cases nonetheless. In the gorilla case, most of the genome gives the answer that is congruent with what was inferred from fossils and morphological analysis. And I am just now conducting a phylogenetic analysis of certain paper daisies where the often maligned ribosomal DNA gives us groups that make a lot of sense from a morphological perspective.

    ReplyDelete
  3. I think the principle components of macroevolution are speciation and extinction. I have never seen a reference to genusization or familyization. It is sometimes argued that higher categories are arbitrary and largely meaningless. I agree that they are human artifacts, but reflect our best understanding of the groups involved. There is also the argument at any monophyletic group is just as real as species are.

    I wonder how much differentiation, both morphological and genetic, can occur in a speciation event. Is there any meaningful idea of average differentiation in speciation events?
    Is there any way to estimate how many hidden by extinction speciation events there are in a linage?

    Anyway, your discussion is interesting, but how to test the hypothesis?

    ReplyDelete
  4. It is clear that all supra-specific ranks are arbitrary. I merely wrote of genera etc because not everybody reading this can be expected to have the frame of mind to understand what I mean if I constantly write about older and younger clades.

    You raise a very good point with the testability. Admittedly, this is more on the level of "it stands to reason" - but some conclusions do follow from certain observations. I cannot interbreed with an oak tree or a rat, for example. That means that _somewhere_ the distance between two clades becomes too big for gene flow to happen. It is now an empirical matter whether the crown groups of extant clades whose stem group ages are ca xyz million years were usually far enough apart that they were already completely isolated from each other genetically.

    Personally, I think people exaggerate the problems. They see a few spectacular cases of really young, closely related species, not least because those are the most interesting to study, and extrapolate from that over all the plant kingdom. At least in the groups that I know well interbreeding just doesn't happen over large distances. Mentha section Mentha is a mess - but you cannot cross a true mint with thyme or oregano, probably not even section Mentha with the other sections. Prunella with its four species is a mess but you certainly cannot cross it with Lamium. And so on.

    ReplyDelete