Continuing my series on the uses of parsimony analysis in phylogenetics and biogeography, we come to the inference of species trees from gene trees.
I have written about the problems with inferring species relationships directly from the relationships of genes that are sample from these species before. In short, healthy species contain genetic diversity with potentially several different alleles for any given locus (e.g. how human eyes can be different colours). The same was true for all ancestral species in evolutionary history, and at first their two descendant species in a speciation event may each have inherited a part of that genetic diversity.
Because there is limited space available for alleles in any given species, even merely through the random process of genetic drift some of them will be lost in the descendant species. However, it takes some time for this loss to happen, and so it is possible that by the next lineage split resulting in more descendant species one gene may still have alleles that diverged in the previous ancestor. If that is the case, then the descendants may inherit a random selection of alleles that show different relationships to each other than the real species relationships, potentially misleading our phylogenetic inference.
For example, although we know from multiple lines of evidence (including most genetic data) that the chimpanzees are our closest living relatives, a minority of our genes is more closely related to those of the gorilla than to those of the chimpanzees. So if only one of those genes were sampled and all other evidence ignored, one might mistakenly infer that the gorillas are our sister species. And in some plant and animal groups mistakes like this can easily be made.
The solution is to use more samples per species than one, to use more genes for the molecular analysis than one, and to use species tree methods. As indicated in my earlier post on species tree software, there is a parsimony approach to this issue. In fact there are two different ways of doing species tree parsimony, depending on what kind of gene trees we are dealing with.
Minimising Deep Coalescences
The first approach, Minimising Deep Coalescences (MDC) assumes that we are dealing with one single gene in each case. You may have a single gene phylogeny in which one or more species are non-monophyletic, meaning that they must have inherited some ancestral allele diversity. Some of the alleles of species A are sister to the alleles of species B and some to those of species C, and you would now like to know what the real species relationships are. Is it (C,(A,B)) or (B,(A,C))? Alternatively, you may have several different gene phylogenies with only one allele from each species, but they contradict each other. Or perhaps a combination of both scenarios.
As always, parsimony analysis searches for the simplest explanation of what we see, and in this case it means that we try to minimise precisely the number of occurrences of what I explained above: the analysis searches for the phylogeny that requires the least number of cases in which a species lineage carried on allelic diversity from its ancestor down to the next lineage split instead of losing it along the way. A "deep coalescence" is the same thing seen from the tips of the phylogeny. You imagine yourself looking down into time and seeing extant gene lineages coalescing in ancestral species, but some of them do so in older (deeper) ancestors than expected.
In the above example, there are two normal coalescences (green), that is looking back down the tree they happen as soon as they can, and two deep coalescences (red), which happen deeper in time than when their containing species meet. The parsimony score of this species tree from this gene tree would consequently be 2.
Minimising gene duplications and losses
The second approach was developed for entire gene families. Gene duplication and subsequent specialisation is a major source of evolutionary innovation. An ancestor may have had one gene for a certain function, in plants for example for making a defensive biochemical compound. If that gene accidentally gets doubled, there are now two copies. In many cases, one copy will simply get silenced and turns into junk DNA. In others, it may remain active but mutate slightly to produce a different product. The plant may now have two different defensive substances, perhaps giving it greater versatility against different types of herbivores.
This can happen repeatedly - but of course copies can also get lost, especially if environmental conditions change so that one of them is not needed any more. We might thus find a gene family sitting inside a group of organisms, and each species in the group has a different subset of all the possible members of the gene family.
Minimising gene duplications and losses is the obvious parsimony solution to inferring species relationships from such a gene family tree. It does just what it says on the tin: it searches for the species phylogeny into which the gene family can be accommodated with the least number of gene duplication events and gene loss events.
In the above example, I have used the very same hypothetical species tree and gene tree as before. However, under the criterion of counting duplications and losses, the scenario now shows two gene duplications and four losses, leading to a much higher parsimony score of 6. Of course, one would not compare it against the previous tree but against other possible arrangements of A-E under the same criterion. The point is merely that what is counted for the parsimony score is very different.
The seminal article introducing the methods
Maddison WP, 1997. Gene trees in species trees. Systematic Biology 46: 523-536.