Monday, January 5, 2015

How to root a phylogenetic tree: outgroup, midpoint and other methods

Googling terms like "outgroup rooting" will, of course, provide several other places on the internet where people have explained how phylogenetic trees can be polarised, be it on university websites or on blogs of other phylogeneticists. Often, however, they seem to mention only the first two of the methods I will list below, and consequently it seems useful to add my take on it.

The problem is easily explained. All phylogenetic methods can produce a phylogenetic tree, that is a tree-graph showing the evolutionary relationships of its terminals, but many of them are silent on the  polarity of the tree. Thus for a study of the species A, B, C, D and E we may retrieve the following "unrooted" tree:


In this case, we know that A is very distant from the other four species, but we do not know in what direction evolution proceeded. It could be that A is the earliest diverging species, but it could also be C, for example, and the very long branch on which A is sitting is simply due to very fast change along that lineage.

As a different example closer to home, consider hominid evolution. Most people will know that the phylogeny of the great apes has been resolved as (orang-utan,(gorilla,(chimp,human))). But if we were ignorant of the root of the tree - here between organ-utan and the rest - it could just as well be (human,(chimp,(gorilla,orang-utan))).

So how do we know that the orang-utan is the earliest diverging great ape? How do we polarise, or "root", our ABCDE tree? There are several commonly used ways of doing this, and they each come with their own assumptions.

Outgroup rooting

The most widely known today is outgroup rooting. In this case, we need to use external knowledge (or make a hopefully reasonable assumption) to identify at least one species which we know to be outside of the rest of the study group but not so far away that character homology becomes difficult to establish.

It can easily be imagined that this leaves us open to the charge of circularity: you can always go one step further down the tree of life to find an even more distant outgroup, but how do you ever know that it is really outside of the rest? There are ways of breaking the circularity, such as some of the other rooting methods below, but for the moment let us just accept that it is commonly used, and that phylogeneticists are well aware of the potential pitfalls. For example, peer reviewers often ask authors of phylogenetic studies how they can be sure that their outgroups are really outside of the study group.

So let us assume that we know A to be outside of the group BCDE, and use it for outgroup rooting. The result looks like this:


Midpoint rooting

Perhaps the most widely known alternative to outgroup rooting is midpoint rooting. In this case, the longest distance between two terminals on the tree is identified. In our example case, it is A to C:


The root is then placed precisely in the middle of that distance, leading to a rooted tree that looks as follows:


As we can see, A and C are sticking out the farthest of all species, and they are equidistant from the root (this was done with the tree viewer FigTree, which has a midpoint rooting option).

The assumption behind midpoint rooting is that character changes across the phylogenetic tree are approximately clock-like, that is they happen approximately at the same speed in every lineage. Sometimes this is a reasonable assumption, as when one has many neutral characters and species with approximately similar characteristics. But there are many situations in which evolution is far from clock-like: if characters are under selection, if for some reason lineages evolve at very different speeds, or if there are appreciable amounts of missing data in the dataset, midpoint rooting will be mislead.

Ultrametric trees

At the beginning I wrote that most phylogenetic methods are silent on the root. We now come to those that do actually automatically root the tree they produce. The first group are all methods that produce ultrametric trees. Ultrametric means that - in contrast to the trees shown so far - all branches of the tree end flush in the present, and that all branch length differences along the lineages play out in the past.

One method that produces such trees is UPGMA, a distance-based clustering algorithm (and thus not really a phylogenetic method in the strictest sense). The other is any type of coalescent analysis such as conducted by the very popular software BEAST, whose author is on the record as saying that there are really no unrooted phylogenetic trees, precisely because all branches leading to extant species must logically terminate at the same level, in the present. (This is not true for phylogenetic studies of fossils, of course, but BEAST was primarily written for molecular data obtained from living organisms.)

At any rate, and admittedly simplifying a bit, both UPGMA and coalescent analyses have a similar logic to them: The analysis starts with the tips and works backwards, uniting lineages until they all run together (coalesce) at the starting point of the phylogeny. That point must then logically be the root, and there we are: the tree is automatically rooted. In our example, it would perhaps look like this:


The assumption behind ultrametric trees of any kind is, however, the same as that behind midpoint rooting: a mostly clock-like behaviour of the data. They are consequently just as vulnerable as midpoint rooting when that assumption is false.

Asymmetric step-matrices

Another way of arriving at a rooted phylogeny is by using asymmetric step-matrices, because in that case the phylogenetic analysis must tentatively polarise the tree to know how to score it. I do not want to go into the details, not least because this is a very rarely used approach, so the following may go over the heads of some readers.

A parsimony analysis in the software PAUP, for example, proceeds by suggesting a tree, changing it somehow, comparing the score of old and new trees to see which one is better, and then picking the better of the two. This is repeated until the program is confident it will not find anything better.

Better in the context of a parsimony analysis means shorter, that is the tree with the lower number of character changes along the branches. That is what parsimony means: all else being equal, the simplest explanation is to be preferred.

Often when we have two character states, say 0 and 1, we consider changes between the two states equally probable and thus score a change in either direction as the same contribution to a tree score: evolving from a yellow flower colour to a white one is probably as easy as evolving from white to yellow. In that case, the root of the tree is irrelevant, and thus the analysis will by default return an unrooted tree (which the user may then root using an outgroup, for example).

However, sometimes it may make sense to consider one direction of change as more costly than another. Let us say that state 1 is something complicated, possession of leaves perhaps, that may be easily lost (1->0) but is less likely to be gained (0->1) in parallel. There are different ways of penalising against changes from 0 to 1, for example by using Dollo parsimony. But another one is to use an asymmetric step-matrix. This means that one would, for example, count a change from 1 to 0 as one step in tree score/length but a change from 0 to 1 as two steps in tree score/length.

Now the problem should be obvious: the parsimony analysis needs to know the root of the tree to even be able to compute the tree score, because the root position decides whether a change on a given branch goes 0->1 (expensive) or 1->0 (cheap). In practice, PAUP appears to use the root position that will produce the best tree score in each case; or so I assume.

Regardless of the details, using PAUP for a parsimony analysis with such an asymmetric step-matrix consequently returns a rooted tree. The question is whether the results make sense, and often they do not seem to do so. This is probably because the assumption underlying the step-matrix, that changes are asymmetric in the specified way, not precisely reflect reality. But again, this approach is rarely used anyway.

Gene duplication events

Finally, there is a completely different but ingenious method of rooting that makes use of gene duplication events. It is well understood that one of the major ways in which organisms evolve new functionalities is by accidental duplication of genes followed by their specialisation into different functions. In come cases, genes have been duplicated very often, leading to entire "gene families".

Imagine that you sequence a little gene family for our study group ABCDE and you find this gene tree:


In this case, four of our species (BCDE) have two separate copies in that gene family, and thus each of those species appears twice in your tree. Species A, on the other hand, has only one copy. Probably the gene was duplicated after the divergence of A; also possible is that A has lost one of the two copies, but that explanation is less parsimonious because it postulates one additional unnecessary event to explain the same observation (the duplication must have happened anyway at some point).

I recently learned that it was the Australian plant systematist Peter Weston who first realised that such gene duplication events can be used to root phylogenies, thus providing one of the ways in which the circularity of outgroup rooting can be broken: The place at which the blue and red gene clades connect must be the root of the group of species that has both copies!

One way of depicting this way of rooting (which was used to great effect in a study of the evolution of flowering plants, for example), is by showing the gene trees of both copies opposite each other, connected by the duplication event:


In total, we have now seen five different ways of polarising trees. There are consequently considerably more options than most people are aware of - or that I myself was aware of a few years ago.

No comments:

Post a Comment