Saturday, May 17, 2014

Matrix Representation Parsimony supertrees

Continuing, for the moment, my little series of posts on the use of parsimony methods in phylogenetics and biogeography, we come to the topic of supertrees.

Some phylogenetic studies deal with higher level groups. For example, one might see an evolutionary tree of the land vertebrates or of the land plants. But in those cases the sampling of the individual groups is very restricted, so that a whole family of mammals or a whole order of plants might be represented with only one terminal.

Other studies deal with more fine scale relationships. For example, there are publications only on the phylogeny of one medium size genus of daisies or one genus of birds. In this case the species within the genera in question are well sampled (hopefully complete or nearly so), but obviously everything outside the study group is represented by only a few close relatives.

At some point one might now want to put all of this information together to arrive at the complete tree of life or, perhaps less ambitiously, at a complete evolutionary tree of all birds or of all flowering plants. How can we take all these individual studies, all dealing with different species and often using very different types of data, and get one tree out of them?

There are two main approaches. It should be obvious that both necessarily require that there is some overlap between the various trees.

The first is to build a supermatrix. In this case, the raw data used in all the various studies are pooled and analysed together with whatever method is appropriate. This has one clear advantage - precisely that of going back to the underlying data - but also some disadvantages. For example, the supermatrix will often have large amounts of missing data, making it harder to find the best tree. Tree reconstruction can also take long computing times. Finally, and perhaps often most problematically, data that were successfully used for a study within one genus may be much too variable across larger phylogenetic distances, making homology assessment (for example in sequence alignment) very difficult.

The alternative approach are supertrees. In this case, the original data used for the various individual studies are ignored. Instead, the topological information from the phylogenetic trees resulting from those studies is used directly; one could say that the trees are glued together into a larger tree.

One of the oldest methods to do this is Matrix Representation Parsimony (MRP). It was already developed in 1992 and is actually quite simple. You make a data matrix in which each terminal (species) found in any of the trees you want to unite is represented with a row.

Now you take the first tree and go through all its internal branches one by one. For each internal branch, you add a column to your matrix, and you score a 0 for every terminal on one side of the branch and a 1 for every terminal on the other side of the branch (it is irrelevant which side gets which number). Then you fill the rest of the column, that is all the lines whose names were not found in the tree, with '?' for 'missing data'. After you have done all the branches, do the same with the remaining trees, one by one.

The above example illustrates the principle. The first three branches have already been scored; the numbers above the columns of the matrix refer to the numbers above the branches in the phylogeny. Although the species C, G and H are found in other trees used for the supertree analysis they are missing in this one. Consequently, they will have '?' in all columns coding for this particular tree.

The resulting matrix faithfully represents all the topological information available to you in the form of binary characters. All you now need to do is a parsimony analysis in standard phylogenetics software such as PAUP or TNT and you will retrieve the supertree. Because the matrix contains a lot of missing data, you will not get any strong support values, but in this case we are happy with the one most parsimonious tree.

If the individual trees contradicted each other, you will likely get multiple equally parsimonious trees, and in that case their strict consensus tree may contain large polytomies indicating unresolved relationships. As so often, throwing more data - in this case, more individual trees - at the problem might help, especially if several mutually consistent trees "outvote" one outlier. Often, however, a lack of additional reliable trees is the main problem in these kinds of analyses.

Lately I have relied more on supermatrices, but I have actually published a paper a few years ago in which I used the MRP method to infer a supertree for the Acanthaceae. So yes, this specific use of parsimony is still worthwhile sometimes.

No comments:

Post a Comment