Tuesday, April 1, 2014

Phylogenetic analysis using the parsimony criterion

One of the simplest ways to reconstruct the phylogenetic relationships between different organisms is parsimony analysis. As explained in the previous post of this series, the principle as applied to tree inference is very straightforward: compare possible solutions by counting the number of events in each and accept the solution that needs the smallest such number.

Now what are the events, and how does that work in practice?

The analysis starts with a data matrix. This is a two-dimensional table that has one row for each of the terminals we want to have in our phylogenetic tree. These terminals could be at various different levels - each of them could be a biological species or as big as a whole class of organisms (hopefully because we are already sure that they reciprocally monophyletic even if we don't know their exact relationships to each other). Because we want to talk in the most general terms, we will simply call them OTUs, short for Operational Taxonomic Units.

The columns of the data matrix are the various characters that we will use to infer the relationships of the OTUs. Here it is most important to establish homology. For example, we cannot compare the wings of vertebrates, which have evolved out of complete front legs, with the wings of insects, whose evolutionary derivation is still unclear but that have definitely not evolved out of anything homologous to vertebrate front legs. But we do have to compare the wings of bats, the front legs of a horse, the arms of a human and the flippers of a whale, because these are all homologous organs.

Further it would be good to make sure that the characters in the matrix are all independent; we should avoid scoring what is pretty much the same character several times independently. Sometimes the decisions can become quite difficult. How to score non-discrete characters, such as length measurements? How to describe shapes? The easiest characters are usually simple absence / presence characters. Of course, we can also use DNA sequence characters, and that is what happens most of the time these days.

If we do not have information on a character for one of the OTUs, we can score it as missing. There can also be polymorphic characters; for example, some species in a plant family may always have glandular leaves, others always have non-glandular leaves, and one or two include individuals with either state.

Once we have our data matrix, we start searching for the "best" tree. We start by suggesting some tree, perhaps randomly. We then map all the characters onto the tree, counting how many character changes would have been necessary to explain the evolutionary relationships shown by the tree. Next, we come up with a different tree, and we do the same counting of changes there. Finally, we compare the number of changes necessary in both cases. The tree that needs the larger number of character changes to work is discarded. Now we try yet another tree, count, compare, repeat until we are sure that we cannot come up with a better one.

The above trees show a very simple example with only one character mapped onto the phylogenies. The tree on the left needs two independent changes from the black to the red state to explain the character states in its terminals. If we swap the positions of C and E, as in the tree on the right, there was only one change from black to red. This is one change less, so the tree on the right is more parsimonious - for this one character, at least. In reality, we have to do the same for several more characters and add up the numbers in changes. It may well be that the swap of C and E has worsened the parsimony score overall if other characters are taken into account.

For large numbers of OTUs, there are ridiculous large numbers of possible phylogenetic trees, making it impossible for a human to find the best solution by hand. In practice, parsimony analysis is therefore conducted with the help of computers. But even then finding the very  best tree for even just fifty OTUs would take so long that phylogenetics software generally only conducts a "heuristic" search: it cuts some corners to increase speed while accepting the risk that it will only find a very good instead of the perfect solution.

I think I will leave it at this today and discuss the various different types of parsimony - that is, different ways of counting character changes - in the next post.

No comments:

Post a Comment