Showing posts with label methods. Show all posts
Showing posts with label methods. Show all posts

Thursday, September 26, 2019

Incongruence Length Difference test in TNT

Because I am fed up with figuring it out anew every time I need to use the Incongruence Length Difference (ILD) test (Farris et al., 1994) in TNT, I will post it once and for all here:

Download TNT and the script "ildtnt.run" from PhyloWiki. In the script, you may have to replace all instances of "numreps" with "num_reps" to make it functional. I at least get the error "numreps is a reserved expression", suggesting that the programmer should not have used that as a variable name.

Open TNT, increase memory, and set data to DNA and treating gaps as missing data. Then load your data matrix, which should of course be in TNT format:

mxram 200 ;
nstates DNA ;
nstates NOGAPS ;
proc (your_alignment_file_name) ;

Look up how many characters your first partition has, then run the test with:

run ildtnt.run (length_of_first_partition) (replicates) ;

There is an alternative script for doing the test called Ild.run, but I have so far failed to set the number of user variables high enough to accommodate my datasets. They seem to be limited to 1,000?

Perhaps this guide will also be useful to somebody besides me.

Reference

Farris JS, Källersjö M, Kluge AG, Bult C, 1994. Testing significance of incongruence. Cladistics 10: 315-319.

Friday, April 20, 2018

Time-calibrated or at least ultrametric trees with the R package ape: an overview

I had reason today to look into time-calibrating phylogenetic trees again, specifically trees that are so large that Bayesian approaches are not computationally feasible. It turns out that there are more options in the R package APE than I had previously been aware of - but unfortunately they are not all equally useful in everyday phylogenetics.

In all cases we first need a phylogram that we want to time-calibrate or at least make ultrametric to use in downstream analyses that require ultrametricity. As we assume that our phylogeny is very large it may for example have been inferred by RAxML, and the branch lengths are proportional to the probability of character changes having happened along them. For present purposes I have used a smaller tree (actually a clade cut out of a larger tree I had floating around), so that I could do the calibrations quickly and so that the figures of this post look nice. My example phylogram has this shape:


We fire up R, load the ape package, and import our phylogeny with read.tree() or read.nexus(), depending on whether it is in Newick or Nexus format, e.g.
mytree <- read.tree("treefilename.tre")
Now to the various methods.

Penalised Likelihood

I have previously done a longer, dedicated post on this method. I did not, however, go into the various models and options then, so let's cover the basics here.

Penalised Likelihood (PL) is, I think, the most sophisticated approach available in APE, allowing the comparison of likelihood scores between different models. It is also the most flexible. It is possible to set multiple calibration points, as discussed in the linked earlier post, but here we simply set the root age to 50 million years:
mycalibration <- makeChronosCalib(mytree, node="root", age.max=50)
We have three different clock models at our disposal, correlated, discrete, and relaxed. Correlated means that adjacent parts of the phylogeny are not allowed to evolve at rates that are very different. Discrete models different parts of the tree as evolving at different rates. As I understand it, relaxed allows the rates to vary most freely. Another important factor that can be adjusted is the smoothing parameter lambda; I usually run all three clock models at lambdas of 1 and 10 and pick the one with the best likelihood score. For present purposes I will restrict myself to lambda = 1.

Let's start with correlated:
mytimetree <- chronos(mytree, lambda = 1, model = "correlated", calibration = mycalibration, control = chronos.control() )
When plotted, the chronogram looks as follows.


Next, discrete. The command is the same as above except for the text in the model parameter. The branch length distribution and likelihood score turned out to be very close to those for the correlated model:


Finally, relaxed. Very different branch length distribution and a by far worse likelihood score compared to the other two:


I have only considered testing a strict clock model with chronos for the first time today. It turns out that you get it by running it as a special case of the discrete model, which by default is set to assume ten rate categories. You simply set the number of categories to one:
mytimetree <- chronos(mytree, lambda = 1, model = "discrete", calibration = mycalibration, control = chronos.control(nb.rate.cat=1) )
In my example case this looks rather similar to the results from correlated model and discrete with ten categories:


The problem with PL is that is seems to be a bit touchy. Even today we had several cases of an inexplicable error message, and several cases of the analysis being unable to find a reasonable starting solution. We finally found that it helped to vastly increase the root age (we had played around with 15, assuming that it doesn't matter, and it worked when we set it to a more realistic three digit number). It is possible that our true problem was short terminal branches.

PL is also the slowest of the methods presented here. I would use it for trees that are too large for Bayesian time calibration but where I need an actual chronogram with a meaningful time axis and want to do model comparison. If I just want an ultrametric tree the following three methods would be faster and simpler alternatives. That being said, so far I had no use case for them.

A superseded but fast alternative: chronopl()

This really came as a surprise as I believed that the function chronopl() had been removed from ape. I thought I had tried to find it in vain a few years ago, but I saw it in the ape documentation today (albeit with the comment "the new function chronos replaces the present one which is no more maintained") and was then able to use it in my current R installation. I must have confused it with a different function.

chronopl() does not provide a likelihood score as far as I can see, but it seems to be very fast. I quickly ran it with default parameters and lambda = 1, again setting root age to 50:
mytimetree <- chronopl(mytree, lambda = 1, age.min = 50, age.max = NULL, node = "root")
The result looks very similar to what chronos() produced with the (low likelihood) relaxed model:


Various parameters can be changed, but as implied above, if I want to do careful model comparison I would use chronos() anyway.

Mean Path Lengths

The chronoMPL() method time-calibrates the phylogeny with what is called a mean path lengths method. The documentation makes clear that multiple calibration points cannot be used; the idea is to make an ultrametric tree, pick one lineage split for which one has a credible date, and then scale the whole tree so that the split has the right age. Command is simply:
mytimetree <- chronoMPL(mytree)
The problem is, the resulting chronogram often looks like this:


Most of the branch length distribution fits the results for the favoured model in the analysis with chronos(), see above. That's actually great, because chronoMPL() is so much faster! But you will notice some wonky lines in particular in the top right and bottom right corners of this tree graph. Those are negative branch lengths. Did somebody throw the ancestral species into a time machine and set them free a bit before they actually evolved?

Some googling suggests that this happens if the phylogram is very unclocklike, which, unfortunately, is often the case in real life. That limits rather sharply what mean path lengths can be used for.

The compute.brtime() function

Another function that I have now tried out is compute.brtime(). It can do two rather different things.

The first is to transform a tree according to what I understand has be a full set of branching times for all splits in the tree. The use case for that seems to be if you have a tree figure and a table of divergence times in a published paper and want to copy that chronogram for a follow-up analysis, but the authors cannot or won't send it to you. So you manually type out the tree, manually type out a vector of divergence times (knowing which node number is which in the R phylo format!), and then you use this function to get the right branch length distribution. May happen, but presumably not a daily occurrence. What we usually have is a tree for which we want the analysis to infer biologically realistic divergence times that we don't know yet.

The second thing the function can do is to infer an ultrametric tree without any calibration points at all but under the coalescent model. The command is then as follows.
mytimetree <- compute.brtime(mytree, method="coalescent", force.positive=TRUE)
It seems that the problem of ending up with negative branch lengths was, in this case, recognised and solved simply by giving the user the option to tell the function PLEASE DON'T. I assume they are collapsed to zero length (?). My result looked like this:


Note that this is more on the lines of "one possible solution under the coalescent model" instead of "the optimal solution under this here clock model", so that every run will produce a slightly different ultrametric tree. I ran it a few times, and one aspect that did not change was the clustering of nearly all splits close to the present, which I (and PL, see above) would consider biologically unrealistic. Still, we have an ultrametric tree in case we need one in a hurry.

It is well possible that I have still missed other options in APE, but these are the ones I have tried out so far.

Something completely different: non-ultrametric chronograms

Finally, I should mention that there are methods to produce very different time-calibrated trees in palaeontology. The chronograms discussed in this post are all inferred under the assumption that we are dealing with extant lineages, so all branches on the chronogram end flush in the present, and consequently a chronogram is an ultrametric tree. And usually the data that went into inferring the topology was DNA sequence data or similar.

Palaeontologists, however, deal with chronograms where many or all branches end in the past because a lineage went extinct, making their chronograms non-ultrametric and look like phylograms. And usually the data that went into inferring the tree topology was morphological. This is a whole different world for me, and I can only refer to posts like this one and this one which discuss an R package called paleotree.

There also seems to be a function in newer APE versions called node.date() which is introduced with the following justification:
Our software, node.dating, uses a maximum likelihood approach to perform divergence-time analysis. node.dating is written in R v3.30 and is a recent addition to the R package ape v4.0 (Paradis et al., 2004). Previously, ape had the capability to estimate the dates of internal nodes via the chronos function; however, chronos requires ultrametric trees and is thus unable to incorporate information from tips that are sampled at different points in time.
This suggests that the point is the same, to allow chronograms with extinct lineages, but in this case aimed more at molecular data. Their example case are virus sequence data.

Monday, April 2, 2018

How problematic is the jump dispersal parameter in ancestral area inference?

I recently read an article in the Journal of Biogeography titled "Conceptual and statistical problems with the DEC + J model of founder-event speciation and its comparison with DEC via model selection". Its authors are Richard Ree, the developer of the original DEC model, and Isabel Sanmartin.

The main problem with discussing the paper here is that it would probably take 5,000 words to properly explain what it is even about. I will try to provide the most superficial introduction to the topic and otherwise assume that of the few people who will read this blog most are at least somewhat familiar with it.

The area of research this is about is the estimation or inference of ancestral areas and biogeographic events. Say we have a number of related species, the phylogeny showing how they are related, a number of geographic areas in which each species is either present or absent, and at least one model of biogeographic history. For the purposes of what I will subsequently call ancestral area inference (AAI) we assume that we know the species are well-defined and that the phylogeny is as close to true as we can infer at the time, so that they will simply be accepted as given. How to objectively define biogeographic areas for the study group is another big question, but again we take it as given that that has been done.

The idea of AAI is to take these pieces of information and infer what distribution ranges the ancestral species at each node of the phylogeny had, and what biogeographic events took place along the phylogeny to lead to the present patterns of distribution. What model of biogeographic events we accept matters a lot, of course. Imagine the following simple scenario of three species and three areas, with sister species occurring in areas A and B, respectively, and their more distant relative occurring in both areas B and C:



Assuming, for example that our model of biogeographic history favours vicariant speciation and range expansions, we may consider the scenario on the left to be a very probable explanation of how we ended up with those patterns of distribution. First the ancestral species of the whole clade occurred in all areas, and vicariant speciation split it into a species in area A and one in areas B and C. The former expanded to occur in both A and B and then underwent another vicariant speciation event, done.

If we have reason to assume that this is unlikely, for example because area A is an oceanic island, we may favour a different model. In the right hand scenario we see the ancestral species occurring in areas B and C and producing one of its daughter species via subset sympatry in area B. At least one seed or pregnant female of that new lineage is then dispersed to island A. An event such as this last one, where dispersal leads to instant genetic isolation and consequent speciation, is in this context often called 'jump dispersal' or, as in the title of the paper, 'founder-event speciation', to differentiate it from the much slower process of gradual range expansion followed by vicariant or sympatric speciation*.

I am not saying that either of these scenarios is the best one to explain how the hypothetical three species evolved and dispersed. In fact I would say that three species are too small a dataset to estimate biogeographic history with any degree of confidence, but it provides an idea of what ancestral area inference is about.

Perhaps the best established approaches to AAI are Dispersal and Vicariance Analysis (DIVA) and the Dispersal, Extinction and Cladogenesis model (DEC). The former was originally implemented as parsimony analysis in a software with the same name, and it has a tendency to favour vicariance, as the name suggests. Likelihood analysis under the DEC model became popular in its implementation in the software Lagrange, and in my limited experience and to the best of my understanding it is designed to have daughter species inherit part of the range of the ancestor, often leading to subset sympatry. And there are other approaches, of course.

As the result of his PhD project, Nick Matzke introduced the following two big innovations in AAI: First, the addition of a parameter j, for jump dispersal, to existing models. This allows the kind of instantaneous speciation after dispersal to a new area that I described above, and which can be assumed to be particularly important in island systems. Second, the idea that the most appropriate model for a study group should be chosen through statistical model selection, as in other areas of evolutionary biology. He created the R package BioGeoBEARS to allow such model selection. It implemented originally likelihood versions of DIVA, DEC and a third model called BayArea, all assuming the operation of slightly different sets of biogeographic processes. Each of them can be tested with and without the j parameter and, after another update, with or without an x parameter for distance-dependent dispersal.

Now I come finally (!) to Ree & Sanmartin. Their eight page paper, as the title implies, is a criticism of these two innovations. What do they argue? I hope I am summarising this faithfully, but in my eyes their three core points are as follows:
  • A biogeographic model with events happening at the nodes of the tree as opposed to along the branches, as is the case with jump dispersal, is not a proper evolutionary model because such events are then "not modeled as time-dependent". In other words, only events that have a per-time-unit probability of occurring along a branch are appropriate.
  • Under certain conditions the most probable explanation provided by a model including the j parameter is that all biogeographic events were jump dispersals. The j parameter gets maximised and explains everything by itself. They call this scenario "degenerate", because the "true" model must "surely" include time-dependent processes.
  • DEC and DEC + j (and, I assume, by extension any other model and its + j variant) cannot be compared in the sense of model selection.

I must, of course, admit that model development is not my area. Consequently I am happy to defer regarding points one and three to others who have more expertise, and who will certainly have something to say about this at some point. I can only at this moment state that these claims do not immediately convince me. Certainly it is often the case that models with very different parameters are statistically compared with each other?

Is it not possible that the best model to explain an evolutionary process may sometimes indeed have a parameter that is not time-dependent but dependent on lineage splits? In the present case, if it is a fact that jump dispersal caused a lineage split, then both events quite simply happened instantaneously (at the relevant time scale of millions of years); in a sense, they were the same event, as the dispersal itself interrupted gene flow.

Perhaps more importantly, however, I am not at all convinced by the second point. Generally I am more interested in practical and pragmatic considerations than theory of statistics and philosophy. In phylogenetics, for example, I am less impressed by the claim that parsimony is supposedly not statistically consistent than by a comparison of the results produced by parsimony and likelihood analysis of DNA sequence datasets. Do they make sense? What can mislead an analysis? What software is available? How computationally feasible is what would otherwise be the best approach, and can it deal with missing data?

So in the present case I would also like to consider the practical side. Is the problem of j being maximised so that everything is explained by jump dispersal at all likely to occur in empirical datasets? In the paper Ree and Sanmartin illustrate a two species / two area example. That is clearly not a realistic empirical dataset, as it is much too small for proper analysis. But if we understand to some degree how the various model parameters work we can deduce under what circumstances j is likely to be maximised.

Unless I am mistaken, the circumstances appear to be as follows: We need a dataset in which all species are local endemics, i.e. all are restricted to a single area, and in which sister species never share part of their ranges. This is because other patterns cannot be explained by jump dispersal. If a species occupies two or more areas, it would have had to expand its range, so the analysis cannot reduce the d parameter for range expansion to zero. If sister species share part of their ranges, likewise; if they share the same single area, they must have diverged sympatrically, which again is not speciation through jump dispersal.

This raises the question, how likely are we to find datasets in which these two conditions apply? In my admittedly limited experience such datasets do not appear to be very common. If we are dealing, for example, with a small to medium sized genus on one continent, we will generally find partly overlapping ranges, and often at least one very widespread species. The j parameter will not be maximised. If we are doing a global analysis of a large clade, we will need rather large areas (because if you use too many small areas the problem becomes computationally intractable). This means, among other things, that entire subclades will share the same single-area range, and j will not be maximised.

In other words, the problem of 'all-jump dispersal' solutions appears to be rather theoretical. But what if we actually do have such a dataset? Is it not a problem then? To me the next question is under what circumstances such a situation would arise. Again, we have all species restricted to single areas, meaning that they apparently find it hard to expand their ranges across two areas. Why? Perhaps geographic separation to the degree that they rarely disperse? Geographic separation to the degree that when they disperse gene flow is interrupted, leading to immediate speciation? Again, we never have sister species sharing an area. Why? A good explanation would be that each area is too small for sympatric speciation to be possible.

Now what does that dataset sound like? To me it sounds like an archipelago of small islands, or perhaps a metaphorical island system such as isolated mountain top habitats. The exact scenario, in other words, in which all-jump dispersal seems like a very probable explanation. Because your ancestral island is too small for speciation, the only way to speciate is to jump to another island, and if you jump to another island you are immediately so isolated from your ancestral population that you speciate.

Again, I am not a modeler, and I have not run careful simulation experiments before writing this, but based on this thought experiment it seems to me as if the + j models would work just as they should: j would not be maximised under circumstances where the other processes are needed to explain present ranges, but it would be maximised under precisely those extremely rare circumstances where 'all jump dispersal' is the only realistic explanation.

Footnote

*) Sympatric meaning here at the scale of the areas defined for the analysis. If one the areas in the analysis is all of North America, for example, it is likely that the 'sympatric' events inside that area would in truth mostly have been allopatric, parapatric or peripatric at a smaller spatial scale.

Saturday, January 27, 2018

Bioregionalisation part 3: clustering with Biodiverse

Biodiverse is a software for spatial analysis of biodiversity, in particular for calculating diversity scores for regions and for bioregionalisation. As mentioned in previous posts, the latter is done with clustering. Biodiverse is freely available and extremely powerful, just about the only minor issues are that the terminology used can sometimes be a bit confusing, and it is not always easy to intuit where to find a given function. As so often, a post like this might also help me to remember some detail when getting back to a program after a few months or so...

The following is about how to do bioregionalisation analysis in Biodiverse. First, the way I usually enter my spatial data is as one line per sample. So if you have coordinates, the relevant comma separated value file could look something like this:
species,lat,long
Planta vulgaris,-26.45,145.29
Planta vulgaris,-27.08,144.88
...
To use equal area grid cells you may have reprojected the data so that lat and long values are in meters, but the format is of course the same. Alternatively, you may have only one column for the spatial information if your cells are not going to be coordinate-based but, for example, political units or bioregions:
species,state
Planta vulgaris,Western Australia
Planta vulgaris,Northern Territory
...
Just for the sake of completeness, different formats such as a tsv would also work. Now to the program itself. You are running Biodiverse and choose 'Basedata -> Import' from the menus.


Navigate to your file and select it. Note where you can choose the format of the data file in the lower right corner. Then click 'next'.


The following dialogue can generally be ignored, click 'next' once more.


But the third dialogue box is crucial. Here you need to tell Biodiverse how to interpret the data. The species (or other taxa) need to be interpreted as 'label', which is Biodiversian for the things that are found in regions. The coordinates need to be interpreted as 'group', the Biodiversian term for information that defines regions. For the grouping information the software also needs to be told if it is dealing with degrees for example, and what the size of the cells is supposed to be. In this case we have degrees and want one degree squared cells, but we could just as well have meters and want 100,000 m x 100,000 m cells.


After this we find ourselves confronted with yet another dialogue box and learn that despite telling Biodiverse which column is lat and which one is long it still doesn't understand that the stuff we just identified as long is meant to be on the x axis of a map. Arrange the two on the right so that long is above lat, and you are ready to click OK.


The result should be something like this: under a tab called 'outputs' we now have our input, i.e. our imported spatial data.


Double-clicking on the name of this dataset will produce another tab in which we can examine it. Clicking on a species name will mark its distribution on the map below. Clicking onto a cell on the map will show how similar other cells are to it in their species content. This will, of course, be much less clear if your cells are just region names, because in that case they will not be plotted in a nice two-dimensional map.


Now it is time to start our clustering analysis. Select 'Analyses -> cluster' from the menu. A third tab will open where you can select analysis parameters. Here I have chosen S2 dissimilarity as the metric. If there are ties during clustering it makes sense to break them by maximising endemism (because that is the whole point of the analysis anyway), so I set it to use Corrected Weighted Endemism first and then Weighted Endemism next if the former still does not resolve the situation. One could use random tie-breaks, but that would mean an analysis is not reproducible. All other settings were left as defaults.


After the analysis is completed, you can have the results displayed immediately. Alternatively, you can always go back to the first tab, where you will now find the analysis listed, and double-click it to get the display.

As we can see there is a dendrogram on the right and a map on the left. There are two ways of exploring nested clusters: Either change the number of clusters in the box at the bottom, or drag the thick blue line into a different position on the dendrogram; I find the former preferable. Note that if you increase the number too much Biodiverse will at a certain point run out of colours to display the clusters.


The results map is good, but we you may want to use the cluster assignments of the cells for downstream analyses in different software or simply to produce a better map somewhere else. How do you export the results? Not from the display interface. Instead, go back to the outputs tab, click the relevant analysis name, and then click 'export' on the right.

You now have an interface where you can name your output file, navigate to the desired folder, and select the number of clusters to be recognised under the 'number of groups' parameter on the left.


The reward should be a csv file like the following, where 'ELEMENT' is the name of each cell and 'NAME' is the column indicating what cluster each cell belongs to.


Again, very powerful, only have to keep in mind that your bioregions, for example, are variously called clusters, groups, and NAME depending on what part of the program you are dealing with.

Friday, April 7, 2017

Parsimony versus models for morphological data: a recent paper

I have written on this blog before about the use of likelihood or Bayesian phylogenetics for morphological data. In our journal club this week we discussed another of the small but growing number of recent papers arguing that parsimony should be dropped in favour of model-based analyses even for morphology:
Puttick et al., 2017. Uncertain-tree: discriminating among competing approaches to the phylogenetic analysis of phenotype data. Proceedings of the Royal Society Biological Series 284, doi 10.1098/rspb.2016.2290
Puttick et al. constructed maximally balanced and unbalanced phylogenies, simulated sequence data for them under the HKY + G model of nucleotide substitution, turned the data matrices into binary and presumably unordered multistate integer characters, and then used equal weights parsimony, implied weights parsimony, and Bayesian and likelihood analyses under the Mk model to try and get the phylogenies back with an eye on accuracy (correctness) and tree resolution. In a second approach, they reanalysed previously published morphological datasets to see what happened to controversial taxon placement under the different approaches.

One of the problems with simulation studies is always that they can come out as kind of circular: if you simulate data under a model it is no surprise that the same model would perform best when trying to infer the input into the simulations. In this case Puttick et al. were admirably circumspect in that not only did they simulate their data under a different model (HKY + G) than that ultimately used in phylogenetic analysis (Mk), but they also repeated the analyses until they had achieved a distribution of homoplasy that mirrored the one found in empirical datasets. This is important because morphology datasets for parsimony analysis are scored to minimise homoplasy, while uncritically simulating matrices may lead to much higher levels of homoplasy, thus putting parsimony at a disadvantage.

Still, it should be observed that the HKY + G model is nonetheless unlikely to have produced data that are a realistic representation of morphological datasets, especially considering that the latter would at a minimum also include multistate characters with ordered states. Also, from a cladist's perspective homoplasy in a morphological dataset is a character scoring error waiting to be corrected in a subsequent analysis. But well, of course using zero homoplasy datasets would also have been unrealistic because real life datasets do have homoplasy in them. (And of course parsimony would "win" all the time if there was zero homoplasy, pretty much by definition.)

Now what are the results? To simplify, Bayesian was best at getting the tree topology right, followed by equal weights parsimony and implied weights parsimony, with likelihood coming in last. Likelihood always produces fully resolved trees, and Bayesian produces the least resolved ones. The authors argue, as Bayesians would, that this is exactly how it should be, as it simply tells us that the data aren't strong enough; the other approaches may give us false confidence. (Although of course parsimony and likelihood analyses can likewise involve several different ways of quantifying support or confidence.)

In conclusion, Puttick et al. make the following recommendations:

First, Bayesian inference should be the preferred approach.

Second, future morphological datasets should be scored with model-based approaches in mind. This means that the number of characters should be maximised by including homoplasious ones, because that will allow a better estimate of rates. As this is the exact opposite scoring strategy of what parsimony analysis requires this will make it hard to change habits.

What is more, I have to smile at Puttick et al.'s expectations here: they simulated data matrices of 100, 350 and 1,000 characters. Maybe you can get 400 or so for some animals (if the fossils are well enough preserved), but for any plant group I have worked on I would struggle to get 30. And wouldn't you know it, the single empirical botanical dataset they re-analysed had only 48.

Third, researchers should lower their expectations and get used to living with unresolved relationships, as Bayesian analysis produces less resolved phylogenies.

Our discussion of the paper was wide-ranging. When I commented that one of the advantages of traditional parsimony software is that it easily allows the implementation of any step matrix that is needed (imagine a character where state 0 can change into states 1, 2 or 3, but 1-3 cannot change into each other) I was informed that that is in fact possible in BEAST. That is a pleasant surprise, as I had assumed that it was limited to setting a few simple models such as standard Mk for unordered states, nothing more. However, those who have written XML files for BEAST may want to consider if that is "easy" compared with writing a Nexus file for PAUP. Personally I find BEAST input files very hard to understand.

Another concern was that while nucleotide substitution models are based on a fairly good understanding of what can happen to DNA nucleotides which, after all, have a limited number of states and transitions between those states, it is considerably less clear what the most appropriate model for any given morphological character is.

What is more, somebody pointed out that there are essentially two options in a model based analysis: either the likelihood of state transitions is fixed, which is a difficult decision to make, or it is estimated during the analysis. But in the latter case the probability of, for example, changing the number of petals would be influenced by the probability of shifting between opposite and alternate leaf arrangement. And clearly that idea is immediately nonsensical.

In summary, the drumbeat of papers on the lines of "we are the Bayesians; you will be assimilated; resistance is futile" is not going to stop any time soon. I use Bayesian and likelihood analyses all the time for molecular data, no problem. But I am still not convinced that the Mk model would be my go-to approach the next time I have to deal with morphological data. It seems to me that it is much easier to justify one's model selection in the case of DNA than in the case of, say, flower colour or leaf length; that the idea of setting one model and estimating gamma across totally incomparable traits is odd; and that I would hardly ever have enough characters for Bayesian analysis to produce more than a large polytomy.

But I guess all that depends on the study group. I can imagine there would be morphometric data for some groups of organisms for which stochastic models work quite well.

Thursday, November 17, 2016

Clock rooting with strong rate shifts - or not

Today at our journal club we discussed Schmitt 2016, "Hennig, Ax, and present-day mainstream cladistics, on polarising characters", published in Peckiana 11: 35-42.

The point of the paper is that early phylogeneticists discussed various ways of figuring out character polarity (i.e. which character state is ancestral and which is derived) first and then using that inference to build a phylogeny, whereas today nearly everybody does the phylogeny building first and then uses outgroup rooting to polarise the resulting network and infer character polarity.

And... that's it, really. There does not appear to be any clear call to action, although one would have expected something on the lines of "and this is bad because...". The paper does end with an exhortation to use more morphological characters instead of only molecular data, and then there is language that may be meant to identify the author as a proponent of paraphyletic taxa without making it explicit (anagenesis!), but neither of those two conclusions appear to be to the point. There is no actual way forward regarding the question of how to polarise characters without outgroup rooting.

The approaches discussed in the paper are the following:

Palaeontological precedence. The character state appearing first in the fossil record is the ancestral one. The problem is, this only works if we assume that the fossil record is fairly complete.

Chorological progression. The character state found more frequently near the edges of a range is the derived one, whereas the ancestral state dominates at the centre of origin. Problem, this is circular because we first need to figure out where the centre of origin is. I am not too convinced of the principle either.

Ontological precedence. Because organisms cannot completely retool their developmental pathways but only change through (1) neoteny or (2) attaching steps to the end of the process, the earlier states in ontogeny are the ancestral ones. The author mentions the problem of a scarcity of ontological data; I might add that this shows a bit of a zoological bias, as it will rarely work in plants and presumably never in microorganisms.

Correlation of transformation series. I must admit I don't quite understand the logic here, and the author isn't very impressed by it either.

Comparison with the stem lineage of the study group. The state found in the ancestral lineage is ancestral. This if very obviously circular, because we would need to know the phylogeny first, and being able to infer that was the whole point of polarising the character.

Ingroup comparison. The state that is more frequent in the study group is ancestral. I see no reason to assume that this is always true, as there can be shifts in diversification rates.

Finally, outgroup comparison. The state that is found in the closest relative(s) of the study group is ancestral in the study group. It is perhaps not totally correct to call this circular, but it has something of turtles all the way down: to find out what the closest relative of your study group is you need to polarise the larger group around it, and then you have the same problem. Still this is the most broadly useful of all these approaches.

Polarising a phylogeny and polarising characters are two sides of the same coin. I have written a thorough post on the former before, which regularly seems to be found by quite a few people doing Google searches. I hope it is still useful. One of the ways I mentioned there for giving the stack of turtles something to stand on is clock rooting, and I found it surprising that the present paper did not mention it at all. It was this, however, that our journal club discussion dwelt on for quite some time.

Admittedly said discussion was a bit meandering, but here are a few thoughts:

The big problem with clock rooting is that it will be thrown off if there are strong rate shifts. Imagine that the true phylogram consists of two sister groups, one with very long branches (short-lived organisms) and the other with very short branches (their long-lived relatives). If we apply a molecular clock model to the phylogenetic analysis, e.g. in MrBayes, it will try to root the tree so that the branches all end at about the same level, the present. The obvious way to do it is to root the tree within the long-branch group. Eh voilà, it has rooted incorrectly.

What to do about this?

The first suggestion was to use an outgroup. In my admittedly limited experience that doesn't work so well. It seems that if the rate shift is strong enough the analysis will simply attach the outgroup to the ingroup in the wrong place.

The next idea was to use a very relaxed clock model, in particular the random local clock model available in BEAST (unfortunately not in MrBayes). But then it was called nice in theory but said to make it hard to achieve stationarity of the MCMC run. I cannot say.

Nick Matzke suggested that a better model could be developed. The hope is that this would allow the analysis to figure out what is going on, recognise the rate shift in the right place, and then root correctly. It would have to be seen how that would work, but at the moment something like that does not appear to be available.

Finally, another colleague said that if the clock models don't work then simply don't use them. Well, but what if we need a time-calibrated phylogeny, a chronogram, to do our downstream analyses, as in biogeographic modelling, studies of diversification rates, or divergence time estimates?

I guess the only way I can think of at the moment is to infer a phylogram whose rooting we trust and then turn it into a chronogram while maintaining topology, as with the software r8s. Maybe there are other ways around the rooting issue with clock models, but I am not ware of them.

Friday, November 4, 2016

CBA seminar on molecular phylogenetics

Today I went to a Centre of Biodiversity Analysis seminar over at the Australian National University: Prof. Lindell Bromham on Reading the story in DNA - the core principles of molecular phylogenetic inference. This was very refreshing, as I have spent most of the year doing non-phylogenetic work such as cytology, programming, species delimitation, and building identification keys.

The seminar was packed, the audience was lively and from very diverse fields, and the speaker was clear and engaging. As can be expected, Prof. Bromham started with the very basics but had nearly two hours (!) to get to very complicated topics: sequence alignments, signal saturation, distance methods, parsimony analysis, likelihood phylogenetics, Bayesian phylogenetics, and finally various problems with the latter, including choice of priors or when results merely restate the priors.

The following is a slightly unsystematic run-down of what I found particularly interesting. Certainly other participants will have a different perspective.

Signal saturation or homoplasy at the DNA level erases the historical evidence. Not merely: makes the evidence harder to find. Erases. It is gone. That means that strictly speaking we cannot infer or even estimate phylogenies, even with a superb model, we can only ever build hypotheses.

Phylogenetics is a social activity. The point is that fads and fashions, irrational likes and dislikes, groupthink, the age of a method, and quite simply the availability and user-friendliness of software determine the choice of analysis quite as much as the appropriateness of the analysis. Even if one were able to show that parsimony, for example, works well for a particular dataset one would still not be able to get the paper into any prestigious journal except Cladistics. And yes, she stressed that there is no method that is automatically inappropriate, even distance or parsimony. It depends on the data.

Any phylogenetic approach taken in a study can be characterised with three elements: a search strategy, an optimality criterion, and a model of how evolution works. For parsimony, for example, the search strategy is usually heuristic (not her words, see below), the optimality criterion is minimal number of character changes, and the implicit model is that character changes are rare and absence of homoplasy.

The more sophisticated the method, the harder it gets to state its assumptions. Just saying out loud all the assumptions behind a BEAST run would take a lot of time. Of course that does not mean that the simpler methods do not make assumptions - they are merely implicit. (I guess if one were to spell them out, they would then often be "this factor can safely be ignored".)

Nominally Bayesian phylogeneticists often behave in very un-Bayesian ways. Examples are use of arbitrary Bayes factor cut-offs, not updating priors but treating every analysis as independent, and frowning upon informative topology priors.

Unfortunately, in Bayesian phylogenetics priors determine the posterior more often than most people realise. This brought me back to discussions with a very outspoken Bayesian seven years ago; his argument was "a wrong prior doesn't matter if you have strong data", which if true would kind of make me wonder what the point is of doing Bayesian analysis in the first place.

However, Prof. Bromham also said a few things that I found a bit odd, or at least potentially in need of some clarification.

She implied that parsimony analysis generally used exhaustive searches. Although there was also a half-sentence to the effect of at least originally, I would stress that search strategy and optimality criterion are two very different things. Nothing keeps a likelihood analysis from using an exhaustive search (except that it would not stop before the heat death of the universe), and conversely no TNT user today who has a large dataset would dream of doing anything but heuristic searches. Indeed the whole point of that program was to offer ways of cutting even more corners in the search.

Parsimony analysis is also a form of likelihood analysis. Well, I would certainly never claim, as some people do, that it comes without assumptions. I would say that parsimony has a model of evolution in the same sense as the word model is used across science, yes. I can also understand how and why people interpret parsimony as a model in the specific sense of likelihood phylogenetics and examine what that means for its behaviour and parameterisation compared to other models. But calling it a subset of likelihood analysis still leaves me a bit uncomfortable, because it does not use likelihood as a criterion but simply tree length. Maybe I am overlooking something, in fact most likely I am overlooking something, but to me the logic of the analysis seems to be rather different, for better or for worse.

One of the reasons why parsimony has fallen out of fashion is that "cladistics" is an emotional and controversial topic; this was illustrated with a caricature of Willi Hennig dressed up as a saint. I feel that this may conflate Hennig's phylogenetic systematics with parsimony analysis, in other words a principle of classification with an optimality criterion. Although the topic is indeed still hotly debated by a small minority, phylogenetic systematics is today state of the art, even as people have moved to using Bayesian methods to figure out whether a group is monophyletic or not.

The main reasons for the popularity of Bayesian methods are (a) that they allow more complex models and (b) that they are much faster than likelihood analyses. The second claim surprised me greatly because it does not at all reflect my personal experience. When I later discussed it with somebody at work, I realised that it depends greatly on what software we choose for comparison. I was thinking BEAST versus RAxML with fast bootstapping, i.e. several days on a supercomputer versus less than an hour on my desktop. But if we compare MrBayes versus likelihood analysis in PAUP with thorough bootstrapping, well, suddenly I see where this comes from.

These days you can only get published if you use Bayesian methods. Again, that is not at all my experience. It seems to depend on the data, not least because huge genomic datasets can often not be processed with Bayesian approaches anyway. We can see likelihood trees of transcriptome data published in Nature, or ASTRAL trees in other prestigious journals. Definitely not Bayesian.

In summary, this was a great seminar to go to especially because I am planning some phylogenetics work over summer. It definitely got the old cogs turning again. Also, Prof. Bromham provided perhaps the clearest explanation I have ever heard of how Bayesian/MCMC analyses work, and that may become useful for when I have to discuss them with a student myself...

Tuesday, July 19, 2016

I must be missing something

As I continue to contemplate Ebach and Michael's recent paper From Correlation to Causation: What Do We Need in the Historical Sciences?, I would first like to make clear that I like reading and appreciate papers like this one. If we want to get things right it is crucial to be pushed out of one's comfort zone from time to time, so asking a question like "have recent developments in the field gone totally in the wrong direction?" has its value.

That being said, however, it would appear to be a reasonable assumption that >90% of the experts are somewhat unlikely to all overlook a fundamental flaw in what they are doing. It could happen, yes, but especially as a non-expert one would need to see a rather good and clear argument before agreeing that they do.

With this in mind, I will now describe my thoughts about the core of the paper and my understanding of what the authors argue for and why. The first parts of the paper feature a lengthy discussion of the interpretation of ancestry and character evolution in phylogenetics and evolutionary biology, of unstructured and structured representations of the same data, and of the dangers of letting unwarranted assumptions distort the data. There might be quite a bit to be discussed here - for example the claim that "historical sciences, such as taxonomy and palaeontology ... are mostly descriptive and defy testing", which might be news to the discoverers of Tiktaalik, who were able to predict and subsequently test their prediction of where to look for this "missing link" - but the meat of the paper as I understand it starts with the criteria the authors suggest for "comparing ... assumptions against a well-attested set of aspects of causation".

Reference is made to the Bradford Hill Criteria of the medical sciences, and they are then adapted to the presumed needs of historical science, which, as discussed in the earlier post, the authors perceive to be fundamentally different from experimental science. The new Historical Sciences Bradford Hill Criteria are presented under key words that sometimes describe what to look for and sometimes describe what to avoid:

Selection bias. This is an obvious problem in science, although I must say that I do not find the specific examples provided by the authors to be the most convincing.

Temporality. I am afraid I do not really understand what is meant here, so I will quote the relevant paragraph from the paper in full, excepting references. "Present day distributions are the result of past events. Therefore there is the possibility that different taxa alive today may be resultant in different events that occurred at different times. In using a single historical event (e.g., the Oligocene drowning of New Zealand ~30 Ma) to address a larger biogeographical question has resulted in several debates about the age of the New Zealand biota. Single event hypothesis are rare, however little to no scrutiny is typically taken to test their validity."

Especially considering that in the original, i.e. medical science, criteria temporality was apparently about dose-response and reversibility of an effect, I am unsure where the above comes from. It is even less clear to me what is wrong with using a single historical event to address a large question. If New Zealand was indeed completely under water then it quite simply follows that all endemic land-living organisms would at that moment have perished, and that the current land-living organisms would have had to disperse into New Zealand afterwards, when it rose up again. (I am not qualified to assess if it was indeed completely under water, but that is not the point.)

Evidence for a mechanism of action. This again makes more sense to me, although we may have to agree to disagree about how plausible any given assumption of a mechanism of action is. The authors, for example, appear to consider dispersal from one biogeographic region to another to be implausible, but I believe that as long as the probability of that happening is not zero it would have to be a matter of weighing it against the plausibility of alternative explanations (such as those requiring that a family of flowering plants arose before multi-cellular life).

Coherence. In effect the question whether a claim fits what else we are currently confident we know. Makes sense.

Replicability or, as I would call it, reproducibility. The authors argue that historical data are not replicable, so the equivalent is correlation between different datasets.

Similarity. Do two different datasets for the same study group arrive at the same result? It is not entirely clear to me how this is different from the previous, but at any rate the two seem sensible enough.

In summary, some of the above is not clear to me, but other aspects appear immediately reasonable. In fact there would a trivial interpretation of what the authors want to say: Examine your assumptions; remove indefensible assumptions; all else being equal, use the simpler model instead of an unnecessarily convoluted one.

But one would assume it cannot really be that easy, because there is hardly anybody in science who would disagree with that. Modellers are all aware of the danger of over-parameterisation; the problem is simply that all else is not always equal. Sometimes a more complex model is quite simply the better explanation. If you have a plot of dots forming a straight line as data, a very simple linear model with one parameter will do nicely. If you have a plot of dots forming an S-shaped pattern you will quite simply not be able to explain them with such a simple model, you need more parameters. I would suspect the same applies to biogeography; if there are data that defy the explanation of vicariance then our explanation needs to incorporate more processes than vicariance. I thus find hard to accept the authors' judgement that "complex models are designed to extrapolate data under highly speculative assumptions" whereas "simple models, with plausible assumptions[,] are more likely to pass the [criteria]". It really depends.

Similarly, everybody in science will agree that we shouldn't base our conclusions on bad assumptions. Problem is, everybody will argue that their assumptions are the good ones. Perhaps now would be a good moment to turn to the example provided by the authors of the present paper, to see how they use their new criteria. This might also clear up those aspects that I did not understand when reading the criteria themselves.

Interestingly, the four methods compared by the authors under their criteria are at least partly apples and oranges. They are Brooks Parsimony Analysis (BPA), Ancestral Area Analysis (AAA), Ebach's own Area Cladistics (AC), and the Dispersal-Extinction-Cladogenesis model (DEC). To the best of my understanding the point of BPA is to reconstruct how biogeographic regions are related, as in "the rainforests of New Zealand are sister to the temperate rainforests of Australia, and sister to both of them are the temperate rainforests of Patagonia" (this not a quote but a hypothetical). We might also call this reconstructing the evolutionary history of biogeographic areas. In contrast, my understanding is that the other three are concerned with reconstructing the inverse, the biogeographic history of an evolutionary lineage, either in its entirety or at least to infer where the common ancestor of the lineage was found (although admittedly I was unable to look deeply into AC as the relevant paper was behind a paywall).

Still, all four are biogeographic methods. I found it easiest to proceed once more criterion by criterion.

Selection bias. DEC is criticised for assuming that "areas" are the result of dispersal and extinction, while the criterion is said to be inapplicable to the other three methods because "the type of area is not specified" in any of them. Once more I can only say that I don't get it.

There are two possible interpretations of "area" in this context. The first is that we are talking about the cells or regions defined a priori as the units of the analysis. If this is the case, then all four methods face the exact same problem, because in all cases the user has to define areas a priori. But this doesn't make sense because the cells defined for a DEC analysis are quite simply not "considered as [sic] a result of dispersal and extinction", they are the units of which a potentially larger range considered to be the result of dispersal and extinction consists.

The second possibility is that we are talking about the results. If this is the case, then yes, obviously a Dispersal-Extinction-Cladogenesis model assumes that the present ranges of organisms are the result of dispersal and extinction (and cladogenesis). That's the point. But if this is what we are talking about then we cannot simply say "doesn't apply" for the other three. AC, for example, assumes that current ranges are the result of vicariance, so at the very least it would need a green marker for a plausible assumption, if indeed we find this assumption plausible; realistically, we would have to start discussing whether vicariance as the only process makes sense.

Temporality. As mentioned above I don't understand how the things considered under this name are any more temporal than the ones that aren't. The example does not really clarify the matter for me either. BPA and DEC are criticised as "speculative" because they use "incongruence" (between distribution patterns of different lineages? I believe that is not how DEC works...) to "explain" or "justify" "ad hoc events" or "processes". First, I think what is meant here is the other way around, i.e. that BPA and DEC explain certain patterns by invoking events that the authors consider to be ad hoc assumptions, apparently in practice meaning any biogeographic process except vicariance.

Second and more importantly it is, to say the least, not clear to me why vicariance is less ad hoc than dispersal, extinction and cladogenesis, which just goes back to my earlier point that everybody thinks their preferred explanation is the plausible one. Anyway, AAA is likewise criticised as speculative because "duplicated areas are considered to be part of an original ancestral area". AC, on the other hand, is given a green for entirely plausible assumptions because it only assumes that "geographical congruence is a result of common historical processes". Taken on its own that may sound reasonable, but what about the incongruences? Are they simply ignored? As mentioned above, if there are data that defy a one parameter model then more parameters would appear to be warranted.

There is little to say about evidence for a mechanism of action because none of the four methods is given a clean bill of health. I actually find it rather impressive that the authors call this aspect even of their preferred method speculative for explaining every congruence with vicariance. I do not, however, understand what is meant with "tree topology determines all processes" in the case of DEC. Taken at face value it is plainly wrong because not only the phylogeny but also the present distribution data go into the analysis. What is more, the same necessarily applies in the three other methods, only that some of them use "areagrams" instead of the phylogenetic tree of a group of organisms.

Finally, the treatment of coherence, replicability, and similarity seems even stranger to me. CA is lauded for comparing its results against other data, and with the exception of BPA for similarity the three other methods are criticised for not doing so. But how does the method determine what the end user does with it? What if the user of CA decides not to make any further comparison? What if the user of the DEC model goes on to apply the model to the next four lineages occurring across the same biogeographic areas? How would using DEC exclude such a possibility?

Maybe I am missing something, but it seems to me as if all four methods generally merit at best the same colour, or level of plausibility, on all criteria. If anything I would look somewhat askance at BPA, AAA and CA for simply assuming that the concept of "areagrams" makes sense in the first place, because if there is any significant degree of exchange between biogeographic regions it doesn't.

Either way I am afraid I cannot claim to have understood how to apply these new criteria to in an unbiased manner going forward.

Wednesday, July 13, 2016

Experimental versus historical science

Long time no blog; it seems to come to me in bursts.

Anyway, a colleague has drawn my attention to a paper that has recently appeared, From Correlation to Causation: What Do We Need in the Historical Sciences?, by Ebach and Michael. It argues that "the integrity of historical science is in peril due [to] the way speculative and often unexamined causal assumptions are being used", and further suggests six criteria to check these supposedly speculative assumptions against.

In effect, the issue appears to be the use of models in phylogenetics and, in particular, in biogeography, and here, in particular and unless my reading between the lines is mistaken, the acceptance of any process except vicariance.

Before even delving into any other parts of the argumentation, it would be interesting to consider one of the underlying premises, which is clear already from the title of the paper: the assumption that historical sciences are fundamentally different from experimental sciences. As the authors write, "any evidence we adduce for some historical event needs must be contemporary evidence from which we make inferences on the basis of auxiliary hypotheses". But is it really any different in experimental sciences like medicine or physics? Do they not also have auxiliary hypotheses and assumptions at every step? Perhaps it is a failing on my part, but I at least cannot clearly see a marked difference.

Yes, of course we have easier access to evidence about things that happen around us every day today, and it is much easier to gather more of it. But that is a question of quantity, not the question of a qualitative, let alone epistemological, difference. To illustrate the point, let us consider an extremely simple case, the textbook statistics example of die throws.

First assume that I give you a die and ask you if you think it is loaded. You will then perhaps roll it twelve times, and get the result 2, 6, 5, 6, 6, 6, 6, 6, 3, 6, 6, and 6. Instinctively you might now conclude that it is, indeed, loaded. If you want to be scientific about it you would do statistics to calculate what the likelihood is of rolling these results with a fair die. It is, after all, possible that a fair die produces nine sixes out of twelve rolls; in fact it could produce a hundred sixes out of a hundred rolls, the question is merely how unlikely that is.

You have the die in your hand and you just did an experiment. Experimental science, right? Okay, now assume that after the twelve rolls described above, I snatch the die away from you and drop it in the Mariana Trench. It could be argued that from that moment on the research question "is the die loaded?" has turned into historical science. The twelve rolls are all the data we will ever get. From there we can take the next step and consider a scenario where we read about the twelve rolls in a book that is hundreds of years old. Surely now the question is squarely in the realm of historical science.

But has anything changed? I don't see it. The exact same statistical approach that applied before still applies afterwards. There is no difference in how we address the problem in either case.

And of course this situation is what we always face in science, in a certain sense. We don't literally have a die snatched away, but we do have time, money and other resource constraints. At some point we stop collecting data for any given study and analyse them. Consequently I fail to see where the philosophical difference is between being limited by the data that are available due to an accident of history and being limited by the data that are available due to, for example, our luck with DNA sequencing success before the project budget ran out.

The flip side of being limited by the dataset we have in any given situation is whether we can get more data in the future. Again, with experimental science we can get additional data more easily than in, say, palaeontology. But in real life historical research we are usually not reliant on a single die that has been destroyed either, as the most interesting questions are broader than that. So even with historical data we can usually go back and try to acquire more fossils or archaeological artefacts.

What we have considered so far was inferring what process operated in the past (a fair or a loaded die?) from data we have available (the results of twelve rolls). Thinking of biogeography this would be comparable to inferring whether long distance dispersal of plants and animals happened in the past from contemporary patterns of distribution. We can also flip that around now and consider the inference of past one-off events from processes we can still observe today. In biogeography, we can today observe spores, seeds, insects, birds, and ballooning spiders being blown across vast distances and arriving on remote shores. Did Rhipsalis, the only cactus genus naturally occurring outside of the Americas, arrive in Africa through chance dispersal across the ocean or is its current distribution the result of a much older vicariance event?

Of course this was a one-off event, and yes, we will never know the answer for certain. But again I fail to see the difference in principle. I cannot possibly know for certain that the sun will rise again tomorrow morning, but I can have a great deal of confidence in my admittedly tentative conclusion. Going back to the die example, if I give you a die and then ask you, "I rolled it once yesterday evening - what do you think the result was?", you cannot know it for certain either unless I tell you. But you can observe the process - you can roll it a thousand times - and then infer a probability distribution. If you find that it is severely loaded and produces a six 81% of the time, you may be willing to go so far as to suggest that my roll was a six.

In summary, I personally do not at this moment see the big difference between experimental and historical science; at least not a difference that could be used to argue that the latter cannot employ, for example, models of the same complexity as the former. Admittedly I am not a philosopher of science though.

Sunday, June 26, 2016

Implementation of substitution models in phylogenetic software

Concluding the little series of posts on nucleotide substitution models, below is a summary of my current understanding of how to set several of the models discussed in the previous post in PAUP and MrBayes. But first a few comments.

For PAUP there are three possible options for the basefreq parameter. My current understanding is that equal sets equal base frequencies as in the JC model (duh), empirical fixes them to the frequencies observed in the data set, and estimate has them estimated across the tree. My understanding is further that while estimate is more rigorous, empirical is often 'close enough' and saves computation time. The point is, in any case below where it says one of the latter two options I believe one could also use the other without changing the model as such. I hope that is accurate.

What I do not quite understand is how one would set models like Tamura-Nei in PAUP. At least one of the sources I consulted when researching for this post (see below) suggests that one can set in PAUP models for example with variable transition rates but equal transversion rates, but the PAUP 4.0b10 manual states that the only options for the number of substitution rate categories are 1 (all equal), 2 (presumably to distinguish Tv and Ts), and 6 (all different). Would one not need nst = 3 or nst = 5 to set the relevant models? Perhaps the trick is to set nst = 6 but fix the substitution matrix? But that would mean one cannot estimate it during the analysis.

For the MrBayes commands note that the applyto parameter needs to be specified with whatever part(s) of the partition the particular model should apply to, as in applyto = (2) for the second part. The MrBayes commands I found my sources appear to be rather short; I assume that default settings do the rest. Note also that the old MrBayes versions that I am familiar with have now been superseded by RevBayes, but I have no experience with the latter program's idiosyncratic scripting language.

In addition to PAUP and MrBayes, I mention whether a model can be selected in three other programs I am familiar with, RAxML, BEAST and PhyML. I have no personal experience with most other Likelihood phylogenetics packages. I tried to check what is available in MEGA, but it wouldn't install for me, and I am not sure if the list of models in the online manual shows only those that are actually available for Likelihood search or whether it includes some that are only available for Distance analysis. The relevant list is linked from a page on Likelihood, but its own text implies it is about Distance. Either way, MEGA appears to have lots of options but I didn't indicate them below.

General Time-Reversible (GTR)

PAUP: lset nst = 6 basefreq = empirical rmatrix = estimate;

MrBayes: lset applyto=() nst=6;

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Tamura - Nei (TN93 / TrN)

Available in BEAST 2.1.3 and PhyML 3.0.

Symmetrical (SYM)

PAUP: lset nst = 6 basefreq = equal rmatrix = estimate;

MrBayes: lset applyto=() nst=6; prset applyto=() statefreqpr=fixed(equal);

Hasegawa-Kishino-Yano (HKY / HKY85)

PAUP: lset nst=2 basefreq=estimate variant=hky tratio=estimate;

MrBayes: lset applyto=() nst=2;

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Felsenstein 84 (F84)

PAUP: lset nst=2 basefreq=estimate variant=F84 tratio=estimate;

Available in PhyML 3.0.

Felsenstein 81 (F81 / TN84)

PAUP: lset nst=1 basefreq=empirical;

MrBayes: lset applyto=() nst=1;

Available in PhyML 3.0.

Kimura 2-parameters (K80 / K2P)

PAUP: lset nst=2 basefreq=equal tratio=estimate;

MrBayes: lset applyto=() nst=2; prset applyto=() statefreqpr=fixed(equal);

Available in RAxML and PhyML 3.0.

Jukes Cantor (JC / JC69)

PAUP: lset nst=1 basefreq=equal;

MrBayes: lset applyto=() nst=1; prset applyto=() statefreqpr=fixed(equal);

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Invariant sites (+ I)

For PAUP, add the following to the lset command: pinvar = estimate

For MrBayes, add the following to the lset command: rates = propinv

Gamma (+ G)

For PAUP, add the following to the lset command: pinvar = 0 rates = gamma ncat = 5 shape = estimate (or another number of categories of your choice for ncat).

For MrBayes, add the following to the lset command: rates = gamma

Invariant sites and Gamma (+ I + G)

For PAUP, add the following to the lset command: pinvar = estimate rates = gamma ncat = 5 shape = estimate (or another number of categories of your choice for ncat).

For MrBayes, add the following to the lset command: rates = invgamma

References

This post is partly based on the following very helpful sources, all accessed c. 20 June 2016.

Faircloth B. Substitution models in mrbayes. https://gist.github.com/brantfaircloth/895282

Lewis PO. PAUP* Lab. http://evolution.gs.washington.edu/sisg/2014/2014_SISG_12_7.pdf

Lewis PO. BIOL848 Phylogenetic Methods Lab 5. http://phylo.bio.ku.edu/slides/lab5ML/lab5ML.html

And a PDF posted on molecularevolution.org.

Friday, June 24, 2016

What number of schemes to test to get what substitution models in jModelTest

Another thing that occurred to me that might be useful to write about in the context of substitution models concerns jModelTest.

The background is here that before setting a model in a phylogenetic analysis, one would usually conduct model-testing. There are some programs that do their own testing, but others don't, and consequently people have written software that examines your dataset and suggests the best model. The classic program was simply called ModelTest. It was developed by David Posada, and it was primarily used through and for PAUP. It actually wrote out the PAUP commands of the best-fitting model, so that one could simply copy and paste them over into the Nexus file, and off we go.

Then people wanted to use the results of the model test for MrBayes. Problem was, MrBayes didn't do all the models that PAUP did, and it was annoying to see a model suggested that one couldn't implement. So Johan Nylander produced a modified version called MrModeltest, and it conveniently wrote out the MrBayes and PAUP commands for the best-fitting model.

These programs have now been superseded by jModelTest. On the plus side, this tool is highly portable and extremely user-friendly thanks to its GUI. Also, the user does not have to have PAUP, but instead jModelTest simply comes packaged with PhyML and hands the data over to that software. Also, it is parallelised. On the other hand, in contrast to its predecessors it does not appear to write out the PAUP or MrBayes code. But well, that is what my next post is going to deal with.

For the moment I want to address a different problem: When starting the Likelihood calculations for various models, the user can select whether 3, 5, 7, 11 or 203 (!) substitution schemes, and thus double the number of models, are going to be tested. (And with or without invariant sites and gamma, if desired.) But the thing is, it is not clear from just looking at the interface which models for example the seven schemes are going to cover. If I select seven, will all the models I could implement in BEAST be included? Or will I just waste a lot of computing time on testing models that BEAST can't do anyway?

So I just ran all the numbers across a very small dataset, and this is what models are being tested in each case:

3 substitution schemes: JC, F81, K80, HKY, SYM, GTR

Three schemes is the highest you need to test if deciding on a model for RAxML or MrBayes. Note that as of updating this post (30 Nov 2016) RAxML can only do JC, K80, HKY, and GTR, while MrBayes can do all six.

5 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, SYM, GTR

I am writing them out as named in jModelTest, but note that TPM1 is synonymous with K81 or K3P, and TrN is called TN93 is some software. An "uf" clearly means a variant with unequal base frequencies, and as mentioned in the previous post "ef" is similarly a variant with equal base frequencies.

Five schemes is the highest you need to do if you are model-testing for a BEAST run, because all its models are covered. It also seems to me as if with the exception of F84 all models available in PhyML are covered, and that one doesn't appear to be tested under higher numbers of substitution schemes either. So the same consideration applies, unless jModelTest uses a different name for it (?).

7 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TIM1ef, TIM1, TVMef, TVM, TPM1, TPM1uf, SYM, GTR

11 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, TPM2, TPM2uf, TPM3, TPM3uf, TIM1ef, TIM1, TIM2ef, TIM2, TIM3ef, TIM3, TVMef, TVM, TPM1, TPM1uf, SYM, GTR

At this stage it becomes clear that there are a lot of strange variants of the Three-Parameter and Transitional Models that I overlooked in my previous post. I don't think they are used very often though...

203 substitution schemes: This adds a gazillion models named with a scheme of six letters or six letters plus F. Not sure how many people ever use them, so I will just stop here. I have a strong tendency to find GTR or HKY suggested for my data anyway...

(Updated 30 Nov 2016 to include information for RAxML and MrBayes.)

Thursday, June 23, 2016

Overview over the substitution models

Continuing from the previous post, here is a run-down of the commonly used models, their abbreviations, alternative names and abbreviations, and parameters. They are sorted from most parameter-rich to simplest.

General Time-Reversible (GTR)
Base frequencies variable.
All six substitution rates variable.
Number of free parameters: 8

Transversional (TVM)
Base frequencies variable.
All four transversion rates variable, transitions equal.
Number of free parameters: 7

Transitional (TIM)
Base frequencies variable.
Two different transversion rates, transitions equal.
Number of free parameters: 6

Symmetrical (SYM)
Base frequencies equal.
All six substitution rates variable.
Number of free parameters: 5

Kimura 3-parameters (K81, K3P, TPM1)
Base frequencies variable.
Two different transversion rates, transition rates equal.
Number of free parameters: 5

Tamura-Nei (TN93, TrN)
Base frequencies variable.
Transversion rates equal, transition rates variable.
Number of free parameters: 5

Hasegawa-Kishino-Yano (HKY)
Base frequencies variable.
Transversion rates equal, transition rates equal.
Number of free parameters: 4
Differs from F84 in how the transversion rate and transition rate parameters are derived.

Felsenstein 84 (F84)
Base frequencies variable.
Transversion rates equal, transition rates equal.
Number of free parameters: 4
Differs from HKY in how the transversion rate and transition rate parameters are derived.

Felsenstein 81 (F81, TN84)
Base frequencies variable.
All substitution rates equal.
Number of free parameters: 3

Kimura 2-parameters (K80, K2P)
Base frequencies equal.
Transversion rates equal, transition rates equal.
Number of free parameters: 1

Jukes-Cantor (JC, JC69)
Base frequencies equal.
All substitution rates equal.
Number of free parameters: 0

And below an overview of the same information in graphical form. The numbers in brackets are the numbers of free parameters that the relevant axis is contributing.


Apparently at least some people fill the empty fields on the left in by using the name of the relevant model to the right and adding "ef" to it. So TVMef would, for example, be a model with all transversion rates variable, transition rates equal, and equal base frequencies. But there is probably a reason why those missing models don't have any well known names; they seem to be rarely used, if ever.