Wednesday, April 24, 2013

Comparison of species tree methods

Update 10 June 2013: This post originally from 24 April 2013 has been updated extensively because I have since tried out a set of new species tree methods, got STEM to run and a bit more experience with some others. I have also promoted the post to one of the "recommended phylogenetic systematics" posts on this site despite not being about theory of classification.

Update 27 March 2016: Added ASTRAL and iGTP, restructured the post to be more software-focused.

-----

I spent part of the last few days trying out different species tree methods, partly to help a colleague produce an example tree that he can use in a workshop he is planning and partly because I want to infer a species tree for one of my own projects in the next few weeks. This post was written for two reasons: as a note to myself for future reference and as a pointer for somebody who might want to infer a species tree and does not know which of the many programs to choose. A person like myself a few days ago, one could say, only if they find this post via search engine it might save them some of the frustrations I experienced.

Note that this is not a post for a methods wonk or for somebody who wants to learn about the theoretical or methodological background. It is strictly from the end user perspective, directed at those who want to know what is available, how user friendly the tools are and where to get them.

If you don't know what this is about you might want to refer to my earlier post on the topic. To summarize: these days we mostly use molecular data, in particular the DNA sequences of genes or intergenic spacers, to infer the evolutionary relationships of species. However, any individual gene phylogeny may or may not be congruent with the species phylogeny or with other gene phylogenies because each species inherits a random subset of the pre-existing allele diversity in of its ancestral species. Alternatively, discrepancies between a gene tree and the species tree or between different gene trees may also arise from introgression, rare gene flow between distinct species.

To understand the curious limitations of some of the following species tree methods it is important to be aware that there are actually two problems, although they are of course closely related. In reality you will probably have one if you have the other, but depending on the data you have generated you may only see one of them and consequently you only have to solve one of them.

The first problem is when you find that the alleles of one gene are actually non-monophyletic in a species (which is sometimes but very incorrectly described as the species not being monophyletic). For example, the situation may be that you have twenty species, and you sequenced only one gene but you did it for ten samples of each species. You find that the alleles of several species are intermingled with the alleles of other species, i.e. their gene copies are non-monophyletic, but you still want to produce a phylogenetic hypothesis for the species relationships. In other words, you want a tree that has each species as only one terminal instead of the current ten.

The other problem you could face is gene tree discordance. As an example, you may again have twenty species but this time you sequenced ten genes for only one sample of each species. The dataset is the same size (200 sequences) but you don't see whether the alleles of any gene are non-monophyletic in any given species. What you do see is that the ten gene phylogenies you can infer disagree in some details, that they are discordant. In contrast to the previous situation you already have phylogenies where each species is only one terminal but you actually have too many of them; you want to make one species tree out of the ten gene trees.

Often the real situation will be somewhere in the middle. In the case I am working on, I am soon going to have three independent genes and some species with two or three samples but several others with only one. The point of the extreme hypothetical examples above was to illustrate that people can come at the species tree issue from different starting points.

Now the curious thing I have only just become aware of is that some of the different methods or programs that have been written have weird limitations, and those may have something to do with how their authors approached the topic. It appears as if the people who started from an incomplete lineage sorting mindset generally wrote universal solutions that work with all possible datasets: one sample per species or many, one gene or many, whatever you want.

But some other tools look as if they have been written very much from a gene discordance perspective, and for some reasons the authors tended to make them much less universal in scope. Several of them do not appear to allow for analyses with only one gene. Most bizarrely, one of them does not allow you to have more than one sample per species. I actually could not believe it for the first 30 minutes or so after I noticed there did not seem to be an option for that. Surely I must have merely overlooked the instructions on how to do it? Surely everybody who works on the gene tree vs species tree issue must address the problem of incomplete lineage sorting? But no, there really is no option for that in this particular program.

Of course, one cannot complain. All these programs are written and made freely available by scientists for other scientists, and there is a surprising number of these tools. In the following, you will find several methods implemented in various different programs or even on a web server. If you got here via a search engine I hope you will find one that does what you need, and that my remarks will be helpful.

Note that the first few expect the user to already have inferred all individual gene trees with a method of the user's choice, be it with PAUP, TNT, RAxML, MrBayes or whatever. In the case of BUCKy it is the same but the gene trees have to come from MrBayes. The last two infer the gene trees and the species tree in parallel so all you need are your sequence data, priors and models of sequence evolution.

Mesquite

Website: http://mesquiteproject.wikispaces.com/

Methods offered
Minimising Deep Coalescences (MDC), minimising gene duplications and losses

Speed
Fast for small datasets but slow for larger ones (e.g. several hours even on my fast work computer for a dataset of > 40 species).

Procedure
Prepare and open nexus file with taxa block for the alleles and with one trees block for all gene trees. Open the file in Mesquite. Create a new taxa block for the species. Use TAXA&TREES -> NEW ASSOCIATION to create an assignment of samples to species. Then select TAXA&TREES -> MAKE NEW TREES BLOCK FROM... -> TREE SEARCH -> MESQUITE HEURISTIC. Check your species as the taxa for the tree. Check DEEP COALESCENCE (GENE TREE) or DEEP COALESCENCE MULTIPLE LOCI as criterion, depending on whether you have one or several gene trees, or the duplications and losses option if you are dealing with a gene family. Check STORED TREE BLOCKS as the source for the gene trees. Finally, chose the correct tree block containing the gene trees.

Ease of use
Mesquite has a GUI but it is sometimes a bit counter-intuitive and makes it hard to find the right part of the menus for what you want to do - just look at my instructions above!

Limitations
None that I can see. It works with multiple samples per species and only one gene as well as with one sample per species and multiple genes.

Rating
MDC is simple and it works for every situation, and its only downside might be lack of sophistication compared to likelihood-based methods. I like parsimony methods in general because it is easy to understand what the computer is actually doing, i.e. the analysis is less of a black box to me. MDC is my recommendation if you want something flexible and straightforward without having to worry about priors and models of evolution, or if you need a solution for one gene only. After my recent experiences, I would also say that it is the most stable method for problematic datasets (lots of missing data etc.). Of the available programs, Mesquite is a bit clunky and hard to navigate. However, for MDC I would strongly recommend to use Mesquite because at least PhyloNet does not appear to find the best species tree. In one case I tried, it consistently retrieved a solution that was ten deep coalescences worse than the one retrieved by Mesquite. Another advantage over iGTP and PhyloNet is that in Mesquite one can assign samples or alleles to species in a GUI environment. 7/10

PhyloNet

Website: http://bioinfo.cs.rice.edu/phylonet

Methods offered
Global LAteSt Split (GLASS), Minimising Deep Coalescence (MDC), and apparently a 'Species Network' taking into account Incomplete Lineage Sorting and hybridisation

Speed
Fast (MDC) to very fast (GLASS)

Procedure
Prepare nexus file containing gene trees and the desired commands as per the documentation of the software, then call Java to execute PhyloNet from the command prompt/console with the input file as a parameter. The command in the nexus file needs to contain a lengthy list of sample-species-assignments if you have more than one sample per species.

Ease of use
The commands are simple but PhyloNet is extremely touchy with regard to the input file. I spent a long time trying to run an analysis and got only error messages until I reduced all sample names to very few letters. It is possible that names are truncated after a few letters, and the program thought two samples had the same name because they only differed in the last letter ("Genus_species_2" and suchlike), but I did not find any indication of that in the otherwise helpful documentation. It also sometimes behaves a bit bizarrely, e.g. it would execute an MDC command but not a GLASS command when I started it in one specific way recently, and it does not provide a lot of helpful error messages to figure out what exactly went wrong.

Limitations
None that I can see. It works with multiple samples per species and only one gene as well as with one sample per species and multiple genes.

Rating
This program is super-fast and works for every situation. GLASS seems to be rarely used and is thus a bit of a black box for the user and potentially for reviewers of a manuscript if you use it. But the major downsides of PhyloNet are how finicky it is with input files and that its search strategy for MDC appears suboptimal. (See my comments on Mesquite, but note also that I last tried PhyloNet in 2013 - maybe it has got better in the meantime.) 5/10

iGTP - Gene Tree Parsimony

Website: http://genome.cs.iastate.edu/CBL/iGTP/

Methods offered
Minimising Deep Coalescences (MDC), minimising gene duplications and losses, minimising gene duplications

Speed
Have only used it with a small dataset, but it seems to be quite fast.

Procedure
Prepare a text file with all species trees in Newick format. My current understanding is that they should be cladograms, i.e. with neither branch lengths nor support values. Multiple alleles per species should simply be indicated by the same species name, so there is no allele assignation table. However, that also has the downside that you will have to prepare the trees in an idiosyncratic format that virtually no tree viewer will open. (Instead they will complain that you have the same taxon more than once in the same tree and exit with an error.)

Ease of use
iGTP has a GUI and is very simple to use. Unfortunately for me it did not work on Ubuntu, only on Windows.

Limitations
Definitely accepts missing alleles and, obviously, multiple alleles per species. I have not tried, but given that it has minimising gene duplication options it should also accept a single gene tree.

Rating
See my comments on the MDC method above under Mesquite. As for the program, I am not impressed by its integrated tree viewer and its failure to work on my Linux machine. But I may prefer it to Mesquite the next time I have a big dataset to deal with, not least because the input file is much easier to set up.  7/10

STRAW - Species TRee Analysis Web server

These are three different distance/algorithm based species tree methods that I will treat together because they are all implemented on the same "Species TRee Analysis Web server" (STRAW) and have certain similarities.

Website: http://bioinformatics.publichealth.uga.edu/SpeciesTreeAnalysis/index.php

Methods offered
Three distance or simple algorithm based approaches called STAR, MP-EST and NJst. References to the individual methods can be found on STRAW.

Speed
Fast.

Procedure
The website offers three different tabs for the aforementioned methods. In each case, you can either upload files or simply paste your data into the text fields. You need to provide gene trees with branch lengths and, if you have more than one allele per species, a species/allele assignment table. One of the greatest services rendered by the website is that it will construct such a table for you if you simply paste your gene trees into the text field under the tab "SpeciesAlleleTableCreator" and tell the server how to recognize species affiliation. For example, if your samples all have names like "Genus_species_labNumber", you can tell it to use the second part of the sample name as divided by underscores to assign alleles to species. (If your sample names are chaotic you are out of luck but that is your own fault.) You can also simply load the resulting species/allele table into the species tree tab of your choice. Once you have entered gene trees and species/allele table into a method tab, simply click go and wait a few seconds.

Ease of use
Very simple and user-friendly.

Limitations
Very flexible - the methods work with multiple samples per species and only one gene as well as with one sample per species and multiple genes. However, NJst is the only one that accepts unrooted gene trees. As pointed out by a commenter below, the resulting species tree branch lengths are meaningless except in the case of MP-EST.

Rating
STRAW has much going for it: it is a great service to the community, very user-friendly, and fast. The SpeciesAlleleTableCreator is also helpful if you don't even want to use the methods available on STRAW itself. You can use it to make such a table from your gene trees and then reformat it with a bit of search & replace-fu to produce an assignment table for STEM, ASTRAL or PhyloNet to save you a lot of work. Finally, the methods offered by STRAW itself are very flexible with regard to the number of loci and the amount of missing data. However, performance seems to vary greatly depending on dataset and between the three methods available on the server. For one locus with multiple alleles per species I found that NJst gave a very reasonable species tree. In another case with six loci, NJst gave me a ridiculous tree although starBEAST and ASTRAL were able to produced meaningful results. For three loci one of which had a lot of missing data, the results from all STRAW methods tended to get bizarre. I would thus not recommend these methods if missing data is a big issue in your dataset (use MDC instead) but they are prefect to get quick results in a user-friendly environment if your dataset is unproblematic. 7/10

ASTRAL - Accurate Species TRee ALgorithm

Website: https://github.com/smirarab/ASTRAL/ and https://github.com/smirarab/ASTRAL/tree/multiind (multiple alleles per species version)

Methods offered
Only one, which is coalescent-based in its logic; but as it isn't likelihood-based or Bayesian I assume one might call it algorithmic (?).

Speed
Fast

Procedure
Prepare text file with all gene trees in Newick format. Include branch lengths, but it is possible that support values need to be removed. If you have multiple alleles per species, prepare a text file with an allele assignation table using the same format as produced by STRAW (see above). Then open terminal, change to correct folder, and type java -jar nameofcurrentASTRALversion -i genetrees_filename -o output_filename -a allele_table_filename

Ease of use
Command line use similar to RAxML and PhyloNet, thus presumably not to everybody's liking, but simple enough compared to other, more finicky programs.

Limitations
The publication of the method and the constant refrain of "we have a version for multiple alleles per species but it is still experimental" (paraphrase) make it sound as the developers came purely from a gene tree discordance perspective. As such, it is highly probable that even the multiple alleles per species version of ASTRAL will not work with only one gene, but to be frank I haven't tried. Otherwise it seems to be very flexible, accepting missing alleles and unrooted gene trees.

Rating
Fairly easy to use as long as you are comfortable with the command line, easy to install and highly portable, no arcane priors, fast, with a robust coalescent approach ... As far as I can tell with only one use case so far there is a lot to like here. However, it is worrying that analyses with multiple alleles per species are treated like a weird exception when they are an everyday occurrence at lower taxonomic levels, and there appear to be no meaningful branch lengths (?). Thus at the moment only 7/10

STEM - Species Tree Estimation using Maximum likelihood

Website: http://www.stat.osu.edu/~lkubatko/software/STEM/

Reference
Kubatko LS, Carstens BC, Knowles LL, 2009. STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics 25: 971-973.

Methods offered
Only one, which is Likelihood-based

Speed
Average (e.g. 20-30 min on a fast work PC for >40 species and three loci).

Procedure
Place your individual gene trees in the right folder. Set up a "yaml" file containing instructions for the program as explained in the documentation. Start the program from the command prompt/console. You have to enter a reasonable estimate of how fast the various genes in your analysis are evolving relative to each other.

Ease of use
Simple but the authors warn that you need to be quite careful about formatting the input file correctly. I found that to be oh so true.

Limitations
Very flexible, accepts missing data and multiple alleles per species.

Rating
I finally got STEM-hy to do an analysis for me and was impressed how fast it was (for a likelihood method, that is!). The method is also very flexible and returns a species tree with branch lengths. Unfortunately, it is really not very user-friendly because the program is so touchy about the input files and the error messages are not very helpful. Turns out that sample names may not start with numbers, but the error message was a bit unclear on what went wrong. The other problem is that on the one occasion that I used STEM-hy the resulting species tree was absurd (polytomies, many zero length branches, unrealistic topology). This may have been due to something else that the authors warn of in the documentation, i.e. a poorly resolved gene tree in the dataset. The results may get better the more gene trees one has and the better resolved they are, but that simply means that the method is not as flexible as, for example, MDC. 5/10

BUCKy - Bayesian Concordance Analysis

Website: http://www.stat.wisc.edu/~ane/bucky/index.html

Methods offered
Only one, Bayesian Concordance Analysis

Speed
Slow

Procedure
Conduct individual phylogenetic analyses for each gene in MrBayes, summarize them with one tool of the BUCKy package, then conduct Bayesian species tree analysis with the other. Everything is done from the command prompt/console.

Ease of use
BUCKy itself is simple but of course you need to know how to use MrBayes first.

Limitations
Bizarrely, only one sample/allele is allowed per species, making BUCKy pretty pointless for cases where the problem is more one of incomplete lineage sorting than one of gene tree discordance. Obviously then, you also need to have at least two genes in your analysis. (Note that I have not checked if anything has changed since my first attempt to use BUCKy in 2013.)

Rating
BUCKy was completely useless for me because I had multiple samples per species in the dataset I wanted analysed. I understand that the software is very popular with people who are doing deep phylogenetics with many different loci, and it is definitely easy to use if you are already familiar with command line interfaces and MrBayes. Still, I consider its limitations to be quite crippling. 4/10

BEST - Bayesian Estimation of Species Trees

Website: http://www.stat.osu.edu/~dkp/BEST/introduction/

Methods offered
Only one, Bayesian Estimation of Species Trees

Speed
Slow

Procedure
BEST uses its own variant of MrBayes to conduct the analyses. You have to set up a nexus file as per instructions, with a few extra options that MrBayes would not understand, and run what looks much like a normal MrBayes run. After that, an additional sumt command will produce the species tree. Multiple samples are assigned to their species with taxset commands. Everything is done from the command prompt/console.

Ease of use
Fairly simple if you know how to use MrBayes already; steep learning curve ahead if you don't.

Limitations
Apparently does not work with only one gene, or at least that made the program crash when I tried it. But at least you can have several samples/alleles per species.

Rating
Less flexible than MDC and GLASS but more so than BUCKy, and fairly easy to use for somebody who is already using MrBayes. Unfortunately it gave me a species tree that was very odd while the exact same data produced a much more reasonable result with BEAST. In addition, BEAST is also considerably faster. 5/10

starBEAST - Bayesian inference of species trees from multilocus data

This is good old BEAST using an additional template.

Website: http://beast2.org/

Methods offered
Only one, Bayesian inference using the coalescent model, but note that there is a different BEAST template for species trees from SNP data called SNAPP.

Speed
Obviously slow compared to distance or algorithm based methods but actually very fast for a Bayesian method; it worked through the same dataset as BEST in a fraction of the time although I set the same chain length.

Procedure
Tutorial available here. The package includes a very helpful GUI-driven program called BEAUti that makes it easy to set up a BEAST input file with all the right settings if you already have your data in nexus files (one for each gene). After BEAST has done its job, the results should be examined with the separate program Tracer, which is not included in the package, and finally they need to be summarized with TreeAnnotator, which is.

Ease of use
Surprisingly easy due to the GUI interfaces. Unfortunately the documentation does not really explain how a mapping file should be formatted but BEAUti has a clever function for deducting species names from the sample names. The main problem is getting models and priors right - more on that below.

Limitations
Apparently starBEAST should work with just one gene, and it certainly allows several samples/alleles per species. It does not officially allow missing data for a species but can be tricked into doing an analysis anyway by adding an all-missing data dummy allele. However, in that case the analysis and support values will obviously suffer.

Rating
Extremely easy to use, surprisingly fast for a Bayesian method, and generally resulting in very reasonable looking species trees. Crucially and in marked contrast to nearly all other methods it produces meaningful branch lengths, with are needed for many downstream analyses in biogeography or studies of rates of evolution. The major downside is that of all Bayesian approaches, only more so: the need to pull a gazillion priors and models out of your nether regions that you cannot possibly justify. How, for example, should I know what a reasonable prior is for effective ancestral population sizes five million years ago? What is more, in a recent use case I struggled with getting the substitution models right, as jModelTest suggested models that turned out to be overparameterised after starBEST failed to get decent ESS for prior and posterior of a six taxon tree (!) even after one billion (!) iterations. So if you need something simple and easy to understand, or if missing data are an issue, or if your dataset is huge, you may do best with a parsimony based software or a fast one like ASTRAL. But if you need meaningful branch lengths, starBEAST is pretty much the only choice. It also seems to be the most user-friendly of the Bayesian methods. 9/10

4 comments:

  1. Here is an interesting blog, which gets beyond my pay grade most of the time.
    http://phylonetworks.blogspot.com/

    ReplyDelete
  2. In case somebody finds this and wonders: Yes, I know about SNAPP and iGTP now but do not have the time to try them out. I will update the post when I have.

    ReplyDelete
  3. Thanks for the useful review. I'd add a couple of remarks:

    Another important limitation of STEM is that gene trees have to fit a molecular clock.

    Also, in STRAW, only the MP-EST method is supposed to estimate meaningful branch lengths. This could well be the cause of your odd results.

    ReplyDelete
  4. Thanks for the input on STRAW, I was not aware of that. When there are branch lengths I tend to assume that they mean something...

    ReplyDelete