Anybody who happens upon my species tree methods post will see that despite being more of a parsimony guy myself I have a lot of praise for BEAST. It is fast (for a Bayesian method) and very user-friendly. "Normal" BEAST is for standard gene tree phylogenetics, *BEAST or starBEAST is the add-on for species tree analyses based on multiple independent genes, and the still fairly novel SNAPP is the add-on for species tree analyses based on Single Nucleotide Polymorphism (SNP) data.
With genomic sequencing, SNPs are only going to become more important for the study of closely related organisms. If you have species that are very recently derived, any individual gene sequence is probably going to be extremely similar between them. This means that the approach of inferring species trees from the reconciliation of multiple gene trees is unlikely to work: instead of gene trees you are likely to get gene "combs", simply unresolved relationships.
There will still be thousands of little individual mutations differentiating your study specimens, but they will be distributed all across their genomes. This is why they are called SNPs: Single Nucleotide Polymorphisms each surrounded by conserved sequence regions.
The idea of SNAPP is now to use the SNPs from multiple samples per species directly to infer the species phylogeny, without any intermediate steps like alignments or gene trees. For this, it uses the coalescent model and the usual Bayesian Marcov Chain Monte Carlo approach. This sounds very attractive, especially after the good experiences with BEAST, and also very rigorous.
Unfortunately, so far my attempts at using SNAPP have been rather frustrating. There are three main issues:
- SNAPP appears to be rather capricious as to whether it will run at all or whether it will fall over. The only machine on which I can get it to run consistently is our family computer, a Linux machine. On the Windows machine at work it is also very consistent in that it always error messages and crashes.
- BEAST in general is known to have a problem with missing data although it can at least be tricked into accepting an allele missing for a species. Still, the same problem applies to SNAPP; a colleague had to throw out most of the data and samples he had to get missing data below ca. 5% before he could do an analysis. In my dataset that is just not possible, I'd be left with too few SNPs.
- Finally, SNAPP is really. Really. Really. Slow. A hundred SNPs, no problem, I can run a decent analysis over a day. Five hundred SNPs? Forget it. Our high performance computing cluster at work did 1,000 generations over a few hours, and I need it to do at least ten million generations; do the math. At home I just tried a dataset reduced to 200 SNPs, and it seems as if it will finish in three months. All that sounds like First World Problems, but the thing is, the whole point of genome-wide SNPs is that you have thousands of them. A SNAPP analysis of my whole dataset is just not going to happen, even if I did not have the missing data issue on top of it.