I had hoped to write this up earlier, but there we are. On 9 February I went to a presentation by Robert Lanfear, the author among other things of PartitionFinder, a software that assists in the selection of models of nucleotide evolution and, as the name implies, dataset partitioning. His talk gave an overview of where he sees the field of (model-based) molecular phylogenetics, its problems and potential solutions.
I will structure my notes on his talk and my own thoughts about it as a kind of numbered list, for easier cross-reference, with no claim to having written this up in a particularly beautiful way.
1. The problem
Lanfear started out with the observation that the current practice in molecular phylogenetics works well, but it works increasingly less well. What he means here is that if there is a phylogenetic question that has a clear and strongly supported answer, then even cutting a lot of corners and making some mistakes will produce that correct answer.
Now, however, those "low-hanging fruit" have largely been harvested, and what is left are really hard to resolve relationships. In those cases small differences in how the analysis is done will lead to different answers (see point 2 below). An example he referred to at least twice during the talk was the relationship between crocodiles, birds and turtles, another one were relationships between major clades of birds.
What I find interesting here is how people set their priorities. Apparently there are a lot of researchers who care very deeply about, for example, whether crocodiles are sister to birds or to turtles. Honestly I couldn't care less, and the same would be true for comparable cases in plant phylogenetics. What phylogenetics is about for me is to identify monophyletic groups for classification and to provide phylogenies for downstream analyses in biogeography and evolutionary biology. For the former, the most relevant observation is that turtles, crocs and birds form three reciprocally monophyletic groups, but if we don't know their relationships to each other we can simply classify them next to each other at the same rank, problem solved. For the latter, there are ways of taking uncertainty into consideration, problem solved.
In other words, where I see need for more work in the field is in the many clades of plants, insects, nematodes, mites, etc., that have so far not been well studied, as opposed to re-analysing over and over and over the same few charismatic but overstudied groups of vertebrates. Each to their own I guess, but the thing is that all the considerations that follow assume first that being unable to decisively resolve every single node in a phylogeny is at all important to anything or anybody. I am just not sure I see that.
2. How do we know that the current practice is working less well now?
Partly because people get very different results with high confidence. Lanfear called this the "new normal": large amounts of genomic data give strong statistical support for contradictory results.
This is a very good observation that will hopefully also be convincing to those who like to stress our inability to know the truth, and that we can only hope to build hypotheses.
3. The current best practice for genomic sequencing
Data cleaning of genomic data is crucial because everything is full of microbes. Even DNA extraction kits are contaminated, so never do genomic sequencing without a negative control.
I must admit that I have not always followed that advice, but with amplicon sequencing or target enrichment for example it may not be that relevant, given that non-targeted DNA is unlikely to amplify and you know if a sequence comes totally out of left field. The example Lanfear used, however, was a de novo genome assembly where contaminants were presented as evidence of horizontal gene transfer. That would have been embarrassing.
He also argued for inclusion of a positive control, as in adding a known genome to check for contamination percentage. That does of course assume that you always have a known genome in your study group, which is unlikely to be the case in most groups.
Finally, there should be biological and technical replicates, probably the sampling guideline that the largest number of people are aware of and follow.
4. The current best practice for assembling the data matrices
Remove parts of the alignment that cannot be trusted. Lanfear mentioned the software GBlocks, which I personally have never used. However, he cited a paper that argues it doesn't seem to help (Tan et al. 2015, Syst Biol 64: 778) and seemed to advise against using it. His own preference is to pragmatically make an automated alignment and then check by eye and delete non-homologous sites manually.
5. Examining the individual gene trees
Next comes paralog detection, if that is relevant to the data type. One of the most stunning observations Lanfear mentioned was that in multi-locus species tree analyses some loci may have massive leverage on the results. He cited a case in which two undetected paralogs made the difference between 100% support for one and 100% support for the other answer.
His suggested positive control here: be suspicious if a gene tree does not show a very well established clade. Keep that one in mind as it will come up again.
6. Multi-locus analysis versus concatenation
We are talking phylogenomics here, so there are always multiple independent loci. A full Bayesian analysis of gene trees and species tree together in StarBEAST is best but limited to max. 50 species. I wasn't aware of the ballpark number, so this is good to know. Interestingly, the next best thing is concatenation, because according to Lanfear short-cut methods using previously inferred gene trees to infer the species tree in a second step (ASTRAL et al.) perform worst. Not sure how easy it will generally be to convince peer reviewers of this.
7. Model selection
Not many people are aware that we have to guess a topology to even do an alignment, and also to do model selection. Then we co-estimate all model parameters at the same time as the final topology is inferred.
We may need a separate model for each codon position and gene, stem vs. loop for rDNA; even for only three genes, the possibilities are already too huge. Also, there is a trade-off between having enough parameters and not being able to estimate all of many parameters. Here cometh PartitionFinder to help with that. However, as the author of that software Lanfear himself stresses that thinking carefully about data may be better than using the automated approach.
He was what I cannot help but call surprisingly cynical about how little we know about model selection and alignment.
8. Tree inference
Be aware that all software has bugs and limitations. Lanfear cited a few examples including a known but so far unresolvable branch length bug in RAxML (10x branch length inflation in 5 of 34 datasets tested). He also said that RAxML does not implement linked branch lengths across parts of the partition, and that few people were aware of that. Me neither.
At any rate he suggested to use more than one software and compare, as a "sanity check". His suggestions for likelihood were RAxML, PhyML, and IQ-tree; for Bayesian phylogenetics obviously MrBayes and BEAST.
Parsimony seemed to be The Method That May Not Be Named, although there is a long tradition in the area I am working in to run at least Bayesian and parsimony analyses and then perhaps also likelihood for comparison. Indeed if I remember correctly the word parsimony was only mentioned once at the beginning of the talk, and it was in the context of something like "parsimony also makes assumptions". Hardly anybody would doubt that; the arguments of parsimony advocates appear to be mostly epistemological (I have discussed before why that doesn't convince me personally) and on the lines of modelling making too many and/or unjustified assumptions, whether that is true or not.
From my own perspective as a methods pragmatist who happily uses all of them as long as they are a good match for the data and computationally feasible, I was once more surprised that a likelihood phylogeneticist like Lanfear explicitly mentioned Neighbor Joining as perfectly fine, something that I had seen previously in that BMC Evolutionary Biology editorial. I am sorry to say that I don't really get it. It seems like saying that you shouldn't use your kitchen knife for emergency surgery because it wasn't properly sterilised, but the muddy shovel from the garden shed will do in a pinch.
9. Special considerations for Bayesian phylogenetics
Keep an eye on sampling and convergence using software such as Tracer and RWTR; effective sample size needs to be > 200 so that samples are independent enough. None of this should be news to anybody who is using Bayesian phylogenetics, one would hope, but I haven't tried RWTY so far.
Two things Lanfear mentioned were less familiar to me, unsurprisingly given that I am not really a Bayesian. First, in theory Markov Chain Monte Carlo only works if run for infinite time, but it "works in practice". Second, apparently there is no good way yet of calculating ESS for tree topology or covergence, but "RWTY helps".
10. The way forward
Lanfear's hopes for improving molecular phylogenetics in the future are based on what he called "integrated analyses". They include trying to infer the model of evolution at the same time as tree topology.
Next there is the need for "better" models, e.g. non-reversible ones, which he mentioned as coming soon to IQ-tree and PartitionFinder, and different models for different parts of tree, which however may be computationally too hard.
Stationarity of model parameters across evolutionary history, reversibility, homogeneity, and tree-likeness (no recombination) are model assumptions that are universal and hardly ever tested. But tests are possible, and then the data that don't fit the model can be removed. Most generally, instead of big data use the data that can reliably be modelled only. I found this really refreshing to hear, as many people seem to prefer throwing more data at a problem in the hope it goes away.
Finally, Lanfear suggested to conduct blinded analyses. He said that in many cases there was a hidden extra step after tree inference: is the tree the one we wanted? If yes, it gets published; if no, if it disagrees with preconceived notions, some people go back and tweak the data. Clearly this is problematic, but I was not the only one in the audience who thought back to what I have here written up as point number 5 and observed a bit of a self-contradiction.
I assume the answer is that there is a difference between being sceptical about a gene tree that contradicts really well established knowledge and tweaking the results that your study really are about. To use a non-phylogenetic example, if you want to find out if one brand of car can go faster than another it is not okay to tweak data after the results show that your favoured brand is the slower one; but it is okay to go back to check your data if they show one of them to have speed of 50,000 km/h, because that just doesn't seem plausible.