As I have mentioned before, there are four main ways of inferring phylogenetic trees of evolutionary relationships:
- Distance/clustering analysis. This is not really a phylogenetic analysis in the strict sense but merely clusters terminals by their similarity, but on the plus side clustering is always extremely fast. There are several programs that can do it, including good old PAUP and MEGA.
- Likelihood analysis. Simplifying a bit one could say it searches for the tree with the best log likelihood score given a model of sequence evolution and the data. Again there are several programs available to do this kind of analysis, including PAUP, MEGA and PHYLIP. Calculating likelihood values across large phylogenetic trees is computationally intensive, and thus they can take quite some time for larger datasets. This is why somebody wrote the software RAxML, which is designed to do complex likelihood searches with seemingly ridiculous speed by cutting a few corners.
- Bayesian phylogenetics. This approach estimates the posterior probability of phylogenetic relationships with a Marcov Chain Monte Carlo (MCMC) method. Standard software packages for this are MrBayes and BEAST. If you want a quick answer, you are out of luck though, because MCMC always takes time.
- Parsimony analysis. The logic here is to find the tree with the lowest number of character changes along the branches, under the assumption that, all else being equal, the simplest explanation is the best. It is often considered less sophisticated than the previous two approaches but it comes with less assumptions; I like it that I know where the computer has its hands, so to say. Once more PAUP, MEGA and PHYLIP implement parsimony searches but they are fairly slow for larger datasets.
Sadly, the program has a few downsides. First, its input and output formats are rather idiosyncratic. Second, it has a GUI only on the Windows version but not on Mac or Linux, so that you will have to use command line and scripting on the latter two systems. Third, the documentation is unsystematic and unhelpful, making it very hard to figure out how to effectively use the command line and scripting. Actually, that is not quite true; documentation on scripting per se seems to be okay, it is rather the simple standard analyses that aren't explained anywhere.
This is why I am writing this post. I have just done a simple analysis, and I want to spare others the same investment in time and frustration, and I want to be able to look up my own post in the future, especially should some time pass before I use TNT again.
So, how do we find things out? A vaguely manual-like HTML file is distributed with the software, and you will also find something called TNT Scripts - General Documentation on its website. But again, while both of these tell you how to write loops and if then clauses in complex scripts, they do not provide a systematic overview over the commands you need to run simple TNT analyses on Linux. Wouldn't it be nice if we could just have something like the PAUP command reference manual and text-search it for the explanation of the command that changes the maximum number of trees to be retained?
The next best thing to do is to fire up TNT and to look at the help function, so let us start there. All the following assumes Linux, by the way, but except for starting the program it should be the same on Mac. You turn the program on by entering ./tnt in its folder; if it doesn't work you may have to sudo it. The prompt should change to tnt*>. You can now simply enter TNT commands, once you know them that is.
But before we consider these commands, a general remark is in order: sometimes you will enter what you think is a perfectly sensible command only to be greeted by a new prompt named after the command in question, seemingly asking you for more parameters. In that case, you will most likely simply have to enter a semicolon (;), because that is the character that signifies the end of a command. Often, however, TNT seems happy without the semicolon on the end, no idea what makes the difference.
Anyway, type help ; to get a list of all the commands. Now that you know what they are called, you can get information on each of them individually by typing help commandname. This plainly unsatisfying approach is partly how I figured out what follows; in other cases, some googling was of assistance.
Okay, lets think about an analysis. First prepare a data matrix, and to be safe it should probably have the TNT format:
xreadAs you can see, you can just use a PHYLIP file as you would use for RAxML, swap number of characters and number of terminals around, and add xread and the final semicolon. (Still, wouldn't it be nice if all phylogenetic programs used the same matrix format?)
Many people seem to use TNT for parsimony analyses of small morphological datasets, which seems a bit like using a tactical nuke to kill a fly. My scenario is generally that of a DNA sequence dataset with dozens to hundreds of terminals and thousands of characters. Before we can import the data file, we will therefore have to make a few preparations. First, TNT is set to a ridiculously low memory usage and will thus generally throw an error if you attempt to import a large matrix. Enter mxram 200 to set memory usage to 200 MB or whatever is realistic and necessary. Second, tell the program to expect DNA data by entering nstates DNA, then nstates NOGAPS to tell it to treat gaps as missing data as opposed to a fifth state.
Now read the data file with procedure filename. Only from now on are we allowed to set the number of equally parsimonious trees to be retained during search. The default is 100, but that is way too low. In fact in my experience this value has a much stronger influence on whether you will find the best trees than the number of search replicates, because a higher number of trees retained gives the program more to swap on. Rectify the situation with hold 1000 or whatever number you prefer.
From here I am mostly adapting the information helpfully provided a few years ago by another blogger, Matthew Vavrek (thanks!). Start logging the analysis output into a text file with log filename. To do a simple, preliminary search, type mult; if you want more replicates, use mult=replic 10 for example. The resulting trees that are now in memory will be alright but still not the best, so they will be used in a more thorough search with bbreak=tbr. This latter search takes time and will, unless there is really only a very low number of equally parsimonious trees, probably fill up the number of trees to be retained. Because you want the strict consensus of all those trees you have, you enter nelsen *.
At this stage, we have got some real results and want to save them. Unfortunately, TNT is set to save taxa only as numbers, making it impossible to interpret the trees it saves. So first we set taxname=, which cryptic command tells it to save taxon names.
To obtain trees in the usual Newick format used by nearly all phylogenetic software on the planet, we have to export the results into Nexus format using export * filename; here the asterisk is crucial to get the actual trees, otherwise you will only export the data matrix.
Update 25 Jan 2017: To export the trees with branch lengths, use export > filename; thanks to Matt Buys for the information.
Finally, branch support. There are scripts out there to do Bremer support (Decay Indices), but for DNA sequence data bootstrapping or jackknifing is more common. The command here is simply resample replications 200 for two hundred bootstrap replicates.
Leave the program by typing quit. To script all the above commands, write them one after the other into a text file (with space and semicolons at the ends of each line!) and run it from TNT with procedure runfilename.
mxram 200 ;Cheers!
nstates DNA ;
nstates NOGAPS ;
procedure yourdatamatrix.tnt ;
log yourlogfilename.txt ;
hold 1000 ;
nelsen * ;
export > yourfilename.nex ;
resample replications 200 ;
export - yourfilename_bs200.nex ;
(Updated 6 May 2015 to add the taxname= command and 25 Jan 2017 to include information on branch length export.)