Sunday, April 10, 2016

User-friendliness, or lack thereof, in scientific software

Having now posted about all those phylogenetics programs and written negative-sounding things like "not the most user-friendly", one might wonder (and I even did so myself) what exactly are my criteria for user-friendly, not only with regard to phylogenetics software but for special purpose scientific analysis tools in general, be they for population genetics or whatever.

This is actually not as easy as it seems, because it soon becomes obvious that there is not just one type of user, even in the same narrow field of science. That means that different interfaces have different advantages and disadvantages depending on what use we are thinking of. Let's start with how a user interacts with a program, using phylogenetics software as example cases.

I think there are three main solutions that people implement most of the time, and then there is a fourth rarely used option:

Graphical User Interface

The interface that would the most obvious choice for being labelled 'user friendly' is surely the GUI. The user can double-click the program icon, a window opens, and then they can use the familiar menus to select Open Data File, Set Likelihood Criterion, Heuristic Search, and Export Tree. All very comfortable.

That being said, if this is the only option offered by the program it is not possible to integrate it into a scripted pipeline. And this means that while users doing simple, one-off analyses will be happy, those who need to call the program to do numerous analyses with varying input files or parameters will be annoyed. Ideally then a program should have a GUI but also be able to execute input files when called through the terminal.

MEGA is probably popular to a large degree because it is GUI based. The same may partly go for BEAST, because you can set up analyses using its helper program BEAUTi. TNT has a GUI only on Windows but not on Mac or Linux, which is why my post on using the command line has become one of the most-read of this blog. Conversely, PAUP never had a GUI on Windows but only on Macs, and even that only on the older ones until the processors were changed. Quite a few colleagues I know for many years still had an ancient Mac sitting in some corner of the lab just so that they would be able to run PAUP with a GUI.

Typing commands in manually

Quite a few phylogenetics programs were designed to primarily run in the terminal and to expect manually entered commands; others have this option in addition to a GUI, or on some operating systems. So you open a terminal, navigate to the right folder, enter the name of the program, and then find yourself still in the terminal only 'inside the program', with a different command prompt. At this moment you need to know the right commands from the manual and type them in to get the program to do what you want. Your interaction with the program through the terminal may look like this (from my post on TNT):

tnt> mxram 200 ;
tnt> nstates DNA ;
tnt> nstates NOGAPS ;
tnt> procedure yourdatamatrix.tnt ;
And so merrily on...

Obviously this is the exact inverse of the previous option: There are whole groups of end users who will hear that this is how a program has to be used and walk off smartly in the opposite direction (perhaps to cuddle their ancient Mac) because they have no interest in learning what amounts to a programming language just to do a simple heuristic search for the best phylogeny.

On the other hand, these programs always have the option of reading in and executing in one go a text file containing a whole list of commands. And that has the great advantage that you only have to understand the standard commandos once, and from that moment on you can re-use the previous script changing only the data matrix, the name of output files, and perhaps the model. Or when you are first starting, get a file with reasonable settings from a colleague and then do the same.

This is how TNT has to be used on any system but Windows, PAUP on any system except the older Macs, and the MrBayes versions that I am familiar with on all systems.

Command line parameters

The third option is again through the command line, but there is no 'inside the program'. Instead you type in its name and, on the same line, provide it with all the parameters it needs know. Among phylogenetics software, the best known example is perhaps RAxML. The whole interaction with the program might be reduced to typing this into the terminal (example from its manual):

raxmlHPC -f v -R RAxML_binaryModelParameters.PARAMS -t RAxML_result.PARAMS -s alg -m GTRCAT -n TEST2

Each of the letters preceded by a hyphen are parameter codes, e.g. -n for the name of the output file(s), and the next element is then the value this parameter should have, i.e. a file name, the choice of an option, or a number.

The advantage here is that this is ideally suited for scripting, as the program can easily be called by other programs to do its job with given parameters. The downside is that the commands are generally even more arcane than those of the previous group of programs; while "hsearch" is a reasonably intuitive and thus memorable command for heuristic searches in PAUP, something like RAxML's "-f a" is rather harder to remember.

The customer service call centre menu

This final option is the odd one out, as I have only seen it in PHYLIP so far. From customer service call centres we are probably all familiar with the concept: you start the program up, and it says something like "if you are calling about billing and invoices, press 1, if you are calling about deliveries, press 2, ...".

This is what comes up if you run one of PHYLIP's executables and have given it an input file. I am told that apparently there are other ways of using its modules, perhaps through command line parameters, but I haven't tried it thoroughly; at any rate, this call centre style menu was what I meant when I wrote that I found the interface off-putting. Other people may have other preferences, but personally I'd prefer any of the other three interfaces.

My own preferences

Okay, so obviously GUIs and command line interaction both have their advantages depending on the needs of the user, but what is user-friendly in one doesn't necessarily apply to the other. Consequently, what I consider user-friendly (or not) cannot be summarised in one neat sentence.

1. If there is a GUI, then the menus should be well-organised and intuitive, so that it is easy to find what one wants to do. The old Mac version of PAUP (or PaupUp, for that matter) works well in this sense because there is a very clear structure to the menus: Search options here, tree display and export here, and so on. Mesquite, on the other hand, is much less intuitive and suffers from very overcrowded, convoluted, and circumstance-dependent menus, admittedly perhaps unavoidably so due to the vast number of things it can do.

2. Default settings of a program should fit realistic end user needs. TNT on Windows has a GUI with a logical arrangement of options, but the problem is that certain default settings don't make a lot of sense to me. So I find myself importing a datafile only to run into an error message saying that the memory isn't big enough. The low memory default is rather odd because the main reason to use TNT would be big datasets that take too long to analyse on other software. I reset memory size, import again, and only then remember that the default is to treat gaps as a fifth state, a setting that pretty much never makes any sense because it would score a single evolutionary event deleting nine bases from a sequence region as nine independent events.

3. I prefer very much to be able to set the names of input and output files instead of a program needing the former to be "infile" or suchlike and always writing the result into "outfile", in the process overwriting the previous analysis.

4. Commands that have to be typed or written into a data file should be intuitive. For example, "contree strict = yes" is easily understood to refer to a strict consensus tree and also easily remembered; "nelsen *" may be harder to figure out or remember.

5. In the same vein, command line parameters should be intuitive. "-threads 4 -resume" is clear; "-f T" much less so.

6. Programs should use standard formats. Once a data matrix format, tree file format, or scripting language is widely established, the next few programs coming out would ideally use the same formats. This is actually going reasonably well in phylogenetics, as RAxML uses the data matrix format of PHYLIP, many programs including popular PAUP, MrBayes and BEAST happily accept Nexus data files, and very nearly all programs use the Newick tree format (or Nexus, which is the same only with "#nexus begin trees; tree mytree = " and "end;" around it). I can also understand why TNT continues the different data matrix format first established for Hennig86, because it is serving the same community, and admittedly it is easily derived from the PHYLIP one.

Still, didn't that above paragraph kind of mention two data formats more than necessary? Why aren't they all using the same, sparing us a lot of reformatting and exporting? And I do not really see any excuse to use non-standard tree formats, as all tree viewer software expects Newick. That being said, as far as I can tell the situation is much worse in population genetics.

Just my two cents, as the saying goes. Others may have totally different experiences, and as most of these programs are non-commercial the most appropriate stance towards the developers is gratitude. But I am working on a little niche thing myself (nothing phylogenetic though), so I should now take my own advice to heart, at least for points 2 to 5...


  1. Interesting post. As a programer in bioinformatic, a user point of view of how a software should work is always something nice. Moreover, such observations can be made for almost every fields in bioinformatics (population genomics is another good example). I share a lot of your ideas, especially with the file format : Why the hell are there so much formats representing the same data ? However maybe some additional points could be added. For example, as someone who works in bioinformatic, I can't imagine a software without command line parameters. Not only because it's more easy to use (if the thing is correctly coded and the documentation is good), but also because it allows you to integrate the software in a pipeline and therefore to easily launch the software using a lot of different parameters/input files. At the opposite, typing command line in the terminal (like TNT) is really an old-fashion style which should not exist anymore. If you think that your software have too much options and it's not going to be easy to use in one single command line, just create a configuration file where user can supply the informations (and please, allow a standard configuration file such as json, and never, really, never, create your own). Sadly each of this software do not target a lot of users, and I feel like there are not really dialogue with users community of how a software in phylogeny (or in an other bioinformatic field) should work. And this is why posts like yours are important. Continue the good work !

    1. Thanks for the kind comment. As far as I know, the command line in terminal approach pretty much always means that the program can also read a script file that executes numerous commands automatically.

      The problem with just handing over parameters is that it will only work if the program has a very limited amount of things it can do, and even in the best case scenario it will only be able to do one or two things at a time. For example, one can call RAxML and tell it to search for the best tree plus to do a bootstrap analysis, great. But if you want PAUP to do a likelihood analysis, export the tree, then another three likelihood analyses with different constraints, export each tree, again read in all those trees, and then do an SH-Rell Test on them, there is just no way of planning a program with 200 possible functions to be that flexible after having been given merely a list of run parameters. It needs a structured list of commands in a given first-do-this-then-do-that order, or it would have to be called several times anyway.

      I would not want to produce a chart of references to different programs over the years by giving Excel a single JSON file either. Sometimes there is value in going into a program to do things bit by bit, and sometimes there is value to integrating a program into a pipeline.