Friday, August 25, 2017

Phylogenetic trees in XML format

I recently had the extremely frustrating experience of having had to look into how phylogenetic trees are coded in XML format. To illustrate why this was frustrating, let us start by considering a small phylogenetic tree as an example. I got one sequence each for the genera Nassauvia, Erigeron, Xerochrysum, Matricaria, Lactuca, Senecio, Ursinia, Calycera, Kippistia, and Synedrella from Genbank, and produced a likelihood tree in PAUP. In graphical representation it looks as follows.

So how is it generally saved? The most concise way of scoring a phylogenetic tree in plain text files is the venerable and widely accepted Newick standard. It consists of OTU names separated by commas and grouped into clades by round brackets. There may be numbers after colons, which are branch lengths, and if there is a number directly after a closing bracket it indicates some kind of support value, such as bootstrap or Bayesian posterior probability. The Newick representation of my little example tree is as follows.

Again, very concise. If we want just a tiny bit more bells and whistles we can use the Nexus format. In the context of phylogenetic trees it is just the Newick format plus "#nexus [line break] begin trees;" at the beginning and "end;" after the trees, and then each Newick tree has "tree [name of tree] =" in front of it and another semicolon at the end. The main advantage is that multiple trees in the same file can now have informative names, whereas in a Newick file they cannot.

If we want to find out how this would look in XML format, we can head over to the website, where we will find an online tool that can transform our boring old Newick or Nexus trees into shiny, exciting, newfangled NeXML trees (for Nexus-inspired XML I guess, although as we will soon see there isn't really any similarity at all). Of course for this post I have done that with the example tree.

So, what do we see? As the name XML implies, the format is similar to HTML in that it consists largely of nested sets of tags starting with is-smaller-than signs and ending with is-larger-than signs. But those are just the optics. What about functionality?

As a Newick file, my phylogenetic tree was 448 bytes in size. After transformation into NeXML, the new tree file is now 2645 bytes in size, an increase by 490%. This has several obvious benefits in particular for the results of Bayesian analyses where thousands of trees have to be saved and may take up megabytes even in Newick format, for example I can't think of any right now.

And I am not even going to go into how NeXML scores data matrices beyond observing that it appears to require a tag assigning character type for every individual character. In other words, instead of saying something like "characters 1-9000 are anonymous genome-wide SNPs with the possible states 0, 1, 2 and ?", as in Nexus files, you would have 9000 lines of code (!) each saying "character 4306 is a SNP character" and then "character 4307 is a SNP character", and so on, wasting enormous amounts of disk space and/or bandwidth. Efficiency!

More generally, the structure of the tree coded as NeXML is extremely convoluted compared to what it looks like in Newick format. Newick is, as mentioned above, a set of nested brackets indicating clades; consequently it can be examined and read relatively easily, and even allows the user to copy subtrees in or out in manual editing (it helps if you have a text editor like SciTE that shows which brackets belong together). In fact I have often produced hypothetical example trees to illustrate a point on this blog by typing them out in Newick format and then opening them in a tree viewer. NeXML, however, has a list of nodes and edges that are referring to each other via obscure identifiers, making it virtually impossible to read, type out and edit manually, especially for larger trees. But I am sure XML makes life easier for the end user because please insert reasons here.

Next, imagine writing a program that should be able to read a phylogeny. If you want it to read a Newick tree, you merely need to parse nested brackets, recognise taxon names, and deal with branch length and support value annotations; this is relatively straightforward. If you want it to be able to read NeXML trees, on the other hand, it needs to be able to handle a large number of possible tags in varying order, plus various parameters in each tag that can appear in varying order (<node id="ne16" otu="ou27" label="Senecio_vulgaris"/> could just as well be <node otu="ou27" label="Senecio_vulgaris" id="ne16"/>, for example). This makes life easier for programmers because I'm sorry I really have no idea. But I mean, the website says that this format is "more easily validated and processed", so that must be true, right? Otherwise they wouldn't claim so, would they?

While on the topic of phylogenetics software, to the best of my knowledge none of the programs that I currently use or have seriously used in the past can read or write phylogenies in XML format. BEAST, PAUP, and MrBayes produce Nexus files, TNT exports its own idiosyncratic format or Nexus, and RAxML produces Newick files. (BEAST uses famously convoluted XML input files, but even here the assumption is that most users import Nexus data matrices into the GUI BEAUTi. At any rate it does not save its output as NeXML.) Mesquite, which uses Nexus as its default format, is supposed to be able to export into NeXML format once we install a certain add-on library, but when I tried to do such a conversion I merely got an incomprehensible crash report.

Perhaps more to the point, if NeXML phylogenies produced by some obscure phylogenetics software that I never employ myself are supposed to be of use they have to be displayed, so how are we doing for tree viewers? The very popular cross-platform software FigTree expects Nexus or Newick phylogenies, and as far as I know the same is true for TreeView. DendroScope claims to read NeXML files but then only gave me an error message when I tried to import the simple example phylogeny after conversion by the official website. To quote from that same website, "the future data exchange standard is here!"

While on that topic, standardisation is one of the main benefits claimed by NeXML or by XML more generally. As Simon St. Laurent wrote already in 1998:
XML allows developers to set standards defining the information that should appear in a document, and in what sequence. XML, in combination with other standards, makes it possible to define the content of a document separately from its formatting, making it easy to reuse that content in other applications or for other presentation environments. Most important, XML provides a basic syntax that can be used to share information between different kinds of computers, different applications, and different organizations without needing to pass through many layers of conversion.
I guess at this stage it should come as no surprise at all that there are already at least two different XML standards for phylogenetic trees, which is another way of saying that there is no XML standard for phylogenetic trees. In addition to NeXML, which I have discussed in detail above, there is phyloXML. Where NeXML describes trees using lists of nodes and edges phyloXML uses nested clade tags, which I find more intuitive and useful because it allows easier parsing and easier manual editing, and which is also more similar in spirit to Newick and Nexus and would thus be more deserving of a name like NeXML than NeXML. Otherwise it appears to be just as inefficient and convoluted though.

So concerning standardisation I guess the reality is that XML is flexible enough that anybody could come up with a new, XML-based standard. Just think of a few words, put is-smaller-than and is-larger-than signs around them, convince a handful of colleagues to adopt this standard, and off you go. Yes, if it is so easy to do then everybody will do it, and then we achieve the exact opposite of standardisation, but I guess that is where XML proponents can switch to touting its "flexibility". Heads XML wins, tails all other data standards lose.

As far as I can see Newick and Nexus work just fine. Compared to XML phylogenies they are easier to parse, are already standardised, are accepted by virtually every phylogenetics software and tree viewer, and take up a fraction of the disk space. Why fix what isn't broken?

1 comment:

  1. Nice article and want share a tool for XML