PhyloBotanist: Trying to use fastStructure

Friday, August 1, 2014

Trying to use fastStructure

Updated 16 August 2014 and 27 Nov 2014

A well established software tool in ecology and systematics is Structure. Using population level molecular data such as Amplified Fragment Polymorphism (AFLP), microsats or Single Nucleotide Polymorphisms (SNPs), it tries to find the underlying population structure. In practice, you run the program with a range of values for number of populations (K) and then compare the resulting likelihood values for each.

Apart from the best K value, that is the number of populations or clusters that your samples are best divided into, Structure also outputs how much of the genome of each of your samples is derived from each population. This is then often depicted in a graph such as the one seen here. Each line in that figure represents the results of one Structure run, from K = 2 to K = 6. Each colour is one of the clusters or populations. Each pixel column is one sample, with its colour showing to what population it belongs. So you may get many samples that belong fully or nearly fully to one population but also some that are 'admixed' between two, perhaps representing hybrids.

This methodology is used to examine population structure below the species level but also, and this is what is more interesting to a systematist like myself, to study the delimitation of species. I have such a project going with a colleague from a different herbarium, and we just got our data. They are more than 9,000 SNPs for 91 samples, and unfortunately Structure is rather slow especially when analysing such large amounts of data.

So you can imagine I was very happy to find that a few months ago the same lab released a program called fastStructure specifically to deal with large numbers of SNPs and promised to be one or two orders of magnitude faster than Structure. In fact it is so new at this point that the paper announcing it has only been cited twice - once in the editorial of the same journal (which doesn't really count) and once in a minor review article. In a few months papers will start coming out by people who have actually used it, but at the moment there is little practical experience to build on except the comments and questions of people on the Structure Google Group.

After our high performance computing staff kindly installed the program on a supercomputer, I spent most of today trying fastStructure out. I learned a lot but so far the results have been mixed. I write this partly to spare other people some of the frustrations I experienced.

First, the installation instructions on the fastStructure site (last accessed 1 August 2014) are for Unix/Linux only, and they are really cryptic. It is a common observation that informaticians always assume that everybody is using Unix even as they know that most end-users run Windows, but okay, let's cut them some slack because I don't like Microsoft products that much either.

What is worse is that they don't even say that those are only the instructions for Unix. There was actually one question on the aforementioned Google Forum by a user who was genuinely puzzled why those instructions didn't work on Windows. You simply cannot assume that every biologist is familiar with operating systems. Their job description is being familiar with organisms, that's it.

Second, although the website mentions that you can use the traditional Structure format for your data (as opposed to the very convoluted multi-file 'plink bed' format of which I had never heard before), it may not be immediately clear why the program does not accept it. You have to set an option "--format=str" if your file is in Structure format, but of course that isn't mentioned where they explain how to start the analysis. Also, make sure that your file has the extension ".str" but don't mention that extension in the "--input=filename" option; the software will assume that if you say "filename" you mean "filename.str".

Third, neither the website nor the publication announcing the program really make clear that you cannot and don't need to set the number of iterations the program should run. As can be seen in the Google Forum on the software, this confuses the heck out of seasoned Structure users (it sure confused me), because in Structure you needed to run tens of thousands of iterations to get defensible results, and often you do an analysis only to find that you have to do it over and quintuple the number of iterations. In fastStructure, however, the program decides by itself when to stop, and it does so after a frighteningly low number of iterations, often in the range of only 50 to 150. I guess that is what makes it so fast.

Anyway, my current problem is that my first results are very meaningless. Although even a quick and dirty dendrogram analysis shows meaningful structure in our data, fastStructure infers an unrealistically high number of populations and then assigns all samples to one and the same population. In my last run the first seven population clusters accounted for 0.000003% of the genes of all samples in the analysis, and the eighth cluster accounted for >90% of their genes. Surely there is something wrong here.

Update: The issue here was squarely on my side. While I was preparing the data file through some rounds of copy pasting and transposing between LibreOffice Calc and a text editor, LibreOffice somehow saw fit to transform all the missing data into zeroes, which in this context means homozygous for the dominant allele. No wonder that the results did not make sense.

However, now that I get meaningful results with K=4 in one dataset and K=18 in another, larger one, there are still other problems. First, I have dendrograms for comparison and am surprised that there isn't more congruence between the dendrogram clusters and the fastStructure clusters. Second, and even more puzzling, fastStructure seems to assign very nearly all samples to clusters with >99%; in other words, there are virtually no admixed individuals. That is particularly odd given the organisms we are dealing with because we expected some hybrids in the dataset, and the patterns seen in the aforementioned dendrograms can only be explained by some degree of admixture.

You will notice that, apart from my specific problems with the results, there is a general theme in this post of the programmers providing too little information on how to use the software. That is not unusual (the people who developed BEAST and TNT, respectively, are rare and notable exceptions), but it is still something that I find hard to understand.

Don't get me wrong, it is clear that all of us can get lost in the nuts and bolts of our profession, and there is always the risk that we assume our partner in conversation will understand various technical terms and methodological assumptions when really they have no chance of knowing them. But the thing is, a programmer writing a program does not have that excuse. They are, for present purposes, not part of a whole community of people who talk the same language and share the same mode of thinking despite it being alien to all other people. Instead they are a community of one. If they don't provide clear instructions they are literally the only person on the planet who can know how their programs works, because nobody else can possibly know how to set that option or how to interpret that result. And they should know that.

Don't get me wrong, it is clear that this kind of software is made freely available and will earn the programmer no money. So obviously one cannot expect the same user-friendliness and support as one could expect for a software package that one just bought a $2,000 license for. But the programmer still does want the software to be used, after all. In science and academia, their reward is going to be the large number of people citing their program because they used it in their studies. Consequently one would naively assume that it might be in the programmer's best interest to write instructions that allow even, dunno, even an end-user who runs Windows and does not know Python at all to work with their program. Just saying.

So yes, fastStructure is really super-fast, and once I figure out how to get less ~~bizarre~~ odd results I will gladly use it. But it takes some frustration tolerance to figure out how to use it, and at least I had some informatics courses in university, know some basic programming and have experience with more than the most popular operating system. It is possible that colleagues who have less computer skills than myself will find the experience rather taxing. And that is a pity.

13 comments:

UnknownFebruary 23, 2016 at 3:18 PM
Did you ever get useful results from fastStructure? Struggling with the install, and wondering if it is worth it.
ReplyDelete
Replies
Francisco Pina-MartinsJune 29, 2016 at 8:04 PM
For those of you with issues installing fastSTRUCTURE, please try my wrapper:
https://github.com/StuntsPT/Structure_threader
It has some helper scripts to install both structure and faststructure, since it can wrap both programs so you can compare results.
I hope it helps, and I especially hope the documentation (http://structure-threader.readthedocs.io/en/latest/) is up to par!
ReplyDelete
Replies
UnknownAugust 21, 2017 at 3:35 AM
Hello Alex,
I am also a new fastSTRUCTURE (=fs for simplicity) user and hope you still review this part of your forum.
I have also faced with something strange in my fs results. I have run multiple fs runs (python faststructure.py) with different seed-sets. But, when I run chooseK.py to get an estimation number of subpopulations in my data set I get different outputs. This means that everything is identical in my faststructure.py script expect of seed-sets and I wonder why I get different K estimation.....!
for instance in one of my runs I get this
Running on 5 cores.
Starting fastSTRUCTURE at: Mon Aug 14 14:00:49 PDT 2017
Model complexity that maximizes marginal likelihood = 1
Model components used to explain structure in data = 10

and in the other run I get
Model complexity that maximizes marginal likelihood = 1
Model components used to explain structure in data = 7
ReplyDelete
Replies

Add comment