A well established software tool in ecology and systematics is Structure. Using population level molecular data such as Amplified Fragment Polymorphism (AFLP), microsats or Single Nucleotide Polymorphisms (SNPs), it tries to find the underlying population structure. In practice, you run the program with a range of values for number of populations (K) and then compare the resulting likelihood values for each.
Apart from the best K value, that is the number of populations or clusters that your samples are best divided into, Structure also outputs how much of the genome of each of your samples is derived from each population. This is then often depicted in a graph such as the one seen here. Each line in that figure represents the results of one Structure run, from K = 2 to K = 6. Each colour is one of the clusters or populations. Each pixel column is one sample, with its colour showing to what population it belongs. So you may get many samples that belong fully or nearly fully to one population but also some that are 'admixed' between two, perhaps representing hybrids.
This methodology is used to examine population structure below the species level but also, and this is what is more interesting to a systematist like myself, to study the delimitation of species. I have such a project going with a colleague from a different herbarium, and we just got our data. They are more than 9,000 SNPs for 91 samples, and unfortunately Structure is rather slow especially when analysing such large amounts of data.
So you can imagine I was very happy to find that a few months ago the same lab released a program called fastStructure specifically to deal with large numbers of SNPs and promised to be one or two orders of magnitude faster than Structure. In fact it is so new at this point that the paper announcing it has only been cited twice - once in the editorial of the same journal (which doesn't really count) and once in a minor review article. In a few months papers will start coming out by people who have actually used it, but at the moment there is little practical experience to build on except the comments and questions of people on the Structure Google Group.
After our high performance computing staff kindly installed the program on a supercomputer, I spent most of today trying fastStructure out. I learned a lot but so far the results have been mixed. I write this partly to spare other people some of the frustrations I experienced.
First, the installation instructions on the fastStructure site (last accessed 1 August 2014) are for Unix/Linux only, and they are really cryptic. It is a common observation that informaticians always assume that everybody is using Unix even as they know that most end-users run Windows, but okay, let's cut them some slack because I don't like Microsoft products that much either.
What is worse is that they don't even say that those are only the instructions for Unix. There was actually one question on the aforementioned Google Forum by a user who was genuinely puzzled why those instructions didn't work on Windows. You simply cannot assume that every biologist is familiar with operating systems. Their job description is being familiar with organisms, that's it.
Second, although the website mentions that you can use the traditional Structure format for your data (as opposed to the very convoluted multi-file 'plink bed' format of which I had never heard before), it may not be immediately clear why the program does not accept it. You have to set an option "--format=str" if your file is in Structure format, but of course that isn't mentioned where they explain how to start the analysis. Also, make sure that your file has the extension ".str" but don't mention that extension in the "--input=filename" option; the software will assume that if you say "filename" you mean "filename.str".
Third, neither the website nor the publication announcing the program really make clear that you cannot and don't need to set the number of iterations the program should run. As can be seen in the Google Forum on the software, this confuses the heck out of seasoned Structure users (it sure confused me), because in Structure you needed to run tens of thousands of iterations to get defensible results, and often you do an analysis only to find that you have to do it over and quintuple the number of iterations. In fastStructure, however, the program decides by itself when to stop, and it does so after a frighteningly low number of iterations, often in the range of only 50 to 150. I guess that is what makes it so fast.
Update: The issue here was squarely on my side. While I was preparing the data file through some rounds of copy pasting and transposing between LibreOffice Calc and a text editor, LibreOffice somehow saw fit to transform all the missing data into zeroes, which in this context means homozygous for the dominant allele. No wonder that the results did not make sense.
However, now that I get meaningful results with K=4 in one dataset and K=18 in another, larger one, there are still other problems. First, I have dendrograms for comparison and am surprised that there isn't more congruence between the dendrogram clusters and the fastStructure clusters. Second, and even more puzzling, fastStructure seems to assign very nearly all samples to clusters with >99%; in other words, there are virtually no admixed individuals. That is particularly odd given the organisms we are dealing with because we expected some hybrids in the dataset, and the patterns seen in the aforementioned dendrograms can only be explained by some degree of admixture.
You will notice that, apart from my specific problems with the results, there is a general theme in this post of the programmers providing too little information on how to use the software. That is not unusual (the people who developed BEAST and TNT, respectively, are rare and notable exceptions), but it is still something that I find hard to understand.
Don't get me wrong, it is clear that all of us can get lost in the nuts and bolts of our profession, and there is always the risk that we assume our partner in conversation will understand various technical terms and methodological assumptions when really they have no chance of knowing them. But the thing is, a programmer writing a program does not have that excuse. They are, for present purposes, not part of a whole community of people who talk the same language and share the same mode of thinking despite it being alien to all other people. Instead they are a community of one. If they don't provide clear instructions they are literally the only person on the planet who can know how their programs works, because nobody else can possibly know how to set that option or how to interpret that result. And they should know that.
Don't get me wrong, it is clear that this kind of software is made freely available and will earn the programmer no money. So obviously one cannot expect the same user-friendliness and support as one could expect for a software package that one just bought a $2,000 license for. But the programmer still does want the software to be used, after all. In science and academia, their reward is going to be the large number of people citing their program because they used it in their studies. Consequently one would naively assume that it might be in the programmer's best interest to write instructions that allow even, dunno, even an end-user who runs Windows and does not know Python at all to work with their program. Just saying.
So yes, fastStructure is really super-fast, and once I figure out how to get less