Friday, August 1, 2014

Trying to use fastStructure

Updated 16 August 2014 and 27 Nov 2014

A well established software tool in ecology and systematics is Structure. Using population level molecular data such as Amplified Fragment Polymorphism (AFLP), microsats or Single Nucleotide Polymorphisms (SNPs), it tries to find the underlying population structure. In practice, you run the program with a range of values for number of populations (K) and then compare the resulting likelihood values for each.

Apart from the best K value, that is the number of populations or clusters that your samples are best divided into, Structure also outputs how much of the genome of each of your samples is derived from each population. This is then often depicted in a graph such as the one seen here. Each line in that figure represents the results of one Structure run, from K = 2 to K = 6. Each colour is one of the clusters or populations. Each pixel column is one sample, with its colour showing to what population it belongs. So you may get many samples that belong fully or nearly fully to one population but also some that are 'admixed' between two, perhaps representing hybrids.

This methodology is used to examine population structure below the species level but also, and this is what is more interesting to a systematist like myself, to study the delimitation of species. I have such a project going with a colleague from a different herbarium, and we just got our data. They are more than 9,000 SNPs for 91 samples, and unfortunately Structure is rather slow especially when analysing such large amounts of data.

So you can imagine I was very happy to find that a few months ago the same lab released a program called fastStructure specifically to deal with large numbers of SNPs and promised to be one or two orders of magnitude faster than Structure. In fact it is so new at this point that the paper announcing it has only been cited twice - once in the editorial of the same journal (which doesn't really count) and once in a minor review article. In a few months papers will start coming out by people who have actually used it, but at the moment there is little practical experience to build on except the comments and questions of people on the Structure Google Group.

After our high performance computing staff kindly installed the program on a supercomputer, I spent most of today trying fastStructure out. I learned a lot but so far the results have been mixed. I write this partly to spare other people some of the frustrations I experienced.

First, the installation instructions on the fastStructure site (last accessed 1 August 2014) are for Unix/Linux only, and they are really cryptic. It is a common observation that informaticians always assume that everybody is using Unix even as they know that most end-users run Windows, but okay, let's cut them some slack because I don't like Microsoft products that much either.

What is worse is that they don't even say that those are only the instructions for Unix. There was actually one question on the aforementioned Google Forum by a user who was genuinely puzzled why those instructions didn't work on Windows. You simply cannot assume that every biologist is familiar with operating systems. Their job description is being familiar with organisms, that's it.

Second, although the website mentions that you can use the traditional Structure format for your data (as opposed to the very convoluted multi-file 'plink bed' format of which I had never heard before), it may not be immediately clear why the program does not accept it. You have to set an option "--format=str" if your file is in Structure format, but of course that isn't mentioned where they explain how to start the analysis. Also, make sure that your file has the extension ".str" but don't mention that extension in the "--input=filename" option; the software will assume that if you say "filename" you mean "filename.str".

Third, neither the website nor the publication announcing the program really make clear that you cannot and don't need to set the number of iterations the program should run. As can be seen in the Google Forum on the software, this confuses the heck out of seasoned Structure users (it sure confused me), because in Structure you needed to run tens of thousands of iterations to get defensible results, and often you do an analysis only to find that you have to do it over and quintuple the number of iterations. In fastStructure, however, the program decides by itself when to stop, and it does so after a frighteningly low number of iterations, often in the range of only 50 to 150. I guess that is what makes it so fast.

Anyway, my current problem is that my first results are very meaningless. Although even a quick and dirty dendrogram analysis shows meaningful structure in our data, fastStructure infers an unrealistically high number of populations and then assigns all samples to one and the same population. In my last run the first seven population clusters accounted for 0.000003% of the genes of all samples in the analysis, and the eighth cluster accounted for >90% of their genes. Surely there is something wrong here.

Update: The issue here was squarely on my side. While I was preparing the data file through some rounds of copy pasting and transposing between LibreOffice Calc and a text editor, LibreOffice somehow saw fit to transform all the missing data into zeroes, which in this context means homozygous for the dominant allele. No wonder that the results did not make sense.

However, now that I get meaningful results with K=4 in one dataset and K=18 in another, larger one, there are still other problems. First, I have dendrograms for comparison and am surprised that there isn't more congruence between the dendrogram clusters and the fastStructure clusters. Second, and even more puzzling, fastStructure seems to assign very nearly all samples to clusters with >99%; in other words, there are virtually no admixed individuals. That is particularly odd given the organisms we are dealing with because we expected some hybrids in the dataset, and the patterns seen in the aforementioned dendrograms can only be explained by some degree of admixture.

You will notice that, apart from my specific problems with the results, there is a general theme in this post of the programmers providing too little information on how to use the software. That is not unusual (the people who developed BEAST and TNT, respectively, are rare and notable exceptions), but it is still something that I find hard to understand.

Don't get me wrong, it is clear that all of us can get lost in the nuts and bolts of our profession, and there is always the risk that we assume our partner in conversation will understand various technical terms and methodological assumptions when really they have no chance of knowing them. But the thing is, a programmer writing a program does not have that excuse. They are, for present purposes, not part of a whole community of people who talk the same language and share the same mode of thinking despite it being alien to all other people. Instead they are a community of one. If they don't provide clear instructions they are literally the only person on the planet who can know how their programs works, because nobody else can possibly know how to set that option or how to interpret that result. And they should know that.

Don't get me wrong, it is clear that this kind of software is made freely available and will earn the programmer no money. So obviously one cannot expect the same user-friendliness and support as one could expect for a software package that one just bought a $2,000 license for. But the programmer still does want the software to be used, after all. In science and academia, their reward is going to be the large number of people citing their program because they used it in their studies. Consequently one would naively assume that it might be in the programmer's best interest to write instructions that allow even, dunno, even an end-user who runs Windows and does not know Python at all to work with their program. Just saying.

So yes, fastStructure is really super-fast, and once I figure out how to get less bizarre odd results I will gladly use it. But it takes some frustration tolerance to figure out how to use it, and at least I had some informatics courses in university, know some basic programming and have experience with more than the most popular operating system. It is possible that colleagues who have less computer skills than myself will find the experience rather taxing. And that is a pity.

13 comments:

  1. Did you ever get useful results from fastStructure? Struggling with the install, and wondering if it is worth it.

    ReplyDelete
    Replies
    1. Yes, see follow-up posts: here, here and here.

      Summary: excruciating to install, user-unfriendly, any rarely ever infers any significant admixture even in known hybrids. On the plus side, it really is ridiculously fast, and apart from the admixture issue the results are comparable to what I got from weeks of running STRUCTURE.

      Now that I know how to use it and have it installed, I will use it again, but in your situation I would decide based on how long you have to wait for STRUCTURE to do its thing. If it isn't weeks then perhaps just skip on fastSTRUCTURE.

      Delete
    2. Hi, Dr. Alex SL,

      Thanks for your "Great posts".
      I already read your 3 more posts about the fastSTRUCTURE.
      So your conclusion is that fastSTRUCTURE is still not complete?

      I tried that with my large data set,180K soybean SNP data with 391 varieties, I cannot get the meaningful result. Actually, I can't find out the best K value. If you don't mind, could you explain how you can make the decision to choose the proper K value?

      What I read in the origianl fastSTRUCTURE paper is that run the 1. fastSTRUCTURE with K=1 to 10
      2. chooK.py to select proper K
      3. compare the data with each in the range and make the decision.

      But, I can't figure out step 3.
      Plz, give me a clue about this.

      Thanks in advance.

      LJS

      P.S.
      I agree that install procedure is painful!!!

      Delete
    3. First note that I am not really an expert in this kind of analysis, I am more of a systematist than a population geneticist. But I will try.

      See my previous comment replying to Aaron Liston. It is useful if you want to do population- or species-delimitation, not so much when you need to study degrees of admixture, because it hardly ever infers any. Unless they changed it in a newer version I haven't tried.

      When using STRUCTURE, fast or otherwise, I tend not to rely on formulaic approaches like chooseK. Instead, my understanding is that one should move from low to high K and then take the solution with the highest likelihood score before it starts plateauing. In other words, plot the likelihood scores by K from 1 to whatever your maximum is, and you should usually see a rapid increase at the beginning followed by a plateau. Pick the K belonging to the first point on the plateau. That is how I learned it years ago.

      That being said, even that is often too simplistic. There are cases where a lower one may make more sense (e.g. because the selected one has four clusters but only three real ones, with no sample belonging to the fourth one with any significant amount of its genome) or a where a higher one makes at least equal sense. A colleague told me for example that they had a case of a maximum at K=2 followed by a second clear maximum at K=12 or something. It turned out there were two species, explaining the K=2, and the higher number were geographically structured populations within those two species. In cases like that one there is no simple answer.

      I don't know your data set of course, but another issue that I observed in mine is that STRUCTURE, again both fast and standard, will generally have trouble inferring a cluster if that species / population / cultivar has less than five individuals in the dataset. In that case it makes a bigger cluster out of it and similarly underrepresented relatives.

      Delete
    4. Also I notice that you entered your comment three times. Sorry for that, but I had to turn on comment moderation after being inundated by spam, so your comments do not show until after I have approved them.

      Delete
    5. Hi Alex SL,
      I am deep into faststructure and currently struggling with two things. One, I cannot get the logistic prior calculation to complete. I simply bombs out. Second, I don't see how to take the output data from faststructure and make the popfile for distruct. Any ideas on these two things?

      Delete
    6. Sorry, it has been a bit since I last used it. I don't even know if there has been a new version, which may have changed things, and I don't have fastStructure on this computer, so I cannot help right now. Will have a look later on my Linux machine, but I did not use distruct and wrote my own Python scripts to turn the output into something useful.

      Delete
    7. Okay, distruct is weird. The obligatory command line parameters are --K, --input for the original str file, and --output for the names of the output files. The bizarre thing is that when trying to open the output files it appears to completely ignore the outputfilename it was given and looks instead for inputfilename.K.meanQ. That could easily be rectified in the Python script I guess, but why is it like that?

      Aside from that, I have no idea what the output of distruct is supposed to be good for because it does not show sample identifiers. How would any end user know what to make of population number 3 without knowing if it represents, say, all their samples numbered 24 to 38? Without that info it is just random colour bars, which is why I wrote my own scripts.

      Delete
    8. Hi Alex,
      After much digging and some excel majik last night, I found out how to make the popfile.popq, which it uses to read the .meanQ output files from the different K runs in faststructure. I used the original .str file to create my popfile. This is how it should look:
      Palmyer_CanyonRd_Laramie_Plains_Pop2001
      Palmyer_CanyonRd_Laramie_Plains_Pop2001
      Palmyer_CanyonRd_Laramie_Plains_Pop2001
      Little_BrooklynLk_MedicineBows_Pop2002
      Little_BrooklynLk_MedicineBows_Pop2002
      Little_BrooklynLk_MedicineBows_Pop2002
      Little_BrooklynLk_MedicineBows_Pop2002
      SheepMtnRd1_BigHorns_Pop2004
      SheepMtnRd1_BigHorns_Pop2004
      SheepMtnRd2_BigHorns_Pop2005
      SheepMtnRd2_BigHorns_Pop2005
      SheepMtnRd2_BigHorns_Pop2005
      .... continuing for every line in the original .str file of data, which for me was 656 lines. Trick is that .str files list every sample twice (diploid), so I needed to use excel and the IFEVEN function to remove every other line. So don't follow the distruct pdf manual, as the example in there will not work. I will also paste below the bash script I used to run distruct, as that took some time to figure it out and might be useful for others. In the end, the output is nice as it put my population name on the clusters and gave them one of 60 different colors automatically. And as far as faststructure goes, the simple prior finished this particular ddRADtag matrix of 656 samples and ~30k loci in about 10 minutes, compared to 5 days in traditional structure.
      If you don't give it a popfile.popq, then the output is totally useless as you pointed out, you have no idea where the samples are actually fitting in the different populations. Problem is I can't really tell what how it decided what order to put the populations in the bar graph.

      Lastly, I still was unsuccessful and finally gave up in getting faststructure to run with the logistic prior. After K=11 for my data, It seems to be a maximum number it hits in the numpy library, which I cannot change. Moving on...
      Here is the script I used with our slurm scheduler (deleted some for space here):
      #!/bin/bash
      ## Command(s) to run:
      echo Loading modules ..

      module load gsl/1.16-intel
      module load python/2.7.8
      module load numpy/1.9.0
      module load matplotlib/1.4.0
      module load scipy/0.14.0

      # running faststructure distruct
      APP="/global/home/users/ijordonthaden/bin/fastStructure/"
      INPUT="/path/path/fast_oligo_50_simple"
      OUTPUT="/path/path/fast_oligo_50_simple"
      KMIN=2
      KMAX=34
      OPTIONS="--popfile=/path/path/popfile_oligo_names.popq"

      cd $APP
      for (( i=$KMIN; i<=$KMAX; i++ ))
      do
      echo
      echo K=$i
      echo "Running ---> python distruct.py -K $i --input=$INPUT --output=${OUTPUT}_K$i.eps $OPTIONS"
      echo

      #title="species_50 K=$i"

      python distruct.py -K $i --input=$INPUT --output=${OUTPUT}_K$i.eps $OPTIONS title="Olig 50 Simple K=$i"

      tail -n 3 $OUTPUT.$i.log
      done

      echo Unloading modules ...
      module unload gsl/1.16-intel
      module unload numpy/1.9.0
      module unload scipy/0.14.0
      module unload python/2.7.8
      module unload matplot/1.4.0

      exit 0

      Delete
    9. Thanks, that may be useful to some others who find this post.

      Delete
  2. For those of you with issues installing fastSTRUCTURE, please try my wrapper:
    https://github.com/StuntsPT/Structure_threader
    It has some helper scripts to install both structure and faststructure, since it can wrap both programs so you can compare results.
    I hope it helps, and I especially hope the documentation (http://structure-threader.readthedocs.io/en/latest/) is up to par!

    ReplyDelete
  3. Hello Alex,
    I am also a new fastSTRUCTURE (=fs for simplicity) user and hope you still review this part of your forum.
    I have also faced with something strange in my fs results. I have run multiple fs runs (python faststructure.py) with different seed-sets. But, when I run chooseK.py to get an estimation number of subpopulations in my data set I get different outputs. This means that everything is identical in my faststructure.py script expect of seed-sets and I wonder why I get different K estimation.....!
    for instance in one of my runs I get this
    Running on 5 cores.
    Starting fastSTRUCTURE at: Mon Aug 14 14:00:49 PDT 2017
    Model complexity that maximizes marginal likelihood = 1
    Model components used to explain structure in data = 10


    and in the other run I get
    Model complexity that maximizes marginal likelihood = 1
    Model components used to explain structure in data = 7

    ReplyDelete
    Replies
    1. I can only speculate at the moment, but if they haven't changed anything since I used it chooseK is really quite simplistic, merely selecting the highest likelihood score. But that is not how one usually does it, one would expect a saturation curve and choose the K where it flattens, even if an insignificantly higher value is at a higher K. (And then there is the possibility that there are two maxima, if you have a nested structure such as two groups of populations or suchlike.)

      Anyway, I can easily imagine that a slightly different seed will produce slightly different solutions, and if they are very nearly identical but differ in the exact likelihood values on the plateau then chooseK as I observed it then would pick whatever happens to be highest on that plateau. But again, speculation.

      Delete