Saturday, January 27, 2018

Bioregionalisation part 3: clustering with Biodiverse

Biodiverse is a software for spatial analysis of biodiversity, in particular for calculating diversity scores for regions and for bioregionalisation. As mentioned in previous posts, the latter is done with clustering. Biodiverse is freely available and extremely powerful, just about the only minor issues are that the terminology used can sometimes be a bit confusing, and it is not always easy to intuit where to find a given function. As so often, a post like this might also help me to remember some detail when getting back to a program after a few months or so...

The following is about how to do bioregionalisation analysis in Biodiverse. First, the way I usually enter my spatial data is as one line per sample. So if you have coordinates, the relevant comma separated value file could look something like this:
Planta vulgaris,-26.45,145.29
Planta vulgaris,-27.08,144.88
To use equal area grid cells you may have reprojected the data so that lat and long values are in meters, but the format is of course the same. Alternatively, you may have only one column for the spatial information if your cells are not going to be coordinate-based but, for example, political units or bioregions:
Planta vulgaris,Western Australia
Planta vulgaris,Northern Territory
Just for the sake of completeness, different formats such as a tsv would also work. Now to the program itself. You are running Biodiverse and choose 'Basedata -> Import' from the menus.

Navigate to your file and select it. Note where you can choose the format of the data file in the lower right corner. Then click 'next'.

The following dialogue can generally be ignored, click 'next' once more.

But the third dialogue box is crucial. Here you need to tell Biodiverse how to interpret the data. The species (or other taxa) need to be interpreted as 'label', which is Biodiversian for the things that are found in regions. The coordinates need to be interpreted as 'group', the Biodiversian term for information that defines regions. For the grouping information the software also needs to be told if it is dealing with degrees for example, and what the size of the cells is supposed to be. In this case we have degrees and want one degree squared cells, but we could just as well have meters and want 100,000 m x 100,000 m cells.

After this we find ourselves confronted with yet another dialogue box and learn that despite telling Biodiverse which column is lat and which one is long it still doesn't understand that the stuff we just identified as long is meant to be on the x axis of a map. Arrange the two on the right so that long is above lat, and you are ready to click OK.

The result should be something like this: under a tab called 'outputs' we now have our input, i.e. our imported spatial data.

Double-clicking on the name of this dataset will produce another tab in which we can examine it. Clicking on a species name will mark its distribution on the map below. Clicking onto a cell on the map will show how similar other cells are to it in their species content. This will, of course, be much less clear if your cells are just region names, because in that case they will not be plotted in a nice two-dimensional map.

Now it is time to start our clustering analysis. Select 'Analyses -> cluster' from the menu. A third tab will open where you can select analysis parameters. Here I have chosen S2 dissimilarity as the metric. If there are ties during clustering it makes sense to break them by maximising endemism (because that is the whole point of the analysis anyway), so I set it to use Corrected Weighted Endemism first and then Weighted Endemism next if the former still does not resolve the situation. One could use random tie-breaks, but that would mean an analysis is not reproducible. All other settings were left as defaults.

After the analysis is completed, you can have the results displayed immediately. Alternatively, you can always go back to the first tab, where you will now find the analysis listed, and double-click it to get the display.

As we can see there is a dendrogram on the right and a map on the left. There are two ways of exploring nested clusters: Either change the number of clusters in the box at the bottom, or drag the thick blue line into a different position on the dendrogram; I find the former preferable. Note that if you increase the number too much Biodiverse will at a certain point run out of colours to display the clusters.

The results map is good, but we you may want to use the cluster assignments of the cells for downstream analyses in different software or simply to produce a better map somewhere else. How do you export the results? Not from the display interface. Instead, go back to the outputs tab, click the relevant analysis name, and then click 'export' on the right.

You now have an interface where you can name your output file, navigate to the desired folder, and select the number of clusters to be recognised under the 'number of groups' parameter on the left.

The reward should be a csv file like the following, where 'ELEMENT' is the name of each cell and 'NAME' is the column indicating what cluster each cell belongs to.

Again, very powerful, only have to keep in mind that your bioregions, for example, are variously called clusters, groups, and NAME depending on what part of the program you are dealing with.

No comments:

Post a Comment