Saturday, January 27, 2018

Bioregionalisation part 3: clustering with Biodiverse

Biodiverse is a software for spatial analysis of biodiversity, in particular for calculating diversity scores for regions and for bioregionalisation. As mentioned in previous posts, the latter is done with clustering. Biodiverse is freely available and extremely powerful, just about the only minor issues are that the terminology used can sometimes be a bit confusing, and it is not always easy to intuit where to find a given function. As so often, a post like this might also help me to remember some detail when getting back to a program after a few months or so...

The following is about how to do bioregionalisation analysis in Biodiverse. First, the way I usually enter my spatial data is as one line per sample. So if you have coordinates, the relevant comma separated value file could look something like this:
Planta vulgaris,-26.45,145.29
Planta vulgaris,-27.08,144.88
To use equal area grid cells you may have reprojected the data so that lat and long values are in meters, but the format is of course the same. Alternatively, you may have only one column for the spatial information if your cells are not going to be coordinate-based but, for example, political units or bioregions:
Planta vulgaris,Western Australia
Planta vulgaris,Northern Territory
Just for the sake of completeness, different formats such as a tsv would also work. Now to the program itself. You are running Biodiverse and choose 'Basedata -> Import' from the menus.

Navigate to your file and select it. Note where you can choose the format of the data file in the lower right corner. Then click 'next'.

The following dialogue can generally be ignored, click 'next' once more.

But the third dialogue box is crucial. Here you need to tell Biodiverse how to interpret the data. The species (or other taxa) need to be interpreted as 'label', which is Biodiversian for the things that are found in regions. The coordinates need to be interpreted as 'group', the Biodiversian term for information that defines regions. For the grouping information the software also needs to be told if it is dealing with degrees for example, and what the size of the cells is supposed to be. In this case we have degrees and want one degree squared cells, but we could just as well have meters and want 100,000 m x 100,000 m cells.

After this we find ourselves confronted with yet another dialogue box and learn that despite telling Biodiverse which column is lat and which one is long it still doesn't understand that the stuff we just identified as long is meant to be on the x axis of a map. Arrange the two on the right so that long is above lat, and you are ready to click OK.

The result should be something like this: under a tab called 'outputs' we now have our input, i.e. our imported spatial data.

Double-clicking on the name of this dataset will produce another tab in which we can examine it. Clicking on a species name will mark its distribution on the map below. Clicking onto a cell on the map will show how similar other cells are to it in their species content. This will, of course, be much less clear if your cells are just region names, because in that case they will not be plotted in a nice two-dimensional map.

Now it is time to start our clustering analysis. Select 'Analyses -> cluster' from the menu. A third tab will open where you can select analysis parameters. Here I have chosen S2 dissimilarity as the metric. If there are ties during clustering it makes sense to break them by maximising endemism (because that is the whole point of the analysis anyway), so I set it to use Corrected Weighted Endemism first and then Weighted Endemism next if the former still does not resolve the situation. One could use random tie-breaks, but that would mean an analysis is not reproducible. All other settings were left as defaults.

After the analysis is completed, you can have the results displayed immediately. Alternatively, you can always go back to the first tab, where you will now find the analysis listed, and double-click it to get the display.

As we can see there is a dendrogram on the right and a map on the left. There are two ways of exploring nested clusters: Either change the number of clusters in the box at the bottom, or drag the thick blue line into a different position on the dendrogram; I find the former preferable. Note that if you increase the number too much Biodiverse will at a certain point run out of colours to display the clusters.

The results map is good, but we you may want to use the cluster assignments of the cells for downstream analyses in different software or simply to produce a better map somewhere else. How do you export the results? Not from the display interface. Instead, go back to the outputs tab, click the relevant analysis name, and then click 'export' on the right.

You now have an interface where you can name your output file, navigate to the desired folder, and select the number of clusters to be recognised under the 'number of groups' parameter on the left.

The reward should be a csv file like the following, where 'ELEMENT' is the name of each cell and 'NAME' is the column indicating what cluster each cell belongs to.

Again, very powerful, only have to keep in mind that your bioregions, for example, are variously called clusters, groups, and NAME depending on what part of the program you are dealing with.

Wednesday, January 24, 2018

Bioregionalisation part 2: clustering

Already I think I should change the way I was going to do this. It seems more straightforward to keep the two approaches in separate posts. So for today: bioregionalisation using clustering methods.

A small example

As the term clustering suggests the approach is very simple. Let's start by considering a landscape of five cells A-E with five species occurring in them as follows:

Another way of expressing this information is as a matrix where the cells are rows and the species are columns, and presence of a species is indicated with "1" while absence is indicated with "0":

We now simply calculate a distance matrix. There are several possible dissimilarity metrics we can use for this. For this post I will use the S2 dissimilarity, which is defined as
S2 dissimilarity = 1 - ( number of shared species / ( number of shared species + minimum( species unique to first cell , species unique to second cell ) ) )
The resulting S2 dissimilarity matrix for our small dataset is consequently as follows:

Now we use a hierarchical clustering algorithm to produce a dendrogram. I have used R's hclust, and the result is:

We can now recognise clusters as bioregions, and we are done. The main remaining problem with hierarchical clustering is that there is no objective answer for the number of bioregions we should recognise. We could still accept anywhere between one and five, but at least we know that there should not be a region of e.g. only the cells C and E to the exclusion of D.

(This is of course the same problem as in phylogenetic systematics, where we would now know that CE to the exclusion of D is not an acceptable taxon, but it remains a subjective decision whether to recognise CD and E as separate genera or whether to have one genus CDE, for example.)

In our present, case it seems sensible to accept less than five regions but more than one, otherwise we would not have needed the analysis, so let's go with the two clusters AB and CDE:

These regions now show a fairly high level of endemism, as four of the five species are endemic to one region; only the blue species occurs across both.

Some R code

Although the proper software for this kind of work is Biodiverse, this post would get too long if I tried to do everything in one go. What is more, a simple analysis can just as well be run in R, which is what I have done in this case. First build a matrix of cells and the species in them, e.g.
occurs <- as.matrix(rbind(c(1,1,0,0,0), c(1,1,1,0,0), c(0,0,1,1,0), c(0,0,1,1,0), c(0,0,0,1,1)))
rownames(occurs) <- c("A", "B", "C", "D", "E")
colnames(occurs) <- c("red","brown","blue","orange","lilac")
The following loops will then produce a matrix of S2 dissimilarity scores.
mydm <- matrix(0, 5, 5)    # create empty matrix; could make it more flexible for future analyses by handing over square root of length(occurs) for the dimensions
rownames(mydm) <- c("A","B","C","D","E")    # same here, could use row names from occurs
colnames(mydm) <- c("A","B","C","D","E")     # and same here
for (i in 1:5)
  for (j in i:5)
    if (i==j)
      mydm[i,j] <- 0
      shareds <- sum(occurs[i,] & occurs[j,])
      uniques_i <- sum(xor(occurs[i,], occurs[i,] & occurs[j,]))
      uniques_j <- sum(xor(occurs[j,], occurs[i,] & occurs[j,]))
      mydm[i,j] <- 1- (shareds / (shareds + min( uniques_i, uniques_j)))
      mydm[j,i] <- mydm[i,j]
Now finally do a cluster analysis and plot the resulting dendrogram:
mycl <- hclust(as.dist(mydm), method = "mcquitty")     # WPGMA
Done. For large numbers of cells we would want a decent visualisation, ideally as a map, and that is where Biodiverse works better. How to do the analysis in that software will be covered in the next post.

Saturday, January 20, 2018

Bioregionalisation part 1: what's the idea?

This is the start of a little series of posts on bioregionalisation. I intend to divide the topic up as follows:
  1. What I mean with bioregionalisation and what it is good for.
  2. Comparison of two different quantitative approaches to defining bioregions, clustering and network analysis.
  3. Practical how-to guide to inferring bioregions with clustering in the software Biodiverse.
  4. Practical how-to guide to inferring bioregions with network analysis in R.
  5. Beyond species presence and absence, i.e. using phylogenies for bioregionalisation.
Let's see if that works. So today:

What do I mean with bioregionalisation?

The idea is to divide a study region - perhaps a country, a continent or the whole world - into natural regions. There are obviously lots of different ways of doing so. A well-known one is climatic, where we would have arctic, temperate, subtropical, and tropical regions. Closer to what I am talking about are vegetation zones; in this case the general appearance of the natural vegetation and the life form of its constituent species are used to define zones such as tundra, boreal forest, mallee, or savanna.

But that still is not what this is going to be about. The bioregions I am going to discuss are defined by the taxa that occur in them. A very high-level classification is shown, for example, in the following map from

As we can see there are no 'tropics', but instead the American tropics are separated from the African and South Asian ones. Why might that be the case? As a botanist I can immediately think of two important plant families that are very characteristic of the Neotropics but are (with the exception of one rather odd, small genus) entirely missing from the Paleotropics: the cactus family Cactaceae and the pineapple family Bromeliaceae.

This, then, is what bioregions as I will subsequently discuss them are: they are regions defined by the presence of (plant, animal, ...) taxa they do not share with other regions. Another way of putting it is that bioregionalisation aims to maximise the endemism of its regions. And this immediately suggests the possibility of quantitative, objective analyses as long as we can somehow quantify endemism.

But these approaches are for other posts. More importantly now:

Why do we care? What are these bioregions good for?

I can think of at least two use cases. The first is quite simply that we like to classify things, and climate and vegetation form do not capture all there is to natural regions. Specifically, the presence e.g. of bromeliads, leaf cutter ants and hummingbirds in the New World and their absence in the Old World is an accident of history that is orthogonal to the shared climate and to the fact that 'tropical rainforest' kind of looks the same from a distance in all continents. But it still matters because these groups of organisms have evolved unique characteristics, like the hummingbirds' high metabolic rate, that have an ecological impact. A neotropical cloud forest 'works' a bit differently than a southeast Asian one.

The second use case is that of finding objectively defensible regions for biogeographic analysis, a problem that still does not have a single widely accepted solution. For example, we may be interested in conducting an inference of ancestral areas and biogeographic processes using the R package BioGeoBears, because we want to know if our study group started evolving in the temperate part of our continent and then spread into the tropics or vice versa. For this analysis we need (a) a time-calibrated phylogeny and (b) a data table of taxa-by-regions showing for each region what taxa are naturally occurring in them.

Taking one step back, it is obvious then that we first need to define regions. This may be easy if we can simply use the islands of an island group, but taking a big blob of land like Australia as an example, how do we cut that up? States? Clearly political units are kind of iffy for biogeography, because they are human inventions. Climate or vegetation zones are more natural, but are they meaningful for our specific study group? How meaningful would a region be for my purposes that happens to have one of my study taxa scored as present because it comes in from the side into 5% of that region's extent?

To me at least it seems as if the solution is bioregionalisation by taxon content: take small units like 100 x 100 km cells or similar and use an objective bioregionalisation approach to group them into meaningful larger regions. As mentioned above this maximises endemism, which is precisely what I would want for the inference of ancestral areas and biogeographic history.

Thursday, January 18, 2018

MCDA spam

Ye gods, I got an absolute gem of science spam yesterday.
We wish you a happy new year.
Well, at least no greetings of the day, so that is good.
It's so pleasant to communicate with eminent people like you through this email.
Is it brown-nose day already?  I had not noticed.
I believe that your efforts will create the good reputation for my Journal. Our MCDA journal is in shortfall of one article to accomplish the issue. So, we request you to submit any type of article towards our journal. I would be highly obliged with your swift submission process. Hope you will support us.
And this is where it all falls over. Why does this spammer think that I should care about the reputation of their journal? If they were a car dealer, would they say, I believe that your purchase will increase my profits, I need another sale to make my target? As opposed to, say, stressing the price to performance ratio of the car. If they were a university recruiter, would they say, I believe that your enrolment would create the good reputation for my university? As opposed to, say, claiming that the already good reputation of their university would transfer onto the prospective student?

Well, maybe that is what they would do. But that is not how it works. I am not in sales, but even I know that if you are selling something you have to convince the prospective buyer that buying is in their interest, not only in yours.
Await your article submission.
Emma Wright Š Modern Concepts & Developments in Agronomy (MCDA)
LLC, Third Avenue, 2nd floor, New York - 10016, USA
If this message is the English of an Emma Wright in New York I will not only eat my hat but a whole stack of them.

Also, just as an aside, the message did not even contain a link to the journal website!

It displays a level of incompetence so profound, so all-encompassingly fractal, that there is truly no hope for this spammer. How can anybody who is able to write an eMail without trying to eat the keyboard look at this spam message and think, yes, this is going to convince people that I am running a serious scientific journal?

Monday, January 8, 2018

Cell phone with macro lens

Happy new year, everybody! Time to get back to blogging.

Lately I have been playing around with a macrolens that I bought for my smartphone. The idea was to be able to take pictures of small structures, in particular fruits or seeds, even when I do not have my proper camera with me. So far the results are mixed.

Here we have the fruit of Hypochaeris radicata (Asteraceae), one of the larger propagules I have tried so far. Not too bad, all in all, but I do not care about the shadow, and obviously the depth of field is an issue.

The above is a mericarp of Malva neglecta (Malvaceae). The surface structure looks nice, but again light conditions and shadows are problematic. I will have to do something about light and the texture of the background.

It works reasonably well for flowers in sunlight, however. Here a tomato flower. As this will likely be the use case for most people I guess one cannot complain. It would be a fair deal given that the lens package cost me only $20 (including wide angle and fish-eye lenses, which I don't really use).