Friday, March 23, 2018

Science spammers constantly reaching for new lows

I received the following two messages on the same day, in fact they were sitting right next to each other in the inbox.

Surely this is sad. Is there no such thing as taking pride in one's work, even among spammers? Recently some people tried to defraud me, and of course that is annoying, but at least they put a lot of effort into it. I was impressed by how much information they had to accumulate to seem half-convincing. These guys, on the other hand, use such a simplistic bot to produce their mass-emails that it they are immediately recognisable as such.

Really the only thing sadder than these messages is that my spam filter is apparently still unable to understand that the keyword "greetings!!!!" is a certain indicator of spaminess.

Wednesday, March 21, 2018

Bioregionalisation part 6: Modularity Analysis with the R package rnetcarto

Today's final post in the bioregionalisation series deals with how to do a network or Modularity Analysis in R. There are two main steps here. First, because we are going to assume, as in the previous post, that we have point distribution data in decimal coordinates, we will turn them into a bipartite network of species and grid cells.

We start by defining a cell size. Again, our data are decimal coordinates, and subsequently we will use one degree cells.
cellsize <- 1
Note that this may not be the ideal approach for publication. The width of one degree cells decreases towards the poles, and in spatial analyses equal area grid cells are often preferred because they are more comparable. If we want equal area cells we first need to project our data into meters and then use a cellsize in meters (e.g. 100,000 for 100 x 100 km). There are R functions for such spatial projection, but we will simply use one degree cells here.

We make a list of all species and a list of all cells that occur in our dataset, naming the cells after their centres in the format "126.5:-46.5". I assume here that we have the data matrix called 'mydata' from the previous post, with the columns species, lat and long.
allspecies <- unique(mydata$species)

longrounded <- floor(mydata$long / cellsize) * cellsize + cellsize/2

latrounded <- floor(mydata$lat / cellsize) * cellsize + cellsize/2

cellcentre <- paste(longrounded,latrounded, sep=":")

allcells <- unique(cellcentre)
We create a matrix of species and cells filled with all zeroes, which means that the species does not occur in the relevant cell. Then we loop through all records to set a species as present in a cell if the coordinates of at least one of its records indicate such presence.
mynetw <- matrix(0, length(allcells), length(allspecies))
for (i in 1:length(mydata[,1]))
  mynetw[ match(cellcentre[i],allcells) , match(mydata$species[i], allspecies) ] <- 1
It is also crucial to name the rows and columns of the network so that we can interpret the results of the Modularity Analysis.
rownames(mynetw) = allcells
colnames(mynetw) = allspecies
Now we come to the actual Modularity Analysis. We need to have the R library rnetcarto installed and load it.
The command to start the analysis is simply:
mymodules <- netcarto(mynetw, bipartite=TRUE)
This may take a bit of time, but after talking to colleagues who have got experience with other software it seems it is actually reasonably fast - for a Modularity Analysis.

Once the analysis is done, we may first wonder how many modules, which we will subsequently interpret as bioregions, the analysis has produced.
For publication we obviously want a decent map, but that is beyond the scope of this post. What follows is merely a very quick and dirty way of plotting the results to see what they look like, but of course the resulting coordinates and module numbers can also be used for fancier plotting. We split the latitudes and longitudes back out of the cell names, define a vector of colours to use for mapping (here thirteen; if you have more modules you will of course need a longer vector), and then we simply plot the cells like some kind of scatter plot.
allcells2 <- strsplit( as.character( mymodules[[1]]$name ), ":" )
allcells_x <- as.numeric(unlist(allcells2)[c(1:(length(allcells)))*2-1])

allcells_y <- as.numeric(unlist(allcells2)[c(1:(length(allcells)))*2])

mycolors <- c("green", "red", "yellow", "blue", "orange", "cadetblue", "darkgoldenrod", "black", "darkolivegreen", "firebrick4", "darkorchid4", "darkslategray", "mistyrose")

plot(allcells_x, allcells_y, col = mycolors[ as.numeric(mymodules[[1]]$module) ], pch=15, cex=2)
There we are. Modularity analysis with the R library rnetcarto is quite easy, the main problem was building the network.

As an example I have done an analysis with all Australian (and some New Guinean) lycopods, the dataset I mentioned in the previous post. It plots as follows.

There are, of course, a few issues here. The analysis produced six modules, but three of them, the green, orange and light blue ones, consist of only two, one and one cells, respectively, and they seem biologically unrealistic. They may be artifacts of not having cleaned the data as well as I would for an actual study, or represent some kind of edge effect. The remaining three modules are clearly more meaningful. Although they contain some outlier cells, we can start to interpret them as potentially representing tropical (red), temperate (yellow), and subalpine/alpine (dark blue) assemblies of species, respectively.

Despite the less than perfect results I hope the example shows how easy it is to do such a Modularity Analysis, and if due diligence is done to the spatial data, as we would do in an actual study, I would also expect the results to become cleaner.

Sunday, March 18, 2018

Botany picture #256: Solenostemon presumably

In spring we bought three types of Sempervivum (Crassulaceae) and planted them in a large bowl. Two little seedlings spontaneously came up in the succulent soil and, recognising them as members of my other favourite plant family Lamiaceae, I transferred them to a different pot where they would get more water.

I was curious to see what they would grow into - perhaps a useful aromatic herb? Well, they grew and grew and grew, but they did not flower until just now. Although it had become clear to me some time ago that they must be some kind of Solenostemon or relative and are presumably cultivated as ornamentals rather than as kitchen herbs I was hoping that they would at least have nice flowers. The reality, alas, is a bit of a let-down. Not terrible but not exactly stunning either. It is unlikely that they will survive winter anyway, as they are probably tropical plants.

In other news, Canberra was covered by dust blown in from western New South Wales today. The sky was of an otherworldly grey and only returned to its customary blue colour late in the afternoon.

Saturday, March 17, 2018

Bioregionalisation part 5: Cleaning point distribution data in R

I should finally complete my series on bioregionalisation. What is missing is a post on how to do a network (Modularity) analysis in R. But first I thought I would write a bit about how to efficiently do some cleaning of point distribution data in R. As often I write this because it may be useful to somebody who finds it via search engine, but also because I can then look it up myself if I need it after not having done it for months.

The assumption is that we start our spatial or biogeographic analyses by obtaining point distribution data by querying e.g. for the genus or family that we want to study on an online biodiversity database or aggregator such as GBIF or Atlas of Living Australia. We download the record list in CSV format and now presumably have a large file with many columns, most of them irrelevant to our interests.

One problem that we may find is that there are numerous cases of records occurring in implausible locations. They may represent geospatial data entry errors such as land plants supposedly occurring in the ocean, or vouchers collected from plants in botanic gardens where the databasers fo some reason entered the garden's coordinates instead of those of the source location , or other outliers that we suspect to be misidentifications. What follows assumes that this at least has been done already (and it is hard to automate anyway), but we can use R to help us with a few other problems.

We start up R and begin by reading in our data, in this case all lycopod records downloaded from ALA. (One of the advantages about that group is that very few of them are cultivated in botanic gardens, and I did not want to do that kind of data clean-up for a blog post.)
rawdata <- read.csv("Lycopodiales.csv", sep=",", na.strings = "", header=TRUE, row.names=NULL)
We now want to remove all records that lack any of the data we need for spatial and biogeographic analyses, i.e. identification to the species level, latitude and longitude. Other filtering may be desired, e.g. of records with too little geocode precision, but we will leave it at that for the moment. In my case the relevant columns are called genus, specificEpithet, decimalLatidue, and decimalLongitude, but that may of course be different in other data sources and require appropriate adjustment of the commands below.
rawdata <- rawdata[!($decimalLatitude) | rawdata$decimalLatitude==""), ]
rawdata <- rawdata[!($decimalLongitude) | rawdata$decimalLongitude==""), ]
rawdata <- rawdata[!($genus) | rawdata$genus==""), ]
rawdata <- rawdata[!($specificEpithet.1) | rawdata$specificEpithet.1==""), ]
All the records missing those data should be gone now. Next we make a new data frame containing only the data we are actually interested in.
lat <- rawdata$decimalLatitude
long <- rawdata$decimalLongitude
species <- paste( as.character(rawdata$genus), as.character(rawdata$specificEpithet.1, sep=" ") )
mydata <- data.frame(species, lat, long)
mydata$species <- as.character(mydata$species)
Unfortunately at this stage there are still records that we may not want for our analysis, but they can mostly be recognised by having more than the two usual name elements of genus name and specific epithet: hybrids (something like "Huperzia prima x secunda" or "Huperzia x tertia") and undescribed phrase name taxa that may or may not actually be distinct species ("Lycopodiella spec. Mount Farewell"). At the same time we may want to check the list of species in our data table with unique(mydata$species) to see if we recognise any other problems that actually have two name elements, such as "Lycopodium spec." or "Lycopodium Undesignated". If there are any of those, we place them into a vector:
kickout <- c("Lycopodium spec.", "Lycopodium Undesignated")
Then we loop through the data to get rid of all these problematic entries.
myflags <- rep(TRUE, length(mydata[,1]))
for (i in 1:length(myflags))
  if ( (length(strsplit(mydata$species[i], split=" ")[[1]]) != 2) || (mydata$species[i]) %in% kickout )
    myflags[i] <- FALSE
mydata <- mydata[myflags, ]
If there is no 'kickout' vector for undesirable records with two name elements, we do the same but adjust the if command accordingly to not expect its existence.

Check again unique(mydata$species) to see if the situation has improved. If there are instances of name variants or outdated taxonomy that need to be corrected, that is surprisingly easy with a command along the following lines:
mydata$species[mydata$species == "Outdatica fastigiata"] = "Valida fastigiata"
In that way we can efficiently harmonise the names so that one species does not get scored as two just because some specimens still have an outdated or misspelled name.

Although we assume that we had checked for geographic outliers, we may now still want to limit our analysis to a specific area. In my case I want to get rid of non-Australian records, so I remove every record outside of a box of 9.5 to 44.5 degrees south and 111 to 154 degrees east around the continent. Although it turns out that this left parts of New Guinea in that is fine with me for present purposes, we don't want to over-complicate this now.
mydata <- mydata[mydata$long<154, ]
mydata <- mydata[mydata$long>111, ]
mydata <- mydata[mydata$lat>(-44.5), ]
mydata <- mydata[mydata$lat<(-9.5), ]
At this stage we may want to save the cleaned up data for future use, just in case.
write.table(mydata, file = "Lycopodiales_records_cleaned.csv", sep=",")
And now, finally, we can actually turn the point distribution data into grid cells and conduct a network analysis, but that will be the next (and final) post of the series.

Saturday, March 10, 2018

Reading The Varieties of Religious Experience: Lecture 2

In his second lecture, James defines what he would 'religion' consider to be for the purposes of the lecture series.

He stresses right at the beginning that religion is such a complex phenomenon that anybody who thinks they can come up with a clear and simple definition is fooling themselves. He then mentions two aspects, the organisational structure (the church with its office holders and buildings) and the personal beliefs and feelings of each believer, and he excludes the former from consideration to focus his efforts on the latter.

That is unsurprising, given his psychological approach, and fair enough. A historian would perhaps be most comfortable addressing religion as an organised body while excluding personal psychology from their considerations. What I find interesting to observe, however, is that one aspect of religion as I see it is not even mentioned. To me, schools of thought that make truth claims, be they ideologies, religions, or scientific, philosophical, scholarly, and engineering communities, have three main components:
  • The people who adhere to the school of thought; they are the focus of James' lectures,
  • The institutional framework (research institutions, churches, political parties, think tanks, journals, internet fora, conferences, etc.); this James mentioned but excluded from consideration, and
  • The actual body of knowledge or belief system; it appears to remain unexamined so far.
Because 90% of the lectures are still to follow I don't want to dwell on this too much, but I find it interesting even at this stage that James appears curiously incurious about the first question that would come to my mind when faced with a school of thought: are its beliefs true? I guess I will see if he will go there later or if he will remain completely disinterested in that question throughout.

After having settled on the personal relationship of an individual human to the divine as his focus, James clarifies that believing in an actual personal god is not a criterion for him. He mentions 'Emersonianism' and Buddhism as examples of  systems that work to produce religious feelings without having personalised deities. I had never heard of Emersonianism, but it appears to be a variant of pantheism, seeing the whole universe as divine and (believe it or not) benign.

Finally, James spends an astonishingly large part of his second lecture on discussing what mindsets he considers truly religious and what mindsets he does not. Again and again he negatively contrasts the philosophical, Stoicist acceptance of the way the world is with the Christian ideal of a joyous embrace of whatever happens, no matter how terrible. Although he sometimes calls the ascetic or highly spiritual Christian 'extreme', the language he uses leaves no doubt that he considers mindless exultation in the face of, say, seeing a loved one dying terribly to be an admirable state of mind, as evidence that religion is a positive force for humanity.

Again I hesitate to immediately reject his argumentation given how little I have progressed into this book, but even here I cannot help wonder if this view does not rely quite a bit of conflation of many different injustices or tribulations to which, really, we would be justified to react in very different ways. We are not merely talking about "the universe is unfair, and a truly wise person will accept that they can only do their best and be happier for it". No, depending on what we are talking about and if we assume gods to exist we may reasonably take very different stances - and I would actually say that religious bliss is the appropriate stance in none of the various cases.

We cannot always get all we wanted. Some things are unachievable, and sometimes we have to compromise with other people. Accepting that is just a sign of maturity. (Embracing such compromises joyously would seem to be a bit twee, though.)

Then there are the evils we do to each other, such as theft, bullying, rape, murder, etc. Really one of the most frustrating facets of human existence is how much needless misery we cause each other, both deliberately and accidentally, given that we would have quite enough misery left to deal with even if we were all perfectly nice to each other (see next point). Point is, in this case the perpetrators generally have a moral responsibility to do better, and joyously accepting their bad deeds is both unreasonable and counterproductive, as it will set perverse incentives and reward bad actors.

What James must really be talking about, however, would have to be 'natural evils', harm to us that is no other human's fault, everything ranging from having to die of old age across natural disasters to people being born with a genetic disorder. Under the (atheist) assumption that there is no god behind these phenomena, that they just happen, James' preferred stance of a joyous embrace would be ridiculous. Stoicist acceptance of what cannot be undone while trying one's best to undo these evils is a more sensible approach.

But what if we assume that natural evils are caused or at least allowed to happen by an omnipotent god who could, with the snap of their metaphorical finger, deliver us from such needless suffering? Does it make sense, under this assumption, to write, "dear superior intelligence running the universe, please accept my heartfelt thanks for making me slowly die of an untreatable, incredibly painful disease; and while on that topic, thanks also for that landslide that crushed my best friend when we were twelve years old"?

I can't say that this would feel sane to me. I would have some very serious questions about the moral character and motivations of such gods, if I believed for a moment that they existed. But then again, James acknowledges himself that there are some people who are unable to have religious feelings as he defined them. I assume I am one of those people, for better or for worse.

And note also that there are presumably many people who would consider themselves religious but who do not feel what James considers to be the religious impulse at its most pure.

Thursday, March 8, 2018

Alpha diversity and beta diversity

At today's journal club meeting, we discussed Alexander Pyron's opinion piece We don't need to save endangered species - extinction is part of evolution. I mentioned it in passing before and still think that his core argument, which is also reflected in the title, is logically equivalent to saying that murder is okay because all humans are going to die of natural causes one day anyway. But reading his piece more thoroughly than before, I now notice a few other, um, problems. The highlights:
Species constantly go extinct, and every species that is alive today will one day follow suit. There is no such thing as an "endangered species," except for all species.
What weirds me out here is the lack of a phylogenetic perspective in a piece written by a systematist - species are discussed as individuals that pop out of thin air and then disappear again. Of course, in the very long run every species will one day go extinct when the sun expands and boils off the oceans. But until then, in the time frame that Pyron discussed, no, not every species will go extinct, quite a few of them will diversify and survive as numerous descendant species, as did the ancestor of all land vertebrates or the ancestor of all insects in the past. They thus become effectively immortal (until, once more, the sun explodes anyway, etc.).
Yet we are obsessed with reviving the status quo ante. The Paris Accords aim to hold the temperature to under two degrees Celsius above preindustrial levels, even though the temperature has been at least eight degrees Celsius warmer within the past 65 million years. Twenty-one thousand years ago, Boston was under an ice sheet a kilometer thick. We are near all-time lows for temperature and sea level ; whatever effort we make to maintain the current climate will eventually be overrun by the inexorable forces of space and geology.
This is sadly a classic of climate change denialism. Yes, there was change in the past too, but there are some major differences. One is the rate of change - the impacts we are having are coming much faster than most natural changes (excepting e.g. meteorite strikes and similarly sudden events), so that animals and plants have less of a chance to migrate or to adapt than they had in past cycles of warm and ice ages. Second, they have even less of a chance to migrate because we have fragmented their available habitats by putting roads, towns, croplands and pastures into their way. Third, past changes did not affect a highly urbanised human population of more than seven billion people; the potential of global change producing catastrophic results even just for us is much greater now than when we were just a few million widely dispersed hunter-gatherers. So yes, it is true that we cannot freeze the status quo in place forever, but I think we would do well to slow the rate of change as far as possible.
Infectious diseases are most prevalent and virulent in the most diverse tropical areas. Nobody donates to campaigns to save HIV, Ebola, malaria, dengue and yellow fever, but these are key components of microbial biodiversity, as unique as pandas, elephants and orangutans, all of which are ostensibly endangered thanks to human interference.
I just don't even. What is the logic here? "Nobody cares about conserving diseases that horribly kill us humans, so we should not care about conserving harmless pandas either?" How does that follow?
And if biodiversity is the goal of extinction fearmongers, how do they regard South Florida, where about 140 new reptile species accidentally introduced by the wildlife trade are now breeding successfully? No extinctions of native species have been recorded, and, at least anecdotally, most natives are still thriving. The ones that are endangered, such as gopher tortoises and indigo snakes , are threatened mostly by habitat destruction. Even if all the native reptiles in the Everglades, about 50, went extinct, the region would still be gaining 90 new species -- a biodiversity bounty. If they can adapt and flourish there, then evolution is promoting their success. If they outcompete the natives, extinction is doing its job.
And this is perhaps what frustrates me most, because while this is not an uncommon argument against biosecurity measures one would expect a biologist to know about different types of biodiversity instead of confusing them. To explain more clearly what is going on, consider the following diagrams. First, we have three areas, roundland, squareland, and hexagonland, with two endemic species each.

Then humans recklessly move species between the areas, allowing them to invade each other's natural ranges. It turns out that three of the species are particularly competitive and prosper at the cost of the other three, driving them to extinction.

Now there are three types of diversity to consider. The first is alpha-diversity, which means simply the number of species in a given place. As we see it has gone up by 50% in all three areas, from two to three species. Yay, more diversity! This is what Pyron proudly points at in Florida.

What is lost, however, is beta-diversity or turnover, that is the heterogeneity you observe as you move between areas. It was very high originally, as every area had its unique species, but now it has been wiped out entirely. Beta-diversity in the second diagram is precisely zero. Under the first scenario a squarelander can go on a holiday trip to roundland and admire the unique flora of that part of the world; under the second scenario they will travel to roundland and merely see the same few weeds that they have growing in their own front yard back home. And the endemic plants of hexagonland have all gone extinct, a 100% loss of that area's irreplaceable evolutionary history.

(Note that beta-diversity would also be zero if all six species survived everywhere. But that is clearly not a realistic assumption, as it would require each area to have such a high carrying capacity that they should each have evolved more than two species to begin with. We would not expect that all the plant species of the world could survive next to each other in, say, Patagonia, even if they were all introduced there.)

Finally, in our example global diversity has of course also been reduced, by 50%. So yeah, great to have more alpha-diversity in Florida, but does that make up for a massive net loss in both beta-diversity and global diversity? The argument seems rather misguided.

Sunday, March 4, 2018

Reading The Varieties of Religious Experience: Lecture 1

I have started reading William James' The Varieties of Religious Experience. Published first in 1902, this collection of twenty lectures is considered to be a classic of the study of religion. It approaches the subject with a psychological as opposed to theological, historical, or apologetic angle, but appears to remain rather charitable towards religious beliefs.

This becomes clear already in the first lecture, much of which is spent assuring the believing reader that they have no reason to be offended by a psychological examination of religious experience.

James calls 'medical materialism' the idea that religion originated as the hallucinations and ravings of 'psychopaths' and 'degenerates' and can therefore be dismissed. (His words; see e.g. the interpretation of Saint Paul's vision of Jesus as the result of an epileptic seizure.) He argues that the value of a phenomenon, here religious truth claims, cannot be deduced from its origins; as an argumentum ad absurdum he points out that a scientific insight would be judged on its own merits even if the scientist who gained it was suffering from some mental disorder. By their fruits ye shall know them, not by their roots.

Well, fair enough, one might say. But while I cannot tell what the state of the discussion was around the year 1900, it seems as if this argument would miss the point of 'medical materialism' as it is applied today. Taking the position of an atheist, it is not the case that they attempt to answer the question of what to think of religious truth claims by looking at how they originated. They would most likely argue that that particular question has already been answered by applying the same criteria as James would (or at least the empirical one, see further down). They already take it as given that religious claims are largely false, and true only by lucky accident:

There is no evidence that there is something to us that lives on after death, and indeed the study of brain damages suggests that all there is to our personality is an emergent property of the physical. There is no evidence that the universe was created by a higher intelligence, and indeed it looks very much as if it wasn't. There is no evidence that the universe was created for our benefit, and indeed it looks very much as if it wasn't. There is no evidence that prayer works; and so on. There is also the small matter that hundreds of religions made and continue to make contradictory claims, meaning that only such a small percentage of them could be true as to be too close to zero percent to matter.

So given that background, the atheist now asks not what to think of a religious claim, but instead: How and why would people come up with something as wrong as that? And here hallucinations are a decent explanation for divine visions. That is why I feel that James' central argument in the first lecture misses its mark. But then again, he seemed to be more interested in reassuring religious readers than in criticising atheist ones anyway.

In this context it is also fascinating to examine what 'fruit' criteria James accepts as valid for judging spiritual and theological claims, now that he has rejected the 'root' criterion. He names three: immediate luminousness, philosophical reasonableness, and moral helpfulness.

Immediate luminousness is also described as based on 'our immediate feeling' upon being exposed to the claim. This seems rather oddly subjective and emotional, and at least in my eyes falls flat as a useful criterion.

Philosophical reasonableness is to be understood as based on how the claim relates to 'the rest of what we hold as true'. This is the most sensible of the three criteria, because that is also how we do it in science. If, for example, somebody presents us with the theories underlying homeopathy, such as water memory, we may consider in comparison what we believe we already understand about physics and chemistry. We then find that either large bodies of scientific knowledge supported by numerous experiments and empirical observations must all be utterly, mind-boggingly wrong, or that, alternatively, homeopathy must be nonsense. At this stage it should be easy to figure out which of the two options strains our credulity less.

Still, in the context of religious truth claims, this approach still appears unsatisfactory. How, after all, are any religious truth claims justified? If they are justified based on fitting into our body of scientific knowledge they are simply more scientific truth claims. If not, as of course they are, then each religion constitutes a network of beliefs that may (or may not) be internally consistent but that is completely unmoored from other such networks and from observable reality. The philosophical reasonableness criterion will have a Christian accept a vision of Jesus in heaven as true and reject a vision of the imminent death of the sun as false, and it will have a precolumbian Aztec reject the former as false and accept the latter as true, with exactly the same justification. How useful.

Finally, moral helpfulness suffers from exactly the same flaw as the previous does in a religious context. Unless the belief system is at some point anchored on empirical, observable reality, it is turtles all the way down.

Monday, February 5, 2018

Botany picture #255: Exocarpos nanus

Currently we are back in Kosciusko National Park for field work, and for the first time I have consciously seen Exocarpos nanus (Santalaceae), although it is so tiny that I may have previously stepped onto it without noticing. Like its larger congeners it is a hemiparasite.

Sunday, February 4, 2018

Bioregionalisation part 4: networks

Having examined a clustering approach to bioregionalisation, today I will try to illustrate the increasingly popular alternative of network analysis.

Consider again our hypothetical study area of five cells with five taxa, where we want to know how to delimit bioregions (or phytoregions, given that the taxa are plant species) in an objective way:

The first step in the analysis is to interpret these data as a network. Specifically, as we have two different types of elements, what we are dealing with is called a bipartite network. Each type of element is connected directly only to elements of the other type, and to elements of its own type only via the other. In this case, the plant species are connected to all cells they occur in, and cells are connected to all plant species occurring in them:

Once we have scored this kind of network structure in a way that the software of our choice understands (either a list of connections or a matrix with 0s and 1s), we can use an algorithm that divides the network into modules. This algorithm tries to maximise connections within a module and to minimise the connections between modules, which in bioregion terms again means to maximise endemism.

As indicated in the posts on clustering, network analysis has the great advantage that it does not only produce groups, it also provides a reproducible and objective answer for the question about the optimal number of groups, whereas in clustering analysis the user still has to make a subjective decision.

That being said, it is always possible to take a large module by itself and explore its internal structure, if so desired, although of course the answer may be that there are no meaningful subdivisions any more.

Either way, any such algorithm will return modules, and what we are mostly interested in is what cells belong to what module. Nonetheless we would also be able to infer what species belong to what module, and depending on the type of network analysis we may be able to get other statistics that may be of interest for the network and for each individual module or even each element.

There are two main approaches to network analysis that have been explored in bioregionalisation. The first is called the Map Equation, developed by Rosvall et al. (2009) and promoted with a sleek, eponymous website. It was first applied to bioregionalisation by Vilhena & Antonelli (2015). One of its advantages is that it is the faster of the two, which may be particularly attractive if one's dataset is large and complex.

The second is Modularity Analysis (Newman, 2006). This is the approach that I prefer personally, after colleagues at my institution conducted a study comparing the two and clustering against each other (Bloomfield et al., 2017). It is slower than the Map Equation, but it seems to be better at recognising the transitional nature of cells situated between two 'pure' modules, which the Map Equation appears to tend to group into distinct modules in their own right.

Next time, how to do modularity analysis in practice.


Bloomfield NJ, Knerr N, Encinas-Viso F, 2017. A comparison of network and clustering methods to detect biogeographical regions. Ecography 41: 1-10.

Newman MEJ, 2006. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, USA 103: 8577-8582.

Rosvall M, Axelsson D, Bergstrom CT, 2009. The map equation. arXiv: 0906.1405 [physics.soc-ph]

Vilhena DA, Antonelli A, 2015. A network approach for identifying and delimiting biogeographical regions. Nature Communications 6: 6848.

Saturday, January 27, 2018

Bioregionalisation part 3: clustering with Biodiverse

Biodiverse is a software for spatial analysis of biodiversity, in particular for calculating diversity scores for regions and for bioregionalisation. As mentioned in previous posts, the latter is done with clustering. Biodiverse is freely available and extremely powerful, just about the only minor issues are that the terminology used can sometimes be a bit confusing, and it is not always easy to intuit where to find a given function. As so often, a post like this might also help me to remember some detail when getting back to a program after a few months or so...

The following is about how to do bioregionalisation analysis in Biodiverse. First, the way I usually enter my spatial data is as one line per sample. So if you have coordinates, the relevant comma separated value file could look something like this:
Planta vulgaris,-26.45,145.29
Planta vulgaris,-27.08,144.88
To use equal area grid cells you may have reprojected the data so that lat and long values are in meters, but the format is of course the same. Alternatively, you may have only one column for the spatial information if your cells are not going to be coordinate-based but, for example, political units or bioregions:
Planta vulgaris,Western Australia
Planta vulgaris,Northern Territory
Just for the sake of completeness, different formats such as a tsv would also work. Now to the program itself. You are running Biodiverse and choose 'Basedata -> Import' from the menus.

Navigate to your file and select it. Note where you can choose the format of the data file in the lower right corner. Then click 'next'.

The following dialogue can generally be ignored, click 'next' once more.

But the third dialogue box is crucial. Here you need to tell Biodiverse how to interpret the data. The species (or other taxa) need to be interpreted as 'label', which is Biodiversian for the things that are found in regions. The coordinates need to be interpreted as 'group', the Biodiversian term for information that defines regions. For the grouping information the software also needs to be told if it is dealing with degrees for example, and what the size of the cells is supposed to be. In this case we have degrees and want one degree squared cells, but we could just as well have meters and want 100,000 m x 100,000 m cells.

After this we find ourselves confronted with yet another dialogue box and learn that despite telling Biodiverse which column is lat and which one is long it still doesn't understand that the stuff we just identified as long is meant to be on the x axis of a map. Arrange the two on the right so that long is above lat, and you are ready to click OK.

The result should be something like this: under a tab called 'outputs' we now have our input, i.e. our imported spatial data.

Double-clicking on the name of this dataset will produce another tab in which we can examine it. Clicking on a species name will mark its distribution on the map below. Clicking onto a cell on the map will show how similar other cells are to it in their species content. This will, of course, be much less clear if your cells are just region names, because in that case they will not be plotted in a nice two-dimensional map.

Now it is time to start our clustering analysis. Select 'Analyses -> cluster' from the menu. A third tab will open where you can select analysis parameters. Here I have chosen S2 dissimilarity as the metric. If there are ties during clustering it makes sense to break them by maximising endemism (because that is the whole point of the analysis anyway), so I set it to use Corrected Weighted Endemism first and then Weighted Endemism next if the former still does not resolve the situation. One could use random tie-breaks, but that would mean an analysis is not reproducible. All other settings were left as defaults.

After the analysis is completed, you can have the results displayed immediately. Alternatively, you can always go back to the first tab, where you will now find the analysis listed, and double-click it to get the display.

As we can see there is a dendrogram on the right and a map on the left. There are two ways of exploring nested clusters: Either change the number of clusters in the box at the bottom, or drag the thick blue line into a different position on the dendrogram; I find the former preferable. Note that if you increase the number too much Biodiverse will at a certain point run out of colours to display the clusters.

The results map is good, but we you may want to use the cluster assignments of the cells for downstream analyses in different software or simply to produce a better map somewhere else. How do you export the results? Not from the display interface. Instead, go back to the outputs tab, click the relevant analysis name, and then click 'export' on the right.

You now have an interface where you can name your output file, navigate to the desired folder, and select the number of clusters to be recognised under the 'number of groups' parameter on the left.

The reward should be a csv file like the following, where 'ELEMENT' is the name of each cell and 'NAME' is the column indicating what cluster each cell belongs to.

Again, very powerful, only have to keep in mind that your bioregions, for example, are variously called clusters, groups, and NAME depending on what part of the program you are dealing with.

Wednesday, January 24, 2018

Bioregionalisation part 2: clustering

Already I think I should change the way I was going to do this. It seems more straightforward to keep the two approaches in separate posts. So for today: bioregionalisation using clustering methods.

A small example

As the term clustering suggests the approach is very simple. Let's start by considering a landscape of five cells A-E with five species occurring in them as follows:

Another way of expressing this information is as a matrix where the cells are rows and the species are columns, and presence of a species is indicated with "1" while absence is indicated with "0":

We now simply calculate a distance matrix. There are several possible dissimilarity metrics we can use for this. For this post I will use the S2 dissimilarity, which is defined as
S2 dissimilarity = 1 - ( number of shared species / ( number of shared species + minimum( species unique to first cell , species unique to second cell ) ) )
The resulting S2 dissimilarity matrix for our small dataset is consequently as follows:

Now we use a hierarchical clustering algorithm to produce a dendrogram. I have used R's hclust, and the result is:

We can now recognise clusters as bioregions, and we are done. The main remaining problem with hierarchical clustering is that there is no objective answer for the number of bioregions we should recognise. We could still accept anywhere between one and five, but at least we know that there should not be a region of e.g. only the cells C and E to the exclusion of D.

(This is of course the same problem as in phylogenetic systematics, where we would now know that CE to the exclusion of D is not an acceptable taxon, but it remains a subjective decision whether to recognise CD and E as separate genera or whether to have one genus CDE, for example.)

In our present, case it seems sensible to accept less than five regions but more than one, otherwise we would not have needed the analysis, so let's go with the two clusters AB and CDE:

These regions now show a fairly high level of endemism, as four of the five species are endemic to one region; only the blue species occurs across both.

Some R code

Although the proper software for this kind of work is Biodiverse, this post would get too long if I tried to do everything in one go. What is more, a simple analysis can just as well be run in R, which is what I have done in this case. First build a matrix of cells and the species in them, e.g.
occurs <- as.matrix(rbind(c(1,1,0,0,0), c(1,1,1,0,0), c(0,0,1,1,0), c(0,0,1,1,0), c(0,0,0,1,1)))
rownames(occurs) <- c("A", "B", "C", "D", "E")
colnames(occurs) <- c("red","brown","blue","orange","lilac")
The following loops will then produce a matrix of S2 dissimilarity scores.
mydm <- matrix(0, 5, 5)    # create empty matrix; could make it more flexible for future analyses by handing over square root of length(occurs) for the dimensions
rownames(mydm) <- c("A","B","C","D","E")    # same here, could use row names from occurs
colnames(mydm) <- c("A","B","C","D","E")     # and same here
for (i in 1:5)
  for (j in i:5)
    if (i==j)
      mydm[i,j] <- 0
      shareds <- sum(occurs[i,] & occurs[j,])
      uniques_i <- sum(xor(occurs[i,], occurs[i,] & occurs[j,]))
      uniques_j <- sum(xor(occurs[j,], occurs[i,] & occurs[j,]))
      mydm[i,j] <- 1- (shareds / (shareds + min( uniques_i, uniques_j)))
      mydm[j,i] <- mydm[i,j]
Now finally do a cluster analysis and plot the resulting dendrogram:
mycl <- hclust(as.dist(mydm), method = "mcquitty")     # WPGMA
Done. For large numbers of cells we would want a decent visualisation, ideally as a map, and that is where Biodiverse works better. How to do the analysis in that software will be covered in the next post.

Saturday, January 20, 2018

Bioregionalisation part 1: what's the idea?

This is the start of a little series of posts on bioregionalisation. I intend to divide the topic up as follows:
  1. What I mean with bioregionalisation and what it is good for.
  2. Comparison of two different quantitative approaches to defining bioregions, clustering and network analysis.
  3. Practical how-to guide to inferring bioregions with clustering in the software Biodiverse.
  4. Practical how-to guide to inferring bioregions with network analysis in R.
  5. Beyond species presence and absence, i.e. using phylogenies for bioregionalisation.
Let's see if that works. So today:

What do I mean with bioregionalisation?

The idea is to divide a study region - perhaps a country, a continent or the whole world - into natural regions. There are obviously lots of different ways of doing so. A well-known one is climatic, where we would have arctic, temperate, subtropical, and tropical regions. Closer to what I am talking about are vegetation zones; in this case the general appearance of the natural vegetation and the life form of its constituent species are used to define zones such as tundra, boreal forest, mallee, or savanna.

But that still is not what this is going to be about. The bioregions I am going to discuss are defined by the taxa that occur in them. A very high-level classification is shown, for example, in the following map from

As we can see there are no 'tropics', but instead the American tropics are separated from the African and South Asian ones. Why might that be the case? As a botanist I can immediately think of two important plant families that are very characteristic of the Neotropics but are (with the exception of one rather odd, small genus) entirely missing from the Paleotropics: the cactus family Cactaceae and the pineapple family Bromeliaceae.

This, then, is what bioregions as I will subsequently discuss them are: they are regions defined by the presence of (plant, animal, ...) taxa they do not share with other regions. Another way of putting it is that bioregionalisation aims to maximise the endemism of its regions. And this immediately suggests the possibility of quantitative, objective analyses as long as we can somehow quantify endemism.

But these approaches are for other posts. More importantly now:

Why do we care? What are these bioregions good for?

I can think of at least two use cases. The first is quite simply that we like to classify things, and climate and vegetation form do not capture all there is to natural regions. Specifically, the presence e.g. of bromeliads, leaf cutter ants and hummingbirds in the New World and their absence in the Old World is an accident of history that is orthogonal to the shared climate and to the fact that 'tropical rainforest' kind of looks the same from a distance in all continents. But it still matters because these groups of organisms have evolved unique characteristics, like the hummingbirds' high metabolic rate, that have an ecological impact. A neotropical cloud forest 'works' a bit differently than a southeast Asian one.

The second use case is that of finding objectively defensible regions for biogeographic analysis, a problem that still does not have a single widely accepted solution. For example, we may be interested in conducting an inference of ancestral areas and biogeographic processes using the R package BioGeoBears, because we want to know if our study group started evolving in the temperate part of our continent and then spread into the tropics or vice versa. For this analysis we need (a) a time-calibrated phylogeny and (b) a data table of taxa-by-regions showing for each region what taxa are naturally occurring in them.

Taking one step back, it is obvious then that we first need to define regions. This may be easy if we can simply use the islands of an island group, but taking a big blob of land like Australia as an example, how do we cut that up? States? Clearly political units are kind of iffy for biogeography, because they are human inventions. Climate or vegetation zones are more natural, but are they meaningful for our specific study group? How meaningful would a region be for my purposes that happens to have one of my study taxa scored as present because it comes in from the side into 5% of that region's extent?

To me at least it seems as if the solution is bioregionalisation by taxon content: take small units like 100 x 100 km cells or similar and use an objective bioregionalisation approach to group them into meaningful larger regions. As mentioned above this maximises endemism, which is precisely what I would want for the inference of ancestral areas and biogeographic history.

Thursday, January 18, 2018

MCDA spam

Ye gods, I got an absolute gem of science spam yesterday.
We wish you a happy new year.
Well, at least no greetings of the day, so that is good.
It's so pleasant to communicate with eminent people like you through this email.
Is it brown-nose day already?  I had not noticed.
I believe that your efforts will create the good reputation for my Journal. Our MCDA journal is in shortfall of one article to accomplish the issue. So, we request you to submit any type of article towards our journal. I would be highly obliged with your swift submission process. Hope you will support us.
And this is where it all falls over. Why does this spammer think that I should care about the reputation of their journal? If they were a car dealer, would they say, I believe that your purchase will increase my profits, I need another sale to make my target? As opposed to, say, stressing the price to performance ratio of the car. If they were a university recruiter, would they say, I believe that your enrolment would create the good reputation for my university? As opposed to, say, claiming that the already good reputation of their university would transfer onto the prospective student?

Well, maybe that is what they would do. But that is not how it works. I am not in sales, but even I know that if you are selling something you have to convince the prospective buyer that buying is in their interest, not only in yours.
Await your article submission.
Emma Wright Š Modern Concepts & Developments in Agronomy (MCDA)
LLC, Third Avenue, 2nd floor, New York - 10016, USA
If this message is the English of an Emma Wright in New York I will not only eat my hat but a whole stack of them.

Also, just as an aside, the message did not even contain a link to the journal website!

It displays a level of incompetence so profound, so all-encompassingly fractal, that there is truly no hope for this spammer. How can anybody who is able to write an eMail without trying to eat the keyboard look at this spam message and think, yes, this is going to convince people that I am running a serious scientific journal?

Monday, January 8, 2018

Cell phone with macro lens

Happy new year, everybody! Time to get back to blogging.

Lately I have been playing around with a macrolens that I bought for my smartphone. The idea was to be able to take pictures of small structures, in particular fruits or seeds, even when I do not have my proper camera with me. So far the results are mixed.

Here we have the fruit of Hypochaeris radicata (Asteraceae), one of the larger propagules I have tried so far. Not too bad, all in all, but I do not care about the shadow, and obviously the depth of field is an issue.

The above is a mericarp of Malva neglecta (Malvaceae). The surface structure looks nice, but again light conditions and shadows are problematic. I will have to do something about light and the texture of the background.

It works reasonably well for flowers in sunlight, however. Here a tomato flower. As this will likely be the use case for most people I guess one cannot complain. It would be a fair deal given that the lens package cost me only $20 (including wide angle and fish-eye lenses, which I don't really use).