Friday, June 29, 2018

TreeBASE and Dryad

It is now generally expected that scientists, unless working on commercial or otherwise confidential projects, make the data underlying their scientific publications freely and publicly available, so that the studies can be replicated if necessary and so that others can use the data for further research.

Sometimes the data are submitted as supplementary material to be published on the journal website, together with the article itself. Some research organisations have their own data repositories. In many cases, however specialised databases are used. GenBank, for example, is a repository of DNA sequence data. Further down the analysis pipeline, I have in the past used TreeBASE to make available sequence alignment matrices and phylogenetic trees, and in one case I have reanalysed other people's data after obtaining them from there.

Recently I had reason to submit another such set of data matrices and phylogenetic trees to a database, and I thought I would go back to TreeBASE. Somehow it did not work out as well as it did a few years ago.

I was able to log in, I created a new submission, I submitted my files, and I described our analysis. The latter process is rather clunky, but okay, it works. Then it turned out that we needed to redo one of the phylogenetic analyses minus one sequence, so I had to delete one of the matrices and one of the trees and replace them with updated versions. That is when the fun started.

Although googling around a bit suggests that other people can do so, I find it impossible to delete anything in TreeBASE. There is no delete button next to anything except co-authors and submissions (i.e. the entire studies). Being unable to change data in a submission, I decided to delete the entire submission and start from scratch. That is surely not how it is meant to work, and it is a lot of extra effort, but what can I do?

As it turns out, not even that. When I ask that a submission be deleted, the web interface thinks for a bit an then throws a Java error at me. I now have three submissions under identical names and cannot delete the first two. Hurray.

At some point I thought I could maybe try out the alternative data repository Dryad.  Perhaps that would work more reliably? At least I have seen it used in several publications lately. I have now twice submitted my eMail address on their 'sign up for a new account' form, been told twice that a confirmation eMail has been sent, and days later neither I nor my spam folder have received any such message.

Perhaps the journal will accept our manuscript without us having the matrix and trees in a public repository? This process is becoming somewhat off-putting.

Update: After a mere four days I have now finally been sent a confirmation link by Dryad. Will see how that repository works.

Saturday, June 9, 2018

A particularly striking example of how paraphyletic taxa confuse our thinking about evolution

I recently reread Jason Rosenhouse's Among the Creationists and came across the following extended quote from Stephen Jay Gould, a widely admired and famous evolutionary biologist.
If mammals had arisen late and helped to drive dinosaurs to their doom, then we could legitimately propose a scenario of expected progress. But dinosaurs remained dominant and probably became extinct only as a quirky result of the most unpredictable of all events - a mass dying triggered by extraterrestrial impact. If dinosaurs had not died in this event, they would probably still dominate the domain of large-bodied vertebrates, as they had for so long with such conspicuous success, and mammals would still be small creatures in the interstices of their world. [...] Since dinosaurs were not moving toward markedly larger brains, and since such a prospect may lie outside the capabilities of reptilian design, we must assume that consciousness would not have evolved on our planet if a cosmic catastrophe had not claimed the dinosaurs as victims. (Gould 1989, 318)
The context is the controversy around convergence and contingency in evolution. Rosenhouse discusses convergence as one of the hopes of Christians trying to reconcile evolution and Christian teachings, citing various proponents of the idea that their god set up the universe in a way that human-like intelligence was guaranteed to arise, thus producing beings that can have a "relationship" with said god.

Convergence is, of course, not only an observation considered helpful by the proponents of one variant of theistic evolution. To what degree the organisms that evolved on our planet would again turn out to be kind of similar if we replayed the tape or if organisms on other planets can be expected to look very similar to those on ours are very interesting questions of broad interest. Even an atheist may ask if we can expect lots of other planets where life arose to produce land plants, something a bit like insects, and perhaps even sentient beings given enough time, or if the vast majority of them will, for example, remain populated only by bacteria, because even evolving as much as multicellularity was a rare fluke.

Rosenhouse cites Gould as a well-known proponent of the importance of contingency. Although I tend much more towards the opposite view, I understand Gould's position. I believe the strongest argument for the contingency side is that while there are many impressive cases of convergence there are also quite a few crucial events in the history of life on this planet that appear to have happened only once: complex Eukaryotic cells; colonisation of dry land by multi-cellular plants; vertebrates; and of course human-like intelligence.

If, for example, the independent evolution of wings by insects, pterosaurs, birds and bats is counted as evidence for the importance of convergence, should something happening only once not be counted as evidence for the importance of contingency? My response would be competition, or in other words the change in the adaptive landscape caused by the first organisms to settle on a new peak. Where there may have been a ridge connecting the niches "kelp" and "large land-living plant" when nobody had occupied the latter, the first lineage to do so quickly became so good at being large land-living plants that the ridge crumbled away and became a canyon. If all land plants were wiped out, however, I would expect the land to be colonised anew, this time perhaps by red or brown algae.

But that is not actually about the main argument Gould is quoted as making in the above excerpt, and not what I found interesting about the quote. To take it in smaller pieces:
If mammals had arisen late and helped to drive dinosaurs to their doom, then we could legitimately propose a scenario of expected progress.
"Expected progress" is a bit of an odd term here. I am not sure if that is what is meant, but it could be read as if any group of animals that does not evolve towards large brains and intelligence is a refutation of the possibility that one group on each planet might evolve towards larger brains. But I do not think that this works as a refutation. And few proponents of the importance of convergence would argue that it is all about one linear progression towards large brains anyway. There are also progressions, for example towards body shapes that work well for swimming, towards paternal care for the young, towards powered flight, etc., and all of these happen at the same time but only in those lineages for which they solve relevant problems or create new opportunities.

If I understand the argument correctly, it is like pointing at a hole in the ground and saying, "if I now throw a pebble into the air and it does not end up in this specific hole, gravity is refuted", whereas the argument for convergence is that, what with evolution throwing thousands of pebbles into the air every year, we are very likely to find a few of them at the bottom of this hole as opposed to half way up its wall.
But dinosaurs remained dominant and probably became extinct only as a quirky result of the most unpredictable of all events - a mass dying triggered by extraterrestrial impact. If dinosaurs had not died in this event, they would probably still dominate the domain of large-bodied vertebrates, as they had for so long with such conspicuous success, and mammals would still be small creatures in the interstices of their world.
Although this is not my field, and I understand that it is an active area of research, I believe it can already be said with some confidence that mass extinction is not random. There are generally some reasons for why an extinction event claims this lineage here but leaves that other one over there largely intact. If a mass extinction of marine life is caused, for example, by a massive drop in the oxygen content of the oceans, then we would expect lineages that can survive under low oxygen conditions to come out in relatively good shape, all things considered, while those with a high oxygen need would be hammered.

In the present case, if we hypothesise that the impact of a large meteorite would have caused massive shockwaves followed by a few years of something like nuclear winter, we could expect the following: Species of small animals may find it easier to survive because they need less food per number of individuals. Bonus points if you have a burrow to hide in when the devastation sweeps across your area (small mammals) or if you can move easily to other areas where a bit more food is left (flight-capable birds). Large animals that can go with little food for long times may also have a good chance, in other words being cold-blooded may help to survive several bad years (crocodiles). If, however, you are large and (!) at the same time you have a high rate of metabolism then you might be in trouble, as you constantly need lots of food per number of individuals. As far as I understand, that describes the non-avian dinosaurs: large and warm-blooded.

The point is, catastrophes do happen from time to time, and once one happened it would probably have decimated the largest animals, even if it had come ten million years later than it did. Their niches are filled up again by small animals evolving to be large (another good example of convergence). What killed off the pterosaur lineage, for example, may well have been that the birds had already out-competed all small pterosaurs, leaving only the very large species when the meteorite struck. But again, this is not my area of expertise really.
Since dinosaurs were not moving toward markedly larger brains, and since such a prospect may lie outside the capabilities of reptilian design, we must assume that consciousness would not have evolved on our planet if a cosmic catastrophe had not claimed the dinosaurs as victims.
And this last part is really what I find the most interesting, because it illustrates so nicely how paraphyletic taxa can confuse the thinking even of the smartest of us, even of experts in evolutionary biology. What is the problem with the argument here?

First, and most obviously, birds are dinosaurs. Second, corvids (crows and ravens) and parrots are highly intelligent. Not quite human-level intelligence, but in some experiments corvids have proved to be smarter even than chimpanzees, our closest relatives. It follows that  dinosaurs have actually "moved toward markedly larger brains", meaning here relative to the size of the body as a whole and, crucially, in terms of actual intelligence. Gould's premise is simply false, but his mistake is understandable, because at fault is really a misleading, i.e. non-phylogenetic, classification.

"Outside the capabilities of reptilian design" is, by the way, the same mistake at a deeper phylogenetic level. Mammals were not created fully formed, as mammals. Some of our ancestors were "reptiles", and here we are, having human-like intelligence by definition, what with us being humans and all that, so apparently there was a way of evolving human-like intelligence from a reptilian starting point. And from a fish starting point, and from a worm starting point, and from a bacterial starting point. All it took was lots of time and open niches waiting to be filled.

But I am not saying that anything here decisively refutes the idea that our sentience is a very rare fluke, unlikely to happen again should we go extinct. Maybe it is. The point is really how corrosive paraphyletic taxa are to reasoning about evolutionary processes.

Reference

Gould SJ, 1989. Wonderful Life: The Burgess Shale and the Nature of History. W.W. Norton, New York.

Wednesday, June 6, 2018

Manuscript submission then and now

When I started in science, back in the dark ages, submitting a manuscript to a journal was still quite simple, if perhaps a bit inefficient:
  1. Print the manuscript in triplicate.
  2. Write a cover letter and print it.
  3. Put everything into an envelope and send it off to the editor.
And that was that.

The first innovation was that you only had to send the manuscript to the editor as an eMail attachment, which was actually faster and saved a lot of paper. Unfortunately, however, things have changed again since then.

This is how it works today:
  1. Log into the editorial management software of the journal of my choice. If I do not have an account with that journal yet, create one first.
  2. Go to the author interface, click new submission.
  3. Select the type of article.
  4. Paste the title and abstract into an online form.
  5. Select key words or topics that supposedly help the journal to assign editors and/or reviewers. Click 'save and continue'.
  6. Upload main manuscript file, generally as an MS Word document.
  7. Upload all the figures as separate files, generally as TIF or EPS, although JPGs may be acceptable at the review stage. Paste figure legends and write 'link texts' into the form fields.
  8. Upload all the supplementary data files. If necessary, update the order of the files. Click 'save and continue'.
  9. Next, the authorship page. As the corresponding author with an account at that journal I am already in, but I may be asked to link my ORCID. (I have no idea if anybody actually uses it for anything - I only ever look people up with their ResearcherID or Google Scholar.)
  10. Search for my co-authors by name or eMail. I find the second co-author, great. The first and third co-authors aren't in the system, so I create entries for them. Click 'save and continue'.
  11. Error: No telephone number provided for second co-author. But he was in the system, so you accepted him before! Also, will any editor really ever want to use it? Argh. Let's look up his number. Okay, edited. Click 'save and continue'.
  12. Suggesting an editor for the manuscript. Oh dear, that's a long list. Hm. I know this guy hates one of the methods we used, he is out. This one is highly qualified but he will probably require us to add this other analysis that he likes. Ah well, worse things could happen. This one is also very qualified, but she works at a university I have a connection to - is that already a conflict of interest? Well, they can always choose somebody else, done.
  13. Okay, suggesting peer reviewers and providing their contact information. This guy is an obvious choice as he is the expert for one of the analyses we used, but darn, he is currently between institutions. Let's google his name. No, that's outdated. This one too. Ah, I'm lucky: he has an updated CV on this third page I found, complete with the new phone number and eMail address. Okay, now for reviewer suggestion number two. She is another obvious choice as one of the world experts on our study group. Easy to find her information on a staff page, so that's good. Who else? Maybe two more experts on the study group? Ah yes, she would be interested in this, and I have her contact details. And then this other guy from Europe. Google. Darn, nothing, despite the unique name. Perhaps there is contact info on recent papers. No, he is too senior, the corresponding authors are always others. Ah, wait, here? No, an eMail address from 2012 going "director@institution.org" sounds fishy, most likely somebody else is director now. More Google. Ah, finally, was able to click myself through to a staff website, well hidden and not in English. Ye gods. Four qualified reviewers, that should be enough to get going. Click 'save and continue'.
  14. Long, complicated page with miscellaneous information and declarations. First, write or upload cover letter. Done.
  15. Next, declare that we have not submitted this manuscript elsewhere. Okay.
  16. Is this a resubmission? No.
  17. Declare that we have followed protocol so-and-so on ethical collection practices. Yes.
  18. Declare that we have added a section on data availability. Wait, was that in the instructions to authors? Don't remember that. Argh. Save. Open manuscript file. Add data availability section. Back to file upload. Delete manuscript file. Re-upload manuscript file. Reorder files. Click 'save and continue'. Back to declaration. Yes, we now have a section on data availability.
  19. Declare no conflicts of interest. Okay. Click 'save and continue'.
  20. Large summary page. Check everything I entered so far. Down at the bottom: have to check PDF proofs before being allowed to submit. Click button, wait while the editorial manager bundles everything into a PDF.
  21. Open PDF. One of the EPS figures does not display. Argh. Argh. Argh. Back to file upload. Delete offending figure. Re-upload figure - as a TIF this time, that should be foolproof. Reorder files. Click 'save and continue'. Back to summary page.
  22. Re-check everything I entered so far. Click button, wait while the editorial manager generates a new PDF. Looks good this time.
  23. The big moment is there: click here to submit. "Are you certain? This will submit your manuscript." Yes!
Yay, progress?

Saturday, May 12, 2018

What are monotypic genera good for?

There are a lot of monotypic genera around. In the group I am currently working on the most, the daisy family Asteraceae in Australia, there are an awful lot of monotypic genera indeed. Why do we need so many of them?

I would argue that there are two different scenarios to be considered. First, however, we need to keep in mind that:
  1. We should classify organisms by their degree of relatedness, meaning that supraspecific taxa (including genera) should be monophyletic, and
  2. while this previous rule tells us how we should group it does not tell us how we should rank. There is no genusness to be discovered in nature. Whether it is here in the phylogeny where we call a clade a genus or four nodes deeper down the tree is ultimately an arbitrary human decision.
This may at first suggest that there is no good argument to be had against monotypic genera either. If ranking is arbitrary then a classification consisting entirely out of monotypic genera - each species in the tree of life gets its own genus - is just as valid as the current one, so why not?

It is true that this is one of many possible ranking solutions compatible with phylogenetic systematics, but to decide between those many possible ranking solutions we can bring other criteria to bear. And here I would argue that it would be useful to minimise the number of monotypic genera as far as possible. Why? Because I would consider the genus level 'wasted' in many of those cases.

The entire point of a classification is that each taxon provides a piece of information. That information is: The members of this taxon are more closely related to each other than they are to non-members of this taxon. If we have a species, the species-taxon provides this information for all the members of that species. If we now have that species classified in a monotypic genus, the genus-taxon provides... the exact same information over again. It doesn't add anything. It is wasted.

Consequently, I believe that the proper use of monotypic genera is for when they are actually required for phylogenetic classification, but that there is a good argument for sinking them into larger genera whenever things could be made monophyletic without them. Two examples may illustrate the argument.


The above presents a case where the monotypic genus in red is actually needed. There are two genera marked in blue and green, and so obviously the phylogenetically isolated lineage in red cannot be lumped into either of them without making them paraphyletic. It is 'left over' and needs its own genus.


A perfect example for this is the ginkgo tree, Ginkgo biloba, which is a phylogenetically isolated living fossil. It is here photographed as an alley tree in front of our apartment block in Zürich, back when I was a postdoc there.


In the above phylogeny, however, the monotypic genus in red is sister to another genus in blue, and that latter genus isn't very large either. Now I can understand why it might perhaps be desirable to recognise the two as different genera if their divergence happened many tens of millions of years ago and they are morphologically quite distinct. Unfortunately, however, the world is full of monotypic genera that are very young and look exactly like the slightly larger sister genus, but differ from it in a single morphological character.

In those cases, do we really need that kind of taxonomic inflation? What then is the use of the genus rank?


The species that occasioned these ruminations in me is the above Tasmanian daisy tree Centropappus brunonis, which is clearly just a Bedfordia without hairs on the leaves; otherwise the two genera are pretty much indistinguishable. And Bedfordia itself has a mere three species, so it is not as if it would get unmanageably large if they were united.

There are many, many similar cases.

Friday, April 20, 2018

Time-calibrated or at least ultrametric trees with the R package ape: an overview

I had reason today to look into time-calibrating phylogenetic trees again, specifically trees that are so large that Bayesian approaches are not computationally feasible. It turns out that there are more options in the R package APE than I had previously been aware of - but unfortunately they are not all equally useful in everyday phylogenetics.

In all cases we first need a phylogram that we want to time-calibrate or at least make ultrametric to use in downstream analyses that require ultrametricity. As we assume that our phylogeny is very large it may for example have been inferred by RAxML, and the branch lengths are proportional to the probability of character changes having happened along them. For present purposes I have used a smaller tree (actually a clade cut out of a larger tree I had floating around), so that I could do the calibrations quickly and so that the figures of this post look nice. My example phylogram has this shape:


We fire up R, load the ape package, and import our phylogeny with read.tree() or read.nexus(), depending on whether it is in Newick or Nexus format, e.g.
mytree <- read.tree("treefilename.tre")
Now to the various methods.

Penalised Likelihood

I have previously done a longer, dedicated post on this method. I did not, however, go into the various models and options then, so let's cover the basics here.

Penalised Likelihood (PL) is, I think, the most sophisticated approach available in APE, allowing the comparison of likelihood scores between different models. It is also the most flexible. It is possible to set multiple calibration points, as discussed in the linked earlier post, but here we simply set the root age to 50 million years:
mycalibration <- makeChronosCalib(mytree, node="root", age.max=50)
We have three different clock models at our disposal, correlated, discrete, and relaxed. Correlated means that adjacent parts of the phylogeny are not allowed to evolve at rates that are very different. Discrete models different parts of the tree as evolving at different rates. As I understand it, relaxed allows the rates to vary most freely. Another important factor that can be adjusted is the smoothing parameter lambda; I usually run all three clock models at lambdas of 1 and 10 and pick the one with the best likelihood score. For present purposes I will restrict myself to lambda = 1.

Let's start with correlated:
mytimetree <- chronos(mytree, lambda = 1, model = "correlated", calibration = mycalibration, control = chronos.control() )
When plotted, the chronogram looks as follows.


Next, discrete. The command is the same as above except for the text in the model parameter. The branch length distribution and likelihood score turned out to be very close to those for the correlated model:


Finally, relaxed. Very different branch length distribution and a by far worse likelihood score compared to the other two:


I have only considered testing a strict clock model with chronos for the first time today. It turns out that you get it by running it as a special case of the discrete model, which by default is set to assume ten rate categories. You simply set the number of categories to one:
mytimetree <- chronos(mytree, lambda = 1, model = "discrete", calibration = mycalibration, control = chronos.control(nb.rate.cat=1) )
In my example case this looks rather similar to the results from correlated model and discrete with ten categories:


The problem with PL is that is seems to be a bit touchy. Even today we had several cases of an inexplicable error message, and several cases of the analysis being unable to find a reasonable starting solution. We finally found that it helped to vastly increase the root age (we had played around with 15, assuming that it doesn't matter, and it worked when we set it to a more realistic three digit number). It is possible that our true problem was short terminal branches.

PL is also the slowest of the methods presented here. I would use it for trees that are too large for Bayesian time calibration but where I need an actual chronogram with a meaningful time axis and want to do model comparison. If I just want an ultrametric tree the following three methods would be faster and simpler alternatives. That being said, so far I had no use case for them.

A superseded but fast alternative: chronopl()

This really came as a surprise as I believed that the function chronopl() had been removed from ape. I thought I had tried to find it in vain a few years ago, but I saw it in the ape documentation today (albeit with the comment "the new function chronos replaces the present one which is no more maintained") and was then able to use it in my current R installation. I must have confused it with a different function.

chronopl() does not provide a likelihood score as far as I can see, but it seems to be very fast. I quickly ran it with default parameters and lambda = 1, again setting root age to 50:
mytimetree <- chronopl(mytree, lambda = 1, age.min = 50, age.max = NULL, node = "root")
The result looks very similar to what chronos() produced with the (low likelihood) relaxed model:


Various parameters can be changed, but as implied above, if I want to do careful model comparison I would use chronos() anyway.

Mean Path Lengths

The chronoMPL() method time-calibrates the phylogeny with what is called a mean path lengths method. The documentation makes clear that multiple calibration points cannot be used; the idea is to make an ultrametric tree, pick one lineage split for which one has a credible date, and then scale the whole tree so that the split has the right age. Command is simply:
mytimetree <- chronoMPL(mytree)
The problem is, the resulting chronogram often looks like this:


Most of the branch length distribution fits the results for the favoured model in the analysis with chronos(), see above. That's actually great, because chronoMPL() is so much faster! But you will notice some wonky lines in particular in the top right and bottom right corners of this tree graph. Those are negative branch lengths. Did somebody throw the ancestral species into a time machine and set them free a bit before they actually evolved?

Some googling suggests that this happens if the phylogram is very unclocklike, which, unfortunately, is often the case in real life. That limits rather sharply what mean path lengths can be used for.

The compute.brtime() function

Another function that I have now tried out is compute.brtime(). It can do two rather different things.

The first is to transform a tree according to what I understand has be a full set of branching times for all splits in the tree. The use case for that seems to be if you have a tree figure and a table of divergence times in a published paper and want to copy that chronogram for a follow-up analysis, but the authors cannot or won't send it to you. So you manually type out the tree, manually type out a vector of divergence times (knowing which node number is which in the R phylo format!), and then you use this function to get the right branch length distribution. May happen, but presumably not a daily occurrence. What we usually have is a tree for which we want the analysis to infer biologically realistic divergence times that we don't know yet.

The second thing the function can do is to infer an ultrametric tree without any calibration points at all but under the coalescent model. The command is then as follows.
mytimetree <- compute.brtime(mytree, method="coalescent", force.positive=TRUE)
It seems that the problem of ending up with negative branch lengths was, in this case, recognised and solved simply by giving the user the option to tell the function PLEASE DON'T. I assume they are collapsed to zero length (?). My result looked like this:


Note that this is more on the lines of "one possible solution under the coalescent model" instead of "the optimal solution under this here clock model", so that every run will produce a slightly different ultrametric tree. I ran it a few times, and one aspect that did not change was the clustering of nearly all splits close to the present, which I (and PL, see above) would consider biologically unrealistic. Still, we have an ultrametric tree in case we need one in a hurry.

It is well possible that I have still missed other options in APE, but these are the ones I have tried out so far.

Something completely different: non-ultrametric chronograms

Finally, I should mention that there are methods to produce very different time-calibrated trees in palaeontology. The chronograms discussed in this post are all inferred under the assumption that we are dealing with extant lineages, so all branches on the chronogram end flush in the present, and consequently a chronogram is an ultrametric tree. And usually the data that went into inferring the topology was DNA sequence data or similar.

Palaeontologists, however, deal with chronograms where many or all branches end in the past because a lineage went extinct, making their chronograms non-ultrametric and look like phylograms. And usually the data that went into inferring the tree topology was morphological. This is a whole different world for me, and I can only refer to posts like this one and this one which discuss an R package called paleotree.

There also seems to be a function in newer APE versions called node.date() which is introduced with the following justification:
Our software, node.dating, uses a maximum likelihood approach to perform divergence-time analysis. node.dating is written in R v3.30 and is a recent addition to the R package ape v4.0 (Paradis et al., 2004). Previously, ape had the capability to estimate the dates of internal nodes via the chronos function; however, chronos requires ultrametric trees and is thus unable to incorporate information from tips that are sampled at different points in time.
This suggests that the point is the same, to allow chronograms with extinct lineages, but in this case aimed more at molecular data. Their example case are virus sequence data.

Friday, April 13, 2018

Monophyletic species, kind of

A paper by bryologist Brent Mishler and philosopher of biology John Wilkins has just come out, with the title The Hunting of the SnaRC: A Snarky Solution to the Species Problem. It is open access in the journal Philosophy Theory and Practice in Biology, so anybody with internet access can check it out.

Many bloggers have issues that they return to again and again even if they are not necessarily the nominal topics of their blogs - for example, Jerry Coyne frequently posts about Free Will and about students trying to shut down talks by speakers they don't like, and Larry Moran regularly takes apart papers claiming that junk DNA has been disproved. This much less widely known blogger can reliably be coaxed out from behind the oven by at least two such recurring issues: bad arguments for the acceptance of paraphyletic taxa, and the in my eyes incoherent concept of "monophyletic species".

As the title indicates, Mishler & Wilkins present a solution for the species problem, i.e. the perennial question in biology of what 'a species' even is. Especially as the paper is freely accessible it would serve no purpose to summarise its introduction, so I will move immediately to what I find most interesting: their views on how to view species and some pointers on how to do classification at the lowest levels in practice.

Note that I say "their views", plural, deliberately, because this is one aspect of the paper that I have not quite understood yet:

Wilkins has argued in the past that the popular approach of developing a theoretical species concept and then applying it to a potentially recalcitrant reality is a dead end. What biologists should do is the opposite, i.e. consider species as empirical phenomena in need of individual explanations. And here in this paper, Wilkins' argument is reiterated concisely in section 3, A Way Forward: Species Are at Least Initially Phenomena.

What I like about this flip in perspective is that it allows much more flexibility; obviously the empirical phenomena that we generally identify as species, be it popularly or as biologists - generally gaps in morphological or genetic variation - need a different scientific explanation for example in asexual than in sexual species, making one-size-fits-all species concepts difficult to apply.

Mishler, in turn, has argued in the past that species are not a special biological category different from e.g. monophyletic genera and families. The species category is arbitrary, and we should just classify all organisms into nested monophyletic groups, AKA clades, all the way down to the individual specimens. And here in this paper, Mishler's argument is reiterated in sections 4, Rankless Taxonomy, 5, Capturing the SNaRC, and 6, Using SNaRCs in Systematic, Evolutionary, and Ecological Studies.

The thing is, while there is perhaps technically no direct contradiction between those two arguments to the degree that there is a contradiction between "all taxa should be monophyletic" and "taxa should be allowed to be paraphyletic", they appear to be two rather different prescriptions. If I understand correctly, the first says,
  • We should treat species as empirical phenomena in need of explanation instead of indiscriminately applying a given theoretical concept to them.
The second says,
  • It makes no sense to even talk of species, we should stop doing so, and here is a single theoretical concept (everything is clades) that we should indiscriminately apply to all specimens.
In fact I am currently unable to see how sections 4-6 and the conclusions of this paper would have to change if section 3 were to be deleted in its entirety. What am I missing?

What I found most useful about this paper was that it has some thoughts on how to do classification into nested clades all the way down to the individual specimens in practice, because that was completely unclear to me in all past instances when this approach was suggested. There are some apparent problems with it, particularly that we need items forming a tree structure to even have clades. It is sometimes difficult to illustrate the issue, but it can perhaps be presented as follows:
  1. The prescription is, as mentioned above, that a classification should be clades (= monophyletic groups) all the way down to individual specimens.
  2. A clade is a complete branch in a tree structure, and usually understood to be specifically a complete branch of a species phylogeny.
  3. In other words, the way the term clade is defined, it applies only in a tree-structure but is inapplicable in a net-like structure.
  4. Sexually reproducing species are systems consisting of individual specimens that have net-like relationships with each other, because they share numerous ancestors instead of one ancestor in each sufficiently earlier generation.
  5. It follows necessarily from the previous two points that the term clade cannot be applied to describe the relationship between specimens if what we are looking at includes multiple specimens from the same sexually reproducing species.
  6. If follows then that it is logically impossible to classify into clades all the way down to these specimens, unless the meaning of the word clade is changed to a degree that the whole purpose of having that word is defeated.
To my understanding this is why Hennig spent so much time discussing the different ways that specimens (or snapshots of them, which he called semaphoronts) can be related to each other. The relationship between four (non-hybridogenic) species is tree-like, so they can, and should, be classified into clades. But relationships between individuals within a sexually reproducing species are net-like, so they cannot possibly be classified into clades, as the word does not even have a meaning in that structure.

The point at which approaches to classification change is approximately at the species level. Phylogenetic systematics applies only above it, and it uses species as the units that it groups into clades, because if it used any smaller units there would not be clades. This is also why in my opinion one cannot coherently reject the reality of species and be a phylogenetic systematist and, conversely, coherently accept the reality of species and promote paraphyletic taxa, because clades are species that have diversified. Many others, of course, disagree.

Now, what is the practical approach suggested by the present paper? It argues that the terminal units of classification should be "the finest-scale clades that can be convincingly demonstrated with current data", here called Smallest Named and Registered Clades (SNaRCs). Obviously such a 'clade' cannot be based on information from a single gene, as it may show a different history than other genes, for example because of introgression or incomplete lineage sorting. The solution is to use as evidence for monophyly "the preponderance of gene lineages making up a clade", or in other words "congruence among the majority of gene trees and other types of phylogenetic characters available".

On the plus side, this is a very empirical and testable prescription. But consider two thought experiments. First, take three samples A, B and C, look at, say, 100 gene trees, and if 51 of them show ((A,B),C) then A and B form a 'clade', even if all three of them are members of the same sexually reproducing species. Again, that is doable, empirical and testable, and we get a clear answer.

Nonetheless this approach does not convince me at the moment, nor will it even if we assume a scenario of 100 gene trees supporting (A,B), simply because no matter what the gene trees say, in reality there is no tree-structure inside the species. Yes, we can easily sequence for example the DNA of three siblings and run an analysis that will produce a phylogenetic tree for each gene, but in reality these three people just don't have a tree-relationship with each other, so it does not make sense to me to use terminology or a classification that implies there is one.

For the second thought experiment, take three samples D, E, and F, and if 33 gene trees say ((D,E),F), 33 say (D,(E,F)), and 34 say (E,(D,F)), we are inside a SNaRC and should not delimit any more narrowly, even if D is a specimen from an arid zone ephemeral, E from an alpine perennial, and F from a narrow endemic of the northwestern Blue Mountains that only occurs on ironstone-sandstone outcrops, and all three of them are geographically isolated from each other.

This hypothetical case has three very distinct entities that show a lot of gene tree discordance for the genes we used for our analysis. This is a much weaker problem than the previous one because Mishler & Wilkins argue that SNaRCs are, as all scientific hypotheses, tentative and await revision after the examination of more data. Maybe the next 100 gene trees will clinch it for (A,(B,C)), and then at least we could separate out A; more realistically, sampling more individuals of all three species will presumably resolve the three species as three SNaRCs, even if we cannot figure out the relationship of those three SNaRCs with each other (they may even form a true polytomy, and that's fine).

Still it bothers me that in a situation where we unfortunately have only one sample per species available for analysis the approach promoted in the present paper might lead to the tentative lumping of clearly distinct entities. And unless something is added to the approach, or unless I am missing something, it would have to, because it does not seem to include a way of recognising single-specimen SNaRCs except in the case of one being left alone as sister to another SNaRCs, that, in turn, would still consist of two potentially vastly different specimens. But maybe I am taking this too literally.

On top of that there is perhaps another methodological issue, or again maybe just something I don't understand. It seems to me as if "majority vote of the gene trees" is not actually how multi-locus phylogenetic analyses generally work. To the best of my understanding they reconcile gene trees in rather more complex ways, even in the case of such a simple approach as Gene Tree Parsimony, let alone the multi-gene coalescent model. Many of these approaches actually presuppose the existence of species or populations, and for the same reason as I argued above: what happens within a sexually reproducing lineage is rather different from what happens between such lineages.

More than anything what I find uncomfortable about the approach presented here is that it seems to care not so much about the actual patterns of common descent of what it classifies as about character or gene tree distribution. The difference may come across as subtle, admittedly. What I am trying to say is that I believe phylogenetic systematics should be about classifying organisms by relatedness, by exclusivity of common descent.

I do not, for example, care very much about the fact that most of the ancestral chloroplast genome has been moved over into the nucleus of the host cell, because the chloroplasts are directly descended in an unbroken line from the first cyanobacterium that colonised a plant cell, and the plant species we have today are descended in an unbroken line from that plant cell. To me chloroplasts are a subclade of cyanobacteria and plants are a subclade of eucaryotes, all regardless of what happened to the individual genes.

To use an example from within a species, I have mentioned in the past that it is possible, although statistically unlikely, that I have inherited no genetic material whatsoever from my maternal grandfather, if it just so happened that all the chromosomes my mother gave me were those she got from her mother (the Y chromosome is of course always from the paternal grandfather, by necessity). But even if that were the case we would nonetheless consider it to be an important piece of information that I descended from my maternal grandfather, and I would nonetheless not exist without his involvement. So yes, we use the genes to infer common descent, but the point is really the common descent itself, and the genes are just a data source that can potentially mislead us. Sometimes the right answer may be (A,(B,C)) even if most genes say ((A,B),C).

The "majority vote of the gene trees" approach, however, feels as if its practical concern starts and ends at the pattern shown by the genes, regardless of what the patterns of descent are. To me that feels the wrong way around.

Another way of looking at the issue may be this: If we truly accept the argument made in section 3, that we should look at natural phenomena, consider them to be explananda, and find the most appropriate scientific explanation for each of them, would the logical result not be Hennig's original approach? The phenomenon that a beetle specimen shares more traits with a bee specimen than either share with a slug specimen has an explanation, and that is that the former two share a much more recent common ancestor from which they inherited the shared traits. We express that reality by grouping the former two into a taxon called 'insects' while leaving the slug out.

The fact that I may easily in some cases share more genetic similarity with somebody born in Italy than with another northern German, however, would most likely be due to the stochastic nature of allele inheritance inside our sexually reproducing species. There is no clade wherein two specimens of humanity - the hypothetical Italian and I - share one and only one most recent common ancestor. Instead, beyond some point in the past we share thousands of ancestral 'specimens' in each generation. Because this is a different biological phenomenon than ((beetle,bee),slug), we need a different approach to classification at that level.

Wednesday, April 11, 2018

Botany picture #257: Gentianella aspera


Has it been that long since I posted the last botany picture? With my mind still on the mountains, here is a European gentian, Gentianella aspera (Gentianaceae), European Alps, 2004. Although sometimes split off into their own genus, the Australian gentians are phylogenetically also Gentianella.

One thing that I found strange about the Australian ones, by the way, is that they are generally white, because the European gentians are rather famously blue, violet, or very rarely yellow. There is even an obnoxious German Schlager song making that point, with the first line of the chorus translating as "blue, blue, blue blooms the gentian".

WARNING: follow that link at your own risk.

Tuesday, April 10, 2018

Sam Harris and Ezra Klein on intelligence and race

Recently atheist activist Sam Harris and journalist Ezra Klein had a discussion about intelligence and race. The background is that Harris had Charles Murray, the author of The Bell Curve, as a guest on his podcast, Klein's Vox site published an article critical of that interview, and Harris felt that that article was unfair.

Having read through the transcript of Harris' and Klein's conversation, I must say that it went reasonably well, considering the topic. Harris' discussion with Noam Chomsky, for example, was much worse, as his first argument went completely over Harris' head, and they just went in circles from that moment on.

The frustrating thing is that at the bottom of what Harris is trying to argue there are quite a few ideas that are valid. Yes, scientific results should be accepted for what they are instead of being pushed aside for fear of being politically incorrect. But his otherwise reasonable points are completely overshadowed by his tendency to make it all about how mean his critics are to him for calling him biased and his inability to see that making it all about how his critics are mean to him while bracketing out how this discussion fits into its historical and political context in the United States is his own unacknowledged bias at work.

What is in my eyes particularly ironic, however, is that while Harris makes it all about how unfair his critics are, he argues at the same time that the science should be the focus. So I tried to have an eye on how the scientific evidence was discussed, and as far as I can tell it seemed to go as follows:

Klein sometimes brings up evidence that shows that intelligence (as measured by IQ or similar tests, which is another whole can of worms) is strongly influenced by the environmental conditions under which somebody grows up, e.g. when children from disadvantaged backgrounds are adopted by affluent families, and cites, by name, relevant scientists who argue that at the very least there is at this moment no evidence yet for any significant genetically determined IQ difference between groups. (And I have no idea where such evidence could even potentially come from, unless there is behind this the usual misunderstanding of what heritability means.) Harris never addresses those arguments, as far as I can tell. His counter-arguments appear to be:

(1) "genes are involved for basically every[thing]". But that is so trivially true as to be meaningless. Genes are involved for the development of fingers, still there are no differences in the number of fingers between different populations. And even if we are talking about traits that vary, it gets us nowhere, because it doesn't necessarily follow that the genes determine more than, say, 5% of the variation. And even if intelligence is strongly heritable it says nothing about significant differences between groups either, as he readily admits that variation within is much stronger than between.

(2) Then there is Harris' sports example, where he says that West Africans dominate certain running sports. He argues "if you have populations that have their means slightly different genetically, 80 percent of a standard deviation difference, you’re going to see massive difference in the tail ends of the distribution, where you could have 100-fold difference in the numbers of individuals who excel at the 99.99 percent level". Now I get that this might be a valid argument to explain the underrepresentation of a group with a hypothetically slightly lower mean at excelling at the >99.9% level under the Utopian assumption of complete equality of opportunity, but then we would be talking about Field Medal winners or Nobel laureates. As an explanation for lower societal achievement on average, i.e. why members of a group are vastly overrepresented in prisons and have vastly lower household wealth than the majority, it is a non-starter and thus irrelevant to the discussion from the get-go.

(3) Harris cites unnamed scientists who, he says, do not want to have their names published because of fear of being called racist, but who are said to agree with him. Not knowing who they are one is, of course, unable to confirm what they said or meant as well as to assess their qualifications, their potential agendas and biases, and if they are even from a relevant field of research. (Note that according to Wikipedia Charles Murray, with whom that whole discussion started, is a "political scientist, author, and columnist" working for a conservative think tank. That is, he is not an expert in the areas of population genetics, human cognitive development, comparative assessments, or any other field of relevance.)

I find that a bit disappointing. For all Harris' claims that the science is clearly on Charles Murray's side, it rather looks to me as if his argumentation runs simply as follows: There are differences in IQ between groups, and these differences must obviously have a genetic component, because everything has a genetic component. And that's it, at least as far as one can tell from the conversation with Klein.

Monday, April 2, 2018

How problematic is the jump dispersal parameter in ancestral area inference?

I recently read an article in the Journal of Biogeography titled "Conceptual and statistical problems with the DEC + J model of founder-event speciation and its comparison with DEC via model selection". Its authors are Richard Ree, the developer of the original DEC model, and Isabel Sanmartin.

The main problem with discussing the paper here is that it would probably take 5,000 words to properly explain what it is even about. I will try to provide the most superficial introduction to the topic and otherwise assume that of the few people who will read this blog most are at least somewhat familiar with it.

The area of research this is about is the estimation or inference of ancestral areas and biogeographic events. Say we have a number of related species, the phylogeny showing how they are related, a number of geographic areas in which each species is either present or absent, and at least one model of biogeographic history. For the purposes of what I will subsequently call ancestral area inference (AAI) we assume that we know the species are well-defined and that the phylogeny is as close to true as we can infer at the time, so that they will simply be accepted as given. How to objectively define biogeographic areas for the study group is another big question, but again we take it as given that that has been done.

The idea of AAI is to take these pieces of information and infer what distribution ranges the ancestral species at each node of the phylogeny had, and what biogeographic events took place along the phylogeny to lead to the present patterns of distribution. What model of biogeographic events we accept matters a lot, of course. Imagine the following simple scenario of three species and three areas, with sister species occurring in areas A and B, respectively, and their more distant relative occurring in both areas B and C:



Assuming, for example that our model of biogeographic history favours vicariant speciation and range expansions, we may consider the scenario on the left to be a very probable explanation of how we ended up with those patterns of distribution. First the ancestral species of the whole clade occurred in all areas, and vicariant speciation split it into a species in area A and one in areas B and C. The former expanded to occur in both A and B and then underwent another vicariant speciation event, done.

If we have reason to assume that this is unlikely, for example because area A is an oceanic island, we may favour a different model. In the right hand scenario we see the ancestral species occurring in areas B and C and producing one of its daughter species via subset sympatry in area B. At least one seed or pregnant female of that new lineage is then dispersed to island A. An event such as this last one, where dispersal leads to instant genetic isolation and consequent speciation, is in this context often called 'jump dispersal' or, as in the title of the paper, 'founder-event speciation', to differentiate it from the much slower process of gradual range expansion followed by vicariant or sympatric speciation*.

I am not saying that either of these scenarios is the best one to explain how the hypothetical three species evolved and dispersed. In fact I would say that three species are too small a dataset to estimate biogeographic history with any degree of confidence, but it provides an idea of what ancestral area inference is about.

Perhaps the best established approaches to AAI are Dispersal and Vicariance Analysis (DIVA) and the Dispersal, Extinction and Cladogenesis model (DEC). The former was originally implemented as parsimony analysis in a software with the same name, and it has a tendency to favour vicariance, as the name suggests. Likelihood analysis under the DEC model became popular in its implementation in the software Lagrange, and in my limited experience and to the best of my understanding it is designed to have daughter species inherit part of the range of the ancestor, often leading to subset sympatry. And there are other approaches, of course.

As the result of his PhD project, Nick Matzke introduced the following two big innovations in AAI: First, the addition of a parameter j, for jump dispersal, to existing models. This allows the kind of instantaneous speciation after dispersal to a new area that I described above, and which can be assumed to be particularly important in island systems. Second, the idea that the most appropriate model for a study group should be chosen through statistical model selection, as in other areas of evolutionary biology. He created the R package BioGeoBEARS to allow such model selection. It implemented originally likelihood versions of DIVA, DEC and a third model called BayArea, all assuming the operation of slightly different sets of biogeographic processes. Each of them can be tested with and without the j parameter and, after another update, with or without an x parameter for distance-dependent dispersal.

Now I come finally (!) to Ree & Sanmartin. Their eight page paper, as the title implies, is a criticism of these two innovations. What do they argue? I hope I am summarising this faithfully, but in my eyes their three core points are as follows:
  • A biogeographic model with events happening at the nodes of the tree as opposed to along the branches, as is the case with jump dispersal, is not a proper evolutionary model because such events are then "not modeled as time-dependent". In other words, only events that have a per-time-unit probability of occurring along a branch are appropriate.
  • Under certain conditions the most probable explanation provided by a model including the j parameter is that all biogeographic events were jump dispersals. The j parameter gets maximised and explains everything by itself. They call this scenario "degenerate", because the "true" model must "surely" include time-dependent processes.
  • DEC and DEC + j (and, I assume, by extension any other model and its + j variant) cannot be compared in the sense of model selection.

I must, of course, admit that model development is not my area. Consequently I am happy to defer regarding points one and three to others who have more expertise, and who will certainly have something to say about this at some point. I can only at this moment state that these claims do not immediately convince me. Certainly it is often the case that models with very different parameters are statistically compared with each other?

Is it not possible that the best model to explain an evolutionary process may sometimes indeed have a parameter that is not time-dependent but dependent on lineage splits? In the present case, if it is a fact that jump dispersal caused a lineage split, then both events quite simply happened instantaneously (at the relevant time scale of millions of years); in a sense, they were the same event, as the dispersal itself interrupted gene flow.

Perhaps more importantly, however, I am not at all convinced by the second point. Generally I am more interested in practical and pragmatic considerations than theory of statistics and philosophy. In phylogenetics, for example, I am less impressed by the claim that parsimony is supposedly not statistically consistent than by a comparison of the results produced by parsimony and likelihood analysis of DNA sequence datasets. Do they make sense? What can mislead an analysis? What software is available? How computationally feasible is what would otherwise be the best approach, and can it deal with missing data?

So in the present case I would also like to consider the practical side. Is the problem of j being maximised so that everything is explained by jump dispersal at all likely to occur in empirical datasets? In the paper Ree and Sanmartin illustrate a two species / two area example. That is clearly not a realistic empirical dataset, as it is much too small for proper analysis. But if we understand to some degree how the various model parameters work we can deduce under what circumstances j is likely to be maximised.

Unless I am mistaken, the circumstances appear to be as follows: We need a dataset in which all species are local endemics, i.e. all are restricted to a single area, and in which sister species never share part of their ranges. This is because other patterns cannot be explained by jump dispersal. If a species occupies two or more areas, it would have had to expand its range, so the analysis cannot reduce the d parameter for range expansion to zero. If sister species share part of their ranges, likewise; if they share the same single area, they must have diverged sympatrically, which again is not speciation through jump dispersal.

This raises the question, how likely are we to find datasets in which these two conditions apply? In my admittedly limited experience such datasets do not appear to be very common. If we are dealing, for example, with a small to medium sized genus on one continent, we will generally find partly overlapping ranges, and often at least one very widespread species. The j parameter will not be maximised. If we are doing a global analysis of a large clade, we will need rather large areas (because if you use too many small areas the problem becomes computationally intractable). This means, among other things, that entire subclades will share the same single-area range, and j will not be maximised.

In other words, the problem of 'all-jump dispersal' solutions appears to be rather theoretical. But what if we actually do have such a dataset? Is it not a problem then? To me the next question is under what circumstances such a situation would arise. Again, we have all species restricted to single areas, meaning that they apparently find it hard to expand their ranges across two areas. Why? Perhaps geographic separation to the degree that they rarely disperse? Geographic separation to the degree that when they disperse gene flow is interrupted, leading to immediate speciation? Again, we never have sister species sharing an area. Why? A good explanation would be that each area is too small for sympatric speciation to be possible.

Now what does that dataset sound like? To me it sounds like an archipelago of small islands, or perhaps a metaphorical island system such as isolated mountain top habitats. The exact scenario, in other words, in which all-jump dispersal seems like a very probable explanation. Because your ancestral island is too small for speciation, the only way to speciate is to jump to another island, and if you jump to another island you are immediately so isolated from your ancestral population that you speciate.

Again, I am not a modeler, and I have not run careful simulation experiments before writing this, but based on this thought experiment it seems to me as if the + j models would work just as they should: j would not be maximised under circumstances where the other processes are needed to explain present ranges, but it would be maximised under precisely those extremely rare circumstances where 'all jump dispersal' is the only realistic explanation.

Footnote

*) Sympatric meaning here at the scale of the areas defined for the analysis. If one the areas in the analysis is all of North America, for example, it is likely that the 'sympatric' events inside that area would in truth mostly have been allopatric, parapatric or peripatric at a smaller spatial scale.

Sunday, April 1, 2018

Weekend in the mountains

We just came back from a nice weekend in the Australian 'Alps', making use of what may have been the last period of nicely warm weather. Still rather cold camping during the night, it is definitely not summer anymore.


Turns out we may finally have to buy a new tent. On the plus side, the belt of the plant press served well to keep the tent pole in shape; not perfectly, but sufficiently to give us just enough structural integrity for two nights.


The main attraction this time was Yarrangobilly Caves. The last time we passed by it was too late in the day, so we were unable to visit them. This time we bought passes for two of the caves.


Although probably weird looking enough, this photo does not do reality justice - the entrance area to South Glory Cave is massive and awe-inspiring.


We camped at our favourite spot in the area, Three Mile Dam. I have posted photos of the lake before, but here is one showing the moon reflected in the water during the night.


Morning mist above the camp site penetrated by the first rays of the sun.


And to conclude, something botanical: Golden everlasting daisies (Xerochrysum subundulatum, Asteraceae) fruiting on the Old Kiandra Gold Fields.

Friday, March 23, 2018

Science spammers constantly reaching for new lows

I received the following two messages on the same day, in fact they were sitting right next to each other in the inbox.



Surely this is sad. Is there no such thing as taking pride in one's work, even among spammers? Recently some people tried to defraud me, and of course that is annoying, but at least they put a lot of effort into it. I was impressed by how much information they had to accumulate to seem half-convincing. These guys, on the other hand, use such a simplistic bot to produce their mass-emails that it they are immediately recognisable as such.

Really the only thing sadder than these messages is that my spam filter is apparently still unable to understand that the keyword "greetings!!!!" is a certain indicator of spaminess.

Wednesday, March 21, 2018

Bioregionalisation part 6: Modularity Analysis with the R package rnetcarto

Today's final post in the bioregionalisation series deals with how to do a network or Modularity Analysis in R. There are two main steps here. First, because we are going to assume, as in the previous post, that we have point distribution data in decimal coordinates, we will turn them into a bipartite network of species and grid cells.

We start by defining a cell size. Again, our data are decimal coordinates, and subsequently we will use one degree cells.
cellsize <- 1
Note that this may not be the ideal approach for publication. The width of one degree cells decreases towards the poles, and in spatial analyses equal area grid cells are often preferred because they are more comparable. If we want equal area cells we first need to project our data into meters and then use a cellsize in meters (e.g. 100,000 for 100 x 100 km). There are R functions for such spatial projection, but we will simply use one degree cells here.

We make a list of all species and a list of all cells that occur in our dataset, naming the cells after their centres in the format "126.5:-46.5". I assume here that we have the data matrix called 'mydata' from the previous post, with the columns species, lat and long.
allspecies <- unique(mydata$species)

longrounded <- floor(mydata$long / cellsize) * cellsize + cellsize/2

latrounded <- floor(mydata$lat / cellsize) * cellsize + cellsize/2

cellcentre <- paste(longrounded,latrounded, sep=":")

allcells <- unique(cellcentre)
We create a matrix of species and cells filled with all zeroes, which means that the species does not occur in the relevant cell. Then we loop through all records to set a species as present in a cell if the coordinates of at least one of its records indicate such presence.
mynetw <- matrix(0, length(allcells), length(allspecies))
for (i in 1:length(mydata[,1]))
{
  mynetw[ match(cellcentre[i],allcells) , match(mydata$species[i], allspecies) ] <- 1
}
It is also crucial to name the rows and columns of the network so that we can interpret the results of the Modularity Analysis.
rownames(mynetw) = allcells
colnames(mynetw) = allspecies
Now we come to the actual Modularity Analysis. We need to have the R library rnetcarto installed and load it.
library(rnetcarto)
The command to start the analysis is simply:
mymodules <- netcarto(mynetw, bipartite=TRUE)
This may take a bit of time, but after talking to colleagues who have got experience with other software it seems it is actually reasonably fast - for a Modularity Analysis.

Once the analysis is done, we may first wonder how many modules, which we will subsequently interpret as bioregions, the analysis has produced.
length(unique(mymodules[[1]]$module))
For publication we obviously want a decent map, but that is beyond the scope of this post. What follows is merely a very quick and dirty way of plotting the results to see what they look like, but of course the resulting coordinates and module numbers can also be used for fancier plotting. We split the latitudes and longitudes back out of the cell names, define a vector of colours to use for mapping (here thirteen; if you have more modules you will of course need a longer vector), and then we simply plot the cells like some kind of scatter plot.
allcells2 <- strsplit( as.character( mymodules[[1]]$name ), ":" )
allcells_x <- as.numeric(unlist(allcells2)[c(1:(length(allcells)))*2-1])
 

allcells_y <- as.numeric(unlist(allcells2)[c(1:(length(allcells)))*2])
 

mycolors <- c("green", "red", "yellow", "blue", "orange", "cadetblue", "darkgoldenrod", "black", "darkolivegreen", "firebrick4", "darkorchid4", "darkslategray", "mistyrose")

plot(allcells_x, allcells_y, col = mycolors[ as.numeric(mymodules[[1]]$module) ], pch=15, cex=2)
There we are. Modularity analysis with the R library rnetcarto is quite easy, the main problem was building the network.

As an example I have done an analysis with all Australian (and some New Guinean) lycopods, the dataset I mentioned in the previous post. It plots as follows.


There are, of course, a few issues here. The analysis produced six modules, but three of them, the green, orange and light blue ones, consist of only two, one and one cells, respectively, and they seem biologically unrealistic. They may be artifacts of not having cleaned the data as well as I would for an actual study, or represent some kind of edge effect. The remaining three modules are clearly more meaningful. Although they contain some outlier cells, we can start to interpret them as potentially representing tropical (red), temperate (yellow), and subalpine/alpine (dark blue) assemblies of species, respectively.

Despite the less than perfect results I hope the example shows how easy it is to do such a Modularity Analysis, and if due diligence is done to the spatial data, as we would do in an actual study, I would also expect the results to become cleaner.

Sunday, March 18, 2018

Botany picture #256: Solenostemon presumably


In spring we bought three types of Sempervivum (Crassulaceae) and planted them in a large bowl. Two little seedlings spontaneously came up in the succulent soil and, recognising them as members of my other favourite plant family Lamiaceae, I transferred them to a different pot where they would get more water.

I was curious to see what they would grow into - perhaps a useful aromatic herb? Well, they grew and grew and grew, but they did not flower until just now. Although it had become clear to me some time ago that they must be some kind of Solenostemon or relative and are presumably cultivated as ornamentals rather than as kitchen herbs I was hoping that they would at least have nice flowers. The reality, alas, is a bit of a let-down. Not terrible but not exactly stunning either. It is unlikely that they will survive winter anyway, as they are probably tropical plants.


In other news, Canberra was covered by dust blown in from western New South Wales today. The sky was of an otherworldly grey and only returned to its customary blue colour late in the afternoon.

Saturday, March 17, 2018

Bioregionalisation part 5: Cleaning point distribution data in R

I should finally complete my series on bioregionalisation. What is missing is a post on how to do a network (Modularity) analysis in R. But first I thought I would write a bit about how to efficiently do some cleaning of point distribution data in R. As often I write this because it may be useful to somebody who finds it via search engine, but also because I can then look it up myself if I need it after not having done it for months.

The assumption is that we start our spatial or biogeographic analyses by obtaining point distribution data by querying e.g. for the genus or family that we want to study on an online biodiversity database or aggregator such as GBIF or Atlas of Living Australia. We download the record list in CSV format and now presumably have a large file with many columns, most of them irrelevant to our interests.

One problem that we may find is that there are numerous cases of records occurring in implausible locations. They may represent geospatial data entry errors such as land plants supposedly occurring in the ocean, or vouchers collected from plants in botanic gardens where the databasers fo some reason entered the garden's coordinates instead of those of the source location , or other outliers that we suspect to be misidentifications. What follows assumes that this at least has been done already (and it is hard to automate anyway), but we can use R to help us with a few other problems.

We start up R and begin by reading in our data, in this case all lycopod records downloaded from ALA. (One of the advantages about that group is that very few of them are cultivated in botanic gardens, and I did not want to do that kind of data clean-up for a blog post.)
rawdata <- read.csv("Lycopodiales.csv", sep=",", na.strings = "", header=TRUE, row.names=NULL)
We now want to remove all records that lack any of the data we need for spatial and biogeographic analyses, i.e. identification to the species level, latitude and longitude. Other filtering may be desired, e.g. of records with too little geocode precision, but we will leave it at that for the moment. In my case the relevant columns are called genus, specificEpithet, decimalLatidue, and decimalLongitude, but that may of course be different in other data sources and require appropriate adjustment of the commands below.
rawdata <- rawdata[!(is.na(rawdata$decimalLatitude) | rawdata$decimalLatitude==""), ]
rawdata <- rawdata[!(is.na(rawdata$decimalLongitude) | rawdata$decimalLongitude==""), ]
rawdata <- rawdata[!(is.na(rawdata$genus) | rawdata$genus==""), ]
rawdata <- rawdata[!(is.na(rawdata$specificEpithet.1) | rawdata$specificEpithet.1==""), ]
All the records missing those data should be gone now. Next we make a new data frame containing only the data we are actually interested in.
lat <- rawdata$decimalLatitude
long <- rawdata$decimalLongitude
species <- paste( as.character(rawdata$genus), as.character(rawdata$specificEpithet.1, sep=" ") )
mydata <- data.frame(species, lat, long)
mydata$species <- as.character(mydata$species)
Unfortunately at this stage there are still records that we may not want for our analysis, but they can mostly be recognised by having more than the two usual name elements of genus name and specific epithet: hybrids (something like "Huperzia prima x secunda" or "Huperzia x tertia") and undescribed phrase name taxa that may or may not actually be distinct species ("Lycopodiella spec. Mount Farewell"). At the same time we may want to check the list of species in our data table with unique(mydata$species) to see if we recognise any other problems that actually have two name elements, such as "Lycopodium spec." or "Lycopodium Undesignated". If there are any of those, we place them into a vector:
kickout <- c("Lycopodium spec.", "Lycopodium Undesignated")
Then we loop through the data to get rid of all these problematic entries.
myflags <- rep(TRUE, length(mydata[,1]))
for (i in 1:length(myflags))
{
  if ( (length(strsplit(mydata$species[i], split=" ")[[1]]) != 2) || (mydata$species[i]) %in% kickout )
  {
    myflags[i] <- FALSE
  }
}
mydata <- mydata[myflags, ]
If there is no 'kickout' vector for undesirable records with two name elements, we do the same but adjust the if command accordingly to not expect its existence.

Check again unique(mydata$species) to see if the situation has improved. If there are instances of name variants or outdated taxonomy that need to be corrected, that is surprisingly easy with a command along the following lines:
mydata$species[mydata$species == "Outdatica fastigiata"] = "Valida fastigiata"
In that way we can efficiently harmonise the names so that one species does not get scored as two just because some specimens still have an outdated or misspelled name.

Although we assume that we had checked for geographic outliers, we may now still want to limit our analysis to a specific area. In my case I want to get rid of non-Australian records, so I remove every record outside of a box of 9.5 to 44.5 degrees south and 111 to 154 degrees east around the continent. Although it turns out that this left parts of New Guinea in that is fine with me for present purposes, we don't want to over-complicate this now.
mydata <- mydata[mydata$long<154, ]
mydata <- mydata[mydata$long>111, ]
mydata <- mydata[mydata$lat>(-44.5), ]
mydata <- mydata[mydata$lat<(-9.5), ]
At this stage we may want to save the cleaned up data for future use, just in case.
write.table(mydata, file = "Lycopodiales_records_cleaned.csv", sep=",")
And now, finally, we can actually turn the point distribution data into grid cells and conduct a network analysis, but that will be the next (and final) post of the series.