Tuesday, June 28, 2016

More adventures in science spammer land

Thus writeth the Open Journal of Plant Science, whose mass eMail is totally not spam, oh no sir:
Please Note: This is not a spam message and has been sent to you because of your eminence in the field.
Alright then.
Greetings for the day!
If you don't want to be taken for a spammer, perhaps try not to begin the eMail like every other science spammer.
Peertechz was launched this Journal to support the Open Access in the way of publishing manuscripts, new technics and methods in science.
I was decided any papers not to submit to journals whose editors not write coherent sentence.
Open Journal of Plant Science published articles are freely available online to the readers for life time.
They don't specify, but presumably they mean the life time of their website, right?
The journal encourages the authors to publish their manuscripts in a large Open Access network: Peertechz and its looking for the manuscripts from selective scientists like you who have enormously contributed to the scientific community.

It would really be grateful to you if you can send us energetic and enthusiastic submission to successfully release the upcoming issue. Send us any type of article to increase the visibility of the Open Journal of Plant Science.
Sadly I don't think that I have ever had a manuscript that I would have called 'energetic', be it as an author or as a reviewer. How do they differ from non-energetic ones?
If you are interested, please respond us on or before 48 hours and send your paper by July 15th, 2016.
I have seen that with quite a few journal spammers before. They have a very close deadline and then a more distant deadline in the very same sentence. So which is it? Why would an author have to respond in two days? I strongly suspect that they would also accept manuscripts on 15 July from people who didn't contact them tomorrow. In fact if we are talking a regularly appearing journal here then it should accept manuscripts on 16 July, so why claim that there are deadlines at all?
We are looking forward to have valuable submission from you soon.
Darauf kannst Du lange warten.

---

I also recently saw an example of what comes out at the other end of the process. One of my publication alerts notified me that I had been cited, and the paper was curious enough that I followed the link. It is
Gayathiri et al. 2016. A review: potential pharmacological uses of natural products from Laminaceae. International Journal of Pharma Research & Review 5: 21-34.
Yes, "Laminaceae". Seven authors have written an alleged review article on the mint family Lamiaceae and have consistently misspelled its name throughout the entire manuscript. Ah no, I lie; they spelled it "laiminace" in the keywords, so scratch consistently.

This alone is just... indescribable. Seriously, how can somebody write a review about something and not know how it is spelled? How can seven authors and presumably at least one editor miss that, even in that journal?

Also, my name is misspelled in the reference list, the species I am cited about is misspelled, and as for the text itself, let's just quote a single sentence from the abstract.
Although, medicinal plants continue provide a new drug leads in drug discovery, and numerous challenges are encountered in procurement and selection of plant materials, screening method and the scale up of active compound, hence this brief review work presents a study of the importance of natural products, especially those derived from higher plants and aims to the highlight the pharmacological significant of Laminaceae family in terms of drug development.
Yay for science!

---

On Jeffrey Beall's blog, which has already covered criteria for (1) what it calls predatory journals and (2) fake impact factors, a discussion has been started about criteria to similarly classify conferences as "predatory". Suggestions by contributor James McCrostie are, in short:
  1. Any kind of deceit, such as hiding that the conference is for-profit or claiming that the organisers are based in a different country than they really are;
  2. No or inadequate peer review, including review only by the conference organisers; later fast acceptance of submitted talks is also mentioned;
  3. High conference fees;
  4. Overly broad scope;
  5. Connections to companies known for running "predatory" journals;
  6. "Regularly accepting papers by undergraduates";
  7. Advertising the conference like a holiday;
  8. Spamming, especially to random people outside of the relevant field of science;
  9. People are allowed to give multiple presentations at the same conference.
(Numbers mine, the original list is more extensive and not numbered.)

I have received quite a few spam eMails advertising in broken but hyperbolic English for obviously crappy conferences, often from fields I have no relation to whatsoever, such as fertiliser research or medicine. This is clearly a problem, although as with the journals I would argue that no competent scientist should ever fall for them. People should know what the relevant conferences in their field are and be able to delete these spam eMails at a glance. I assume they mostly prey on the desperate.

Still, I can see how a list of dodgy journals, for example, is useful to people outside of a field or totally outside of science, to help gauge if somebody's CV is inflated. I am just not so sure that it is quite as easy to develop useful criteria for recognising dodgy conferences as it is for journals.

Points 1, 4, 5, 7, and 8 are clearly valid. Deceit, spamming and being run by a crappy journal company are obvious alarm bells, and overly broad scope makes a conference practically useless for networking and learning. But the others? Not so much.

Maybe it is different in other fields, but I do not think that I have ever participated in a professional conference that peer reviewed the abstract submissions except for whether they fit into the scope of the meeting. In other words, the organisers of a systematic botany conference will look over the abstract, and if it were on theology or law or perhaps obviously pseudoscience they would kick it out, but that's that. The closest I have ever seen is that the committee would demote some talk submissions that were deemed less important to posters, and even that only at the very most prestigious meetings.

High fees? Well that is in the eye of the beholder I guess, but some legitimate conferences can be terribly pricey.

"Regularly accepting papers by undergraduates"? What? Are there seriously fields of research where the quality of work doesn't matter but only one's status and age? Luckily I am not working in one of them.

Similarly, what is the problem with giving two talks at the same meeting? At most conferences in my field there are two or three people who do that. Nobody has ever had an issue with that as long as they don't do it every time, and as long as nobody gives five or something like that.

Perhaps practices are just too different between the various fields of science and scholarship to find easy agreement on this.

Sunday, June 26, 2016

Implementation of substitution models in phylogenetic software

Concluding the little series of posts on nucleotide substitution models, below is a summary of my current understanding of how to set several of the models discussed in the previous post in PAUP and MrBayes. But first a few comments.

For PAUP there are three possible options for the basefreq parameter. My current understanding is that equal sets equal base frequencies as in the JC model (duh), empirical fixes them to the frequencies observed in the data set, and estimate has them estimated across the tree. My understanding is further that while estimate is more rigorous, empirical is often 'close enough' and saves computation time. The point is, in any case below where it says one of the latter two options I believe one could also use the other without changing the model as such. I hope that is accurate.

What I do not quite understand is how one would set models like Tamura-Nei in PAUP. At least one of the sources I consulted when researching for this post (see below) suggests that one can set in PAUP models for example with variable transition rates but equal transversion rates, but the PAUP 4.0b10 manual states that the only options for the number of substitution rate categories are 1 (all equal), 2 (presumably to distinguish Tv and Ts), and 6 (all different). Would one not need nst = 3 or nst = 5 to set the relevant models? Perhaps the trick is to set nst = 6 but fix the substitution matrix? But that would mean one cannot estimate it during the analysis.

For the MrBayes commands note that the applyto parameter needs to be specified with whatever part(s) of the partition the particular model should apply to, as in applyto = (2) for the second part. The MrBayes commands I found my sources appear to be rather short; I assume that default settings do the rest. Note also that the old MrBayes versions that I am familiar with have now been superseded by RevBayes, but I have no experience with the latter program's idiosyncratic scripting language.

In addition to PAUP and MrBayes, I mention whether a model can be selected in three other programs I am familiar with, RAxML, BEAST and PhyML. I have no personal experience with most other Likelihood phylogenetics packages. I tried to check what is available in MEGA, but it wouldn't install for me, and I am not sure if the list of models in the online manual shows only those that are actually available for Likelihood search or whether it includes some that are only available for Distance analysis. The relevant list is linked from a page on Likelihood, but its own text implies it is about Distance. Either way, MEGA appears to have lots of options but I didn't indicate them below.

General Time-Reversible (GTR)

PAUP: lset nst = 6 basefreq = empirical rmatrix = estimate;

MrBayes: lset applyto=() nst=6;

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Tamura - Nei (TN93 / TrN)

Available in BEAST 2.1.3 and PhyML 3.0.

Symmetrical (SYM)

PAUP: lset nst = 6 basefreq = equal rmatrix = estimate;

MrBayes: lset applyto=() nst=6; prset applyto=() statefreqpr=fixed(equal);

Hasegawa-Kishino-Yano (HKY / HKY85)

PAUP: lset nst=2 basefreq=estimate variant=hky tratio=estimate;

MrBayes: lset applyto=() nst=2;

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Felsenstein 84 (F84)

PAUP: lset nst=2 basefreq=estimate variant=F84 tratio=estimate;

Available in PhyML 3.0.

Felsenstein 81 (F81 / TN84)

PAUP: lset nst=1 basefreq=empirical;

MrBayes: lset applyto=() nst=1;

Available in PhyML 3.0.

Kimura 2-parameters (K80 / K2P)

PAUP: lset nst=2 basefreq=equal tratio=estimate;

MrBayes: lset applyto=() nst=2; prset applyto=() statefreqpr=fixed(equal);

Available in RAxML and PhyML 3.0.

Jukes Cantor (JC / JC69)

PAUP: lset nst=1 basefreq=equal;

MrBayes: lset applyto=() nst=1; prset applyto=() statefreqpr=fixed(equal);

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Invariant sites (+ I)

For PAUP, add the following to the lset command: pinvar = estimate

For MrBayes, add the following to the lset command: rates = propinv

Gamma (+ G)

For PAUP, add the following to the lset command: pinvar = 0 rates = gamma ncat = 5 shape = estimate (or another number of categories of your choice for ncat).

For MrBayes, add the following to the lset command: rates = gamma

Invariant sites and Gamma (+ I + G)

For PAUP, add the following to the lset command: pinvar = estimate rates = gamma ncat = 5 shape = estimate (or another number of categories of your choice for ncat).

For MrBayes, add the following to the lset command: rates = invgamma

References

This post is partly based on the following very helpful sources, all accessed c. 20 June 2016.

Faircloth B. Substitution models in mrbayes. https://gist.github.com/brantfaircloth/895282

Lewis PO. PAUP* Lab. http://evolution.gs.washington.edu/sisg/2014/2014_SISG_12_7.pdf

Lewis PO. BIOL848 Phylogenetic Methods Lab 5. http://phylo.bio.ku.edu/slides/lab5ML/lab5ML.html

And a PDF posted on molecularevolution.org.

Friday, June 24, 2016

What number of schemes to test to get what substitution models in jModelTest

Another thing that occurred to me that might be useful to write about in the context of substitution models concerns jModelTest.

The background is here that before setting a model in a phylogenetic analysis, one would usually conduct model-testing. There are some programs that do their own testing, but others don't, and consequently people have written software that examines your dataset and suggests the best model. The classic program was simply called ModelTest. It was developed by David Posada, and it was primarily used through and for PAUP. It actually wrote out the PAUP commands of the best-fitting model, so that one could simply copy and paste them over into the Nexus file, and off we go.

Then people wanted to use the results of the model test for MrBayes. Problem was, MrBayes didn't do all the models that PAUP did, and it was annoying to see a model suggested that one couldn't implement. So Johan Nylander produced a modified version called MrModeltest, and it conveniently wrote out the MrBayes and PAUP commands for the best-fitting model.

These programs have now been superseded by jModelTest. On the plus side, this tool is highly portable and extremely user-friendly thanks to its GUI. Also, the user does not have to have PAUP, but instead jModelTest simply comes packaged with PhyML and hands the data over to that software. Also, it is parallelised. On the other hand, in contrast to its predecessors it does not appear to write out the PAUP or MrBayes code. But well, that is what my next post is going to deal with.

For the moment I want to address a different problem: When starting the Likelihood calculations for various models, the user can select whether 3, 5, 7, 11 or 203 (!) substitution schemes, and thus double the number of models, are going to be tested. (And with or without invariant sites and gamma, if desired.) But the thing is, it is not clear from just looking at the interface which models for example the seven schemes are going to cover. If I select seven, will all the models I could implement in BEAST be included? Or will I just waste a lot of computing time on testing models that BEAST can't do anyway?

So I just ran all the numbers across a very small dataset, and this is what models are being tested in each case:

3 substitution schemes: JC, F81, K80, HKY, SYM, GTR

Three schemes is the highest you need to test if deciding on a model for RAxML or MrBayes. Note that as of updating this post (30 Nov 2016) RAxML can only do JC, K80, HKY, and GTR, while MrBayes can do all six.

5 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, SYM, GTR

I am writing them out as named in jModelTest, but note that TPM1 is synonymous with K81 or K3P, and TrN is called TN93 is some software. An "uf" clearly means a variant with unequal base frequencies, and as mentioned in the previous post "ef" is similarly a variant with equal base frequencies.

Five schemes is the highest you need to do if you are model-testing for a BEAST run, because all its models are covered. It also seems to me as if with the exception of F84 all models available in PhyML are covered, and that one doesn't appear to be tested under higher numbers of substitution schemes either. So the same consideration applies, unless jModelTest uses a different name for it (?).

7 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TIM1ef, TIM1, TVMef, TVM, TPM1, TPM1uf, SYM, GTR

11 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, TPM2, TPM2uf, TPM3, TPM3uf, TIM1ef, TIM1, TIM2ef, TIM2, TIM3ef, TIM3, TVMef, TVM, TPM1, TPM1uf, SYM, GTR

At this stage it becomes clear that there are a lot of strange variants of the Three-Parameter and Transitional Models that I overlooked in my previous post. I don't think they are used very often though...

203 substitution schemes: This adds a gazillion models named with a scheme of six letters or six letters plus F. Not sure how many people ever use them, so I will just stop here. I have a strong tendency to find GTR or HKY suggested for my data anyway...

(Updated 30 Nov 2016 to include information for RAxML and MrBayes.)

Thursday, June 23, 2016

Overview over the substitution models

Continuing from the previous post, here is a run-down of the commonly used models, their abbreviations, alternative names and abbreviations, and parameters. They are sorted from most parameter-rich to simplest.

General Time-Reversible (GTR)
Base frequencies variable.
All six substitution rates variable.
Number of free parameters: 8

Transversional (TVM)
Base frequencies variable.
All four transversion rates variable, transitions equal.
Number of free parameters: 7

Transitional (TIM)
Base frequencies variable.
Two different transversion rates, transitions equal.
Number of free parameters: 6

Symmetrical (SYM)
Base frequencies equal.
All six substitution rates variable.
Number of free parameters: 5

Kimura 3-parameters (K81, K3P, TPM1)
Base frequencies variable.
Two different transversion rates, transition rates equal.
Number of free parameters: 5

Tamura-Nei (TN93, TrN)
Base frequencies variable.
Transversion rates equal, transition rates variable.
Number of free parameters: 5

Hasegawa-Kishino-Yano (HKY)
Base frequencies variable.
Transversion rates equal, transition rates equal.
Number of free parameters: 4
Differs from F84 in how the transversion rate and transition rate parameters are derived.

Felsenstein 84 (F84)
Base frequencies variable.
Transversion rates equal, transition rates equal.
Number of free parameters: 4
Differs from HKY in how the transversion rate and transition rate parameters are derived.

Felsenstein 81 (F81, TN84)
Base frequencies variable.
All substitution rates equal.
Number of free parameters: 3

Kimura 2-parameters (K80, K2P)
Base frequencies equal.
Transversion rates equal, transition rates equal.
Number of free parameters: 1

Jukes-Cantor (JC, JC69)
Base frequencies equal.
All substitution rates equal.
Number of free parameters: 0

And below an overview of the same information in graphical form. The numbers in brackets are the numbers of free parameters that the relevant axis is contributing.


Apparently at least some people fill the empty fields on the left in by using the name of the relevant model to the right and adding "ef" to it. So TVMef would, for example, be a model with all transversion rates variable, transition rates equal, and equal base frequencies. But there is probably a reason why those missing models don't have any well known names; they seem to be rarely used, if ever.

Tuesday, June 21, 2016

Nucleotide substitution models

I recently had reason to consider alternative nucleotide substitution models and got a bit frustrated at the scarcity of clear, easily understood summaries of (1) their differences, (2) their availability in different phylogenetic programs, and (3) how to implement them in the various programs.

This is not to say that the information isn't out there. It is out there, and other people may be very satisfied with the form in which it can be found, e.g. in Wikipedia articles full of mathematical matrices. But because I had a hard time finding exactly what I would have wanted I am going to write a few posts over the next few days summarising it in a way that seems intuitive to me. And perhaps this will also be useful to others.

Some throat clearing will be necessary.

Models

First, what are models? I am certainly not a modeller but was originally trained in parsimony analysis. Consequently I am not the ideal person to introduce this, and I may make a few slips further down the line. But the principle is rather simple: models are mathematical descriptions or abstractions that allow us to describe a process and ideally to make predictions, at least stochastically.

Models may differ vastly in their complexity, in the number of their parameters. Imagine we want a model that allows us to describe how fast I am cycling under different circumstances, assuming that I never change gears. A simple model would perhaps have only one parameter, how much effort I invest. We can now put a value in that describes the energy I am using to pedal and get out a speed value.

We may compare this model against observations of me cycling, and we will find that while there is some fit it is far from ideal. Most likely adding another parameter, the force of headwind I have to work against, will improve the fit of the model to the data. Perhaps adding a third parameter, the incline on which I am cycling at any given moment, will further improve the model, and so on.

Overparameterisation

It is often the case that a model becomes better the more parameters are added. There are, however, most likely diminishing returns, as adding the tenth factor will improve the model less than adding the second. Worse, there is a concept called 'overparameterisation', meaning that a model can be too complicated for its own good. I found a very nice explanation of what that means here.

In case the link is dead when you read this, their example is predicting the seventh point on a graph from the first six points. A very simple model would be a straight line (the usual y = a * x + b), more complex, quadratic models would be monotonous curves, and even more complex, higher polynomial ones would be wavy lines.

If the six points are not exactly on a straight line (perhaps just because there was some noise in the data), it is trivial to produce a very complex model that fits them better than the simple line, because it could get much closer to each of them. But they would work much less well at predicting a seventh point outside of the range of the first six if the underlying pattern was really just a straight line, because they'd tend to shoot off to very high or low values at either end of the range to which they were fitted.

This also related to the reason why I am currently looking into models. It has now happened for the second time this year that jModelTest suggested substitutions models that turned out to be too parameter-rich when I tried to use them in BEAST, prompting me to consider which one I should use instead and why.

Nucleotide substitution models

As their name indicates, nucleotide substitution models are used in phylogenetic analysis of DNA sequence data. They differ in the number of parameters but also have several things in common. Most importantly, all commonly used models assume reversibility, meaning a given substitution is as likely as the the reversal of that substitution.

Parameters of the nucleotide substitution models

Obviously, this cuts the number of possible parameters that can be used to describe substitutions in half, but it still leaves the following:


Each of the six double-headed arrows is a possible parameter describing the likelihood of substitution of one nucleotide in the DNA with another (and the reverse). They can be grouped into classes, most obviously transitions and transversions. Nucleotides fall into two chemically different groups, purines and pyrimidines. A substitution within a group is called a transition (Ts), a substitution between purine and pyrimidine is called a transversion (Tv). Accordingly there are parameter-poorer models that assign the same probability to all transversions and/or to all transitions, or something in-between.

Relevant for model complexity is, by the way, not the number of parameters, but the number of free parameters. Even in a model that has different probabilities assigned to all six substitutions the sixth can be calculated from the other five, meaning that there is always one free parameter less than there are parameters.

In addition to these up to six substitution parameters, models may have the four base frequencies as additional parameters. Again the number of free parameters is one less, i.e. three, because obviously the fourth base frequency must be one minus the other three. Other models simply assume equal base frequencies; in these cases the number of free parameters from that part of the model is zero.

Any of the substitution models described by these sets of parameters can then be modified as '+ I', '+ G', and '+ I + G'. I is the invariant model, involving an estimate of the ratio of invariant sites in the data. G is the gamma model which allows sites to evolve at different rates using a set number of gamma categories and a factor alpha describing the shape of the frequency distribution of sites across the categories. I found this explanation rather helpful.

What this is going to be about

Finally, note that the following posts will not be about how the likelihood function is maximised, how phylogenetic software searches for the best tree given a model, or any theoretical or mathematical background. Honestly, at least to some degree the end user of phylogenetic software would be justified to treat much of this as a black box. The point is rather to summarise information on what parameters the different models have, how they are related, where and how they are implemented, and even such minor but useful to know things as which two or three names confusingly refer to the very same model.

Saturday, June 18, 2016

Taxonomic terminology again

I have often complained about idiotically unusable identification keys. Unclear descriptions of species are a related issue.

Yesterday I read the description of a daisy and encountered the term 'peracute'. Acute means pointy and is usually understood to mean that the apex of an organ shows an angle of less than 90 degrees (as opposed to obtuse for more than 90 degrees). The most frequently encountered problem is somebody writing 'subacute' to indicate... well, that is a good question. Pointy but not very pointy? And then of course we have issues with some taxonomists using related terms for pointy apices that show some kind of curvature, like apiculate or attenuate, in inconsistent ways.

But I have never before come across peracute. My first reaction was to try and translate it. Again, acute means pointy. Per means through, so through-acute? That does not make any sense whatsoever. A more modern meaning is 'for each', as in twenty kilometres per hour. For each acute? Yeah, not very sensible either.

But wait, I have that old glossary I wrote about in March. But no, sadly it doesn't contain that term either. So really, what is the point if a taxonomist uses a word that simply doesn't exist? Who will understand the description under those circumstances? Wouldn't it be nice if we could define four to six clear and universally agreed-on terms for, in this case, apex shapes, and use them, and only them, consistently?

Ah well. At least I had the opportunity to peruse the glossary again. And of course peracute is nothing. This book is the mother lode of bad ideas.

Pampinus - n. Tendril.
Then why not write tendril? Ye gods.

Panduriform - a. Fiddle-shaped.
Then why not... oh, we had that already.

Parastomon - n. An abortive stamen, a staminodium.
You know, we already have a word for that. It is 'staminodium' or 'staminode'.

Phoranthium - n. The receptacle of the capitulum of Compositae.
Ah, I have heard of that structure, only everybody calls it a ... receptacle.

Argh. Argh. Argh. Argh. Headdesk. Are some people actually deliberately trying to make their taxonomic publications pointless, useless and maximally infuriating?

Friday, June 17, 2016

Botany picture #230: Jacobaea vulgaris


Jacobaea vulgaris (Asteraceae), also known as Senecio jacobaea, Germany, 2005. This species is a widely introduced weed, and among the places it has been introduced to is Australia.

I had known for some time that it and several relatives had been officially segregated out into Jacobaea. Having met it first as a student under a Senecio name, not seeing any obvious morphological differences to species that remained under that genus, and, perhaps most importantly, without any pressing need to research the matter, I was unsure what to make of it.

Perhaps this was just some random splitting based on a whim? Or there would have been a splitting and a lumping option to make something monophyletic, and the splitting option that creates two indistinguishable genera was preferred over a lumping option that would create a heterogeneous genus? For all I knew this may well have been one of those scenarios that 'evolutionary' systematists always complain about.

Recently, however, a colleague discussed the matter with me, and that prompted me to properly study the literature. The phylogeny of the Senecioneae was thoroughly resolved in a series of papers published around ten years ago by Pieter Pelser and collaborators. Perhaps the best overview for present purposes is provided by figure 1 of Pelser et al. (2007, Taxon 56: 1077-1104). It is a single phylogeny broken up over ten pages. The part that concerns Jacobaea is figure 1F, and it can conveniently be summarised as follows:


So this is not just paraphyletic. Even 'evolutionary' systematists should be unable to accept a genus that includes Jacobaea and Senecio in the strict sense. But surely then two clades separated by so many other clades, among them illustrious genera like Crassocephalum and Dendrosenecio, must have some diagnostic characters? Sadly, no:
Although Jacobaea is distinguished from Senecio s.str. and other genera in Senecioneae by DNA sequence characters (Pelser & al. 2002), clear morphological synapomorphies for Jacobaea have not been identified to date.
That may be very unsatisfying, but life isn't always fair. There is no rule that says that we will always find diagnostic characters to tell all distantly related groups apart.