Sunday, July 24, 2016

Botany picture #231: Daphne


Daphne (Thymelaeaceae), perhaps D. mezereum, as seen at Cockington Green in Canberra the previous weekend. I have always liked the genus because it flowers so early in the season and has such an amazing, strong floral scent. Unfortunately both the branches and the fruits of these species are rather poisonous, so they are not ideal for a garden frequented by small children.

The genus is not native to Australia, which has the large genus Pimelea instead.

Tuesday, July 19, 2016

I must be missing something

As I continue to contemplate Ebach and Michael's recent paper From Correlation to Causation: What Do We Need in the Historical Sciences?, I would first like to make clear that I like reading and appreciate papers like this one. If we want to get things right it is crucial to be pushed out of one's comfort zone from time to time, so asking a question like "have recent developments in the field gone totally in the wrong direction?" has its value.

That being said, however, it would appear to be a reasonable assumption that >90% of the experts are somewhat unlikely to all overlook a fundamental flaw in what they are doing. It could happen, yes, but especially as a non-expert one would need to see a rather good and clear argument before agreeing that they do.

With this in mind, I will now describe my thoughts about the core of the paper and my understanding of what the authors argue for and why. The first parts of the paper feature a lengthy discussion of the interpretation of ancestry and character evolution in phylogenetics and evolutionary biology, of unstructured and structured representations of the same data, and of the dangers of letting unwarranted assumptions distort the data. There might be quite a bit to be discussed here - for example the claim that "historical sciences, such as taxonomy and palaeontology ... are mostly descriptive and defy testing", which might be news to the discoverers of Tiktaalik, who were able to predict and subsequently test their prediction of where to look for this "missing link" - but the meat of the paper as I understand it starts with the criteria the authors suggest for "comparing ... assumptions against a well-attested set of aspects of causation".

Reference is made to the Bradford Hill Criteria of the medical sciences, and they are then adapted to the presumed needs of historical science, which, as discussed in the earlier post, the authors perceive to be fundamentally different from experimental science. The new Historical Sciences Bradford Hill Criteria are presented under key words that sometimes describe what to look for and sometimes describe what to avoid:

Selection bias. This is an obvious problem in science, although I must say that I do not find the specific examples provided by the authors to be the most convincing.

Temporality. I am afraid I do not really understand what is meant here, so I will quote the relevant paragraph from the paper in full, excepting references. "Present day distributions are the result of past events. Therefore there is the possibility that different taxa alive today may be resultant in different events that occurred at different times. In using a single historical event (e.g., the Oligocene drowning of New Zealand ~30 Ma) to address a larger biogeographical question has resulted in several debates about the age of the New Zealand biota. Single event hypothesis are rare, however little to no scrutiny is typically taken to test their validity."

Especially considering that in the original, i.e. medical science, criteria temporality was apparently about dose-response and reversibility of an effect, I am unsure where the above comes from. It is even less clear to me what is wrong with using a single historical event to address a large question. If New Zealand was indeed completely under water then it quite simply follows that all endemic land-living organisms would at that moment have perished, and that the current land-living organisms would have had to disperse into New Zealand afterwards, when it rose up again. (I am not qualified to assess if it was indeed completely under water, but that is not the point.)

Evidence for a mechanism of action. This again makes more sense to me, although we may have to agree to disagree about how plausible any given assumption of a mechanism of action is. The authors, for example, appear to consider dispersal from one biogeographic region to another to be implausible, but I believe that as long as the probability of that happening is not zero it would have to be a matter of weighing it against the plausibility of alternative explanations (such as those requiring that a family of flowering plants arose before multi-cellular life).

Coherence. In effect the question whether a claim fits what else we are currently confident we know. Makes sense.

Replicability or, as I would call it, reproducibility. The authors argue that historical data are not replicable, so the equivalent is correlation between different datasets.

Similarity. Do two different datasets for the same study group arrive at the same result? It is not entirely clear to me how this is different from the previous, but at any rate the two seem sensible enough.

In summary, some of the above is not clear to me, but other aspects appear immediately reasonable. In fact there would a trivial interpretation of what the authors want to say: Examine your assumptions; remove indefensible assumptions; all else being equal, use the simpler model instead of an unnecessarily convoluted one.

But one would assume it cannot really be that easy, because there is hardly anybody in science who would disagree with that. Modellers are all aware of the danger of over-parameterisation; the problem is simply that all else is not always equal. Sometimes a more complex model is quite simply the better explanation. If you have a plot of dots forming a straight line as data, a very simple linear model with one parameter will do nicely. If you have a plot of dots forming an S-shaped pattern you will quite simply not be able to explain them with such a simple model, you need more parameters. I would suspect the same applies to biogeography; if there are data that defy the explanation of vicariance then our explanation needs to incorporate more processes than vicariance. I thus find hard to accept the authors' judgement that "complex models are designed to extrapolate data under highly speculative assumptions" whereas "simple models, with plausible assumptions[,] are more likely to pass the [criteria]". It really depends.

Similarly, everybody in science will agree that we shouldn't base our conclusions on bad assumptions. Problem is, everybody will argue that their assumptions are the good ones. Perhaps now would be a good moment to turn to the example provided by the authors of the present paper, to see how they use their new criteria. This might also clear up those aspects that I did not understand when reading the criteria themselves.

Interestingly, the four methods compared by the authors under their criteria are at least partly apples and oranges. They are Brooks Parsimony Analysis (BPA), Ancestral Area Analysis (AAA), Ebach's own Area Cladistics (AC), and the Dispersal-Extinction-Cladogenesis model (DEC). To the best of my understanding the point of BPA is to reconstruct how biogeographic regions are related, as in "the rainforests of New Zealand are sister to the temperate rainforests of Australia, and sister to both of them are the temperate rainforests of Patagonia" (this not a quote but a hypothetical). We might also call this reconstructing the evolutionary history of biogeographic areas. In contrast, my understanding is that the other three are concerned with reconstructing the inverse, the biogeographic history of an evolutionary lineage, either in its entirety or at least to infer where the common ancestor of the lineage was found (although admittedly I was unable to look deeply into AC as the relevant paper was behind a paywall).

Still, all four are biogeographic methods. I found it easiest to proceed once more criterion by criterion.

Selection bias. DEC is criticised for assuming that "areas" are the result of dispersal and extinction, while the criterion is said to be inapplicable to the other three methods because "the type of area is not specified" in any of them. Once more I can only say that I don't get it.

There are two possible interpretations of "area" in this context. The first is that we are talking about the cells or regions defined a priori as the units of the analysis. If this is the case, then all four methods face the exact same problem, because in all cases the user has to define areas a priori. But this doesn't make sense because the cells defined for a DEC analysis are quite simply not "considered as [sic] a result of dispersal and extinction", they are the units of which a potentially larger range considered to be the result of dispersal and extinction consists.

The second possibility is that we are talking about the results. If this is the case, then yes, obviously a Dispersal-Extinction-Cladogenesis model assumes that the present ranges of organisms are the result of dispersal and extinction (and cladogenesis). That's the point. But if this is what we are talking about then we cannot simply say "doesn't apply" for the other three. AC, for example, assumes that current ranges are the result of vicariance, so at the very least it would need a green marker for a plausible assumption, if indeed we find this assumption plausible; realistically, we would have to start discussing whether vicariance as the only process makes sense.

Temporality. As mentioned above I don't understand how the things considered under this name are any more temporal than the ones that aren't. The example does not really clarify the matter for me either. BPA and DEC are criticised as "speculative" because they use "incongruence" (between distribution patterns of different lineages? I believe that is not how DEC works...) to "explain" or "justify" "ad hoc events" or "processes". First, I think what is meant here is the other way around, i.e. that BPA and DEC explain certain patterns by invoking events that the authors consider to be ad hoc assumptions, apparently in practice meaning any biogeographic process except vicariance.

Second and more importantly it is, to say the least, not clear to me why vicariance is less ad hoc than dispersal, extinction and cladogenesis, which just goes back to my earlier point that everybody thinks their preferred explanation is the plausible one. Anyway, AAA is likewise criticised as speculative because "duplicated areas are considered to be part of an original ancestral area". AC, on the other hand, is given a green for entirely plausible assumptions because it only assumes that "geographical congruence is a result of common historical processes". Taken on its own that may sound reasonable, but what about the incongruences? Are they simply ignored? As mentioned above, if there are data that defy a one parameter model then more parameters would appear to be warranted.

There is little to say about evidence for a mechanism of action because none of the four methods is given a clean bill of health. I actually find it rather impressive that the authors call this aspect even of their preferred method speculative for explaining every congruence with vicariance. I do not, however, understand what is meant with "tree topology determines all processes" in the case of DEC. Taken at face value it is plainly wrong because not only the phylogeny but also the present distribution data go into the analysis. What is more, the same necessarily applies in the three other methods, only that some of them use "areagrams" instead of the phylogenetic tree of a group of organisms.

Finally, the treatment of coherence, replicability, and similarity seems even stranger to me. CA is lauded for comparing its results against other data, and with the exception of BPA for similarity the three other methods are criticised for not doing so. But how does the method determine what the end user does with it? What if the user of CA decides not to make any further comparison? What if the user of the DEC model goes on to apply the model to the next four lineages occurring across the same biogeographic areas? How would using DEC exclude such a possibility?

Maybe I am missing something, but it seems to me as if all four methods generally merit at best the same colour, or level of plausibility, on all criteria. If anything I would look somewhat askance at BPA, AAA and CA for simply assuming that the concept of "areagrams" makes sense in the first place, because if there is any significant degree of exchange between biogeographic regions it doesn't.

Either way I am afraid I cannot claim to have understood how to apply these new criteria to in an unbiased manner going forward.

Friday, July 15, 2016

Freedom!

In the light of two more recent rounds of the perennial Free Will discussion elsewhere, I think I now finally understand the incompatiblist position. Let's see if I got this right.

Free Speech. The right to express one's opinion without being punished. Generally considered to find its limits in libel and incitement to violence. In the stricter sense limited to the understanding that the government should not be able to punish a person for expressing their political views; on the other hand it can be argued that free speech in this strict sense alone would be hollow, that expressing an unpopular opinion should not be grounds for losing one's job either. Either way, this concept does not imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Academic Freedom. Same as free speech but in the context of university employees, particularly tenured professors. Sometimes misunderstood to mean that professors have the right not to do the job they are being paid for without facing any repercussions at all, e.g. when somebody uses what should have been a science course to promote their religious beliefs or political ideology. Most importantly, this concept does not imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Degrees of Freedom (Statistics). The number of parameters that can vary, that are not determined by others. In many models or statistical tests this number is one less than the total number of parameters, as the value of the last parameter follows necessarily from the values of the others. This concept does not imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Degrees of Freedom (Mechanics). The number of ways in which a machine can move, counting dimensions and rotations around dimensions. A locomotive for example would have one, a car three (two dimensions and rotation around the third), an aeroplane six. This concept does not imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Freedom of Religion. The right to practice one's religious faith without being punished for it. Sometimes badly confused with the right to also force others to adhere to the rules of one's own religion or to discriminate against members of other religions. This concept does not imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Freedom of Movement. Commonly understood to mean the right to move without restriction through one's own country, including choosing one's place of residence, and to leave the country and return to it. This concept does not imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Free Lunch / Entry / Drinks / etc. Descriptive of receiving a service or item that usually has to be paid for, without having to pay for it in this instance, generally because somebody else pays for it. Funnily enough this concept does not imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Free Press. The right of the news media to report what is going on without being punished for it. Generally understood to be reasonably limited by the right to privacy and national security concerns. Generally understood to be an important aspect of a functioning democracy, as only a well informed electorate can make well informed decisions. This concept does not imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Free Range Chickens. Chickens that are, while still obviously fenced in so that they do not escape, given a healthy amount of room to move around, as opposed to "battery" hens. This concept does not imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Free Fall. The situation in which the only significant force acting on a body is gravity, as opposed to being held up by the ground or being slowed down by a parachute. Even this concept does not, despite having the word "free" in it, imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Free Style. Being allowed to conduct an activity without having to follow strict rules or being required to achieve a set goal. This concept does not imply anything magic, is perfectly compatible with a deterministic universe, and we are free to say it.

Free Will. The ability to contemplate different possible courses of action and then decide between them in the absence of external pressure or pathological compulsion, resulting in actions that match one's preferences. Despite its equivalence with most of the other terms above, acceptance of this concept (and only of this concept) implies a belief in magic and a rejection of deterministic rules of cause-and-effect. Although a compatibilist view like the one just described was already promoted by the determinist stoics of Greek Antiquity, this view is actually nothing but goal-post moving by unreasonable contemporary philosophers who don't want to accept that neurophysiology has shown determinism to be true. And of course until a few years ago nobody ever had that determinism idea. What stoics? No idea what you are talking about. While we are at it, please ignore all religious traditions that have promoted determinism for hundreds of years because their gods are omniscient; focus on the traditions that have promoted magical, non-determinist Free Will because they were troubled by the Problem of Evil. Using the term even under the non-magical, compatibilist definition given earlier aids them (somehow), so the term Free Will (and only this term, but none of the other equivalent concepts containing the word "free") should not be used any more. Because that is totally going to happen. And when we need to describe the difference between, say, a coldly calculating thief and a kleptomaniac we will come up with something. Perhaps just use "voluntary" and pretend it is not simply the Latin translation of Germanic "out of one's own free will". Or maybe we don't need a word to describe the difference after all, because due to determinism the former had as little choice as the latter; then again, we also believe that there is a difference after all because we would still lock the former up but give the latter treatment, so maybe a term would be useful; then again, due to determinism the former had as little choice as the latter... (Oops, I think I entered an infinite loop there.)

Is that about correct? Bit uncertain about the end, but well, I am not an incompatibilist. There might also simply be different perspectives in the incompatibilist camp.

Wednesday, July 13, 2016

Experimental versus historical science

Long time no blog; it seems to come to me in bursts.

Anyway, a colleague has drawn my attention to a paper that has recently appeared, From Correlation to Causation: What Do We Need in the Historical Sciences?, by Ebach and Michael. It argues that "the integrity of historical science is in peril due [to] the way speculative and often unexamined causal assumptions are being used", and further suggests six criteria to check these supposedly speculative assumptions against.

In effect, the issue appears to be the use of models in phylogenetics and, in particular, in biogeography, and here, in particular and unless my reading between the lines is mistaken, the acceptance of any process except vicariance.

Before even delving into any other parts of the argumentation, it would be interesting to consider one of the underlying premises, which is clear already from the title of the paper: the assumption that historical sciences are fundamentally different from experimental sciences. As the authors write, "any evidence we adduce for some historical event needs must be contemporary evidence from which we make inferences on the basis of auxiliary hypotheses". But is it really any different in experimental sciences like medicine or physics? Do they not also have auxiliary hypotheses and assumptions at every step? Perhaps it is a failing on my part, but I at least cannot clearly see a marked difference.

Yes, of course we have easier access to evidence about things that happen around us every day today, and it is much easier to gather more of it. But that is a question of quantity, not the question of a qualitative, let alone epistemological, difference. To illustrate the point, let us consider an extremely simple case, the textbook statistics example of die throws.

First assume that I give you a die and ask you if you think it is loaded. You will then perhaps roll it twelve times, and get the result 2, 6, 5, 6, 6, 6, 6, 6, 3, 6, 6, and 6. Instinctively you might now conclude that it is, indeed, loaded. If you want to be scientific about it you would do statistics to calculate what the likelihood is of rolling these results with a fair die. It is, after all, possible that a fair die produces nine sixes out of twelve rolls; in fact it could produce a hundred sixes out of a hundred rolls, the question is merely how unlikely that is.

You have the die in your hand and you just did an experiment. Experimental science, right? Okay, now assume that after the twelve rolls described above, I snatch the die away from you and drop it in the Mariana Trench. It could be argued that from that moment on the research question "is the die loaded?" has turned into historical science. The twelve rolls are all the data we will ever get. From there we can take the next step and consider a scenario where we read about the twelve rolls in a book that is hundreds of years old. Surely now the question is squarely in the realm of historical science.

But has anything changed? I don't see it. The exact same statistical approach that applied before still applies afterwards. There is no difference in how we address the problem in either case.

And of course this situation is what we always face in science, in a certain sense. We don't literally have a die snatched away, but we do have time, money and other resource constraints. At some point we stop collecting data for any given study and analyse them. Consequently I fail to see where the philosophical difference is between being limited by the data that are available due to an accident of history and being limited by the data that are available due to, for example, our luck with DNA sequencing success before the project budget ran out.

The flip side of being limited by the dataset we have in any given situation is whether we can get more data in the future. Again, with experimental science we can get additional data more easily than in, say, palaeontology. But in real life historical research we are usually not reliant on a single die that has been destroyed either, as the most interesting questions are broader than that. So even with historical data we can usually go back and try to acquire more fossils or archaeological artefacts.

What we have considered so far was inferring what process operated in the past (a fair or a loaded die?) from data we have available (the results of twelve rolls). Thinking of biogeography this would be comparable to inferring whether long distance dispersal of plants and animals happened in the past from contemporary patterns of distribution. We can also flip that around now and consider the inference of past one-off events from processes we can still observe today. In biogeography, we can today observe spores, seeds, insects, birds, and ballooning spiders being blown across vast distances and arriving on remote shores. Did Rhipsalis, the only cactus genus naturally occurring outside of the Americas, arrive in Africa through chance dispersal across the ocean or is its current distribution the result of a much older vicariance event?

Of course this was a one-off event, and yes, we will never know the answer for certain. But again I fail to see the difference in principle. I cannot possibly know for certain that the sun will rise again tomorrow morning, but I can have a great deal of confidence in my admittedly tentative conclusion. Going back to the die example, if I give you a die and then ask you, "I rolled it once yesterday evening - what do you think the result was?", you cannot know it for certain either unless I tell you. But you can observe the process - you can roll it a thousand times - and then infer a probability distribution. If you find that it is severely loaded and produces a six 81% of the time, you may be willing to go so far as to suggest that my roll was a six.

In summary, I personally do not at this moment see the big difference between experimental and historical science; at least not a difference that could be used to argue that the latter cannot employ, for example, models of the same complexity as the former. Admittedly I am not a philosopher of science though.

Sunday, July 3, 2016

The Markov k model for discrete mophological data

The most frequently used model of character evolution for morphological data is called the Markov k (Mk) model. It was suggested by Lewis (2001) and is implemented in a few Likelihood or Bayesian phylogenetics programs.

The idea here is that there are several discrete character states. So for continuous traits like organ lengths one would divide the continuum into categories, e.g. character state 0 for small than 5 cm and state 1 for larger than 5 cm. But as that is also how most people build their datasets for parsimony analysis it means that the same data can often be used for both analyses.

Some software allows the states of one character to be ordered, so that to change from state 0 to state 2 a lineage has to pass through state 1, counting as two mutation steps. Some also allow for a gamma parameter, so that the different characters can fall into categories with different rates of change (some faster-evolving and some slower-evolving).

Another important consideration with morphological data is the scoring approach. Datasets of sequence regions generally contain all the sequence data that were obtained, i.e. both the ones that are variable and the ones that are entirely constant across the study group. When scoring morphological data, however, people tend not to put data in that are constant. Imagine building a trait list for several species of frogs - would you add a column for "wings" only to have "no" as the only state across the entire group? Probably not. However, some datasets may contain constant characters, and they may or may not contain characters that differ for only one species. The analysis has to be told what to expect so that branch lengths in the resulting phylogeny are modelled well.

After my recent dive into nucleotide substitution models I also looked up how to properly set the Mk model in PAUP and MrBayes.

The Mk model in MrBayes

The Mk model is set automatically for matrices with datatype = standard. These data can have states 0-9, which should generally be enough.

Depending on the coverage, one can then use lset coding = all if the dataset includes constant characters. Alternative options are variable if there are no constant characters, and informative if there are neither constant characters nor characters that differ for only one species. The Mk model with only variable characters is also sometimes called the Mkv model.

If there are no constant characters, equal rates of change for all characters can be assumed with lset rates = equal, variable rates with lset rates = gamma. If constant characters are included, my understanding is that propinv and invgamma should be used instead.

The default is that all characters are unordered. They can be changed to ordered by using the ctype command, as in ctype ordered: 2 4 for characters number two and number four.

The Mk model in PAUP

I have tried setting the Mk model in one of the new test versions of PAUP, specifically 4a149. To set the model as such, lset nst = Mkv. Unfortunately, beyond that the options are rather limited. The model always assumes equal rates, and as that little v at the end indicates it also seems to assume that all constant characters have been excluded.

Mk model versus parsimony: my admittedly anecdotal experience

I have always made clear that I am not really that terribly interested in philosophical foundations or statistical theory when using a phylogenetic method. For me the real questions are pragmatic ones:
  1. Does the method produce sensible results with empirical data, i.e. results that fit information that we have from other data?
  2. Does the method produce the correct results with simulated data?
  3. Is the method computationally feasible? (What good is a robust Bayesian coalescent approach if it takes weeks on a supercomputer even for six species?)
  4. Can the method be mislead in certain scenarios? But if so, are these scenarios likely to be frequent, or are there other ways of dealing with them than discarding the method? (E.g. different data or better taxon sampling to deal with Long Branch Attraction.)
For the Mk model, the problem is mostly the first point. Just for the giggles, I have in the past used it on a few morphological datasets from small genera, and the results were generally much less convincing than the ones from parsimony analysis. I have also used it in Mesquite for ancestral character reconstruction along trees obtained from e.g. Bayesian analysis of sequence data, and the results were rather nonsensical.

That being said, after the recent publication claiming that Bayesian phylogenetics outperforms parsimony on simulated data, I tried again with a little dataset I am generating, at that moment only 23 traits for 13 species. I am happy to report that the results of running those data through MrBayes were much more meaningful than what I had seen in the past. So I will definitely keep that in mind as an option.

Another interesting observation, however, is that Likelihood or Bayesian analysis of morphological data tends to produce fully resolved trees where parsimony shows uncertainty clearly as polytomies. This is rather ironic given that one of the main arguments of Bayesians is that their preferred approach better shows uncertainty in the data. Of course one could point at low Posterior Probabilities and say, see, there is your measure of uncertainty, but then again support values are always worse for morphological data than for sequences simply because there are much fewer characters. It is not rare to have a dataset with fifty taxa but only twenty characters; of course you will never see a lot of 100% bootstraps or 1.00 PPs under those circumstances, even in the best cases. Thus a fully resolved tree will look very suggestive even at 0.57 PP where a polytomy tells us that we really don't know.

A final reason why I will not soon drop parsimony analysis for morphological data (even as I will give the Mk model more attention) is that there are numerous well established ways of doing parsimony according to how a character can be expected to evolve. Assume, for example, that you have four states 0, 1, 2, and 3, and that 1-3 can all arise from 0 but not from each other (meaning that to get from 1 to 2 you have to pass through 0). Or assume that you want to set a character state so that it was gained precisely once but is impossible to be regained once it is lost.

It would be easy to set this up in parsimony. Maybe it is possible to do this in a model based analysis, but if so then it is at least not part of standard implementations. More generally, the assumption behind a model that there is a general process across all the characters in the analysis makes a lot of sense for molecular data. A base pair is a base pair, and all the sequence positions will be affected by polymerase errors. But does it make nearly as much sense for morphology? A fruit shape is not the same thing as the presence or absence of stipules, and a collar bone shape is not the same thing as the possession of a red patch on the throat.

Again, I am happy to admit that the Mk model in MrBayes surpassed my expectations, and I will use it more often in the future. I am, however, still not ready to do without the option of parsimony, at least for the admittedly rare cases when I want to analyse morphological data.

References

Lewis PO, 2001. A Likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology 50: 913-925.

Tuesday, June 28, 2016

More adventures in science spammer land

Thus writeth the Open Journal of Plant Science, whose mass eMail is totally not spam, oh no sir:
Please Note: This is not a spam message and has been sent to you because of your eminence in the field.
Alright then.
Greetings for the day!
If you don't want to be taken for a spammer, perhaps try not to begin the eMail like every other science spammer.
Peertechz was launched this Journal to support the Open Access in the way of publishing manuscripts, new technics and methods in science.
I was decided any papers not to submit to journals whose editors not write coherent sentence.
Open Journal of Plant Science published articles are freely available online to the readers for life time.
They don't specify, but presumably they mean the life time of their website, right?
The journal encourages the authors to publish their manuscripts in a large Open Access network: Peertechz and its looking for the manuscripts from selective scientists like you who have enormously contributed to the scientific community.

It would really be grateful to you if you can send us energetic and enthusiastic submission to successfully release the upcoming issue. Send us any type of article to increase the visibility of the Open Journal of Plant Science.
Sadly I don't think that I have ever had a manuscript that I would have called 'energetic', be it as an author or as a reviewer. How do they differ from non-energetic ones?
If you are interested, please respond us on or before 48 hours and send your paper by July 15th, 2016.
I have seen that with quite a few journal spammers before. They have a very close deadline and then a more distant deadline in the very same sentence. So which is it? Why would an author have to respond in two days? I strongly suspect that they would also accept manuscripts on 15 July from people who didn't contact them tomorrow. In fact if we are talking a regularly appearing journal here then it should accept manuscripts on 16 July, so why claim that there are deadlines at all?
We are looking forward to have valuable submission from you soon.
Darauf kannst Du lange warten.

---

I also recently saw an example of what comes out at the other end of the process. One of my publication alerts notified me that I had been cited, and the paper was curious enough that I followed the link. It is
Gayathiri et al. 2016. A review: potential pharmacological uses of natural products from Laminaceae. International Journal of Pharma Research & Review 5: 21-34.
Yes, "Laminaceae". Seven authors have written an alleged review article on the mint family Lamiaceae and have consistently misspelled its name throughout the entire manuscript. Ah no, I lie; they spelled it "laiminace" in the keywords, so scratch consistently.

This alone is just... indescribable. Seriously, how can somebody write a review about something and not know how it is spelled? How can seven authors and presumably at least one editor miss that, even in that journal?

Also, my name is misspelled in the reference list, the species I am cited about is misspelled, and as for the text itself, let's just quote a single sentence from the abstract.
Although, medicinal plants continue provide a new drug leads in drug discovery, and numerous challenges are encountered in procurement and selection of plant materials, screening method and the scale up of active compound, hence this brief review work presents a study of the importance of natural products, especially those derived from higher plants and aims to the highlight the pharmacological significant of Laminaceae family in terms of drug development.
Yay for science!

---

On Jeffrey Beall's blog, which has already covered criteria for (1) what it calls predatory journals and (2) fake impact factors, a discussion has been started about criteria to similarly classify conferences as "predatory". Suggestions by contributor James McCrostie are, in short:
  1. Any kind of deceit, such as hiding that the conference is for-profit or claiming that the organisers are based in a different country than they really are;
  2. No or inadequate peer review, including review only by the conference organisers; later fast acceptance of submitted talks is also mentioned;
  3. High conference fees;
  4. Overly broad scope;
  5. Connections to companies known for running "predatory" journals;
  6. "Regularly accepting papers by undergraduates";
  7. Advertising the conference like a holiday;
  8. Spamming, especially to random people outside of the relevant field of science;
  9. People are allowed to give multiple presentations at the same conference.
(Numbers mine, the original list is more extensive and not numbered.)

I have received quite a few spam eMails advertising in broken but hyperbolic English for obviously crappy conferences, often from fields I have no relation to whatsoever, such as fertiliser research or medicine. This is clearly a problem, although as with the journals I would argue that no competent scientist should ever fall for them. People should know what the relevant conferences in their field are and be able to delete these spam eMails at a glance. I assume they mostly prey on the desperate.

Still, I can see how a list of dodgy journals, for example, is useful to people outside of a field or totally outside of science, to help gauge if somebody's CV is inflated. I am just not so sure that it is quite as easy to develop useful criteria for recognising dodgy conferences as it is for journals.

Points 1, 4, 5, 7, and 8 are clearly valid. Deceit, spamming and being run by a crappy journal company are obvious alarm bells, and overly broad scope makes a conference practically useless for networking and learning. But the others? Not so much.

Maybe it is different in other fields, but I do not think that I have ever participated in a professional conference that peer reviewed the abstract submissions except for whether they fit into the scope of the meeting. In other words, the organisers of a systematic botany conference will look over the abstract, and if it were on theology or law or perhaps obviously pseudoscience they would kick it out, but that's that. The closest I have ever seen is that the committee would demote some talk submissions that were deemed less important to posters, and even that only at the very most prestigious meetings.

High fees? Well that is in the eye of the beholder I guess, but some legitimate conferences can be terribly pricey.

"Regularly accepting papers by undergraduates"? What? Are there seriously fields of research where the quality of work doesn't matter but only one's status and age? Luckily I am not working in one of them.

Similarly, what is the problem with giving two talks at the same meeting? At most conferences in my field there are two or three people who do that. Nobody has ever had an issue with that as long as they don't do it every time, and as long as nobody gives five or something like that.

Perhaps practices are just too different between the various fields of science and scholarship to find easy agreement on this.

Sunday, June 26, 2016

Implementation of substitution models in phylogenetic software

Concluding the little series of posts on nucleotide substitution models, below is a summary of my current understanding of how to set several of the models discussed in the previous post in PAUP and MrBayes. But first a few comments.

For PAUP there are three possible options for the basefreq parameter. My current understanding is that equal sets equal base frequencies as in the JC model (duh), empirical fixes them to the frequencies observed in the data set, and estimate has them estimated across the tree. My understanding is further that while estimate is more rigorous, empirical is often 'close enough' and saves computation time. The point is, in any case below where it says one of the latter two options I believe one could also use the other without changing the model as such. I hope that is accurate.

What I do not quite understand is how one would set models like Tamura-Nei in PAUP. At least one of the sources I consulted when researching for this post (see below) suggests that one can set in PAUP models for example with variable transition rates but equal transversion rates, but the PAUP 4.0b10 manual states that the only options for the number of substitution rate categories are 1 (all equal), 2 (presumably to distinguish Tv and Ts), and 6 (all different). Would one not need nst = 3 or nst = 5 to set the relevant models? Perhaps the trick is to set nst = 6 but fix the substitution matrix? But that would mean one cannot estimate it during the analysis.

For the MrBayes commands note that the applyto parameter needs to be specified with whatever part(s) of the partition the particular model should apply to, as in applyto = (2) for the second part. The MrBayes commands I found my sources appear to be rather short; I assume that default settings do the rest. Note also that the old MrBayes versions that I am familiar with have now been superseded by RevBayes, but I have no experience with the latter program's idiosyncratic scripting language.

In addition to PAUP and MrBayes, I mention whether a model can be selected in three other programs I am familiar with, RAxML, BEAST and PhyML. I have no personal experience with most other Likelihood phylogenetics packages. I tried to check what is available in MEGA, but it wouldn't install for me, and I am not sure if the list of models in the online manual shows only those that are actually available for Likelihood search or whether it includes some that are only available for Distance analysis. The relevant list is linked from a page on Likelihood, but its own text implies it is about Distance. Either way, MEGA appears to have lots of options but I didn't indicate them below.

General Time-Reversible (GTR)

PAUP: lset nst = 6 basefreq = empirical rmatrix = estimate;

MrBayes: lset applyto=() nst=6;

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Tamura - Nei (TN93 / TrN)

Available in BEAST 2.1.3 and PhyML 3.0.

Symmetrical (SYM)

PAUP: lset nst = 6 basefreq = equal rmatrix = estimate;

MrBayes: lset applyto=() nst=6; prset applyto=() statefreqpr=fixed(equal);

Hasegawa-Kishino-Yano (HKY / HKY85)

PAUP: lset nst=2 basefreq=estimate variant=hky tratio=estimate;

MrBayes: lset applyto=() nst=2;

Available in BEAST 2.1.3 and PhyML 3.0.

Felsenstein 84 (F84)

PAUP: lset nst=2 basefreq=estimate variant=F84 tratio=estimate;

Available in PhyML 3.0.

Felsenstein 81 (F81 / TN84)

PAUP: lset nst=1 basefreq=empirical;

MrBayes: lset applyto=() nst=1;

Available in PhyML 3.0.

Kimura 2-parameters (K80 / K2P)

PAUP: lset nst=2 basefreq=equal tratio=estimate;

MrBayes: lset applyto=() nst=2; prset applyto=() statefreqpr=fixed(equal);

Available in PhyML 3.0.

Jukes Cantor (JC / JC69)

PAUP: lset nst=1 basefreq=equal;

MrBayes: lset applyto=() nst=1; prset applyto=() statefreqpr=fixed(equal);

Available in BEAST 2.1.3 and PhyML 3.0.

Invariant sites (+ I)

For PAUP, add the following to the lset command: pinvar = estimate

For MrBayes, add the following to the lset command: rates = propinv

Gamma (+ G)

For PAUP, add the following to the lset command: pinvar = 0 rates = gamma ncat = 5 shape = estimate (or another number of categories of your choice for ncat).

For MrBayes, add the following to the lset command: rates = gamma

Invariant sites and Gamma (+ I + G)

For PAUP, add the following to the lset command: pinvar = estimate rates = gamma ncat = 5 shape = estimate (or another number of categories of your choice for ncat).

For MrBayes, add the following to the lset command: rates = invgamma

References

This post is partly based on the following very helpful sources, all accessed c. 20 June 2016.

Faircloth B. Substitution models in mrbayes. https://gist.github.com/brantfaircloth/895282

Lewis PO. PAUP* Lab. http://evolution.gs.washington.edu/sisg/2014/2014_SISG_12_7.pdf

Lewis PO. BIOL848 Phylogenetic Methods Lab 5. http://phylo.bio.ku.edu/slides/lab5ML/lab5ML.html

And a PDF posted on molecularevolution.org.