Friday, April 5, 2013

Biodiversity Genomics Conference, last day

Today I participated in a workshop on phylogenomics. It was a somewhat mixed bag; on the one hand, there were a few really useful elements, on the other hand one of the presenters merely repeated, virtually one to one, a talk that he had already given on Wednesday. Ah well.

The conference was very rewarding and clearly a great success. Apparently significantly more people wanted to register than there were places. The participants hailed from various continents and represented many different fields of research - from US American evolutionary biologists over German entomologists and Australian soil researchers to New Zealand conservation biologists. I have heard from many colleagues how much they enjoyed it and how much they got out of it, and even that several people have suggested to have a conference like that every year, especially considering how fast the genomics field is evolving.

I cannot help, however, to end on a somewhat skeptical note considering the advances in genomics and Next Generation Sequencing from the perspective of my research interest in phylogenetics. One of the presentations today drove home the point just how mind-bogglingly, unmanageably and ridiculously huge the amounts of data are that are being produced. Again, the 1KITE project sequences the transcriptomes of 1,000 insect species, that means all genes that the insects had expressed at the moment they were sampled. And then they want to use these data to produce a better phylogeny of the insects. Admittedly, they can probably use the same data to do a couple dozen other things in addition, but from a phylogenetics perspective and considering that many other people are doing the same, does sequencing entire whatever-omes really, if we come right down to it, make any sense?
  • Nobody appears to know even where to put all these data. That is true on the level of the individual researcher, who buys a cutting edge external hard drive only to run out of space one project later, but also on the level of the scientific community as a whole. One participant actually asked that question on Wednesday: Genbank accepts annotated traditional Sanger sequences of individual genes and they are already struggling to keep up, but where do I submit a terabyte of genomic DNA sequences to fulfill the requirement that it is publicly available to colleagues who want to be able to reproduce my results? This is getting out of hand pretty quickly.
  • Nobody seems to know how to analyze them appropriately (for phylogenetic purposes). A major topic discussed controversially in today's workshop was the data analysis. I don't want to go into details, but for massive amounts of genomic data for numerous samples the only chance currently seems to be to concatenate all of it and to use phylogenetic tools that trade sophistication for insane speed. At the same time the people generating all these data are keenly aware that what they really want and need are complex models of DNA evolution, complicated partitioning of the data, and species tree methods, but the software that could do those analyses falls over if you try to do them with even just 10% of the available data.
  • And here is the kicker: For phylogenetic purposes, nobody actually needs that much data. Alan Lemmon was entirely correct when he said that 400-500 independent loci are more than enough for phylogenetic analysis. But even that might already be overkill, as Leaché & Rannala (2011) found that 10-100 loci are generally sufficient even in difficult cases, and as few as 10 in simple ones. In other words, do we really need to expend a lot of effort and money on sequencing entire genomes and produce reams and reams of data if using 0.5% of those data would already give us precisely the same result? Don't get me wrong, this all makes sense if you are interested in exploring signals of adaptation in the genome and suchlike, but at the moment it appears the phylogeneticist is presented with a shiny new nuclear bomb and told that this is a good way to kill the flies in their house. What we need would be a labor- and cost-efficient way of capturing, say, 40-50 loci for a few hundred samples (preferably from low amounts of extracted DNA) instead of laborious ways of producing insane amounts of useless data for a few samples. They same goes for population geneticists, by the way.

No comments:

Post a Comment