Evo-Karma: April 2012

Thursday, April 19, 2012

Crowdsourcing science project for phylogenies?

Image via WikipediaThe idea behind crowdsourcing is that the answer to a question is often more likely to be correct if you average the answers from a large number of non-experts rather than a single expert in the field. The term "crowdsourcing" has also been used for projects that outsource repetitive or challenging work to a crowd via the internet.
I have been thinking of outsourcing the problem of conversion of embedded phylogenies in PDFs back to newick/nexus format and have been looking at various science projects that have used crowdsourcing.

The most impressive from my point of view is Galaxy Zoo which has already resulted in a number of publications and impressive discoveries. Astrophysicist use the crowd to categorise 1000s of galaxies and have expanded the crowd tasks to include matching images of galaxies with randomly simulated images.

Stardust@Home is another astrophysics project which asks that the crowd looks through images for dust particles brought back to earth by a spacecraft in 2006.

Another cool project is the Open Dinosaur Project which asks that the crowd aggregates published measurements of dinosaur limb bones for many different taxa from the literature and directly measured from specimens to study the evolutionary transitions from bipedality to quadrupedality.

Foldit is a computer game enabling the crowd to contribute to our understanding of how protein folds. Figuring out which of the many, many possible structures is the best one is regarded as one of the hardest problems in biology today and current methods take a lot of money and time, even for computers. The idea of using human's spare time to get further insight is genius!

Another game that might not be directly relevant to science is Google Image Labeler which I found rather addictive. Google gets users to label/tag images as a side-effect of playing a game and this is probably used to improve image searches on the web. I list it hear because I came across a few images of animals that in some cases were labeled down to the latin binomial.

UPDATE: An interesting new crowd sourcing project at http://www.oldweather.org/ to help gather information about past climates from hand written nautical records.

Friday, April 13, 2012

Share your trees and reduce your carbon footprint

I recently attended the SPDG in Glasgow. This is an discussion group on phylogenetics which takes place in various Universities across Scotland. The guest lecturer at the last meeting was Alexandros Stamatakis from Heidelberg. The main part of his talk was about the PaPaRa software which can be used to align short reads to a phylogeny. This is really useful for the identification of next-gen reads from environmental samples. However, he also talked about reducing the carbon footprint of computational biology by writing better algorithms and code. The effect of heavy computation on the environment was not new to me as I once sat through a video conference by Herve Philippe at the Entomological Society of America which was meant to be about "phylogenomics and the sister groups to Hexapoda" but ended up being about why he hadn't travelled to Reno, Nevada. He made valid points, which can be found here. At the SPDG, we also had a video conference from Erick Matsen and fortunately this time it was on topic. Erick is the organiser of phyloseminar which is well worth having a look at and could definitely lower your carbon footprint.

More efficient algorithms and programming, videoconferencing! This all got me thinking about the three Rs: REDUCE, REUSE and RECYCLE in the context of phylogenetics. We can all do our bit to REDUCE our carbon footprint when doing phylogenetics. For starters, is the analysis I want to do really necessary, does it have to run as long, can we use a better, more efficient algorithms. Secondly, we can REUSE the trees that others have already done but this means that we need to get much better at sharing our trees. TreeBASE and DataDryad are undoubtedly playing an important role in enabling us to share phylogenies and thus reduce our carbon footprint. However, as discussed in "Towards a taxonomically intelligent phylogenetic database" by Rod Page the pace at which we are publishing phylogenies is not being matched by the submissions to TreeBASE. This leaves us with the last option to RECYCLE our trees. This should only be a last resort but ends up happening most of the time. For this we need to get back to our raw materials, the sequences, which fortunately are more consistently shared in GenBank and redo the analyses.

Hopefully, this time round the algorithm will produce less carbon and the data will be submitted to TreeBASE!

Tuesday, April 10, 2012

Phylogeny digitisation

I was hunting around for further research on phylogeny image digitisation to see whether any advances had been made since I last published on the topic and to keep my previous post up-to-date. The main reason behind all of this is to see whether there would be a faster way to digitise a bunch of images that are accumulating on my hard drive. I thought it would be cool to do something with the ripped phylogenies for the iEvoBio Challenge but my current set of trees only has a total of 2,000 leaves and I need 10,000.
Anyway, I came across PHYLODIGM in my searches which looks promising. Thomas Laubach has also done some further work on TreeSnatcher Plus including using the benchmarking dataset from TreeRipper and a number of tree files found via Google searches. Additionally, he has released the source code under the GNU General Public License.
This all looks promising!!!