The Future of Protein Science will not be Supervised

But it may well be semi-supervised.

For some time now I have thought that building a latent representation of protein sequence space is a really good idea, both because we have far more sequences than any form of labelled data, and because, once built, such a representation can inform a broad range of downstream tasks. This is why I jumped at the opportunity last year when Surge Biswas, from the Church Lab, approached me about collaborating on exactly such a project. Last week we posted a preprint on bioRxiv describing this effort. It was led by Ethan Alley, Grigory Khimulya, and Surge. All I did was to enthusiastically cheer them on, and so the bulk of the credit goes to them and George Church for his mentorship.

Continue reading

AlphaFold @ CASP13: “What just happened?”

I just came back from CASP13, the biennial assessment of protein structure prediction methods (I previously blogged about CASP10.) I participated in a panel on deep learning methods in protein structure prediction, as well as a predictor (more on that later.) If you keep tabs on science news, you may have heard that DeepMind’s debut went rather well. So well in fact that not only did they take first place, but put a comfortable distance between them and the second place predictor (the Zhang group) in the free modeling (FM) category, which focuses on modeling novel protein folds. Is the news real or overhyped? What is AlphaFold’s key methodological advance, and does it represent a fundamentally new approach? Is DeepMind forthcoming in sharing the details? And what was the community’s reaction? I will summarize my thoughts on these questions and more below. At the end I will also briefly discuss how RGNs, my end-to-end differentiable model for structure prediction, did on CASP13.

Continue reading

Protein Linguistics

For over a decade now I have been working, essentially off the grid, on protein folding. I started thinking about the problem during my undergraduate years and actively working on it from the very beginning of grad school. For about four years, during the late 2000s, I pursued a radically different approach (to what was current then and now) based on ideas from Bayesian nonparametrics. Despite spending a significant fraction of my Ph.D. time on the problem, I made no publishable progress, and ultimately abandoned the approach. When deep learning began to make noise in the machine learning community around 2010, I started thinking about reformulating the core hypothesis underlying my Bayesian nonparametrics approach in a manner that can be cast as end-to-end differentiable, to utilize the emerging machinery of deep learning. Today I am finally ready to start talking about this long journey, beginning with a preprint that went live on bioRxiv yesterday.

Continue reading

The Quantified Anatomy of a Paper

previously blogged on my adventures in self quantification (QS). In that post I wrote about the general system but did not delve into specific projects. Ultimately however the utility of self quantification is in the detailed insights it gives, and so I’m going to dive deeper into a project that passed a major milestone earlier today: publication of a paper. If you’re interested in the science behind this project, see my other post, A New Way to Read the Genome. Here I will focus on the application and utility of QS as applied to individual projects.
Continue reading

A New Way to Read the Genome

I am pleased to announce that earlier today the embargo was lifted on our most recent paper. This work represents the culmination of over two years of effort by my collaborators and I. You can find the official version on the Nature Genetics website here, and the freely available ReadCube version here. In this post, I will focus on making the science accessible to the lay reader. I have also written another post, The Quantified Anatomy of a Paper, which delves into the quantified-self analytics of this project.

Continue reading

Predictions Are Cheap in Biology

I just came back from ICSB 2013, the leading international conference on systems biology (short write-up here). During the conference Bernhard Palsson gave a great talk, which he ended by promoting a view that (I suspect) is widely held among computational and theoretical biologists but rarely vocalized: most high-impact journals require that novel predictions are experimentally validated before they are deemed worthy for publication, by which point they cease to be novel predictions. Why not allow scientists to publish predictions by themselves?

Continue reading

ICSB 2013

I recently had the pleasure of attending the 14th International Conference on Systems Biology in Copenhagen. It was a five-day, multi-track bonanza, a strong sign of the field’s continued vibrancy. The keynotes were generally excellent, and while I cannot help but feel a little dismayed by the incrementalism that is inherent to scientific research and that is on display in conferences, the forest view was encouraging and hopeful. This is one of the most exciting fields of science today.

Continue reading

Is Terrestrial Life of Extraterrestrial Origin?

A few weeks ago a paper titled Life Before Earth was posted on the arXiv preprint repository. It came to my attention by way of this MIT Technology Review article and this blog post. The paper, using a rather simple extrapolation, argues that the apparent rate at which the complexity of terrestrial life increases suggests that its birth occurred approximately 9.7 billion years ago. Earth, in contrast, is around 4.5 billion years old. If their extrapolation is to be believed, then this discrepancy can only be resolved if terrestrial life is in fact of extraterrestrial origin. I will briefly summarize their argument, but I will not attempt to justify its validity. The original paper can be read here and is fairly accessible. The paper’s conclusions are consistent with a fact that has always puzzled me; the surprising complexity and maturity of what is known as the Last Universal Common Ancestor. It is this topic that I wish to focus on in this post.

Continue reading

Emergent Modularity

Last week I attended a talk at MIT by Michael Deem, a professor at Rice University who has done some very interesting work on the emergence of modularity in evolution. This is a topic that I have long thought about, as it seems that modularity is intrinsic to many biological phenomena, and it also seems that modular systems would by construction have certain internal structure that can be exploited in computational modeling. My thinking on the topic has been crude and qualitative, and so it was with some delight that I discovered Michael’s work in this area, as his group has placed this problem on very firm quantitative footing. The talk was thought provoking, and left me with genuinely new insights, something that happens with only a small, small fraction of the talks I attend.

Continue reading