CASP10, and the Future of Structure in Biology

I recently had the fortune of attending the 10th Community Assessment of protein Structure Prediction, or CASP, as it is affectionately known. CASP is a competition of sorts that happens once every two years to ascertain the progress made in computationally predicting protein structure. It is a blind experiment, where the structures to be predicted are unknown beforehand, and thus serves as a unbiased test of the predictive power of current computational methods. It is in many ways a model that the rest of computational biology ought (and is starting) to follow.

I went to CASP in part because it is very relevant to my research; I develop computational methods that use the structures of molecules to predict their binding affinity to other molecules. But I also went to gain a better understanding of the state of structural biology, computational and otherwise. It is no secret that in the past decade, as the genomics revolution kicked into high gear, (computational) structure biology has shown stunted progress. Science is very fashionable, and structural biology is currently out of fashion. There are many possible reasons for this. Two obvious ones are the explosion of available sequence data, which has made sequence-based analysis ripe with low-hanging fruit, and the slow progress in computational structural biology itself. Our ability to predict structures has been stagnant, with most of the improvement gained coming from the availability of more structural data as opposed to fundamental theoretical advances. This stagnation has in fact been a topic of discussion for the past two CASPs at least.

The expectations were somewhat different this year however. A set of publications, coming primarily from two groups [2,3], have demonstrated surprising accuracy in predicting protein structure. Furthermore, these methods attack the problem in an entirely different way from the so-called fragment assembly based methods that are the current standard in the field. They do so by exploiting the evolutionary signal in protein sequences. In particular, they search for residues that co-evolve and use that information as an indication that such residues may be in physical contact. The idea has been around for some time, but recent progress has come from employing more sophisticated mathematical machinery, and these advances appear to make a significant difference. My expectation, and I think that of a few others, was that these co-evolution methods were going to make a big splash at CASP10.

Unfortunately they did not. That’s the bad news. The good news is that it is not because they made inaccurate predictions. Instead, these methods require a very large number of sequences to work, 1,000s to 10,000s, and the targets used in the CASP competition in the relevant category (so-called free modeling) simply did not have that many sequences available. This leaves the possibility open that in the next CASP, we will see a significant breakthrough made by co-evolution methods.

So that was a letdown. But what of the bigger question, the health and long-term outlook of the field? Coming as an outsider, I repeatedly probed the conference attendees on this question, and what I received in reply was somewhat interesting. Almost everyone I talked to readily admitted that the field has been stagnant, and that it is getting increasingly difficult to do the things that are the lifeblood of a scientific discipline: raise money, attract students, and publish in high-impact journals. Yet, somewhat inexplicably, there was also a budding sense of optimism that has been lacking for years. In part it may have been sparked by the recent developments in co-evolution methods, which, although not making the splash some had hoped for, still hold a lot of promise. But my sense is that it is broader than that.

The following is pure speculation on my end, but I am betting that structure will be making a comeback soon, possibly in a year or two, possibly a little after that. I think new computational methods will play a crucial role in this process, as we get better, possibly much better, at predicting structures and their interactions. But another potential driver is the need for interpreting genomic data. While the explosion in ultra high-throughput sequencing has been nothing short of revolutionary, that field is not devoid of its own set of problems. Chief among them is our inability to make sense of the growing mound of genomic data. Perhaps the highest profile example of this is the repeated failure of genome-wide association studies (GWAS) to yield highly predictive loci for diseases or traits. It is appearing increasingly likely that the mere availability of enormous amounts of sequence data will not yield scientific insights, particularly if the attempted leap is directly from sequence to final organismal-level phenotype. The basic problem here is that the mapping function is far too complex. To go from sequence to something like a person’s height involves a highly non-linear set of mappings, and exploring the full space of these mappings without utilizing prior knowledge is hopeless, no matter how many sequences we throw at the problem. The field of genetics so far has largely concerned itself with the tiny fraction of genetic traits that can be traced to one or a handful of loci. But the vast sea that is the human phenotypic landscape will involve much more complex mappings.

So how does structure fit into all this? Structure represents in many ways the first step forward. In going from sequence to phenotype, the first question we should ask is “what are the molecular consequences of these sequences?” Our ability to reliably move between sequence and structure, which is what the structure prediction problem is ultimately about, is central to answering this question, which in turn is central to the interpretation of genomic data. This realization has already started permeating genomics. As it continues to do so, and as potentially significant breakthroughs are made in protein structure prediction using co-evolution methods and possibly other, entirely new approaches, structural biology may suddenly find itself in the spotlight.


  1. Coming from a compressive sensing standpoint, I sense that they seem to follow a similar path to what has been going on in CS. For a while the central tenet was sparsity: i.e. If your signal is sparse then it should be acquired in a CS fashion and reconstructed with solvers that imposed that constraint. However, most signals are not sparse and to make CS really relevant it has to address the issue of compressible signals. The only way to tackle that has been to develop the concept of structured sparsity and attendant solvers.
    But one could go beyond just make a parallel. Imagine an NGS machine that only gets “incoherent” measurements from some DNA, one would then needs to some of these structured sparsity solvers (with the right kind of structure) to find out what the NGS machine has been actually measuring, etc….

    • I agree and I think it’s a very interesting area in at least two ways. One is how to construct equipment that compressively measures sequences or other measurements of interest, and two is what is the right sparsity structure to impose?

      • From what I understand there are roughly two different techniques. One that splits the DNA into several pieces and much of the challenge is putting those pieces together. This techniques comprises the bulk of the current NGS systems. Another technique is based on the nanopore concept: namely, it allows the DNA strand to go through one pore that is monitored through some potential. Each elements G, T, A, and C produce different voltages that can be directly digitally recorded. I somehow wondered if this nanopore approach couldn’t be made to provide incoherent meauserements in
        One of the reason it might work would be in cases where the DNA strand is too knotty and cannot go through the pore. It would be knotty because of its folding/structure……


        • I think one issue with sequencing in general, including nanopores, is that the sequences would have to be sparse in some way, which is not generally the case. However if for example you have a population of genomes which are all very similar, except for a few bases that differ, then that could be exploited. It’s hard to see how this could be done in the context of sequencers though, but certainly the idea of a compressive microarray has been around for a while.

  2. If sparsity cannot be the ruling principle, then I wonder how an analysis approach to learning the underlying norm ( to eventually minimize ) could be revealing of issues like structure. In imaging, here is an example, out of many, of learning what eventually looks lkke the TV norm ( as expected)
    In that respect i am curious how that would apply to the work you have done before…..

    • The paper you linked to is quite interesting. Thanks for the link.

      I agree that there’s a lot more to structure than just sparsity, and things like TV from imaging are instructive. In terms of the energy potentials work, I think broadly speaking one can look to the very rich literature from physics, quantum chemistry, etc, for inspiration for structural constraints that can be imposed. It’s something that I’m looking into now.

      But yes it’s even more interesting to try to discover that structure automatically using things like analysis operators. It’s something that I’m looking into as well.

  3. Mohammed,

    You might be interested in looking up implementations for learning the analysis operator [1,2]. Since some of the wording is new, I tried to put in perspective what this analysis operator was in general [3] In short, provided enough examples, one can begin to guess the structure of the field equations at hand. I would not be overly surprised if given the right training one can get to have an idea of the potentials and the structure of the DNA [5] at play . [4] is also of interest.

    1. Analysis K-SVD: A Dictionary-Learning Algorithm for the Analysis Sparse Model

    2. Noise Aware Analysis Operator Learning for Approximately Cosparse Signals -implementation-

    3. Sunday Morning Insight: The Linear Boltzmann Equation and Co-Sparsity

    4. A Comment on Learning Analysis Operators

    5. DNA’s Twisted Communication,

    • Igor, I briefly skimmed through some of these links and they look great! Thanks so much for the distillation. I’ll have to spend some time familiarizing myself with the analysis perspective because I’ve had little exposure to it. This’ll give me a great start!

    • Igor,

      Yes, the issue of large-scale genome organization is certainly of great current interest! There was some rather interesting work a couple of years ago on the experimental side:

      Although, the “knots” that show up in nanopore sequencing are quite different than the large-scale knots that dictate genome-wide organization.

      • Hello Mohammed,

        My question has really to do with whether people have noticed some cyclicality in genomic studies based on the fact that every 147 base pairs wrapped up around the histones are in fact physical neighbors ?


        • I’d like to say yes but I don’t remember the specific study off the topic of my head. But I’m pretty sure I have come across a study or even two that detected some signal based purely on sequence that corresponded to nucleosomes. You may want to check out Eran Segal’s work at Weizmann.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s