The past week was a momentous occasion for protein structure prediction, structural biology at large, and in due time, may prove to be so for the whole of life sciences. CASP14, the conference for the biennial competition for the prediction of protein structure from sequence, took place virtually over multiple remote working platforms. DeepMind, Google’s premier AI research group, entered the competition as they did the previous time, when they upended expectations of what an industrial research lab can do. The outcome this time was very, very different however. At CASP13 DeepMind made an impressive showing with AlphaFold but was ultimately within the bounds of the usual expectations of academic progress, albeit at an accelerated rate. At CASP14 DeepMind produced an advance so thorough it compelled CASP organizers to declare the protein structure prediction problem for single protein chains to be solved. In my read of most CASP14 attendees (virtual as it was), I sense that this was the conclusion of the majority. It certainly is my conclusion as well.
In a twist of irony the community most directly affected by this development—in some ways negatively affected on a personal level as AlphaFold2 (AF2) essentially obsoletes at least parts of our research programs—has been the most unanimous in its agreement on the significance of AF2’s advance (although certainly not wholly unanimous). Judging by Twitter, communities further and further away from protein structure prediction have had more mixed reactions.
In this post I will try to distill my views on AF2 and CASP14. I struggled with whether I should write this blog post as it felt at times like an obligation rather than something I desired to do. Sequels are also never as good as the original and the weight of expectation from my CASP13 post fazed me. In the end however there were enough new things to say that I felt it worthwhile to write the post. I hope that it proves useful to others.
In “The Advance” I explain the magnitude of AF2’s leap in quantitative terms; “A Solution?” addresses the controversy and semantics of the word “solution”, and unpacks the myriad problems often called “protein folding” but that are not; in “The Method” I speculate on what AF2 does exactly, although details here are thin because we don’t have them; “Impact on …” details my views on how this impacts fields ranging from protein structure prediction to the whole of biology; “Why DeepMind?” delves, a little bit, into the “sociology” of why it was DeepMind that managed this and not anyone else, although this section is not nearly as long as the one I wrote for CASP13 for the simple reason that most of what I said then still holds true IMO; and finally, in “Academic Research in a New Age” I put on my academician’s hat and think out loud about how best to compete strategically in this new era—this section is inside baseball for academics working on biomolecular machine learning and can be safely ignored by most.
Before I start (sorry) one last bit of housekeeping: after publishing my CASP13 blog post two years ago I received an entirely undeserved amount of attention and found myself repeatedly the spokesperson for a field far too big and diverse for me to have any right to represent. There are many people at least as capable as me and often more so in this space; among the “new” generation I count Sergey Ovchinnikov, John Ingraham, Possu Huang and many others. Among the “established” generation there is of course David Baker, Debbie Marks, Jinbo Xu, Chris Sander, Yang Zhang, and many more. And then there is the AlphaFold team. I only list people I personally know and have interacted with, but a quick perusal of the CASP14 participants list can give you an idea of who is active in this space. All of them are just as and often far more qualified than I am to speak about this problem and so please ensure that you get a broad representation of views when seeking to form your own.
Table of Contents
- The Advance
- A Solution?
- The Method
- Impact on …
- Why DeepMind?
- Academic Research in a New Age
To understand AF2’s significance and why so many people were yelling “OMG PROTEIN FOLDING IS SOLVED!!!!!” it makes sense to first take stock of the qualitative magnitude of AF2’s leap. I was privy to the results before they became widely known and my initial expectation when I heard that DeepMind will declare the problem solved was that they had achieved a median GDT_TS of around 80. You can intuitively think of GDT_TS (definition) as the fraction of the protein that is correctly predicted, i.e., 80 corresponds to ~80% of the protein being more or less right. Random predictions give around 20; getting the gross topology right gets one to ~50; accurate topology is usually around 70; and when all the little bits and pieces, including side-chain conformations, are correct, GDT_TS begins to climb above 90. I speculated in my CASP13 post that in ~4 years’ time we would get the topology correctly, i.e., we’d have a median GDT_TS between 70 and 90 at CASP15. Although I never wrote it in the CASP13 post, my expectation (back then) for when we would nail all the details was another 10 years, i.e., not until the late 2020s would we see >90 GDT_TS for most targets.
So when I learned that DeepMind will declare the problem solved I assumed they had done what they did at CASP13 again, achieving in two years what I thought they would do in four, getting a median GDT_TS of ~80. I was impressed and eager to hear the details. Imagine my surprise then when I was informed a few days later of the final number, a median GDT_TS of 92.4. Never in my life had I expected to see a scientific advance so rapid.
Statements have been made that this is proteins’ “ImageNet moment” but that would be incorrect IMO. The ImageNet moment was the first time deep learning demonstrated it can outperform conventional approaches on image recognition and made the field of computer vision take notice. Relative to AF2’s advance this year, the 2012 ImageNet advance was incremental. The closest to an ImageNet moment this field has had is Jinbo Xu’s 2016 PLoS Comp Bio paper, which demonstrated the first real impact of deep learning on protein structure prediction. This on the other hand is something altogether different. It is more akin to having the ImageNet accuracies of 2020 in 2012! A seismic and unprecedented shift so profound it literally turns a field upside down over night.
The table of Z-scores that CASP14 publishes has been making the rounds and so I won’t reproduce it here, in part because I think it’s a little hard to interpret other than saying “AF2 does much better than anyone else”. Instead, consider the graph below, which illustrates the delta between AF2 and the next best method this year.
Now recall what I said earlier about the rough meaning of different GDT_TS regimes and reexamine the plot. The improvement is nothing short of staggering and it is across the board. We see structures with a GDT_TS of 20, i.e., nonsense predicted by the next best method, getting to a GDT_TS of almost 90, i.e., with all the details!!! And then we see really good structures (mid 80s) predicted by the next best method going well north of 90 and 95 in some cases! Above 95 is within experimental accuracy.
It’s worth noting here that historically speaking it was very rare for one method to dominate others so thoroughly, certainly in recent memory. All the top groups, in particular the Baker and Zhang groups, are often running neck and neck. The only real exception to this was the last CASP, when the first AlphaFold did best for 1/3 of targets. This time, AF2 did best for 88 out of 97 targets!
Below is the comparison between the 2nd and 3rd best methods to illustrate this point.
This addresses some of the concerns that I’ve seen on social media from people unfamiliar with CASP, e.g., that there may be an overfitting issue; for starters, CASP organizers go to extraordinary lengths to get really difficult protein targets, ones that are quite different from known structures. I think it’s fair to say that the difficulty of the CASP free modeling (FM) category is harder than most structures deposited in the PDB and so in terms of real-world conditions, CASP proteins are actually harder than usual. But seeing the deltas above further allays these concerns, given how badly (relatively speaking) everyone else did with respect to AF2. I admit this was a concern of my own, that perhaps this year was an “easy” year, as there is always some variability in target difficulty from year to year. Thankfully the CASP14 organizers quantified the difficulty of this year’s targets and found them to be harder than those of the few previous CASPs, so this was a hard year!
I was also concerned that the impressive median numbers may be hiding some poor predictions at the bottom of the distribution. I knew about the median GDT_TS of 92.4 early on but it wasn’t until the weekend before CASP14 that I got access to the full distribution. It turned out that just a handful of structures, five to be exact, had a GDT_TS below 70 (out of 93 predictions made by AF2). This was remarkable: less than 10% of structures can be considered to not have the details right. Furthermore when one delves into these five, two turn out to be NMR structures and three are part of oligomeric complexes. NMR structures can be floppy reflecting the fact that these proteins don’t have well defined structures. As for oligomeric complexes, AF2 only predicts the structures of individual protein chains and so cannot be expected to reflect their oligomeric state.
I hope this communicates how thoroughly shocking AF2’s accuracy is. When looking at RMSD, a metric more commonly used by the broader biology community, AF2 achieves for Cα atoms an accuracy of <1Å 25% of the time, <1.6Å 50% of the time, and <2.5Å 75% of the time. When considering all atoms including those of side chains, the numbers are <1.5Å 25% of the time, <2.1Å 50% of the time, and <3Å 75% of the time. All but 7 of its predictions (out of 93, so 7.5%) are less than 5Å over all side chain atoms, including the cases I mentioned previously. I repeat, it is <5Å 92.5% of the time over all side chain atoms. And it is worth noting here that the model has residue-level error metrics, and they appear to be robust enough to have alerted the AF2 team when they did poorly on a SARS-CoV2 protein, so it may be the case that AF2 can warn users in the remaining 7.5% of cases but we don’t know that for sure yet.
On Twitter some quoted my CASP13 blog post to assert that I had predicted things correctly two years ago—I most certainly had not! In 2018 I expected us to get to a median of 4Å Cα RMSD by 2022, which if you look at where the next best group is, would be about right. Instead we got 1.6Å in 2020. This I did not expect to see until the decade’s end and I might have even said would not be achievable without physics-based approaches like molecular dynamics (MD).
Given all the above, does AF2 constitute a “solution” of one form or another? This will invariably be an exercise in semantics. Part of me wishes that DeepMind never used the term, that the CASP14 organizers didn’t, and that I didn’t, because all it has done is raise temperatures on all sides without advancing the discussion one bit. I used it because I thought it was justified, and still do, and so I will spend a bit of time here unpacking it but really there are better things to do with one’s time.
First, protein folding has often been used, especially in the lay media, as an umbrella term for different problems. I list a few that qualify:
- Prediction of the structure of a single protein domain from sequence
- Prediction of the structure of a single protein, possibly comprised of multiple domains, from sequence
- Prediction of the structure of a multimeric complex
- Prediction of the major conformations of a protein
- Prediction of the dynamic folding pathway(s) of a protein
For each of the above, there is furthermore the question of whether a prediction is “pure”, i.e., made from only a single sequence, or whether additional information is used, most commonly homologous protein sequences but possibly homologous structures and even other forms of non-sequence-based experimental data. In addition, there is the question of whether a solution constitutes solving the bulk of the problem or all imaginable instantiations of it. For example, proteins can have metals and other co-factors that alter their structure. Does a solution need to address all of them? What about unnatural amino acids? Entirely de novo proteins that may fold to structures unseen in nature? Finally there’s the question of accuracy—how good is good enough? Less than 3Å? Less than 1Å? Less than 0.5Å? It gets complicated.
Here’s what I think AF2 can do: reliably (>90% of the time) predict to reasonable accuracy (<3-4Å) the lowest energy structure of vanilla (no co-factors, no obligate oligomerization) single protein chains using a list of homologous protein sequences, i.e., some version of the second bullet point above. It seems to deal with multi-domain proteins just fine but it hasn’t been thoroughly tested in this regard; this was surprising to me. Beyond that it can’t yet handle any of the corner cases and it’s not working from single sequences.
Does this constitute a solution of the static protein structure prediction problem? I think so but there are all these wrinkles. Honest, thoughtful people can disagree here and it comes down to one’s definition of what the word “solution” really means. Let me explain why I consider this a solution.
While the current list of caveats is a long one, making it seem that AF2 has a way to go before tackling all corner cases and elaborations, it is my expectation that this is not the case because the core intellectual problem has been solved. I believe that everything on the preceding bullet list, excepting the very last item (and possibly the second-to-last) is now an engineering rather than a scientific problem. That doesn’t mean it’s any less important or any less hard. I consider myself an engineer as much as I am a scientist, maybe even more so. But there is an important distinction between scientific and engineering problems that is pertinent to the discussion here: engineering problems can be exceedingly difficult, and require the marshaling of inordinate resources, but competent domain experts know the pieces that need to fall into place to solve them. Whether or not a problem can be solved is usually not a question in engineering tasks. It’s a question whether it can be done given the resources available. In scientific problems on the other hand, we don’t know how to go from A to Z. It may turn out that, with a little bit of careful thought (as appears to be the case for AF2), one discovers the solution and it’s easy to go from A to Z, easier than many engineering problems. But prior to that discovery, the path is unknown, and the problem may take a year, a decade, or a century. This is where we were pre-AF2, and why I made a prediction in terms of how fast we will progress that turned out to be wrong. I thought I had a pretty good handle on the problem but I could not estimate it correctly. In some ways it turned out that protein structure prediction is easier than we anticipated. But this is not the point. The point is that before AF2 we didn’t know, and now we do know, that a solution is possible.
My claim is that given this knowledge all the corner cases have become engineering problems. Some may prove to be harder, in terms of required effort as measured by human hours, than solving the core protein structure prediction problem. For example dealing with unnatural amino acids will probably be a long slog. In all likelihood we will never solve every corner case. In my view requiring that we solve every corner case to declare something a “solution”, especially something as difficult as protein structure prediction and that is not mathematics, means we have decided to never use that word for the problem. That’s a reasonable choice and people can make it. For me, the word still has relevance in the specific way I describe. It means that the bulk of the scientific problem is solved; what’s left now is execution. This may be harder than all that’s come before, but if we’re motivated enough, in terms of building the models and collecting the data, then we can solve those other problems. Protein structure prediction is at the 90% of the problem but 10% of the effort stage.
One item on the above bullet list does not fall into this category and ironically it is the one that gave this field its name: the actual dynamic process by which proteins fold. This is a completely different problem and I am almost tempted to say that AF2 has no bearing on it, but that would be far too strong I think—AF2, directly or indirectly, may well contribute to solving the protein folding problem. But that problem remains firmly in the realm of science. It may get solved in 5 years, or 10, or 100; we don’t yet know.
Now we finally get to something interesting: how AF2 actually works! Alas, I will be able to say a lot less than I had hoped for, and here I have to do something which I very much dislike to do but feel that I must—call out DeepMind for falling short of the standards of academic communication. What was presented on AF2 at CASP14 barely resembled a methods talk. It was exceedingly high-level, heavy on ideas and insinuations but almost entirely devoid of detail. This is a shame and contrasts markedly with DeepMind’s participation in CASP13, when they gave two talks that provided sufficient details for many groups to reproduce their results right away and participated in a poster session where they freely answered questions and built rapport with the community. While at CASP13 I and many others were surprised by DM’s entry and impressed by their results, we all walked away feeling like there’s a great new group of colleagues in the community. I’m afraid this time they left a different impression and I’m not sure it was at all necessary. DeepMind is in an exceedingly dominant position here—they will invariably get the cover of Nature or Science and may one day nab their first Nobel prize for AF2. Withholding details stands to poison the well of goodwill in the community. I hope their paper corrects it, and I furthermore hope they preprint their results to accelerate dissemination of their work.
Alright, enough with the rant! So what did they do? Insofar as I can tell, there were four major pieces to their scheme, which I will list in my perceived order of their importance. But I could be very wrong, on the order and on the details, as they really told us very little.
Out with the Potts models, in with raw MSAs
The current standard is to build a multiple sequence alignment (MSA) of homologous protein sequences then extract summary statistics out of this alignment, roughly speaking how strongly co-evolving every residue is with respect to every other residue. This summarized information is then fed into a neural network to predict a “distogram”, a matrix of the probabilities of pairwise distances between all Cβ atoms (sometimes other quantities are predicted but I’m simplifying). This was the approach of the first AlphaFold and multiple methods since then.
The new AF2 no longer summarizes the MSA. Instead, it keeps all raw sequences and iteratively “attends” to them. At step n, AF2 decides which sequences are worth looking at and which can be safely ignored and based on this predicts a distogram. At step n+1, AF2 uses the distogram to decide which sequences to attend to next, and based on them predict a new distogram. It does this multiple times. How many it wasn’t clear, but if one were to squint at DeepMind’s slides it appears to involve a few hundred iterations. In this way AF2 start outs building the local structure within individual protein domains before branching out to more global features, for example the relative orientation of two domains within a protein.
This approach is novel and has a number of potential advantages (simpler precedents do exist in the literature). First, AF2 can leverage deeper (as in having more sequences) MSAs for individual protein domains and shallower ones for whole proteins. It is quite common for individual domains to have a lot more sequences available and this can be used by the attention mechanism to resolve intra-domain details. Once done, AF2 can then use the relatively shallower full-length MSA to resolve the inter-domain details. I’m not proposing that this pre-engineered; AF2 likely learns how to do this on its own.
There are other, perhaps more significant advantages to this approach. MSAs can often be noisy, containing sequences that are not evolutionary related. By allowing the model to choose what to include at every step, AF2 can learn to filter on its own. Second, the summary statistics I previously described are all pairwise, extracting only how two residues co-evolve at a time. By accessing the full MSA, AF2 may be able to extract higher-order correlations. Third, some proteins, depending on the availability of data and their evolutionary age, can have very shallow MSAs. This has traditionally made it difficult for MSA-based methods to predict their structures (and what set me off in the first place to predict structures from individual sequences), including the first AlphaFold. This time around, AF2’s performance appears almost entirely decoupled from MSA depth. At least, it appears to be quite robust to proteins with very shallow MSAs. This may be due to the iterative attention scheme, as it is always operating on a self-chosen weighted mixture of individual proteins. When confronted with shallower MSAs it can learn to make do with what it has. Finally, the iterative approach is a good idea in general. We know from many machine learning tasks, especially in computer vision, that better results are achieved when a model is able to inspect its own output to generate refined outputs in response.
The second big change that DeepMind made is to reformulate the entire pipeline, from raw MSA to final predicted structure, to be end-to-end differentiable. On this point I feel justified in taking some credit for having developed the first end-to-end model for protein structure prediction (RGN). Mine was developed contemporaneously with another model (NEMO) by John Ingraham and colleagues (my preprint came out a few months before theirs but their final version was published a few months before mine). Both of our models did not perform competitively with MSA-based approaches as we relied on either single sequences or more limited forms of evolutionary information. That may have been a strategic mistake on our respective parts, but either way the details of AF2 are certainly very different.
End-to-end differentiability allows for model parameters to be tuned jointly, from beginning to end, to optimize for the final 3D structure instead of proximal quantities like inter-atomic distances. Second, and perhaps more importantly, it acts as a self-consistency constraint. Approaches that are not end-to-end generate outputs that can be inherently contradictory. For example, if the distances between all atoms are predicted simultaneously, they may not be embeddable in three-dimensional space, essentially giving nonsense. This is traditionally resolved by feeding the outputs through an optimization procedure, sometimes a physics-based one, which is what the first AlphaFold and all other methods did except for RGN and NEMO. With an end-to-end approach, the model, one way or another, must figure out how to be self-consistent. In the RGN case this was by construction because it reasoned in internal coordinates and iteratively built the protein structure one atom at a time so that it never made inconsistent predictions. AF2 doesn’t appear to do this, and it is unclear how it achieves self-consistency, as the fundamental object it operates on during the iterative attention portion is the distogram. At some point the distogram gets converted into 3D coordinates to be fed into the structure module (described next) but how that leap is made is left to our imagination. There are certainly known ways to do it and the answer may be vanilla but as of now we don’t know.
Iterative refinement using SE(3)-equivariant transformers
At some point, either after a fixed or learned number of iterations in distogram-space, AF2 generates what is likely a 3D point cloud that is then fed into an SE(3)–equivariant transformer. I will explain what this means momentarily but for now suffice it to say that they use machinery that operates directly on atoms in 3D space. This is important because it captures higher-order coordination between atoms that cannot be captured by distograms, which are always about two atoms at a time, or at least two contiguous stretches of atoms. In 3D, multiple atoms from distant protein regions can all come together, and they are presented to the model to operate on and refine. This is an approach that we have also been working on for the last two years and so it is a bit painful to get scooped here. The computational requirements for equivariant neural networks can be substantial which slowed our progress.
Interestingly, the number of iterations that AF2 performs in 3D space seem to be on the order of 10. I say this is interesting because it is the geometric mean of what the RGN and NEMO models do. To elaborate: the original RGN did not iterate whatsoever, predicting a single structure in one shot. This made it very fast (milliseconds) but prevented it from performing any sort of refinement. NEMO went to the other extreme in a way, performing around 200 iterations using Langevin dynamics. It could slowly fold the structure but may have taken the physics too seriously by trying to emulate an energy landscape that can be traced down to its minimum. AF2 seems to do something in-between, both in the literal sense of taking 10 steps but also in formulating the refinement not as a physics-inspired process but as an iterative neural refinement process that “fixes” bad structures in likely non-physical ways (the fixing is non-physical, not the final structures). We don’t actually know the details and so I may be projecting here based on our own work but this is my best current guess of what AF2 is doing.
As for SE(3)-equivariant networks: proteins are molecules that exist in 3D space (in solution) without a preferred orientation or location (as individual abstract molecules). Much of the neural network machinery that’s been developed for images is (locally) translationally-invariant, i.e. does not care about location, but is not rotationally-invariant. The last few years have seen a mini-explosion of neural primitives that respect rotational invariance (equivariance just means one keeps track of the rotation/translation instead of ignoring it). It was started by a seminal paper by Nathaniel Thomas, Tess Smidt, and others and has now evolved into a vibrant subfield exemplified by the works of Taco Cohen, Max Welling, and Risi Kondor, among others. The unifying theme is the fact that standard convolutions have a group-theoretic structure, traditionally Z2 or Z3, that can be generalized to other groups including the Lie groups that respect the type of rotational symmetry desired here. Very recently a paper by Fabian Fuchs et al. ported this idea to transformers, which don’t rely on the locality assumptions made by convolutions. Given the timing of the paper, it is unlikely that the same exact approach was used by AF2 (the authors were not part of the AF2 team) but the idea is probably similar.
The final new piece in AF2 harkens back to the olden days of protein structure prediction, when nothing worked and the only option was to crib off homologous structures. It’s a bit of a “cheat” but can be effective. AF2 incorporates this technique by taking structural homologs directly as inputs along with the MSA. Interestingly, the second best performing team this year, from the Baker lab, does something similar. It is unclear how much this trick is contributing to AF2’s performance as it is able to do well even on proteins with no structural homologs (in the global sense, i.e., that cover most of the protein—essentially all protein structure prediction nowadays is some version of local structural homology as we have likely experimentally saturated the space of all possible protein fragments on a coarse-grained level.)
There is one other aspect of AF2 worth commenting on that doesn’t have to do with the approach per se but is interesting in its own right and is something of a mystery to me: the compute resources used. For training, AF2 consumed something like 128 TPUs for several weeks. That’s a lot of compute by academic standards but is not surprising for this type of problem. What is surprising however is the amount of compute needed for inference, i.e., making new predictions. According to Demis Hassabis, depending on the protein, they used between 5 to 40 GPUs for hours to days. Although they phrased this as “moderate” it is anything but for inference purposes. It is in fact an insane amount given that they are not doing any MD and has me perplexed. Nothing in their architecture as I understand it could warrant this much compute, unless they’re initializing from multiple random seeds for each prediction. The most computationally intensive part is likely the iterative MSA/distogram attention ping-pong, but even if that is run for hundreds or thousands of iterations, the inference compute seems too much. MSAs can be very large, that is true, but I doubt that they’re using them in their entirety as that seems overkill. At any rate I wonder if there is something to be learned from this mystery or if I am missing something obvious.
Impact on …
Alright so AF2 is amazing and all and maybe/maybe not solved protein structure prediction. What does this mean for …
Protein structure prediction?
The core field has been blown to pieces; there’s just no sugar-coating it. I can say this because it’s (one of) my own field(s). There are some intellectually interesting exercises left, for example predicting structure from a single sequence without structural templates or evolutionary information, and there are important engineering problems including addressing all the corner cases that AF2 still can’t. These are important and scientifically worthwhile but will be of limited interest beyond the core community of structure predictors. The “pure” problem of going from a single sequence to structure is the problem that’s been closest to my heart for over a decade, so it’s painful to say it, but it is the truth. It is similar to how the first mathematical proof of a result garners the most interest and accolades, even if subsequent complementary proofs are interesting in their own right. Everyone in the field now will be faced with tinkering around the edges, finding second proofs, or leaving to greener pastures. This was captured poignantly by a panelist at the very last session of the conference who remarked that CASP14 feels a bit like when one’s child leaves home for the very first time. It is good, in the sense that they have now matured, as has the field, and are ready to tackle bigger and better problems. But there is a bittersweetness to the experience that cannot be ignored.
To be clear I am referring only to the core protein structure prediction problem and not any of the proximal problems, including the ones on the bullet list in “A Solution?”. But, it’s going to be very risky for an academic group to swim in these waters. I won’t say more here as this is the topic of the last section of the blog post.
Experimental structure biology?
After the field of protein structure prediction itself, the fields that will most obviously be impacted by AF2 are those comprising the experimental determination of protein structure in various forms. In the most immediate term I suspect that even X-ray crystallography will actually benefit because of AF2’s value in molecular replacement, already discussed and demonstrated at CASP14. But beyond the short-term (>3-5 years) I expect AF-like models will begin to undercut crystallography. There were comments on Twitter to the effect of “call me when this hits 0.2Å” but that’s missing the point in my opinion. There are many, many, many applications in biology that do not require sub-angstrom accuracy and that will benefit a great deal from having structures even at 3Å or 4Å accuracy (not to mention, many crystallographic structures are not at 0.2Å resolution—but it’s important to note that we’re comparing apples to oranges here as prediction RMSD and crystallographic structure resolution are two different things.) Many of these applications formerly relied on crystallography because there were not many alternatives. Now that there will be soon, demand for crystallography will invariably fall. Of course it’s unlikely to ever get completely obsoleted, as few things ever do; some people still listen to the radio and ride horses, but it’s fair to say that this is not where the action is. I say this with empathy as someone who is contending with these questions myself—at least for most experimentalists, they can continue to do tomorrow what they were doing yesterday and this will remain the case for the next several years. For structure predictors they now have to instantly pivot or face obsolescence.
To be sure there will initially be justified reticence by the broader biology community to accept AF2 predictions as “truth”, but as we get to the point where the umpteenth crystal structure is produced and is in agreement within an angstrom or two of an AF2 prediction except for one tiny loop, attitudes will change. Maybe it will take more than a few years but the writing is on the wall. The applications that remain will require very high resolution, including most prominently drug discovery, and those will take longer to get affected. For some (very unscientific) support of this assessment I ran a poll on Twitter asking how excited biologists were by AF2’s specifications. Around 80% said they were pretty or very excited, while ~10% felt “meh” about the results. The latter is important as it suggests that for at least some subset of applications, AF2 remains underwhelming and this is a safe area for experimental methods for now. On the other hand, the 80% who were excited suggest that many existing needs may be met by AF2.
On the subject of crystallography, there is another point to note here—the cytoplasm is not a crystal! Most crystallized protein structures are probably not particularly good representatives of physiologic state! We’ve all been engaged in a sort of make-belief exercise that crystal structures give us the “truth”, and they do in a sense, about proteins coerced into forming crystalline material. But insofar as we’re interested in biological function, if predicted structures are well within the lowest energy basin of a protein’s energy landscape, I would argue it’s unclear whether crystal structures or predicted ones will be more informative. I mean that—it’s a question to be settled in the years ahead, especially as AF2-like predictions are coupled with MD methods to get more complete pictures of low energy ensembles. Hence I find some of the hand-wringing about crystallography being the ultimate arbiter of truth a bit off-the-mark given its own limitations. Of course the one big wrinkle in all this, at least for the time being, is that AF2 itself is trained on crystallographic structures in the PDB. And so it’s more accurate to think of it as predicting crystal structures as opposed to predicting the lowest energy state of proteins. This is an important caveat and will only be addressed when other experimental techniques and physics-based computational methods are systematically integrated. This is an exciting frontier and one that may actually be a good space for academics to play in, as it won’t leverage DeepMind’s ML expertise as much as AF2 has.
Speaking of other experimental techniques, next up is single particle CryoEM. The outlook for CryoEM is in my opinion better in the short and medium terms, as CryoEM is increasingly focused on quaternary complexes and molecular machines. If anything AF2 will help CryoEM because it’s the details of the individual monomers that CryoEM struggles with and AF2 excels at. So a win-win. Still, DeepMind has made it clear that complexes are their next big target, and I do think that most of the intellectual heavy lifting is done, hence the “solved” part. Going from monomers → complexes will, I predict, be easier than going from pre-AF2 → monomers. The only real question is how much of a negative impact will the relative paucity of experimentally-determined structures of quaternary complexes make on AF2-like models. This could turn out to be a serious issue but I wouldn’t bet on it, and at any rate structures of quaternary complexes are being generated at an accelerating rate.
The one area that I do think is truly the future of experimental structural biology, and which will remain safe and wholly complementary to AF2, is in situ structural biology. Getting the cellular context of structures is not something that DeepMind can meaningfully tackle anytime soon, but helping identify structures in their cellular context is definitely something that AF2 can help with. If anything AF2 may accelerate the breakneck pace of progress in CryoET and usher in the era of structural cell biology faster than even its proponents are expecting. I know a few people at EMBL who will be happy to hear this!
Biology as a whole?
This is the section I have been most looking forward to write for it is here that we can begin to imagine what can be.
While I love structural biology, and love staring at proteins, my interest in the field has always had a practical bent: structure not for its own sake but in service to biology. For this vision to become reality we need data, structural data, which has always been very hard to come by. AF2 is profoundly transformative because it may do for structure what DNA sequencing did for genomics; make it possible. Every question in biology, from the molecular to the cellular to the organismal to the evolutionary, can now be posed and framed in terms of structural hypotheses. We’ve done this with sequence for at least a couple of decades and it has come to define every facet of biological sciences. Now we get to do it over all again with structure. And while the structure → function dogma never fully rung true with me, it’s certainly the case that having structure > sequence for determining function.
So what does all this mean on a practical level? Truth be told, I don’t really know, and I suspect most of us don’t. We have yet to fully grok the consequences. I have been thinking about this question ever since CASP13 but have yet to truly internalize it. When televisions first appeared networks simply emulated radio broadcasting by having talking heads fill the screen. We will do the same for a while because we don’t understand what structure at scale means. But in due time we will. I am biased of course, but I believe this is the most important question facing basic biology in a post-AF2 era.
Still, we can speculate. First is the question of function derived from structure. There have been numerous efforts to predict protein function and do so in nuanced ways that reflect their multifunctional reality. Thus far these efforts have largely relied on sequence—now all can be redone using structure. For some protein classes, particularly enzymes, this may substantially improve accuracy, especially if the structures are good enough to resolve catalytic sites. Even if they are not, coarse structures will help relate proteins in the “twilight zone”, i.e., ones far from anything we’ve functionally characterized, to ones we do know something about. Especially in prokaryotic biology, where vast swaths of bacterial proteomes are still entirely uncharacterized, this alone may transform our ability to understand and one day engineer them.
It won’t happen overnight. None of what I’m saying here will. It will take years and maybe decades, but now that protein structure prediction has become an engineering exercise, we know that many of these ideas can be realized.
What else? Variant prediction. Interpreting what mutations do to proteins is both hard and important for human disease. So far we’ve been restricted almost exclusively to statistical approaches. We can’t predict what a mutation does to a protein on a molecular level, but we can discern whether a mutation is deleterious from looking at the genetics of healthy and sick populations. Because many diseases arise from interactions between and mutations in multiple genes, piecing their puzzles together requires extraordinary statistical power save for the simplest “Mendelian” diseases. Having structures for most proteins won’t, again, solve the problem overnight, but it will provide us with more powerful tools. This is especially true if we can predict how mutations alter not only individual proteins but their interactions as well. One major caveat here is how well AF2 captures the impact of small changes in sequence on structure, especially because it is an MSA-based method. A month ago I would have predicted not very well. Today I am not sure, but it will be a key test of its capabilities. Even if AF2 turns out to underperform on this problem however, I believe we now have the tools to build future versions that can address the problem.
What else? Possibly protein design. Like with variant prediction, we don’t have direct evidence that AF2 can do this, although Demis Hassabis did mention it as one of their goals. Two years ago I would have been somewhat pessimistic, both because AF2 is trained mostly on natural proteins and because it requires MSAs as input, which are not available for proteins that have undergone neither natural nor synthetic evolution. Now I am no longer so sure, primarily due to a series of three groundbreaking papers from the Baker and Ovchinnikov labs that have utilized trRosetta, a system very similar to the first AlphaFold, to successfully design proteins. This bodes well for AF2. Nonetheless, I do suspect that AF2 has learned something like the “natural manifold” of protein structure space and may struggle with structures that look nothing like natural proteins, which to be fair no de novo protein design tool can yet do.
What else? Comparative “structuromics” (?!) Take your protein of interest and all the organisms in which it resides and examine how it has changed across evolutionary time. And all the proteins it interacts with and how they changed across evolutionary time. Couple that with functional characteristics of the protein itself, of its cellular context, and of its organism. Can we understand how changes in e.g. the structure of metabolic enzymes altered the metabolism of their organisms? What about cytoskeletal proteins and the morphology of their organisms? Signaling proteins and the information processing machinery of their organisms? And on and on.
What else? Synthetic biology, a field with tremendous potential hamstrung by its focus on engineering transcriptional circuits. Engineering DNA-based circuitry has been a natural choice because DNA is a lot more predictable than proteins, enabling the mixing and matching of promoters and DNA binding motifs with relative ease. But a cell can only do so much with transcriptional regulation, which is (likely) why it evolved a rich repertoire of information processing machinery in the form of protein-based signal transduction pathways as well as the structural machinery that gives the cell its form, motility, and function. Little of this machinery has been engineerable. Pioneering work by people like Wendall Lim mixed and matched modular protein domains to rewire signaling pathways but such systems remain brittle and difficult to engineer, requiring trial and error that is not dissimilar from how we used to build bridges before the advent of civil engineering. AF2 stands to change this, not just in terms of de novo protein design but also in engineering multi-domain proteins with flexible linkers and programmable logic. Like in other applications, it remains to be seen whether AF2’s performance broadly generalizes to the multi-domain context, but if it does, it will open up new opportunities complementary to existing protein design efforts. Once engineering protein-based circuits is feasible, it is not hard to imagine graduating to the next level in complexity.
Which brings me to what I think is the most exciting opportunity of all: the prospect of building a structural systems biology. In almost all forms of systems biology practiced today, from the careful and quantitative modeling of the dynamics of a small cohort of proteins to the quasi-qualitative systems-wide models that rely on highly simplified representations, structure rarely plays a role. This is unfortunate because structure is the common currency through which everything in biology gets integrated, both in terms of macromolecular chemistries, i.e., proteins, nucleic acids, lipids, etc, but also in terms of the cell’s functional domains, i.e., its information processing circuitry, its morphology, and its motility. A structural systems biology would take this seriously, deriving the rate constants of enzymatic and metabolic reactions, protein-protein binding affinities, and protein-DNA interactions all from structural models. We don’t yet know how much easier, if at all, it will be to predict these types of quantities from structure than from sequence—we need to put the dogma of “structure determines function” to the test. Even if the dogma were to fail in some instances, which it almost certainly will, partial success will open up new avenues.
Systems biology has hitherto been surprisingly non-spatial. ODEs over PDEs as it were. There are many reasons for this beyond the issue of structure, but as in situ structural biology becomes increasingly more powerful and is combined with predicted protein structures, new forms of simulation become possible that do take space seriously, both on the microscale of molecular machines as well as on the mesoscale of cellular components. These are long-term visions likely to take a decade or decades but the groundwork for them has been laid with AF2. Up till now structure only existed in small isolated pieces, for fragments of proteins and fragments of proteomes. AF2, when it becomes widely available, will change all this.
I will end this section with the question that gets asked most often about protein structure prediction—will it change drug discovery? Truthfully, in the short term, the answer is most likely no. But it’s complicated.
One important thing to note is that, of the entire drug development pipeline, the early discovery stage is just that, an early stage. Even if crystallography were to become fast and routine, it would still not fundamentally alter the dynamics of drug discovery as it is practiced today, as most of the cost is in the later stages of drug development beyond medicinal chemistry and well into biology and physiology. Reliable protein structure prediction doesn’t change that.
In my CASP13 post I took pharmaceuticals to task for not investing in protein structure prediction. This was not because it has immediate applications, certainly not back then. Instead, I thought that a problem of such fundamental biochemical importance ought to interest pharmaceuticals if for no other reason than to develop a robust basic research program that attracts the world’s best talent, especially the world’s best machine learning talent. This is arguably the real value proposition of DeepMind for Google (and MSR for Microsoft, FAIR for Facebook, etc.) Not immediate translation, but an intellectual core that feeds into other parts of the company. Some pharmaceuticals may be beginning to see the value of this but most of the exciting work remains in startups and companies backed by forward-thinking VCs like Flagship Pioneering and a16z.
Getting back to early-stage drug discovery, one part where AF2 can help is in determining structures of protein targets that can be modulated for therapeutic purposes. The challenge here, for the very immediate future, is that AF2 is trained to predict apo (unbound) protein structures while most medicinal chemistry applications require complexes of the protein bound to a small molecule. Second, sub-angstrom resolution is often necessary, which remains beyond what AF2 can achieve. A more fruitful direction for AF2 may lie in designing protein-based therapeutics, e.g., antibodies and peptides, where ultra-high resolution is less needed.
In the long run the true power of AF2 may come in providing a more robust platform for drug discovery, particularly within a systems pharmacology framework. We’re not there yet but we can imagine a future in which drugs are designed for their polypharmacology, i.e., to modulate multiple protein targets intentionally. This would very much be unlike conventional medicinal chemistry as practiced today where the emphasis is on minimizing off-targets and making highly selective small molecules. Drugs with designed polypharmacology may be able to modulate entire signaling pathways instead of acting on one protein at a time. There have been many fits and starts in this space and there is no reason to believe that a change is imminent, especially because the systems challenges of the equation remain formidable. Wide availability of structures may hasten progress however.
I promised to write one piece of armchair sociology and so here it is. Why was it DeepMind, rather than an academic group, that built AF2?
First and foremost it has to do with the people who make up the AF2 team. One should not pretend that they are substitutable. Even within DeepMind, if it were a different set of people we would likely have had a different outcome. This may seem obvious but I repeatedly heard people treat the AF2 team as an amorphous blob. Let us not forget that the main reason they did so well is because of who they are, their talents, and their dedication. In this most important sense, it is not about DeepMind at all.
Resources also helped and this is not to be underestimated, but I would like to focus on organizational structure as I believe it is the key factor beyond the individual contributors themselves. DeepMind is organized very differently from academic groups. There are minimal administrative requirements, freeing up time to do research. This research is done by professionals working at the same job for years and who have achieved mastery of at least one discipline. Contrast this with academic labs where there is constant turnover of students and postdocs. This is as it should be, as their primary mission is the training of the next generation of scientists. Furthermore, at DeepMind everyone is rowing in the same direction. There is a reason that the AF2 abstract has 18 co-first authors and it is reflective of an incentive structure wholly foreign to academia. Research at universities is ultimately about individual effort and building a personal brand, irrespective of how collaborative one wants to be. This means the power of coordination that DeepMind can leverage is never available to academic groups. Taken together these factors result in a “fast and focused” research paradigm.
AF2’s success raises the question of what other problems exist that are ripe for a “fast and focused” attack. The will does exist on the part of funding agencies to dedicate significant resources to tackling so-called grand challenges. The Structural Genomics Initiative was one such effort and the structures it determined set the stage, in part, for DeepMind’s success today. But all these efforts tend to be distributed. Does it make sense to organize concerted efforts modeled on the DeepMind approach but focused on other pressing issues? I think so. One can imagine some problems in climate science falling in this category.
To be clear, the DeepMind approach is no silver bullet. The factors I mentioned above—experienced hands, high coordination, and focused research objectives—are great for answering questions but not for asking them, whereas in most of biology defining questions is the interesting part; protein structure prediction being one major counterexample. It would be short-sighted to turn the entire research enterprise into many mini DeepMinds.
There is another, more subtle drawback to the fast and focused model and that is its speed. Even for protein structure prediction, if DeepMind’s research had been carried out over a period of ten years instead of four, it is likely that their ideas, as well as other ideas they didn’t conceive of, would have slowly gestated and gotten published by multiple labs. Some of these ideas may or may not have ultimately contributed to the solution, but they would have formed an intellectual corpus that informs problems beyond protein structure prediction. The fast and focused model minimizes the percolation and exploration of ideas. Instead of a thousand flowers blooming, only one will, and it may prevent future bloomings by stripping them of perceived academic novelty. Worsening matters is that while DeepMind may have tried many approaches internally, we will only hear about a single distilled and beautified result.
None of this is DeepMind’s fault—it reflects the academic incentive structure, particularly in biology (and machine learning) that elevates bottom-line performance over the exploration of new ideas. This is what I mean by stripping them from perceived academic novelty. Once a solution is solved in any way, it becomes hard to justify solving it another way, especially from a publication standpoint.
To be sure I would far rather have DeepMind be in this space than not, and I would not trade AF2 for the thousand flowers I mentioned; better a bird in hand. But it does raise the question of whether it is possible to have one’s cake and eat it too. To have fast and focused efforts co-exist with the slow and steady progress of conventional research.
Academic Research in a New Age
Speaking of co-existence, what are the prospects for academic research on biomolecular machine learning in the post-AF2 era? I write this section with an eye toward helping prospective researchers, students, and postdocs chart their path through what has become intensely competitive territory. After CASP13 I described the AlphaFold group as a “world class research team … competitive with the very best existing teams.” It’s fair to say that the situation now looks markedly more lopsided: DeepMind has become so dominant it should make any prospective researcher take pause before entering the field.
To think through what DeepMind’s presence means moving forward let’s place ourselves in DeepMind’s shoes and try to reverse-engineer their logic. My hope is that in so doing we begin to map out a research perimeter that is not under the constant threat of being crushed in <2 years’ time.
First some observations. DeepMind cares about making a splash. All their major efforts ranging from Go to StarCraft 2 to AlphaFold have been coupled with massive media blitzes. They do carry out more conventional and less glamorous research, but the projects with the big resources tend to be splashy ones.
Second, and this is a point that Demis himself made, DeepMind likes well-defined problems with clear objectives and metrics. Science is almost never this way but protein structure prediction actually fit the bill perfectly. There is literally a leaderboard every two years. Scientific problems with this feature are likely to attract DeepMind’s attention.
Third, DeepMind does have a core competency and it is machine learning. By the late 2010s protein structure prediction had turned into an almost exclusively machine learning problem. It required some domain expertise, and the AF2 team composition reflected that a bit, but by and large the hard problems were machine learning ones. This suggests that problems in which machine learning is not the core nut to crack are also less likely to attract DeepMind’s attention.
Fourth, given point three, any problem that machine learning can tackle must have a lot of data, and representative data that cover a large swath of the problem space.
As DeepMind begins to reckon with what comes next after AF2 they are likely to focus first on problems that look a lot like protein structure prediction. Based on the above observations let’s consider some of these outstanding problems. The first and most obvious is predicting the structure of protein complexes. It is the “next step” after protein structure and is sufficiently proximal that CASP already has a category for it called Assembly. Data here is not as abundant as in protein structure prediction, in fact there is probably around an order magnitude less, but it is also the case that multi-domain proteins are somewhat informative of this problem as inter-domain packing is more similar to protein complex formation than protein folding. Based on this and DeepMind’s repeated assertions that they want to tackle the problem, it’s fair to say that this will be very competitive territory. My advice here, as will be elsewhere, is not to throw one’s hands in the air and exit the space altogether. Instead my suggestions are twofold. First, be aware of the leaderboard clock if it exists in your subfield and plan to publish off-cycle so as not to get crushed by DeepMind’s media machine. In some ways trRosetta is a good example of this, landing right in between CASP13 and 14. Second, work on developing general purpose machinery that can be applied to many problems. If quaternary complexes happen to be an open problem at the moment, by all means tackle them, but avoid constructing overly specialized toolkits that take a lot of time to develop and generalize poorly to proximal problems. I think this is good advice in general, but all the more so in a hypercompetitive landscape.
What comes next? Protein-ligand, protein-DNA, etc. Here we begin to see some cracks. Protein-DNA interactions probably check off enough boxes that DeepMind may well tackle them. Protein-small molecule interactions are very tempting of course, because of drug discovery applications, but the scientific problem is much hairier. One because organic molecules occupy a larger and more topologically complex chemical space than proteins. Two because the data is much worse, distributed across multiple silos, many behind corporate IP walls, and inherently much less randomly sampled and representative. It doesn’t make the problem impenetrable to DeepMind but it almost certainly means they can’t crack it in its full generality. They can aim at bits and pieces of it, which lowers the splashiness of any solution. I suspect that because of commercial implications they will make a serious push, but it will have less of an impact on academic research, resulting in work that gets published in specialized journals or that is entirely locked up from view.
What comes after that? Perhaps protein function prediction. What is protein function? That is in itself a good question, but asking questions is not the sort of challenge DeepMind wants, so let’s look for predefined notions of protein structure-function relationships. There are certainly some, for example the classification of enzymes into EC classes. This problem is unlikely to generate many CNS papers however. And even if “solved”, I would argue what would be really interesting is a finer-grained delineation of enzymatic categories and a tighter understanding of the relationship between structure and enzymatic function, including allostery and dynamics. Once again we’re venturing into territories where problems are poorly defined and the most intellectually stimulating work is about defining problems rather than solving them. As we exit the realm of protein complexes and protein-X interactions, we quickly run out of problems with clear objectives and large datasets and begin to encounter problems of ever shrinking scope and dataset size. This doesn’t mean that DeepMind will not pursue some of them. But as they do, given that none are grand challenges worthy of the whole team, they will begin to splinter their human resources to focus on disparate projects, ones that require not some insignificant degree of domain expertise. This will dilute their team and, in the long run, make DeepMind look increasingly more like regular academic groups, but with better funding, better resources, backing of world-class engineering, and some degree of cooperation between the various subgroups. That is certainly not a bad place to be! I expect that they will however become less scary, resembling more the version I described at CASP13 than the terrifying bulldozer they have become at CASP14.
There is one other problem worth discussing here and that is the dynamic process of protein folding itself. It’s fun to speculate on whether DeepMind will attempt to tackle it. On the one hand, it plays well to some of their expertise, certainly reinforcement learning, and is in some sense a grand challenge. On the other hand, there are multiple strikes against it. First, there is virtually no experimental data to benchmark against. Not much in the way of clear-cut metrics as well. Success is hard to define here, other than in terms of proxy applications, e.g., can one design a better drug, which is what outfits like DESRES and Nimbus have long made a bet on. Speaking of which, there is also serious competition in this space from well-established industrial labs. Second, it is a hard problem, a really hard problem, much more so than protein structure prediction. Because of lack of data as I mentioned but also because the fundamental object that one is trying to infer is inherently more complicated than the single lowest energy state of protein structure prediction. If any group is up to the task it is DeepMind but it would mean them expending substantially more resources on the problem than they already have, perhaps an order of magnitude more. Would they do that? Which brings me to my third point. The grand prize has already been won with AF2. DeepMind declared victory and said that the “protein folding” problem is solved. From a publicity or even Swedish committee standpoint, there is not much more to be gained, which speaks against allocating inordinate resources to double down on the problem. My guess is that the scientists will push for it, because it’s exciting, but management will push back, because it will seem pointless and overly expensive. Either way I believe it is safe for academics to operate in this space for the foreseeable future, as the problem really is quite hard and I don’t foresee DeepMind making AF2-like progress in <5 years. Needless to say however I’ve been proven wrong by them before!
I will close with a broad comment about one’s motivation for research and how that interacts with being in a competitive field. Some researchers do research because they like solving puzzles for their own sake and could care less about publishing in glossy journals or even having practical impact. If you are in this category I say carry on, as DeepMind’s presence will have little effect and if anything may be a source of inspiration and new ideas. Some researchers like the competition and the race, the thrill of trying to outdo someone else. If you are of this sort, DeepMind’s entry may be the biggest boon you could ever hope for, presenting a bigger fish than anyone else in the space. Keep that in mind though, as getting repeatedly crushed may not be much fun. Finally, some researchers do research to make the world a better place, enhance our understanding of natural phenomena and in so doing empower human interests, including human health. For people in this category competing with DeepMind does not make such sense because they are a very capable team and are likely to crack the problems they set their sights on. Working on the same problems seems redundant to me and one might well as dedicate one’s energy to other, understudied, problems.
Speaking for myself, I am a mixture of all three but lean most heavily on the second and third archetypes. In some ways I entered academic science to temper my competitive tendencies, as I know they can get out of hand. If I wanted to really compete, I would build a startup that takes on DeepMind, but this seems a little pointless given that DeepMind is already there. Hence for so long as I am doing science, I will look for areas that are interesting, impactful, and understudied, as all things being equal, I would rather make an impact that betters the world than one that is neutral. Different people are different of course and you will have to chart your own path given your interests and temperament. Good luck!
Thanks to Randy Read for pointing out my inconsistency when using the terms “accuracy” and “resolution”. I have now made my usage more consistent, reserving “accuracy” for prediction RMSD and “resolution” for crystallographic resolution.