The Future of Protein Science will not be Supervised

But it may well be semi-supervised.

For some time now I have thought that building a latent representation of protein sequence space is a really good idea, both because we have far more sequences than any form of labelled data, and because, once built, such a representation can inform a broad range of downstream tasks. This is why I jumped at the opportunity last year when Surge Biswas, from the Church Lab, approached me about collaborating on exactly such a project. Last week we posted a preprint on bioRxiv describing this effort. It was led by Ethan Alley, Grigory Khimulya, and Surge. All I did was to enthusiastically cheer them on, and so the bulk of the credit goes to them and George Church for his mentorship.

Surge has already done a great job summarizing the paper in a Twitter thread and so I won’t spend much time explaining what we did. I will instead focus on what I found surprising during the course of this project and the implications of representation learning for protein informatics and, in the long run, the rest of protein science. I suspect that these types of representations will become foundational to how we understand proteins in the future, perhaps rivaling the impact they have had on natural language processing. I also think that, viewed as a conceptual playground, proteins will prove to be a more fertile ground for ideation in semi-supervised learning than other application domains. In short, I will make the case that if you’re an ML researcher working on semi-supervised learning, one of the best, if not the best domain to be working in is proteins. Conversely, if you’re a protein informaticist I would encourage you to pay close attention to semi-supervised learning!

But first, the paper itself. The basic idea of ‘UniRep’ is to train a neural network model to compress protein sequence space, in this instance by training a (multiplicative) LSTM to do next letter (amino acid) prediction, thereby mapping any arbitrary protein sequence to a fixed-length vector representation. The techniques are not necessarily new, although of course applying them to the protein context brings its own challenges. What is remarkable about this seemingly simple procedure is the power of the resulting representation. It does a surprisingly good job of predicting protein function across a diverse set of tasks, including ones structural in nature, like the induction of a single neuron that is able, with some degree of accuracy (ρ = 0.33) to distinguish between α helices and β strands (I suspect the network as a whole is far more performant at this task than the single neuron we’ve identified, but we didn’t push this aspect of the analysis as the problem is well tackled using specialized approaches.) What this suggests, which till now I find remarkable, is that by simply having to be compressive in its representation of protein sequence space, an RNN is compelled to build an internal detector of secondary structure; presumably because knowing about secondary structure makes it a more efficient compressor. It is not dissimilar from some of the recent results on sentiment analysis, and models like GPT-2 are even more breathtaking, but the power of the compressive principle to be so damn effective at inducing human-interpretable structure still manages to surprise me. Perhaps this is even more so when the modality is not obviously sensory, but something abstract (albeit structured) like protein sequences. I recall not long ago one of the grand poohbahs of machine learning saying that non-sensory sequences, specifically biological sequences, are unlikely to benefit from RNNs precisely for this reason. I think it’s safe to say now that they’ve been proven wrong.

Another aspect that I found particularly surprising is the power of the unsupervised signal. Across the set of tasks we tackled, UniRep was able to outperform the structurally-supervised RGN models I published last year. One would have thought that structural information, particularly for tasks like thermal stability, would be far more valuable than sequence information. And of course it’s likely the case that if one were to have as many structures as sequences, a structurally-supervised model would do better. But given the gap in available data, sequence trumps structure. This antagonism is superficial of course—the right thing to do would be to leverage the advantages of both, and it is an avenue we’re currently exploring.

A key feature of this type of representation learning is its induction of a global representation of protein sequence space. Much of the prior work in this space has focused on family-specific representations, for example VAE-based ones. While family-specific representations have proven to be very powerful, particularly for protein structure prediction and for predicting the effects of mutant variants, they have always left me feeling somewhat unsatisfied. I say this because from the perspective of data / sampling efficiency, they’re only able to exploit patterns observed within a single protein family. They fragment protein sequence space into local clusters and perform learning, unsupervised or otherwise, separately for each family. This sort of sample vs. model complexity tradeoff is emblematic of much of machine learning, and I’ve written about it before in the context of predicting SH2-mediated protein-protein interactions. On the one hand, if each protein family is learned separately, the complexity of the model is reduced. But the amount of data available is also reduced, fracturing the inherent universality of proteins into tiny phenomenological universes. On the other hand, a global model of protein sequence space is able to leverage all available data, but must learn something truly general, a much more demanding task that substantially increases model complexity. What is the right tradeoff—where is the sweet spot? My hunch has long been that a global model is likely to be more performant. And with UniRep, I believe we’re beginning to see this play out.

Beyond the question of performance, a global model of protein sequence space has the potential to be much more useful, by being more broadly applicable. The challenge we face in protein informatics, and protein science more broadly, is that our functional characterization of proteins is sparse and patchy. Certain protein families have had the benefit of deep functional characterization, both in terms of gross function (e.g. the plethora of mammalian signaling proteins) and in terms of detailed structural perturbation (e.g. as resultant from deep mutational scans). The GFP family of proteins studied in the UniRep paper is an example of one such family. This patchiness makes it difficult to say something truly useful about the vast majority of proteins (across Prokarya and Eukarya) because, to use an overextended metaphor, they exist in a dark region of sequence space. And the detailed characterization we have of a few families ends up being largely uninformative about this larger space, except in a vague conceptual way.

Global models of protein sequence space have the potential to change this because, if we can get them to work well, they can at a minimum help us see the connective tissue that underpins protein space, and thereby relate information about well-characterized protein families to poorly characterized ones. In essence, they provide a fancy version of k-nearest neighbors, by densely populating the empty space surrounding sparsely characterized proteins, enabling functional associations to be transferred from one protein to another much farther than previously possible. I believe something similar was occurring in my differentiable RGN models of protein structure, in particular when I moved from training models on individual protein sequences to PSSMs. What PSSMs provide is precisely this form of connective tissue, by forging links between seemingly far away proteins whose relationship would be undetectable by mere sequence similarity. By leveraging the evolutionary record, one is able to see how one protein relates to another, and I suspect this allowed RGNs to become much more performant with PSSMs. UniRep has the potential to do something similar without PSSMs.

Beyond this, global models of protein sequence hold the possibility of learning something truly general about proteins, that would move us beyond mere k-nearest neighbor matching to something more akin to a linguistics of proteins, decomposing protein sequences into their constituent functional and structural fragments. The fact that UniRep learned something about protein secondary structure is an indication that this is possible and already happening, without any supervision. This is important because unlike secondary structure, most of the principles of protein function (and perhaps structure) remain opaque to us, and so our ability to perform supervision will remain limited for the foreseeable future.

To drive home this point, below is a plot of the number of available protein sequences and structures over the past decade.

What should be obvious is that new protein sequences are being acquired at a much faster pace than structures, and the above holds true for pretty much any form of functional characterization of proteins. There is no assay that we can perform today that will close this exponentially increasing gap, certainly nothing on the horizon.

On the one hand this may seem depressing, but on the other hand, I believe it presents a unique opportunity for unsupervised and semi-supervised machine learning. I am aware of no other problem in which the gap between labelled and unlabeled data is this large and continually increasing. If unsupervised learning can be made to work somewhere, it ought to be here. I emphasize this point because I have followed unsupervised learning with some interest over the last few years, and found most applications to be somewhat uncompelling, in the sense that the increase in performance gained from e.g. unsupervised initialization of an RNN always seems to be marginal. In many applications this is further exacerbated by the fact that acquiring labelled data is not all that expensive, rendering the extra effort that goes into semi-supervised learning even less worthwhile. However, the gap between labelled and unlabelled data in most applications is on the scale of 1 to 2 orders of magnitude, at most. What we see in protein sequence vs. e.g. structure is a gap of 5 orders of magnitude. This suggests, again, that if unsupervised learning were to work anywhere, it ought to be on proteins.

Another advantage of proteins is the wealth of prior knowledge that can be exploited to construct sophisticated loss functions for unsupervised learning, instead of simple next letter prediction (approaches like BERT have gone beyond this, but the amount of unsupervised signals in NLP seem to be much more limited than proteins.) A recent paper from Bonnie Berger’s lab demonstrates this idea (on the whole this paper deserves a lot more attention than it’s received IMO—it’s really a very good piece of work. Thanks to Tami Lieberman for the pointer.) Instead of simply learning to predict a missing amino acid, they augment their system with an auxiliary loss that predicts the structural distance between two proteins, based on their SCOP classification. It’s a simple idea, but demonstrates the type of structured knowledge that exists for proteins (to be sure, in this case the loss signal requires actual protein structures, and so it’s not entirely unsupervised.)

These are early days in our understanding and formulation of protein sequence space. It is an incredibly rich object, with much known and even more that is unknown. For a long time the only way we could relate one protein to another was through explicit pairwise or multiple sequence alignments, which assumed a direct evolutionary relationship and induced a residue-to-residue correspondence. What we’re beginning to see now is the emergence of something more general, a way to think about proteins that is less concerned with their evolutionary relationships and more concerned with their fundamental functional and structural constituents. If we push this approach to its limits, we may end up with a theoretical science of proteins, one which spans not only the space of extant proteins but generalizes to heretofore unseen ones. The challenge to this vision is the (lack of) human interpretability of such representations. If and how this challenge may be overcome, and whether it’s worth trying at all, is a subject for another post.

Acknowledgments: Thanks to Surge Biswas, Grigory Khimulya, and Ethan Alley for reading and providing feedback on an earlier version of this post.


  1. Hi Mohammed,

    Great post. I’ll new to this field. A naive question: what’s the difference between the latent space and hidden state in HMM? To my understanding, this latent space sounds like the hidden states, excepting the dimensions of input data and latent space could be different.


    • The concepts are certainly related, but with HMMs you have a Markovian assumption that limits the length of context that the latent state encodes, while in principle with RNNs you have infinite memory (of course in practice you don’t).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s