# A New Way to Read the Genome

I am pleased to announce that earlier today the embargo was lifted on our most recent paper. This work represents the culmination of over two years of effort by my collaborators and I. You can find the official version on the Nature Genetics website here, and the freely available ReadCube version here. In this post, I will focus on making the science accessible to the lay reader. I have also written another post, The Quantified Anatomy of a Paper, which delves into the quantified-self analytics of this project.

We set out to address an easy-to-state but difficult-to-solve problem: predict, from genetic sequence alone, the consequences of mutations. This is a fundamental problem that lies at the heart of genomics, as our ability to obtain data continues to far outpace our ability to make sense of it. We still cannot, in the general case, understand what any given mutation does. But the work we published today makes a small step in this direction. For a subset of proteins involved in cellular signaling, we are now able to predict how any single mutation affects their ability to interact with their partner proteins. This means that for all diseases effected by mutations—we focus on cancer—we can examine how signaling pathways are rewired in the disease state. This can lead to a better understanding of the basic biology of signaling in healthy and diseased cells, and to the development of drugs that target previously unknown proteins in signaling pathways.

 Statistical Mechanics By Way Of Machine Learning The basic idea behind our model is to cast a statistical mechanical model into a machine learning formulation. It is a well-observed fact of machine learning that the model formulation, what the inputs and outputs are and how the data is represented, often matter far more than the specific choice of algorithm used. I believe this was particularly true in our case and explains to some extent why it was possible to make substantial progress on this relatively well-studied problem. The main issue with existing methods is that they largely fall into two camps, neither of which optimizes the complexity/statistical power trade-off very well. The first camp comprises general protein-protein interaction methods. Such methods can in principle predict the binding affinity of any two proteins, but in practice they are limited to qualitative predictions. General methods take on a significantly more challenging problem than the single domain problem, because arbitrary proteins can bind other proteins in effectively arbitrary ways. This would not be catastrophic if it weren’t for the fact that the number of new data points gained, by pooling all existing data on all types of protein-protein interactions, is far from sufficient to offset the gain in model complexity. Hence general protein-protein interaction methods fall on the too little data for too much model complexity spectrum of models. The other camp encompasses domain-specific methods. The definition of what constitutes a “domain” varies, but for these methods, it typically refers to an individual protein domain, e.g. one of the 100+ SH2 domains in the human proteome. Each model is specific to a single domain. This results in low model complexity (relative to the general protein-protein interaction case), since a given domain is unlikely to vary all that much between its interaction partners, but also equally low statistical power, since only data that is specific for the individual domain is used. Consequently while these approaches solve the model complexity problem, they also lose the data richness of general protein-protein interaction methods. The key to solving the SH2 domain problem, or any machine learning problem, is to optimize the trade-off between model complexity and statistical power. Unlike existing approaches, our model was not domain-specific, but generally applicable to any SH2 domain. This had the benefit of making it possible to model mutations in SH2 domains, something that was not possible before. Furthermore, given that most SH2 domains vary little between one another, no substantial increase in complexity was incurred by generalizing our model to any SH2 domain sequence. On the other hand, by pooling data from all SH2 domains, we effectively gained two orders of magnitude of additional data for what I suspect is a nominal increase in model complexity. Hitting this sweet spot of model complexity vs. statistical power is one of the key enabling aspects of our model.

The approach we took in tackling this problem synthesizes structural biology, the study of the shape and motion of individual biological molecules, with systems biology, the study of how multiple biological molecules assemble to form functioning systems that carry out cellular growth, division, motion, signaling, and myriad other tasks. To make a tractable first step, we focused on a single family of proteins, ones containing what are known as Src Homology 2 (SH2) domains. Loosely speaking, domains are parts of proteins that function independently and are reused repeatedly throughout evolution in many proteins. SH2 domains in particular are critical to cellular signaling. When the cell senses an external stimulus, information encoding that stimulus is propagated throughout the cell through the use, in part, of SH2 domains that interact with other proteins to form a chain of signaling events, passing information from one molecule to another.

We began by building a model of how individual SH2 domains interact with their protein partners. Such models had been built before, but they lacked the necessary precision to distinguish between two proteins that differed by only a single residue. Making progress in this instance required that we step back and think about the formulation of the problem, how it is typically stated and how it may be restated, before barging ahead with new algorithms. While it is often difficult to pinpoint the precise reason why a new model works where previous ones did not, I do suspect that this reformulation was the key ingredient. For more on the technical details see the box “Statistical Mechanics By Way Of Machine Learning”.

Once we had a working model of individual SH2 domains, we set out to test it computationally and experimentally. The model made a number of surprising predictions, particularly for a protein about which little was known before, by quadrupling the number of its interaction partners. When the initial experimental results came back positive, we were ecstatic, as the agreement was not only qualitative but quantitative. It seemed like it may be possible to model SH2 domains after all.

The next step proved particularly challenging however. Thus far we had been modeling individual SH2 domains, but real proteins are comprised of multiple domains, sometimes dozens. To make useful predictions about the effects of mutations we needed to model proteins in their entirety. This was unexplored territory as existing models focused on individual domains. We would start down a path, often arriving at what seemed like a solution only to discover some critical flaw. For me personally, this work had a distinctly theoretical flavor, very different from the usual machine learning to which I am accustomed. Eventually, after several false starts, we arrived at we believe is the right conceptual model. It solves several problems at once, including the ability to quantify, in an interpretable way, the likelihood that a mutation will lead to a change in protein function that is consequential, i.e. detectable and biologically meaningful. See the box “Deriving an Interpretable Metric” for more.

Armed with this new model, we were now in a position to analyze the effects of mutations in a given disease on SH2 signaling in humans. We decided to focus on cancer because of the inherent relevance of the problem, as cancer is known to impact signaling pathways. Thanks to large-scale publicly funded efforts, thousands of tissue samples from cancer patients have already been taken and their genomes sequenced. This enabled us to analyze the effects of these cancer mutations on the SH2 signaling network. One of the first things to emerge was that cancer mutations seem to target connected subnetworks in the larger human SH2 network. Below I show a figure for kidney cancer which illustrates this.

Individual proteins are shown as nodes, with edges between nodes indicating affinity for interaction. Edges that are perturbed in kidney cancer are shown in green and orange. A priori, there is no reason for the perturbed edges to “cluster” the way they do, i.e. for them to form a connected subnetwork within the larger network. But they repeatedly seem to do this, with different subnetworks targeted in different types of tissue. These subnetworks are suggestive of signaling chains that play a role in cancer function. In any given patient, only one of these edges may be disrupted, and that disruption may be sufficient to dysregulate the entire chain. If one were to examine one patient at a time, these subnetworks would not have emerged. By pooling mutations from multiple patients however, one is able to observe the extent and connectivity of these potential signaling chains.

 Deriving an Interpretable Metric One of the more interesting, and challenging, aspects of the project was deriving a quantity to denote the importance of a mutation. While it’s easy to construct an ad hoc metric, we wanted a quantity that is interpretable, that does not require mental gymnastics to understand. We also wanted this quantity to be biologically relevant, to correspond to changes that would be consequential and detectable experimentally. With respect to interpretability, one challenge was the fact that we had two probabilities to contend with, corresponding to the likelihood of an interaction before and after a mutation. A simple ratio (or difference) of probabilities, while an easy choice, does not have an obvious physical meaning. Furthermore, a ratio would mask important aspects of the change in binding affinity. For example, a strong interaction becoming a very strong interaction would register the same as a non-existent interaction becoming a strong interaction. With respect to biological relevance, we needed to contend with the fact that mutations occur in domains residing in complex protein “contexts”, i.e. proteins comprised of other domains and binding sites, all of which have the potential to interact with one another. From a biological standpoint, disruption of a single domain-domain interaction may not be consequential or even experimentally detectable. Furthermore, the natural variation in instances of interactions between proteins meant that sometimes two proteins may register as interacting and sometimes as not merely due to noise. The quantity we ultimately derived, termed $P_{perturb}$, addresses all these issues, and is, mathematically, a probability in the formal sense. Intuitively, $P_{perturb}$ is the probability that a given interaction between two proteins will be qualitatively altered, i.e. in an experimentally detectable way, in a given disease. Technically, we first derive the probability of a hypothetical “double experiment” in which the state of a protein complex (bound vs. unbound) is simultaneously measured before and after a mutation. The set of all possible outcomes of these double experiments constitute what is known as a canonical ensemble. We consider the subset of states in this ensemble in which the pre- and post-mutation states differ and in which the change is localized to the mutated site. We compute the probability of this subset of the ensemble, and then take the expectation of this probability over all possible mutations in a given disease, estimated empirically. This expectation is the value of $P_{perturb}$. In addition to providing intuitively interpretable semantics, $P_{perturb}$ also proves to be very useful. As described in the paper, ranking genes using this quantity allows us to fish out proteins involved in cancer which may serve as targets for therapeutic interventions.

On the “practical” side, the method also seems to do something rather useful: find cancer proteins. Such proteins have the potential to serve as targets for drugs. Although cancer genome data provide information on which genes are mutated, most such mutations are “passengers”, there for the ride but inconsequential for cellular function. Mutations predicted by our model to be consequential were overwhelmingly more likely to occur in proteins already known to be cancer-causing or cancer-suppressing than other cancer mutations.

The model also sheds light on a sort of “dark matter” of genetic mutations, ones that are very infrequent and possibly patient-specific and thus nearly impossible to detect using statistical analysis alone. By analyzing this dark matter of genetic mutations, we discovered that on average, they are as likely as recurring mutations to cause disruptions in signaling networks. This suggests that they are severely understudied and must be examined using models that directly predict the functional consequences of mutations.

Above and beyond the specific findings, this work is one of the opening rounds in what I suspect will be a process of synthesis for structural and systems biology. As I mentioned at the beginning, structural biology studies the very small, the building blocks of all biological systems, while systems biology aims to make sense of the interactions of these building blocks. Although it may seem natural to marry the two, historically this has been difficult, as structural biology is primarily applied to individual molecules, making it difficult to scale across whole systems. On the other hand, systems biology models incorporate many components, but do so in a coarse-grained fashion that ignores most quantitative details. Such an approach can often yield very useful insights, but in biology, the devil is very much in the details, and a synthesis of the precision of structural biology with the broadness of systems biology may open entirely new vistas.

This work represents a structural approach to systems biology because it uses structural information to build a model of SH2 domains that is then applied systems-wide, making predictions for all proteins involved in this signaling system. It is also a systems approach to structural biology because it uses what is known about the local neighborhood of a protein, i.e. its interaction partners in the network, to determine whether a mutation will have consequential effects on signaling. The system provides the context in which mutations occur, and this information is exploited.

So what next? Soon after starting graduate school, I found myself asking the question: if I knew the interaction partners of every protein in the cell, what would I do with that information? I didn’t have a good answer then, and I don’t have a good answer now. Back then however, the question was premature. Now I am no longer so sure.