Generating Proteins with Atomic-level Control
TL;DR
In this work, we describe proseLM, a method for steering protein language models for design tasks by providing explicit structural and functional information. ProseLM leverages compute-efficient adaptation of large language models to enable conditional generation of sequences given atomic coordinates of the protein backbone and surrounding molecules.
Through in silico validation, we show that proseLM effectively incorporates structural and functional context into its designs, yielding sequences that are confidently predicted to recapitulate the desired three-dimensional structure while recovering critical residues responsible for functional activity. In the lab, we use proseLM to generate optimized nucleases and base editors for genome editing applications, and find that our designs exceed the activity of the starting enzymes. We further validate proseLM for therapeutic antibody design, where we demonstrate affinity maturation and sequence diversification for clinically approved, highly optimized drugs.
To facilitate continued advances in functional protein design for the betterment of society, we release proseLM for non-commercial usage. Through continued development of proseLM and related technologies, we aim to generate solutions to pressing challenges across biomedicine, agriculture, and industrial domains.
Preliminaries
Proteins are the functional agents of cellular life, responsible for critical activities ranging from catalysis of chemical reactions to supporting cellular structure. Composed of linear chains of amino acids, proteins fold into diverse three-dimensional structures that enable these wide-ranging functions. Protein design aims to repurpose these versatile molecular agents to carry out useful behaviors in medicine and biotechnology.
Like finding a needle in a haystack, the challenge of protein design is finding a precise ordering of amino acids (of which there are twenty choices) that will fold up and perform a function of interest. For a small protein consisting of 100 amino acids, this amounts to identifying specific sequences amongst the 20^100 possibilities. Existing approaches typically leverage either structural or evolutionary constraints to narrow this vast design space. For structure-based methods, design is often framed as a fixed-backbone sequence optimization problem, where a protein structure is given and the objective is to find a sequence that folds into the structure. By contrast, evolutionary approaches leverage the patterns in natural sequences evolved for similar functions to define a set of soft constraints on the design space.
Protein language models, which aim to learn the general rules of proteins from large databases of natural sequences, fit into the evolutionary category. These models can leverage this broad understanding to generate diverse protein sequences reflecting the natural distribution of all proteins, or be fine-tuned on a curated set of proteins to narrow their scope for design tasks. This is the approach we have taken for the design of CRISPR-Cas proteins, yielding OpenCRISPR-1.
Giving structure to language
While fine-tuning on natural examples is an effective way to design proteins with language models, the process introduces several constraints. First, the entire strategy is predicated on being able to assemble enough natural examples for the function-of-interest that a model can be fine-tuned. Relatedly, fine-tuning works best when we want to design a protein that has a function nature has already explored, but is more challenging when we want to deviate from or go beyond nature. Finally, while the coarse constraints of evolution can be useful in some cases (such as for designing complex molecular machines like Cas9), in other cases it is necessary to provide atomic-level detail about what we want our proteins to do.
To tackle these problems, we have developed proseLM (protein structure-encoded language model). ProseLM unites the evolutionary understanding of protein function from language models with the atomistic detail of structure-based design. This is achieved by introducing structural information into a pre-trained language model through a set of added layers, called adapters. Importantly, these adapter layers have very few parameters (millions) compared to the language model (billions), making it efficient to train and run models like proseLM.
We encode the protein backbone structure as a graph, where each amino acid residue is a node that exchanges information with its nearest neighbors in three-dimensional space. Additionally, we incorporate functional information about other proteins, small-molecule ligands, nucleic acids, and ions that our designed protein should interact with by extending the graph to encode arbitrary sets of atoms.
To test proseLM, we first measured the impact of providing varying levels of functional context to the model and inspecting the designed sequences. As a generative model, proseLM is capable of producing many more sequences than we can analyze. Instead, we can look at the perplexity, which provides a measure of how likely a given sequence (in this case, the natural sequence) is according to the model. Encouragingly, as we provide increasing levels of information – first protein-protein interactions, then all interactions – to proseLM, the model assigns lower (better) perplexities to the natural sequences. We also found that larger proseLM models, which have better learned the evolutionary information from sequence-only pre-training, performed better than smaller models.
In a similar test, we next designed single sequences for a set of structures through a greedy decoding strategy, wherein we took the most likely amino acid at each position according to the model. When we measured what percentage of the residues match the natural sequence (recovery), we again found that providing functional context was beneficial and that larger models were more capable protein designers.
Designing functional proteins
The ultimate test of a protein design method is how well it produces sequences that work in the lab. Even a single misplaced amino acid can disrupt protein folding, and mutations at functionally important sites in the protein are particularly sensitive to errors. To evaluate proseLM, we chose two design areas that are broadly relevant across biomedical, agricultural, and research settings: genome editors and antibodies.
For genome editing, we started by exploring the local functional space near SpCas9, an RNA-guided endonuclease. We chose SpCas9 as a first test case due to the complexity of its functional behavior, which requires a series of highly coordinated steps that ultimately result in the double-stranded cleavage of DNA. By providing proseLM with the structures of two functional states, as well as additional evolutionary and mutational data, we were able to design sequences with ~60 mutations that exhibited changes in the editing profile, including significant increases in on-target editing for some targets.
We next turned to base editors, which are a fusion of a deaminase domain (for editing) to a catalytically deactivated Cas9 nuclease (for targeting). As a starting point, we used a low-activity deaminase that we previously generated using language models. We focused proseLM on redesigning either the deaminase active site (near the base to be edited) or peripheral positions (farther away from the base). Both strategies yielded designs with nearly 50% higher A-to-G editing efficiency than the starting deaminase sequence, approaching state-of-the-art editors from directed evolution with just a single round of optimization.
Antibodies are another class of broadly useful functional proteins, which excel at binding targets with high specificity and affinity. Therapeutic antibodies in particular bind their targets exceptionally tightly, honed through a process called affinity maturation that is performed naturally by the immune system or in the lab through rounds of evolution. To test proseLM on this class of proteins, we first considered nivolumab, which is a cancer immunotherapeutic that binds the PD-1 protein. Similar to our strategy for base editors, we focused design on residues near the binding site (CDR loops) or scaffolding positions (framework). With both approaches, we found designs that increased affinity by nearly 3-fold compared to nivolumab.
As a final validation of proseLM for antibody design, we considered secukinumab, which binds IL-17A through an extended loop that makes extensive contacts with the target. These numerous contacts are challenging because each presents an opportunity to break the binding interface, resulting in a non-functional antibody. However, when we allowed proseLM to redesign the entire sequence of secukinumab, we found two designs that retained binding despite having 18 and 31 mutations across the antibody, including many at the interface.
Conclusion
ProseLM is a method for functional protein design that unites the evolutionary understanding of language models with the atomistic control of structure-based approaches. Through in silico validation, we showed that proseLM effectively incorporates structural and functional information into its generated sequences, an ability that improves with increased model scale. We validated proseLM on a series of challenging functional protein design tasks, including genome editing and antibody design.