ProGen3 is a family of AI models for protein generation, involving billion-parameter language models trained on over 3.4 billion protein sequences. Our scientific preprint examines the first wet lab evidence of scaling benefits for biological design. Our platform enables single-shot design of antibodies, termed OpenAntibodies, for 20 drug targets that address 7 million patients with $660B in historical sales and the development of an ultra-compact gene editor. Partners can work with us by licensing our assets, collaborating on new proteins, or joining our early access program for our models.
Proteins are the functional unit of life at the molecular scale and are responsible for a wide variety of activities, from catalysis of biochemical reactions to recognition of foreign pathogens. A protein can be described by a sequence of amino acid building blocks — of which there are 20 possible options — that fold up into a three-dimensional structure and perform some biological activity. In this way, a protein’s sequence describes both its structure and the function that it will carry out.
The challenge of protein design is to devise new sequences of amino acids to perform functions that have not emerged through evolution, such as treatments for diseases or ultra-stable industrial enzymes. This design space is astronomically large — if we limit ourselves to just short proteins of 100 amino acids, we are left to sift through more possibilities than the number of atoms in the universe. Clearly, we need another strategy.
At Profluent, we build language models that learn directly from the vast collection of sequences that evolution has produced in order to design new useful proteins. The behavior of these models is governed by a set of scaling laws that forecast how their capabilities improve with greater capacity and exposure to more data. Just as in natural language, where scaling allowed models to graduate from generating short strings of barely coherent text to rich, thoughtful narratives, we have shown the first real-world evidence that scaling laws exist for protein design.
This is not purely an academic endeavor. We ultimately evaluate our progress at Profluent based on our ability to create positive value for society. Scaling protein language models enabled a capability jump from generating model enzymes like lysozymes to designing complex, highly functional genome editors like OpenCRISPR. We are excited to introduce ProGen3 — the next step in this journey.
ProGen3 is a frontier suite of generative language models for protein design. It allows users to not only generate novel full-length proteins, but also redesign specific domains of an existing protein for improved function. It leverages a sparse architecture to achieve a 4x speedup without sacrificing modeling performance. To train ProGen3, we assembled the Profluent Protein Atlas v1 (PPA-1), a carefully curated resource of 3.4B full-length proteins and 1.1T amino acid tokens — the most expansive high-quality protein dataset compiled to date. We optimized PPA-1 for training language models, and we used it to optimally scale ProGen3 up to a 46B parameter model trained on 1.5T tokens (Fig. 1).
Next, we asked how a ProGen3 model’s scale impacts its ability to generate real proteins. Larger models like ProGen3-46B routinely generated proteins from families that smaller models had no knowledge of, but not vice versa. We validated our findings in the wet lab and found that generated proteins from all models typically expressed at levels comparable to naturally occurring proteins from the same family (Fig. 2). However, ProGen3-46B generated 59% more diversity than ProGen3-3B and 198% more diversity than ProGen3-339M (as measured by the number of generations unique at 30% ID). This suggests that as models grow larger, they more faithfully represent the biological principles underlying a much wider diversity of life.
The more general representations learned by larger models also make them better tools for real-world design. Off the shelf, a pre-trained protein language model can generate useful proteins, but it may not be optimized for the exact properties that a scientist is designing for. However, we can use a limited amount of laboratory data to align ProGen3 with properties like activity, expression, stability, and binding affinity. While alignment can improve models of any scale, larger models reap the greatest benefits, with ProGen3-46B’s correlation with experimentally measured protein fitness improving from 33.1% to 67.3% (Fig. 3). We can thus leverage laboratory data to iteratively refine ProGen3’s ability to satisfy specific design goals.
Over the past two decades, antibodies have become a crucial class of therapeutics for a wide range of diseases. However, the discovery, engineering, and optimization of therapeutic antibodies is a time-intensive, costly process typically involving animal immunization and/or multiple rounds of experimental screening. We wanted to test our protein design platform to see if it could generate antibodies in a single shot that rival approved therapeutics along multiple attributes. We termed this initiative OpenAntibodies— where we selected 20 distinct targets, for which approved drugs have collectively treated 7 million patients and yielded $660 billion in sales (Fig. 4). For each target, we generated antibodies that are computationally predicted to bind precisely the same epitope as the approved therapeutic yet constitute a different composition of matter. The median identity of the designs was at most 80% identity to any known binder against the same targets, and all designs featured amino acid differences in every complementarity-determining region (CDR) loop. We are working toward releasing these antibodies which will be made available for royalty-free or upfront-free licensing.
Going beyond computational evaluation, we experimentally tested our antibody designs against the CD38 and PKal targets for multiple attributes ranging from binding to developability properties. Many of our designs not only matched the affinity of highly optimized therapeutics against the same epitopes, but further showed considerably improved developability (Fig. 5). By contrast, traditional approaches often struggle with optimizing one attribute at the expense of another.
The designed antibodies are dissimilar from their therapeutic counterparts across the entire variable domain, including the complementarity-determining regions (CDRs). To contextualize, even one mutation in a CDR loop could altogether ablate binding. Due to this sensitivity, leading approaches are limited to non-CDR mutations and strive to stay only a couple mutations away from a parent sequence (>98% identity).
These results demonstrate the ability of our platform to design high-quality antibody candidates for a wide variety of potential drug targets. This previously unseen ability of our models to navigate sequence and fitness landscapes extends even to the highly sensitive interactions exemplified by antibody binding interfaces. With continued scaling, we anticipate emergent capabilities that will sweep the antibody field.
Genome editing technologies are poised to transform medicine and agriculture, largely through repurposing of natural defense systems like CRISPR. However, while the simplicity and robustness of these systems have led to wide adoption, several key challenges stand in the way for different applications. One such challenge is the size of these systems — for example, the Cas9 nuclease from S. pyogenes consists of 1,368 residues and requires a 100 nucleotide guide RNA, already approaching the packaging limits of a single AAV delivery system.
Motivated by these shortcomings, we designed a multitude of programmable gene editors that are highly compact, with as few as 592 residues, and demonstrate functional performance in the wet laboratory (Fig. 6). Not possible with legacy CRISPR-Cas systems, one could utilize these compact proteins along with other effectors and tissue-specific promoters to address previously unaddressable targets with a single AAV. We have line-of-sight on more challenges in gene editing that will be unlocked with our platform tailored toward curing disease and practical challenges in multiple industries.
Profluent is working with leaders in therapeutics, agriculture, and biomanufacturing on applications of its AI-designed proteins. Partners can access the company’s technology in the following ways:
1 — Molecules: Easy licensing of our proteins or a strategic partnership to build bespoke solutions. [Access Here]
2 — Models: Early access program to our API which could allow select partners to customize our best foundation models to specific data or use-cases. [Access Here]