Apr 22, 2024
By Aadyot Bhatnagar, Stephen Nayfach, Joe Gallagher, and Jeff Ruffolo
TL:DR
In this study, we demonstrate the world’s first precision gene editing using molecules designed from scratch with AI. Gene editors are complex systems requiring intricate spatial and temporal interactions between multi-domain proteins, DNA, and RNA. Designing a functionally differentiated gene editor using AI represents a major leap in the blossoming field of AI-driven biological design.
To address these challenges and design novel gene editors, we trained large language models (LLMs) on the most extensive dataset of diverse CRISPR-based gene editing systems gathered to date. We found that the proteins generated by these models expand the diversity of virtually all naturally occurring CRISPR-Cas families by 4.8-fold, and that we could continue rapidly increasing this diversity at-will. Next, we focused on CRISPR-Cas9 systems due to their wide adoption, including spurring a Nobel Prize and recently receiving FDA approval as a novel therapeutic modality. In human cells, our computationally designed gene editors showed comparable or improved activity and specificity relative to SpCas9, an exemplar gene editor, while being more than 400 mutations distant.
We hereby publicly release OpenCRISPR-1, a highly performant AI-generated gene editor, to facilitate broad, ethical usage across research and commercial applications. We aim to advance innovation and development in the gene editing community, to bring new treatments to patients with major unmet needs.
Preliminaries
Until now, the protein engineering community has often relied on discovery-based approaches to either copy a functional protein from nature and/or perform iterative modifications through a process called directed evolution. Many of the transformative proteins in our society were found through serendipitous discovery; for example, insulin in dogs, Cas9 in a yogurt facility, and Botox® in a food poisoning incident. Large generative protein language models capture the underlying blueprint of what makes a natural protein functional. They promise a shortcut to bypass the random process of evolution and move us towards intentionally designing proteins for a specific purpose.
The core component of CRISPR-Cas9 gene editing systems is the Cas9 protein, which is an RNA-guided nuclease that can search through all 3 billion nucleotides in the human genome and cut it at just one specific site. This nuclease complexes together with a single guide RNA (sgRNA) that consists of a scaffold which interacts structurally with the protein, and a spacer sequence that can be programmed to target any site in the genome.
Given that most Cas9 proteins are over 1000 amino acids long, the overall design space contains 20^1000 possible sequences, which is orders of magnitude more than the number of atoms in the observable universe. However, because these proteins must orchestrate many interactions in a precise order to achieve accurate cutting, even a single misplaced mutation can completely abolish protein function. It would take many lifetimes to explore all possible sequence variations experimentally, yet in a matter of hours, AI systems can navigate this search space to discover functional gene editors.
Language models generate diverse CRISPR-Cas proteins
Generative protein language models are typically pre-trained on large, diverse datasets of natural protein sequences that span a wide range of functions. They can generate realistic protein sequences that reflect the properties of natural proteins. However, for specific applications, such as the generation of novel gene editors, we need to steer generation towards particular protein families of interest.
To this end, we performed exhaustive data mining to construct, to our knowledge, the most extensive dataset of CRISPR systems curated to date. We refer to this resource as the CRISPR-Cas Atlas. All told, we uncovered 5.1 million CRISPR-Cas proteins, expanding the known natural diversity of these systems by 2.7-fold overall, and 4.1-fold for Cas9 specifically.
To generate novel CRISPR-Cas proteins, we then trained a protein language model on the CRISPR-Cas Atlas. We generated 4 million sequences from this model and used bioinformatic techniques to remove degenerate sequences and identify which CRISPR-Cas family each generated protein belongs to. This filtered set of generated sequences represents a 4.8-fold expansion of diversity compared to natural proteins found in the CRISPR-Cas Atlas. We fully expect that generating more sequences would expand this diversity even further.
Generated sequences greatly expand the diversity across CRISPR-associated proteins families, as measured by the number of protein clusters. The heat-map indicates how often each protein family is found in different types of CRISPR-Cas systems (e.g. Cas9 is exclusively found in Type II CRISPR-Cas systems).
Generated gene editors are functional in human cells
We further narrowed our focus to CRISPR-Cas9 systems and trained a protein language model on the 238,917 Cas9 proteins in the CRISPR-Cas Atlas. Given the wide adoption and clinical success of SpCas9, we used our models to generate Cas9-like proteins that are interoperable with SpCas9. In other words, they bind to the same parts of the genome (the PAM) and are compatible with the same sgRNA; therefore, they can be used for the same applications.
We selected 48 of these generated sequences for rigorous functional characterization in human cells. Our top hit OpenCRISPR-1 had comparable activity to SpCas9 at on-target sites (55.7% editing for OpenCRISPR-1 vs. 48.3% for SpCas9), but strikingly had a 95% reduction in editing at off-target sites (0.32% editing for OpenCRISPR-1 vs. 6.1% for SpCas9). Moreover, OpenCRISPR-1 is a highly novel protein: it is 403 mutations away from SpCas9 and 182 mutations away from any natural protein in the CRISPR-Cas Atlas.
Multiple generated nucleases (green), including OpenCRISPR-1 (dark green), have comparable or higher on-target activity to SpCas9 (blue), but much lower off-target activity.
Next, we showed that when paired with a deaminase, both OpenCRISPR-1 and SpCas9 had similar activity and specificity when precisely editing a single base in a target genome. Moreover, we were able to maintain base editing activity while improving specificity by using deaminases generated by another Profluent-trained protein language model.
OpenCRISPR-1 functions very similarly to SpCas9 when used for base editing with ABE8.20, a highly active engineered deaminase, as well as our generated deaminases PF-DEAM-1 and PF-DEAM-2.
Finally, to further optimize the activity of our generated nucleases, we also trained a model to generate a compatible sgRNA for any given Cas9-like protein. Compared to SpCas9’s sgRNA, we found that these generated sgRNAs could improve the activity of generated nucleases for four of the five of the proteins tested.
For 4 of the 5 generated nucleases tested, using a model-generated sgRNA improved editing efficiency.
Discussion
We demonstrate the world’s first successful editing of the human genome using a gene editing system where every component is fully designed by AI. Our most performant AI-generated editor, OpenCRISPR-1, achieves similar activity and higher specificity than SpCas9, an exemplar gene editor, while being highly dissimilar in sequence. Moreover, our platform is capable of generating many more gene editing systems at-will; OpenCRISPR-1 is just the tip of the iceberg. We publicly release OpenCRISPR-1 to facilitate broad, ethical usage across research and commercial applications. In making this molecule available to the broader community, we hope to lower the costs and barrier to entry for therapeutic, agricultural, and scientific applications of CRISPR-based technologies.