Reprogramming CRISPR Systems for Customized Genome Editing
TL;DR
CRISPR-Cas systems have been adapted for a diverse array of gene editing applications. One crucial component of CRISPR-Cas systems is the Protospacer-Adjacent Motif, or PAM, a small DNA sequence that Cas proteins must recognize to target specific sites in the genome. While this feature is essential for accuracy, it limits CRISPR’s flexibility by restricting which DNA regions can be edited. But what if we could expand these systems’ potential by reprogramming them to recognize new PAMs?
In this study, we introduce Protein2PAM, a protein language model that predicts the PAM specificity of a unique Cas protein sequence. Protein2PAM is 4 times more sensitive and 500 times faster than existing state-of-the-art bioinformatic methods which predict the PAM based on aligning CRISPR spacers to large databases of viral genomes. As a proof of concept for PAM-customization, we used Protein2PAM to introduce mutations that either broadened or altered the PAM specificity of Nme1Cas9, a compact CRISPR-Cas9 protein previously used in genome editing. Experiments in human cell lysate confirmed that our engineered proteins had altered PAMs and were up to 50 times more active than Nme1Cas9. These results highlight the model’s ability to customize PAM specificity without the need for lab-generated training data.
To promote advances in genome editing, we have made our models freely available for non-commercial use via the Protein2PAM web server: https://protein2pam.profluent.bio. Going forward, we aim to advance scalable approaches in personalized genomic medicine by leveraging Protein2PAM to generate PAM-customized Cas proteins tailored to specific therapeutic targets.
Preliminaries
The core component of CRISPR-Cas9 gene editing systems is the Cas9 protein, which is a nuclease that can search through all 3 billion nucleotides in the human genome and cut it at just one specific site. The first step of this process is recognizing and binding to the PAM, a short DNA sequence that each Cas9 specifically binds. Because PAM recognition is a prerequisite to binding and cutting DNA, a Cas9’s PAM specificity can greatly restrict the set of gene editing applications that it can be used for.
PAM recognition by CRISPR-Cas Proteins: PAM recognition is the first step of CRISPR-Cas mediated DNA cleavage (Figure adapted from Klein et al. 2018). Because of their PAMs, SpCas9, Nme2Cas9, St3Cas9, and Nme1Cas9 can respectively edit one out of every 16, 16, 64, and 256 positions in the human genome (on average).
To identify a Cas9 capable of editing the genome at a specific PAM, researchers generally rely on one of two strategies: screening natural Cas9 proteins to find one with the desired PAM—though most natural Cas9s are not active in mammalian cells—or using labor-intensive experimental workflows for protein engineering. Given these limitations, there is a clear need for a robust, easy-to-use method to design bespoke Cas proteins with customized PAMs for specific therapeutic targets and personalized genomic medicine. Protein language models can fill this gap because they learn the rules of how proteins work directly from large, bioinformatically mined databases of protein sequences, without the need for any structures. They promise a scalable way to predict and customize the PAMs of any Cas proteins.
Protein language models accurately predict the PAM specificity of CRISPR-Cas proteins
To model how Cas proteins interact with their PAMs, we performed exhaustive data mining to construct, to our knowledge, the most extensive dataset of CRISPR systems curated to date (Ruffolo et al. 2024). We leveraged a bioinformatics pipeline to identify PAMs for 45,816 unique proteins from this dataset, including 15,731 Cas9 proteins with 1,360 distinct PAM sequences. This dataset represents a 2.8x increase over the largest dataset of bioinformatically determined PAMs (Ciciani et al. 2022) and a ~200x increase over the largest dataset of experimentally determined PAMs (Gasiunas et al. 2020).
A comprehensive training dataset of Cas9’s and their PAMs. Phylogenetic tree of Cas9 proteins found in nature with their PAM preferences shown in outer rings. Sequences were sourced from the CRISPR-Cas Atlas dataset.
We used this dataset to train Protein2PAM, a protein language model that takes a Cas protein sequence as input and returns both a predicted PAM and a confidence score. Protein2PAM’s predictions align well with experimental data: when applied to experimentally characterized Cas9 proteins where the model can make a confident prediction, predicted PAMs have 88.3% agreement with experimentally characterized PAMs.
Protein2PAM accurately predicts the PAM specificity of commonly used gene editors. Top: experimentally determined PAM preferences for four CRISPR-Cas9 proteins used in genome editing. Bottom: Protein2PAM model predictions for these same proteins. Model accuracy measured using the cosine similarity between PAM profiles.
Interestingly, despite never being trained on a protein structure, Protein2PAM identifies specific amino acids in Nme1Cas9 that are important for PAM binding because they have structural interactions with DNA. It also accurately predicts specific PAM-altering mutations to those amino acids that we verified experimentally.
Protein2PAM pinpoints known protein-PAM interactions. Left: Analysis of Nme1Cas9’s structure shows that amino acids Q981, N1029, and H1024 are essential in mediating PAM binding. Right: Protein2PAM predicts that mutations Q981A, N1029A, and H1024D change Nme1Cas9’s N4GATT PAM to N4GNAT, N4GNTA, and N4CNTT, respectively.
Compared to state-of-the-art bioinformatic methods, Protein2PAM can confidently predict PAMs for more than 4 times more natural CRISPR-Cas systems while being more than 500 times faster. Going forward, Protein2PAM will be an important tool to computationally characterize the PAMs of natural CRISPR-Cas systems that are poorly suited to existing methods.
Engineering gene editors with custom PAMs
Next, we harnessed Protein2PAM to engineer Cas9 proteins with custom PAMs. To do so, we computationally evolved Nme1Cas9 by iteratively introducing random mutations predicted by the model to move the protein’s PAM towards a target PAM that we specified. Our objective was to broaden Nme1Cas9’s highly specific four-nucleotide PAM (found at one in 256 positions in the genome) to a one- or two-nucleotide PAM (respectively found at one in 4 and one in 16 positions in the genome). These more flexible PAMs would allow us to edit a much wider range of positions in the human genome.
We used our computational evolution pipeline to generate 30,000 candidate proteins targeting three single-nucleotide PAMs (N4G, N4C, and N7A) and three di-nucleotide PAMs (N4CNNT, N6TT, and N6TA), and we selected the top 22 proteins for experimental characterization. These selected designs contained an average of 11.6 mutations from Nme1Cas9. Amongst the 22 designs we tested, 11 were active, and 6 were more active than the wild-type controls Nme1Cas9 and Nme2Cas9. We achieved high hit rates with shifted PAM predictions for the N4G, N4C, and N4CNNT PAMs. In particular, our top design for N4G was 56.4x more active than Nme1Cas9, and our top design for N4C was 9.6x more active than Nme2Cas9. However, designs for the other target PAMs were generally inactive or had less accurate PAM predictions.
Engineering PAM-customized CRISPR proteins with Protein2PAM. Left: Number of PAMs cleaved by wild-type and Protein2PAM designed proteins. Many designed proteins are active and broadened PAM compatibility compared to wild-type controls. Right: PAM preferences for selected enzymes. Designed proteins have shifted PAMs that are similar to their design targets.
Conclusion
Protein2PAM is a protein language model that efficiently predicts the PAM specificity of any Cas protein sequence. We demonstrate that Protein2PAM accurately predicts the PAMs of naturally occurring Cas9 proteins and can even pinpoint the effects of specific mutations on PAM specificity. We use Protein2PAM to engineer Nme1Cas9 to have both higher activity and a PAM of our choosing, and we validate our methods experimentally. By unlocking the ability to scalably customize the DNA targeting of a Cas9 protein, we hope to expand the range of diseases that can be treated by gene editing therapies.
References
Ciciani, Matteo, Michele Demozzi, Eleonora Pedrazzoli, Elisabetta Visentin, Laura Pezzè, Lorenzo Federico Signorini, Aitor Blanco-Miguez, et al. 2022. “Automated Identification of Sequence-Tailored Cas9 Proteins Using Massive Metagenomic Data.” Nature Communications 13 (1): 1–8.
Gasiunas, Giedrius, Joshua K. Young, Tautvydas Karvelis, Darius Kazlauskas, Tomas Urbaitis, Monika Jasnauskaite, Mantvyda M. Grusyte, et al. 2020. “A Catalogue of Biochemically Diverse CRISPR-Cas9 Orthologs.” Nature Communications 11 (1): 1–10.
Ruffolo, Jeffrey A., Stephen Nayfach, Joseph Gallagher, Aadyot Bhatnagar, Joel Beazer, Riffat Hussain, Jordan Russ, et al. 2024. “Design of Highly Functional Genome Editors by Modeling the Universe of CRISPR-Cas Sequences.” bioRxiv. https://doi.org/10.1101/2024.04.22.590591.