Introduction

CRISPR has been well established as a key genome editing tool, with Cas9 as the most widely used CRISPR enzyme. Classical CRISPR-Cas9-based gene editing relies on the introduction of double-stranded DNA breaks (DSB) guided by the small guide RNA (gRNA). Repair of the CRISPR-induced DSBs in cells leads to a variety of possible insertions or deletions (indels) in the DNA. This challenge has been addressed by engineering next-generation CRISPR editors such as base editors (BE)1 and prime editors (PE)2. The BE consists of a recombinant Cas9 nickase and an attached deaminase that enables introducing single nucleotide mutations without double-stranded DNA cleavage3,4. Similar to the standard Cas9, BEs make use of the gRNA, for which its 20 nucleotide (nt) spacer must be complementary to the DNA target site (protospacer). Here, we focus on adenine base editors (ABEs), converting A•T into G•C base pairs, and cytosine base editors (CBEs), converting C•G into T•A base pairs, which allow for conversion of a single base pair, but “bystander” bases in an approximately 8nt window may be edited as well5. The bystander edits result in different outcomes for a given gRNA. Hence, for base editing to be widely usable, design tools are required to accurately predict frequencies of all outcomes and the gRNA efficiency accurately. Thus, the prediction task extends the common task of predicting the gRNA efficiency.

Within the last few years, tens of thousands of ABE and CBE gRNA efficiencies and corresponding outcomes have been evaluated in cells6,7,8,9,10,11,12,13. However, the availability of datasets for building machine learning prediction models is still limited, and the diverse properties of deaminases6,10,14,15 and libraries generated by various sources pose a challenge for the use of current datasets. In addition, these datasets are also challenged by the inclusion of many low-efficiency gRNAs. Thus, generating more datasets is crucial for the development of better prediction models. Furthermore, since the datasets are not directly compatible, combining them is not straightforward, and all existing methods6,7,8,9,10,11,12,13 are based on a single dataset each with the given limitations.

Here, we apply our previously established lentiviral gRNA-target pair library technology (SURRO-seq)16,17 to measure base editing efficiency on a massive scale for two base editors, ABE (ABE7.10) and CBE (BE4-Gam), in HEK293T cells. For each base editor, we assessed editing efficiency and outcomes of approximately 11,500 gRNAs using SURRO-seq. Upon integration with published datasets including ABE7.10, ABE8e and BE4, totaling 17,941 gRNAs for ABE and 19,010 gRNAs for CBE, we developed two integrated deep-learning models CRISPRon-ABE and CRISPRon-CBE, for predicting gRNA editing efficiency and outcome frequency from ABE and CBE, respectively. Deep learning models are trained simultaneously on all data in a procedure where the source of each data point is kept track of. In this way, prediction on a specific base-editor (dataset) benefits from the others and allows for predictions based on a weighted combination of the datasets used for training. Our models simultaneously predict gRNA efficiency and outcome frequency and their performances are evaluated by two-dimensional Pearson and Spearman rank correlation coefficients (R218 and ρ2, see “Methods” for details). Benchmarking our models on independent test sets shows superior performance compared to existing tools. Additionally, feature analysis reveals that predicted CRISPR-Cas9 efficiency plays an important role in base editor efficiency prediction. A webserver based on CRISPRon-ABE/CBE for predicting gRNA efficiency and outcome frequencies is available via https://rth.dk/resources/crispr/.

Results and discussion

Massive quantification of ABE and CBE editing efficiency and outcomes in cells

We first sought to generate data with more homogeneous ABE and CBE editing efficiency and outcomes in cells using the gRNA-target pair library technology developed previously16,17. We transduced HEK293T-ABE (ABE7.10) and HEK293T-CBE (BE4-Gam) cells with a 12 K gRNA-target pair library with a multiplicity of infection (MOI) of 0.3, with an estimated coverage of approximately 1000 (Fig. 1A). We harvested the transduced cells eight days after growing in selective medium: puromycin (for selecting cells with the integration of the gRNA-target pair construct) and doxycycline (for induction of ABE and CBE expression). Deep amplicon sequencing was performed using surrogate target site-specific PCR products (coverage = ~ 2000 reads per gRNA) to capture the editing efficiencies and outcomes (Fig. 1A). After removing low quality gRNAs (<100 reads), we obtained 11,484 gRNAs covered in the ABE set and 11,406 gRNAs covered in the CBE set (Fig. 1B and Supplementary Data 1). We also investigated the strictness and editing window (position 3-10 in the protospacer) for ABE7.10 and BE4 using our large amount of in-cell measured data. As shown in Fig. 1C, ABE7.10 exhibited highly strict Adenine (A) to Guanine (G) transition. 97% of all edited sites were A•G transition. Unlike ABE7.10, the BE4 exhibited less strict Cytosine (C) to Thymine (T) transition (92%). Low frequency (3.5%) of C•G transversion was observed for BE4. Although Cas9 nickase was used for both ABE7.10 and BE4, low indel frequency (2.06% for ABE7.10; 2.96% for BE4) was also observed. When analyzing the editing window, the narrower core editing window (position 4-8 in the protospacer) exhibits high editing efficiency for both ABE7.10 and BE4 (Fig. 1D). We compared the ABE7.10 and BE4 editing efficiency with the Cas9 from Streptococcus pyogenes (SpCas9) indel frequency obtained for all these sites. Our results showed that the editing efficiency exhibited positive correlation (R = 0.40 to 0.54 for ABE7.10, R = 0.29 to 0.49 for BE4) between base editing efficiency and SpCas9-induced indel frequency (Supplementary Fig. 1). We also performed sequence motif analysis similar to that of Arbab et al. 6 to explore how nucleotides flanking the edited sites affect base editing efficiency using the three nucleotides flanking the editing sites within the core editing window. ABE7.10 and BE4 exhibit high editing activity of 5’TAC (A in bold is the edited site) and 5’TC (C in bold is the edited site), whereas ABE7.10 and BE4 fail to edit sites with 5’AAA and 5’GC, respectively (Fig. 1E). In conclusion, we generated experimental data for over 11,000 gRNAs in cells, comprising SpCas9 indel frequency, ABE (ABE7.10) base editing activity, and CBE (BE4) base editing activity.

Fig. 1: High throughput generation of ABE and CBE editing data in cells.
figure 1

A Illustration of experimental setup. HEK293T-ABE7.10 and HEK293T-BE4 are referred to doxycycline (Dox) inducible ABE (ABE7.10) and CBE (BE4) human embryonic kidney 293T (HEK293T) cells, respectively. The lentiviral 12 K gRNA and target pair library contains 12,000 gRNAs, generated previously in our CRISPRon study17. B Histogram distribution of base editing efficiency (% edited sites out of the total sites in the target site) for ABE7.10 (top) and BE4 (bottom). C Quantification of strictness for all editing events detected in the surrogate target sites for ABE7.10 and BE4. D Bar plot of substitution fractions from the edited site (A for ABE7.10, and C for BE4) for all target sites at each position, including the flanking bases, protospacer, and PAM. PAM, protospacer adjacent motif. The distal base pair is defined as 1 (N1). E Sequence motif for the three flanking sites of the edited base for ABE7.10 (left) and BE4 (right). Logistic regression weight is provided as positively or negatively correlating with base editing efficiency. Source data are provided as a Source Data file.

Dataset aware deep learning model improves prediction of base editor efficiency and outcome frequency

In contrast to existing models, which predict only outcome frequency and from that infer the gRNA efficiency8, we develop deep learning models to predict both simultaneously from the 30-nucleotide (nt) input DNA target sequence (Supplementary Fig. 2 and Supplementary Note 1). We included the 30nt DNA sequences (20nt protospacer + 3nt PAM + 4nt and 3nt flanking sequences at the 5’ and 3’ ends respectively), gRNA-DNA binding energy ∆GB19, previously adopted in CRISPRon, labeling target nucleotide editing positions for different outcomes and the predicted Cas9 efficiency computed by a CRISPRon version trained on gRNAs not overlapping the test set17 (Supplementary Note 2). We only consider editing outcomes within the 8nt editing window for both ABE and CBE in the construction of the prediction model (Methods). Since there is a correlation between Cas9-induced indel frequency and base conversion efficiency (Supplementary Fig. 1), making use of the predicted Cas9 efficiency is beneficial for predicting the base editing gRNA efficiency and outcome frequency. This is further supported through a SHAP analysis20 (Supplementary Note 3 and Supplementary Fig. 3) and an ablation analysis by sequential removal of features to train the model (Supplementary Fig. 4).

Before constructing prediction models that can simultaneously make use of multiple datasets, we trained our model on individual datasets on this architecture (Supplementary Fig. 2). We collected different datasets from two base editors (ABE7.10 and BE4) for training: the SURRO-seq data from this study, the Song Train and Test datasets (Song datasets)7 and the Arbab dataset6 for ABE and CBE, respectively. Moreover, we included two ABE datasets (“SpCas9-ABEmax” and “SpCas9-ABE8e” from Kissling et al. 12, here referred to as “Kissling ABE7.10” and “Kissling ABE8e”) (Methods). We trained our model using a 5-fold cross-validation strategy, with a 6th partition reserved for independent testing. See detailed training process and dataset separation in the Methods and Supplementary Data 2. Since we have two outputs, gRNA efficiency and outcome frequency, we used two-dimensional Pearson and Spearman rank correlation coefficients R218 and ρ2 introduced here (Supplementary Note 4) for the combined evaluation of gRNA editing efficiency and outcome frequency predictions. Evaluation on independent test sets illustrates that our deep learning architecture is effective in solving the base editor efficiency prediction problem (Supplementary Fig. 5). We also notice that when training our model solely on the Song Train data, we obtained higher performance on the test set (Song Test) than their model DeepABE/CBE (Supplementary Fig. 5).

Upon inspecting the datasets (Fig. 2, Supplementary Fig. 6, Supplementary Note 5) we noticed that the Kissling ABE8e dataset predominantly exhibits high editing efficiencies between 60% and 80%. In contrast, the Kissling ABE7.10 dataset targets the same gRNA sequences but shows lower efficiencies from 30% to 50%. The ABE8e was engineered based on the ABE7.10 using phage-assisted evolution selection, which contains 8 extra mutations in the TadA-7.10 domain and exhibits 3-fold to 11-fold increases in editing levels15.

Fig. 2: Variety of dataset from diverse sources.
figure 2

A, B Distributions of gRNA editing efficiency for ABE (A) and CBE (B); C Summary of diverse datasets from different studies. The column “#Bystander edits” shows the number of edited nucleotides within the editing window from a single outcome observed in experiments. Source data are provided as a Source Data file.

To train prediction models on multiple datasets simultaneously (Fig. 3A, Supplementary Fig. 7), we use the provided output values as is from the respective datasets but label them individually by including a vector of size equal to the number of datasets being trained on. A position in the vector have value 1 to indicate the dataset origin of a specific input-output example, while the other positions are 0. For the ABE models the order is: SURRO-seq, Song Train, Arbab, Kissling ABE7.10, and Kissling ABE8e. For the CBE models, the order is: SURRO-seq, Song Train, and Arbab. When using the models for prediction we have the option of using the positions in the vector as weights (see Supplementary Table 1 for detailed model tuning strategy). Users can assign full weight (100%) to a specific training set by setting a value of 1 at the corresponding data index, or allocate fractional weights across multiple datasets, with the total sum constrained to 1. For example, the encoding “10000” (for ABE) or “100” (for CBE) corresponds to models tuning exclusively on the SURRO-seq dataset. The encoding “22000” refers to an ABE model with equal weight (50% or 1/2) on the SURRO-seq and Song datasets, and no weights on the others. Note the number indicated a given position at the labeling vector corresponds to the denominator of the fraction representing the weight. Evaluating the performance on several independent test sets reveals consistent results for fused models tuned on different datasets (Fig. 3B, C). Comparing with established base editor efficiency prediction methods, our models demonstrate superior performance on independent test sets (Fig. 3B, C, Supplementary Fig. 5, 812 and Supplementary Data 3). Notably, all methods yield higher performance on the Song Test set showing that this set for some reason is “easier” than other test sets.

Fig. 3: Schematic representation of CRISPRon-ABE and CRISPRon-CBE prediction algorithm.
figure 3

A The inputs of the models are the one-hot encoded 30nt target DNA sequence, labeling of the edited nucleotide positions within the editing window from the outcome sequence, the binding energy (ΔGB), the predicted Cas9 efficiency, and the labeling of the dataset source. For ABE models, the dataset labels follow the order: SURRO-seq, Song Train, Arbab, Kissling ABE7.10, and Kissling ABE8e. For CBE models, the order is: SURRO-seq, Song Train, and Arbab. B, C Performance comparison of CRISPRon-ABE and CRISPRon-CBE together with other existing models on independent test sets. The fused model makes use of numerical encoding corresponding to the weight given to each of them. Specifically, “1” represents giving the 100% weight to the corresponding dataset (SURRO-seq, Song Train, Arbab, Kissling ABE7.10 or Kissling ABE8e dataset), while “2”, “3”, “4” represent giving 50% (1/2), 33% (1/3) or 25% (1/4) weight to the corresponding dataset. For example, the encodings or labeling “10000” and “100” refer to models that are tuned to SURRO-seq for ABE and CBE, respectively. The encoding “22000” refers to a model which gives equal weight (50%) to the SURRO-seq and the Song datasets and no weight (0%) to the other ABE datasets. Similarly, the encoding “333” refers to equal weight (33%) on all three CBE datasets. N.a. means not available due to lack of explicit train-test separation. The two-sided Steiger’s test p-value was used to compare correlation coefficients between two models. * represents p-value  <0.05, ** represents p-value  <0.01, *** represents p-value  <0.001. The p-values for ABE7.10 evaluated by R2 (from top to bottom) are 0, 0, 0, and 1.73e-6, respectively. The p-values for ABE7.10 evaluated by ρ2 (from top to bottom) are 0, 0, 0, and 6.66e-16, respectively. The p-values for ABE8e evaluated by R2 and ρ2 are 0 and 1.06e-13, respectively. The p-values for BE4 evaluated by R2 (from top to bottom) are 0, 2.34e-6, and 0, respectively. The p-values for BE4 evaluated by ρ2 (from top to bottom) are 0, 7.88e-3, and 0.42, respectively.

Due to the data heterogeneity and differences in base editor efficiency, our results suggest that the weighting decisions should be made dependent on the base editors and the technology for concrete editing. For instance, if the gRNA is designed for ABE8e, we suggest giving a weight of 100% to Kissling ABE8e dataset by CRISPRon-ABE (00001). If the gRNA is designed for ABE7.10 or BE4 and the editing is from SURRO-seq, Song, Arbab or Kissling platform, we suggest giving a weight of 100% to the corresponding dataset. If the gRNA is designed for ABE7.10 or BE4 but the platform for base editing is not clear, we suggest considering the scaffold sequence. If the scaffold is same as the one used by Arbab and Kissling ABE7.10 dataset with an extended stem and modified nucleotide21, we suggest giving weights of 50% to both Arbab and Kissling ABE7.10 datasets for ABE (CRISPRon-ABE(00220)) or giving weights of 100% to the Arbab dataset for CBE (CRISPRon-CBE(001)). Otherwise, we suggest giving weights of 50% to both SURRO-seq and Song datasets (CRISPRon-ABE(22000) and CRISPRon-CBE(220)).

We observed that the model performance on the independent test sets declined by approximately 10% when dataset labeling was omitted during training (Supplementary Fig. 4). This observation is consistent with the datasets being diverse and supports the importance of dataset awareness when developing models using combined and diverse datasets.

The limited variety of base editors included in the CRISPRon-ABE/CBE also impacts the ability to construct predictors that apply broadly in base editor field. So far, the models have been developed using ABE7.10, ABE8e and BE4 base editors. Considering the distinct sequence motif preferences from different deaminases6,10,22, accurate prediction of editing outcomes for other base editors using CRISPRon-ABE/CBE remains a challenge. For example, newly developed base editors like ABE8.20m23 and CBE6b14 have not been incorporated into our deep-learning models or even in benchmarking analyses due to the limited publicly available datasets. This emphasizes the need and promise of generating large-scale editing dataset using e.g. SURRO-seq for these newly evolved base editing enzymes. In addition, our current model is only trained on the data harvested in HEK293T cell line. Systematic benchmarking indicates that CRISPRon-ABE/CBE achieves comparable performance across cell lines in Arbab dataset (Supplementary Fig. 12). Given the more diverse properties across different platforms for data generation (Fig. 2 and Supplementary Fig. S6), the diversity of base editing across cell lines is not accounted for in our current model. This highlights the need for more data.

We conclude that our fused models (CRISPRon-ABE and CRISPRon-CBE) consistently demonstrate significantly improved prediction. The labeling of the datasets used during training can, in a design context, be replaced by weights allowing emphasis on a specific dataset, e.g., one that uses a BE closer to the one of interest to the user. Our analysis showed that both more data and data integration in the prediction model are important to further improve the performance. In addition, our dataset integration strategy not only proves effective for combining datasets from multiple base editors but also holds potential for broader applications in other CRISPR technologies. For instance, integrating cell-line information when training models on prime editing datasets with same edit types such as short indels but measured in different cell lines may enhance predictive accuracy. The CRISPRon-ABE/CBE models are available as stand-alone software and as a webserver via https://rth.dk/resources/crispr/.

Methods

Cell culture and doxycycline-inducible ABE/CBE stable cell lines generation

Wildtype HEK293T cells from ATCC (catalog number CRL-3216) were cultured in DMEM media supplemented with 10% FBS and 1% penicillin-streptomycin (D10 medium) at 37 °C with 5% CO2. Routine mycoplasma testing using PCR (cat no. PM008) yielded negative results the cells used for this study. TRE-ABE7.10 and TRE-CBE (BE4) stable cell lines were generated using PiggyBac transposon systems. HEK293T cells were transfected with pPB-ABE7.10-hygromycin (modified from the Addgene#102919), pPB-CBE-hygromycin (modified from the Addgene#100806) vectors, and pCMV-hybase at a 9:1 ratio. HEK293T-ABE7.10 and HEK293T-BE4 cells were cultured in selection medium containing 50 μg/mL hygromycin.

Lentivirus transduction with the 12 K library

HEK293T-ABE7.10 and -BE4 cells were cultured in D10 medium supplemented with 50 μg/mL hygromycin throughout the experiment. The 12 K library was generated previously17. For the lentiviral 12 K CRISPR gRNA-target pair library transduction, the following steps were carried out: 1) Day −1: 2.5 × 10^6 cells per 10 cm dish were seeded in 12 dishes. One dish per group was allocated for cell number determination pre-transduction, one for drug-resistance (puromycin) control, and the remaining 10 dishes for lentivirus library transduction; 2) Day 0: Cell count was performed to estimate cell number per dish, determining the volume of lentivirus for transduction with a MOI of 0.3. Low MOI (0.3) ensured most infected cells received only one copy of lentivirus. Infected cells were cultured in D10 medium at 37 °C; 3) Day 1: Transduced cells were split into three dishes per original dish; 4) Day 2: Sub-group 1 (10 dishes) was harvested and labeled as Day 2 post-12K library transduction, with cells pooled for genomic DNA extraction. Sub-group 2 (10 dishes) received fresh D10 medium with 50 μg/mL hygromycin + 1 μg/mL puromycin (Dox-free). Sub-group 3 (10 dishes) received D10 medium with 50 μg/mL hygromycin + 1 μg/mL puromycin + 1 μg/mL doxycycline (Dox induction of base editor expression). For WT HEK293T cells (Group 3) screening, hygromycin but not puromycin was used in the culture medium; 5) Transduced cells were split every 2–3 days when confluence reached 90%. 6) The transduced cells were harvested for further genomic DNA extraction and targeted deep sequencing eight days after growing in the Dox selection medium.

Targeted deep sequencing

Genomic DNA from transduced cells was extracted using the phenol-chloroform method and subsequently purified. PCR amplification of the surrogate target sites was carried out using the TRAP-NGS-F1 and TRAP-NGS-R1 primers (sequences listed in Supplementary Table 2). Each PCR reaction contained 5 μg of genomic DNA as a template, yielding approximately 7.6 × 10^5 copies of the gRNA-target pair construct, providing about 63-fold coverage of the 12 K library. To achieve comprehensive coverage, 32 parallel PCR reactions were conducted, resulting in approximately 2016-fold coverage of each construct. The PCR products were then purified using a 1.5% gel and pooled in equal amounts for subsequent deep amplicon sequencing. The amplicon libraries were subjected to deep sequencing on the MGISEQ-2000 (MGI of BGI in China) platform. All the samples were subjected to paired-end 150 bp deep-sequencing on the MGISEQ-2000 platform.

Raw data processing

The raw data were sequenced using the MGISEQ-2000 platform (G400), generating PE150 paired-end reads. FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/, version: 0.11.3) and fastp (https://github.com/OpenGene/fastp, version: 0.19.6)24 were used for data quality control and filtering with the default parameters. The paired-end data were assembled using flash (http://www.cbcb.umd.edu/software/flash, version: 1.2.11)25. BWA-MEM (version: 0.7.17)26 was used to map the assembled data to the designed oligo sequences to preliminarily distinguish the data of each synthetic gRNA-target pair surrogate site. The oligo sequences have been reported in the previous CRISPRon study and are available in Supplementary Data 117.

Data filtering

The pysam module of Python-3.8 was used to split the aligned data according to the site number of the chip, and the reads of different sites were obtained. Then, we used three steps of strictly controlling parameters to filter the data of each site. Firstly, according to the structure of the chip, g + gRNA (20 bp) + scaffold (82 bp) + GTTT should remain unchanged at the beginning and end of each site. Then, to remove the chip synthesis errors, the pseudo-editing sequences found in WT HEK293T cells transduced with the 12 K SURRO-seq library were removed from both ABE and CBE groups. The above three filtering steps were completed with Julia-1.5.3 language.

Calculation of base editing efficiency

To improve the suitability for machine/deep learning, the SURRO-seq dataset was further processed and base editing efficiency and outcome editing efficiency were calculated separately. Base editing efficiency is relative to the number of sites edited, and for each site, the frequency of mutating (number of mutations/total number of reads) was calculated, including the frequency of occurrence of indels, although at a lower percentage. To mitigate the impact of enrichment and depletion, we kept this subset of the dataset with a total number of reads greater than or equal to 100, resulting in 11,484 gRNAs for ABE and 11,406 gRNAs for CBE. The outcomes for all edited results were calculated by masking indel loci as wild type. The frequency (number of mutations per outcome/total number of reads) and proportion (number of mutations per outcome/number of reads in which mutations occurred) were then calculated separately.

Sequence context analysis for base editing efficiency

To explore the impact of proximal sites on the editing efficiency of base editors, we constructed sequence motifs for both ABE and CBE using our SURRO-seq dataset. We employed a logistic regression model to predict gRNA editing efficiency and generated sequence logos based on the obtained weights. Initially, we selected sequences containing the target nucleotides (“A” for ABE and “C” for CBE) along with their flanking nucleotides ( ± 3nt). These sequences were then encoded using a one-hot encoding scheme for efficiency prediction. The gRNA editing efficiency was coded as 0 or 1 to facilitate binary classification modeling. If the editing efficiency exceeded the mean value at certain position, the target value was 1; otherwise, it was 0. We subsequently utilized a logistic regression model for efficiency prediction, employing 5-fold cross-validation by keeping highly similar target sequences (≤ 4 mismatches for a 30nt target sequence) in a single fold. Then model performance was assessed on the 6th partition with equal size reserved for independent testing using Pearson correlation coefficient between actual and predicted efficiency values. Additionally, we extracted weights from the logistic regression model to visualize the effect of sequence context on base editor efficiency. Given the high efficiency observed from positions 4 to 8 within the spacer, we focused our motif analysis on target sequences with target nucleotides located within these positions.

Data collection and pre-processing for machine learning

In addition to our SURRO-seq dataset, we collected datasets for both ABE and CBE: Song Train/Test from Song et al., 7 Arbab dataset from Arbab et al. 6 (although there is no information about which part of the data was used for their training and testing) and Marquart Test from Marquart et al. 8. We also collected two ABE datasets (Kissling ABE7.10 and ABE8e) from Kissling et al. 12. Note that no detailed information was provided regarding which data were used for training and testing in the original study. In total, three base editors are involved in this study: ABE7.10 (including SURRO-seq, Song, Arbab, Marquart, Kissling ABE7.10 dataset), ABE8e (Kissling ABE8e dataset) and BE4 (SURRO-seq, Song, Arbab, Marquart dataset). In addition, we not only collected the data evaluated in HEK293T cells but also the Arbab dataset from other cell-lines, including mESC (Arbab mESC Test) and U2OS (Arbab U2OS Test) to explore the model performance across cell lines. Further details about the datasets are given in Supplementary Table 3.

The datasets were processed as follows: 1) Removal of gRNAs without “NGG” PAM at the DNA target sites; 2) Removal of gRNAs for which the corresponding target sites are supported by less than 100 total reads; 3) Removal of gRNAs with over 100,000 total reads, considering that their total reads are higher than other gRNAs; 4) Removal of gRNAs not containing the nucleotide A (ABE) or C (CBE) within the editing window (positions 3 to 10); 5) Exclusive retainment of outcomes with intended target nucleotide transitions (A > G for ABE, C > T for CBE) within the editing window, considering that only few reads supported transitions to unintended nucleotides, such as A > Y for ABE and C > R for CBE and the transitions occur mainly in the editing window (Fig. 1C, D). The outcomes from one target DNA sequence were further grouped according to the positions of the transitioned nucleotides in the editing window and the frequencies within each group were summed; 6) Generation of outcomes at target sites with the potential to be edited but without any reads from the experimental support, by setting their outcome frequency to 0; 7) Removal of gRNAs for which the corresponding target site is supported by fewer than 100 edited reads, as in previous studies6,7. As we observed in the Song datasets that the sum of the number of reads for the outcomes edited by a specific gRNA can be higher than the total number of reads reported for this same gRNA, we recalculated the outcome frequencies accordingly. See Supplementary Table 4 for more details about dataset processing and Supplementary Table 5 for information on the general composition of these datasets.

Each target sequence is 30nt long, encompassing 4nt upstream, 20nt target DNA protospacer (corresponding to the spacer), 3nt PAM and 3nt downstream from the PAM. To develop our model, we combined our SURRO-seq dataset, Song Train, Arbab dataset, Kissling ABE7.10 and Kissling ABE8e datasets (train and/or test set), while Song Test, Marquart Test and Arbab mESC and U2OS Test were only used for model evaluation (test set). The combined dataset consists of 17,941 unique 30nt target DNA sequences for ABE and 19,010 unique sequences for CBE. The identical 30nt sequences between our SURRO-seq dataset and Song Train were modest (20 gRNAs from ABE and 22 gRNAs from CBE) and were retained in both dataset sources. While the gRNAs from Kissling datasets are generated from the same library, 932 gRNAs exist in both datasets after above filtering steps and all the gRNAs were retained. See Supplementary Fig. 13 for details.

Generation of target DNA and outcome features

We processed the 30nt target DNA sequences and their corresponding edited outcome sequences by the following steps: 1) the 30nt target DNA sequences were one-hot encoded; 2) outcome sequences in the editing window were one-hot encoded/labeled (edited or not) based on the positions of the target nucleotide transitions in the editing window. Therefore, each outcome sequence was represented by one vector with length equal to that of the editing window. The value in the vector was set to 1 if the nucleotide was edited, and 0 otherwise. 3) The spacer RNA-target DNA binding energy ΔGB was calculated using the CRISPRoff pipeline 1.1.219 supplied with RNAfold 2.5.127. 4) CRISPR/Cas9 indel frequencies were predicted by CRISPRon17 (see details in Supplementary Note 2). 5) The origin of each training set was encoded in a “labeling” vector of size 5 for ABE (SURRO-seq dataset, Song Train, Arbab dataset, Kissling ABE7.10 dataset, Kissling ABE8e dataset) and 3 for CBE (SURRO-seq dataset, Song Train, Arbab dataset), as detailed in Supplementary Table 1.

Generation of dataset partitions

The dataset was split into six partitions with approximately equal sizes (within a count of one gRNA) for model development as previously done in CRISPRon17, resulting in five partitions for cross validation and the 6th partition for test. The procedure took the sequence similarity between 30nt target DNA sequences into account by assigning highly similar ones to the same partition while remaining dissimilar ones were randomly assigned into different partitions. In addition, we considered that part of the targets were included in the training of CRISPRon17, which is applied for Cas9-gRNA efficiency prediction, and that none of the data points in the 6th partition (test set) should belong to the CRISPRon training set to prevent overestimation (Supplementary Note 2). This was done as follows (see detailed workflow in Supplementary Fig. 14): 1) The pairwise Hamming distance between all 30nt target DNA sequences was calculated using the scikit-learn pairwise_distances function28, and extracting highly similar sequences (≤4 mismatches for a 30nt target sequence), as for CRISPRon17; 2) The 30nt target DNA sequences employed in CRISPRon model development across six validation sets were assigned to the same partitions as in CRISPRon17; 3) 30nt target sequences highly similar to those assigned in step 2 were put in the same partition as their similar ones; 4) Remaining similar DNA targets were grouped and assigned to one partition randomly; 5) The remaining dissimilar targets from different sources were randomly distributed into different partitions. The test dataset from Song et al. 7 provided by the authors, was assigned to the 6th partition for subsequent benchmark analysis, while all the 30nt sequences from Song Train were divided into the five partitions used for training. Sequences from Song Train which were assigned to the test partition of CRISPRon were not allocated to any partitions in step 2 but were randomly distributed into five partitions in step 5. Additionally, the DNA targets involving same variant loci in Kissling dataset were also grouped as highly similar ones as suggested by the authors12, which were assigned to different partitions in step 3 and 4. After these assigning steps, 30nt target DNA sequences and their potential edited outcomes from diverse sources were distributed across six partitions without highly similar sequences among partitions and keeping all the data not used for CRISPRon training in the 6th partition, to maintain the test set fully independent for further benchmark analysis (Supplementary Table 6).

Test sets for the evaluation and comparison of models

Test sets from various dataset sources were collected: SURRO-seq Test (the 6th partition from SURRO-seq dataset), Song Test, Arbab Test (the 6th partition from Arbab dataset), Kissling ABE7.10 Test (the 6th partition from Kissling ABE7.10 dataset, ABE only), Kissling ABE8e Test (the 6th partition from Kissling ABE8e dataset, ABE only), Marquart Test and Arbab mESC and U2OS Test (see Generation of dataset partitions). To create fully independent test sets, each test set was processed as follows: 1) Pairwise Hamming distance between test sets and training sets used in any method was calculated on 20nt gRNAs by the cdist function in SciPy29; 2) The gRNAs with Hamming distance ≤ 3 (number of mismatches) on 20nt gRNAs from any training set were removed as was done in CRISPRon17; 3) The 30nt target DNA sequences used to train the final version of CRISPRon were removed and the different test sets were reduced to the following number of gRNAs: SURRO-seq Test: 868 (ABE), 1297 (CBE); Song Test: 430 (ABE), 474 (CBE); Arbab Test: 1,096 (ABE), 1180 (CBE); Marquart Test: 1482 (ABE), 1,489 (CBE); Kissling ABE7.10 Test: 144 (ABE), Kissling ABE8e Test: 152 (ABE); Arbab mESC Test: 2554 (ABE), 1481 (CBE); Arbab U2OS Test: 361 (ABE), 289 (CBE). In addition, we extracted the outcomes with a frequency greater than 5% in each test set for benchmark analysis considering these outcomes are more critical for practical application, which are called “SURRO-seq Test*”, “Song Test*”, “Arbab Test*”, “Marquart Test*”, “Kissling ABE7.10 Test*” and “Kissling ABE8e Test*” respectively. To evaluate the performance on a dataset including the vast number of low efficient gRNAs appearing the unprocessed datasets we made a test containing the firstly discarded gRNAs (edited reads <100) resulting in “SURRO-seq Test with low efficient gRNAs”: 2326 (ABE), 2165 (CBE). Correspondingly, by removing low efficient gRNAs ( < 100 edited reads) from the Song Test, we obtained “Song Test without low efficient gRNAs”: 383 (ABE), 338 (CBE).

Development of deep learning models designed for base editors

Deep learning models were developed for ABE and CBE using Keras/Tensorflow30 2.10.0 and Python 3.10. The models were designed to simultaneously predict gRNA editing efficiency and outcome frequency from a 30nt target DNA sequence and an edited outcome. The model requires four inputs: the 30nt target DNA sequence, the 8nt labeling of editable positions, and two molecular features: ΔGB, a powerful feature for Cas9-gRNA efficiency prediction17, and the predicted Cas9-gRNA efficiency, which is positively correlated with base editor efficiency7 (Supplementary Fig. 1). The one-hot encoded 30nt target DNA sequences are fed into three convolutional neural networks with filter sizes of 3, 5, and 7. The outputs from these three filters are flattened and then further passed through one fully connected layer. The one-hot encoded outcome vector representing the edits of target nucleotides in the editing window is concatenated with the molecular features and outputs from the three convolutions. Subsequently, this combined input is fed into two sequential fully connected layers before computing a 2D output: the gRNA editing efficiency and the outcome frequency (Supplementary Note 1 and Supplementary Fig. 2).

To enhance the generalization performance on diverse datasets, our SURRO-seq dataset was combined with the published datasets Song Train, Arbab dataset, Kissling ABE7.10 (ABE only) and Kissling ABE8e datasets (ABE only) for model development. One fused model was proposed for training on a combined dataset while keeping the original values of gRNA editing efficiency and outcome frequency. Due to the variations between diverse datasets (i.e., the same gRNA may exhibit different efficiencies across datasets), we encoded the dataset of origin as a feature (Supplementary Table 1) and introduced this feature alongside others in the model. The complete model architecture is outlined in Fig. 3A with details in Supplementary Fig. 7.

Models were trained using a 5-fold cross-validation on five partitioned datasets with the 6th partition held out for testing. All the models were trained and evaluated using the mean squared error (MSE). During training, the model underwent 2000 epochs in which weights were updated using batches of 16 samples, with a learning rate of 0.0001 using ADAM optimization. The training was stopped when the MSE computed on the validation set did not improve for 100 consecutive epochs and the best model by MSE on the validation data was saved. To prevent overfitting during training, a dropout rate of 0.1 was applied in each layer. The hyper-parameters were tuned to optimize the deep learning models. Inspired by the network from DeepBE10, the number of filters in convolutional neural networks and number of nodes in fully connected layers were increased (see Supplementary Data 2 for detailed hyper-parameter optimization). During hyper-parameter optimization, each training in the 5-fold cross-validation was repeated four times with different random seeds and the best model of the four was chosen from each validation. The final MSE for model selection was calculated as the average MSE from the five best models, weighed by number of outcomes in each validation. In the final model, each training in the 5-fold cross-validation was repeated 10 times with different random seeds for robustness. The final output was the average of the output from the best five models chosen for each fold.

Benchmark analysis of base editor prediction methods

CRISPRon-ABE and CRISPRon-CBE were benchmarked against existing base editor methods, including DeepABE/CBE7, BE-HIVE6, BE-DICT8, BE_Endo11, BEDICT2.012 (ABE only) and DeepBE10 (ABE8e only) for the prediction of the gRNA editing efficiency and outcome frequency on independent test sets. For BE-DICT8 and BEDICT2.012, the outcome frequency with target nucleotide transitions in the editing window (from positions 3 to 10 in 20nt gRNA spacer) was predicted as suggested by the authors. Considering all the test sets are genome-integrated target sequences, we only compared the performance with BE_Endo11 trained using sequence information. All the above models were downloaded from their GitHub repositories.

Jointly evaluation of gRNA editing efficiency and outcome frequency

To evaluate the performance of the base editor efficiency predictions, we calculated an extended K-dimensional Pearson correlation coefficient RK (K = 2)18 and an extended K-dimensional Spearman rank correlation coefficient, ρK (K = 2), between ground truth and prediction results. These metrics allow us to combinedly evaluate the performance in predicting gRNA editing efficiency and outcome frequency. The formulas of RK and ρK are explained as follows (see Supplementary Note 4 for details):

We consider two tables X (e.g. ground truth) and Y (e.g. prediction) with N rows (number of outcomes) and K columns (here K = 2, gRNA editing efficiency and outcome frequency). The covariance between X and Y was defined in ref.18 as:

$$COV(\underline{\underline{X}},\underline{\underline{Y}})={\sum }_{k\,=1}^{K}{w}_{k}COV({\underline{\underline{X}}}_{k},{\underline{\underline{Y}}}_{k})=\frac{1}{K}{\sum }_{n\,=\,1}^{N}{\sum }_{k\,=\,1}^{K}({X}_{nk}-{\bar{X}}_{k})({Y}_{nk}-{\bar{Y}}_{k}),$$
(1)

where \({\bar{X}}_{k}=\frac{1}{N}\mathop{\sum }_{n=1}^{N}{X}_{nk}\), \({\bar{Y}}_{k}=\frac{1}{N}\mathop{\sum }_{n=1}^{N}{Y}_{nk}\) and \({w}_{k}=\frac{1}{K}=\frac{1}{2}\).

The correlation coefficient is dessfined as

$${R}_{K}=\frac{COV({\underline{\underline{X}}},\underline{\underline{Y}})}{\sqrt{COV({\underline{\underline{X}}},\underline{\underline{X}})COV({\underline{\underline{Y}}},{\underline{\underline{Y}}})}}$$
(2)

For R2 calculation, X and Y are the values of gRNA editing efficiency and outcome frequency. For ρ2 calculation, X and Y are the ranks of gRNA editing efficiency and outcome frequency. It can be shown that ρK takes a similar form as Spearman rank correlation when K = 1 (see Supplementary Note 4).

Statistics and reproducibility

No statistical method was used to predetermine sample size. The detailed data processing steps were described in “Methods” section. The datasets for model development are provided with the software package for reproducibility. All statistical analyses were performed using Python (version 3.10.6) SciPy library (version 1.10.1).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.