Deep learning models simultaneously trained on multiple datasets improve base-editing activity prediction

Sun, Ying; Qu, Kunli; Corsi, Giulia I.; Anthon, Christian; Pan, Xiaoguang; Xiang, Xi; Jensen, Lars Juhl; Lin, Lin; Luo, Yonglun; Gorodkin, Jan

doi:10.1038/s41467-025-65200-5

Download PDF

Article
Open access
Published: 07 November 2025

Deep learning models simultaneously trained on multiple datasets improve base-editing activity prediction

Nature Communications volume 16, Article number: 9821 (2025) Cite this article

Subjects

Abstract

CRISPR-derived base editors (BE) enable precise single nucleotide substitution without introducing double-stranded DNA breaks. Apart from the base editing enzymes, efficient base editing strongly depends on both the CRISPR guide RNA (gRNA) efficiency and the edited position. Here, we show that the accuracy of BE gRNA design can be significantly improved by generating more data and by introducing deep neural networks trained on multiple different datasets simultaneously. Generating ~20,000 gRNAs for A•T to G•C and C•G to T•A conversions, we present such deep learning models, which also allow users to do dataset-aware predictions. The methods are available online and as stand-alone software.

Introduction

CRISPR has been well established as a key genome editing tool, with Cas9 as the most widely used CRISPR enzyme. Classical CRISPR-Cas9-based gene editing relies on the introduction of double-stranded DNA breaks (DSB) guided by the small guide RNA (gRNA). Repair of the CRISPR-induced DSBs in cells leads to a variety of possible insertions or deletions (indels) in the DNA. This challenge has been addressed by engineering next-generation CRISPR editors such as base editors (BE)¹ and prime editors (PE)². The BE consists of a recombinant Cas9 nickase and an attached deaminase that enables introducing single nucleotide mutations without double-stranded DNA cleavage^3,4. Similar to the standard Cas9, BEs make use of the gRNA, for which its 20 nucleotide (nt) spacer must be complementary to the DNA target site (protospacer). Here, we focus on adenine base editors (ABEs), converting A•T into G•C base pairs, and cytosine base editors (CBEs), converting C•G into T•A base pairs, which allow for conversion of a single base pair, but “bystander” bases in an approximately 8nt window may be edited as well⁵. The bystander edits result in different outcomes for a given gRNA. Hence, for base editing to be widely usable, design tools are required to accurately predict frequencies of all outcomes and the gRNA efficiency accurately. Thus, the prediction task extends the common task of predicting the gRNA efficiency.

Within the last few years, tens of thousands of ABE and CBE gRNA efficiencies and corresponding outcomes have been evaluated in cells^{6,7,8,9,10,11,12,13}. However, the availability of datasets for building machine learning prediction models is still limited, and the diverse properties of deaminases^6,10,14,15 and libraries generated by various sources pose a challenge for the use of current datasets. In addition, these datasets are also challenged by the inclusion of many low-efficiency gRNAs. Thus, generating more datasets is crucial for the development of better prediction models. Furthermore, since the datasets are not directly compatible, combining them is not straightforward, and all existing methods^{6,7,8,9,10,11,12,13} are based on a single dataset each with the given limitations.

Here, we apply our previously established lentiviral gRNA-target pair library technology (SURRO-seq)^16,17 to measure base editing efficiency on a massive scale for two base editors, ABE (ABE7.10) and CBE (BE4-Gam), in HEK293T cells. For each base editor, we assessed editing efficiency and outcomes of approximately 11,500 gRNAs using SURRO-seq. Upon integration with published datasets including ABE7.10, ABE8e and BE4, totaling 17,941 gRNAs for ABE and 19,010 gRNAs for CBE, we developed two integrated deep-learning models CRISPRon-ABE and CRISPRon-CBE, for predicting gRNA editing efficiency and outcome frequency from ABE and CBE, respectively. Deep learning models are trained simultaneously on all data in a procedure where the source of each data point is kept track of. In this way, prediction on a specific base-editor (dataset) benefits from the others and allows for predictions based on a weighted combination of the datasets used for training. Our models simultaneously predict gRNA efficiency and outcome frequency and their performances are evaluated by two-dimensional Pearson and Spearman rank correlation coefficients (R₂¹⁸ and ρ₂, see “Methods” for details). Benchmarking our models on independent test sets shows superior performance compared to existing tools. Additionally, feature analysis reveals that predicted CRISPR-Cas9 efficiency plays an important role in base editor efficiency prediction. A webserver based on CRISPRon-ABE/CBE for predicting gRNA efficiency and outcome frequencies is available via https://rth.dk/resources/crispr/.

Results and discussion

Massive quantification of ABE and CBE editing efficiency and outcomes in cells

We first sought to generate data with more homogeneous ABE and CBE editing efficiency and outcomes in cells using the gRNA-target pair library technology developed previously^16,17. We transduced HEK293T-ABE (ABE7.10) and HEK293T-CBE (BE4-Gam) cells with a 12 K gRNA-target pair library with a multiplicity of infection (MOI) of 0.3, with an estimated coverage of approximately 1000 (Fig. 1A). We harvested the transduced cells eight days after growing in selective medium: puromycin (for selecting cells with the integration of the gRNA-target pair construct) and doxycycline (for induction of ABE and CBE expression). Deep amplicon sequencing was performed using surrogate target site-specific PCR products (coverage = ~ 2000 reads per gRNA) to capture the editing efficiencies and outcomes (Fig. 1A). After removing low quality gRNAs (<100 reads), we obtained 11,484 gRNAs covered in the ABE set and 11,406 gRNAs covered in the CBE set (Fig. 1B and Supplementary Data 1). We also investigated the strictness and editing window (position 3-10 in the protospacer) for ABE7.10 and BE4 using our large amount of in-cell measured data. As shown in Fig. 1C, ABE7.10 exhibited highly strict Adenine (A) to Guanine (G) transition. 97% of all edited sites were A•G transition. Unlike ABE7.10, the BE4 exhibited less strict Cytosine (C) to Thymine (T) transition (92%). Low frequency (3.5%) of C•G transversion was observed for BE4. Although Cas9 nickase was used for both ABE7.10 and BE4, low indel frequency (2.06% for ABE7.10; 2.96% for BE4) was also observed. When analyzing the editing window, the narrower core editing window (position 4-8 in the protospacer) exhibits high editing efficiency for both ABE7.10 and BE4 (Fig. 1D). We compared the ABE7.10 and BE4 editing efficiency with the Cas9 from Streptococcus pyogenes (SpCas9) indel frequency obtained for all these sites. Our results showed that the editing efficiency exhibited positive correlation (R = 0.40 to 0.54 for ABE7.10, R = 0.29 to 0.49 for BE4) between base editing efficiency and SpCas9-induced indel frequency (Supplementary Fig. 1). We also performed sequence motif analysis similar to that of Arbab et al. ⁶ to explore how nucleotides flanking the edited sites affect base editing efficiency using the three nucleotides flanking the editing sites within the core editing window. ABE7.10 and BE4 exhibit high editing activity of 5’TAC (A in bold is the edited site) and 5’TC (C in bold is the edited site), whereas ABE7.10 and BE4 fail to edit sites with 5’AAA and 5’GC, respectively (Fig. 1E). In conclusion, we generated experimental data for over 11,000 gRNAs in cells, comprising SpCas9 indel frequency, ABE (ABE7.10) base editing activity, and CBE (BE4) base editing activity.

**Fig. 1: High throughput generation of ABE and CBE editing data in cells.**

Dataset aware deep learning model improves prediction of base editor efficiency and outcome frequency

In contrast to existing models, which predict only outcome frequency and from that infer the gRNA efficiency⁸, we develop deep learning models to predict both simultaneously from the 30-nucleotide (nt) input DNA target sequence (Supplementary Fig. 2 and Supplementary Note 1). We included the 30nt DNA sequences (20nt protospacer + 3nt PAM + 4nt and 3nt flanking sequences at the 5’ and 3’ ends respectively), gRNA-DNA binding energy ∆G_B¹⁹, previously adopted in CRISPRon, labeling target nucleotide editing positions for different outcomes and the predicted Cas9 efficiency computed by a CRISPRon version trained on gRNAs not overlapping the test set¹⁷ (Supplementary Note 2). We only consider editing outcomes within the 8nt editing window for both ABE and CBE in the construction of the prediction model (Methods). Since there is a correlation between Cas9-induced indel frequency and base conversion efficiency (Supplementary Fig. 1), making use of the predicted Cas9 efficiency is beneficial for predicting the base editing gRNA efficiency and outcome frequency. This is further supported through a SHAP analysis²⁰ (Supplementary Note 3 and Supplementary Fig. 3) and an ablation analysis by sequential removal of features to train the model (Supplementary Fig. 4).

Before constructing prediction models that can simultaneously make use of multiple datasets, we trained our model on individual datasets on this architecture (Supplementary Fig. 2). We collected different datasets from two base editors (ABE7.10 and BE4) for training: the SURRO-seq data from this study, the Song Train and Test datasets (Song datasets)⁷ and the Arbab dataset⁶ for ABE and CBE, respectively. Moreover, we included two ABE datasets (“SpCas9-ABEmax” and “SpCas9-ABE8e” from Kissling et al. ¹², here referred to as “Kissling ABE7.10” and “Kissling ABE8e”) (Methods). We trained our model using a 5-fold cross-validation strategy, with a 6^th partition reserved for independent testing. See detailed training process and dataset separation in the Methods and Supplementary Data 2. Since we have two outputs, gRNA efficiency and outcome frequency, we used two-dimensional Pearson and Spearman rank correlation coefficients R₂¹⁸ and ρ₂ introduced here (Supplementary Note 4) for the combined evaluation of gRNA editing efficiency and outcome frequency predictions. Evaluation on independent test sets illustrates that our deep learning architecture is effective in solving the base editor efficiency prediction problem (Supplementary Fig. 5). We also notice that when training our model solely on the Song Train data, we obtained higher performance on the test set (Song Test) than their model DeepABE/CBE (Supplementary Fig. 5).

Upon inspecting the datasets (Fig. 2, Supplementary Fig. 6, Supplementary Note 5) we noticed that the Kissling ABE8e dataset predominantly exhibits high editing efficiencies between 60% and 80%. In contrast, the Kissling ABE7.10 dataset targets the same gRNA sequences but shows lower efficiencies from 30% to 50%. The ABE8e was engineered based on the ABE7.10 using phage-assisted evolution selection, which contains 8 extra mutations in the TadA-7.10 domain and exhibits 3-fold to 11-fold increases in editing levels¹⁵.

**Fig. 2: Variety of dataset from diverse sources.**

To train prediction models on multiple datasets simultaneously (Fig. 3A, Supplementary Fig. 7), we use the provided output values as is from the respective datasets but label them individually by including a vector of size equal to the number of datasets being trained on. A position in the vector have value 1 to indicate the dataset origin of a specific input-output example, while the other positions are 0. For the ABE models the order is: SURRO-seq, Song Train, Arbab, Kissling ABE7.10, and Kissling ABE8e. For the CBE models, the order is: SURRO-seq, Song Train, and Arbab. When using the models for prediction we have the option of using the positions in the vector as weights (see Supplementary Table 1 for detailed model tuning strategy). Users can assign full weight (100%) to a specific training set by setting a value of 1 at the corresponding data index, or allocate fractional weights across multiple datasets, with the total sum constrained to 1. For example, the encoding “10000” (for ABE) or “100” (for CBE) corresponds to models tuning exclusively on the SURRO-seq dataset. The encoding “22000” refers to an ABE model with equal weight (50% or 1/2) on the SURRO-seq and Song datasets, and no weights on the others. Note the number indicated a given position at the labeling vector corresponds to the denominator of the fraction representing the weight. Evaluating the performance on several independent test sets reveals consistent results for fused models tuned on different datasets (Fig. 3B, C). Comparing with established base editor efficiency prediction methods, our models demonstrate superior performance on independent test sets (Fig. 3B, C, Supplementary Fig. 5, 8–12 and Supplementary Data 3). Notably, all methods yield higher performance on the Song Test set showing that this set for some reason is “easier” than other test sets.

**Fig. 3: Schematic representation of CRISPRon-ABE and CRISPRon-CBE prediction algorithm.**

Due to the data heterogeneity and differences in base editor efficiency, our results suggest that the weighting decisions should be made dependent on the base editors and the technology for concrete editing. For instance, if the gRNA is designed for ABE8e, we suggest giving a weight of 100% to Kissling ABE8e dataset by CRISPRon-ABE (00001). If the gRNA is designed for ABE7.10 or BE4 and the editing is from SURRO-seq, Song, Arbab or Kissling platform, we suggest giving a weight of 100% to the corresponding dataset. If the gRNA is designed for ABE7.10 or BE4 but the platform for base editing is not clear, we suggest considering the scaffold sequence. If the scaffold is same as the one used by Arbab and Kissling ABE7.10 dataset with an extended stem and modified nucleotide²¹, we suggest giving weights of 50% to both Arbab and Kissling ABE7.10 datasets for ABE (CRISPRon-ABE(00220)) or giving weights of 100% to the Arbab dataset for CBE (CRISPRon-CBE(001)). Otherwise, we suggest giving weights of 50% to both SURRO-seq and Song datasets (CRISPRon-ABE(22000) and CRISPRon-CBE(220)).

We observed that the model performance on the independent test sets declined by approximately 10% when dataset labeling was omitted during training (Supplementary Fig. 4). This observation is consistent with the datasets being diverse and supports the importance of dataset awareness when developing models using combined and diverse datasets.

The limited variety of base editors included in the CRISPRon-ABE/CBE also impacts the ability to construct predictors that apply broadly in base editor field. So far, the models have been developed using ABE7.10, ABE8e and BE4 base editors. Considering the distinct sequence motif preferences from different deaminases^6,10,22, accurate prediction of editing outcomes for other base editors using CRISPRon-ABE/CBE remains a challenge. For example, newly developed base editors like ABE8.20m²³ and CBE6b¹⁴ have not been incorporated into our deep-learning models or even in benchmarking analyses due to the limited publicly available datasets. This emphasizes the need and promise of generating large-scale editing dataset using e.g. SURRO-seq for these newly evolved base editing enzymes. In addition, our current model is only trained on the data harvested in HEK293T cell line. Systematic benchmarking indicates that CRISPRon-ABE/CBE achieves comparable performance across cell lines in Arbab dataset (Supplementary Fig. 12). Given the more diverse properties across different platforms for data generation (Fig. 2 and Supplementary Fig. S6), the diversity of base editing across cell lines is not accounted for in our current model. This highlights the need for more data.

We conclude that our fused models (CRISPRon-ABE and CRISPRon-CBE) consistently demonstrate significantly improved prediction. The labeling of the datasets used during training can, in a design context, be replaced by weights allowing emphasis on a specific dataset, e.g., one that uses a BE closer to the one of interest to the user. Our analysis showed that both more data and data integration in the prediction model are important to further improve the performance. In addition, our dataset integration strategy not only proves effective for combining datasets from multiple base editors but also holds potential for broader applications in other CRISPR technologies. For instance, integrating cell-line information when training models on prime editing datasets with same edit types such as short indels but measured in different cell lines may enhance predictive accuracy. The CRISPRon-ABE/CBE models are available as stand-alone software and as a webserver via https://rth.dk/resources/crispr/.

Methods

Cell culture and doxycycline-inducible ABE/CBE stable cell lines generation

Wildtype HEK293T cells from ATCC (catalog number CRL-3216) were cultured in DMEM media supplemented with 10% FBS and 1% penicillin-streptomycin (D10 medium) at 37 °C with 5% CO2. Routine mycoplasma testing using PCR (cat no. PM008) yielded negative results the cells used for this study. TRE-ABE7.10 and TRE-CBE (BE4) stable cell lines were generated using PiggyBac transposon systems. HEK293T cells were transfected with pPB-ABE7.10-hygromycin (modified from the Addgene#102919), pPB-CBE-hygromycin (modified from the Addgene#100806) vectors, and pCMV-hybase at a 9:1 ratio. HEK293T-ABE7.10 and HEK293T-BE4 cells were cultured in selection medium containing 50 μg/mL hygromycin.

Lentivirus transduction with the 12 K library

HEK293T-ABE7.10 and -BE4 cells were cultured in D10 medium supplemented with 50 μg/mL hygromycin throughout the experiment. The 12 K library was generated previously¹⁷. For the lentiviral 12 K CRISPR gRNA-target pair library transduction, the following steps were carried out: 1) Day −1: 2.5 × 10^6 cells per 10 cm dish were seeded in 12 dishes. One dish per group was allocated for cell number determination pre-transduction, one for drug-resistance (puromycin) control, and the remaining 10 dishes for lentivirus library transduction; 2) Day 0: Cell count was performed to estimate cell number per dish, determining the volume of lentivirus for transduction with a MOI of 0.3. Low MOI (0.3) ensured most infected cells received only one copy of lentivirus. Infected cells were cultured in D10 medium at 37 °C; 3) Day 1: Transduced cells were split into three dishes per original dish; 4) Day 2: Sub-group 1 (10 dishes) was harvested and labeled as Day 2 post-12K library transduction, with cells pooled for genomic DNA extraction. Sub-group 2 (10 dishes) received fresh D10 medium with 50 μg/mL hygromycin + 1 μg/mL puromycin (Dox-free). Sub-group 3 (10 dishes) received D10 medium with 50 μg/mL hygromycin + 1 μg/mL puromycin + 1 μg/mL doxycycline (Dox induction of base editor expression). For WT HEK293T cells (Group 3) screening, hygromycin but not puromycin was used in the culture medium; 5) Transduced cells were split every 2–3 days when confluence reached 90%. 6) The transduced cells were harvested for further genomic DNA extraction and targeted deep sequencing eight days after growing in the Dox selection medium.

Targeted deep sequencing

Genomic DNA from transduced cells was extracted using the phenol-chloroform method and subsequently purified. PCR amplification of the surrogate target sites was carried out using the TRAP-NGS-F1 and TRAP-NGS-R1 primers (sequences listed in Supplementary Table 2). Each PCR reaction contained 5 μg of genomic DNA as a template, yielding approximately 7.6 × 10^5 copies of the gRNA-target pair construct, providing about 63-fold coverage of the 12 K library. To achieve comprehensive coverage, 32 parallel PCR reactions were conducted, resulting in approximately 2016-fold coverage of each construct. The PCR products were then purified using a 1.5% gel and pooled in equal amounts for subsequent deep amplicon sequencing. The amplicon libraries were subjected to deep sequencing on the MGISEQ-2000 (MGI of BGI in China) platform. All the samples were subjected to paired-end 150 bp deep-sequencing on the MGISEQ-2000 platform.

Raw data processing

The raw data were sequenced using the MGISEQ-2000 platform (G400), generating PE150 paired-end reads. FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/, version: 0.11.3) and fastp (https://github.com/OpenGene/fastp, version: 0.19.6)²⁴ were used for data quality control and filtering with the default parameters. The paired-end data were assembled using flash (http://www.cbcb.umd.edu/software/flash, version: 1.2.11)²⁵. BWA-MEM (version: 0.7.17)²⁶ was used to map the assembled data to the designed oligo sequences to preliminarily distinguish the data of each synthetic gRNA-target pair surrogate site. The oligo sequences have been reported in the previous CRISPRon study and are available in Supplementary Data 1¹⁷.

Data filtering

The pysam module of Python-3.8 was used to split the aligned data according to the site number of the chip, and the reads of different sites were obtained. Then, we used three steps of strictly controlling parameters to filter the data of each site. Firstly, according to the structure of the chip, g + gRNA (20 bp) + scaffold (82 bp) + GTTT should remain unchanged at the beginning and end of each site. Then, to remove the chip synthesis errors, the pseudo-editing sequences found in WT HEK293T cells transduced with the 12 K SURRO-seq library were removed from both ABE and CBE groups. The above three filtering steps were completed with Julia-1.5.3 language.

Calculation of base editing efficiency

To improve the suitability for machine/deep learning, the SURRO-seq dataset was further processed and base editing efficiency and outcome editing efficiency were calculated separately. Base editing efficiency is relative to the number of sites edited, and for each site, the frequency of mutating (number of mutations/total number of reads) was calculated, including the frequency of occurrence of indels, although at a lower percentage. To mitigate the impact of enrichment and depletion, we kept this subset of the dataset with a total number of reads greater than or equal to 100, resulting in 11,484 gRNAs for ABE and 11,406 gRNAs for CBE. The outcomes for all edited results were calculated by masking indel loci as wild type. The frequency (number of mutations per outcome/total number of reads) and proportion (number of mutations per outcome/number of reads in which mutations occurred) were then calculated separately.

Sequence context analysis for base editing efficiency

To explore the impact of proximal sites on the editing efficiency of base editors, we constructed sequence motifs for both ABE and CBE using our SURRO-seq dataset. We employed a logistic regression model to predict gRNA editing efficiency and generated sequence logos based on the obtained weights. Initially, we selected sequences containing the target nucleotides (“A” for ABE and “C” for CBE) along with their flanking nucleotides ( ± 3nt). These sequences were then encoded using a one-hot encoding scheme for efficiency prediction. The gRNA editing efficiency was coded as 0 or 1 to facilitate binary classification modeling. If the editing efficiency exceeded the mean value at certain position, the target value was 1; otherwise, it was 0. We subsequently utilized a logistic regression model for efficiency prediction, employing 5-fold cross-validation by keeping highly similar target sequences (≤ 4 mismatches for a 30nt target sequence) in a single fold. Then model performance was assessed on the 6th partition with equal size reserved for independent testing using Pearson correlation coefficient between actual and predicted efficiency values. Additionally, we extracted weights from the logistic regression model to visualize the effect of sequence context on base editor efficiency. Given the high efficiency observed from positions 4 to 8 within the spacer, we focused our motif analysis on target sequences with target nucleotides located within these positions.

Data collection and pre-processing for machine learning

In addition to our SURRO-seq dataset, we collected datasets for both ABE and CBE: Song Train/Test from Song et al., ⁷ Arbab dataset from Arbab et al. ⁶ (although there is no information about which part of the data was used for their training and testing) and Marquart Test from Marquart et al. ⁸. We also collected two ABE datasets (Kissling ABE7.10 and ABE8e) from Kissling et al. ¹². Note that no detailed information was provided regarding which data were used for training and testing in the original study. In total, three base editors are involved in this study: ABE7.10 (including SURRO-seq, Song, Arbab, Marquart, Kissling ABE7.10 dataset), ABE8e (Kissling ABE8e dataset) and BE4 (SURRO-seq, Song, Arbab, Marquart dataset). In addition, we not only collected the data evaluated in HEK293T cells but also the Arbab dataset from other cell-lines, including mESC (Arbab mESC Test) and U2OS (Arbab U2OS Test) to explore the model performance across cell lines. Further details about the datasets are given in Supplementary Table 3.

The datasets were processed as follows: 1) Removal of gRNAs without “NGG” PAM at the DNA target sites; 2) Removal of gRNAs for which the corresponding target sites are supported by less than 100 total reads; 3) Removal of gRNAs with over 100,000 total reads, considering that their total reads are higher than other gRNAs; 4) Removal of gRNAs not containing the nucleotide A (ABE) or C (CBE) within the editing window (positions 3 to 10); 5) Exclusive retainment of outcomes with intended target nucleotide transitions (A > G for ABE, C > T for CBE) within the editing window, considering that only few reads supported transitions to unintended nucleotides, such as A > Y for ABE and C > R for CBE and the transitions occur mainly in the editing window (Fig. 1C, D). The outcomes from one target DNA sequence were further grouped according to the positions of the transitioned nucleotides in the editing window and the frequencies within each group were summed; 6) Generation of outcomes at target sites with the potential to be edited but without any reads from the experimental support, by setting their outcome frequency to 0; 7) Removal of gRNAs for which the corresponding target site is supported by fewer than 100 edited reads, as in previous studies^6,7. As we observed in the Song datasets that the sum of the number of reads for the outcomes edited by a specific gRNA can be higher than the total number of reads reported for this same gRNA, we recalculated the outcome frequencies accordingly. See Supplementary Table 4 for more details about dataset processing and Supplementary Table 5 for information on the general composition of these datasets.

Each target sequence is 30nt long, encompassing 4nt upstream, 20nt target DNA protospacer (corresponding to the spacer), 3nt PAM and 3nt downstream from the PAM. To develop our model, we combined our SURRO-seq dataset, Song Train, Arbab dataset, Kissling ABE7.10 and Kissling ABE8e datasets (train and/or test set), while Song Test, Marquart Test and Arbab mESC and U2OS Test were only used for model evaluation (test set). The combined dataset consists of 17,941 unique 30nt target DNA sequences for ABE and 19,010 unique sequences for CBE. The identical 30nt sequences between our SURRO-seq dataset and Song Train were modest (20 gRNAs from ABE and 22 gRNAs from CBE) and were retained in both dataset sources. While the gRNAs from Kissling datasets are generated from the same library, 932 gRNAs exist in both datasets after above filtering steps and all the gRNAs were retained. See Supplementary Fig. 13 for details.

Generation of target DNA and outcome features

We processed the 30nt target DNA sequences and their corresponding edited outcome sequences by the following steps: 1) the 30nt target DNA sequences were one-hot encoded; 2) outcome sequences in the editing window were one-hot encoded/labeled (edited or not) based on the positions of the target nucleotide transitions in the editing window. Therefore, each outcome sequence was represented by one vector with length equal to that of the editing window. The value in the vector was set to 1 if the nucleotide was edited, and 0 otherwise. 3) The spacer RNA-target DNA binding energy ΔG_B was calculated using the CRISPRoff pipeline 1.1.2¹⁹ supplied with RNAfold 2.5.1²⁷. 4) CRISPR/Cas9 indel frequencies were predicted by CRISPRon¹⁷ (see details in Supplementary Note 2). 5) The origin of each training set was encoded in a “labeling” vector of size 5 for ABE (SURRO-seq dataset, Song Train, Arbab dataset, Kissling ABE7.10 dataset, Kissling ABE8e dataset) and 3 for CBE (SURRO-seq dataset, Song Train, Arbab dataset), as detailed in Supplementary Table 1.

Generation of dataset partitions

The dataset was split into six partitions with approximately equal sizes (within a count of one gRNA) for model development as previously done in CRISPRon¹⁷, resulting in five partitions for cross validation and the 6th partition for test. The procedure took the sequence similarity between 30nt target DNA sequences into account by assigning highly similar ones to the same partition while remaining dissimilar ones were randomly assigned into different partitions. In addition, we considered that part of the targets were included in the training of CRISPRon¹⁷, which is applied for Cas9-gRNA efficiency prediction, and that none of the data points in the 6th partition (test set) should belong to the CRISPRon training set to prevent overestimation (Supplementary Note 2). This was done as follows (see detailed workflow in Supplementary Fig. 14): 1) The pairwise Hamming distance between all 30nt target DNA sequences was calculated using the scikit-learn pairwise_distances function²⁸, and extracting highly similar sequences (≤4 mismatches for a 30nt target sequence), as for CRISPRon¹⁷; 2) The 30nt target DNA sequences employed in CRISPRon model development across six validation sets were assigned to the same partitions as in CRISPRon¹⁷; 3) 30nt target sequences highly similar to those assigned in step 2 were put in the same partition as their similar ones; 4) Remaining similar DNA targets were grouped and assigned to one partition randomly; 5) The remaining dissimilar targets from different sources were randomly distributed into different partitions. The test dataset from Song et al. ⁷ provided by the authors, was assigned to the 6^th partition for subsequent benchmark analysis, while all the 30nt sequences from Song Train were divided into the five partitions used for training. Sequences from Song Train which were assigned to the test partition of CRISPRon were not allocated to any partitions in step 2 but were randomly distributed into five partitions in step 5. Additionally, the DNA targets involving same variant loci in Kissling dataset were also grouped as highly similar ones as suggested by the authors¹², which were assigned to different partitions in step 3 and 4. After these assigning steps, 30nt target DNA sequences and their potential edited outcomes from diverse sources were distributed across six partitions without highly similar sequences among partitions and keeping all the data not used for CRISPRon training in the 6th partition, to maintain the test set fully independent for further benchmark analysis (Supplementary Table 6).

Test sets for the evaluation and comparison of models

Test sets from various dataset sources were collected: SURRO-seq Test (the 6th partition from SURRO-seq dataset), Song Test, Arbab Test (the 6th partition from Arbab dataset), Kissling ABE7.10 Test (the 6th partition from Kissling ABE7.10 dataset, ABE only), Kissling ABE8e Test (the 6th partition from Kissling ABE8e dataset, ABE only), Marquart Test and Arbab mESC and U2OS Test (see Generation of dataset partitions). To create fully independent test sets, each test set was processed as follows: 1) Pairwise Hamming distance between test sets and training sets used in any method was calculated on 20nt gRNAs by the cdist function in SciPy²⁹; 2) The gRNAs with Hamming distance ≤ 3 (number of mismatches) on 20nt gRNAs from any training set were removed as was done in CRISPRon¹⁷; 3) The 30nt target DNA sequences used to train the final version of CRISPRon were removed and the different test sets were reduced to the following number of gRNAs: SURRO-seq Test: 868 (ABE), 1297 (CBE); Song Test: 430 (ABE), 474 (CBE); Arbab Test: 1,096 (ABE), 1180 (CBE); Marquart Test: 1482 (ABE), 1,489 (CBE); Kissling ABE7.10 Test: 144 (ABE), Kissling ABE8e Test: 152 (ABE); Arbab mESC Test: 2554 (ABE), 1481 (CBE); Arbab U2OS Test: 361 (ABE), 289 (CBE). In addition, we extracted the outcomes with a frequency greater than 5% in each test set for benchmark analysis considering these outcomes are more critical for practical application, which are called “SURRO-seq Test*”, “Song Test*”, “Arbab Test*”, “Marquart Test*”, “Kissling ABE7.10 Test*” and “Kissling ABE8e Test*” respectively. To evaluate the performance on a dataset including the vast number of low efficient gRNAs appearing the unprocessed datasets we made a test containing the firstly discarded gRNAs (edited reads <100) resulting in “SURRO-seq Test with low efficient gRNAs”: 2326 (ABE), 2165 (CBE). Correspondingly, by removing low efficient gRNAs ( < 100 edited reads) from the Song Test, we obtained “Song Test without low efficient gRNAs”: 383 (ABE), 338 (CBE).

Development of deep learning models designed for base editors

Deep learning models were developed for ABE and CBE using Keras/Tensorflow³⁰ 2.10.0 and Python 3.10. The models were designed to simultaneously predict gRNA editing efficiency and outcome frequency from a 30nt target DNA sequence and an edited outcome. The model requires four inputs: the 30nt target DNA sequence, the 8nt labeling of editable positions, and two molecular features: ΔG_B, a powerful feature for Cas9-gRNA efficiency prediction¹⁷, and the predicted Cas9-gRNA efficiency, which is positively correlated with base editor efficiency⁷ (Supplementary Fig. 1). The one-hot encoded 30nt target DNA sequences are fed into three convolutional neural networks with filter sizes of 3, 5, and 7. The outputs from these three filters are flattened and then further passed through one fully connected layer. The one-hot encoded outcome vector representing the edits of target nucleotides in the editing window is concatenated with the molecular features and outputs from the three convolutions. Subsequently, this combined input is fed into two sequential fully connected layers before computing a 2D output: the gRNA editing efficiency and the outcome frequency (Supplementary Note 1 and Supplementary Fig. 2).

To enhance the generalization performance on diverse datasets, our SURRO-seq dataset was combined with the published datasets Song Train, Arbab dataset, Kissling ABE7.10 (ABE only) and Kissling ABE8e datasets (ABE only) for model development. One fused model was proposed for training on a combined dataset while keeping the original values of gRNA editing efficiency and outcome frequency. Due to the variations between diverse datasets (i.e., the same gRNA may exhibit different efficiencies across datasets), we encoded the dataset of origin as a feature (Supplementary Table 1) and introduced this feature alongside others in the model. The complete model architecture is outlined in Fig. 3A with details in Supplementary Fig. 7.

Models were trained using a 5-fold cross-validation on five partitioned datasets with the 6th partition held out for testing. All the models were trained and evaluated using the mean squared error (MSE). During training, the model underwent 2000 epochs in which weights were updated using batches of 16 samples, with a learning rate of 0.0001 using ADAM optimization. The training was stopped when the MSE computed on the validation set did not improve for 100 consecutive epochs and the best model by MSE on the validation data was saved. To prevent overfitting during training, a dropout rate of 0.1 was applied in each layer. The hyper-parameters were tuned to optimize the deep learning models. Inspired by the network from DeepBE¹⁰, the number of filters in convolutional neural networks and number of nodes in fully connected layers were increased (see Supplementary Data 2 for detailed hyper-parameter optimization). During hyper-parameter optimization, each training in the 5-fold cross-validation was repeated four times with different random seeds and the best model of the four was chosen from each validation. The final MSE for model selection was calculated as the average MSE from the five best models, weighed by number of outcomes in each validation. In the final model, each training in the 5-fold cross-validation was repeated 10 times with different random seeds for robustness. The final output was the average of the output from the best five models chosen for each fold.

Benchmark analysis of base editor prediction methods

CRISPRon-ABE and CRISPRon-CBE were benchmarked against existing base editor methods, including DeepABE/CBE⁷, BE-HIVE⁶, BE-DICT⁸, BE_Endo¹¹, BEDICT2.0¹² (ABE only) and DeepBE¹⁰ (ABE8e only) for the prediction of the gRNA editing efficiency and outcome frequency on independent test sets. For BE-DICT⁸ and BEDICT2.0¹², the outcome frequency with target nucleotide transitions in the editing window (from positions 3 to 10 in 20nt gRNA spacer) was predicted as suggested by the authors. Considering all the test sets are genome-integrated target sequences, we only compared the performance with BE_Endo¹¹ trained using sequence information. All the above models were downloaded from their GitHub repositories.

Jointly evaluation of gRNA editing efficiency and outcome frequency

To evaluate the performance of the base editor efficiency predictions, we calculated an extended K-dimensional Pearson correlation coefficient R_K (K = 2)¹⁸ and an extended K-dimensional Spearman rank correlation coefficient, ρ_K (K = 2), between ground truth and prediction results. These metrics allow us to combinedly evaluate the performance in predicting gRNA editing efficiency and outcome frequency. The formulas of R_K and ρ_K are explained as follows (see Supplementary Note 4 for details):

We consider two tables X (e.g. ground truth) and Y (e.g. prediction) with N rows (number of outcomes) and K columns (here K = 2, gRNA editing efficiency and outcome frequency). The covariance between X and Y was defined in ref.¹⁸ as:

$$COV(\underline{\underline{X}},\underline{\underline{Y}})={\sum }_{k\,=1}^{K}{w}_{k}COV({\underline{\underline{X}}}_{k},{\underline{\underline{Y}}}_{k})=\frac{1}{K}{\sum }_{n\,=\,1}^{N}{\sum }_{k\,=\,1}^{K}({X}_{nk}-{\bar{X}}_{k})({Y}_{nk}-{\bar{Y}}_{k}),$$

(1)

where ${\bar{X}}_{k}=\frac{1}{N}\mathop{\sum }_{n=1}^{N}{X}_{nk}$, ${\bar{Y}}_{k}=\frac{1}{N}\mathop{\sum }_{n=1}^{N}{Y}_{nk}$ and ${w}_{k}=\frac{1}{K}=\frac{1}{2}$.

The correlation coefficient is dessfined as

$${R}_{K}=\frac{COV({\underline{\underline{X}}},\underline{\underline{Y}})}{\sqrt{COV({\underline{\underline{X}}},\underline{\underline{X}})COV({\underline{\underline{Y}}},{\underline{\underline{Y}}})}}$$

(2)

For R₂ calculation, X and Y are the values of gRNA editing efficiency and outcome frequency. For ρ₂ calculation, X and Y are the ranks of gRNA editing efficiency and outcome frequency. It can be shown that ρ_K takes a similar form as Spearman rank correlation when K = 1 (see Supplementary Note 4).

Statistics and reproducibility

No statistical method was used to predetermine sample size. The detailed data processing steps were described in “Methods” section. The datasets for model development are provided with the software package for reproducibility. All statistical analyses were performed using Python (version 3.10.6) SciPy library (version 1.10.1).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The high-throughput sequencing data of ABE7.10 and BE4-Gam generated in this study have been deposited in the China National GeneBank Database under accession code CNP0001031 and NCBI Sequence Read Archive under the accession number PRJNA1321836. The processed dataset showing gRNA efficiencies and edited products is provided in Supplementary Data 1. Source data are provided with this paper.

Code availability

The source code for CRISPRon-BE is available from http://github.com/RTH-tools/crispron-BE with the DOI zenodo.16981783 [https://zenodo.org/records/16981783]. The code is released under the Business Source License v.1.1 and is free for academic use. The source code may also be downloaded from http://rth.dk/resources/crispr.

References

Komor, A. C., Kim, Y. B., Packer, M. S., Zuris, J. A. & Liu, D. R. Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature 533, 420–424 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Rees, H. A. & Liu, D. R. Base editing: precision chemistry on the genome and transcriptome of living cells. Nat. Rev. Genet. 19, 770–788 (2018).
Article CAS PubMed PubMed Central Google Scholar
Pacesa, M., Pelea, O. & Jinek, M. Past, present, and future of CRISPR genome editing technologies. Cell 187, 1076–1100 (2024).
Article CAS PubMed Google Scholar
Anzalone, A. V., Koblan, L. W. & Liu, D. R. Genome editing with CRISPR-Cas nucleases, base editors, transposases and prime editors. Nat. Biotechnol. 38, 824–844 (2020).
Article CAS PubMed Google Scholar
Arbab, M. et al. Determinants of base editing outcomes from target library analysis and machine learning. Cell 182, 463–480.e430 (2020).
Article CAS PubMed PubMed Central Google Scholar
Song, M. et al. Sequence-specific prediction of the efficiencies of adenine and cytosine base editors. Nat. Biotechnol. 38, 1037–1043 (2020).
Article CAS PubMed Google Scholar
Marquart, K. F. et al. Predicting base editing outcomes with an attention-based deep learning algorithm trained on high-throughput target library screens. Nat. Commun. 12, 5114 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Pallaseni, A. et al. Predicting base editing outcomes using position-specific sequence determinants. Nucleic Acids Res. 50, 3551–3564 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kim, N. et al. Deep learning models to predict the editing efficiencies and outcomes of diverse base editors. Nat. Biotechnol. 42, 484–497 (2023).
Yuan, T. et al. Deep learning models incorporating endogenous factors beyond DNA sequences improve the prediction accuracy of base editing outcomes. Cell Discov. 10, 20 (2024).
Article CAS PubMed PubMed Central Google Scholar
Kissling, L. et al. Predicting adenine base editing efficiencies in different cellular contexts by deep learning. Genome Biol. 26, 115 (2025).
Article CAS PubMed PubMed Central Google Scholar
Zhou, X. et al. Comprehensive evaluation and prediction of editing outcomes for near-PAMless adenine and cytosine base editors. Commun. Biol. 7, 1389 (2024).
Article CAS PubMed PubMed Central Google Scholar
Zhang, E., Neugebauer, M. E., Krasnow, N. A. & Liu, D. R. Phage-assisted evolution of highly active cytosine base editors with enhanced selectivity and minimal sequence context preference. Nat. Commun. 15, 1697 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Richter, M. F. et al. Phage-assisted evolution of an adenine base editor with improved Cas domain compatibility and activity. Nat. Biotechnol. 38, 883–891 (2020).
Article CAS PubMed PubMed Central Google Scholar
Pan, X. et al. Massively targeted evaluation of therapeutic CRISPR off-targets in cells. Nat. Commun. 13, 4049 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Xiang, X. et al. Enhancing CRISPR-Cas9 gRNA efficiency prediction by data integration and deep learning. Nat. Commun. 12, 3238 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Comput Biol. Chem. 28, 367–374 (2004).
Article CAS PubMed MATH Google Scholar
Alkan, F., Wenzel, A., Anthon, C., Havgaard, J. H. & Gorodkin, J. CRISPR-Cas9 off-targeting assessment with nucleic acid duplex energy parameters. Genome Biol. 19, 177 (2018).
Article PubMed PubMed Central Google Scholar
Lundberg S. M., Lee S. I. A unified approach to interpreting model predictions. Adv. Neur. In. 30, 4768–4777 (2017).
Phillips-Cremins, J. E. et al. Architectural protein subclasses shape 3D organization of genomes during lineage commitment. Cell 153, 1281–1295 (2013).
Article CAS PubMed PubMed Central Google Scholar
Huang, J. et al. Discovery of deaminase functions by structure-based protein clustering. Cell 186, 3182–3195.e3114 (2023).
Article CAS PubMed Google Scholar
Gaudelli, N. M. et al. Directed evolution of adenine base editors with increased activity and therapeutic application. Nat. Biotechnol. 38, 892–900 (2020).
Article CAS PubMed Google Scholar
Patel, R. K. & Jain, M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 7, e30619 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Magoc, T. & Salzberg, S. L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27, 2957–2963 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
Article CAS PubMed PubMed Central Google Scholar
Lorenz, R. et al. ViennaRNA Package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
Article PubMed PubMed Central Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn Res 12, 2825–2830 (2011).
MathSciNet Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Abadi, M. et al. TensorFlow: a system for large-scale machine learning. Proceedings of Osdi'16: 12th Usenix Symposium on Operating Systems Design and Implementation, 265-283 (2016).

Download references

Acknowledgements

This project is supported by the Novo Nordisk Foundation (NNF21OC0068988 to J.G., Y.L. and L.J.J.; NNF21OC0071031 to Y.L.; NNF14CC0001 to L.J.J.), the Lundbeck Foundation Ascending Investigator award (R396-2022-350 to Y.L.) and Independent Research Fund Denmark, FTP (9041-00317B to J.G. and Y.L.). We acknowledge the Danish e-Infrastructure Cooperation (DeiC, Denmark) for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through the DeiC allocation program. We would like to thank Lucas Kissling and Gerald Schwank for providing us with the processed SpCas9-ABEmax and SpCas9-ABE8e dataset.

Author information

These authors contributed equally: Ying Sun, Kunli Qu, Giulia I. Corsi, Christian Anthon, Xiaoguang Pan.

Authors and Affiliations

Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Frederiksberg, Denmark
Ying Sun, Giulia I. Corsi, Christian Anthon & Jan Gorodkin
Department of Biology, University of Copenhagen, Copenhagen N, Denmark
Kunli Qu & Xiaoguang Pan
Scientific Research Center, The Seventh Affiliated Hospital of Sun Yat-sen University, Shenzhen, Guangdong, China
Xi Xiang
Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
Lars Juhl Jensen
Department of Biomedicine, Aarhus University, Aarhus C, Denmark
Lin Lin & Yonglun Luo
Steno Diabetes Center Aarhus, Aarhus University Hospital, Aarhus N, Denmark
Lin Lin & Yonglun Luo

Authors

Ying Sun
View author publications
Search author on:PubMed Google Scholar
Kunli Qu
View author publications
Search author on:PubMed Google Scholar
Giulia I. Corsi
View author publications
Search author on:PubMed Google Scholar
Christian Anthon
View author publications
Search author on:PubMed Google Scholar
Xiaoguang Pan
View author publications
Search author on:PubMed Google Scholar
Xi Xiang
View author publications
Search author on:PubMed Google Scholar
Lars Juhl Jensen
View author publications
Search author on:PubMed Google Scholar
Lin Lin
View author publications
Search author on:PubMed Google Scholar
Yonglun Luo
View author publications
Search author on:PubMed Google Scholar
Jan Gorodkin
View author publications
Search author on:PubMed Google Scholar

Contributions

J.G. and Y.L. conceived the idea. L.L. and Y.L. supervised and coordinated the experimental part of the study. J.G. and L.J.J. supervised and coordinated the computational part of the study. K.Q., X.P. performed most of the experimental work. Y.S., G.I.C., C.A., X.P. and X.X. analyzed the data. Y.S., G.I.C. C.A., L.J.J. and J.G. developed the machine learning methods. Y.S., G.I.C., and C.A. implemented the deep learning methods and webserver. All authors have contributed to the execution of the experiments and studies. Y.S., K.Q., G.I.C., C.A., X.P., L.J.J., L.L., Y.L. and J.G. drafted the manuscript. All authors discussed the results and contributed to the final manuscript.

Corresponding authors

Correspondence to Yonglun Luo or Jan Gorodkin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Reporting Summary

Transparent Peer Review file

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sun, Y., Qu, K., Corsi, G.I. et al. Deep learning models simultaneously trained on multiple datasets improve base-editing activity prediction. Nat Commun 16, 9821 (2025). https://doi.org/10.1038/s41467-025-65200-5

Download citation

Received: 12 November 2024
Accepted: 08 October 2025
Published: 07 November 2025
Version of record: 07 November 2025
DOI: https://doi.org/10.1038/s41467-025-65200-5