CN111613267A

CN111613267A - A CRISPR/Cas9 off-target prediction method based on attention mechanism

Info

Publication number: CN111613267A
Application number: CN202010433848.9A
Authority: CN
Inventors: 曾甜; 戴宪华
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2020-09-01

Abstract

The invention relates to a CRISPR/Cas9 off-target prediction method based on an attention mechanism, which relates to the field of gene editing and comprises the following steps: constructing an off-target data set, encoding sgRNA-DNA sequence pairs, building an off-target prediction model, training model weights, and predicting whether the sgRNA-DNA sequence pairs contain off-target sites by using the trained model. The invention provides a novel CRISPR/Cas9 off-target prediction model based on an attention mechanism by analyzing the composition and the matching condition of a sgRNA-DNA sequence pair and the relation between the off-target problem, wherein the model takes a three-layer convolution neural network as a base framework, and the attention mechanism is introduced to extract the characteristics related to position information in the sgRNA-DNA sequence pair. The model can extract abundant characteristics related to the sgRNA-DNA miss-detection problem, can more accurately and comprehensively predict the CRISPR/Cas9 miss-detection problem compared with the existing prediction algorithm, and has better generalization performance.

Description

A CRISPR/Cas9 off-target prediction method based on attention mechanism

技术领域technical field

本发明涉及基因编辑和深度学习领域，具体是一种基于注意力机制的CRISPR/Cas9脱靶预测方法。The invention relates to the fields of gene editing and deep learning, in particular to a CRISPR/Cas9 off-target prediction method based on an attention mechanism.

背景技术Background technique

CRISPR/Cas9(clustered regularly interspaced short palindromic repeat/CRISPR-associa-tedprotein 9)系统介导的基因组编辑技术是第三代“基因组定点编辑技术”，可用于对特定基因组DNA靶点进行修饰。相比前两代锌指核酸酶(zinc-fingernucleases,ZFNs)和类转录激活因子效应物核酸酶(transcription activator-likeeffector nuclease,TALENs)技术，CRISPR/Cas9技术具有广泛的应用范围、低廉的成本及易操作性等特点。其应用主要包括对模型生物进行靶向诱变，敲除或敲入基因以验证基因功能和表观基因组编辑，将碱基编辑酶传递到目标位点等。除此之外，该技术在体内基因治疗方面也具有巨大的潜力。然而，由于目前的CRISPR/Cas9在体内的传递方法不是组织特异性的，CRISPR/Cas9可能在非目标组织中进行基因编辑而导致靶上的副作用，这使得CRISPR/Cas9基因治疗应用仍面临着一系列的安全挑战。因此，准确的预测CRISPR/Cas9的脱靶问题对于其应用具有重要意义。The genome editing technology mediated by the CRISPR/Cas9 (clustered regularly interspaced short palindromic repeat/CRISPR-associa-tedprotein 9) system is the third-generation "genome site-directed editing technology", which can be used to modify specific genomic DNA targets. Compared with the previous two generations of zinc-finger nucleases (zinc-finger nucleases, ZFNs) and transcription activator-like effector nucleases (TALENs) technology, CRISPR/Cas9 technology has a wide range of applications, low cost and Ease of operation and so on. Its applications mainly include targeted mutagenesis of model organisms, knockout or knock-in of genes to verify gene function and epigenome editing, delivery of base editing enzymes to target sites, etc. In addition to this, the technology also has great potential for in vivo gene therapy. However, since the current delivery methods of CRISPR/Cas9 in vivo are not tissue-specific, CRISPR/Cas9 may undergo gene editing in non-target tissues and cause on-target side effects, which makes CRISPR/Cas9 gene therapy applications still facing a challenge. series of security challenges. Therefore, accurate prediction of off-target problems of CRISPR/Cas9 is of great significance for its application.

目前已有多种用于预测CRISPR/Cas9脱靶问题的方法，主要可分为基于序列比对的方法、基于假设经验的方法、测序方法及基于机器学习的算法，其中预测性能最好的是基于经验的方法和基于机器学习的方法。近年来，深度学习也逐渐被用于预测CRISPR/Cas9的脱靶问题，并取得了一些成效，但仍然面临预测精度不高的问题。由于应用深度学习可自动提取sgRNA-DNA序列对的特征，操作较为简便且算法性能提升空间较大，因此，可进一步研究应用深度学习提升CRISPR/Cas9脱靶预测效率的方法。At present, there are many methods for predicting CRISPR/Cas9 off-target problems, which can be mainly divided into methods based on sequence alignment, methods based on hypothesis experience, sequencing methods and algorithms based on machine learning. Empirical and machine learning-based approaches. In recent years, deep learning has also been gradually used to predict the off-target problem of CRISPR/Cas9, and has achieved some results, but still faces the problem of low prediction accuracy. Since the application of deep learning can automatically extract the features of sgRNA-DNA sequence pairs, the operation is relatively simple and the performance of the algorithm has a large room for improvement. Therefore, further research can be done to improve the efficiency of CRISPR/Cas9 off-target prediction by applying deep learning.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服已有的CRISPR/Cas9脱靶预测方法的缺陷，提出了一种基于注意力机制的CRISPR/Cas9脱靶预测方法。由于脱靶预测方法的输入为sgRNA-DNA序列对，算法主要通过关注sgRNA序列和DNA序列内碱基在各个位置的匹配情况来判断是否会产生脱靶现象，这与位置信息关联较大，而注意力机制可有效的促进序列对内部局部位置特征的提取。因此，该方法以卷积神经网络为基框架，引入注意力机制用以加强对sgRNA-DNA序列对位置特征的提取，可有效提高脱靶预测的准确率及召回率，且具有较高的泛化性能，这对于CRISPR/Cas9的安全应用具有重要意义。The purpose of the present invention is to overcome the defects of the existing CRISPR/Cas9 off-target prediction methods, and propose a CRISPR/Cas9 off-target prediction method based on an attention mechanism. Since the input of the off-target prediction method is the sgRNA-DNA sequence pair, the algorithm mainly judges whether off-target phenomenon will occur by paying attention to the matching of the bases in the sgRNA sequence and the DNA sequence at each position, which is closely related to the position information, and the attention The mechanism can effectively promote the extraction of internal local location features from sequences. Therefore, this method takes the convolutional neural network as the base framework, and introduces the attention mechanism to strengthen the extraction of the location features of the sgRNA-DNA sequence, which can effectively improve the accuracy and recall rate of off-target prediction, and has high generalization. performance, which has important implications for the safe application of CRISPR/Cas9.

为实现上述目的，本发明所提供的技术方案为：For achieving the above object, the technical scheme provided by the present invention is:

一种基于注意力机制的CRISPR/Cas9脱靶预测方法，主要包括以下步骤：A CRISPR/Cas9 off-target prediction method based on attention mechanism mainly includes the following steps:

1)构建包含sgRNA，DNA及其脱靶标签的用于模型训练及测试的数据集，包括已公开的HEK293T和K562t两个数据集；1) Construct datasets for model training and testing that contain sgRNA, DNA and their off-target tags, including the published HEK293T and K562t datasets;

2)利用特定的编码方法对样本数据集中的sgRNA-DNA序列对进行编码，使其可作为神经网络的输入；2) Use a specific encoding method to encode the sgRNA-DNA sequence pair in the sample data set, so that it can be used as the input of the neural network;

3)搭建基于注意力机制的卷积神经网络模型，包括一维卷积层和独立的注意力模块；3) Build a convolutional neural network model based on an attention mechanism, including a one-dimensional convolutional layer and an independent attention module;

4)选定优化器及损失函数，设定优化器参数、学习率及迭代次数，采用自助采样法对步骤3)中的模型进行训练，得到训练好的模型的权重；4) Selecting an optimizer and a loss function, setting the parameters of the optimizer, the learning rate and the number of iterations, and using the self-service sampling method to train the model in step 3) to obtain the weight of the trained model;

5)将待测sgRNA-DNA序列对输入到步骤4)中已训练好的模型中，得到sgRNA-DNA序列对的脱靶预测值。5) Input the sgRNA-DNA sequence pair to be tested into the model trained in step 4) to obtain the off-target prediction value of the sgRNA-DNA sequence pair.

进一步地，所述步骤1)中的数据集构成如下：Further, the data set in described step 1) is constituted as follows:

在HEK293T中，带有脱靶位点的sgRNA-DNA序列对(正样本)与不带脱靶位点的sgRNA-DNA(负样本)的数目分别为536和132378；而在数据集K562t中，负样本有20199条，正样本有120条。两个数据集中样本数目的总和为153233，正负样本的比例大约为1：233，其中正样本被标记为数字1，负样本则标记为数字0。In HEK293T, the number of sgRNA-DNA sequence pairs with off-target sites (positive samples) and sgRNA-DNA without off-target sites (negative samples) were 536 and 132,378, respectively; while in dataset K562t, the number of negative samples There are 20199, and there are 120 positive samples. The sum of the number of samples in the two datasets is 153233, and the ratio of positive and negative samples is approximately 1:233, where positive samples are marked with number 1 and negative samples are marked with number 0.

进一步地，所述步骤2)包括以下步骤：Further, described step 2) comprises the following steps:

21)定义一个碱基对的词典，该词典包含了A、T、C、G四种不同碱基两两组合成的所有16种不同的碱基对与其所对应数值的映射，具体为：{AA:0，AT:1，AC:2，AG:3，TA:4，TT:5，TC:6，TG:7，CA:8，CT:9，CC:10，GG:11，GA:12，GT:13，GC:14，GG:15}；21) Define a dictionary of base pairs, which contains the mapping of all 16 different base pairs synthesized in two groups of four different bases A, T, C, and G and their corresponding values, specifically: { AA:0, AT:1, AC:2, AG:3, TA:4, TT:5, TC:6, TG:7, CA:8, CT:9, CC:10, GG:11, GA: 12, GT: 13, GC: 14, GG: 15};

22)到步骤21)中的词典中查找sgRNA-DNA序列对(23bp)中每个位置的核苷酸对所对应的数值，完成sgRNA-DNA序列对的数值编码；22) look up the numerical value corresponding to the nucleotide pair of each position in the sgRNA-DNA sequence pair (23bp) in the dictionary in step 21), and complete the numerical encoding of the sgRNA-DNA sequence pair;

23)对步骤22)中得到的编码序列x^1×23，利用Keras框架中的Embedding层进一步作词向量编码，Embedding层实际是包含一个隐藏层的神经网络，令其中的隐藏层向量为h^1×N，具体的编码方式如下：23) For the coding sequence x ^1×23 obtained in step 22), use the Embedding layer in the Keras framework to further encode the word vector. The Embedding layer is actually a neural network containing a hidden layer, and the hidden layer vector is h ^{1× N} , the specific encoding method is as follows:

x^1×23·W^23×N＝h^1×N x ^1×23 ·W ^23×N =h ^1×N

其中W^23×N为输入层与隐藏层之间的权重矩阵，N为自定义的编码尺寸，最终得到的词向量编码结果即为W^23×N。Among them, W ^23×N is the weight matrix between the input layer and the hidden layer, N is the custom encoding size, and the final word vector encoding result is W ^23×N .

进一步地，所述步骤3)包括如下步骤：Further, described step 3) comprises the steps:

31)对步骤23)中得到的编码序列进行一维卷积，核尺寸为5，核数目为20，得到C1层；31) One-dimensional convolution is performed on the coding sequence obtained in step 23), the kernel size is 5, and the number of kernels is 20 to obtain the C1 layer;

对于C1层，对其进行规范化处理，得到B1层；For the C1 layer, normalize it to get the B1 layer;

对于B1层，对其进行核尺寸为5，核数目为40的一维卷积，得到C2层；For the B1 layer, perform a one-dimensional convolution with a kernel size of 5 and a kernel number of 40 to obtain the C2 layer;

对于C2层，对其进行规范化处理，得到B2层；For the C2 layer, normalize it to get the B2 layer;

对于B2层，对其进行核尺寸为5，核数目为80的一维卷积，得到C3层；For the B2 layer, perform a one-dimensional convolution with a kernel size of 5 and a kernel number of 80 to obtain the C3 layer;

对于C3层，对其进行规范化处理，得到B3层；For the C3 layer, normalize it to get the B3 layer;

对于B1层，对其进行线性尺度变换，使其与B3层的尺寸一致，得到C11层；For layer B1, perform linear scale transformation to make it consistent with the size of layer B3 to obtain layer C11;

32)将步骤31)中得到的C11，C3层输入到注意力模块，加强对sgRNA-DNA序列对内局部位置特征的提取，得到注意力模块的输出A1层；32) Input the C11 and C3 layers obtained in step 31) into the attention module, strengthen the extraction of local location features within the sgRNA-DNA sequence pair, and obtain the output A1 layer of the attention module;

33)将步骤32)中的A1层的特征映射通过尺寸为40的全连接层，得到E1层；33) The feature map of the A1 layer in step 32) is passed through a fully connected layer with a size of 40 to obtain the E1 layer;

对于E1层，对其进行参数为0.2的随机失活操作，得到D1层；For the E1 layer, perform a random deactivation operation with a parameter of 0.2 to obtain the D1 layer;

对于D1层，将其通过尺寸为20的全连接层，得到E2层；For the D1 layer, pass it through the fully connected layer of size 20 to get the E2 layer;

对于E2层，对其进行参数为0.2的随机失活操作，得到D2层；For the E2 layer, perform a random deactivation operation with a parameter of 0.2 to obtain the D2 layer;

34)步骤33)中D2层的特征映射即为模型最终提取到的sgRNA-DNA序列对的特征；34) The feature mapping of the D2 layer in step 33) is the feature of the sgRNA-DNA sequence pair finally extracted by the model;

35)对于D2层，将其通过尺寸为2，激活函数为softmax的全连接层，得到一个包含两个元素的一维向量，两个元素分别代表脱靶概率及正确打靶概率。35) For the D2 layer, pass it through a fully connected layer with a size of 2 and an activation function of softmax to obtain a one-dimensional vector containing two elements, the two elements represent the probability of missing the target and the probability of hitting the target correctly.

进一步地，所述步骤4)中的自助采样法包括如下步骤：Further, the self-service sampling method in described step 4) comprises the steps:

41)将训练集中的负样本按照每一份256条数据的方法随机划分为N个等份，N＝负样本总数/256；41) The negative samples in the training set are randomly divided into N equal parts according to the method of each 256 pieces of data, N=the total number of negative samples/256;

42)从训练集中的正样本中有放回的随机抽取256条样本，与(1)中的一份负样本组合构成一个小的训练集；42) Randomly extract 256 samples from the positive samples in the training set, and combine them with a negative sample in (1) to form a small training set;

43)重复步骤42)N次，一共可得到N个平衡的小型训练集；43) Repeating step 42) N times, a total of N balanced small training sets can be obtained;

44)在一次迭代中每次使用一个小训练集进行训练，共需分N次进行。44) In one iteration, a small training set is used each time for training, which needs to be divided into N times.

进一步地，所述步骤32)包括如下步骤：Further, the step 32) includes the following steps:

321)计算注意力分数：321) Calculate the attention score:

score_i＝U·(W₁q+W₂k_i)score _i =U·(W ₁ q+W ₂ k _i )

其中U，W₁和W₂均为随机初始化的矩阵，q为C3层输出矩阵中某一列的向量，k_i为C11层输出矩阵中某一列的向量，score_i为k_i与q之间的注意力得分，表征二者之间的相关度；Among them, U, W ₁ and W ₂ are all randomly initialized matrices, q is the vector of a column in the output matrix of the C3 layer, _{ki is the vector of a column in the output matrix of the C11 layer, and score i} _is the value between _ki and q Attention score, which represents the correlation between the two;

322)计算注意力模块的输出：322) Calculate the output of the attention module:

其中q'为经注意力算法处理后得到的注意力模块的输出。where q' is the output of the attention module after being processed by the attention algorithm.

本发明与现有的技术相比，具有如下优点与有益技术效果：Compared with the prior art, the present invention has the following advantages and beneficial technical effects:

1、本发明以卷积神经网络作为基框架，在此基础上引入注意力机制，可加强对sgRNA-DNA序列对内与位置信息相关的特征的提取，提高了CRISPR/Cas9脱靶预测的查准率与查全率；1. The present invention uses a convolutional neural network as the base framework, and introduces an attention mechanism on this basis, which can strengthen the extraction of features related to position information within the sgRNA-DNA sequence pair, and improve the accuracy of CRISPR/Cas9 off-target prediction rate and recall rate;

2、本发明在预测CRISPR/Cas9的脱靶问题时，在准确率与召回率上实现了较好的平衡；2. When predicting the off-target problem of CRISPR/Cas9, the present invention achieves a good balance between the accuracy rate and the recall rate;

3、本发明的模型结构较简单，具有较好的泛化性能；3. The model structure of the present invention is relatively simple and has better generalization performance;

4、本发明以卷积神经网络作为基框架，时间复杂度较低，具有较高的实用性能。4. The present invention uses the convolutional neural network as the base frame, which has low time complexity and high practical performance.

附图说明Description of drawings

图1是本发明方法实施实例流程图。FIG. 1 is a flow chart of an implementation example of the method of the present invention.

图2是本发明方法的模型结构图。Fig. 2 is a model structure diagram of the method of the present invention.

具体实施方式Detailed ways

为了便于本领域普通技术人员理解和实施本发明，下面结合实施例对本发明作进一步的说明，应当理解，此处所描述的实施例仅用于说明和解释本发明，并不是对本发明的限定。In order to facilitate the understanding and implementation of the present invention by those of ordinary skill in the art, the present invention will be further described below with reference to the embodiments. It should be understood that the embodiments described herein are only used to illustrate and explain the present invention, but not to limit the present invention.

本发明采用已公开的HEK293T和K562t数据集，HEK293T中的正样本和负样本数目分别为536和132378；而在数据集K562t中，负样本有20199条，正样本有120条。两个数据集中样本数目的总和为153233，正负样本的比例大约为1：233。The present invention adopts the published HEK293T and K562t data sets, and the numbers of positive samples and negative samples in HEK293T are 536 and 132,378 respectively; while in the data set K562t, there are 20,199 negative samples and 120 positive samples. The sum of the number of samples in the two datasets is 153233, and the ratio of positive and negative samples is about 1:233.

本发明提出了一种基于注意力机制的CRISPR/Cas9脱靶预测方法，该方法在使用神经网络模型预测未知的sgRNA-DNA序列对是否包含脱靶位点之前，需要对构建的神经网络模型进行训练，利用训练好的模型对sgRNA-DNA序列对进行脱靶预测，主要流程参考图1，包括以下步骤：The invention proposes a CRISPR/Cas9 off-target prediction method based on an attention mechanism. Before using a neural network model to predict whether an unknown sgRNA-DNA sequence pair contains an off-target site, the constructed neural network model needs to be trained. Use the trained model to predict off-target sgRNA-DNA sequence pairs. Refer to Figure 1 for the main process, including the following steps:

1)对数据集标签进行预处理，将其中的包含脱靶位点的样本标记为正样本，用数字1作为标签；不包含脱靶位点的样本记为负样本，用数字0作为标签。1) Preprocess the labels of the dataset, mark the samples containing off-target sites as positive samples, and use the number 1 as the label; the samples that do not contain off-target sites are recorded as negative samples, with the number 0 as the label.

2)采用特定的编码方法对样本数据集中的sgRNA-DNA序列对进行编码，参考图2，具体步骤如下：2) Using a specific encoding method to encode the sgRNA-DNA sequence pair in the sample data set, with reference to Figure 2, the specific steps are as follows:

x^1×23·W^23×N＝h^1×N x ^1×23 ·W ^23×N =h ^1×N

3)对每一个数据集，将其按照8:2的比例划分为训练集和测试集，训练集用于训练模型的权重。3) For each data set, it is divided into training set and test set according to the ratio of 8:2, and the training set is used to train the weight of the model.

4)将经过编码的sgRNA-DNA序列对输入到脱靶预测模型中进行特征提取，参考图2，包括以下步骤：4) Input the encoded sgRNA-DNA sequence pair into the off-target prediction model for feature extraction, referring to Figure 2, including the following steps:

41)将Embedding层的输出经过核数目为20，核尺寸为5的一维卷积后，得到第一个特征映射C1层；41) After the output of the Embedding layer is subjected to a one-dimensional convolution with a kernel number of 20 and a kernel size of 5, the first feature map C1 layer is obtained;

42)对C1层的输出进行规范化，得到B1层；42) Normalize the output of the C1 layer to obtain the B1 layer;

43)将B1层的输出经过核数目为40，核尺寸为5的一维卷积后，得到第二个特征映射C2层；43) After the output of the B1 layer is subjected to a one-dimensional convolution with a kernel number of 40 and a kernel size of 5, the second feature map C2 layer is obtained;

44)对C2层的输出进行规范化，得到B2层；44) Normalize the output of the C2 layer to obtain the B2 layer;

45)将B2层的输出经过核数目为80，核尺寸为5的一维卷积后，得到第三个特征映射C3层；45) After the output of the B2 layer is subjected to a one-dimensional convolution with a kernel number of 80 and a kernel size of 5, the third feature map C3 layer is obtained;

46)对C3层的输出进行规范化，得到B3层；46) Normalize the output of the C3 layer to obtain the B3 layer;

47)对B1层的输出进行线性的尺度变换，使其维度与B3层一致，得到C11层；47) Perform a linear scale transformation on the output of the B1 layer to make the dimension consistent with the B3 layer to obtain the C11 layer;

48)将C11层及B3层的输出进行注意力机制的计算，得到注意力模块的输出A1层；48) Calculate the attention mechanism on the outputs of the C11 layer and the B3 layer to obtain the output A1 layer of the attention module;

49)将A1层的输出进行展平，使其成为一维向量，；49) Flatten the output of the A1 layer to make it a one-dimensional vector,;

50)将步骤49)中的特征向量先后经过两个全连接层进行特征整合，每个全连接层包括一个Dense层和一个Dropout层，前后两个Dense层的尺寸分别为40，20，Dropout层的参数均为0.2，最终获得整个模型提取到的特征映射D2层。50) Integrate the feature vector in step 49) through two fully connected layers successively, each fully connected layer includes a Dense layer and a Dropout layer, and the sizes of the front and rear Dense layers are 40, 20, and Dropout layers respectively. The parameters are all 0.2, and finally the feature map D2 layer extracted from the entire model is obtained.

5)利用提取到的特征映射对sgRNA-DNA序列对的脱靶标签进行拟合，实现模型的训练。5) Fitting off-target tags of sgRNA-DNA sequence pairs using the extracted feature maps to train the model.

按照上述步骤，构建出训练模型，并进行实验。在模型训练过程中，设置模型的优化器为Adam，学习率为初始值为0.003，下降速率为0.2，训练中的批量为512，迭代次数为30，完成训练后，在划分好的测试集上进行测试，完成测试后，该方法在测试集上的平均性能相比现有最优的预测算法高出11％，由此可知，本发明方法可提升CRISPR/Cas9脱靶预测的整体性能，具有较高的泛化能力。Follow the above steps to build a training model and conduct experiments. In the model training process, set the optimizer of the model to Adam, the initial learning rate is 0.003, the decline rate is 0.2, the batch in training is 512, and the number of iterations is 30. After the training is completed, on the divided test set After the test is completed, the average performance of the method on the test set is 11% higher than that of the existing optimal prediction algorithm. It can be seen that the method of the present invention can improve the overall performance of CRISPR/Cas9 off-target prediction, and has a relatively high performance. High generalization ability.

综上所述，本发明针对CRISPR/Cas9脱靶问题，重点研究了基于注意力机制的CRISPR/Cas9脱靶预测算法。该方法利用三层卷积神经网络作为基框架，引入注意力机制加强对sgRNA-DNA序列对与位置信息相关的特征的提取。该方法一方面提高了CRISPR/Cas9脱靶问题的预测性能，另一方面具有较高的泛化性能，具有较高的实用性。To sum up, the present invention focuses on the CRISPR/Cas9 off-target prediction algorithm based on the attention mechanism, aiming at the CRISPR/Cas9 off-target problem. This method uses a three-layer convolutional neural network as the base framework, and introduces an attention mechanism to enhance the extraction of location-related features of sgRNA-DNA sequence pairs. On the one hand, this method improves the prediction performance of CRISPR/Cas9 off-target problems, and on the other hand, it has high generalization performance and high practicability.

以上所述的仅是本申请的优选实施方式，本发明不限于以上实施例。可以理解为，本领域技术人员在不脱离本发明的精神和构思的前提下直接导出或联想到的其它改进和变化，均应认为包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present application, and the present invention is not limited to the above embodiments. It can be understood that other improvements and changes directly derived or imagined by those skilled in the art without departing from the spirit and concept of the present invention should be considered to be included within the protection scope of the present invention.

Claims

1. An attention-based CRISPR/Cas9 off-target prediction method, which is characterized by comprising the following steps:

1) constructing a data set containing sgRNA, DNA and off-target labels thereof for model training and testing, wherein the data set comprises two published data sets of HEK293T and K562 t;

2) encoding the sgRNA-DNA sequence pairs in the sample data set by using a specific encoding method so that the sgRNA-DNA sequence pairs can be used as input of a neural network;

3) building a convolutional neural network model based on an attention mechanism, wherein the convolutional neural network model comprises a one-dimensional convolutional layer and an independent attention module;

4) selecting an optimizer and a loss function, setting parameters of the optimizer, a learning rate and iteration times, and training the model in the step 3) by adopting a self-service sampling method to obtain the weight of the trained model;

5) inputting the sgRNA-DNA sequence pairs to be detected into the trained model in the step 4) to obtain the off-target prediction value of the sgRNA-DNA sequence pairs.

2. The CRISPR/Cas9 off-target prediction method based on the attention mechanism as claimed in claim 1, wherein the coding method selected in the step 2) is word vector coding, comprising the following steps:

21) defining a base pair dictionary, wherein the dictionary comprises mapping of all 16 different base pairs formed by pairwise combination of A, T, C, G four different base pairs and corresponding numerical values, and specifically comprises the following steps: { AA:0, AT:1, AC:2, AG:3, TA:4, TT:5, TC:6, TG:7, CA:8, CT:9, CC:10, GG:11, GA:12, GT:13, GC:14, GG:15 };

22) searching a dictionary in the step 21) for a numerical value corresponding to the nucleotide pair at each position in the sgRNA-DNA sequence pair (23bp) to complete numerical value coding of the sgRNA-DNA sequence pair;

23) for the coding sequence x obtained in step 22)^1×23The method utilizes an Embedding layer in a Keras framework to further carry out word vector coding, wherein the Embedding layer is actually a neural network comprising a hidden layer, and the hidden layer vector is h^1×NThe specific encoding method is as follows:

x^1×23·W^23×N＝h^1×N

wherein W^23×NIs a weight matrix between the input layer and the hidden layer, N is a self-defined coding size, and the finally obtained word vector coding result is W^23×N。

3. The attention mechanism-based CRISPR/Cas9 off-target prediction method according to claim 1, wherein the model building in the step 3) comprises the following steps:

31) performing one-dimensional convolution on the coding sequence obtained in the step 23), wherein the kernel size is 5, and the kernel number is 20, so as to obtain C1 layers; for the C1 layer, carrying out normalization treatment on the C1 layer to obtain a B1 layer;

for the B1 layer, performing one-dimensional convolution with the kernel size of 5 and the kernel number of 40 to obtain a C2 layer;

for the C2 layer, carrying out normalization treatment on the C2 layer to obtain a B2 layer;

for the B2 layer, performing one-dimensional convolution with the kernel size of 5 and the kernel number of 80 to obtain a C3 layer;

for the C3 layer, carrying out normalization treatment on the C3 layer to obtain a B3 layer;

for the B1 layer, performing linear scale transformation to make the B1 layer consistent with the size of the B3 layer, and obtaining a C11 layer;

32) inputting the C11 and B3 layers obtained in the step 31) into an attention module, enhancing the extraction of the internal local position characteristics of the sgRNA-DNA sequence, and obtaining an output A1 layer of the attention module;

33) mapping the characteristics of the A1 layer in the step 32) through a full connection layer with the size of 40 to obtain an E1 layer;

for the E1 layer, carrying out random inactivation operation with the parameter of 0.2 to obtain a D1 layer;

for the D1 layer, it was passed through a fully connected layer of size 20, resulting in an E2 layer;

for the E2 layer, carrying out random inactivation operation with the parameter of 0.2 to obtain a D2 layer;

34) the feature mapping of the layer D2 in the step 33) is the features of the sgRNA-DNA sequence pairs finally extracted by the model;

35) for layer D2, it is passed through a fully connected layer of size 2 and activation function softmax to obtain a one-dimensional vector containing two elements representing miss probability and correct hit probability, respectively.

4. The method for predicting CRISPR/Cas9 miss based on attention mechanism according to claim 1, wherein the training model of the self-help sampling method in the step 4) comprises the following steps:

41) dividing the negative samples in the training set into N equal parts randomly according to the method of each 256 pieces of data, wherein N is the total number of the negative samples/256;

42) randomly drawing 256 samples from positive samples in the training set, and combining the 256 samples with one negative sample in the step (1) to form a small training set;

43) repeating the step 42) N times to obtain N balanced small training sets;

44) in one iteration, each time, a small training set is used for training, and the training needs to be carried out in N times.

5. The attention mechanism-based CRISPR/Cas9 off-target prediction method according to claim 1, the attention module implementation in step 32) comprising the steps of:

321) calculating the attention score:

score_i＝U·(W₁q+W₂k_i)

wherein U, W₁And W₂All are randomly initialized matrixes, q is a vector of a certain column in an output matrix of the B3 layer, and k_iIs a vector of a certain column in the C11 layer output matrix, score_iIs k_iAnd q, characterizing the degree of correlation between the two;

322) compute attention module output:

wherein q' is the output of the attention module obtained after processing by the attention algorithm.