CN111613267A - A CRISPR/Cas9 off-target prediction method based on attention mechanism - Google Patents
A CRISPR/Cas9 off-target prediction method based on attention mechanism Download PDFInfo
- Publication number
- CN111613267A CN111613267A CN202010433848.9A CN202010433848A CN111613267A CN 111613267 A CN111613267 A CN 111613267A CN 202010433848 A CN202010433848 A CN 202010433848A CN 111613267 A CN111613267 A CN 111613267A
- Authority
- CN
- China
- Prior art keywords
- layer
- sgrna
- dna sequence
- cas9
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Medicinal Chemistry (AREA)
- Pharmacology & Pharmacy (AREA)
- Crystallography & Structural Chemistry (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
技术领域technical field
本发明涉及基因编辑和深度学习领域,具体是一种基于注意力机制的CRISPR/Cas9脱靶预测方法。The invention relates to the fields of gene editing and deep learning, in particular to a CRISPR/Cas9 off-target prediction method based on an attention mechanism.
背景技术Background technique
CRISPR/Cas9(clustered regularly interspaced short palindromic repeat/CRISPR-associa-tedprotein 9)系统介导的基因组编辑技术是第三代“基因组定点编辑技术”,可用于对特定基因组DNA靶点进行修饰。相比前两代锌指核酸酶(zinc-fingernucleases,ZFNs)和类转录激活因子效应物核酸酶(transcription activator-likeeffector nuclease,TALENs)技术,CRISPR/Cas9技术具有广泛的应用范围、低廉的成本及易操作性等特点。其应用主要包括对模型生物进行靶向诱变,敲除或敲入基因以验证基因功能和表观基因组编辑,将碱基编辑酶传递到目标位点等。除此之外,该技术在体内基因治疗方面也具有巨大的潜力。然而,由于目前的CRISPR/Cas9在体内的传递方法不是组织特异性的,CRISPR/Cas9可能在非目标组织中进行基因编辑而导致靶上的副作用,这使得CRISPR/Cas9基因治疗应用仍面临着一系列的安全挑战。因此,准确的预测CRISPR/Cas9的脱靶问题对于其应用具有重要意义。The genome editing technology mediated by the CRISPR/Cas9 (clustered regularly interspaced short palindromic repeat/CRISPR-associa-tedprotein 9) system is the third-generation "genome site-directed editing technology", which can be used to modify specific genomic DNA targets. Compared with the previous two generations of zinc-finger nucleases (zinc-finger nucleases, ZFNs) and transcription activator-like effector nucleases (TALENs) technology, CRISPR/Cas9 technology has a wide range of applications, low cost and Ease of operation and so on. Its applications mainly include targeted mutagenesis of model organisms, knockout or knock-in of genes to verify gene function and epigenome editing, delivery of base editing enzymes to target sites, etc. In addition to this, the technology also has great potential for in vivo gene therapy. However, since the current delivery methods of CRISPR/Cas9 in vivo are not tissue-specific, CRISPR/Cas9 may undergo gene editing in non-target tissues and cause on-target side effects, which makes CRISPR/Cas9 gene therapy applications still facing a challenge. series of security challenges. Therefore, accurate prediction of off-target problems of CRISPR/Cas9 is of great significance for its application.
目前已有多种用于预测CRISPR/Cas9脱靶问题的方法,主要可分为基于序列比对的方法、基于假设经验的方法、测序方法及基于机器学习的算法,其中预测性能最好的是基于经验的方法和基于机器学习的方法。近年来,深度学习也逐渐被用于预测CRISPR/Cas9的脱靶问题,并取得了一些成效,但仍然面临预测精度不高的问题。由于应用深度学习可自动提取sgRNA-DNA序列对的特征,操作较为简便且算法性能提升空间较大,因此,可进一步研究应用深度学习提升CRISPR/Cas9脱靶预测效率的方法。At present, there are many methods for predicting CRISPR/Cas9 off-target problems, which can be mainly divided into methods based on sequence alignment, methods based on hypothesis experience, sequencing methods and algorithms based on machine learning. Empirical and machine learning-based approaches. In recent years, deep learning has also been gradually used to predict the off-target problem of CRISPR/Cas9, and has achieved some results, but still faces the problem of low prediction accuracy. Since the application of deep learning can automatically extract the features of sgRNA-DNA sequence pairs, the operation is relatively simple and the performance of the algorithm has a large room for improvement. Therefore, further research can be done to improve the efficiency of CRISPR/Cas9 off-target prediction by applying deep learning.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于克服已有的CRISPR/Cas9脱靶预测方法的缺陷,提出了一种基于注意力机制的CRISPR/Cas9脱靶预测方法。由于脱靶预测方法的输入为sgRNA-DNA序列对,算法主要通过关注sgRNA序列和DNA序列内碱基在各个位置的匹配情况来判断是否会产生脱靶现象,这与位置信息关联较大,而注意力机制可有效的促进序列对内部局部位置特征的提取。因此,该方法以卷积神经网络为基框架,引入注意力机制用以加强对sgRNA-DNA序列对位置特征的提取,可有效提高脱靶预测的准确率及召回率,且具有较高的泛化性能,这对于CRISPR/Cas9的安全应用具有重要意义。The purpose of the present invention is to overcome the defects of the existing CRISPR/Cas9 off-target prediction methods, and propose a CRISPR/Cas9 off-target prediction method based on an attention mechanism. Since the input of the off-target prediction method is the sgRNA-DNA sequence pair, the algorithm mainly judges whether off-target phenomenon will occur by paying attention to the matching of the bases in the sgRNA sequence and the DNA sequence at each position, which is closely related to the position information, and the attention The mechanism can effectively promote the extraction of internal local location features from sequences. Therefore, this method takes the convolutional neural network as the base framework, and introduces the attention mechanism to strengthen the extraction of the location features of the sgRNA-DNA sequence, which can effectively improve the accuracy and recall rate of off-target prediction, and has high generalization. performance, which has important implications for the safe application of CRISPR/Cas9.
为实现上述目的,本发明所提供的技术方案为:For achieving the above object, the technical scheme provided by the present invention is:
一种基于注意力机制的CRISPR/Cas9脱靶预测方法,主要包括以下步骤:A CRISPR/Cas9 off-target prediction method based on attention mechanism mainly includes the following steps:
1)构建包含sgRNA,DNA及其脱靶标签的用于模型训练及测试的数据集,包括已公开的HEK293T和K562t两个数据集;1) Construct datasets for model training and testing that contain sgRNA, DNA and their off-target tags, including the published HEK293T and K562t datasets;
2)利用特定的编码方法对样本数据集中的sgRNA-DNA序列对进行编码,使其可作为神经网络的输入;2) Use a specific encoding method to encode the sgRNA-DNA sequence pair in the sample data set, so that it can be used as the input of the neural network;
3)搭建基于注意力机制的卷积神经网络模型,包括一维卷积层和独立的注意力模块;3) Build a convolutional neural network model based on an attention mechanism, including a one-dimensional convolutional layer and an independent attention module;
4)选定优化器及损失函数,设定优化器参数、学习率及迭代次数,采用自助采样法对步骤3)中的模型进行训练,得到训练好的模型的权重;4) Selecting an optimizer and a loss function, setting the parameters of the optimizer, the learning rate and the number of iterations, and using the self-service sampling method to train the model in step 3) to obtain the weight of the trained model;
5)将待测sgRNA-DNA序列对输入到步骤4)中已训练好的模型中,得到sgRNA-DNA序列对的脱靶预测值。5) Input the sgRNA-DNA sequence pair to be tested into the model trained in step 4) to obtain the off-target prediction value of the sgRNA-DNA sequence pair.
进一步地,所述步骤1)中的数据集构成如下:Further, the data set in described step 1) is constituted as follows:
在HEK293T中,带有脱靶位点的sgRNA-DNA序列对(正样本)与不带脱靶位点的sgRNA-DNA(负样本)的数目分别为536和132378;而在数据集K562t中,负样本有20199条,正样本有120条。两个数据集中样本数目的总和为153233,正负样本的比例大约为1:233,其中正样本被标记为数字1,负样本则标记为数字0。In HEK293T, the number of sgRNA-DNA sequence pairs with off-target sites (positive samples) and sgRNA-DNA without off-target sites (negative samples) were 536 and 132,378, respectively; while in dataset K562t, the number of negative samples There are 20199, and there are 120 positive samples. The sum of the number of samples in the two datasets is 153233, and the ratio of positive and negative samples is approximately 1:233, where positive samples are marked with
进一步地,所述步骤2)包括以下步骤:Further, described step 2) comprises the following steps:
21)定义一个碱基对的词典,该词典包含了A、T、C、G四种不同碱基两两组合成的所有16种不同的碱基对与其所对应数值的映射,具体为:{AA:0,AT:1,AC:2,AG:3,TA:4,TT:5,TC:6,TG:7,CA:8,CT:9,CC:10,GG:11,GA:12,GT:13,GC:14,GG:15};21) Define a dictionary of base pairs, which contains the mapping of all 16 different base pairs synthesized in two groups of four different bases A, T, C, and G and their corresponding values, specifically: { AA:0, AT:1, AC:2, AG:3, TA:4, TT:5, TC:6, TG:7, CA:8, CT:9, CC:10, GG:11, GA: 12, GT: 13, GC: 14, GG: 15};
22)到步骤21)中的词典中查找sgRNA-DNA序列对(23bp)中每个位置的核苷酸对所对应的数值,完成sgRNA-DNA序列对的数值编码;22) look up the numerical value corresponding to the nucleotide pair of each position in the sgRNA-DNA sequence pair (23bp) in the dictionary in step 21), and complete the numerical encoding of the sgRNA-DNA sequence pair;
23)对步骤22)中得到的编码序列x1×23,利用Keras框架中的Embedding层进一步作词向量编码,Embedding层实际是包含一个隐藏层的神经网络,令其中的隐藏层向量为h1×N,具体的编码方式如下:23) For the coding sequence x 1×23 obtained in step 22), use the Embedding layer in the Keras framework to further encode the word vector. The Embedding layer is actually a neural network containing a hidden layer, and the hidden layer vector is h 1× N , the specific encoding method is as follows:
x1×23·W23×N=h1×N x 1×23 ·W 23×N =h 1×N
其中W23×N为输入层与隐藏层之间的权重矩阵,N为自定义的编码尺寸,最终得到的词向量编码结果即为W23×N。Among them, W 23×N is the weight matrix between the input layer and the hidden layer, N is the custom encoding size, and the final word vector encoding result is W 23×N .
进一步地,所述步骤3)包括如下步骤:Further, described step 3) comprises the steps:
31)对步骤23)中得到的编码序列进行一维卷积,核尺寸为5,核数目为20,得到C1层;31) One-dimensional convolution is performed on the coding sequence obtained in step 23), the kernel size is 5, and the number of kernels is 20 to obtain the C1 layer;
对于C1层,对其进行规范化处理,得到B1层;For the C1 layer, normalize it to get the B1 layer;
对于B1层,对其进行核尺寸为5,核数目为40的一维卷积,得到C2层;For the B1 layer, perform a one-dimensional convolution with a kernel size of 5 and a kernel number of 40 to obtain the C2 layer;
对于C2层,对其进行规范化处理,得到B2层;For the C2 layer, normalize it to get the B2 layer;
对于B2层,对其进行核尺寸为5,核数目为80的一维卷积,得到C3层;For the B2 layer, perform a one-dimensional convolution with a kernel size of 5 and a kernel number of 80 to obtain the C3 layer;
对于C3层,对其进行规范化处理,得到B3层;For the C3 layer, normalize it to get the B3 layer;
对于B1层,对其进行线性尺度变换,使其与B3层的尺寸一致,得到C11层;For layer B1, perform linear scale transformation to make it consistent with the size of layer B3 to obtain layer C11;
32)将步骤31)中得到的C11,C3层输入到注意力模块,加强对sgRNA-DNA序列对内局部位置特征的提取,得到注意力模块的输出A1层;32) Input the C11 and C3 layers obtained in step 31) into the attention module, strengthen the extraction of local location features within the sgRNA-DNA sequence pair, and obtain the output A1 layer of the attention module;
33)将步骤32)中的A1层的特征映射通过尺寸为40的全连接层,得到E1层;33) The feature map of the A1 layer in step 32) is passed through a fully connected layer with a size of 40 to obtain the E1 layer;
对于E1层,对其进行参数为0.2的随机失活操作,得到D1层;For the E1 layer, perform a random deactivation operation with a parameter of 0.2 to obtain the D1 layer;
对于D1层,将其通过尺寸为20的全连接层,得到E2层;For the D1 layer, pass it through the fully connected layer of size 20 to get the E2 layer;
对于E2层,对其进行参数为0.2的随机失活操作,得到D2层;For the E2 layer, perform a random deactivation operation with a parameter of 0.2 to obtain the D2 layer;
34)步骤33)中D2层的特征映射即为模型最终提取到的sgRNA-DNA序列对的特征;34) The feature mapping of the D2 layer in step 33) is the feature of the sgRNA-DNA sequence pair finally extracted by the model;
35)对于D2层,将其通过尺寸为2,激活函数为softmax的全连接层,得到一个包含两个元素的一维向量,两个元素分别代表脱靶概率及正确打靶概率。35) For the D2 layer, pass it through a fully connected layer with a size of 2 and an activation function of softmax to obtain a one-dimensional vector containing two elements, the two elements represent the probability of missing the target and the probability of hitting the target correctly.
进一步地,所述步骤4)中的自助采样法包括如下步骤:Further, the self-service sampling method in described step 4) comprises the steps:
41)将训练集中的负样本按照每一份256条数据的方法随机划分为N个等份,N=负样本总数/256;41) The negative samples in the training set are randomly divided into N equal parts according to the method of each 256 pieces of data, N=the total number of negative samples/256;
42)从训练集中的正样本中有放回的随机抽取256条样本,与(1)中的一份负样本组合构成一个小的训练集;42) Randomly extract 256 samples from the positive samples in the training set, and combine them with a negative sample in (1) to form a small training set;
43)重复步骤42)N次,一共可得到N个平衡的小型训练集;43) Repeating step 42) N times, a total of N balanced small training sets can be obtained;
44)在一次迭代中每次使用一个小训练集进行训练,共需分N次进行。44) In one iteration, a small training set is used each time for training, which needs to be divided into N times.
进一步地,所述步骤32)包括如下步骤:Further, the step 32) includes the following steps:
321)计算注意力分数:321) Calculate the attention score:
scorei=U·(W1q+W2ki)score i =U·(W 1 q+W 2 k i )
其中U,W1和W2均为随机初始化的矩阵,q为C3层输出矩阵中某一列的向量,ki为C11层输出矩阵中某一列的向量,scorei为ki与q之间的注意力得分,表征二者之间的相关度;Among them, U, W 1 and W 2 are all randomly initialized matrices, q is the vector of a column in the output matrix of the C3 layer, ki is the vector of a column in the output matrix of the C11 layer, and score i is the value between ki and q Attention score, which represents the correlation between the two;
322)计算注意力模块的输出:322) Calculate the output of the attention module:
其中q'为经注意力算法处理后得到的注意力模块的输出。where q' is the output of the attention module after being processed by the attention algorithm.
本发明与现有的技术相比,具有如下优点与有益技术效果:Compared with the prior art, the present invention has the following advantages and beneficial technical effects:
1、本发明以卷积神经网络作为基框架,在此基础上引入注意力机制,可加强对sgRNA-DNA序列对内与位置信息相关的特征的提取,提高了CRISPR/Cas9脱靶预测的查准率与查全率;1. The present invention uses a convolutional neural network as the base framework, and introduces an attention mechanism on this basis, which can strengthen the extraction of features related to position information within the sgRNA-DNA sequence pair, and improve the accuracy of CRISPR/Cas9 off-target prediction rate and recall rate;
2、本发明在预测CRISPR/Cas9的脱靶问题时,在准确率与召回率上实现了较好的平衡;2. When predicting the off-target problem of CRISPR/Cas9, the present invention achieves a good balance between the accuracy rate and the recall rate;
3、本发明的模型结构较简单,具有较好的泛化性能;3. The model structure of the present invention is relatively simple and has better generalization performance;
4、本发明以卷积神经网络作为基框架,时间复杂度较低,具有较高的实用性能。4. The present invention uses the convolutional neural network as the base frame, which has low time complexity and high practical performance.
附图说明Description of drawings
图1是本发明方法实施实例流程图。FIG. 1 is a flow chart of an implementation example of the method of the present invention.
图2是本发明方法的模型结构图。Fig. 2 is a model structure diagram of the method of the present invention.
具体实施方式Detailed ways
为了便于本领域普通技术人员理解和实施本发明,下面结合实施例对本发明作进一步的说明,应当理解,此处所描述的实施例仅用于说明和解释本发明,并不是对本发明的限定。In order to facilitate the understanding and implementation of the present invention by those of ordinary skill in the art, the present invention will be further described below with reference to the embodiments. It should be understood that the embodiments described herein are only used to illustrate and explain the present invention, but not to limit the present invention.
本发明采用已公开的HEK293T和K562t数据集,HEK293T中的正样本和负样本数目分别为536和132378;而在数据集K562t中,负样本有20199条,正样本有120条。两个数据集中样本数目的总和为153233,正负样本的比例大约为1:233。The present invention adopts the published HEK293T and K562t data sets, and the numbers of positive samples and negative samples in HEK293T are 536 and 132,378 respectively; while in the data set K562t, there are 20,199 negative samples and 120 positive samples. The sum of the number of samples in the two datasets is 153233, and the ratio of positive and negative samples is about 1:233.
本发明提出了一种基于注意力机制的CRISPR/Cas9脱靶预测方法,该方法在使用神经网络模型预测未知的sgRNA-DNA序列对是否包含脱靶位点之前,需要对构建的神经网络模型进行训练,利用训练好的模型对sgRNA-DNA序列对进行脱靶预测,主要流程参考图1,包括以下步骤:The invention proposes a CRISPR/Cas9 off-target prediction method based on an attention mechanism. Before using a neural network model to predict whether an unknown sgRNA-DNA sequence pair contains an off-target site, the constructed neural network model needs to be trained. Use the trained model to predict off-target sgRNA-DNA sequence pairs. Refer to Figure 1 for the main process, including the following steps:
1)对数据集标签进行预处理,将其中的包含脱靶位点的样本标记为正样本,用数字1作为标签;不包含脱靶位点的样本记为负样本,用数字0作为标签。1) Preprocess the labels of the dataset, mark the samples containing off-target sites as positive samples, and use the
2)采用特定的编码方法对样本数据集中的sgRNA-DNA序列对进行编码,参考图2,具体步骤如下:2) Using a specific encoding method to encode the sgRNA-DNA sequence pair in the sample data set, with reference to Figure 2, the specific steps are as follows:
21)定义一个碱基对的词典,该词典包含了A、T、C、G四种不同碱基两两组合成的所有16种不同的碱基对与其所对应数值的映射,具体为:{AA:0,AT:1,AC:2,AG:3,TA:4,TT:5,TC:6,TG:7,CA:8,CT:9,CC:10,GG:11,GA:12,GT:13,GC:14,GG:15};21) Define a dictionary of base pairs, which contains the mapping of all 16 different base pairs synthesized in two groups of four different bases A, T, C, and G and their corresponding values, specifically: { AA:0, AT:1, AC:2, AG:3, TA:4, TT:5, TC:6, TG:7, CA:8, CT:9, CC:10, GG:11, GA: 12, GT: 13, GC: 14, GG: 15};
22)到步骤21)中的词典中查找sgRNA-DNA序列对(23bp)中每个位置的核苷酸对所对应的数值,完成sgRNA-DNA序列对的数值编码;22) look up the numerical value corresponding to the nucleotide pair of each position in the sgRNA-DNA sequence pair (23bp) in the dictionary in step 21), and complete the numerical encoding of the sgRNA-DNA sequence pair;
23)对步骤22)中得到的编码序列x1×23,利用Keras框架中的Embedding层进一步作词向量编码,Embedding层实际是包含一个隐藏层的神经网络,令其中的隐藏层向量为h1×N,具体的编码方式如下:23) For the coding sequence x 1×23 obtained in step 22), use the Embedding layer in the Keras framework to further encode the word vector. The Embedding layer is actually a neural network containing a hidden layer, and the hidden layer vector is h 1× N , the specific encoding method is as follows:
x1×23·W23×N=h1×N x 1×23 ·W 23×N =h 1×N
其中W23×N为输入层与隐藏层之间的权重矩阵,N为自定义的编码尺寸,最终得到的词向量编码结果即为W23×N。Among them, W 23×N is the weight matrix between the input layer and the hidden layer, N is the custom encoding size, and the final word vector encoding result is W 23×N .
3)对每一个数据集,将其按照8:2的比例划分为训练集和测试集,训练集用于训练模型的权重。3) For each data set, it is divided into training set and test set according to the ratio of 8:2, and the training set is used to train the weight of the model.
4)将经过编码的sgRNA-DNA序列对输入到脱靶预测模型中进行特征提取,参考图2,包括以下步骤:4) Input the encoded sgRNA-DNA sequence pair into the off-target prediction model for feature extraction, referring to Figure 2, including the following steps:
41)将Embedding层的输出经过核数目为20,核尺寸为5的一维卷积后,得到第一个特征映射C1层;41) After the output of the Embedding layer is subjected to a one-dimensional convolution with a kernel number of 20 and a kernel size of 5, the first feature map C1 layer is obtained;
42)对C1层的输出进行规范化,得到B1层;42) Normalize the output of the C1 layer to obtain the B1 layer;
43)将B1层的输出经过核数目为40,核尺寸为5的一维卷积后,得到第二个特征映射C2层;43) After the output of the B1 layer is subjected to a one-dimensional convolution with a kernel number of 40 and a kernel size of 5, the second feature map C2 layer is obtained;
44)对C2层的输出进行规范化,得到B2层;44) Normalize the output of the C2 layer to obtain the B2 layer;
45)将B2层的输出经过核数目为80,核尺寸为5的一维卷积后,得到第三个特征映射C3层;45) After the output of the B2 layer is subjected to a one-dimensional convolution with a kernel number of 80 and a kernel size of 5, the third feature map C3 layer is obtained;
46)对C3层的输出进行规范化,得到B3层;46) Normalize the output of the C3 layer to obtain the B3 layer;
47)对B1层的输出进行线性的尺度变换,使其维度与B3层一致,得到C11层;47) Perform a linear scale transformation on the output of the B1 layer to make the dimension consistent with the B3 layer to obtain the C11 layer;
48)将C11层及B3层的输出进行注意力机制的计算,得到注意力模块的输出A1层;48) Calculate the attention mechanism on the outputs of the C11 layer and the B3 layer to obtain the output A1 layer of the attention module;
49)将A1层的输出进行展平,使其成为一维向量,;49) Flatten the output of the A1 layer to make it a one-dimensional vector,;
50)将步骤49)中的特征向量先后经过两个全连接层进行特征整合,每个全连接层包括一个Dense层和一个Dropout层,前后两个Dense层的尺寸分别为40,20,Dropout层的参数均为0.2,最终获得整个模型提取到的特征映射D2层。50) Integrate the feature vector in step 49) through two fully connected layers successively, each fully connected layer includes a Dense layer and a Dropout layer, and the sizes of the front and rear Dense layers are 40, 20, and Dropout layers respectively. The parameters are all 0.2, and finally the feature map D2 layer extracted from the entire model is obtained.
5)利用提取到的特征映射对sgRNA-DNA序列对的脱靶标签进行拟合,实现模型的训练。5) Fitting off-target tags of sgRNA-DNA sequence pairs using the extracted feature maps to train the model.
按照上述步骤,构建出训练模型,并进行实验。在模型训练过程中,设置模型的优化器为Adam,学习率为初始值为0.003,下降速率为0.2,训练中的批量为512,迭代次数为30,完成训练后,在划分好的测试集上进行测试,完成测试后,该方法在测试集上的平均性能相比现有最优的预测算法高出11%,由此可知,本发明方法可提升CRISPR/Cas9脱靶预测的整体性能,具有较高的泛化能力。Follow the above steps to build a training model and conduct experiments. In the model training process, set the optimizer of the model to Adam, the initial learning rate is 0.003, the decline rate is 0.2, the batch in training is 512, and the number of iterations is 30. After the training is completed, on the divided test set After the test is completed, the average performance of the method on the test set is 11% higher than that of the existing optimal prediction algorithm. It can be seen that the method of the present invention can improve the overall performance of CRISPR/Cas9 off-target prediction, and has a relatively high performance. High generalization ability.
综上所述,本发明针对CRISPR/Cas9脱靶问题,重点研究了基于注意力机制的CRISPR/Cas9脱靶预测算法。该方法利用三层卷积神经网络作为基框架,引入注意力机制加强对sgRNA-DNA序列对与位置信息相关的特征的提取。该方法一方面提高了CRISPR/Cas9脱靶问题的预测性能,另一方面具有较高的泛化性能,具有较高的实用性。To sum up, the present invention focuses on the CRISPR/Cas9 off-target prediction algorithm based on the attention mechanism, aiming at the CRISPR/Cas9 off-target problem. This method uses a three-layer convolutional neural network as the base framework, and introduces an attention mechanism to enhance the extraction of location-related features of sgRNA-DNA sequence pairs. On the one hand, this method improves the prediction performance of CRISPR/Cas9 off-target problems, and on the other hand, it has high generalization performance and high practicability.
以上所述的仅是本申请的优选实施方式,本发明不限于以上实施例。可以理解为,本领域技术人员在不脱离本发明的精神和构思的前提下直接导出或联想到的其它改进和变化,均应认为包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present application, and the present invention is not limited to the above embodiments. It can be understood that other improvements and changes directly derived or imagined by those skilled in the art without departing from the spirit and concept of the present invention should be considered to be included within the protection scope of the present invention.
Claims (5)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010433848.9A CN111613267A (en) | 2020-05-21 | 2020-05-21 | A CRISPR/Cas9 off-target prediction method based on attention mechanism |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010433848.9A CN111613267A (en) | 2020-05-21 | 2020-05-21 | A CRISPR/Cas9 off-target prediction method based on attention mechanism |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN111613267A true CN111613267A (en) | 2020-09-01 |
Family
ID=72202383
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010433848.9A Pending CN111613267A (en) | 2020-05-21 | 2020-05-21 | A CRISPR/Cas9 off-target prediction method based on attention mechanism |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111613267A (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112750499A (en) * | 2021-01-19 | 2021-05-04 | 无锡市第五人民医院 | Method and system for improving safety of gene editing technology |
| CN113611367A (en) * | 2021-08-05 | 2021-11-05 | 湖南大学 | A CRISPR/Cas9 off-target prediction method enhanced by VAE data |
| CN114334007A (en) * | 2022-01-20 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Gene off-target prediction model training method, prediction method, device and electronic equipment |
| CN114496069A (en) * | 2022-02-17 | 2022-05-13 | 华东师范大学 | An off-target prediction method for CIRSRCas9 system based on Transformer architecture |
| CN115579058A (en) * | 2022-11-01 | 2023-01-06 | 阿里巴巴(中国)有限公司 | Lossless compression method for genome data, and method and apparatus for predicting genetic variation |
| WO2023164925A1 (en) * | 2022-03-04 | 2023-09-07 | 中国科学院脑科学与智能技术卓越创新中心 | Method for predicting gene editing activity by deep learning and use thereof |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106446600A (en) * | 2016-05-20 | 2017-02-22 | 同济大学 | CRISPR/Cas9-based sgRNA design method |
| CN109153980A (en) * | 2015-10-22 | 2019-01-04 | 布罗德研究所有限公司 | Type VI-B CRISPR enzymes and systems |
| CN109971842A (en) * | 2019-02-15 | 2019-07-05 | 成都美杰赛尔生物科技有限公司 | A method of detection CRISPR-Cas9 undershooting-effect |
| CN110070912A (en) * | 2019-04-15 | 2019-07-30 | 桂林电子科技大学 | A kind of prediction technique of CRISPR/Cas9 undershooting-effect |
-
2020
- 2020-05-21 CN CN202010433848.9A patent/CN111613267A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109153980A (en) * | 2015-10-22 | 2019-01-04 | 布罗德研究所有限公司 | Type VI-B CRISPR enzymes and systems |
| CN106446600A (en) * | 2016-05-20 | 2017-02-22 | 同济大学 | CRISPR/Cas9-based sgRNA design method |
| CN109971842A (en) * | 2019-02-15 | 2019-07-05 | 成都美杰赛尔生物科技有限公司 | A method of detection CRISPR-Cas9 undershooting-effect |
| CN110343751A (en) * | 2019-02-15 | 2019-10-18 | 成都美杰赛尔生物科技有限公司 | A method of detection CRISPR-Cas9 undershooting-effect |
| CN110070912A (en) * | 2019-04-15 | 2019-07-30 | 桂林电子科技大学 | A kind of prediction technique of CRISPR/Cas9 undershooting-effect |
Non-Patent Citations (1)
| Title |
|---|
| GUISHAN ZHANG ET AL.: "C-RNNCrispr: Prediction of CRISPR/Cas9 sgRNA activity using convolutional and recurrent neural networks", 《COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL》 * |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112750499A (en) * | 2021-01-19 | 2021-05-04 | 无锡市第五人民医院 | Method and system for improving safety of gene editing technology |
| CN113611367A (en) * | 2021-08-05 | 2021-11-05 | 湖南大学 | A CRISPR/Cas9 off-target prediction method enhanced by VAE data |
| CN114334007A (en) * | 2022-01-20 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Gene off-target prediction model training method, prediction method, device and electronic equipment |
| CN114334007B (en) * | 2022-01-20 | 2024-10-18 | 腾讯科技(深圳)有限公司 | Gene off-target prediction model training method, prediction device and electronic equipment |
| CN114496069A (en) * | 2022-02-17 | 2022-05-13 | 华东师范大学 | An off-target prediction method for CIRSRCas9 system based on Transformer architecture |
| WO2023164925A1 (en) * | 2022-03-04 | 2023-09-07 | 中国科学院脑科学与智能技术卓越创新中心 | Method for predicting gene editing activity by deep learning and use thereof |
| CN115579058A (en) * | 2022-11-01 | 2023-01-06 | 阿里巴巴(中国)有限公司 | Lossless compression method for genome data, and method and apparatus for predicting genetic variation |
| CN115579058B (en) * | 2022-11-01 | 2023-12-01 | 阿里巴巴(中国)有限公司 | Lossless compression method of genome data, prediction method and device of genetic variation |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111613267A (en) | A CRISPR/Cas9 off-target prediction method based on attention mechanism | |
| CN111613274A (en) | A deep learning-based method for predicting CRISPR/Cas9 sgRNA activity | |
| CN106446600B (en) | A design method of sgRNA based on CRISPR/Cas9 | |
| CN111798921A (en) | RNA binding protein prediction method and device based on multi-scale attention convolution neural network | |
| Matsuda | Protein phylogenetic inference using maximum likelihood with a genetic algorithm | |
| CN106021990B (en) | A method of biological gene is subjected to classification and Urine scent with specific character | |
| CN102332064B (en) | Biological species identification method based on genetic barcode | |
| CN108197432A (en) | A kind of gene regulatory network reconstructing method based on gene expression data | |
| CN107609352A (en) | A kind of Forecasting Methodology of protein self-interaction | |
| CN106202999B (en) | Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement | |
| CN115171792A (en) | Hybrid prediction method of virulence factor and antibiotic resistance gene | |
| CN114496069A (en) | An off-target prediction method for CIRSRCas9 system based on Transformer architecture | |
| Zhou et al. | Gene essentiality prediction based on chaos game representation and spiking neural networks | |
| CN116343927A (en) | A miRNA-disease association prediction method based on an enhanced hypergraph convolutional self-encoding algorithm | |
| CN117594117A (en) | Drug-target interaction prediction method based on heterogeneous graph contrast learning | |
| CN115394348A (en) | IncRNA subcellular localization prediction method, equipment and medium based on graph convolution network | |
| CN114842983A (en) | Anti-cancer drug response prediction method and device based on tumor cell line self-supervision learning | |
| Liu et al. | Mixed-weight neural bagging for detecting $ m^ 6A $ modifications in SARS-CoV-2 RNA sequencing | |
| CN119360978B (en) | Gene text data amplification method based on generation of countermeasure network | |
| CN116312798A (en) | Metagenome sequencing data species verification method and application | |
| Ma et al. | Prediction of long non-coding RNA-protein interaction through kernel soft-neighborhood similarity | |
| CN117831624B (en) | Tumor mutation analysis method based on tumor molecular diagnosis knowledge base | |
| CN118522359A (en) | ScRNA-seq data cell type annotation method and system based on gating axial self-attention mechanism | |
| CN118248208A (en) | NCRNA-drug resistance association prediction method based on characteristic blending network | |
| CN117912570A (en) | Classification feature determining method and system based on gene co-expression network |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200901 |
|
| WD01 | Invention patent application deemed withdrawn after publication |