CN112215013B

CN112215013B - A deep learning-based clone code semantic detection method

Info

Publication number: CN112215013B
Application number: CN202011205774.XA
Authority: CN
Inventors: 成肖云; 王建荣; 王赞; 贾勇哲; 马国宁
Original assignee: Tianjin Thai Technology Co ltd; Tianjin University
Current assignee: Tianjin Thai Technology Co ltd; Tianjin University
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2022-04-19
Anticipated expiration: 2040-11-02
Also published as: CN112215013A

Abstract

The invention discloses a semantic clone detection method based on deep learning, for a given code block pair, firstly preprocessing the code block into a sequence of TPE basic units, then performing word embedding operation on the code block and the TPE basic units, and using the code block and the TPE basic units to a BilSTM module with context characteristic combination; secondly, intensively extracting useful information related to the clone codes learned by the neural network by using a self-attention mechanism; each code segment is converted into a vector representation, the Euclidean distances between the vectors are calculated as the characteristics of classification, and the vectors are classified into two classes: if two code blocks are similar, the vectors they generate over the neural network should be similar, i.e. predictive cloned/uncloned. Compared with the prior art, the invention saves more time and can capture rich grammar and semantic information; TPE can also avoid the problem of insufficient vocabulary (OOV).

Description

A deep learning-based clone code semantic detection method

技术领域technical field

本发明涉及程序分析和机器学习领域，尤其是涉及一种源代码的表示和语义克隆检测方法。The present invention relates to the field of program analysis and machine learning, in particular to a source code representation and semantic clone detection method.

背景技术Background technique

克隆代码就是重复的代码、相似的代码。常见的是将代码克隆分为四种类型：1)一型克隆是除空格、格式和注释不同外是完全相同的两个代码片段；2)二型克隆是两个代码片段除标识符、常量、变量类型不同外是完全相同的两个代码片段；3)三型克隆是对复制过来代码片段进行少量语句修改，如改变、增加或删除少量语句的代码的两个代码片段；4)四型克隆主要与功能相似性有关。前三种类型主要与文本相似性有关。代码克隆的存在不必要地增加了程序大小，对代码段的更改也需要对其克隆进行修改，从而增加维护工作量，而且复制包含错误的代码片段会导致错误传播。检测代码克隆有助于降低软件维护成本并防止发生错误。Clone code is duplicate code, similar code. It is common to divide code clones into four types: 1) Type 1 clones are two code fragments that are identical except for spaces, formats and comments; 2) Type 2 clones are two code fragments except for identifiers, constants , two code fragments that are exactly the same except for different variable types; 3) Type 3 clone is two code fragments that modify a small number of statements in the copied code fragment, such as changing, adding or deleting a small number of statements; 4) Type 4 Cloning is mainly concerned with functional similarity. The first three types are mainly related to text similarity. The existence of code clones unnecessarily increases program size, changes to code segments also require modifications to their clones, increasing maintenance effort, and duplicating code segments that contain errors can cause errors to propagate. Detecting code clones can help reduce software maintenance costs and prevent bugs.

在各种检测代码克隆的方法中很少能够检测语义克隆，因为语义克隆是最难以检测的，它们包括在语法上不同但仍然执行相同功能的克隆，因此，提出一个有效检测代码语义克隆的方法是十分有必要的。Among the various methods for detecting code clones, few can detect semantic clones, because semantic clones are the most difficult to detect, they include clones that are syntactically different but still perform the same function, therefore, an efficient method to detect code semantic clones is proposed. is very necessary.

语义克隆检测的一个关键问题是如何有效地学习源代码的表示，从而有效地捕捉其语义。Token和抽象语法树(AST)通常用于检测代码克隆。但是Token无法很好的学习到代码结构中包含的语义信息，这对于语义克隆检测任务来说是不够的。最近的语义克隆检测工作多利用抽象语法树(AST)来结合语法信息来表示代码，这些方法被证明是有效的，但检测效率较低，因为代码的抽象语法树(AST)通常比文本的解析树更复杂。代码克隆检测不仅考虑准确度还要考虑效率。A key issue in semantic clone detection is how to efficiently learn the representation of source code to capture its semantics efficiently. Tokens and Abstract Syntax Trees (ASTs) are often used to detect code clones. However, Token cannot learn the semantic information contained in the code structure well, which is not enough for the task of semantic clone detection. Recent work on semantic clone detection mostly utilizes Abstract Syntax Trees (ASTs) to combine syntactic information to represent codes. These methods are proven to be effective, but the detection efficiency is lower, because the Abstract Syntax Trees (ASTs) of codes are usually more efficient than text parsing. Trees are more complex. Code clone detection considers not only accuracy but also efficiency.

在代码克隆检测任务中，使用Token作为代表程序源代码的基本分割方法非常常见。经过规范化的Token词汇量很小(通常不超过300个不同的规范化Token)，词汇量小导致学习到的词汇表(预训练获得的一种外部知识)容量有限，使得外部预训练对神经模型无效。因此语义克隆检测任务通常采用未规范化的令牌。但是Token仍然可能无法捕获丰富的语义信息，特别是在程序中使用无意义的变量。In the code clone detection task, it is very common to use Token as the basic segmentation method to represent the program source code. The normalized token vocabulary size is very small (usually no more than 300 different normalized tokens), and the small vocabulary size leads to the limited capacity of the learned vocabulary (a kind of external knowledge obtained by pre-training), making external pre-training invalid for neural models . Therefore semantic clone detection tasks usually employ unnormalized tokens. But tokens may still fail to capture rich semantic information, especially when meaningless variables are used in programs.

为了从预训练中提取更多的信息，一种直接的方法是扩大输入词汇量。语句级分割可能是一种自然选择，但是，由于语句的多样性，它的词汇量可以无限大。不可能训练一个包含所有可能语句表示的词汇表。因此，可能会遇到在词汇表中找不到向量表示的输入语句，这被称为OOV(out-of-vocabulary)问题。OOV问题严重限制了代码表示的有效性。To extract more information from pre-training, a straightforward approach is to expand the input vocabulary. Sentence-level segmentation may be a natural choice, however, due to the diversity of sentences, its vocabulary can be infinitely large. It is impossible to train a vocabulary containing all possible sentence representations. Therefore, it is possible to encounter input sentences for which the vector representation cannot be found in the vocabulary, which is called the OOV (out-of-vocabulary) problem. The OOV problem severely limits the usefulness of code representation.

发明内容SUMMARY OF THE INVENTION

本发明旨在解决程序中语义代码克隆的检测问题，提出一种基于深度学习的语义克隆检测方法，通过提出一种源代码表示单元TPE来表示程序源代码，再利用有监督的分类器实现侧重语义克隆检测的代码克隆对的检测。The invention aims to solve the problem of detection of semantic code clones in programs, and proposes a semantic clone detection method based on deep learning. A source code representation unit TPE is proposed to represent the program source code, and then a supervised classifier is used to realize the focus Detection of code clone pairs for semantic clone detection.

本发明的一种基于深度学习的克隆代码语义检测方法，该方法具体包括以下流程：A deep learning-based clone code semantic detection method of the present invention specifically includes the following processes:

步骤1、确定语义克隆检测任务中代码表示的TPE的基本单元，TPE的生成过程为：首先将输入语料库中的每个代码截断为Token序列，使用获得的Token初始化词汇表vocab包括对所有出现在语料库中的二元组Token进行合并，然后对当前语料库中的所有Token二元组进行统计，对二元组Token进行排序和标记，然后利用在语料库中迭代组合频率最高的Token来识别新的基本单元，并将新获得的单元添加到词汇表中，新生成的语料库由新添加的二元组Token组成，TPE将二元组Token视为新的Token，上述过程迭代地进行，通过不断地迭代查找频率更高的Token组合来更新词汇表；在获得最终的词汇表后，利用向后最大匹配方法，根据得到的词汇表将代码语句分割成TPE；Step 1. Determine the basic unit of the TPE represented by the code in the semantic clone detection task. The generation process of the TPE is as follows: first, each code in the input corpus is truncated into a Token sequence, and the obtained Token is used to initialize the vocabulary vocab, including the The two-tuple tokens in the corpus are merged, then all the token two-tuples in the current corpus are counted, the two-tuple tokens are sorted and marked, and then the token with the highest iteration combination frequency in the corpus is used to identify the new basic unit, and add the newly obtained unit to the vocabulary, the newly generated corpus consists of the newly added two-tuple Token, TPE regards the two-tuple Token as a new Token, the above process is performed iteratively, by continuously iterating Find the combination of tokens with higher frequency to update the vocabulary; after obtaining the final vocabulary, use the backward maximum matching method to divide the code statement into TPEs according to the obtained vocabulary;

根据选取的语料库利用TPE算法得到不同语言的TPE的基本单元；According to the selected corpus, the basic unit of TPE in different languages is obtained by using the TPE algorithm;

步骤2、建立适用于代码克隆检测的神经网络模型，利用Skip-Gram模型对步骤1获得的TPE的基本单元进行预训练，生成相应的TPE单元—词向量表示格式的词汇表；将词汇表中这一系列离散的序列，转换为连续的向量表示，实现并训练了一个标准的BiLSTM模型；将BiLSTM模型学习到的TPE的基本单元的向量表示放入一个矩阵中，乘以一个权重矩阵得到一个固定维度的向量，通过不断学习更新权重矩阵把握每个TPE的基本单元向量在整个句子中所占的权重，从而得到整个代码方法的向量表示；具体公式如下：Step 2, establish a neural network model suitable for code clone detection, use the Skip-Gram model to pre-train the basic unit of the TPE obtained in step 1, and generate a corresponding TPE unit-word vector representation format vocabulary; This series of discrete sequences is converted into a continuous vector representation, and a standard BiLSTM model is implemented and trained; the vector representation of the basic unit of TPE learned by the BiLSTM model is put into a matrix, and multiplied by a weight matrix to obtain a For a fixed-dimensional vector, by continuously learning and updating the weight matrix to grasp the weight of the basic unit vector of each TPE in the entire sentence, the vector representation of the entire code method is obtained; the specific formula is as follows:

其中，v表示矢量参数，T表示矩阵的转置，向量中的每个元素代表每个序列节点的重要性，

表示BiLSTM的隐层输出，s_t表示

的显著性，a_t表示注意力权重t位置元素在整个序列的权重参数，h^CODE表示最终的代码向量表示；j表示迭代参数，t表示当前位置；where v represents the vector parameter, T represents the transpose of the matrix, and each element in the vector represents the importance of each sequence node,

Represents the hidden layer output of BiLSTM, s _t represents

The significance of , a _t represents the weight parameter of the attention weight t position element in the whole sequence, h ^CODE represents the final code vector representation; j represents the iteration parameter, t represents the current position;

步骤3、设计Siamese架构进行克隆对/非克隆对的分类检测，具体包括：给出两个代码块转换的两个向量作为输入，通过采用Skip-Gram模型进行预训练，得到每一个代码块的向量表示，再利用向量间欧几里得距离的计算方式计算两个输出向量距离的差异作为分类的特征，向量间欧几里得距离相近的为克隆对，由此获得了最终的克隆/非克隆预测。Step 3. Design the Siamese architecture to classify and detect clone pairs/non-clone pairs, which specifically includes: giving two vectors converted from two code blocks as input, and pre-training by using the Skip-Gram model to obtain the value of each code block. Vector representation, and then use the calculation method of Euclidean distance between vectors to calculate the difference between the distances of the two output vectors as a feature of classification. The Euclidean distance between vectors is similar to the clone pair, thus obtaining the final clone/non-clone pair. Clone prediction.

与现有技术相比，本发明的有益效果如下：Compared with the prior art, the beneficial effects of the present invention are as follows:

(1)新的源代码表示基本单元TPE与基于AST的表示相比，更节省时间，同时可以捕获丰富的语法和语义信息；(1) Compared with the AST-based representation, the new source code representation basic unit TPE is more time-saving and can capture rich syntactic and semantic information at the same time;

(2)TPE还可以避免词汇量不足(OOV)的问题。(2) TPE can also avoid the problem of insufficient vocabulary (OOV).

附图说明Description of drawings

图1为本发明的一种基于深度学习的语义克隆检测方法整体流程图；1 is an overall flow chart of a deep learning-based semantic clone detection method according to the present invention;

图2为利用TPE生成TPE单元的示例图；Fig. 2 is the example diagram that utilizes TPE to generate TPE unit;

图3为BiLSTM网络结构图。Figure 3 is a diagram of the BiLSTM network structure.

具体实施方式Detailed ways

以下结合附图和实施例对本发明进行详细说明。The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

如图1所示，为本发明的一种基于深度学习的语义克隆检测方法整体流程图。该流程具体包括以下处理：As shown in FIG. 1 , it is an overall flow chart of a deep learning-based semantic clone detection method of the present invention. The process specifically includes the following processing:

步骤1、确定克隆代码检测任务中代码表示的TPE基本单元，通过预训练的嵌入可以从大规模原始代码语料库中学习潜在的句法/语义信息，基于TPE生成的词汇表，可以预先训练表达能力更强的嵌入矩阵；Step 1. Determine the basic unit of TPE represented by the code in the clone code detection task. Through the pre-trained embedding, the potential syntactic/semantic information can be learned from the large-scale original code corpus. Based on the vocabulary generated by the TPE, the expression ability can be improved by pre-training. Strong embedding matrix;

TPE基本单元生成过程是：首先将输入语料库中的每个代码截断为Token序列，使用获得的Token初始化词汇表vocab包括对所有出现在语料库中的二元组Token进行合并，然后对当前语料库中的所有Token二元组进行统计，对二元组Token进行排序和标记，然后利用在语料库中迭代组合频率最高的Token来识别新的基本单元，并将新获得的单元添加到词汇表中，新生成的语料库由新添加的二元组Token组成，TPE将二元组Token视为新的Token，上述过程迭代地进行，通过不断地迭代查找频率更高的Token组合来更新词汇表。在获得最终的词汇表后，利用向后最大匹配方法，根据得到的词汇表将代码语句分割成TPE单元(例如一个代码语句为ABC，设置两个指针分别指向语句的第一个字符A和最后一个字符C，首先查找词汇表是否有ABC这个基本单元，如果有，就将整个程序语句ABC都看做一个TPE单元；如果没有，第一个指针后移，查找BC是否在词汇表里面。如果不存在就继续这个过程，最终将整个代码语句都表示为词汇表中的TPE单元。基于新的TPE词汇表，可以通过预训练得到拥有更多信息的词汇表，然后应用在本流程中；The basic unit generation process of TPE is: firstly truncate each code in the input corpus into a Token sequence, use the obtained Token to initialize the vocabulary vocab including merging all the two-tuple Tokens that appear in the corpus, and then use the obtained Token to initialize the vocabulary vocab. All Token binary groups are counted, the binary Tokens are sorted and marked, and then the Token with the highest frequency of iterative combination in the corpus is used to identify the new basic unit, and the newly obtained unit is added to the vocabulary, and the newly generated The corpus of is composed of newly added two-tuple Tokens. TPE regards the two-tuple Token as a new Token. The above process is performed iteratively, and the vocabulary is updated by iteratively searching for more frequent token combinations. After obtaining the final vocabulary, use the backward maximum matching method to divide the code statement into TPE units according to the obtained vocabulary (for example, a code statement is ABC, and two pointers are set to point to the first character A and the last character of the statement respectively. For a character C, first check whether the vocabulary has the basic unit ABC, if so, treat the entire program statement ABC as a TPE unit; if not, move the first pointer backward to find out whether BC is in the vocabulary. If Continue this process if it does not exist, and finally represent the entire code statement as a TPE unit in the vocabulary. Based on the new TPE vocabulary, a vocabulary with more information can be obtained through pre-training, and then applied in this process;

如图2所示，为利用TPE生成TPE单元的示例图，用于演示两个迭代的TPE算法。标记为④⑤的虚线箭头是对分段语料库中Token进行统计的过程。标记为⑥的虚线箭头表示得到统计数据后，即每个标记符号出现的频率，然后更新词汇表以合并新识别的Token组合。所示。标记为⑦的虚线箭头表示迭代的最后一步是用新获得的二元组Token单元更新语料库。V₀显示了从原始代码片段获得的词汇表。As shown in Figure 2, it is an example diagram of using TPE to generate a TPE unit, which is used to demonstrate the TPE algorithm of two iterations. The dashed arrows marked ④⑤ are the process of counting tokens in the segmented corpus. The dashed arrows marked ⑥ indicate that after the statistics are obtained, that is, the frequency of each token symbol, the vocabulary is then updated to incorporate the newly identified token combinations. shown. The dashed arrows marked ⑦ indicate that the last step of the iteration is to update the corpus with the newly obtained bigram token units. _V0 shows the vocabulary obtained from the original code snippet.

TPE(Token Pair Encoding)作为一种具有创新性的分割方法，它利用Token作为基本组件来构造新的代码表示。TPE携带丰富的代码信息，有助于更好地利用基于深度学习的克隆检测的优势。同时，TPE还可以避免词汇量不足(OOV)的问题。As an innovative segmentation method, TPE (Token Pair Encoding) uses Token as a basic component to construct a new code representation. TPE carries rich code information, which helps to better utilize the advantages of deep learning-based clone detection. At the same time, TPE can also avoid the problem of insufficient vocabulary (OOV).

步骤2、建立适用于代码克隆检测的神经网络模型，并进行训练：Step 2. Establish a neural network model suitable for code clone detection and train it:

在进行训练前，先采用Skip-Gram模型进行预训练：首先根据选取的语料库利用TPE算法得到不同语言的TPE基本单元；然后利用Skip-Gram模型对这些TPE单元进行预训练，得到一份TPE单元—词向量表示格式的词汇表。Before training, use the Skip-Gram model for pre-training: first, use the TPE algorithm to obtain TPE basic units in different languages according to the selected corpus; then use the Skip-Gram model to pre-train these TPE units to get a TPE unit — The vocabulary of the word vector representation format.

Skip-Gram模型的原理是给出中心词用模型来预测上下文，利用周围词的预测结果，调整中心词的词向量。在Skip-Gram当中，每个词都要受到周围的词的影响，每个词在作为中心词的时候都要再进行预测和调整，这种重复多次的调整和预测使得学习到的词向量更加准确。因此选用Skip-Gram模型作为TPE单元向量的训练模型。采用工具fastText实现Skip-Gram模型进行训练，生成代码基本单元-向量的词汇表。(补充：在使用fastText命令中设置参数学习率为0.025，训练的词向量维度dim选定为100，上下文窗口大小ws默认为5，epoch和最低出现的词频minCount都使用默认值5，负采样的个数neg定值为5，损失函数loss选定ns，可容纳子词的数量bucket设置为2000000，最大最小字符长度maxn、minn都采用默认值3、6，线程数量thread设置为4，学习率选取默认值100。fastText最终生成后缀为vec和bin的两个文件，bin是一个二进制文件，包含模型的参数以及所有超参数。vec文件就是最终需要的词汇表，格式就是每行一个词(TPE单元)对应一个100维的向量表示。The principle of the Skip-Gram model is to give the center word model to predict the context, and use the prediction results of the surrounding words to adjust the word vector of the center word. In Skip-Gram, each word is affected by the surrounding words, and each word needs to be predicted and adjusted when it is used as a central word. This repeated adjustment and prediction makes the learned word vector more precise. Therefore, the Skip-Gram model is selected as the training model of the TPE unit vector. Use the tool fastText to implement the Skip-Gram model for training, and generate a vocabulary of code basic units-vectors. (Supplement: In the fastText command, set the parameter learning rate to 0.025, the training word vector dimension dim is selected as 100, the context window size ws defaults to 5, epoch and the lowest word frequency minCount both use the default value of 5, negative sampling The number of neg is set to 5, the loss function loss is set to ns, the number of subwords that can be accommodated in bucket is set to 2,000,000, the maximum and minimum character lengths maxn and minn are set to default values of 3 and 6, the number of threads is set to 4, and the learning rate is set to 4. Select the default value of 100. FastText finally generates two files with the suffix vec and bin. bin is a binary file that contains model parameters and all hyperparameters. The vec file is the final vocabulary required, and the format is one word per line (TPE unit) corresponds to a 100-dimensional vector representation.

利用BPE算法的思想从选取的初始语料库中学习到不同语言的TPE基本单元，再采用Skip-Gram模型对这些TPE单元进行预训练，得到一份TPE单元—词向量表示格式的词汇表。结合词汇表将经过TPE单元表示后的代码序列转换为对应的向量表示。Using the idea of the BPE algorithm to learn the basic units of TPE in different languages from the selected initial corpus, and then use the Skip-Gram model to pre-train these TPE units to obtain a TPE unit-word vector representation format vocabulary. Combined with the vocabulary, the code sequence represented by the TPE unit is converted into the corresponding vector representation.

进行预处理：删除源代码的空行和注释，并根据向后最大匹配算法将源代码表示为TPE单元组合成的一个序列。经过预处理后的源代码要参考预训练得到的词汇表来获取每个TPE单元的具体向量表示。经过预处理步骤之后，源代码已经被转化成为了一系列TPE单元的序列。接下来需要将这一系列离散的序列，转换为连续的向量表示，实现并训练了一个标准的BiLSTM模型，增强了TPE嵌入的克隆检测功能。LSTM是一个被广泛利用的网络，用于对TPE单元的序列输入进行编码，BiLSTM是LSTM的双向扩展，具有从右到左的扩展。通过初步实验确定模型的超参数，隐藏层维度设置为100。在输入嵌入层上应用dropou，BiLSTM隐藏层的比例为0.33。采用Adam算法进行参数优化，初始学习率为5×10^-4，梯度剪切阈值为5，最小batch-size为32。通过双向长短记忆神经网络学习到序列中每个位置的元素的语义信息，同时正向和反向的序列信息也被记录在了学习到的向量中。在此基础上，通过一个全局池化层高度提取双向长短记忆神经网络生成的隐层向量，这里使用自注意力机制池化(self-attention pooling)来实现目标。用一层自注意力机制神经网络进行加权求和，将每个Java方法转换为可相互比较的向量。Do preprocessing: remove blank lines and comments from the source code, and represent the source code as a sequence of TPE units combined according to a backward maximum matching algorithm. The pre-processed source code should refer to the pre-trained vocabulary to obtain the specific vector representation of each TPE unit. After a preprocessing step, the source code has been transformed into a sequence of TPE units. Next, we need to convert this series of discrete sequences into continuous vector representations, implement and train a standard BiLSTM model, and enhance the clone detection function of TPE embedding. LSTM is a widely utilized network for encoding sequence inputs to TPE units, and BiLSTM is a bidirectional extension of LSTM with a right-to-left extension. The hyperparameters of the model are determined through preliminary experiments, and the hidden layer dimension is set to 100. Dropou is applied on the input embedding layer, and the scale of the BiLSTM hidden layer is 0.33. Adam algorithm is used for parameter optimization, the initial learning rate is 5×10 ^-4 , the gradient clipping threshold is 5, and the minimum batch-size is 32. The semantic information of each element in the sequence is learned through a bidirectional long-short-term memory neural network, and the forward and reverse sequence information is also recorded in the learned vector. On this basis, the hidden layer vector generated by the bidirectional long short-term memory neural network is extracted through a global pooling layer height, where self-attention pooling is used to achieve the goal. A weighted summation with a layer of self-attention neural network converts each Java method into a vector that is comparable to each other.

利用自注意力机制将BiLSTM学习到的TPE单元的向量表示放入一个矩阵中，乘以一个权重矩阵得到一个固定维度的向量，通过不断学习更新权重矩阵可以把握每个TPE单元向量在整个句子中所占的权重，从而得到整个代码方法的向量表示。具体计算公式如下：The self-attention mechanism is used to put the vector representation of the TPE unit learned by BiLSTM into a matrix, and multiply it by a weight matrix to obtain a fixed-dimensional vector. By continuously learning and updating the weight matrix, each TPE unit vector can be grasped in the entire sentence. The weight occupied by the method, thereby obtaining the vector representation of the entire code method. The specific calculation formula is as follows:

其中，v表示一个矢量参数，T表示矩阵的转置，矩阵中每个元素代表每个序列节点的重要性，通过数值向量的方式表现出来，

表示是BiLSTM的隐层输出，s_t表示

的显著性，代表着每个位置的序列结点向量在整个句子中所占的权重，注意力权重a_t根据h_t本身计算并且进行归一化之后得到，并且在模型不断训练、学习的过程中不断更新，h^CODE表示最终的代码向量表示。Among them, v represents a vector parameter, T represents the transpose of the matrix, and each element in the matrix represents the importance of each sequence node, which is expressed in the form of a numerical vector,

Indicates that it is the hidden layer output of BiLSTM, and s _t represents

The significance of , represents the weight of the sequence node vector at each position in the entire sentence. The attention weight at _{is calculated and normalized according to h t} _itself , and the model is continuously trained and learned. Constantly updated in , h ^CODE represents the final code vector representation.

表示BiLSTM的隐层输出，s_t表示

的显著性，代表着每个位置的序列结点向量在整个句子中所占的权重，a_t表示注意力权重t位置元素在整个序列的权重参数，根据h_t本身计算并且进行归一化之后得到，并且在模型不断训练、学习的过程中不断更新，h^CODE表示最终的代码向量表示；j表示迭代参数，t表示当前位置；where v represents the vector parameter, T represents the transpose of the matrix, and each element in the vector represents the importance of each sequence node,

Represents the hidden layer output of BiLSTM, s _t represents

The significance of , represents the weight of the sequence node vector at each position in the entire sentence, a _t represents the weight parameter of the attention weight t position element in the entire sequence, calculated and normalized according to h _t itself Obtained, and continuously updated in the process of continuous training and learning of the model, h ^CODE represents the final code vector representation; j represents the iteration parameter, and t represents the current position;

双向长短记忆神经网络学习到正向序列和反向序列的所有信息，能够更加有力地捕捉序列的语义和时序信息，比普通的长短记忆神经网络学习地更加充分，预测更加准确。The bidirectional long-short-term memory neural network learns all the information of the forward sequence and the reverse sequence, which can capture the semantic and timing information of the sequence more powerfully, and can learn more fully and predict more accurately than the ordinary long-short-term memory neural network.

如图3所示，为BiLSTM网络结构图。双向长短期记忆网络(BiLSTM)由两个普通的长短期记忆网络(LSTM)组成，正向的LSTM利用过去时刻的信息，逆向的LSTM利用未来时刻的信息。BiLSTM通过如下公式递归计算隐藏输出向量：As shown in Figure 3, it is the structure diagram of BiLSTM network. Bi-directional long short-term memory network (BiLSTM) consists of two ordinary long short-term memory networks (LSTM), the forward LSTM uses the information of the past moment, and the reverse LSTM uses the information of the future moment. BiLSTM recursively calculates the hidden output vector by the following formula:

f_t＝σ(W_f[h_t-1,x_t]+b_f)f _t =σ(W _f [h _t-1 ,x _t ]+b _f )

i_t＝σ(W_i[h_t-1,x_t]+b_i)i _t =σ(W _i [h _t-1 ,x _t ]+b _i )

o_t＝σ(W_o[h_t-1,x_t]+b_o)o _t =σ(W _o [h _t-1 ,x _t ]+b _o )

h_t＝o_t⊙tanh(c_t)h _t =o _t ⊙tanh(c _t )

其中，x₁,x₂,…,x_n为输入，h₁,h₂,…,h_n为输出，其他变量如W、b为模型参数。从右到左的方向只是以相反的方式预演了相同的计算。Among them, x ₁ , x ₂ ,…, x _n are inputs, h ₁ , h ₂ ,…, h _n are outputs, and other variables such as W and b are model parameters. The right-to-left direction just rehearsed the same calculation in reverse.

步骤3、设计一个Siamese架构用于克隆对/非克隆对的分类检测，具体包括：比较输入片段对的代码表示，然后计算它们的差异作为分类的特征，最后根据分类特征进行最终的克隆/非克隆预测；将代码克隆检测问题形式化为一个监督的二分类任务，即给出两个代码块转换的两个向量作为输入，再利用向量间欧几里得距离的计算方式计算两个输出向量的差异作为分类的特征，向量间欧几里得距离相近的为克隆对。如果它们是克隆对，设置其标签为1，如果不是克隆对，设置它们的标签为0；最终也得到了给定输入对为克隆和非克隆的概率。Step 3. Design a Siamese architecture for the classification detection of clone/non-clone pairs, including: comparing the code representations of the input fragment pairs, then calculating their differences as classification features, and finally performing the final cloning/non-cloning based on the classification features. Clone prediction; formalize the code clone detection problem as a supervised binary classification task, that is, given two vectors converted from two code blocks as input, and then use the Euclidean distance between the vectors to calculate the two output vectors The difference is used as the classification feature, and the Euclidean distance between vectors is close to the clone pair. If they are clone pairs, set their labels to 1, and if they are not clone pairs, set their labels to 0; also finally get the probability that a given input pair is clone and non-clone.

Siamese网络主要思想是通过一个函数将输入映射到目标空间，在目标空间使用简单的距离进行对比相似度。将代码克隆检测转换为一个二分类问题，提出了一个简单有效的BiLSTM模型，通过BiLSTM将离散的代码片段转换为低维度，连续的向量表示。设计了一个Siamese架构，两个BiLSTM子网络结构相同，且共享权值。The main idea of Siamese network is to map the input to the target space through a function, and use a simple distance to compare the similarity in the target space. Converting code clone detection into a binary classification problem, we propose a simple and effective BiLSTM model that converts discrete code fragments into low-dimensional, continuous vector representations via BiLSTM. A Siamese architecture is designed, where the two BiLSTM sub-networks have the same structure and share weights.

本发明采用Siamese架构和标准的BiLSTM分类模型进行码克隆检测，在保证码克隆检测有效性的同时提高了效率。通过TPE单元可以获取丰富的语法和语义信息，可以有效的进行语义克隆的检测。The present invention adopts the Siamese architecture and the standard BiLSTM classification model for code clone detection, which improves the efficiency while ensuring the effectiveness of the code clone detection. Through the TPE unit, rich syntactic and semantic information can be obtained, which can effectively detect semantic clones.

本发明实施例描述如下：The embodiments of the present invention are described as follows:

步骤1，数据选取和预处理：Step 1, data selection and preprocessing:

1-1、准备数据集，所述数据集包括：1-1. Prepare a data set, the data set includes:

(1)BigCloneBench是被广泛采用的检测克隆的评价基准之一，许多代码克隆检测工具都以此数据集为评价基准。BigCloneBench数据集针对Java语言中的克隆代码进行创建，在旧版本的BigCloneBench中仅包含600万个标记的真克隆对和26万个标记的假克隆对，涵盖了10个功能。新版本中包含超过800万对标记为克隆的Java代码对(其中大多数是三型和四型克隆)以及279,032个标记为非克隆对的代码。对于BigCloneBench数据集，从每个类型中选取20000对函数进行实验，不足20000的选取全部。(1) BigCloneBench is one of the widely used evaluation benchmarks for detecting clones, and many code clone detection tools use this dataset as the evaluation benchmark. The BigCloneBench dataset was created for the cloned code in the Java language. The older version of BigCloneBench only contains 6 million labeled true clone pairs and 260 thousand labeled fake clone pairs, covering 10 features. The new release includes more than 8 million pairs of Java code marked as clones (most of which are type III and type four clones) and 279,032 code pairs marked as non-clones. For the BigCloneBench dataset, 20,000 pairs of functions are selected from each type for experimentation, and all of those less than 20,000 are selected.

(2)另一个代码克隆评价基准OJClone是针对C语言程序创建的。OJClone基于在线编程开放性评判系统(Open Judge System)，常被称为OJ系统[8]，它的构建方法为，在OJ系统中选择104个编程题目，对于同一编程题目，将由不同人员提交的代码片段视为克隆对。OJClone数据集没有明确说明克隆的类型，但通常认为OJClone的大多数克隆对都是三型克隆或四型克隆]。从OJClone的前15个编程问题中选择500个程序。相同的编程问题可以形成124,750个克隆对。不同的编程问题可以组合成2800多万个非克隆对。随机选取5万个函数对，其中克隆对和非克隆的比例是1:14。(2) Another code clone evaluation benchmark OJClone is created for C language programs. OJClone is based on the Open Judge System of online programming, often referred to as the OJ system [8]. Its construction method is to select 104 programming topics in the OJ system. For the same programming topic, different people will submit Code snippets are considered clone pairs. The OJClone dataset does not explicitly state the type of clones, but it is generally believed that most clone pairs in OJClone are either tri- or tetra-clones]. Choose 500 programs from OJClone's top 15 programming questions. The same programming problem can form 124,750 clone pairs. Different programming problems can be combined into more than 28 million non-clone pairs. 50,000 function pairs were randomly selected, and the ratio of clone pairs to non-clone pairs was 1:14.

(3)Google Code Jam(GCJ)是谷歌每年举办的在线国际节目比赛。比赛的内容包括一系列算法问题，这些问题必须在规定的时间内解决。参加者可以使用自己选择的任何编程语言和开发环境来回答问题。针对同一问题的每个项目由不同的程序员实现，谷歌验证了其正确性。因此，同一问题的处理过程应该是函数相似的。从2016版本中选取了12个问题共1665个Java函数，分别组成了约27万对克隆对和100万的非克隆对，最终随机选取了5万的函数对，其中克隆对和非克隆的比例是1：4。(3) Google Code Jam (GCJ) is an online international programming competition held by Google every year. The competition consists of a series of algorithmic problems that must be solved within a specified time. Participants can answer questions using any programming language and development environment of their choice. Each project for the same problem was implemented by a different programmer, and Google verified its correctness. Therefore, the processing of the same problem should be functionally similar. A total of 1665 Java functions were selected from 12 questions from the 2016 version, which respectively formed about 270,000 clone pairs and 1 million non-clone pairs. Finally, 50,000 function pairs were randomly selected, among which the ratio of clone pairs to non-clone pairs is 1:4.

1-2、对于步骤1-1所选取数据，利用前面提到的TPE算法将函数利用TPE基本单元进行表示。针对Java和C语言训练了不同的TPE词汇表。1-2. For the data selected in step 1-1, use the TPE algorithm mentioned above to represent the function with the TPE basic unit. Different TPE vocabularies are trained for Java and C languages.

步骤2，训练集、测试集划分：对于每个数据集，随机将其分为三个部分，分别为60％、20％、20％用于训练、验证和测试。Step 2, training set, test set division: For each data set, it is randomly divided into three parts, 60%, 20%, and 20% are used for training, validation and testing.

步骤3，利用模型进行训练Step 3, use the model for training

将经过预处理后的TPE基本单元向量表示输入到第一层LSTM单元，获得此基本单元的前一个单元对它的影响的特征，再将其输入到第二层LSTM单元，获得字符的后一个字符对它的影响。再将第一层LSTM的输出和第二层LSTM的输出进行拼接组合。通过训练，输出的特征向量包含了这个代码单元上下文信息及其序列信息。Input the preprocessed TPE basic unit vector representation to the first layer LSTM unit to obtain the characteristics of the influence of the previous unit of this basic unit on it, and then input it to the second layer LSTM unit to obtain the next character of the character. The effect of characters on it. Then the output of the first layer LSTM and the output of the second layer LSTM are spliced and combined. After training, the output feature vector contains the context information of this code unit and its sequence information.

通过双向长短记忆神经网络学习到序列中每个位置的元素的语义信息，同时正向和反向的序列信息也被记录在了学习到的向量中。在此基础上，通过一个全局池化层高度提取双向长短记忆神经网络生成得隐层向量，这里使用自注意力机制池化(self-attention pooling)来实现目标。用一层自注意力机制神经网络进行加权求和，将每个Java方法转换为可相互比较的向量。至此，已经将原来的纯代码文本，转换为了数字化的向量，接着，利用向量间欧几里得距离的计算公式计算两个向量的差异。The semantic information of each element in the sequence is learned through a bidirectional long-short-term memory neural network, and the forward and reverse sequence information is also recorded in the learned vector. On this basis, the hidden layer vector generated by the bidirectional long-short-term memory neural network is extracted through a global pooling layer height. Here, the self-attention pooling mechanism is used to achieve the goal. A weighted summation with a layer of self-attention neural network converts each Java method into a vector that is comparable to each other. So far, the original plain code text has been converted into a digitized vector, and then the difference between the two vectors is calculated using the Euclidean distance between vectors.

将本发明与两种最先进的克隆检测方法TBCCD和ASTNN从精度(P)、召回率(R)、F1得分(F1)、数据处理时间(data-time)和测试时间(test-time)方面进行比较进行比较。。Compare the present invention with two state-of-the-art clone detection methods TBCCD and ASTNN in terms of precision (P), recall (R), F1 score (F1), data-time (data-time) and test-time (test-time) Compare to compare. .

如表1所示，现有技术的ASTNN，模型、TBCCD模型与本发明的TPE模型在BigCloneBench数据集上的检测结果。As shown in Table 1, the detection results of the prior art ASTNN model, the TBCCD model and the TPE model of the present invention on the BigCloneBench data set.

表1Table 1

可以看到在BigCloneBench这个数据集上，本发明的模型比最新检测语义克隆的其他两个工具F值都要高。尤其在检测效率上本发明的TPE基本单元表示代码的方法比另外两个利用AST表示代码的工具在数据处理(data-time)上提升了接近2.5倍。It can be seen that on the BigCloneBench dataset, the model of the present invention has a higher F value than the other two latest tools for detecting semantic clones. Especially in terms of detection efficiency, the TPE basic unit code representation method of the present invention is nearly 2.5 times higher in data-time than the other two tools using AST to represent codes.

在模型检测速度上面，本发明的标准BiLstm模型比两外两个模型效率提升了接近3倍。这是因为TBCCD使用基于树的卷积模型和max-pooling，ASTNN使用RvNN和RNN模型进行语句和代码编码，而只使用一个简单的BiLSTM模型，BiLSTM可以学习到正向序列和反向序列的所有信息，能够更加有力地捕捉序列的语义和时序信息。In terms of model detection speed, the standard BiLstm model of the present invention is nearly 3 times more efficient than the other two models. This is because TBCCD uses tree-based convolution model and max-pooling, ASTNN uses RvNN and RNN models for sentence and code encoding, and only uses a simple BiLSTM model, BiLSTM can learn all the forward and reverse sequences information, which can more powerfully capture the semantic and timing information of the sequence.

表2列出了本发明的模型和ASTNN、TBCCD在OJClone数据集上的检测结果。Table 2 lists the model of the present invention and the detection results of ASTNN and TBCCD on the OJClone dataset.

在OJClone数据集上本发明的检测效果依旧优于TBCCD，模型表现没有ASTNN好，这是因为ASTNN在表示代码时首先为每个代码片段构建一个AST，并将整个AST分解成小的语句树(由语句的AST节点组成的树，以语句节点为根)。然后采用多路语句树上的递归编码器捕获语句级词法和句法信息，获得语句向量。最后通过递归神经网络得到代码表示，这种方式捕获的代码信息比较全面，但是同时也耗费了很多时间。On the OJClone data set, the detection effect of the present invention is still better than that of TBCCD, and the model performance is not as good as that of ASTNN. This is because ASTNN first constructs an AST for each code fragment when representing the code, and decomposes the entire AST into small statement trees ( A tree consisting of the AST nodes of the statement, rooted at the statement node). Then, a recursive encoder on a multi-way sentence tree is used to capture sentence-level lexical and syntactic information to obtain sentence vectors. Finally, the code representation is obtained through the recurrent neural network. The code information captured in this way is more comprehensive, but it also consumes a lot of time.

如表3所示，为TPE模型、ASTNN模型、TBCCD模型在GCJ数据集上的检测结果。As shown in Table 3, the detection results of the TPE model, the ASTNN model, and the TBCCD model on the GCJ dataset are shown.

表3table 3

在GCJ数据集上，本发明的模型依旧比TBCCD表现好，可以达到和ASTNN接近的效果，同时在时间方面优势明显。无论是在哪一个数据集上，ASTNN和TBCCD的处理时间都比TPE长，这是因为ASTNN和TBCCD都是基于AST来表达代码的。On the GCJ data set, the model of the present invention still performs better than TBCCD, and can achieve an effect close to that of ASTNN, while having obvious advantages in time. Regardless of the dataset, the processing time of ASTNN and TBCCD is longer than that of TPE, because both ASTNN and TBCCD express code based on AST.

Claims

1. a semantic clone detection method based on deep learning, is characterized in that, this method specifically comprises following process:

Step 1. Determine the basic unit of the TPE represented by the code in the semantic clone detection task. The generation process of the TPE is as follows: first, each code in the input corpus is truncated into a Token sequence, and the obtained Token is used to initialize the vocabulary vocab, including the The two-tuple tokens in the corpus are merged, then all the token two-tuples in the current corpus are counted, the two-tuple tokens are sorted and marked, and then the token with the highest iteration combination frequency in the corpus is used to identify the new basic unit, and add the newly obtained unit to the vocabulary, the newly generated corpus consists of the newly added two-tuple Token, TPE regards the two-tuple Token as a new Token, the above process is performed iteratively, by continuously iterating Find the combination of tokens with higher frequency to update the vocabulary; after obtaining the final vocabulary, use the backward maximum matching method to divide the code statement into TPEs according to the obtained vocabulary;

According to the selected corpus, the basic unit of TPE in different languages is obtained by using the TPE algorithm;

Step 2, establish a neural network model suitable for code clone detection, use the Skip-Gram model to pre-train the basic unit of the TPE obtained in step 1, and generate a corresponding TPE unit-word vector representation format vocabulary; This series of discrete sequences is converted into a continuous vector representation, and a standard BiLSTM model is implemented and trained; the vector representation of the basic unit of TPE learned by the BiLSTM model is put into a matrix, and multiplied by a weight matrix to obtain a For a fixed-dimensional vector, by continuously learning and updating the weight matrix to grasp the weight of the basic unit vector of each TPE in the entire sentence, the vector representation of the entire code method is obtained; the specific formula is as follows:

where v represents the vector parameter, T represents the transpose of the matrix, and each element in the vector represents the importance of each sequence node,

Represents the hidden layer output of BiLSTM, s _t represents

Step 3. Design the Siamese architecture to classify and detect clone pairs/non-clone pairs, which specifically includes: giving two vectors converted from two code blocks as input, and pre-training by using the Skip-Gram model to obtain the value of each code block. Vector representation, and then use the calculation method of Euclidean distance between vectors to calculate the difference between the distances of the two output vectors as a feature of classification. The Euclidean distance between vectors is similar to the clone pair, thus obtaining the final clone/non-clone pair. Clone prediction.