CN110414238A

CN110414238A - Homologous binary code retrieval method and device

Info

Publication number: CN110414238A
Application number: CN201910526523.2A
Authority: CN
Inventors: 石志强; 马原; 张国栋; 杨寿国; 朱红松; 孙利民
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2019-11-05

Abstract

Embodiments of the present invention provide a method and device for retrieving homologous binary codes, wherein the method includes: determining all basic blocks of a function to be detected and basic block level information of each basic block; generating a control flow graph of the function to be detected; The basic block level information and dependencies of all basic blocks are used as the control call information of the function to be detected; the control call information is input into the pre-trained neural network model, the encoding vector of the function to be detected is output, and the hash signature of the encoding vector is calculated, As the hash signature to be retrieved; search the pre-built hash signature library to see if there is a hash signature that is the same as the hash signature to be retrieved, and if so, use the binary code corresponding to the retrieved hash signature as the function to be detected homologous binary code. The embodiments of the present invention perform risk assessment on firmware that may have vulnerability risks, and provide analysis reference for vulnerability security researchers.

Description

Homologous binary code retrieval method and device

技术领域technical field

本发明涉及漏洞挖掘技术领域，更具体地，涉及同源二进制代码的检索方法及装置。The invention relates to the technical field of vulnerability mining, and more particularly, to a method and device for retrieving homologous binary codes.

背景技术Background technique

漏洞是信息系统及网络空间安全的重要影响因素。近年来披露的固件漏洞和针对物联网设备发起的大规模恶意攻击表明物联网设备正成为恶意攻击的焦点目标，而且固件就是攻击者的首选攻击对象。固件与物联网设备交互这一特殊性使得固件的安全性不仅影响着信息系统的安全性，而且还关系到物理设备的安全性。Vulnerability is an important factor affecting the security of information systems and cyberspace. The firmware vulnerabilities disclosed in recent years and the large-scale malicious attacks against IoT devices show that IoT devices are becoming the focus of malicious attacks, and firmware is the preferred target of attackers. The particularity of firmware interacting with IoT devices makes the security of firmware not only affect the security of information systems, but also the security of physical devices.

在固件的开发过程中，由于共享底层库和第三方SDK的广泛使用，同源漏洞普遍存在于不同物联网设备固件中。当某个固件被爆出漏洞时，则包含该同源代码的其他固件也将处于高风险状态。由于同源代码关联技术可以快速从海量固件二进制代码中检索出与给定漏洞二进制代码相似的代码片段、缩小后续人工分析范围，从而可以对可能存在漏洞风险的固件进行风险评估并为漏洞安全研究人员提供分析参考依据。During the development of firmware, due to the extensive use of shared underlying libraries and third-party SDKs, same-origin vulnerabilities commonly exist in different IoT device firmwares. When a firmware is exposed, other firmware containing the same source code will also be in a high-risk state. Since the same-source code association technology can quickly retrieve code fragments similar to a given vulnerability binary code from massive firmware binary codes, and narrow the scope of subsequent manual analysis, it is possible to perform risk assessment on firmware that may have vulnerability risks and conduct vulnerability security research. Provide analysis reference.

在同源代码关联研究中，往往利用二进制代码的函数结构信息，例如函数调用图、函数内部控制流图，采用图同构技术进行结构化匹配得到相似度。2004年，Rudinger K,Gamble J K,Bach E,et al.Comparing algorithms for graph isomorphism usingdiscrete-and continuous-time quantum random walks提出了一种基于指令相似度的图形化比较方法，基本思想是从可执行文件的入口地址进行分析，将函数表示为节点为指令级的函数控制流图，再对指令进行相似性比较得到节点间的相似度图，并基于结果对函数控制流图进行化简和合成，得到最大可能的相似度图。这种方法的不足在于容易受到编译器优化导致的指令重排的影响，而且算法不能跳过局部的一小块不匹配的代码重新找到比较的开始点。Halver Flake.Structural Comparison of Executable Objects在同一年提出将整个可执行文件看作是一个节点是函数的调用关系图，对每个函数分配一个结构化的签名，利用结构化的签名对函数进行比较、标识函数之间的关系。这种方法关注可执行文件中的结构变化，不足是它不能发现非结构化的变化，并且由于可能存在多个函数的结构化签名相同的情况，而使得部分函数不能进行比较。2017年常青,刘中金,王猛涛,等.VDNS:一种跨平台的固件漏洞关联算法提出了一种跨平台固件漏洞函数关联方法，该方法以函数为关联对象，通过特征提取和数值化处理，计算漏洞函数相似度从而进行漏洞检测。但是这种方法的缺点是两两比较的函数关联方法仅适合数据规模较小的场景，在数据规模达千万级别的场景，函数对两两比较计算的时间成本高达上万个小时。In the research of homologous code association, the function structure information of binary code is often used, such as function call graph and function internal control flow graph, and the similarity is obtained by structural matching using graph isomorphism technology. In 2004, Rudinger K, Gamble J K, Bach E, et al. Comparing algorithms for graph isomorphism using discrete-and continuous-time quantum random walks proposed a graphical comparison method based on instruction similarity. Analyze the entry address of the function, express the function as a function control flow graph with the node as the instruction level, and then compare the similarity of the instructions to obtain the similarity graph between the nodes, and simplify and synthesize the function control flow graph based on the result. The largest possible similarity map. The disadvantage of this method is that it is susceptible to instruction rearrangement caused by compiler optimizations, and the algorithm cannot skip a local small block of mismatched code to find the starting point of the comparison again. Halver Flake. Structural Comparison of Executable Objects proposed in the same year to regard the entire executable file as a call graph whose nodes are functions, assign a structured signature to each function, and use the structured signature to compare functions, Identifies relationships between functions. This approach focuses on structural changes in executable files. The disadvantage is that it cannot detect unstructured changes, and because there may be multiple functions with the same structural signature, some functions cannot be compared. Chang Qing, Liu Zhongjin, Wang Mengtao, et al. 2017. VDNS: A Cross-Platform Firmware Vulnerability Association Algorithm A cross-platform firmware vulnerability function association method is proposed. The method takes functions as the associated objects, through feature extraction and numerical processing , and calculate the similarity of vulnerability function to perform vulnerability detection. However, the disadvantage of this method is that the function association method for pairwise comparison is only suitable for scenarios with a small data scale. In a scenario where the data scale reaches tens of millions, the time cost of the function for pairwise comparison calculation is as high as tens of thousands of hours.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种克服上述问题或者至少部分地解决上述问题的同源二进制代码的检索方法及装置。Embodiments of the present invention provide a method and device for retrieving homologous binary codes that overcome the above problems or at least partially solve the above problems.

第一个方面，本发明实施例提供一种同源二进制代码的检索方法，包括：In a first aspect, an embodiment of the present invention provides a method for retrieving homologous binary codes, including:

确定待检测函数的所有基本块以及每个基本块的基本块级别信息，生成所述待检测函数的控制流图，根据所述控制流图确定所有基本块之间的依赖关系，将所有基本块的基本块级别信息以及依赖关系作为所述待检测函数的控制调用信息；Determine all basic blocks of the function to be detected and basic block level information of each basic block, generate a control flow graph of the function to be detected, determine the dependencies between all basic blocks according to the control flow graph, and combine all basic blocks. The basic block level information and the dependency relationship are used as the control call information of the function to be detected;

将所述控制调用信息输入至预先训练的神经网络模型中，输出所述待检测函数的编码向量，计算所述编码向量的哈希签名，作为待检索哈希签名；Inputting the control call information into a pre-trained neural network model, outputting the encoding vector of the function to be detected, and calculating the hash signature of the encoding vector as the hash signature to be retrieved;

在预先构建的哈希签名库中检索是否存在与所述待检索哈希签名相同的哈希签名，若存在，则将检索到的哈希签名对应的二进制代码作为所述待检测函数的同源二进制代码。Search the pre-built hash signature library to see if there is a hash signature that is the same as the hash signature to be retrieved, and if so, take the binary code corresponding to the retrieved hash signature as the homology of the function to be detected binary code.

第二个方面，本发明实施例提供一种同源二进制代码的检索装置，包括：In a second aspect, an embodiment of the present invention provides an apparatus for retrieving homologous binary codes, including:

控制调用信息生成模块，用于确定待检测函数的所有基本块以及每个基本块的基本块级别信息，生成所述待检测函数的控制流图，根据所述控制流图确定所有基本块之间的依赖关系，将所有基本块的基本块级别信息以及依赖关系作为所述待检测函数的控制调用信息；The control call information generation module is used to determine all basic blocks of the function to be detected and the basic block level information of each basic block, generate a control flow graph of the function to be detected, and determine between all basic blocks according to the control flow graph Dependency relationship, the basic block level information and dependencies of all basic blocks are used as the control call information of the function to be detected;

哈希签名生成模块，用于将所述控制调用信息输入至预先训练的神经网络模型中，输出所述待检测函数的编码向量，计算所述编码向量的哈希签名，作为待检索哈希签名；A hash signature generation module is used to input the control call information into the pre-trained neural network model, output the encoding vector of the function to be detected, and calculate the hash signature of the encoding vector as the hash signature to be retrieved ;

检索模块，用于在预先构建的哈希签名库中检索是否存在与所述待检索哈希签名相同的哈希签名，若存在，则将检索到的哈希签名对应的二进制代码作为所述待检测函数的同源二进制代码。The retrieval module is used to retrieve whether there is a hash signature that is the same as the hash signature to be retrieved in the pre-built hash signature library, and if so, the binary code corresponding to the retrieved hash signature is used as the hash signature to be retrieved Homologous binaries for detection functions.

第三方面，本发明实施例提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如第一方面所提供的方法的步骤。In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the program as described in the first aspect when the processor executes the program Steps of the provided method.

第四方面，本发明实施例提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如第一方面所提供的方法的步骤。In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the method provided in the first aspect.

本发明实施例提供的同源二进制代码的检索方法及装置，通过深度学习的方式将二进制代码的特征编码为编码向量(特征向量)，从而可以使用局部敏感哈希进行加速，提高在线匹配的速度，实现快速从海量固件二进制代码中检索出与给定漏洞二进制代码相似的代码片段，缩小后续人工分析范围，从而可以对可能存在漏洞风险的固件进行风险评估并为漏洞安全研究人员提供分析参考依据。The method and device for retrieving homologous binary codes provided by the embodiments of the present invention encode the features of binary codes into coding vectors (feature vectors) by means of deep learning, so that locality-sensitive hashing can be used for acceleration and the speed of online matching is improved , to quickly retrieve code fragments similar to a given vulnerability binary code from massive firmware binary codes, and narrow the scope of subsequent manual analysis, so as to perform risk assessment on firmware that may have vulnerability risks and provide analysis reference for vulnerability security researchers .

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1为本发明实施例提供的同源二进制代码的检索方法的流程示意图；1 is a schematic flowchart of a method for retrieving homologous binary codes according to an embodiment of the present invention;

图2为本发明实施例的迭代过程中基本块节点属性更新时的依赖关系图；Fig. 2 is the dependency relation diagram when basic block node attribute is updated in the iterative process of the embodiment of the present invention;

图3为本发明实施例提供的同源二进制代码的检索装置的结构示意图；3 is a schematic structural diagram of an apparatus for retrieving homologous binary codes provided by an embodiment of the present invention;

图4为本发明实施例提供的电子设备的实体结构示意图。FIG. 4 is a schematic diagram of a physical structure of an electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

图1为本发明实施例提供的同源二进制代码的检索方法的流程示意图，如图1所示，该检索方法包括：S101、S102和S103，具体地，FIG. 1 is a schematic flowchart of a method for retrieving homologous binary codes provided by an embodiment of the present invention. As shown in FIG. 1 , the retrieval method includes: S101, S102, and S103. Specifically,

S101、确定待检测函数的所有基本块以及每个基本块的基本块级别信息，生成所述待检测函数的控制流图，根据所述控制流图确定所有基本块之间的依赖关系，将所有基本块的基本块级别信息以及依赖关系作为所述待检测函数的控制调用信息。S101. Determine all basic blocks of the function to be detected and basic block level information of each basic block, generate a control flow graph of the function to be detected, determine dependencies between all basic blocks according to the control flow graph, The basic block level information and the dependency relationship of the basic block are used as the control calling information of the function to be detected.

需要说明的是，本发明实施例可以通过逆向工具对二进制程序进行逆向划分函数操作，即可获得多个函数。采用现有技术对每个函数产生一个有向图形式的控制流图。在控制流图中，每个节点对应函数的一个基本块，节点间的每条有向边对应两个基本块间的跳转关系。可以理解的是，基本块是一次执行的汇编指令序列集合，通常有一个入口点和一个出口点，程序执行时从入口指令开始，依次执行经过的指令，直到出口指令位置。通过对控制流图处理，可以获得函数内的基本块级别信息和基本块间的结构信息。基本块间的结构信息即基本块间的跳转关系。It should be noted that, in this embodiment of the present invention, a reverse tool can be used to perform a reverse function division operation on a binary program, so as to obtain multiple functions. A control flow graph in the form of a directed graph is generated for each function using the prior art. In the control flow graph, each node corresponds to a basic block of a function, and each directed edge between nodes corresponds to a jump relationship between two basic blocks. It can be understood that a basic block is a set of assembly instruction sequences executed at one time, and usually has an entry point and an exit point. When the program is executed, it starts from the entry instruction, and executes the passed instructions in sequence until the exit instruction position. By processing the control flow graph, the basic block level information within the function and the structural information between basic blocks can be obtained. The structural information between the basic blocks is the jump relationship between the basic blocks.

作为一种可选实施例，基本块级别信息包括基本块内的字符串常量的数量、数字常量的数量、转义指令的数量、调用指令的数量、算术指令的数量、汇编指令的数量、子节点的数量以及中介中心度。As an optional embodiment, the basic block level information includes the number of string constants, the number of numeric constants, the number of escape instructions, the number of call instructions, the number of arithmetic instructions, the number of assembly instructions, the number of The number of nodes and betweenness centrality.

通过将基本块级别信息以及依赖关系相结合，即获得了待检测函数的控制调用信息。作为一种可选实施例，本发明实施例的控制调用信息为带属性控制调用图(AttributedControl Flow Graph，ACFG)，所述带属性控制调用图中的每个节点为一个带有属性集合的基本块。ACFG本质上是函数内部的基本块的有向图，其中函数内部属性的跨平台性已经得到过证明。By combining the basic block level information and the dependency relationship, the control call information of the function to be detected is obtained. As an optional embodiment, the control call information in this embodiment of the present invention is an attribute control call graph (Attributed Control Flow Graph, ACFG), and each node in the attribute control call graph is a basic control call graph with an attribute set. piece. ACFGs are essentially directed graphs of basic blocks inside functions, where the cross-platform nature of the properties inside functions has been demonstrated.

S102、将所述控制调用信息输入至预先训练的神经网络模型中，输出所述待检测函数的编码向量，计算所述编码向量的哈希签名，作为待检索哈希签名。S102. Input the control call information into a pre-trained neural network model, output an encoding vector of the function to be detected, and calculate a hash signature of the encoding vector as the hash signature to be retrieved.

需要说明的是，本发明实施例通过预先训练的神经网络模型，将控制调用信息抽象表示为一个编码向量。对获得的编码向量以预先确定的局部敏感哈希函数进行转换，得到编码向量的哈希签名。It should be noted that, in the embodiment of the present invention, the control call information is abstractly represented as a coding vector through a pre-trained neural network model. Convert the obtained encoded vector with a predetermined locality-sensitive hash function to obtain the hash signature of the encoded vector.

S103、在预先构建的哈希签名库中检索是否存在与所述待检索哈希签名相同的哈希签名，若存在，则将检索到的哈希签名对应的二进制代码作为所述待检测函数的同源二进制代码。S103. Search in a pre-built hash signature library whether there is a hash signature that is the same as the hash signature to be retrieved, and if so, use the binary code corresponding to the retrieved hash signature as the signature of the function to be detected. Homologous binary code.

需要说明的是，本发明实施例可以参照步骤S101和S102的方法对系统固件库中的二进制代码进行计算，得到大量哈希签名，并构建哈希签名库。可以理解的是，构建哈希签名库的构成可以离线完成，从而不消耗在线搜索时间。当收到在线搜索任务后，首先对待查询的对待查询的二进制函数代码采用与离线时相同的方法生成哈希签名，然后在函数哈希数据库中对哈希签名进行快速检索，最终得到库中所有条件的同源二进制代码信息。It should be noted that, in this embodiment of the present invention, the binary codes in the system firmware library can be calculated by referring to the methods of steps S101 and S102, a large number of hash signatures can be obtained, and a hash signature library can be constructed. It is understandable that the composition of building the hash signature library can be done offline, thus not consuming online search time. After receiving the online search task, firstly, the binary function code to be queried uses the same method as offline to generate a hash signature, and then quickly retrieves the hash signature in the function hash database, and finally obtains all the information in the library. Conditional homologous binary code information.

需要说明的是，本发明实施例的同源二进制代码的检索方法，通过深度学习的方式将二进制代码的特征编码为编码向量(特征向量)，从而可以使用局部敏感哈希进行加速，提高在线匹配的速度，实现快速从海量固件二进制代码中检索出与给定漏洞二进制代码相似的代码片段，缩小后续人工分析范围，从而可以对可能存在漏洞风险的固件进行风险评估并为漏洞安全研究人员提供分析参考依据。It should be noted that, in the method for retrieving homologous binary codes according to the embodiment of the present invention, the features of binary codes are encoded into coding vectors (feature vectors) by means of deep learning, so that locality-sensitive hashing can be used to accelerate and improve online matching. It can quickly retrieve code fragments similar to a given vulnerability binary code from massive firmware binary codes, and narrow the scope of subsequent manual analysis, so as to conduct risk assessment on firmware that may have vulnerability risks and provide analysis for vulnerability security researchers. Reference.

在上述各实施例的基础上，作为一种可选实施例，神经网络模型的训练方法包括S201和S202，具体地，On the basis of the above embodiments, as an optional embodiment, the training method of the neural network model includes S201 and S202, specifically,

S201、准备一对已知是否同源的样本函数，向两个预先构建的神经网络模型中各输入一个所述样本函数的控制调用信息，所述两个神经网络模型共享参数。S201. Prepare a pair of sample functions that are known to be homologous or not, and input one control call information of the sample functions into two pre-built neural network models, where the two neural network models share parameters.

需要说明的是，本发明实施例的神经网络模型是将两个共享参数的模型结合起来一起训练，从而保证编码结果可以采用余弦距离Cosine来度量它们所表示的二进制文件之间的同源性。It should be noted that the neural network model of the embodiment of the present invention is trained by combining two models with shared parameters, so as to ensure that the coding result can use the cosine distance Cosine to measure the homology between the binary files represented by them.

S202、计算两个神经网络模型输出的编码向量间的余弦值，将所述余弦值以及所述两个样本函数的同源标记输入至神经网络模型的损失函数，根据损失函数的结果对所述神经网络模型的参数进行调整；S202: Calculate the cosine value between the coding vectors output by the two neural network models, input the cosine value and the homologous labels of the two sample functions into the loss function of the neural network model, and perform the analysis on the loss function according to the result of the loss function. Adjust the parameters of the neural network model;

其中，所述损失函数定义为：Among them, the loss function is defined as:

W代表神经网络模型Φ中的所有参数；b₁和b₂分别表示两个样本函数的控制调用信息；Φ(b₁)和Φ(b₂)分别表示神经网络模型输出的两个样本函数的编码向量；π(b₁,b₂)用于表征两个所述样本函数是否同源，同源为+1，样本函数对不同源为-1。输出为采用Cosine度量的两个函数编码向量的距离，通过样本函数对同源标记、两个函数编码向量的余弦并根据损失函数的定义，在该训练架构上进行反向传播。作为一种可选实施例，本发明实施例的神经网络模型可以采用Siamese架构的深度学习编码模型。Siamese网络从数据中去学习一个相似性度量，用学习出来的度量去比较和匹配新的未知类别的样本。W represents all parameters in the neural network model Φ; b ₁ and b ₂ represent the control call information of the two sample functions respectively; Φ(b ₁ ) and Φ(b ₂ ) represent the output of the neural network model Coding vector; π(b ₁ , b ₂ ) is used to characterize whether the two sample functions are homologous, the homolog is +1, and the sample function is -1 for different sources. The output is the distance between the two function encoding vectors using the Cosine metric, and back-propagation is performed on the training architecture through the sample function for the homologous label, the cosine of the two function encoding vectors and according to the definition of the loss function. As an optional embodiment, the neural network model in the embodiment of the present invention may adopt a deep learning coding model of Siamese architecture. The Siamese network learns a similarity measure from the data, and uses the learned measure to compare and match samples of new unknown classes.

在上述各实施例的基础上，作为一种可选实施例，将所述函数的控制调用信息输入至预先训练的神经网络模型中，输出所述函数的编码向量，具体为：On the basis of the above embodiments, as an optional embodiment, the control call information of the function is input into the pre-trained neural network model, and the encoding vector of the function is output, specifically:

将所述控制调用信息输入至预先训练的神经网络模型中进行预设次数的迭代处理，在每一次迭代过程中，对于任意一个待操作的基本块，将所述待操作的基本块所依赖的其他所有基本块在上一次迭代后的值相加，再进行ReLU变换，获得第一参考值；将所述待操作的基本块在上一次迭代后的值进行线性变换，获得第二参考值；将第一参考帧和第二参考值的和作为所述待操作的基本块在本次迭代后的值；Input the control call information into the pre-trained neural network model to perform iterative processing for a preset number of times. In each iteration process, for any basic block to be operated, the basic block to be operated depends on. The values of all other basic blocks after the last iteration are added, and then ReLU transformation is performed to obtain the first reference value; the value of the basic block to be operated after the last iteration is linearly transformed to obtain the second reference value; Taking the sum of the first reference frame and the second reference value as the value of the basic block to be operated after this iteration;

根据模糊直方图方法将所有基本块最后一轮迭代的结果进行拟合，再经过带偏置的线性网络变换，输出所述函数的编码向量；Fit the results of the last iteration of all basic blocks according to the fuzzy histogram method, and then go through a linear network transformation with a bias to output the encoding vector of the function;

其中，每个基本块的初始值为该基本块的基本块级别信息。The initial value of each basic block is the basic block level information of the basic block.

图2为本发明实施例的迭代过程中基本块节点属性更新时的依赖关系图，如图2所示，图2中以迭代两次为示例，表示节点v_i在第j次迭代后的值，W₂X表示神经网络中的全连接层，也可以理解为：做了一次线性变换，即做了一次矩阵的乘法运算，经过一个全连接层最终获得函数的编码向量，μ表示模型的输出。其中，i为1、2或3，j为0、1或2，以节点v₃为例，v₁和v₂均有指向v₃的边，因此v₃在每轮更新时所依赖的节点属性除了v₃外还包括上一轮结束时x₁和x₂的属性值，因此每轮更新时，x₃的新一轮的值来自于x₁和x₂相加再经过ReLU变换并与上一轮次旧值x₃经W₁变换的和。FIG. 2 is a dependency relationship diagram of updating basic block node attributes in an iterative process according to an embodiment of the present invention. As shown in FIG. 2 , in FIG. 2 , two iterations are used as an example. Represents the value of node v _i after the jth iteration, W ₂ X represents the fully connected layer in the neural network, which can also be understood as: a linear transformation is performed, that is, a matrix multiplication operation is performed, and after a fully connected layer Finally, the encoded vector of the function is obtained, and μ represents the output of the model. Among them, i is 1, 2 or 3, and j is 0, 1 or 2. Taking node v ₃ as an example, both v ₁ and v ₂ have edges pointing to v ₃ , so v ₃ depends on the node in each round of update. In addition to v ₃ , the attributes also include the attribute values of x ₁ and x ₂ at the end of the previous round, so when each round is updated, the new round value of x ₃ comes from the addition of x ₁ and x ₂ and then ReLU transformation and with The sum of the old value x ₃ transformed by W ₁ in the previous round.

在上述各实施例的基础上，作为一种可选实施例，迭代次数为3～5次。需要说明的是，理论上迭代次数越多则结果越准确，但本发明实施例通过实际验证获知，迭代次数在3～5次后，准确率的提升已经可以忽略不计，因此，本发明实施例中的迭代次数定为3～5次。Based on the foregoing embodiments, as an optional embodiment, the number of iterations is 3 to 5 times. It should be noted that theoretically, the more iterations, the more accurate the result. However, in the embodiment of the present invention, it is known through actual verification that after the iteration number is 3 to 5 times, the improvement of the accuracy can be ignored. Therefore, the embodiment of the present invention has The number of iterations in is set to be 3 to 5 times.

图3为本发明实施例提供的同源二进制代码的检索装置的结构示意图，如图3所示，该同源二进制代码的检索装置包括：控制调用信息生成模块301、哈希签名生成模块302和检索模块303，其中：FIG. 3 is a schematic structural diagram of an apparatus for retrieving homologous binary codes provided by an embodiment of the present invention. As shown in FIG. 3 , the apparatus for retrieving homologous binary codes includes: a control call information generation module 301, a hash signature generation module 302 and a Retrieval module 303, where:

控制调用信息生成模块301，用于确定待检测函数的所有基本块以及每个基本块的基本块级别信息，生成所述待检测函数的控制流图，根据所述控制流图确定所有基本块之间的依赖关系，将所有基本块的基本块级别信息以及依赖关系作为所述待检测函数的控制调用信息；The control call information generation module 301 is used to determine all basic blocks of the function to be detected and the basic block level information of each basic block, generate a control flow graph of the function to be detected, and determine the one of all basic blocks according to the control flow graph. Dependency relationship between, the basic block level information and dependencies of all basic blocks are used as the control call information of the function to be detected;

哈希签名生成模块302，用于将所述控制调用信息输入至预先训练的神经网络模型中，输出所述待检测函数的编码向量，计算所述编码向量的哈希签名，作为待检索哈希签名；Hash signature generation module 302, configured to input the control call information into the pre-trained neural network model, output the encoding vector of the function to be detected, and calculate the hash signature of the encoding vector as the hash to be retrieved sign;

检索模块303，用于在预先构建的哈希签名库中检索是否存在与所述待检索哈希签名相同的哈希签名，若存在，则将检索到的哈希签名对应的二进制代码作为所述待检测函数的同源二进制代码。The retrieval module 303 is used to retrieve whether there is a hash signature that is the same as the hash signature to be retrieved in the pre-built hash signature library, and if there is, the binary code corresponding to the retrieved hash signature is used as the Homologous binary code of the function to be instrumented.

本发明实施例提供的同源二进制代码的检索装置，具体执行上述各同源二进制代码的检索方法实施例流程，具体请详见上述各同源二进制代码的检索方法实施例的内容，在此不再赘述。本发明实施例提供的同源二进制代码的检索装置通过深度学习的方式将二进制代码的特征编码为编码向量(特征向量)，从而可以使用局部敏感哈希进行加速，提高在线匹配的速度，实现快速从海量固件二进制代码中检索出与给定漏洞二进制代码相似的代码片段，缩小后续人工分析范围，从而可以对可能存在漏洞风险的固件进行风险评估并为漏洞安全研究人员提供分析参考依据。The apparatus for retrieving homologous binary codes provided by the embodiments of the present invention specifically executes the process of the above embodiments of the retrieval methods for each homologous binary code. Repeat. The apparatus for retrieving homologous binary codes provided by the embodiments of the present invention encodes the features of binary codes into coding vectors (feature vectors) by means of deep learning, so that locality-sensitive hashing can be used for acceleration, the speed of online matching is improved, and the speed of online matching is improved. Retrieves code fragments similar to a given vulnerability binary code from massive firmware binary codes, and narrows the scope of subsequent manual analysis, so as to conduct risk assessment on firmware that may have vulnerability risks and provide analysis reference for vulnerability security researchers.

图4为本发明实施例提供的电子设备的实体结构示意图，如图4所示，该电子设备可以包括：处理器(processor)410、通信接口(Communications Interface)420、存储器(memory)430和通信总线440，其中，处理器410，通信接口420，存储器430通过通信总线440完成相互间的通信。处理器410可以调用存储在存储器430上并可在处理器410上运行的计算机程序，以执行上述各实施例提供的同源二进制代码的检索方法，例如包括：确定待检测函数的所有基本块以及每个基本块的基本块级别信息，生成所述待检测函数的控制流图，根据所述控制流图确定所有基本块之间的依赖关系，将所有基本块的基本块级别信息以及依赖关系作为所述待检测函数的控制调用信息；将所述控制调用信息输入至预先训练的神经网络模型中，输出所述待检测函数的编码向量，计算所述编码向量的哈希签名，作为待检索哈希签名；在预先构建的哈希签名库中检索是否存在与所述待检索哈希签名相同的哈希签名，若存在，则将检索到的哈希签名对应的二进制代码作为所述待检测函数的同源二进制代码。FIG. 4 is a schematic diagram of an entity structure of an electronic device provided by an embodiment of the present invention. As shown in FIG. 4 , the electronic device may include: a processor (processor) 410, a communications interface (Communications Interface) 420, a memory (memory) 430, and a communication The bus 440, wherein the processor 410, the communication interface 420, and the memory 430 complete the communication with each other through the communication bus 440. The processor 410 may call a computer program stored in the memory 430 and run on the processor 410 to execute the method for retrieving the homologous binary code provided by the above embodiments, for example, including: determining all basic blocks of the function to be detected and Basic block level information of each basic block, generate a control flow graph of the function to be detected, determine the dependencies among all basic blocks according to the control flow graph, and use the basic block level information and dependencies of all basic blocks as The control call information of the function to be detected; input the control call information into a pre-trained neural network model, output the encoding vector of the function to be detected, and calculate the hash signature of the encoding vector as the hash to be retrieved. Search for the hash signature that is the same as the hash signature to be retrieved in the pre-built hash signature library, and if so, use the binary code corresponding to the retrieved hash signature as the function to be detected homologous binary code.

此外，上述的存储器430中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory 430 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solutions of the embodiments of the present invention are essentially, or the parts that make contributions to the prior art or the parts of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

本发明实施例还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各实施例提供的同源二进制代码的检索方法，例如包括：确定待检测函数的所有基本块以及每个基本块的基本块级别信息，生成所述待检测函数的控制流图，根据所述控制流图确定所有基本块之间的依赖关系，将所有基本块的基本块级别信息以及依赖关系作为所述待检测函数的控制调用信息；将所述控制调用信息输入至预先训练的神经网络模型中，输出所述待检测函数的编码向量，计算所述编码向量的哈希签名，作为待检索哈希签名；在预先构建的哈希签名库中检索是否存在与所述待检索哈希签名相同的哈希签名，若存在，则将检索到的哈希签名对应的二进制代码作为所述待检测函数的同源二进制代码。Embodiments of the present invention further provide a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, is implemented to execute the method for retrieving the homologous binary code provided by the above embodiments, for example Including: determining all basic blocks of the function to be detected and basic block level information of each basic block, generating a control flow graph of the function to be detected, determining the dependencies between all basic blocks according to the control flow graph, The basic block level information and dependencies of the basic block are used as the control call information of the function to be detected; the control call information is input into the pre-trained neural network model, the encoding vector of the function to be detected is output, and the The hash signature of the encoded vector is used as the hash signature to be retrieved; in the pre-built hash signature library, it is searched whether there is a hash signature that is the same as the hash signature to be retrieved. The binary code corresponding to the signature is used as the homologous binary code of the function to be detected.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. a kind of search method of homologous binary code characterized by comprising

It determines all basic blocks of function to be detected and the atomic block level information of each basic block, generates the letter to be detected Several controlling stream graphs determines the dependence between all basic blocks according to the controlling stream graph, by the basic of all basic blocks The control recalls information of block level information and dependence as the function to be detected；

The control recalls information is input in neural network model trained in advance, the coding of the function to be detected is exported Vector calculates the hash signature of the coding vector, as hash signature to be retrieved；

Retrieval is with the presence or absence of hash signature identical with the hash signature to be retrieved in the hash signature library constructed in advance, if In the presence of then using the corresponding binary code of the hash signature retrieved as the homologous binary code of the function to be detected.

2. search method according to claim 1, which is characterized in that the training method of the neural network model are as follows:

Prepare it is a pair of known whether homologous sample function, respectively input an institute in the neural network model constructed in advance to two State the control recalls information of sample function, described two neural network model shared parameters；

The cosine value between the coding vector of two neural network models output is calculated, by the cosine value and described two samples The same source marking of function is input to the loss function of neural network model, according to the result of loss function to the neural network mould The parameter of type is adjusted；

Wherein, the loss function is defined as:

W represents all parameters in neural network model Φ；b₁And b₂Respectively indicate the control recalls information of two sample functions； Φ(b₁) and Φ (b₂) respectively indicate neural network model output two sample functions coding vector；π(b₁,b₂) it is used for table Whether homologous levy two sample functions.

3. search method according to claim 1 or 2, which is characterized in that the control recalls information by the function It is input in neural network model trained in advance, exports the coding vector of the function, specifically:

The control recalls information is input to the iterative processing that preset times are carried out in neural network model trained in advance, In Each time in iterative process, for any one basic block to be operated, the basic block to be operated is relied on other Value of all basic blocks after upper primary iteration is added, then carries out ReLU transformation, obtains the first reference value；It will be described to be operated Value of the basic block after upper primary iteration carries out linear transformation, obtains the second reference value；By the first reference frame and the second reference value Value of the sum as the basic block wait operate after current iteration；

According to fuzzy histogram method, by all basic blocks, last result for taking turns iteration is fitted, using the line with biasing Property network transformation, exports the coding vector of the function；

Wherein, the initial value of each basic block is the atomic block level information of the basic block.

4. the search method of homologous binary code according to claim 1, which is characterized in that the atomic block level letter Breath include the quantity of the character string constant in basic block, the quantity of digital constant, the quantity of escape instruction, call instruction number Amount, the quantity of arithmetic instruction, the quantity of assembly instruction, the quantity of child node and intermediary's centrad.

5. the search method of homologous binary code according to claim 1, which is characterized in that the control recalls information For band property control calling figure ACFG, each node in property control calling figure ACFG is one and has attribute set Basic block.

6. the search method of homologous binary code according to claim 1, which is characterized in that the determination letter to be detected The atomic block level information of several all basic blocks and each basic block, specifically: the binary program is carried out reverse Partition function operation obtains several functions to be detected, all basic blocks of each function to be detected and each basic block Atomic block level information.

7. the search method of homologous binary code according to claim 3, which is characterized in that the number of iterations is 3 To 5 times.

8. a kind of retrieval device of homologous binary code characterized by comprising

Recalls information generation module is controlled, for determining all basic blocks of function to be detected and the basic block of each basic block Level information generates the controlling stream graph of the function to be detected, according to the controlling stream graph determine between all basic blocks according to The relationship of relying calls letter using the atomic block level information of all basic blocks and dependence as the control of the function to be detected Breath；

Hash signature generation module, it is defeated for the control recalls information to be input in neural network model trained in advance The coding vector of the function to be detected out, calculates the hash signature of the coding vector, as hash signature to be retrieved；

Retrieval module, for retrieving in the hash signature library constructed in advance with the presence or absence of identical as the hash signature to be retrieved Hash signature, and if it exists, then using the corresponding binary code of the hash signature retrieved as the same of the function to be detected Source binary code.

9. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor is realized homologous two as described in any one of claim 1 to 7 when executing described program The step of search method of carry system code.

10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, it is homologous as claimed in any of claims 1 to 7 in one of claims that the computer instruction executes the computer The search method of binary code.