+

CN117938715A - Abnormality detection method, device, electronic device and readable medium of RDMA system - Google Patents

Abnormality detection method, device, electronic device and readable medium of RDMA system Download PDF

Info

Publication number
CN117938715A
CN117938715A CN202410064360.1A CN202410064360A CN117938715A CN 117938715 A CN117938715 A CN 117938715A CN 202410064360 A CN202410064360 A CN 202410064360A CN 117938715 A CN117938715 A CN 117938715A
Authority
CN
China
Prior art keywords
workload
rdma
rdma system
present disclosure
search space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410064360.1A
Other languages
Chinese (zh)
Inventor
郭雪芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Technology Innovation Center
China Telecom Corp Ltd
Original Assignee
China Telecom Technology Innovation Center
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Technology Innovation Center, China Telecom Corp Ltd filed Critical China Telecom Technology Innovation Center
Priority to CN202410064360.1A priority Critical patent/CN117938715A/en
Publication of CN117938715A publication Critical patent/CN117938715A/en
Priority to PCT/CN2024/118540 priority patent/WO2025152481A1/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Environmental & Geological Engineering (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本公开提供一种RDMA系统的异常检测方法、装置、电子设备和可读介质,其中,RDMA系统的异常检测方法包括:对RDMA系统的工作空间进行抽象建模,以生成RDMA系统的搜索空间;根据搜索空间的维度确定工作负载;确定任一工作负载触发RDMA系统异常的条件特征,以生成异常的最小功能特征集合;针对任一最小功能特征集合,调用RDMA系统的计数器确定工作负载的异常指标。通过本公开实施例,实现了对RDMA系统的性能异常的主动检测,提高了RDMA系统在实际应用中的性能与稳定性。

The present disclosure provides an abnormality detection method, device, electronic device and readable medium of an RDMA system, wherein the abnormality detection method of the RDMA system includes: abstractly modeling the workspace of the RDMA system to generate a search space of the RDMA system; determining the workload according to the dimension of the search space; determining the conditional characteristics of any workload triggering the abnormality of the RDMA system to generate a minimum functional feature set of the abnormality; for any minimum functional feature set, calling the counter of the RDMA system to determine the abnormality index of the workload. Through the embodiments of the present disclosure, active detection of performance abnormalities of the RDMA system is achieved, and the performance and stability of the RDMA system in practical applications are improved.

Description

RDMA系统的异常检测方法、装置、电子设备和可读介质Abnormality detection method, device, electronic device and readable medium of RDMA system

技术领域Technical Field

本公开涉及通信技术领域,具体而言,涉及一种RDMA系统的异常检测方法、装置、电子设备和可读介质。The present disclosure relates to the field of communication technology, and in particular to an abnormality detection method, device, electronic device and readable medium for an RDMA system.

背景技术Background technique

在相关技术中,RDMA硬件供应商进行大量测试,但由于涉及方众多,RDMA系统仍可能出现各类性能异常,严重影响RDMA网络性能和稳定性。In related technologies, RDMA hardware vendors conduct a large number of tests, but due to the large number of parties involved, the RDMA system may still experience various performance anomalies, seriously affecting the performance and stability of the RDMA network.

但是,现有的RDMA性能测试主要依赖简单的基准测试工具,或针对已知应用程序测试,无法全面覆盖潜在的应用工作负载,很难发现性能异常。However, existing RDMA performance tests mainly rely on simple benchmark tools or test known applications, which cannot fully cover potential application workloads and make it difficult to detect performance anomalies.

需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above background technology section is only used to enhance the understanding of the background of the present disclosure, and therefore may include information that does not constitute the prior art known to ordinary technicians in the field.

发明内容Summary of the invention

本公开的目的在于提供一种RDMA系统的异常检测方法、装置、电子设备和可读介质,用于至少在一定程度上克服由于相关技术的限制和缺陷而导致的RDMA系统的可靠性差的问题。The purpose of the present disclosure is to provide an abnormality detection method, device, electronic device and readable medium for an RDMA system, which are used to overcome the problem of poor reliability of the RDMA system caused by the limitations and defects of the related art at least to a certain extent.

根据本公开实施例的第一方面,提供一种RDMA系统的异常检测方法,包括:对所述RDMA系统的工作空间进行抽象建模,以生成所述RDMA系统的搜索空间;根据所述搜索空间的维度确定工作负载;确定任一所述工作负载触发所述RDMA系统异常的条件特征,以生成所述异常的最小功能特征集合;针对任一所述最小功能特征集合,调用所述RDMA系统的计数器确定所述工作负载的异常指标。According to a first aspect of an embodiment of the present disclosure, a method for detecting anomalies in an RDMA system is provided, comprising: abstractly modeling a workspace of the RDMA system to generate a search space of the RDMA system; determining a workload according to a dimension of the search space; determining conditional features for triggering anomalies in the RDMA system by any of the workloads to generate a minimum functional feature set of the anomaly; and for any of the minimum functional feature sets, calling a counter of the RDMA system to determine an abnormality indicator of the workload.

在本公开的一种示例性实施例中,还包括:In an exemplary embodiment of the present disclosure, it further includes:

基于统计学算法对所述异常指标的迭代计算结果调整所述工作负载的模式和/或所述工作负载的选取结果。The workload mode and/or the workload selection result are adjusted based on the iterative calculation result of the abnormal indicator using a statistical algorithm.

在本公开的一种示例性实施例中,基于统计学算法对所述异常指标的迭代计算结果调整所述工作负载的模式和/或所述工作负载的选取结果包括:In an exemplary embodiment of the present disclosure, adjusting the workload mode and/or the workload selection result based on the iterative calculation result of the abnormal indicator by a statistical algorithm includes:

调用所述统计学算法中的退火算法迭代计算所述工作负载的计数值与新的工作负载的计数值之间的计数差值;Calling an annealing algorithm in the statistical algorithm to iteratively calculate a count difference between the count value of the workload and the count value of a new workload;

确定所述计数差值与所述工作负载的计数值之间的比例值;determining a ratio value between the count difference value and the count value of the workload;

根据所述比例值调整所述工作负载的模式和/或所述工作负载的选取结果。The workload mode and/or the workload selection result is adjusted according to the ratio value.

在本公开的一种示例性实施例中,根据所述比例值调整所述工作负载的模式和/或所述工作负载的选取结果包括:In an exemplary embodiment of the present disclosure, adjusting the mode of the workload and/or the selection result of the workload according to the ratio value includes:

若确定所述比例值小于零,则将所述工作负载转移为所述新的工作负载;If it is determined that the ratio value is less than zero, transferring the workload to the new workload;

若确定所述比例值大于零,则以自然常数e为底的指数函数计算所述比例值的相反数;If it is determined that the ratio value is greater than zero, then calculating the opposite of the ratio value using an exponential function with a natural constant e as the base;

将所述指数函数的计算结果确定为将所述工作负载转移为所述新的工作负载的概率。The calculation result of the exponential function is determined as the probability of transferring the workload to the new workload.

在本公开的一种示例性实施例中,根据所述比例值调整所述工作负载的模式和/或所述工作负载的选取结果包括:In an exemplary embodiment of the present disclosure, adjusting the mode of the workload and/or the selection result of the workload according to the ratio value includes:

根据所述比例值调整所述RDMA系统的请求模式和/或所述RDMA系统的缓冲区的分配信息。The request mode of the RDMA system and/or the allocation information of the buffer of the RDMA system are adjusted according to the ratio value.

在本公开的一种示例性实施例中,对所述RDMA系统的工作空间进行抽象建模,以生成所述RDMA系统的搜索空间包括:In an exemplary embodiment of the present disclosure, abstract modeling of the working space of the RDMA system to generate the search space of the RDMA system includes:

对所述RDMA系统的工作空间进行抽象,以确定所述工作空间对应的内存区域,并对所述内存区域进行注册;Abstracting the workspace of the RDMA system to determine a memory area corresponding to the workspace, and registering the memory area;

创建所述队列对并设置所述队列对的传输类型;Creating the queue pair and setting the transmission type of the queue pair;

根据已注册的内存区域和设置有传输类型的队列对创建工作队列元素;Create a work queue element based on the registered memory area and queue pair with the transfer type set;

根据所述工作队列元素创建所述完成队列,以生成所述RDMA系统的搜索空间。The completion queue is created according to the work queue element to generate a search space of the RDMA system.

在本公开的一种示例性实施例中,根据所述搜索空间的维度确定工作负载包括:In an exemplary embodiment of the present disclosure, determining the workload according to the dimension of the search space includes:

解析所述搜索空间的维度中包括的拓扑结构、主机内存源、传输模式和消息模式;Parsing the topology, host memory source, transmission mode and message mode included in the dimensions of the search space;

根据所述拓扑结构、所述主机内存源、所述传输模式和所述消息模式生成所述RDMA系统的主机之间的流量,并确定所述流量对应的所述工作负载。The traffic between hosts of the RDMA system is generated according to the topology, the host memory source, the transmission mode and the message mode, and the workload corresponding to the traffic is determined.

根据本公开实施例的第二方面,提供一种RDMA系统的异常检测装置,包括:According to a second aspect of an embodiment of the present disclosure, there is provided an abnormality detection device for an RDMA system, comprising:

建模模块,设置为对所述RDMA系统的工作空间进行抽象建模,以生成所述RDMA系统的搜索空间;A modeling module, configured to perform abstract modeling on the working space of the RDMA system to generate a search space of the RDMA system;

确定模块,设置为根据所述搜索空间的维度确定工作负载;a determination module configured to determine a workload according to a dimension of the search space;

所述确定模块,设置为确定任一所述工作负载触发所述RDMA系统异常的条件特征,以生成所述异常的最小功能特征集合;The determination module is configured to determine the conditional features of any of the workloads triggering the RDMA system abnormality, so as to generate a minimum functional feature set of the abnormality;

所述确定模块,设置为针对任一所述最小功能特征集合,调用所述RDMA系统的计数器确定所述工作负载的异常指标。The determination module is configured to call a counter of the RDMA system to determine an abnormality indicator of the workload for any of the minimum functional feature sets.

根据本公开的第三方面,提供一种电子设备,包括:存储器;以及耦合到所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如上述任意一项所述的方法。According to a third aspect of the present disclosure, an electronic device is provided, comprising: a memory; and a processor coupled to the memory, wherein the processor is configured to execute any one of the methods described above based on instructions stored in the memory.

根据本公开的第四方面,提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时实现如上述任意一项所述的RDMA系统的异常检测方法。According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided, on which a program is stored, and when the program is executed by a processor, the abnormality detection method of the RDMA system as described in any one of the above items is implemented.

本公开实施例,通过对RDMA系统的工作空间进行抽象建模,以生成RDMA系统的搜索空间,进而根据搜索空间的维度确定工作负载,确定任一工作负载触发RDMA系统异常的条件特征,以生成异常的最小功能特征集合,最终针对任一最小功能特征集合,调用RDMA系统的计数器确定工作负载的异常指标,实现了对RDMA系统的性能异常的主动检测,提高了RDMA系统在实际应用中的性能与稳定性。The disclosed embodiments generate a search space for the RDMA system by abstractly modeling the workspace of the RDMA system, and then determine the workload according to the dimension of the search space, determine the conditional features for triggering an abnormality in the RDMA system by any workload, and generate an abnormal minimum functional feature set. Finally, for any minimum functional feature set, call the counter of the RDMA system to determine the abnormal index of the workload, thereby realizing active detection of performance anomalies of the RDMA system and improving the performance and stability of the RDMA system in practical applications.

应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。The accompanying drawings herein are incorporated into the specification and constitute a part of the specification, illustrate embodiments consistent with the present disclosure, and together with the specification are used to explain the principles of the present disclosure. Obviously, the accompanying drawings described below are only some embodiments of the present disclosure, and for ordinary technicians in this field, other accompanying drawings can be obtained based on these accompanying drawings without creative work.

图1示出了可以应用本发明实施例的RDMA系统的异常检测方案的示例性系统架构的示意图;FIG1 is a schematic diagram showing an exemplary system architecture of an anomaly detection solution for an RDMA system to which an embodiment of the present invention can be applied;

图2是本公开示例性实施例中一种RDMA系统的异常检测方法的流程图;FIG2 is a flow chart of an abnormality detection method for an RDMA system in an exemplary embodiment of the present disclosure;

图3是本公开示例性实施例中另一种RDMA系统的异常检测方法的流程图;FIG3 is a flow chart of another abnormality detection method for an RDMA system in an exemplary embodiment of the present disclosure;

图4是本公开示例性实施例中另一种RDMA系统的异常检测方法的流程图;FIG4 is a flow chart of another abnormality detection method for an RDMA system in an exemplary embodiment of the present disclosure;

图5是本公开示例性实施例中另一种RDMA系统的异常检测方法的流程图;FIG5 is a flow chart of another abnormality detection method for an RDMA system in an exemplary embodiment of the present disclosure;

图6是本公开示例性实施例中另一种RDMA系统的异常检测方法的流程图;FIG6 is a flow chart of another abnormality detection method for an RDMA system in an exemplary embodiment of the present disclosure;

图7是本公开示例性实施例中另一种RDMA系统的异常检测方法的流程图;FIG7 is a flow chart of another abnormality detection method for an RDMA system in an exemplary embodiment of the present disclosure;

图8是本公开示例性实施例中另一种RDMA系统的异常检测方法的流程图;FIG8 is a flow chart of another abnormality detection method for an RDMA system in an exemplary embodiment of the present disclosure;

图9是本公开示例性实施例中一种RDMA系统的异常检测方案的系统架构图;FIG9 is a system architecture diagram of an abnormality detection solution for an RDMA system in an exemplary embodiment of the present disclosure;

图10是本公开示例性实施例中一种RDMA系统的异常检测方案的搜索空间(工作空间)的示意图;FIG10 is a schematic diagram of a search space (working space) of an anomaly detection scheme for an RDMA system in an exemplary embodiment of the present disclosure;

图11是本公开示例性实施例中一种RDMA系统的异常检测方案的流程图;FIG11 is a flow chart of an abnormality detection scheme for an RDMA system in an exemplary embodiment of the present disclosure;

图12是本公开示例性实施例中一种RDMA系统的异常检测装置的方框图;FIG12 is a block diagram of an abnormality detection device for an RDMA system in an exemplary embodiment of the present disclosure;

图13是本公开示例性实施例中一种电子设备的方框图。FIG. 13 is a block diagram of an electronic device in an exemplary embodiment of the present disclosure.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的条件特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。Example embodiments will now be described more fully with reference to the accompanying drawings. However, example embodiments can be implemented in a variety of forms and should not be construed as being limited to the examples set forth herein; on the contrary, these embodiments are provided so that the present disclosure will be more comprehensive and complete, and the concept of the example embodiments will be fully conveyed to those skilled in the art. The described conditional features, structures, or characteristics may be combined in one or more embodiments in any suitable manner. In the following description, many specific details are provided to provide a full understanding of the embodiments of the present disclosure. However, those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced while omitting one or more of the specific details, or other methods, components, devices, steps, etc. may be adopted. In other cases, known technical solutions are not shown or described in detail to avoid obscuring various aspects of the present disclosure.

此外,附图仅为本公开的示意性图解,图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。In addition, the accompanying drawings are only schematic diagrams of the present disclosure, and the same reference numerals in the drawings represent the same or similar parts, so their repeated description will be omitted. Some of the block diagrams shown in the accompanying drawings are functional entities, which do not necessarily correspond to physically or logically independent entities. These functional entities can be implemented in software form, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.

图1示出了可以应用本发明实施例的RDMA系统的异常检测方案的示例性系统架构的示意图。FIG. 1 is a schematic diagram showing an exemplary system architecture of an anomaly detection solution for an RDMA system to which an embodiment of the present invention can be applied.

如图1所示,系统架构100可以包括终端设备101、102、103中的一种或多种,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in Fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or optical fiber cables, etc.

应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。比如服务器105可以是多个服务器组成的服务器集群等。It should be understood that the number of terminal devices, networks and servers in FIG1 is only illustrative. According to implementation requirements, there may be any number of terminal devices, networks and servers. For example, the server 105 may be a server cluster composed of multiple servers.

用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、便携式计算机和台式计算机等等。Users can use terminal devices 101, 102, 103 to interact with server 105 through network 104 to receive or send messages, etc. Terminal devices 101, 102, 103 can be various electronic devices with display screens, including but not limited to smart phones, tablet computers, portable computers, desktop computers, etc.

在一些实施例中,本发明实施例所提供的RDMA系统的异常检测方法一般由服务器105执行,相应地,RDMA系统的异常检测装置一般设置于终端设备103(也可以是终端设备101或102)中。在另一些实施例中,某些终端可以具有与服务器设备相似的功能从而执行本方法。In some embodiments, the abnormality detection method of the RDMA system provided in the embodiments of the present invention is generally executed by the server 105. Accordingly, the abnormality detection device of the RDMA system is generally set in the terminal device 103 (or the terminal device 101 or 102). In other embodiments, some terminals may have functions similar to those of the server device to execute the method.

下面对本公开的实施例涉及的技术术语进行解释和说明,具体如下:The technical terms involved in the embodiments of the present disclosure are explained and described below, specifically as follows:

MFS:Minimum Feature Set,最小功能集合或最小功能特征集合。MFS: Minimum Feature Set, minimum feature set or minimum feature set.

MR:Memory Region,内存区域,应用需要先注册内存区域,使其成为RDMA可访问的内存。使用函数注册MR,MR有开始地址和长度,定义一块连续的内存空间,RNIC可以直接访问已注册的MR,不需要CPU参与。MR: Memory Region. Applications need to register memory regions first to make them accessible to RDMA. Use functions to register MRs. MRs have a starting address and length, defining a continuous memory space. RNIC can directly access registered MRs without CPU involvement.

RNIC:RDMA Network Interface Card,RDMA网卡。RNIC: RDMA Network Interface Card, RDMA network card.

QP:Queue Pair,队列对,使用ibv_create_qp创建QP,QP表示应用和RNIC之间的一个“连接”,QP是点对点通信的抽象,包含发送队列和接收队列,每个QP需要配置传输模式,如可靠连接RC。QP: Queue Pair, queue pair, use ibv_create_qp to create QP. QP represents a "connection" between the application and RNIC. QP is an abstraction of point-to-point communication, including sending queues and receiving queues. Each QP needs to be configured with a transmission mode, such as reliable connection RC.

WQE:Work Queue Element,工作请求,要发送/接收消息,需要构造WQE,并投递给QP,WQE包含一个scatter/gather列表,列表指定一组参与传输的MR内存缓冲区,向QP发送队列投递WQE。WQE: Work Queue Element, work request. To send/receive messages, you need to construct a WQE and deliver it to the QP. The WQE contains a scatter/gather list that specifies a set of MR memory buffers involved in the transmission and delivers the WQE to the QP sending queue.

其中,scatter/gather用于描述从Channel中读取或者写入到Channel的操作。Among them, scatter/gather is used to describe the operation of reading from or writing to the Channel.

分散(scatter):从Channel中读取在读操作中将读取的数据写入多个Buffer中。因此,Channel将从Channel中读取的数据分散(scatter)到多个Buffer中。Scatter: When reading from a Channel, the data read is written to multiple buffers during the read operation. Therefore, the Channel scatters the data read from the Channel to multiple buffers.

聚集(gather):写入Channel中将多个Buffer的数据写入同一个Channel,因此,Channel将多个Buffer中的数据聚集(gather)后发送到Channelscatter/gather经常用于需要将传输的数据分开处理的场合,例如传输一个由消息头和消息体组成的数据,你可以将消息头和消息体分散到不通的buffer中,这样就可以方便的处理。Gather: Write data from multiple buffers into the same Channel. Therefore, Channel gathers data from multiple buffers and sends them to Channel. Scatter/gather is often used in situations where the transmitted data needs to be processed separately. For example, when transmitting data consisting of a message header and a message body, you can scatter the message header and the message body into different buffers for easy processing.

CQ:Completion Queue,完成队列,CQ用于接收完成通知,判断WQE是否完成,创建CQ使用查询CQ,获取完成通知后,可以重用WQE内存缓冲区。CQ: Completion Queue, completion queue, CQ is used to receive completion notifications and determine whether WQE is completed. Create CQ and query CQ. After obtaining the completion notification, the WQE memory buffer can be reused.

RC:面向连接的可靠服务。RC: Connection-oriented reliable service.

UC:面向连接的不可靠服务。UC: Connection-oriented unreliable service.

UD:面向数据报的不可靠服务。UD: Datagram-oriented unreliable service.

RD:面向非连接(类似UDP)的可靠服务。RD: Connectionless (similar to UDP) reliable service.

RDMA:Remote Direct Memory Access,远程直接数据存取,就是为了解决网络传输中服务器端数据处理的延迟而产生的,RDMA通过网络把资料直接传入计算机的存储区,将数据从一个系统快速移动到远程系统存储器中,而不对操作系统造成任何影响,这样就不需要用到多少计算机的处理功能,消除了外部存储器复制和上下文切换的开销,解放了内存带宽和CPU周期用于改进应用系统性能。RDMA: Remote Direct Memory Access, is created to solve the delay of server-side data processing during network transmission. RDMA transfers data directly to the computer's storage area through the network, quickly moving data from one system to the remote system memory without causing any impact on the operating system. This does not require much computer processing power, eliminates the overhead of external memory copying and context switching, and frees up memory bandwidth and CPU cycles to improve application system performance.

下面结合附图对本公开示例实施方式进行详细说明。The exemplary embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

图2是本公开示例性实施例中RDMA系统的异常检测方法的流程图。FIG. 2 is a flowchart of an abnormality detection method for an RDMA system in an exemplary embodiment of the present disclosure.

参考图2,RDMA系统的异常检测方法可以包括:Referring to FIG. 2 , the abnormality detection method of the RDMA system may include:

步骤S202,对所述RDMA系统的工作空间进行抽象建模,以生成所述RDMA系统的搜索空间。Step S202: abstractly modeling the working space of the RDMA system to generate a search space of the RDMA system.

步骤S204,根据所述搜索空间的维度确定工作负载。Step S204: determine the workload according to the dimension of the search space.

在本公开的一种示例性实施例中,工作负载可以是服务器、终端和中间节点等计算机设备,但不限于此。In an exemplary embodiment of the present disclosure, the workload may be computer devices such as servers, terminals, and intermediate nodes, but is not limited thereto.

步骤S206,确定任一所述工作负载触发所述RDMA系统异常的条件特征,以生成所述异常的最小功能特征集合。Step S206: determining the conditional features of any of the workloads triggering the RDMA system exception, so as to generate a minimum functional feature set of the exception.

在本公开的一种示例性实施例中,通过运行最小特征集算法(MFS)提取触发该异常的必要条件,使用提取的特征集避免重复搜索已知的异常区,从而加速搜索过程,MFS算法逐个测试异常存在的每个特征,判断哪些特征是必要的。In an exemplary embodiment of the present disclosure, the necessary conditions for triggering the anomaly are extracted by running a minimum feature set algorithm (MFS), and the extracted feature set is used to avoid repeated searches of known anomaly areas, thereby accelerating the search process. The MFS algorithm tests each feature of the anomaly one by one to determine which features are necessary.

在本公开的一种示例性实施例中,如果一个异常是用RC队列对触发的,测试使用UC队列时是否也能重现该异常,如果不能重现该异常,则表明RC队列是该异常的MFS特征之一,MFS输出触发该异常的必要特征组合。In an exemplary embodiment of the present disclosure, if an exception is triggered using an RC queue pair, it is tested whether the exception can be reproduced when using a UC queue. If the exception cannot be reproduced, it indicates that the RC queue is one of the MFS features of the exception, and the MFS outputs the necessary feature combination that triggers the exception.

步骤S208,针对任一所述最小功能特征集合,调用所述RDMA系统的计数器确定所述工作负载的异常指标。Step S208: for any of the minimum functional feature sets, call the counter of the RDMA system to determine an abnormality indicator of the workload.

在本公开的一种示例性实施例中,计数器包括性能计数器和/或诊断计数器,性能计数器主要反映RDMA子系统的性能,比如吞吐量、延迟、CPU使用率等,诊断计数器主要反映RNIC内部的错误或异常事件,比如cache miss、内部拥塞等,但不限于此。In an exemplary embodiment of the present disclosure, the counters include performance counters and/or diagnostic counters. The performance counters mainly reflect the performance of the RDMA subsystem, such as throughput, latency, CPU usage, etc. The diagnostic counters mainly reflect errors or abnormal events inside the RNIC, such as cache miss, internal congestion, etc., but are not limited to this.

本公开实施例,通过对RDMA系统的工作空间进行抽象建模,以生成RDMA系统的搜索空间,进而根据搜索空间的维度确定工作负载,确定任一工作负载触发RDMA系统异常的条件特征,以生成异常的最小功能特征集合,最终针对任一最小功能特征集合,调用RDMA系统的计数器确定工作负载的异常指标,实现了对RDMA系统的性能异常的主动检测,提高了RDMA系统在实际应用中的性能与稳定性。The disclosed embodiments generate a search space for the RDMA system by abstractly modeling the workspace of the RDMA system, and then determine the workload according to the dimension of the search space, determine the conditional features for triggering an abnormality in the RDMA system by any workload, and generate an abnormal minimum functional feature set. Finally, for any minimum functional feature set, call the counter of the RDMA system to determine the abnormal index of the workload, thereby realizing active detection of performance anomalies of the RDMA system and improving the performance and stability of the RDMA system in practical applications.

下面,对RDMA系统的异常检测方法的各步骤进行详细说明。Next, each step of the abnormality detection method of the RDMA system is described in detail.

在本公开的一种示例性实施例中,如图3所示,还包括:In an exemplary embodiment of the present disclosure, as shown in FIG3 , it further includes:

步骤S302,基于统计学算法对所述异常指标的迭代计算结果调整所述工作负载的模式和/或所述工作负载的选取结果。Step S302: adjusting the workload mode and/or the workload selection result based on the iterative calculation result of the abnormal indicator using a statistical algorithm.

在本公开的一种示例性实施例中,如图4所示,基于统计学算法对所述异常指标的迭代计算结果调整所述工作负载的模式和/或所述工作负载的选取结果包括:In an exemplary embodiment of the present disclosure, as shown in FIG4 , adjusting the workload mode and/or the workload selection result based on the iterative calculation result of the abnormal indicator by a statistical algorithm includes:

步骤S402,调用所述统计学算法中的退火算法迭代计算所述工作负载的计数值与新的工作负载的计数值之间的计数差值。Step S402: calling the annealing algorithm in the statistical algorithm to iteratively calculate the count difference between the count value of the workload and the count value of the new workload.

步骤S404,确定所述计数差值与所述工作负载的计数值之间的比例值。Step S404: determine a ratio between the count difference and the count value of the workload.

步骤S406,根据所述比例值调整所述工作负载的模式和/或所述工作负载的选取结果。Step S406: adjusting the workload mode and/or the workload selection result according to the ratio value.

在本公开的一种示例性实施例中,工作负载的模式包括访问模式、消息模式、请求模式等,但不限于此,例如,RDMA访问完全不需要远端机的确认的模式,或RDMA访问需要远端机CPU的参与的模式,工作负载的模式决定了生成的RDMA流量。In an exemplary embodiment of the present disclosure, the workload mode includes an access mode, a message mode, a request mode, etc., but is not limited thereto. For example, a mode in which RDMA access does not require confirmation from the remote machine at all, or a mode in which RDMA access requires the participation of the remote machine's CPU. The workload mode determines the generated RDMA traffic.

在本公开的一种示例性实施例中,工作负载的选取结果包括是否将导致异常的工作负载移动为新的工作负载。In an exemplary embodiment of the present disclosure, the workload selection result includes whether to move the workload causing the abnormality to a new workload.

在本公开的一种示例性实施例中,如图5所示,根据所述比例值调整所述工作负载的模式和/或所述工作负载的选取结果包括:In an exemplary embodiment of the present disclosure, as shown in FIG5 , the mode of adjusting the workload and/or the selection result of the workload according to the ratio value includes:

步骤S502,若确定所述比例值小于零,则将所述工作负载转移为所述新的工作负载。Step S502: If it is determined that the ratio value is less than zero, the workload is transferred to the new workload.

在本公开的一种示例性实施例中,将上述比值记作ΔE,若ΔE<0,则将所述工作负载转移为所述新的工作负载。In an exemplary embodiment of the present disclosure, the above ratio is recorded as ΔE. If ΔE<0, the workload is transferred to the new workload.

步骤S504,若确定所述比例值大于零,则以自然常数e为底的指数函数计算所述比例值的相反数。Step S504: If it is determined that the ratio value is greater than zero, the opposite of the ratio value is calculated using an exponential function with the natural constant e as the base.

步骤S506,将所述指数函数的计算结果确定为将所述工作负载转移为所述新的工作负载的概率。Step S506: Determine the calculation result of the exponential function as the probability of transferring the workload to the new workload.

在本公开的一种示例性实施例中,若ΔE>0,则计算exp(-ΔE/T)作为将所述工作负载转移为所述新的工作负载的概率。In an exemplary embodiment of the present disclosure, if ΔE>0, exp(-ΔE/T) is calculated as the probability of transferring the workload to the new workload.

在本公开的一种示例性实施例中,如图6所示,根据所述比例值调整所述工作负载的模式和/或所述工作负载的选取结果包括:In an exemplary embodiment of the present disclosure, as shown in FIG6 , the mode of adjusting the workload and/or the selection result of the workload according to the ratio value includes:

步骤S602,根据所述比例值调整所述RDMA系统的请求模式和/或所述RDMA系统的缓冲区的分配信息。Step S602: adjusting the request mode of the RDMA system and/or the allocation information of the buffer of the RDMA system according to the ratio value.

在本公开的一种示例性实施例中,如图7所示,对所述RDMA系统的工作空间进行抽象建模,以生成所述RDMA系统的搜索空间包括:In an exemplary embodiment of the present disclosure, as shown in FIG7 , abstract modeling of the working space of the RDMA system to generate the search space of the RDMA system includes:

步骤S702,对所述RDMA系统的工作空间进行抽象,以确定所述工作空间对应的内存区域,并对所述内存区域进行注册。Step S702: abstract the workspace of the RDMA system to determine a memory area corresponding to the workspace, and register the memory area.

步骤S704,创建所述队列对并设置所述队列对的传输类型。Step S704: create the queue pair and set the transmission type of the queue pair.

步骤S706,根据已注册的内存区域和设置有传输类型的队列对创建工作队列元素。Step S706: Create a work queue element according to the registered memory area and the queue pair with the transmission type set.

步骤S708,根据所述工作队列元素创建所述完成队列,以生成所述RDMA系统的搜索空间。Step S708: Create the completion queue according to the work queue element to generate a search space of the RDMA system.

在本公开的一种示例性实施例中,如图8所示,根据所述搜索空间的维度确定工作负载包括:In an exemplary embodiment of the present disclosure, as shown in FIG8 , determining the workload according to the dimension of the search space includes:

步骤S802,解析所述搜索空间的维度中包括的拓扑结构、主机内存源、传输模式和消息模式。Step S802, parsing the topology structure, host memory source, transmission mode and message mode included in the dimensions of the search space.

步骤S804,根据所述拓扑结构、所述主机内存源、所述传输模式和所述消息模式生成所述RDMA系统的主机之间的流量,并确定所述流量对应的所述工作负载。Step S804: Generate traffic between hosts of the RDMA system according to the topology, the host memory source, the transmission mode, and the message mode, and determine the workload corresponding to the traffic.

在本公开的一种示例性实施例中,传输模式包括:QP类型如RC、UC、UD等;QP的数量;操作码类型;WQE的使用情况。In an exemplary embodiment of the present disclosure, the transmission mode includes: QP type such as RC, UC, UD, etc.; the number of QPs; the operation code type; and the use of WQE.

下面结合图9至图11对本公开的RDMA系统的异常检测方案进行具体说明。The abnormality detection scheme of the RDMA system disclosed in the present invention is described in detail below with reference to FIG. 9 to FIG. 11 .

在本公开的一种示例性实施例中,如图9所示,提供了一种RDMA系统900的异常检测方案的系统架构,RDMA系统900包括工作负载引擎、异常监控、工作负载生成器等。In an exemplary embodiment of the present disclosure, as shown in FIG. 9 , a system architecture of an anomaly detection solution for an RDMA system 900 is provided. The RDMA system 900 includes a workload engine, an anomaly monitor, a workload generator, and the like.

在本公开的一种示例性实施例中,工作负载引擎负责建立RDMA流量。In an exemplary embodiment of the present disclosure, the workload engine is responsible for establishing RDMA traffic.

在本公开的一种示例性实施例中,异常监控根据流量、PFC暂停帧检测出性能瓶颈和MFS最小必要异常特征集合。In an exemplary embodiment of the present disclosure, anomaly monitoring detects performance bottlenecks and a minimum necessary anomaly feature set of MFS based on traffic and PFC pause frames.

在本公开的一种示例性实施例中,工作负载生成器拿到硬件计数等指标来决定测试的工作模式。In an exemplary embodiment of the present disclosure, the workload generator obtains indicators such as hardware count to determine the working mode of the test.

在本公开的一种示例性实施例中,重复多轮上述步骤,最后得到RDMA性能异常的特征集,即最小功能特征集合。In an exemplary embodiment of the present disclosure, the above steps are repeated for multiple rounds, and finally a feature set of RDMA performance anomalies, that is, a minimum functional feature set, is obtained.

在本公开的一种示例性实施例中,如图10所示,提供了一种RDMA系统的异常检测方案的搜索空间1000,基于搜索空间的设置的维度参数,通过RDMA网络接口卡,在主机间生成对应的网络流量,具体以如下四个维度进行建模:In an exemplary embodiment of the present disclosure, as shown in FIG. 10 , a search space 1000 of an anomaly detection solution for an RDMA system is provided. Based on the dimension parameters set in the search space, corresponding network traffic is generated between hosts through an RDMA network interface card, and is specifically modeled in the following four dimensions:

(1)拓扑结构:流量如何流入/流出RNIC流向/流出其他服务器hard-商品组件。(1) Topology: How traffic flows into/out of the RNIC and to/from other server hard-commodity components.

(2)内存分配设置包括对存储区域进行配置和划分等。(2) Memory allocation settings include configuring and dividing storage areas.

(3)传输设置:(a)QP类型(RC、UC、UD);(b)QP的数量;(c)操作码类型;(d)WQE的使用情况。还可以设置Opcode如SEND、WRITE等。(3) Transmission settings: (a) QP type (RC, UC, UD); (b) Number of QPs; (c) Opcode type; (d) WQE usage. You can also set Opcodes such as SEND, WRITE, etc.

(4)消息模式:可以灵活设置消息大小,也可以定义一系列消息的顺序。(4) Message mode: You can flexibly set the message size and define the order of a series of messages.

进一步地,搜索空间1000中的工作负载引擎可以组合和测试搜索空间中的不同RDMA操作,对工作空间进行抽象建模,模拟应用对RDMA的操作,便于快速搜索到RDMA性能异常的特征集。Furthermore, the workload engine in the search space 1000 can combine and test different RDMA operations in the search space, abstractly model the workspace, and simulate the application's operations on RDMA, so as to quickly search for feature sets of RDMA performance anomalies.

在本公开的一种示例性实施例中,如图11所示,RDMA系统的异常检测方案包括以下步骤:In an exemplary embodiment of the present disclosure, as shown in FIG11 , the abnormality detection scheme of the RDMA system includes the following steps:

步骤S1102:工作负载设置器构建抽象的RDMA搜索空间,包括:Step S1102: The workload setter constructs an abstract RDMA search space, including:

(1)初始化S:工作空间设置;(1) Initialize S: workspace settings;

(2)初始化temperature:退火算法温度指标;(2) Initialization temperature: annealing algorithm temperature indicator;

(3)初始化N:工作负载维度。(3) Initialize N: workload dimension.

步骤S1104:工作负载引擎根据工作负载设置器的工作负载模式生成RDMA流量。Step S1104: The workload engine generates RDMA traffic according to the workload pattern of the workload setter.

步骤S1106:异常监控器根据RDMA系统指标判断是否异常。Step S1106: The abnormality monitor determines whether there is an abnormality based on the RDMA system indicators.

步骤S1108:触发异常。Step S1108: trigger an exception.

步骤S1110:异常监控器使用MSF算法生成最小异常集合。Step S1110: The anomaly monitor generates a minimum anomaly set using the MSF algorithm.

步骤S1112:工作负载设置器获取最小异常特征集合。Step S1112: The workload setter obtains a minimum abnormal feature set.

对上述步骤S1104-步骤S1112进行重复迭代,最终输出检测到的所有RDMA性能异常的特征集。The above steps S1104 to S1112 are iterated repeatedly, and finally a feature set of all detected RDMA performance anomalies is output.

步骤S1114:工作负载设置器获取RDMA系统计数器。Step S1114: The workload setter obtains the RDMA system counter.

步骤S1116:工作负载设置器使用退火算法,计算能量函数。Step S1116: The workload setter uses an annealing algorithm to calculate the energy function.

步骤S1118:工作负载设置器运算的退火算法根据能量值和退火温度,选择工作负载是否接收。Step S1118: The annealing algorithm calculated by the workload setter selects whether to accept the workload based on the energy value and the annealing temperature.

步骤S1120:工作负载设置器根据迭代轮数,判断是否更新退火参数temperature。Step S1120: The workload setter determines whether to update the annealing parameter temperature according to the number of iterations.

步骤S1122:工作负载设置器根据RDMA系统(子系统或主系统)计数器和当前搜索空间来决定下一个工作负载模式。Step S1122: The workload setter determines the next workload pattern according to the RDMA system (subsystem or main system) counter and the current search space.

基于上述步骤S1102-步骤S1122,从开发人员的角度构建搜索空间,通过分析标准的verbs库和开发人员可以做出的设计决策,如请求模式、如何分配RDMA缓冲区等,构建一个全面的搜索空间。构建RDMA编程中的关键抽象,包括内存区域(MR)、队列对(QP)、工作队列(WQE)、完成队列(CQ)。Based on the above steps S1102 to S1122, a search space is constructed from the developer's perspective, and a comprehensive search space is constructed by analyzing the standard verbs library and the design decisions that developers can make, such as request patterns, how to allocate RDMA buffers, etc. Key abstractions in RDMA programming are constructed, including memory regions (MRs), queue pairs (QPs), work queues (WQEs), and completion queues (CQs).

其中,提取影响性能的维度有:主机拓扑、内存分配、传输设置、消息模式四个可以影响RDMA子系统性能的关键维度。这些抽象在不同的设置下,会对RDMA应用的性能产生不同的影响。通过模拟不同的工作负载,并使用性能计数器,来发现异常。Among them, the dimensions that affect performance are: host topology, memory allocation, transmission settings, and message mode. Four key dimensions that can affect the performance of the RDMA subsystem. These abstractions will have different effects on the performance of RDMA applications under different settings. By simulating different workloads and using performance counters, anomalies can be discovered.

下面对建模各个模块进行具体说明:The following is a detailed description of each modeling module:

(1)内存区域(Memory Region,MR):应用需要先注册内存区域,使其成为RDMA可访问的内存。使用函数注册MR,MR有开始地址和长度,定义一块连续的内存空间,RNIC可以直接访问已注册的MR,不需要CPU参与。(1) Memory Region (MR): The application needs to register the memory region first to make it accessible to RDMA. Use the function to register the MR. The MR has a start address and length, defining a continuous memory space. The RNIC can directly access the registered MR without CPU involvement.

(2)队列对(Queue Pair,QP):使用ibv_create_qp创建QP,QP表示应用和RNIC之间的一个“连接”,QP是点对点通信的抽象,包含发送队列和接收队列。每个QP需要配置传输模式,如可靠连接RC。(2) Queue Pair (QP): Use ibv_create_qp to create a QP. QP represents a "connection" between the application and the RNIC. QP is an abstraction of point-to-point communication, including a send queue and a receive queue. Each QP needs to be configured with a transmission mode, such as a reliable connection RC.

(3)工作请求(Work Queue Element,WQE):要发送/接收消息,需要构造WQE,并投递给QP,WQE包含一个scatter/gather列表,列表指定一组参与传输的MR内存缓冲区,向QP发送队列投递WQE。(3) Work Queue Element (WQE): To send/receive messages, you need to construct a WQE and deliver it to the QP. The WQE contains a scatter/gather list that specifies a set of MR memory buffers that participate in the transmission. The WQE is delivered to the QP sending queue.

(3)完成队列(Completion Queue,CQ):CQ用于接收完成通知,判断WQE是否完成,创建CQ使用查询CQ,获取完成通知,获取通知后,可以重用WQE内存缓冲区。(3) Completion Queue (CQ): CQ is used to receive completion notifications and determine whether the WQE is completed. Create a CQ and query the CQ to obtain the completion notification. After obtaining the notification, the WQE memory buffer can be reused.

其中,根据搜索空间维度生成具体的工作负载,可以按照如下四个维度进行建模:Among them, the specific workload is generated according to the search space dimension, which can be modeled according to the following four dimensions:

(1)拓扑结构:流量如何流入/流出RNIC流向/流出其他服务器hard-商品组件。(1) Topology: How traffic flows into/out of the RNIC and to/from other server hard-commodity components.

(2)内存分配设置。(2)Memory allocation settings.

(3)传输设置:(a)QP类型(RC、UC、UD),可以设置Opcode如SEND、WRITE等;(b)QP的数量;(c)操作码类型;(d)WQE的使用情况。(3) Transmission settings: (a) QP type (RC, UC, UD), you can set Opcode such as SEND, WRITE, etc.; (b) Number of QPs; (c) Opcode type; (d) WQE usage.

(4)消息模式:可以灵活设置消息大小,也可以定义一系列消息的顺序)通过RDMA网络接口卡,在主机间生成对应的网络流量。(4) Message mode: The message size can be flexibly set, and the order of a series of messages can be defined) through the RDMA network interface card to generate corresponding network traffic between hosts.

在完成搜索空间构建后,RDMA系统的搜索流程即使用模拟退火算法,通过驱动硬件计数器的极值来有效搜索可能触发性能异常的应用程序工作负载。After completing the construction of the search space, the search process of the RDMA system uses a simulated annealing algorithm to effectively search for application workloads that may trigger performance anomalies by driving the extreme values of hardware counters.

使用退火算法搜索RDMA性能异常的流程如下:The process of searching for RDMA performance anomalies using the annealing algorithm is as follows:

(1)使用工作负载设置器选择一个初始的随机工作负载,即初始状态。(1) Use the workload setter to select an initial random workload, i.e., the initial state.

(2)工作负载引擎根据工作负载设置器工作负载模式生成RDMA流量,发向RDMA系统。(2) The workload engine generates RDMA traffic based on the workload pattern of the workload setter and sends it to the RDMA system.

(3)异常检测器根据RDMA系统异常指标进行异常检测,判断是否触发异常。(3) The anomaly detector performs anomaly detection based on the anomaly indicators of the RDMA system to determine whether an anomaly is triggered.

(4)若异常检测器检测到出现新的异常,异常监控器:运行最小特征集算法(MFS),提取触发该异常的必要条件。使用提取的特征集避免重复搜索已知的异常区,从而加速搜索过程。MFS算法逐个测试异常存在的每个特征,判断哪些特征是必要的。例如:如果一个异常是用RC队列对触发的,测试使用UC时是否也能重现该异常。如果不能,则表明RC是该异常的MFS特征之一,MFS输出触发该异常的必要特征组合。(4) If the anomaly detector detects a new anomaly, the anomaly monitor: runs the minimum feature set algorithm (MFS) to extract the necessary conditions to trigger the anomaly. The extracted feature set is used to avoid repeated searches of known anomaly areas, thereby speeding up the search process. The MFS algorithm tests each feature of the anomaly one by one to determine which features are necessary. For example: If an anomaly is triggered by an RC queue pair, test whether the anomaly can be reproduced using UC. If not, it indicates that RC is one of the MFS features of the anomaly, and MFS outputs the necessary feature combination to trigger the anomaly.

(5)工作负载设置器获取最小异常特征集合。(5) The workload setter obtains the minimum anomaly feature set.

(6)工作负载设置器获取RDMA系统计数器。对工作负载点的计数器数值进行检测。通过RNIC得到计数器接口,具体有性能计数器值、诊断计数器。(这些计数器数据不依赖专有知识,不需要访问RNIC等硬件的内部实现细节)。(6) The workload setter obtains the RDMA system counters. The counter values of the workload points are checked. The counter interface is obtained through the RNIC, including performance counter values and diagnostic counters. (These counter data do not rely on proprietary knowledge and do not require access to the internal implementation details of the RNIC and other hardware).

(7)工作负载设置器使用退火算法计算能量函数,本公开中通过将性能计数器越小越好,诊断计数器越大越好作为优化目标。计算当前点和新点计数器值的能量差值ΔE。对于性能计数器(使其值尽量减小),ΔE定义为:ΔE=(新计数器值-旧计数器值)/旧计数器值。对于诊断计数器(使其值尽量增大),ΔE定义为:ΔE=(旧计数器值-新计数器值)/新计数器值,性能计数器主要反映RDMA子系统的性能,比如吞吐量、延迟、CPU使用率等。本公开的目标是找到会导致性能异常(吞吐量降低、延迟增加等)的工作负载。因此,如果任一工作负载使性能计数器值变小,说明性能降低,很可能触发了异常。所以对性能计数器,其值越小,说明越有可能找到性能异常。诊断计数器主要反映RNIC内部的错误或异常事件,比如cache miss、内部拥塞等。这些计数器通常只在RNIC出问题时才会上升。因此,如果任一工作负载使诊断计数器值变大,说明RNIC内部异常增加,很可能出问题。所以对诊断计数器,其值越大,越有可能找到性能异常。(7) The workload setter uses an annealing algorithm to calculate the energy function. In this disclosure, the smaller the performance counter, the better, and the larger the diagnostic counter, the better as the optimization goal. Calculate the energy difference ΔE between the current point and the new point counter value. For the performance counter (to reduce its value as much as possible), ΔE is defined as: ΔE = (new counter value - old counter value) / old counter value. For the diagnostic counter (to increase its value as much as possible), ΔE is defined as: ΔE = (old counter value - new counter value) / new counter value. The performance counter mainly reflects the performance of the RDMA subsystem, such as throughput, latency, CPU usage, etc. The goal of this disclosure is to find workloads that cause performance anomalies (reduced throughput, increased latency, etc.). Therefore, if any workload makes the performance counter value smaller, it means that the performance is reduced and it is likely to trigger an anomaly. Therefore, for the performance counter, the smaller its value, the more likely it is to find a performance anomaly. The diagnostic counter mainly reflects errors or abnormal events inside the RNIC, such as cache miss, internal congestion, etc. These counters usually only rise when there is a problem with the RNIC. Therefore, if any workload makes the diagnostic counter value larger, it means that the internal anomaly of the RNIC has increased and there is a high probability of a problem. So for diagnostic counters, the larger the value, the more likely it is to find performance anomalies.

(8)工作负载设置器使用退火算法进行计算能量函数结果,判断是否接受:如果能量变化表示优化了计数器值,具体ΔE<0,接受新点;如果能量变化表示变差了,ΔE>0,以概率exp(-ΔE/T)接受新工作负载。(8) The workload setter uses the annealing algorithm to calculate the energy function result and decide whether to accept it: if the energy change indicates that the counter value has been optimized, specifically ΔE<0, the new point is accepted; if the energy change indicates a deterioration, ΔE>0, the new workload is accepted with probability exp(-ΔE/T).

进一步地,随着迭代轮数增加,判断是否对退火算法温度参数进行改变,随着搜索轮数增加,逐步降低概率接受度量,使搜索更加聚焦。Furthermore, as the number of iterations increases, it is determined whether to change the temperature parameters of the annealing algorithm. As the number of search rounds increases, the probability acceptance metric is gradually reduced to make the search more focused.

(9)工作负载设置器根据迭代轮数,判断是否更新退火温度参数。(9) The workload setter determines whether to update the annealing temperature parameter based on the number of iterations.

(10)工作负载设置器根据RDMA子系统计数器和当前搜索空间来决定下一个工作负载模式。工作负载引擎根据工作负载设置器工作负载模式生成RDMA流量,发向RDMA系统。(10) The workload setter determines the next workload pattern based on the RDMA subsystem counter and the current search space. The workload engine generates RDMA traffic based on the workload setter workload pattern and sends it to the RDMA system.

综上,对以上步骤重复迭代,最终输出检测到的所有RDMA性能异常的特征集。In summary, the above steps are repeated and iterated, and finally a feature set of all detected RDMA performance anomalies is output.

在本公开的一种示例性实施例中,本公开的实施例维护一个性能异常的列表,每个异常都是一个最小故障区域(对应于MFS),例如搜索空间中的一个区域,它导致了性能异常。搜索从搜索空间中的一个随机工作负载开始,本公开的实施例的算法测量计数器的值。在SA的每次迭代中,本公开的实施例在搜索维度中改变工作负载。本公开的实施例使用异常监测器测试新的工作负载是否导致性能异常。如果是这样,本公开的实施例运行本公开的实施例的MFS算法来确定整个搜索空间中属于该异常的区域。本公开的实施例将新的异常添加到集合中,并将当前工作负载更改为随机工作负载。如果新的工作负载没有触发性能异常,本公开的实施例通过比较计数器的值来测量该点,并决定是否将当前工作负载移动到新的工作负载。对于高效搜索,本公开的实施例总是跳过属于现有性能异常的工作负载。In an exemplary embodiment of the present disclosure, the embodiment of the present disclosure maintains a list of performance anomalies, each of which is a minimum fault region (corresponding to MFS), such as an area in the search space, which causes the performance anomaly. The search starts with a random workload in the search space, and the algorithm of the embodiment of the present disclosure measures the value of the counter. In each iteration of SA, the embodiment of the present disclosure changes the workload in the search dimension. The embodiment of the present disclosure uses an anomaly monitor to test whether the new workload causes the performance anomaly. If so, the embodiment of the present disclosure runs the MFS algorithm of the embodiment of the present disclosure to determine the area belonging to the anomaly in the entire search space. The embodiment of the present disclosure adds the new anomaly to the collection and changes the current workload to a random workload. If the new workload does not trigger the performance anomaly, the embodiment of the present disclosure measures the point by comparing the value of the counter and decides whether to move the current workload to the new workload. For efficient search, the embodiment of the present disclosure always skips the workload belonging to the existing performance anomaly.

进一步地,本公开的检测异常条件包括:在各轮迭代中,异常检测器判断新工作负载是否触发了RDMA性能异常。Furthermore, the detection of abnormal conditions disclosed in the present invention includes: in each round of iteration, the anomaly detector determines whether the new workload triggers an RDMA performance abnormality.

异常检测条件一:如果产生了暂停帧,且暂停帧的比例超过0.1%,判定现在很可能出现异常瓶颈。Abnormal detection condition 1: If a pause frame is generated and the proportion of the pause frame exceeds 0.1%, it is determined that an abnormal bottleneck is likely to occur.

异常检测条件二:每个(远程网络接口卡)都有自己指定的最大每秒位数和最大每秒数据包数,可以通过简单的基准测试来验证。如果工作负载的吞吐量低于这些上限的20%,则说明性能很可能受到RDMA(远程直接内存访问)子系统中的其他瓶颈的限制。Anomaly detection condition two: Each (remote network interface card) has its own specified maximum number of bits per second and maximum number of packets per second, which can be verified through simple benchmark tests. If the workload's throughput is less than 20% of these upper limits, it means that performance is likely limited by other bottlenecks in the RDMA (remote direct memory access) subsystem.

进一步地,本公开的异常检测器提取特征集的处理过程如下:Furthermore, the processing process of extracting feature sets by the anomaly detector disclosed in the present invention is as follows:

如果出现新的异常,异常监控器运行最小特征集算法(MFS),提取触发该异常的必要条件。使用提取的特征集避免重复搜索已知的异常区,从而加速搜索过程。MFS算法逐个测试异常存在的每个特征,判断哪些特征是必要的。例如:如果一个异常是用RC队列对触发的,测试使用UC时是否也能重现该异常。如果不能,则表明RC是该异常的MFS特征之一。MFS输出触发该异常的必要特征组合。经过多轮重复迭代,最终输出检测到的所有RDMA性能异常的特征集。If a new anomaly occurs, the anomaly monitor runs the minimum feature set algorithm (MFS) to extract the necessary conditions to trigger the anomaly. The extracted feature set is used to avoid repeated searches of known anomaly areas, thereby speeding up the search process. The MFS algorithm tests each feature of the anomaly one by one to determine which features are necessary. For example: If an anomaly is triggered by an RC queue pair, test whether the anomaly can also be reproduced using UC. If not, it indicates that RC is one of the MFS features of the anomaly. MFS outputs the necessary feature combination that triggers the anomaly. After multiple rounds of repeated iterations, the feature set of all detected RDMA performance anomalies is finally output.

对应于上述方法实施例,本公开还提供一种RDMA系统的异常检测装置,可以用于执行上述方法实施例。Corresponding to the above method embodiment, the present disclosure also provides an abnormality detection device for an RDMA system, which can be used to execute the above method embodiment.

图12是本公开示例性实施例中一种RDMA系统的异常检测装置的方框图。FIG. 12 is a block diagram of an abnormality detection device for an RDMA system in an exemplary embodiment of the present disclosure.

参考图12,RDMA系统的异常检测装置1200可以包括:Referring to FIG. 12 , an abnormality detection device 1200 of an RDMA system may include:

建模模块1202,设置为对所述RDMA系统的工作空间进行抽象建模,以生成所述RDMA系统的搜索空间。The modeling module 1202 is configured to perform abstract modeling on the working space of the RDMA system to generate a search space of the RDMA system.

确定模块1204,设置为根据所述搜索空间的维度确定工作负载。The determination module 1204 is configured to determine the workload according to the dimension of the search space.

所述确定模块1204,设置为确定任一所述工作负载触发所述RDMA系统异常的条件特征,以生成所述异常的最小功能特征集合。The determination module 1204 is configured to determine the conditional features of any of the workloads triggering the RDMA system abnormality, so as to generate a minimum functional feature set of the abnormality.

所述确定模块1204,设置为针对任一所述最小功能特征集合,调用所述RDMA系统的计数器确定所述工作负载的异常指标。The determination module 1204 is configured to call the counter of the RDMA system to determine the abnormal index of the workload for any of the minimum functional feature sets.

在本公开的一种示例性实施例中,RDMA系统的异常检测装置1200还配置为:In an exemplary embodiment of the present disclosure, the abnormality detection device 1200 of the RDMA system is further configured as follows:

基于统计学算法对所述异常指标的迭代计算结果调整所述工作负载的模式和/或所述工作负载的选取结果。The workload mode and/or the workload selection result are adjusted based on the iterative calculation result of the abnormal indicator using a statistical algorithm.

在本公开的一种示例性实施例中,RDMA系统的异常检测装置1200还配置为:In an exemplary embodiment of the present disclosure, the abnormality detection device 1200 of the RDMA system is further configured as follows:

调用所述统计学算法中的退火算法迭代计算所述工作负载的计数值与新的工作负载的计数值之间的计数差值;Calling an annealing algorithm in the statistical algorithm to iteratively calculate a count difference between the count value of the workload and the count value of a new workload;

确定所述计数差值与所述工作负载的计数值之间的比例值;determining a ratio value between the count difference value and the count value of the workload;

根据所述比例值调整所述工作负载的模式和/或所述工作负载的选取结果。The workload mode and/or the workload selection result is adjusted according to the ratio value.

在本公开的一种示例性实施例中,RDMA系统的异常检测装置1200还配置为:In an exemplary embodiment of the present disclosure, the abnormality detection device 1200 of the RDMA system is further configured as follows:

若确定所述比例值小于零,则将所述工作负载转移为所述新的工作负载;If it is determined that the ratio value is less than zero, transferring the workload to the new workload;

若确定所述比例值大于零,则以自然常数e为底的指数函数计算所述比例值的相反数;If it is determined that the ratio value is greater than zero, then calculating the opposite of the ratio value using an exponential function with a natural constant e as the base;

将所述指数函数的计算结果确定为将所述工作负载转移为所述新的工作负载的概率。The calculation result of the exponential function is determined as the probability of transferring the workload to the new workload.

在本公开的一种示例性实施例中,RDMA系统的异常检测装置1200还配置为:In an exemplary embodiment of the present disclosure, the abnormality detection device 1200 of the RDMA system is further configured as follows:

根据所述比例值调整所述RDMA系统的请求模式和/或所述RDMA系统的缓冲区的分配信息。The request mode of the RDMA system and/or the allocation information of the buffer of the RDMA system are adjusted according to the ratio value.

在本公开的一种示例性实施例中,建模模块1202还设置为:In an exemplary embodiment of the present disclosure, the modeling module 1202 is further configured to:

对所述RDMA系统的工作空间进行抽象,以确定所述工作空间对应的内存区域,并对所述内存区域进行注册;Abstracting the workspace of the RDMA system to determine a memory area corresponding to the workspace, and registering the memory area;

创建所述队列对并设置所述队列对的传输类型;Creating the queue pair and setting the transmission type of the queue pair;

根据已注册的内存区域和设置有传输类型的队列对创建工作队列元素;Create a work queue element based on the registered memory area and queue pair with the transfer type set;

根据所述工作队列元素创建所述完成队列,以生成所述RDMA系统的搜索空间。The completion queue is created according to the work queue element to generate a search space of the RDMA system.

在本公开的一种示例性实施例中,确定模块1204还设置为:In an exemplary embodiment of the present disclosure, the determination module 1204 is further configured to:

解析所述搜索空间的维度中包括的拓扑结构、主机内存源、传输模式和消息模式;Parsing the topology, host memory source, transmission mode and message mode included in the dimensions of the search space;

根据所述拓扑结构、所述主机内存源、所述传输模式和所述消息模式生成所述RDMA系统的主机之间的流量,并确定所述流量对应的所述工作负载。The traffic between hosts of the RDMA system is generated according to the topology, the host memory source, the transmission mode and the message mode, and the workload corresponding to the traffic is determined.

由于装置1200的各功能已在其对应的方法实施例中予以详细说明,本公开于此不再赘述。Since the functions of the device 1200 have been described in detail in the corresponding method embodiments, the present disclosure will not elaborate on them here.

应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的条件特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的条件特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that, although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present disclosure, the conditional features and functions of two or more modules or units described above can be concretized in one module or unit. Conversely, the conditional features and functions of one module or unit described above can be further divided into being concretized by multiple modules or units.

在本公开的示例性实施例中,还提供了一种能够实现上述方法的电子设备。In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

所属技术领域的技术人员能够理解,本发明的各个方面可以实现为系统、方法或程序产品。因此,本发明的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。It will be appreciated by those skilled in the art that various aspects of the present invention may be implemented as a system, method or program product. Therefore, various aspects of the present invention may be specifically implemented in the following forms, namely: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software, which may be collectively referred to herein as a "circuit", "module" or "system".

下面参照图13来描述根据本发明的这种实施方式的电子设备1300。图13显示的电子设备1300仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。The electronic device 1300 according to this embodiment of the present invention is described below with reference to Fig. 13. The electronic device 1300 shown in Fig. 13 is only an example and should not bring any limitation to the functions and application scope of the embodiment of the present invention.

如图13所示,电子设备1300以通用计算设备的形式表现。电子设备1300的组件可以包括但不限于:上述至少一个处理单元1310、上述至少一个存储单元1320、连接不同系统组件(包括存储单元1320和处理单元1310)的总线1330。As shown in Fig. 13, the electronic device 1300 is presented in the form of a general-purpose computing device. The components of the electronic device 1300 may include, but are not limited to: the at least one processing unit 1310, the at least one storage unit 1320, and a bus 1330 connecting different system components (including the storage unit 1320 and the processing unit 1310).

其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元1310执行,使得所述处理单元1310执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。例如,所述处理单元1310可以执行如本公开实施例所示的方法。The storage unit stores program codes, which can be executed by the processing unit 1310, so that the processing unit 1310 performs the steps according to various exemplary embodiments of the present invention described in the above “Exemplary Method” section of this specification. For example, the processing unit 1310 can perform the method shown in the embodiment of the present disclosure.

存储单元1320可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)13201和/或高速缓存存储单元13202,还可以进一步包括只读存储单元(ROM)13203。The storage unit 1320 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 13201 and/or a cache storage unit 13202 , and may further include a read-only storage unit (ROM) 13203 .

存储单元1320还可以包括具有一组(至少一个)程序模块13205的程序/实用工具13204,这样的程序模块13205包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage unit 1320 may also include a program/utility 13204 having a set (at least one) of program modules 13205, such program modules 13205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination may include an implementation of a network environment.

总线1330可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。Bus 1330 may represent one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

电子设备1300也可以与一个或多个外部设备1340(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备1300交互的设备通信,和/或与使得该电子设备1300能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口1350进行。并且,电子设备1300还可以通过网络适配器1360与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器1360通过总线1330与电子设备1300的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备1300使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 1300 may also communicate with one or more external devices 1340 (e.g., keyboards, pointing devices, Bluetooth devices, etc.), may also communicate with one or more devices that enable a user to interact with the electronic device 1300, and/or communicate with any device that enables the electronic device 1300 to communicate with one or more other computing devices (e.g., routers, modems, etc.). Such communication may be performed via an input/output (I/O) interface 1350. Furthermore, the electronic device 1300 may also communicate with one or more networks (e.g., local area networks (LANs), wide area networks (WANs), and/or public networks, such as the Internet) via a network adapter 1360. As shown, the network adapter 1360 communicates with other modules of the electronic device 1300 via a bus 1330. It should be understood that, although not shown in the figure, other hardware and/or software modules may be used in conjunction with the electronic device 1300, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施方式的方法。Through the description of the above implementation, it is easy for those skilled in the art to understand that the example implementation described here can be implemented by software, or by software combined with necessary hardware. Therefore, the technical solution according to the implementation of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a USB flash drive, a mobile hard disk, etc.) or on a network, including several instructions to enable a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the implementation of the present disclosure.

在本公开的示例性实施例中,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本发明的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。In an exemplary embodiment of the present disclosure, a computer-readable storage medium is also provided, on which a program product capable of implementing the above method of the present specification is stored. In some possible implementations, various aspects of the present invention may also be implemented in the form of a program product, which includes a program code, and when the program product is run on a terminal device, the program code is used to enable the terminal device to perform the steps according to various exemplary implementations of the present invention described in the above "Exemplary Method" section of the present specification.

根据本发明的实施方式的用于实现上述方法的程序产品可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本发明的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。The program product for implementing the above method according to an embodiment of the present invention can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto. In this document, a readable storage medium can be any tangible medium containing or storing a program, which can be used by or in combination with an instruction execution system, an apparatus or a device.

所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product may use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples (non-exhaustive list) of readable storage media include: an electrical connection with one or more wires, a portable disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.

计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。Computer readable signal media may include data signals propagated in baseband or as part of a carrier wave, in which readable program code is carried. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. Readable signal media may also be any readable medium other than a readable storage medium, which may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.

可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。The program code embodied on the readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.

可以以一种或多种程序设计语言的任意组合来编写用于执行本发明操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。Program code for performing the operations of the present invention may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., and conventional procedural programming languages such as "C" or similar programming languages. The program code may be executed entirely on the user computing device, partially on the user device, as a separate software package, partially on the user computing device and partially on a remote computing device, or entirely on a remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (e.g., via the Internet using an Internet service provider).

此外,上述附图仅是根据本发明示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。In addition, the above-mentioned figures are only schematic illustrations of the processes included in the method according to an exemplary embodiment of the present invention, and are not intended to be limiting. It is easy to understand that the processes shown in the above-mentioned figures do not indicate or limit the time sequence of these processes. In addition, it is also easy to understand that these processes can be performed synchronously or asynchronously, for example, in multiple modules.

本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和构思由权利要求指出。Those skilled in the art will readily appreciate other embodiments of the present disclosure after considering the specification and practicing the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or customary techniques in the art that are not disclosed in the present disclosure. The specification and examples are to be considered exemplary only, and the true scope and concept of the present disclosure are indicated by the claims.

Claims (10)

1. An anomaly detection method for an RDMA system, comprising:
abstract modeling a workspace of the RDMA system to generate a search space of the RDMA system;
Determining a workload according to the dimension of the search space;
Determining a condition feature of any of the workloads triggering the RDMA system exception to generate a minimum set of functional features of the exception;
And for any minimum functional feature set, calling a counter of the RDMA system to determine an abnormality index of the workload.
2. The RDMA system anomaly detection method as recited in claim 1, further comprising:
And adjusting the mode of the workload and/or the selection result of the workload based on the iterative calculation result of the abnormal index by a statistical algorithm.
3. The RDMA system anomaly detection method of claim 2, wherein adjusting the workload pattern and/or the workload selection result based on the iterative computation of the anomaly metrics by a statistical algorithm comprises:
Invoking an annealing algorithm in the statistical algorithm to iteratively calculate a count difference between the count value of the workload and the count value of a new workload;
determining a ratio value between the count difference value and the count value of the workload;
and adjusting the mode of the workload and/or the selection result of the workload according to the proportion value.
4. The RDMA system anomaly detection method of claim 3, wherein adjusting the workload pattern and/or the workload selection result according to the scale value comprises:
if the ratio value is determined to be smaller than zero, transferring the workload to the new workload;
If the ratio value is determined to be greater than zero, calculating the opposite number of the ratio value by using an exponential function based on a natural constant e;
the calculation of the exponential function is determined as a probability of transferring the workload to the new workload.
5. The RDMA system anomaly detection method of claim 3, wherein adjusting the workload pattern and/or the workload selection result according to the scale value comprises:
And adjusting the request mode of the RDMA system and/or the allocation information of the buffer area of the RDMA system according to the proportion value.
6. The RDMA system anomaly detection method of any of claims 1-5, wherein abstractly modeling a workspace of the RDMA system to generate a search space of the RDMA system comprises:
Abstracting a working space of the RDMA system to determine a memory area corresponding to the working space, and registering the memory area;
creating the queue pair and setting the transmission type of the queue pair;
Creating a work queue element according to the registered memory area and the queue pair provided with the transmission type;
Creating the completion queue from the work queue element to generate a search space for the RDMA system.
7. The RDMA system anomaly detection method of any of claims 1-5, wherein determining a workload from the dimensions of the search space comprises:
analyzing a topological structure, a host memory source, a transmission mode and a message mode which are included in the dimension of the search space;
And generating traffic among hosts of the RDMA system according to the topological structure, the host memory source, the transmission mode and the message mode, and determining the workload corresponding to the traffic.
8. An abnormality detection apparatus for an RDMA system, comprising:
A modeling module configured to abstract model a workspace of the RDMA system to generate a search space of the RDMA system;
A determining module arranged to determine a workload from the dimensions of the search space;
the determining module is configured to determine a condition feature of any of the workloads that triggers the RDMA system exception to generate a minimum set of functional features of the exception;
the determining module is configured to invoke a counter of the RDMA system to determine an abnormality indicator of the workload for any of the minimum feature sets.
9. An electronic device, characterized by comprising:
a memory; and
A processor coupled to the memory, the processor configured to perform the anomaly detection method of the RDMA system of any of claims 1-7 based on instructions stored in the memory.
10. A computer readable storage medium having stored thereon a program which, when executed by a processor, implements the anomaly detection method of the RDMA system of any of claims 1-7.
CN202410064360.1A 2024-01-16 2024-01-16 Abnormality detection method, device, electronic device and readable medium of RDMA system Pending CN117938715A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202410064360.1A CN117938715A (en) 2024-01-16 2024-01-16 Abnormality detection method, device, electronic device and readable medium of RDMA system
PCT/CN2024/118540 WO2025152481A1 (en) 2024-01-16 2024-09-12 Anomaly detection method and apparatus for rdma system, electronic device, and readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410064360.1A CN117938715A (en) 2024-01-16 2024-01-16 Abnormality detection method, device, electronic device and readable medium of RDMA system

Publications (1)

Publication Number Publication Date
CN117938715A true CN117938715A (en) 2024-04-26

Family

ID=90754979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410064360.1A Pending CN117938715A (en) 2024-01-16 2024-01-16 Abnormality detection method, device, electronic device and readable medium of RDMA system

Country Status (2)

Country Link
CN (1) CN117938715A (en)
WO (1) WO2025152481A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025152481A1 (en) * 2024-01-16 2025-07-24 中国电信股份有限公司技术创新中心 Anomaly detection method and apparatus for rdma system, electronic device, and readable medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102430187B1 (en) * 2015-07-08 2022-08-05 삼성전자주식회사 METHOD FOR IMPLEMENTING RDMA NVMe DEVICE
CN112968811A (en) * 2021-02-20 2021-06-15 中国工商银行股份有限公司 PFC exception handling method and device for RDMA network
CN115174702B (en) * 2022-09-08 2022-11-22 深圳华锐分布式技术股份有限公司 RDMA (remote direct memory Access) protocol-based data transmission method, device, equipment and medium
CN115658592A (en) * 2022-09-09 2023-01-31 珠海星云智联科技有限公司 RDMA-based data transmission method and device
CN117938715A (en) * 2024-01-16 2024-04-26 中国电信股份有限公司技术创新中心 Abnormality detection method, device, electronic device and readable medium of RDMA system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025152481A1 (en) * 2024-01-16 2025-07-24 中国电信股份有限公司技术创新中心 Anomaly detection method and apparatus for rdma system, electronic device, and readable medium

Also Published As

Publication number Publication date
WO2025152481A9 (en) 2025-09-18
WO2025152481A1 (en) 2025-07-24

Similar Documents

Publication Publication Date Title
EP1441491A1 (en) System and method for testing portable communications devices
CN113326181B (en) Fuzz testing method, device and storage medium for stateful network protocol
WO2025152481A1 (en) Anomaly detection method and apparatus for rdma system, electronic device, and readable medium
CN111726258A (en) Network performance detection method and related device
CN105282244B (en) a kind of data processing method, device, server and controller
CN115987965A (en) File uploading method, device, equipment and storage medium
CN113791792B (en) Methods, devices and storage media for obtaining application call information
CN112463067A (en) Data protection method and equipment in NVMe-oF scene
CN112506798A (en) Performance test method, device, terminal and storage medium of block chain platform
CN112507265B (en) Method and device for abnormality detection based on tree structure and related products
CN114063606A (en) PLC protocol fuzzy testing method and device, electronic equipment, storage medium
WO2025138713A1 (en) Event statistics method and apparatus
CN113760589A (en) Service fusing method and device based on real-time stream processing framework
CN113825170A (en) Method and apparatus for determining a network channel
CN115022201B (en) A data processing function testing method, device, equipment and storage medium
CN117221189A (en) Network speed measurement method and device, electronic equipment and storage medium
CN112511522B (en) Method, device and equipment for reducing memory occupation in detection scanning
CN115098371A (en) Interface testing method, device, storage medium and electronic equipment
CN112364284B (en) Methods, devices and related products for context-based anomaly detection
CN113377660B (en) Test methods and equipment
CN116955105B (en) Cross-chip transmission performance analysis method and device and electronic equipment
CN110704222A (en) Dump file analysis method and device, storage medium, electronic device
CN117472787B (en) Test case generation method, device, medium and equipment for vehicle-mounted computer fuzzy test
CN114449095B (en) Cloud mobile phone screenshot method and device, electronic equipment and storage medium
CN115499402B (en) Instant messaging information processing method, terminal and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载