RDMA (remote direct memory Access) -based key value storage system transmission method
Technical Field
The invention belongs to the technical field of computer storage, and particularly relates to a transmission method of a RDMA-based key value storage system.
Background
With the maturity of cloud computing and big data processing technologies, the amount of data generated by internet applications is gradually exponentially increased. Meanwhile, with the rise of pictures and short videos, the data have the characteristics of various formats, different sizes, no structuralization and the like; in order to perform query analysis and permanent storage on the growing mass data, a higher-performance storage technology is required. And the traditional relational database has low concurrent processing capability, poor expansibility and fixed storage structure, and is difficult to be suitable for the storage requirements required by a novel unstructured data mode with dispersed formats.
For this reason, the key-value storage system of non-relational (NoSQL) storage is beginning to be widely used as a mainstream storage and analysis tool in the industry. The memory key value storage system is widely applied due to the advantages of high access speed, strong expandability and the like, and is used for accelerating various data processing workloads, including online analysis workloads and offline data intensive workloads. For example, it can be used as a main storage (e.g., Redis and RAMcloud) tool, or as a front cache of a back-end database (e.g., Memcached) to speed up data read and write efficiency. In addition, the method is also used in the upper-layer application (such as HBase) of Hadoop and Spark of big data analysis tools to support unstructured data storage.
However, in the face of the ever-increasing volume of data and the high computational load associated with processing large-scale data, conventional TCP/IP network protocols and hardware devices have slowly not kept pace with high-performance cores and high transmission lines (100 Gbs). Network IO performance and computational resource strain begin to become bottlenecks in key-value storage systems.
Thus, efficient network hardware and more advanced transport protocols are introduced into conventional key-value storage systems. As the price of high-performance computing hardware decreases, data centers gradually begin to use, for example, rdma (remote Direct Memory access) technology to improve the transmission and computing performance of the Memory key value storage system. RDMA operations allow a machine to read (or write) from a pre-registered memory region of another machine without involving the CPU on the remote side. RDMA achieves minimal round-trip delay (in microseconds), highest throughput, and lowest CPU overhead compared to traditional messaging. By combining RDMA with a key value storage system, the online processing speed can be greatly improved, and the data intensive workload is reduced. Meanwhile, the RDMA starts to support RoCE (RDMA over converted Ethernet) protocol, which is an extension technology allowing RDMA hardware to run at the bottom layer of an Ethernet link, so that the RDMA high-performance hardware can be compatible with the traditional Ethernet, and the RDMA high-performance hardware is introduced into a traditional key value storage system and has good adaptability.
There are several problems to be solved when using RDMA-based key-value storage systems for data transfer. Through tests, data transmission between nodes needs 1-3 microseconds, and the searching of a memory only needs 60-120 nanoseconds, and the time delay of the node occupies a main part, which shows that whether the transmission efficiency is high or low directly influences the overall read-write performance of the key value storage system. However, in recent research work on RDMA-based key-value storage systems, remote memory access is mostly performed in RDMA read mode, such as the transfer mode used in Pilaf and FaRM systems. The RDMA read operation bypasses the kernel of the remote server, but also causes the remote to be unable to perform complex addressing, and the data transmission between the client and the server needs to be completed by multiple round trips. The time delay caused by the design of multiple round trip transmission is obviously longer than that of the design of single round trip transmission, and the user experience is obviously reduced. Therefore, the problem of multiple round-trip transmissions brought by the RDMA read mode will greatly reduce the overall performance of the key-value storage system.
In addition, while the RDMA read-based operation makes multiple round trips, although the kernel of the remote server is bypassed (which is also one of the reasons that the remote cannot perform complicated addressing), multiple transmissions can cause interruption and thread switching to the CPU of the client, and more than one client is often applied to use the CPU, so the experience of the user at the application layer level is greatly reduced. Meanwhile, the server side exists for providing services, the CPU of the server cannot have excessive application switching, and the server side looks inverted after an inner core of the server side is excessively pursued to be a perfect bypass.
Disclosure of Invention
In order to solve the problems in the prior art, the invention aims to provide a RDMA-based key value storage system transmission method, which can effectively reduce the round-trip communication delay of a memory key value storage system, improve the throughput, and improve the upper-layer user experience of a client by bypassing the client kernel by using RDMA unilateral semantics.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, a method for RDMA-based key-value storage system transmission includes the following steps:
the client side and the server side are connected with each other, the server side registers an RDMA memory for creating a command queue, and the client side registers the RDMA memory for receiving a return data block and mutually transmitting a memory address and an access key;
after the connection is successfully established, the client sends a GET/PUT instruction to the server in a unilateral write semantic form;
the server side receives parallel processing requests of multiple clients and stores the requests in a command queue, analyzes and responds data in the command queue according to RDMA unilateral write semantics, and sends value data to a memory of the client side in a mode of bypassing a kernel of the client side for a GET instruction; for PUT instructions, the value store is added or updated locally.
In a preferred embodiment, sending the GET/PUT command in unilateral write semantic form by the client is implemented by calling an RDMA write function, where the RDMA write function parameters include:
r _ address, which is the virtual memory mapping from the server,
r _ key, which is an access key transmitted from the server,
and the data is the relevant information of the request, and contains the information correspondingly required by the operation type on the basis of distinguishing the operation type.
As a preferred implementation, for a GET request, the data includes:
a command for distinguishing request types;
the key is a key of a target object of the request in a key value storage system and is used for searching a value address space in the index at the opposite end;
l _ address, which is an address space used for storing return data in the memory of the client; and
l _ key, access key for client.
As a preferred embodiment, for the GET request, the parsing and responding the data in the command queue according to RDMA single-side write semantics includes:
the server creates the received data in the thread processing instruction queue and analyzes the parameters in the data;
according to command, determining that the command is GET, and creating a response function RDMA-write (l _ address, l _ key, r _ data);
accessing the hash table according to the key to obtain the address mapping of the storage block where the corresponding value is located, and taking the value out of the storage block according to the mapping address and packaging the value into r _ data of the response function;
directly filling the analyzed l _ address and l _ key into the l _ address and l _ key of the response function;
and after the key is successfully matched with the l _ key of the client, sending the data to a client memory specified by the l _ address in a form of bypassing a client kernel, and receiving a GET result by the client.
As a preferred implementation, for the PUT request, the data includes:
a command for distinguishing request types;
key, which is the key of the data block required to be written in the request in the key value storage system;
value, which is the value of the data block in the key value storage system that needs to be written for the request.
In a preferred embodiment, for a PUT request, the parsing and responding data in a command queue according to RDMA single-side write semantics includes:
the server creates the received data in the thread processing instruction queue and analyzes the parameters in the data;
determining that the command is PUT according to command, starting an index access thread to execute write-in operation, and newly building a key value pair of < new _ key, new _ value > in a hash table;
writing the key into a new _ key of the newly-built key value pair according to the analyzed key;
and according to the analyzed value, newly building a section of data storage block in the memory area, copying the value into the newly built storage block, and writing the access address of the storage block into the key value pair new _ value.
As a preferred embodiment, when sending a request, the client divides priority levels according to task urgency, and sends a priority flag bit and a data block to the server, and the server commands the queue to receive the flag bit and the data block, then serially takes out and sequentially processes the flag bit and the data block according to priority.
In a second aspect, a data processing apparatus, the apparatus comprising:
one or more processors;
a memory;
and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors, which when executed by the processors implement the RDMA-based key-value storage system data transfer method of the first aspect of the invention.
In a third aspect, a computer readable storage medium stores computer instructions which, when executed by a processor, implement the RDMA-based key-value storage system transfer method of the first aspect of the invention.
Compared with the traditional TCP/IP communication protocol and other RDMA semantic designs, the design method only needs one round-trip transmission, automatically processes the command queue, completes data access and simultaneously releases the CPU overhead of the client. The method can be applied to a scene that an internal memory key value storage system is used as a database engine under the RDMA hardware environment.
Drawings
FIG. 1 is a schematic diagram of a method of transmission of an RDMA-based key-value storage system, according to an embodiment of the invention;
FIG. 2 is a schematic diagram illustrating a client and a server establishing a connection with each other according to an embodiment of the present invention;
FIG. 3 is a diagram of a command queue and a polling thread according to an embodiment of the invention;
FIG. 4 is a schematic diagram of a request phase of a GET instruction according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the response and return phases of a GET instruction, according to an embodiment of the invention;
FIG. 6 is a schematic diagram of a PUT instruction client sending phase according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a phase of a server-side PUT instruction response according to an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is further explained with reference to the drawings and the embodiments.
Fig. 1 is a general schematic diagram illustrating a transmission method of a RDMA-based key-value storage system according to an embodiment of the present invention, which redesigns GET and PUT instructions of the key-value storage system using higher-performance RDMA write semantics, thereby avoiding multiple round trips, reducing transmission delay, and improving throughput. And simultaneously, the server side analyzes the GET command, after the operation required by the command is obtained, the server returns the data to the client side by adopting RDMA write, and the client side kernel is changed into a bypass client side kernel, so that the CPU overhead is released for the user. The following describes how to redesign the GET command and the PUT command by fully utilizing the high-performance semantics of RDMA to improve the transmission performance of the key value storage system.
Fig. 2 is a schematic diagram illustrating a connection between a client and a server according to an embodiment of the present invention. Firstly, a server side starts an RDMA memory registration thread to create a Command Queue (Command Queue) for receiving GET instruction cache sent by a plurality of clients by using RDMA write, and sends a server memory address r _ address and an access key r _ key of the segment to each client in advance to establish connection. And simultaneously, each client starts a RDMA memory registration thread of the client, and a memory block is created for receiving a Data block (Receive Data) returned after the GET instruction is completed. And the memory address l _ address and the access key l _ key of the client are sent to the server in advance, and connection is established in advance, namely a connection-oriented data transmission protocol. After the connection is established, the remote kernel can be bypassed by the address and the key to access the memory block.
After the connection is established, the remote memory is virtually abstracted to the address space of the local network card, for upper-layer application, the access to the remote memory storage is equivalent to the command and operation of accessing the local memory, and the implementation details are completed by the RDMA network card protocol and hardware together.
Due to the adoption of the RDMA write unilateral semantics with high transmission performance, a remote kernel is not informed to complete the access to a remote memory in the data transmission process, so that higher transmission efficiency is brought. However, the kernel is not notified, so that no matter the above is a PUT instruction based on the unilateral semantic GET instruction or the unilateral semantic, the read-write competition problem which occurs when a plurality of clients concurrently access the server data storage area cannot be immediately coordinated and solved by the server kernel which is already bypassed. Therefore, the design scheme shown in FIG. 3 is proposed.
FIG. 3 is a diagram illustrating a command queue and a polling thread according to an embodiment of the invention. Firstly, RDMA supports that a section of memory is pre-opened in the memory for caching data directly sent by a client on one side, and the section of memory is defined as a message queue in the memory of a server for receiving the data of a parallel client. The client divides priority levels according to the urgency of tasks, writes a priority flag bit into a work queue and sends the priority flag bit to the server together with the data. And after receiving the zone bits and the data blocks, the server serially takes out the zone bits and the data blocks and sequentially processes the zone bits and the data blocks according to the priority. Because the RDMA write semantic of the client bypasses the kernel of the server, the serial taking-out step cannot be automatically executed by the kernel of the server and needs to be assisted by a new polling thread, and meanwhile, according to the design of a GET instruction and a PUT instruction of a key value storage system, a small amount of CPU (Central processing Unit) processes are required for accessing an index structure. Therefore, in the process of accessing the hash table, the invention additionally creates a new polling auxiliary thread p as the first step starting of the whole process space, and the thread p mainly plays a role of polling and searching the server RDMA for caching whether a memory area of the receiving queue has a new client request to work or not, so as to meet the requirements of periodically polling and processing the client requests in sequence according to the priority. In addition, the RDMA write unilateral semantic and message queue processing scheme is compatible and adaptive with each other and is used for processing the problem of distributed read-write competition of multiple clients during kernel-free reception. This queue is named "command queue" and the thread becomes a "polling thread". And entering a direct communication stage, namely specifically designing the GET instruction and the PUT instruction in the memory key value storage system.
FIG. 4 is a diagram illustrating a request phase of a GET instruction according to an embodiment of the present invention. After the connection is already established, the client initiates the connection actively, enters a request phase of a GET instruction, the RNIC network card of the client starts RDMA write communication semantics and calls a write (r _ address, r _ key, data) request function, the first parameter r _ address of the function is the mapping of the virtual memory transmitted from the far end when the connection is established, and the remote memory can be accessed directly through the r _ address parameter. The second parameter r _ key is a contract key set in consideration of the security of the bypass kernel, and when the key is matched and verified with the remote server, the remote memory can be read and written without notifying the remote kernel. The third parameter data is stored in a Command Queue (Command Queue) of the server, which has been registered in advance by the server RNIC, for receiving the request Command sent by the client to be stored in the data. The parameter data mainly comprises four parts which are respectively:
1) command: and the specific GET instruction content indicates the access property of the operation.
2) key: the key in the key-value storage system is used for searching the value address space in the index.
3) l _ key: and after obtaining the value, starting a thread for returning data to the client, avoiding the client key l _ key of the memory of the client from secret access, and bypassing the client kernel by key matching without interrupting the current thread of the application layer of the client.
4) client receiving address: in order to receive the address space l _ address of the client side for returning the value data, the address is abstracted to the memory mapping of the server side in the data returning stage, and the returning process of the data can be directly finished by executing the single-side write semantic meaning without informing the CPU of the client side.
The GET requests of the client side are uniformly carried out in a command queueReceiving, and then the server polls the data in the queueiAnd (5) carrying out further treatment. Server-assisted processing is required because RDMA cannot support pointer tracking and index queries alone. Therefore, the server creates a Command Queue (Command Queue) and polls the data received in the QueueiAnd (6) parameter analysis.
FIG. 5 is a diagram illustrating the response and return phases of a GET instruction according to an embodiment of the present invention. As shown in the figure, the server-side kernel intervenes in the operation, and takes out and analyzes the data from the receiving queue according to the priorityiRequest work in (1), dataiThe first parameter command analysis instruction is GET or PUT, if the command is GET, a response function RDMA-write (l _ address, l _ key, r _ data) is created for the return of the next value. And fourthly, addressing the index hash table by the second parameter key, and storing the key value pair of the hash table to obtain the value address mapping corresponding to the key word key. According to the address mapping, taking out value from the storage block and packaging into the parameter r _ data of the response function created previously. Data (C)iThe third and fourth parameters are directly written into the l _ key and l _ address of the response function RDMA-write. As described above, the l _ address parameter is used as the address of the memory address space used by the client to receive data, and the l _ key parameter is used as the matching key required for accessing the memory of the client while bypassing the client kernel. After the RDMA-write response function is connected with the client, l _ key matching is carried out, and after matching is successful, value data in the data are directly transmitted to a client memory specified by the l _ address to be stored, and the GET process is completed. At this point, the GET request initiated by the client finally obtains the response of the server, and returns the value to the local memory by bypassing the kernel.
The above process is a description of the request phase and the response and return phase of the entire GET instruction of the present invention based on high performance RDMA write semantics. The request phase shown in fig. 4 is incorporated into the request function initiated in the first step of fig. 5, so that all GET instruction completion steps can be visualized. Compared with other work related to a key value storage system based on RDMA, the RDMA write semantic with the lowest communication delay is introduced to be used as the whole-process communication basic semantic, and the request phase and the return phase are optimized, so that the transmission round trip is reduced from multiple times to only one round trip transmission to complete the whole GET instruction. The multiple transmission can bring interruption and thread switching to the CPU of the client, and the client often uses more than one CPU, so that the experience of the user on the application layer level can be greatly reduced. Meanwhile, the design of the invention changes the bypass of the client kernel, so that the client kernel with more software applications is liberated, and the most practical experience of the user in front of the client is improved. Meanwhile, the server side exists for providing services, the CPU of the server is used for storing the work instruction load of the system completely, and the phenomenon that the traditional RDMA read semantics excessively pursue the perfect bypass of the kernel of the server side and is turned over at the end of effectiveness is avoided.
Fig. 6 is a schematic diagram illustrating a sending stage of a PUT command client according to an embodiment of the present invention. When a client needs to write new data into a value storage block of a server-side key value storage system or update original old data, a PUT instruction needs to be used. The flow of the PUT instruction is relatively much simpler than if the GET instruction required three phases of request, response, and return. As in the beginning step of the GET command, RDMA between the client and the server still needs to establish a connection-oriented communication mode in advance. In order to reduce the complexity of the PUT request function and save memory resources, the PUT instruction and the GET instruction share an index hash table, a value storage block, and a receive Queue (Command Queue) of work requests in the server. Because the instructions share the buffer area, the receiving queue can not be changed because the request work is used as GET or PUT operation, so the work request in the queue still uses dataiNaming, unlike GET operations, work request data under PUT operationsiThere are only three parameters, command, key and value, respectively. Polling the data received in the queueiParameters and resolves the work request. For differentiation, in dataiThe first byte position of the command indicates that the command is a GET operation/PUT operation. The above description of the GET instruction has already elaborated the memory registration part of the command queue, and this part of the process will not be described herein again. Memory registrationAfter the completion, the client knows the access address r _ address and the remote memory access key r _ key of the remote server, can directly access the server memory by the parameters r _ address and r _ key through RDMA write semantics and write data into a server command queue which is opened up in advance, and generates a work request data in the queuei. The PUT operation can be designed based on the principle to supplement the PUT operation. Because of the unilateral operation, the client ends the thread of the client after the request function RDMA-write (r _ address, r _ key, data) is sent.
Fig. 7 is a schematic diagram illustrating a phase of a server-side PUT command response according to an embodiment of the present invention. The server end receives the work request dataiAnd then, starting a polling thread of the kernel of the server, and sequentially taking out and analyzing the data of the work request by the thread according to the priority of the work requesti. The work request is determined to be a GET command or a PUT command, where the command is designated as a PUT command, by the parsed first argument command. Starting an index access thread to execute a write operation, and newly building in a hash table<new_key,new_value>A key-value pair. Parsing work request dataiAnd obtaining a primary key according to the second parameter, and writing the primary key into a new _ key of the newly-built key value pair. And parses the work request dataiAnd obtaining a written data block value by the third parameter, newly building a section of data storage block in the server storage area, copying the value parameter into the newly built storage block, writing the access address of the storage block into the key value pair new _ value, and completing the whole updating and writing of the key value pair index structure and the storage block.
The client successfully adds (or updates) a new key-value pair in the key value storage system of the server. Because the client closes the related thread after completing sending, the subsequent server does not have message notification to the client for the newly added index structure and the expansion of the storage area, so that the client still approximates the kernel bypass as a GET instruction, the CPU resource occupation is greatly reduced, and the computing resource is vacated to provide better upper-layer experience for the client with more application switching. Meanwhile, the messages do not come and go, the transmission delay is minimized, the transmission efficiency of the whole working load is improved, and the memory key value storage system is matched with the GET operation designed in the foregoing to realize high-performance transmission.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, the interaction between the control node and the edge computing node, the feedback information content collection and the online scheduling method in the present invention are applicable to all systems, and it should be understood by those skilled in the art that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.