CN118332010B

CN118332010B - Data batch insertion method for distributed database

Info

Publication number: CN118332010B
Application number: CN202410755952.8A
Authority: CN
Inventors: 易成龙; 张益�; 吴明远; 侯晓凡; 赵健
Original assignee: Tianjin Nankai University General Data Technologies Co ltd
Current assignee: Tianjin Nankai University General Data Technologies Co ltd
Priority date: 2024-06-13
Filing date: 2024-06-13
Publication date: 2024-10-11
Anticipated expiration: 2044-06-13
Also published as: CN118332010A

Abstract

The invention provides a data batch insertion method for a distributed database, wherein a client side constructs and transmits P messages containing parameterized SQL sentences to the database, the database analyzes the SQL sentences after receiving the P messages and generates corresponding query trees, and the query trees are cached locally; the client sends a U message to the database, the database analyzes data in the U message by using the query tree, and determines a target data node by using a fragment key field and a hash function of the data to perform route calculation; the database packages the data group of the same data node into new U messages according to the hash value, and sends the U messages to the corresponding data nodes in batches for data insertion; traversing all the determined data nodes in T2 to receive the inserted U messages in batches, and completing the data insertion process. The invention has the beneficial effects that: the efficiency of batch data insertion and the system response speed are remarkably improved.

Description

Data batch insertion method for distributed database

Technical Field

The invention belongs to the technical field of databases, and particularly relates to a data batch insertion method for a distributed database.

Background

In a database management system, data batch insertion is realized through an INSERT statement, and the method allows a plurality of pieces of data to be simultaneously inserted into a table so as to improve the operation efficiency. The traditional INSERT INTO table _name (column_name_list) VALUES (data 1), (data 2), (data 3) statement achieves efficiency improvement by reducing database call times. The distributed database system such as openGauss further optimizes this process, and reduces network round-trip and repetitive SQL parsing by improving PBE (Parse, bind, execute) flow and U message mechanism to send multiple pieces of data at once. However, this approach still shows limitations in the context of processing large amounts of data or requiring fine routing to specific data nodes, such as large parsing overhead, low data routing efficiency, and limited parallel processing potential.

Disclosure of Invention

In view of the foregoing, the present invention aims to propose a data batch insertion method for a distributed database, so as to optimize the U message processing logic, and significantly improve the efficiency of batch data insertion and the system response speed.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

A data batch insertion method for a distributed database.

Further, the method comprises the following steps:

t1, sending a P message: the client builds and sends a P message containing parameterized SQL sentences to the database, the database analyzes the SQL sentences after receiving the P message, and generates a corresponding query tree, and the query tree is cached locally;

t2, sending a U message: the client sends a U message to the database, the database analyzes data in the U message by using the query tree, and determines a target data node by using a fragment key field and a hash function of the data to perform route calculation;

T3, batch data insertion: the database packages the data group of the same data node into new U messages according to the hash value, and sends the U messages to the corresponding data nodes in batches for data insertion;

And T4, finishing insertion: and traversing all the determined data nodes in the T2 to determine that the data insertion is completed, and completing the data insertion process.

Further, in the T2, the sending the U message further includes the following sub-steps:

t21, the client sends a U message to the database, wherein the U message comprises the data to be inserted and a data list;

t22, the database analyzes the actual data from the U message, and the same hash function is used for processing the slicing keys of the data of the same table;

And T23, calculating the hash value of the data in batches, and obtaining the data node number corresponding to each piece of data.

Further, the U message includes a P message and a B message, and the U message generating step includes:

S1, a client sends a B message to a database, wherein the B message comprises an SQL sentence name and a piece of data to be inserted, and the database binds the B message with a P message in T1;

s2, the client sends E information to the database, and the database executes SQL sentences of the B information in the S1 to complete the insertion of one piece of data;

s3, returning to S1 and executing the insertion of the next piece of data.

Further, in the T1, the P message includes a parameterized SQL statement to be executed, where the parameterized SQL statement is used to identify the present SQL statement;

in T2, the U message also includes the name of the parameterized SQL statement in the P message.

Further, the T3 includes the following substeps:

T31, after receiving the U message, the database center node analyzes all data, and constructs a data set according to the data;

t32, selecting a proper hash function by the database according to the distribution column type of any piece of data in the data set, and calculating a hash value of each piece of data so as to ensure that each piece of data is uniformly distributed to each data node;

T33, initializing a new U message for the target data node of each piece of data, traversing the data set, and putting the data into the corresponding new U message according to the target data node of each piece of data so as to finish data distribution;

and T34, the central node sends each new U message to the corresponding data node, and the data node performs data insertion according to the new U message.

Further, an electronic device includes a processor and a memory communicatively coupled to the processor and configured to store instructions executable by the processor, the processor configured to perform a data batch insertion method for a distributed database as described above.

Further, a server comprising at least one processor and a memory communicatively coupled to the processor, the memory storing instructions executable by the at least one processor to cause the at least one processor to perform the data bulk insertion method for a distributed database.

Further, a computer readable storage medium stores a computer program which when executed by a processor implements the described method for data bulk insertion for a distributed database.

Compared with the prior art, the data batch insertion method for the distributed database has the following beneficial effects:

(1) According to the data batch insertion method, the data and the information are processed in batches, so that the overhead of single data insertion is reduced, the overall data insertion efficiency is improved, and particularly, when a large amount of data is processed, the load of a database is obviously reduced;

(2) According to the batch data insertion method, by using the parameterized SQL statement and the cache of the query tree, repeated work of analyzing SQL by a database is reduced, so that consumption of a CPU and a memory is reduced, and utilization of resources is optimized;

(3) According to the data batch insertion method, the processing capacity of the distributed database can be expanded by adding the data nodes, and the data can be uniformly distributed on each node through effective data slicing and routing, so that the expansibility of a system is improved, and larger data volume and higher concurrent processing capacity are supported.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a schematic flow chart of a data batch insertion method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a data batch insertion principle (prior art) according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a data batch insertion principle (the present application) according to an embodiment of the present application;

Fig. 4 is a flowchart illustrating a procedure for sending a U message according to an embodiment of the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

The invention will be described in detail below with reference to the drawings in connection with embodiments.

As shown in fig. 2, the existing data insertion flow uses a PBE (packet, execute) mode to insert one piece of data at a time, specifically, the data insertion is completed by a piece of 'P' message and a plurality of pieces of 'B' and 'E' messages, wherein the P message is a parameterized SQL statement, the B message contains one piece of actual data to be inserted, and the E message is used for triggering and executing a data insertion action to complete data insertion. In this way, when a large amount of data is inserted, the insertion efficiency is low because of more message interactions, so the application merges a plurality of 'B' and 'E' messages into a 'U' message, and realizes a large amount of time insertion through one U message, as shown in fig. 3.

Specifically, a data batch insertion method for a distributed database, as shown in fig. 1, includes the following steps:

t1, sending a P message: first, the client builds and sends a P message containing parameterized SQL statements to the database, for example: insert into table _name (column_name_list) values ($1, $2, $n); then, after receiving the P message, the database analyzes the SQL statement of the P message and generates a corresponding query tree, and the query tree is cached locally;

t2, sending a U message: the client sends a U message to the database, the database analyzes the data in the U message through the query tree to determine a destination data node of each piece of data, and route calculation is carried out according to the fragment key field and the hash function of the data;

t3, batch data insertion: the database will pack the data sets of the same data node into new U messages according to the hash value, for example:

DN1 inserts messages in batches: {1, 1} {2, 2};

DN2 inserts messages in batches: {3, 3};

then, the new U message is sent to the corresponding data node in batches for data insertion;

and T4, finishing insertion: traversing all the determined data nodes in T2 to determine that the data insertion is completed, and completing the data insertion process.

Specifically, the U message in T2 includes a P message and a B message, and the U message generating step includes:

s3, returning to S1 and executing the insertion of the next piece of data.

The advantage is, for example, that the data (1, 1), (2, 2), (3, 3) are inserted:

P { 'sql_name', sql statement };

insertion (1, 1);

B{‘sql_name’，{（1，1，1）}}+E；

insertion (2, 2);

B{‘sql_name’，{（2，2，2）}}+E；

Insertion (3, 3);

B{‘sql_name’，{（3，3，3）}}+E。

Inserting multiple data through a U message, wherein the multiple data are inserted in batches, such as (1, 1), (2, 2), (3, 3), through one message between a client and a database in one U message:

U{‘sql_name’，3，{（1，1，1），（2，2，2），（3，3，3）}}。

further as shown in fig. 4, in T2, sending the U message further includes the following sub-steps:

T21, the client sends a U message to the database, wherein the U message comprises the number of data to be inserted and a data list, for example:

{；

3，

{1，1，1}；

{2，2，2}；

{3，3，3}；

}；

T22, the database parses out the actual data from the U message, and since the data are stored in the same table, their sharded keys (one or more fields for determining on which data node the data should be stored) are identical, the sharded keys of the data of the same table can be processed using the same hash function, which is advantageous in that: determining a hash function for the sharding key before processing the insert data operation, thereby rapidly and accurately calculating a hash value of each piece of data and routing the data to the correct data node according to the hash value;

And T23, for each piece of data to be inserted, calculating the hash value of the data in batches, and obtaining the data node number corresponding to each piece of data.

Specifically, in T22, the type of the sharded key of the data to be inserted is determined first, and then hash values of all the data are calculated in batches according to the type of the sharded key.

Further, T3 comprises the following sub-steps:

For example, the original 'U' message:

New 'U' message:

DN1'U' message: { 'sql_name',2, { (1, 1), (2, 2) };

DN2'U' message: { 'sql_name',1, { (3, 3) }.

Specifically, in T1, the P message includes a parameterized SQL statement to be executed, where the parameterized SQL statement is used to identify the present SQL statement; in T2, the U message also includes the name of the parameterized SQL statement in the P message.

Those of ordinary skill in the art will appreciate that the elements and method steps of each example described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the elements and steps of each example have been described generally in terms of functionality in the foregoing description to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed methods and systems may be implemented in other ways. For example, the above-described division of units is merely a logical function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The units may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A data batch insertion method for a distributed database, characterized in that: the method comprises the following steps:

And T4, finishing insertion: traversing all the determined data nodes in the T2 to determine that data insertion is completed, and completing a data insertion process;

In said T2, sending the U message further comprises the sub-steps of:

T21, the client sends a U message to the database, wherein the U message comprises data to be inserted and a data list;

t23, calculating hash values of the data in batches, and obtaining a data node number corresponding to each piece of data;

The U message comprises a P message and a B message, and the U message generating step comprises the following steps:

s3, returning to S1 to execute the insertion of the next piece of data;

the T3 comprises the following substeps:

2. A method of data bulk insertion for a distributed database according to claim 1, wherein: in the T1, the P message comprises a parameterized SQL statement to be executed, and the parameterized SQL statement is used for marking the SQL statement;

3. An electronic device comprising a processor and a memory communicatively coupled to the processor for storing processor-executable instructions, characterized in that: the processor is configured to perform a data batch insertion method for a distributed database as claimed in any one of the preceding claims 1-2.

4. A server, characterized by: comprising at least one processor and a memory communicatively coupled to the processor, the memory storing instructions executable by the at least one processor to cause the at least one processor to perform a data bulk insertion method for a distributed database as claimed in any one of claims 1-2.

5. A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed by a processor, implements a data bulk insertion method for a distributed database as claimed in any one of claims 1-2.