CN119691004B

CN119691004B - Equivalent query method based on grouping information and database system

Info

Publication number: CN119691004B
Application number: CN202510213393.2A
Authority: CN
Inventors: 周小华; 胡津铭
Original assignee: Zhejiang Zhiyu Technology Co ltd
Current assignee: Zhejiang Zhiyu Technology Co ltd
Priority date: 2025-02-26
Filing date: 2025-02-26
Publication date: 2025-05-09
Anticipated expiration: 2045-02-26
Also published as: CN119691004A

Abstract

The invention discloses an equal value query method based on grouping information and a database system, belonging to the technical field of databases. The database system adopts an equal value query method based on grouping information. The method firstly sorts and groups connection columns in a connection table. The sorting is to sort the data values in front and back. The grouping is to divide the repeated value area based on the cardinality sorted data, and recursively call the data sorting and grouping method for the data in the area to finally obtain the sorted data and grouping results. Then, the sorting and grouping results of the connection columns in at least two connection tables are obtained, the data values of the current group and the current connection column of the connection table are compared, and according to the comparison result of the data value size, the grouping is changed based on the direction of the numerical value sorting. If all the connection columns of the group are equal, the grouped data of the connection table are output as the query result.

Description

Equivalent query method based on grouping information and database system

Technical Field

The invention belongs to the technical field of databases, and particularly relates to an equivalent query method based on grouping information and a database system.

Background

In the multi-table query of the database system, the connection query comprises an inner connection, an outer connection and the like, wherein the inner connection is also called equivalent query Equal join, and each row of data in two tables is matched according to the standard that a plurality of columns are respectively Equal. For example, if each row of data of the left table attempts to match all rows of the right table, if a row of the left table matches a row of the right table, all columns of the rows of the left and right tables are combined into a row and placed in the result, the row without the match will not be output. Columns for which equivalent connections are used to match left and right table rows are called connection columns.

The current method for realizing equivalent query of the Equal join in the database mainly comprises the steps of connecting the sort-merge join based on the ordered table and connecting the hash join based on the hash table. Taking the sort-based table connection sort-merge join as an example, the number of comparisons to left and right table rows is large, and there is repeated comparison, whereas in the prior art, the time complexity is high based on the merge sort algorithm.

Disclosure of Invention

In order to solve the defects in the prior art and achieve the purposes of reducing the ordering complexity and the data comparison times under the condition of large data volume, the invention adopts the following technical scheme:

A method for inquiring equivalence based on grouping information includes such steps as sorting the connection columns in connection table, dividing the data based on base number, recursively calling said data sorting and grouping method to obtain sorted data and grouping result, comparing the data values of current group and current connection column, changing the group based on the sorting direction, and outputting the grouping data as inquiry result.

Further, the sorting and grouping are to sort the data of the current connection column, find out the repeated value area of the current connection column according to the sorted data to divide the current connection column into different groups, sort and group the data in the area if the current connection column is not the last connection column for the repeated value area, sort and group the next current connection column after the layer-by-layer sorting and grouping are completed, until the final sorting and grouping are completed, and combine the grouping results.

Further, setting a subscript for the current connection column, carrying out layer-by-layer sequencing and grouping on the data in the area of the current connection column based on the subscript, after completion, shifting the subscript backward, carrying out layer-by-layer sequencing and grouping on the data in the area of the next current connection column, and merging grouping results through the subscript.

Further, after the current connection column is ordered, the starting position and the length of all the repeated value areas of the current connection column are marked, and after all the grouping is completed, the starting position and the length of the areas are added in all the grouping results.

Further, in the left and right connection tables, the order of connection columns is sequentially increased from left to right and from top to bottom according to the data value of the connection columns;

constructing an outer layer cycle to circularly execute an inner layer cycle until no more groups exist in the current left table or the current right table, and outputting a query result set;

if the data value of the current connection column of the current group of the left connection table is smaller than the data value of the current connection column of the current group of the right connection table, the next group of the right connection table is used as the current group, the inner layer circulation is built, if the data value of the current connection column of the current group of the left connection table is larger than the data value of the current connection column of the current group of the right connection table, the next group of the left connection table is used as the current group, the inner layer circulation is stopped, and if the current groups of the left connection table and the right connection table are equal in all connection columns, the current left connection table and the right connection table are added to a query result set, and the next groups of the left connection table and the right connection table are used as the current groups.

Further, setting and maintaining subscripts of current groups of left and right connection tables, when the current connection column data value of the current group of the left connection table is larger than the current connection column data value of the current group of the right connection table, moving the subscripts of the current group of the left connection table backward when the data value of the current connection column of the current group of the left connection table is smaller than the data value of the current connection column of the current group of the right connection table, and moving the subscripts of the current groups of the left and right connection tables backward when the current groups of the left and right connection tables are equal in all connection columns.

Further, when all the connection columns of the current left and right connection tables are equal, each row in the current left and right connection tables is formed into a row of results, and the results are added to the query result set.

A database system supports the equivalent query method based on grouping information to perform data query.

The invention has the advantages that:

The equivalent query method and the database system based on the grouping information are characterized in that based on the repeated value area of the data, the data are ordered and grouped according to the connection columns, so that the complexity of the ordering is reduced, the sort-merge join algorithm is improved by utilizing the ordered grouping information, when the pointers of the left table and the right table are matched, the area with the same value of the connection columns is skipped, the repeated comparison of the data in each group of the left table and the data in each group of the right table is avoided, the comparison times are reduced, and the performance of the database system is improved.

Drawings

FIG. 1 is a flow chart of a method of the prior art sort-merge join.

FIG. 2 is a flow chart for merge ordering of data according to a connection column.

Fig. 3 is a flow chart of sorting and grouping data according to connection columns using the radix sorting concept in an embodiment of the present invention.

FIG. 4 is a flow chart of a method of improving a sort-merge join based on a grouping result in an embodiment of the present invention.

FIG. 5 is a time-consuming comparison of ranking methods in an embodiment of the invention.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

As shown in fig. 1, the sort-merge join existing algorithm is connected based on a sorted list, comprising the steps of:

step 1, sorting left table data according to connection columns, sequentially increasing from left to right and sequentially increasing from top to bottom;

step 2, sorting the right table data according to the connection columns, sequentially increasing from left to right and sequentially increasing from top to bottom;

Step 3, initializing a null data set;

initializing row pointers of a left table and a right table, and respectively pointing to a first piece of data after ordering the left table and the right table;

And 5, constructing and executing an outer layer loop, which comprises the following steps:

step 5.1, constructing and executing a first inner layer circulation, and comparing corresponding connection columns from left to right for the row pointed by the left table pointer and the row pointed by the right table pointer until all the connection columns are equal, namely, finding out the corresponding equal connection columns in the left table and the right table;

If a certain connection column of the left table is smaller than a corresponding connection column of the right table, judging that the left table pointer points to the last piece of data of the current row of the left table, and moving the left table pointer to the next row;

If a certain connection column of the left table is larger than a corresponding connection column of the right table, judging that the pointer of the right table points to the last piece of data of the current row of the right table, and moving the pointer of the right table to the next row;

Step 5.2, recording the current position of the pointer of the left table;

Step 5.3, constructing and executing a second inner layer circulation, and comparing corresponding connection columns from left to right for the row pointed by the left table pointer and the row pointed by the right table pointer until all the connection columns are equal or the left table has no more data;

Forming a row of results by pointing the left and right table pointers to the row, and adding the row of results to the result;

if the right table has more data, the pointer of the right table moves to the next row, otherwise, the outer layer circulation is stopped, and a result is output;

comparing the corresponding connection columns from left to right for the row of the recording position before the left table pointer and the row pointed by the right table pointer, if all the rows are equal, resetting the left table pointer to the position recorded before, if not all the rows are equal, judging again, if the left table has no more data, stopping the outer layer circulation, outputting a result, and if the left table has more data, continuing the outer layer circulation;

And 6, outputting a result.

The size comparison of the connection columns refers to comparing the values of the connection columns of the left and right tables in a certain row, for example, the values of the three connection columns of the left table in a certain row are a, b, c, and the values of the three connection columns of the right table in a certain row are d, e, f, and if a > d, the values of the connection columns of the left table and the right table are compared one by one, for example, if a=d, the connection column of the left table is larger than the connection column of the right table, if a=d, it is determined whether b is larger than e, if b > e, the connection column of the left table is larger, if b=e, c and f are continuously compared, and so on.

As shown in fig. 2, the data is ordered according to the connection columns, generally using merge ordering, comprising the steps of:

step 1, returning original data (output data) if the length of input data (table, sequencing column) is < =1, otherwise, separating the input data from the middle row;

Step 2, recursively calling the merging and sorting to the left and the right, for example, if one array is [ a, b, c, d, e, f, ], dividing the data into a left part and a right part, wherein the left part is [ a, b, c ], the right part is [ d, e, f ], and recursively calling the merging and sorting to the left and the right;

Step 3, initializing result data to be empty;

And 4, merging the sequencing results of the left side and the right side:

Step 4.1, maintaining the positions of the left and right currently considered rows;

step 4.2, build and execute the following loop until no more data is left or right:

Comparing corresponding connection columns from left to right for the left and right currently considered rows, if all are equal, or for the first unequal connection column, if left < right, the left current row is added to the end of the result data, and the position of the left currently considered row is shifted back by 1 bit, otherwise, the right current row is added to the end of the result data, and the position of the right currently considered row is shifted back by 1 bit;

Step 4.3, if left rows are not considered, adding all the left rows to the end of the result data;

step 4.4, if the left side has the unaccounted row, adding all the left rows to the end of the result data;

And 7, returning a merging result.

The existing sort-based table is connected with the sort-merge join, and when the data volume is large, the aim of improving the performance of the algorithm can be achieved by starting from the following two aspects:

1. Finding a ranking algorithm with lower complexity for large data volumes. The time complexity of the merging and sorting algorithm is O (k is n is log (n)), k is the number of connection columns, and n is the data length;

2. When the pointers of the left table and the right table are matched, the areas with the same values of the connecting columns are skipped, so that the comparison times are reduced. The above-mentioned sort-merge join algorithm compares the number of times O (k×m×n) for the left and right table rows, k is the number of connected columns, m is the left table length, and n is the right table length.

In order to improve the performance of a sort-merge join algorithm, the invention provides a method for sorting a table according to a plurality of columns, and improves the sorting performance under a large data volume by using the concept of radix sorting. The data are grouped while being ordered, and one group is defined as a block area with the same value of all connection columns in the ordered data.

First, the data are ordered and grouped according to the connection columns, as shown in fig. 3, and the specific flow is as follows:

Step 1, inputting data, all connection columns and currently considered connection column subscripts (0 when being called for the first time);

Step 2, sorting the data according to the currently considered connection columns;

step 3, marking the starting positions and the lengths of all repeated value areas of the current connection column for the sequenced data;

Step 4, initializing the grouping result to be null;

step 5, for each repeated value area of the current column, executing the following operations:

if the current column is not the last connection column, recursively calling the current data sequencing and grouping flow for the data in the region, and the subscript of the connection column considered is shifted backwards by 1;

Otherwise, the starting position and the length of the current area are added in all grouping results;

and 6, returning the sequenced data and the grouping result.

Assuming "sorting the data according to the currently considered connection columns" in the above step 2, when radix sorting is used, the complexity of the algorithm is O (k×w×n), k is the number of connection columns, w is the average length of each piece of data in the array, and n is the length of data. For data in a database, w is typically not too long, such as a column of data type bits of a 32-bit integer, and the maximum value of w is 32.

When the data length n is large, w < log (n) according to the above assumption. The algorithm complexity O (k×w×n) of multi-column sorting in the database system using the concept of radix sorting is < the algorithm complexity O (k×n×log (n)) of multi-column sorting using merge sorting, i.e. the performance of the database system is improved.

Then, based on the grouping result, the sort-merge join algorithm is improved to reduce the number of comparisons, as shown in fig. 4, and specifically includes the following steps:

Step 1, inputting results of the left and right tables ordered according to the connection columns and grouping results;

step 2, initializing a null data set;

Step 3, maintaining the subscripts of the currently considered groups of the left table and the right table;

step 4, the outer layer of the component is circulated, and the following steps are executed in a circulating way until no more groups of the current left table or the current right table can be considered:

step 4.1, constructing an inner layer loop to circularly execute all connection columns:

comparing the current group of the left and right tables, the value of the current connection column:

If the current connection column of the current group of the left table is greater than the current connection column of the current group of the right table, the subscript of the currently considered group of the right table is moved backwards by 1, and the inner layer circulation is stopped;

if the current connection column of the current group of the left table is smaller than the current connection column of the current group of the right table, the subscript of the currently considered group of the left table is moved backwards by 1, and the inner layer circulation is stopped;

if the current groups of the left and right tables are equal in all the connection columns, forming a result row by each row in the current left table group and each row in the current right table group, adding the result row to a result set, and shifting the subscript of the current considered group of the left and right tables back by 1;

and 5, outputting a result set.

In the modified sort-merge join algorithm, the subscripts of the currently considered group of the left and right tables are only moved backward and not moved forward. The total comparison number is O (k×m+n), k is the number of connection columns, m is the left table length, and n is the right table length. Compared with the comparison times O (k x m x n) of the general algorithm of the existing sort-merge join, the new algorithm obviously reduces the comparison times and improves the performance of the database system.

Performance comparison of merge ordering with radix ordering:

As shown in fig. 5, by comparing the time consumption of each sorting algorithm under different data amounts, it can be seen that the method (radix sort) of the present invention increases the time consumption of the merging sorting by a very significant extent with respect to the radix sorting, and the total consumption of the radix sorting is significantly advantageous with respect to the merging sorting when the data amount increases.

A database system adopts the equivalent query method based on grouping information to perform data query. This part of the embodiments are similar to the embodiments of the method embodiments described above, and will not be repeated here.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some or all of the features thereof, and that the modifications or substitutions do not depart from the scope of the embodiments of the present invention.

Claims

1. An equivalent query method based on grouping information is characterized in that:

Sorting and grouping connection columns in a connection table, wherein the sorting is to sort the connection columns back and forth according to the size of data values, the grouping is to divide repeated value areas based on data of base number sorting, and one group is defined as an area with the same value of all connection columns in the sorted data, and the data sorting and grouping method is called for data recursion the area, so that the sorted data and grouping result are finally obtained;

Obtaining the sorting and grouping results of the connection columns in at least two connection tables, comparing the data value sizes of the current group and the current connection column of the connection tables, replacing the grouping based on the sorting direction of the numerical values according to the data value size comparison result, and outputting grouping data of the connection tables as a query result if all the connection columns of the grouping are equal;

In the left and right connection tables, the sequence of connection columns is increased from left to right and from top to bottom according to the data value;

2. The method for equivalent query based on grouping information of claim 1, wherein the sorting and grouping are performed on the data of the current connection column, the repeated value area of the current connection column is found out according to the sorted data to be divided into different groups, for the repeated value area, if the current connection column is not the last connection column, the data in the area is sorted and grouped, after the layer-by-layer sorting and grouping are completed, the next current connection column is sorted and grouped until the final sorting and grouping are completed, and grouping results are combined.

3. The method for equivalent query based on grouping information of claim 1, wherein a subscript is set for a current connection column, layer-by-layer ordering and grouping are performed on data in an area of the current connection column based on the subscript, after completion, the subscript is moved backward, layer-by-layer ordering and grouping are performed on the data in the area of the next current connection column, and grouping results are combined through the subscript.

4. The method for equivalent query based on grouping information as set forth in claim 1, wherein the starting positions and lengths of all repeated value areas of the current connection columns are marked after the current connection columns are ordered, and the starting positions and lengths of the areas are appended to all grouping results after all grouping is completed.

5. The method for querying equivalence based on grouping information according to claim 1, wherein subscripts of current groups of left and right connection tables are set and maintained, subscripts of current groups of right connection tables are moved backward when current connection column data values of the current groups of left and right connection tables are larger than current connection column data values of the current groups of right connection tables, subscripts of current groups of left and right connection tables are moved backward when data values of current connection columns of the current groups of left and right connection tables are smaller than data values of current connection columns of the current groups of right connection tables, and subscripts of current groups of left and right connection tables are moved backward when the current groups of left and right connection tables are equal in all connection columns.

6. The method of claim 1, wherein when all connection columns of the left and right connection tables are equal, each row in the left connection table and each row in the right connection table are combined into a row result, and the row result is added to the query result set.

7. A database system is characterized in that the database system supports the equivalent query method based on grouping information as set forth in claim 1 for data query.