Disclosure of Invention
In order to solve the defects in the prior art and achieve the purposes of reducing the ordering complexity and the data comparison times under the condition of large data volume, the invention adopts the following technical scheme:
A method for inquiring equivalence based on grouping information includes such steps as sorting the connection columns in connection table, dividing the data based on base number, recursively calling said data sorting and grouping method to obtain sorted data and grouping result, comparing the data values of current group and current connection column, changing the group based on the sorting direction, and outputting the grouping data as inquiry result.
Further, the sorting and grouping are to sort the data of the current connection column, find out the repeated value area of the current connection column according to the sorted data to divide the current connection column into different groups, sort and group the data in the area if the current connection column is not the last connection column for the repeated value area, sort and group the next current connection column after the layer-by-layer sorting and grouping are completed, until the final sorting and grouping are completed, and combine the grouping results.
Further, setting a subscript for the current connection column, carrying out layer-by-layer sequencing and grouping on the data in the area of the current connection column based on the subscript, after completion, shifting the subscript backward, carrying out layer-by-layer sequencing and grouping on the data in the area of the next current connection column, and merging grouping results through the subscript.
Further, after the current connection column is ordered, the starting position and the length of all the repeated value areas of the current connection column are marked, and after all the grouping is completed, the starting position and the length of the areas are added in all the grouping results.
Further, in the left and right connection tables, the order of connection columns is sequentially increased from left to right and from top to bottom according to the data value of the connection columns;
constructing an outer layer cycle to circularly execute an inner layer cycle until no more groups exist in the current left table or the current right table, and outputting a query result set;
if the data value of the current connection column of the current group of the left connection table is smaller than the data value of the current connection column of the current group of the right connection table, the next group of the right connection table is used as the current group, the inner layer circulation is built, if the data value of the current connection column of the current group of the left connection table is larger than the data value of the current connection column of the current group of the right connection table, the next group of the left connection table is used as the current group, the inner layer circulation is stopped, and if the current groups of the left connection table and the right connection table are equal in all connection columns, the current left connection table and the right connection table are added to a query result set, and the next groups of the left connection table and the right connection table are used as the current groups.
Further, setting and maintaining subscripts of current groups of left and right connection tables, when the current connection column data value of the current group of the left connection table is larger than the current connection column data value of the current group of the right connection table, moving the subscripts of the current group of the left connection table backward when the data value of the current connection column of the current group of the left connection table is smaller than the data value of the current connection column of the current group of the right connection table, and moving the subscripts of the current groups of the left and right connection tables backward when the current groups of the left and right connection tables are equal in all connection columns.
Further, when all the connection columns of the current left and right connection tables are equal, each row in the current left and right connection tables is formed into a row of results, and the results are added to the query result set.
A database system supports the equivalent query method based on grouping information to perform data query.
The invention has the advantages that:
The equivalent query method and the database system based on the grouping information are characterized in that based on the repeated value area of the data, the data are ordered and grouped according to the connection columns, so that the complexity of the ordering is reduced, the sort-merge join algorithm is improved by utilizing the ordered grouping information, when the pointers of the left table and the right table are matched, the area with the same value of the connection columns is skipped, the repeated comparison of the data in each group of the left table and the data in each group of the right table is avoided, the comparison times are reduced, and the performance of the database system is improved.
Detailed Description
The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
As shown in fig. 1, the sort-merge join existing algorithm is connected based on a sorted list, comprising the steps of:
step 1, sorting left table data according to connection columns, sequentially increasing from left to right and sequentially increasing from top to bottom;
step 2, sorting the right table data according to the connection columns, sequentially increasing from left to right and sequentially increasing from top to bottom;
Step 3, initializing a null data set;
initializing row pointers of a left table and a right table, and respectively pointing to a first piece of data after ordering the left table and the right table;
And 5, constructing and executing an outer layer loop, which comprises the following steps:
step 5.1, constructing and executing a first inner layer circulation, and comparing corresponding connection columns from left to right for the row pointed by the left table pointer and the row pointed by the right table pointer until all the connection columns are equal, namely, finding out the corresponding equal connection columns in the left table and the right table;
If a certain connection column of the left table is smaller than a corresponding connection column of the right table, judging that the left table pointer points to the last piece of data of the current row of the left table, and moving the left table pointer to the next row;
If a certain connection column of the left table is larger than a corresponding connection column of the right table, judging that the pointer of the right table points to the last piece of data of the current row of the right table, and moving the pointer of the right table to the next row;
Step 5.2, recording the current position of the pointer of the left table;
Step 5.3, constructing and executing a second inner layer circulation, and comparing corresponding connection columns from left to right for the row pointed by the left table pointer and the row pointed by the right table pointer until all the connection columns are equal or the left table has no more data;
Forming a row of results by pointing the left and right table pointers to the row, and adding the row of results to the result;
if the right table has more data, the pointer of the right table moves to the next row, otherwise, the outer layer circulation is stopped, and a result is output;
comparing the corresponding connection columns from left to right for the row of the recording position before the left table pointer and the row pointed by the right table pointer, if all the rows are equal, resetting the left table pointer to the position recorded before, if not all the rows are equal, judging again, if the left table has no more data, stopping the outer layer circulation, outputting a result, and if the left table has more data, continuing the outer layer circulation;
And 6, outputting a result.
The size comparison of the connection columns refers to comparing the values of the connection columns of the left and right tables in a certain row, for example, the values of the three connection columns of the left table in a certain row are a, b, c, and the values of the three connection columns of the right table in a certain row are d, e, f, and if a > d, the values of the connection columns of the left table and the right table are compared one by one, for example, if a=d, the connection column of the left table is larger than the connection column of the right table, if a=d, it is determined whether b is larger than e, if b > e, the connection column of the left table is larger, if b=e, c and f are continuously compared, and so on.
As shown in fig. 2, the data is ordered according to the connection columns, generally using merge ordering, comprising the steps of:
step 1, returning original data (output data) if the length of input data (table, sequencing column) is < =1, otherwise, separating the input data from the middle row;
Step 2, recursively calling the merging and sorting to the left and the right, for example, if one array is [ a, b, c, d, e, f, ], dividing the data into a left part and a right part, wherein the left part is [ a, b, c ], the right part is [ d, e, f ], and recursively calling the merging and sorting to the left and the right;
Step 3, initializing result data to be empty;
And 4, merging the sequencing results of the left side and the right side:
Step 4.1, maintaining the positions of the left and right currently considered rows;
step 4.2, build and execute the following loop until no more data is left or right:
Comparing corresponding connection columns from left to right for the left and right currently considered rows, if all are equal, or for the first unequal connection column, if left < right, the left current row is added to the end of the result data, and the position of the left currently considered row is shifted back by 1 bit, otherwise, the right current row is added to the end of the result data, and the position of the right currently considered row is shifted back by 1 bit;
Step 4.3, if left rows are not considered, adding all the left rows to the end of the result data;
step 4.4, if the left side has the unaccounted row, adding all the left rows to the end of the result data;
And 7, returning a merging result.
The existing sort-based table is connected with the sort-merge join, and when the data volume is large, the aim of improving the performance of the algorithm can be achieved by starting from the following two aspects:
1. Finding a ranking algorithm with lower complexity for large data volumes. The time complexity of the merging and sorting algorithm is O (k is n is log (n)), k is the number of connection columns, and n is the data length;
2. When the pointers of the left table and the right table are matched, the areas with the same values of the connecting columns are skipped, so that the comparison times are reduced. The above-mentioned sort-merge join algorithm compares the number of times O (k×m×n) for the left and right table rows, k is the number of connected columns, m is the left table length, and n is the right table length.
In order to improve the performance of a sort-merge join algorithm, the invention provides a method for sorting a table according to a plurality of columns, and improves the sorting performance under a large data volume by using the concept of radix sorting. The data are grouped while being ordered, and one group is defined as a block area with the same value of all connection columns in the ordered data.
First, the data are ordered and grouped according to the connection columns, as shown in fig. 3, and the specific flow is as follows:
Step 1, inputting data, all connection columns and currently considered connection column subscripts (0 when being called for the first time);
Step 2, sorting the data according to the currently considered connection columns;
step 3, marking the starting positions and the lengths of all repeated value areas of the current connection column for the sequenced data;
Step 4, initializing the grouping result to be null;
step 5, for each repeated value area of the current column, executing the following operations:
if the current column is not the last connection column, recursively calling the current data sequencing and grouping flow for the data in the region, and the subscript of the connection column considered is shifted backwards by 1;
Otherwise, the starting position and the length of the current area are added in all grouping results;
and 6, returning the sequenced data and the grouping result.
Assuming "sorting the data according to the currently considered connection columns" in the above step 2, when radix sorting is used, the complexity of the algorithm is O (k×w×n), k is the number of connection columns, w is the average length of each piece of data in the array, and n is the length of data. For data in a database, w is typically not too long, such as a column of data type bits of a 32-bit integer, and the maximum value of w is 32.
When the data length n is large, w < log (n) according to the above assumption. The algorithm complexity O (k×w×n) of multi-column sorting in the database system using the concept of radix sorting is < the algorithm complexity O (k×n×log (n)) of multi-column sorting using merge sorting, i.e. the performance of the database system is improved.
Then, based on the grouping result, the sort-merge join algorithm is improved to reduce the number of comparisons, as shown in fig. 4, and specifically includes the following steps:
Step 1, inputting results of the left and right tables ordered according to the connection columns and grouping results;
step 2, initializing a null data set;
Step 3, maintaining the subscripts of the currently considered groups of the left table and the right table;
step 4, the outer layer of the component is circulated, and the following steps are executed in a circulating way until no more groups of the current left table or the current right table can be considered:
step 4.1, constructing an inner layer loop to circularly execute all connection columns:
comparing the current group of the left and right tables, the value of the current connection column:
If the current connection column of the current group of the left table is greater than the current connection column of the current group of the right table, the subscript of the currently considered group of the right table is moved backwards by 1, and the inner layer circulation is stopped;
if the current connection column of the current group of the left table is smaller than the current connection column of the current group of the right table, the subscript of the currently considered group of the left table is moved backwards by 1, and the inner layer circulation is stopped;
if the current groups of the left and right tables are equal in all the connection columns, forming a result row by each row in the current left table group and each row in the current right table group, adding the result row to a result set, and shifting the subscript of the current considered group of the left and right tables back by 1;
and 5, outputting a result set.
In the modified sort-merge join algorithm, the subscripts of the currently considered group of the left and right tables are only moved backward and not moved forward. The total comparison number is O (k×m+n), k is the number of connection columns, m is the left table length, and n is the right table length. Compared with the comparison times O (k x m x n) of the general algorithm of the existing sort-merge join, the new algorithm obviously reduces the comparison times and improves the performance of the database system.
Performance comparison of merge ordering with radix ordering:
As shown in fig. 5, by comparing the time consumption of each sorting algorithm under different data amounts, it can be seen that the method (radix sort) of the present invention increases the time consumption of the merging sorting by a very significant extent with respect to the radix sorting, and the total consumption of the radix sorting is significantly advantageous with respect to the merging sorting when the data amount increases.
A database system adopts the equivalent query method based on grouping information to perform data query. This part of the embodiments are similar to the embodiments of the method embodiments described above, and will not be repeated here.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some or all of the features thereof, and that the modifications or substitutions do not depart from the scope of the embodiments of the present invention.