+

CN119691004B - Equivalent query method based on grouping information and database system - Google Patents

Equivalent query method based on grouping information and database system Download PDF

Info

Publication number
CN119691004B
CN119691004B CN202510213393.2A CN202510213393A CN119691004B CN 119691004 B CN119691004 B CN 119691004B CN 202510213393 A CN202510213393 A CN 202510213393A CN 119691004 B CN119691004 B CN 119691004B
Authority
CN
China
Prior art keywords
connection
current
data
grouping
columns
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510213393.2A
Other languages
Chinese (zh)
Other versions
CN119691004A (en
Inventor
周小华
胡津铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Zhiyu Technology Co ltd
Original Assignee
Zhejiang Zhiyu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Zhiyu Technology Co ltd filed Critical Zhejiang Zhiyu Technology Co ltd
Priority to CN202510213393.2A priority Critical patent/CN119691004B/en
Publication of CN119691004A publication Critical patent/CN119691004A/en
Application granted granted Critical
Publication of CN119691004B publication Critical patent/CN119691004B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于分组信息的等值查询方法及数据库系统,属于数据库技术领域,数据库系统采用基于分组信息的等值查询方法,该方法首先,对连接表中的连接列进行排序和分组,排序是以数据值的大小进行前后排序,分组是基于基数排序的数据进行重复值区域的划分,并对所述区域内数据递归的调用所述数据排序和分组方法,最终得到排序后的数据和分组结果;然后,获取至少两个连接表中连接列的排序和分组结果,比较连接表当前组、当前连接列的数据值大小,根据数据值大小比较结果,基于数值大小排序的方向更换分组,对于分组的所有连接列均相等的,则将连接表分组数据作为查询结果输出。

The invention discloses an equal value query method based on grouping information and a database system, belonging to the technical field of databases. The database system adopts an equal value query method based on grouping information. The method firstly sorts and groups connection columns in a connection table. The sorting is to sort the data values in front and back. The grouping is to divide the repeated value area based on the cardinality sorted data, and recursively call the data sorting and grouping method for the data in the area to finally obtain the sorted data and grouping results. Then, the sorting and grouping results of the connection columns in at least two connection tables are obtained, the data values of the current group and the current connection column of the connection table are compared, and according to the comparison result of the data value size, the grouping is changed based on the direction of the numerical value sorting. If all the connection columns of the group are equal, the grouped data of the connection table are output as the query result.

Description

Equivalent query method based on grouping information and database system
Technical Field
The invention belongs to the technical field of databases, and particularly relates to an equivalent query method based on grouping information and a database system.
Background
In the multi-table query of the database system, the connection query comprises an inner connection, an outer connection and the like, wherein the inner connection is also called equivalent query Equal join, and each row of data in two tables is matched according to the standard that a plurality of columns are respectively Equal. For example, if each row of data of the left table attempts to match all rows of the right table, if a row of the left table matches a row of the right table, all columns of the rows of the left and right tables are combined into a row and placed in the result, the row without the match will not be output. Columns for which equivalent connections are used to match left and right table rows are called connection columns.
The current method for realizing equivalent query of the Equal join in the database mainly comprises the steps of connecting the sort-merge join based on the ordered table and connecting the hash join based on the hash table. Taking the sort-based table connection sort-merge join as an example, the number of comparisons to left and right table rows is large, and there is repeated comparison, whereas in the prior art, the time complexity is high based on the merge sort algorithm.
Disclosure of Invention
In order to solve the defects in the prior art and achieve the purposes of reducing the ordering complexity and the data comparison times under the condition of large data volume, the invention adopts the following technical scheme:
A method for inquiring equivalence based on grouping information includes such steps as sorting the connection columns in connection table, dividing the data based on base number, recursively calling said data sorting and grouping method to obtain sorted data and grouping result, comparing the data values of current group and current connection column, changing the group based on the sorting direction, and outputting the grouping data as inquiry result.
Further, the sorting and grouping are to sort the data of the current connection column, find out the repeated value area of the current connection column according to the sorted data to divide the current connection column into different groups, sort and group the data in the area if the current connection column is not the last connection column for the repeated value area, sort and group the next current connection column after the layer-by-layer sorting and grouping are completed, until the final sorting and grouping are completed, and combine the grouping results.
Further, setting a subscript for the current connection column, carrying out layer-by-layer sequencing and grouping on the data in the area of the current connection column based on the subscript, after completion, shifting the subscript backward, carrying out layer-by-layer sequencing and grouping on the data in the area of the next current connection column, and merging grouping results through the subscript.
Further, after the current connection column is ordered, the starting position and the length of all the repeated value areas of the current connection column are marked, and after all the grouping is completed, the starting position and the length of the areas are added in all the grouping results.
Further, in the left and right connection tables, the order of connection columns is sequentially increased from left to right and from top to bottom according to the data value of the connection columns;
constructing an outer layer cycle to circularly execute an inner layer cycle until no more groups exist in the current left table or the current right table, and outputting a query result set;
if the data value of the current connection column of the current group of the left connection table is smaller than the data value of the current connection column of the current group of the right connection table, the next group of the right connection table is used as the current group, the inner layer circulation is built, if the data value of the current connection column of the current group of the left connection table is larger than the data value of the current connection column of the current group of the right connection table, the next group of the left connection table is used as the current group, the inner layer circulation is stopped, and if the current groups of the left connection table and the right connection table are equal in all connection columns, the current left connection table and the right connection table are added to a query result set, and the next groups of the left connection table and the right connection table are used as the current groups.
Further, setting and maintaining subscripts of current groups of left and right connection tables, when the current connection column data value of the current group of the left connection table is larger than the current connection column data value of the current group of the right connection table, moving the subscripts of the current group of the left connection table backward when the data value of the current connection column of the current group of the left connection table is smaller than the data value of the current connection column of the current group of the right connection table, and moving the subscripts of the current groups of the left and right connection tables backward when the current groups of the left and right connection tables are equal in all connection columns.
Further, when all the connection columns of the current left and right connection tables are equal, each row in the current left and right connection tables is formed into a row of results, and the results are added to the query result set.
A database system supports the equivalent query method based on grouping information to perform data query.
The invention has the advantages that:
The equivalent query method and the database system based on the grouping information are characterized in that based on the repeated value area of the data, the data are ordered and grouped according to the connection columns, so that the complexity of the ordering is reduced, the sort-merge join algorithm is improved by utilizing the ordered grouping information, when the pointers of the left table and the right table are matched, the area with the same value of the connection columns is skipped, the repeated comparison of the data in each group of the left table and the data in each group of the right table is avoided, the comparison times are reduced, and the performance of the database system is improved.
Drawings
FIG. 1 is a flow chart of a method of the prior art sort-merge join.
FIG. 2 is a flow chart for merge ordering of data according to a connection column.
Fig. 3 is a flow chart of sorting and grouping data according to connection columns using the radix sorting concept in an embodiment of the present invention.
FIG. 4 is a flow chart of a method of improving a sort-merge join based on a grouping result in an embodiment of the present invention.
FIG. 5 is a time-consuming comparison of ranking methods in an embodiment of the invention.
Detailed Description
The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
As shown in fig. 1, the sort-merge join existing algorithm is connected based on a sorted list, comprising the steps of:
step 1, sorting left table data according to connection columns, sequentially increasing from left to right and sequentially increasing from top to bottom;
step 2, sorting the right table data according to the connection columns, sequentially increasing from left to right and sequentially increasing from top to bottom;
Step 3, initializing a null data set;
initializing row pointers of a left table and a right table, and respectively pointing to a first piece of data after ordering the left table and the right table;
And 5, constructing and executing an outer layer loop, which comprises the following steps:
step 5.1, constructing and executing a first inner layer circulation, and comparing corresponding connection columns from left to right for the row pointed by the left table pointer and the row pointed by the right table pointer until all the connection columns are equal, namely, finding out the corresponding equal connection columns in the left table and the right table;
If a certain connection column of the left table is smaller than a corresponding connection column of the right table, judging that the left table pointer points to the last piece of data of the current row of the left table, and moving the left table pointer to the next row;
If a certain connection column of the left table is larger than a corresponding connection column of the right table, judging that the pointer of the right table points to the last piece of data of the current row of the right table, and moving the pointer of the right table to the next row;
Step 5.2, recording the current position of the pointer of the left table;
Step 5.3, constructing and executing a second inner layer circulation, and comparing corresponding connection columns from left to right for the row pointed by the left table pointer and the row pointed by the right table pointer until all the connection columns are equal or the left table has no more data;
Forming a row of results by pointing the left and right table pointers to the row, and adding the row of results to the result;
if the right table has more data, the pointer of the right table moves to the next row, otherwise, the outer layer circulation is stopped, and a result is output;
comparing the corresponding connection columns from left to right for the row of the recording position before the left table pointer and the row pointed by the right table pointer, if all the rows are equal, resetting the left table pointer to the position recorded before, if not all the rows are equal, judging again, if the left table has no more data, stopping the outer layer circulation, outputting a result, and if the left table has more data, continuing the outer layer circulation;
And 6, outputting a result.
The size comparison of the connection columns refers to comparing the values of the connection columns of the left and right tables in a certain row, for example, the values of the three connection columns of the left table in a certain row are a, b, c, and the values of the three connection columns of the right table in a certain row are d, e, f, and if a > d, the values of the connection columns of the left table and the right table are compared one by one, for example, if a=d, the connection column of the left table is larger than the connection column of the right table, if a=d, it is determined whether b is larger than e, if b > e, the connection column of the left table is larger, if b=e, c and f are continuously compared, and so on.
As shown in fig. 2, the data is ordered according to the connection columns, generally using merge ordering, comprising the steps of:
step 1, returning original data (output data) if the length of input data (table, sequencing column) is < =1, otherwise, separating the input data from the middle row;
Step 2, recursively calling the merging and sorting to the left and the right, for example, if one array is [ a, b, c, d, e, f, ], dividing the data into a left part and a right part, wherein the left part is [ a, b, c ], the right part is [ d, e, f ], and recursively calling the merging and sorting to the left and the right;
Step 3, initializing result data to be empty;
And 4, merging the sequencing results of the left side and the right side:
Step 4.1, maintaining the positions of the left and right currently considered rows;
step 4.2, build and execute the following loop until no more data is left or right:
Comparing corresponding connection columns from left to right for the left and right currently considered rows, if all are equal, or for the first unequal connection column, if left < right, the left current row is added to the end of the result data, and the position of the left currently considered row is shifted back by 1 bit, otherwise, the right current row is added to the end of the result data, and the position of the right currently considered row is shifted back by 1 bit;
Step 4.3, if left rows are not considered, adding all the left rows to the end of the result data;
step 4.4, if the left side has the unaccounted row, adding all the left rows to the end of the result data;
And 7, returning a merging result.
The existing sort-based table is connected with the sort-merge join, and when the data volume is large, the aim of improving the performance of the algorithm can be achieved by starting from the following two aspects:
1. Finding a ranking algorithm with lower complexity for large data volumes. The time complexity of the merging and sorting algorithm is O (k is n is log (n)), k is the number of connection columns, and n is the data length;
2. When the pointers of the left table and the right table are matched, the areas with the same values of the connecting columns are skipped, so that the comparison times are reduced. The above-mentioned sort-merge join algorithm compares the number of times O (k×m×n) for the left and right table rows, k is the number of connected columns, m is the left table length, and n is the right table length.
In order to improve the performance of a sort-merge join algorithm, the invention provides a method for sorting a table according to a plurality of columns, and improves the sorting performance under a large data volume by using the concept of radix sorting. The data are grouped while being ordered, and one group is defined as a block area with the same value of all connection columns in the ordered data.
First, the data are ordered and grouped according to the connection columns, as shown in fig. 3, and the specific flow is as follows:
Step 1, inputting data, all connection columns and currently considered connection column subscripts (0 when being called for the first time);
Step 2, sorting the data according to the currently considered connection columns;
step 3, marking the starting positions and the lengths of all repeated value areas of the current connection column for the sequenced data;
Step 4, initializing the grouping result to be null;
step 5, for each repeated value area of the current column, executing the following operations:
if the current column is not the last connection column, recursively calling the current data sequencing and grouping flow for the data in the region, and the subscript of the connection column considered is shifted backwards by 1;
Otherwise, the starting position and the length of the current area are added in all grouping results;
and 6, returning the sequenced data and the grouping result.
Assuming "sorting the data according to the currently considered connection columns" in the above step 2, when radix sorting is used, the complexity of the algorithm is O (k×w×n), k is the number of connection columns, w is the average length of each piece of data in the array, and n is the length of data. For data in a database, w is typically not too long, such as a column of data type bits of a 32-bit integer, and the maximum value of w is 32.
When the data length n is large, w < log (n) according to the above assumption. The algorithm complexity O (k×w×n) of multi-column sorting in the database system using the concept of radix sorting is < the algorithm complexity O (k×n×log (n)) of multi-column sorting using merge sorting, i.e. the performance of the database system is improved.
Then, based on the grouping result, the sort-merge join algorithm is improved to reduce the number of comparisons, as shown in fig. 4, and specifically includes the following steps:
Step 1, inputting results of the left and right tables ordered according to the connection columns and grouping results;
step 2, initializing a null data set;
Step 3, maintaining the subscripts of the currently considered groups of the left table and the right table;
step 4, the outer layer of the component is circulated, and the following steps are executed in a circulating way until no more groups of the current left table or the current right table can be considered:
step 4.1, constructing an inner layer loop to circularly execute all connection columns:
comparing the current group of the left and right tables, the value of the current connection column:
If the current connection column of the current group of the left table is greater than the current connection column of the current group of the right table, the subscript of the currently considered group of the right table is moved backwards by 1, and the inner layer circulation is stopped;
if the current connection column of the current group of the left table is smaller than the current connection column of the current group of the right table, the subscript of the currently considered group of the left table is moved backwards by 1, and the inner layer circulation is stopped;
if the current groups of the left and right tables are equal in all the connection columns, forming a result row by each row in the current left table group and each row in the current right table group, adding the result row to a result set, and shifting the subscript of the current considered group of the left and right tables back by 1;
and 5, outputting a result set.
In the modified sort-merge join algorithm, the subscripts of the currently considered group of the left and right tables are only moved backward and not moved forward. The total comparison number is O (k×m+n), k is the number of connection columns, m is the left table length, and n is the right table length. Compared with the comparison times O (k x m x n) of the general algorithm of the existing sort-merge join, the new algorithm obviously reduces the comparison times and improves the performance of the database system.
Performance comparison of merge ordering with radix ordering:
As shown in fig. 5, by comparing the time consumption of each sorting algorithm under different data amounts, it can be seen that the method (radix sort) of the present invention increases the time consumption of the merging sorting by a very significant extent with respect to the radix sorting, and the total consumption of the radix sorting is significantly advantageous with respect to the merging sorting when the data amount increases.
A database system adopts the equivalent query method based on grouping information to perform data query. This part of the embodiments are similar to the embodiments of the method embodiments described above, and will not be repeated here.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some or all of the features thereof, and that the modifications or substitutions do not depart from the scope of the embodiments of the present invention.

Claims (7)

1. An equivalent query method based on grouping information is characterized in that:
Sorting and grouping connection columns in a connection table, wherein the sorting is to sort the connection columns back and forth according to the size of data values, the grouping is to divide repeated value areas based on data of base number sorting, and one group is defined as an area with the same value of all connection columns in the sorted data, and the data sorting and grouping method is called for data recursion the area, so that the sorted data and grouping result are finally obtained;
Obtaining the sorting and grouping results of the connection columns in at least two connection tables, comparing the data value sizes of the current group and the current connection column of the connection tables, replacing the grouping based on the sorting direction of the numerical values according to the data value size comparison result, and outputting grouping data of the connection tables as a query result if all the connection columns of the grouping are equal;
In the left and right connection tables, the sequence of connection columns is increased from left to right and from top to bottom according to the data value;
constructing an outer layer cycle to circularly execute an inner layer cycle until no more groups exist in the current left table or the current right table, and outputting a query result set;
if the data value of the current connection column of the current group of the left connection table is smaller than the data value of the current connection column of the current group of the right connection table, the next group of the right connection table is used as the current group, the inner layer circulation is built, if the data value of the current connection column of the current group of the left connection table is larger than the data value of the current connection column of the current group of the right connection table, the next group of the left connection table is used as the current group, the inner layer circulation is stopped, and if the current groups of the left connection table and the right connection table are equal in all connection columns, the current left connection table and the right connection table are added to a query result set, and the next groups of the left connection table and the right connection table are used as the current groups.
2. The method for equivalent query based on grouping information of claim 1, wherein the sorting and grouping are performed on the data of the current connection column, the repeated value area of the current connection column is found out according to the sorted data to be divided into different groups, for the repeated value area, if the current connection column is not the last connection column, the data in the area is sorted and grouped, after the layer-by-layer sorting and grouping are completed, the next current connection column is sorted and grouped until the final sorting and grouping are completed, and grouping results are combined.
3. The method for equivalent query based on grouping information of claim 1, wherein a subscript is set for a current connection column, layer-by-layer ordering and grouping are performed on data in an area of the current connection column based on the subscript, after completion, the subscript is moved backward, layer-by-layer ordering and grouping are performed on the data in the area of the next current connection column, and grouping results are combined through the subscript.
4. The method for equivalent query based on grouping information as set forth in claim 1, wherein the starting positions and lengths of all repeated value areas of the current connection columns are marked after the current connection columns are ordered, and the starting positions and lengths of the areas are appended to all grouping results after all grouping is completed.
5. The method for querying equivalence based on grouping information according to claim 1, wherein subscripts of current groups of left and right connection tables are set and maintained, subscripts of current groups of right connection tables are moved backward when current connection column data values of the current groups of left and right connection tables are larger than current connection column data values of the current groups of right connection tables, subscripts of current groups of left and right connection tables are moved backward when data values of current connection columns of the current groups of left and right connection tables are smaller than data values of current connection columns of the current groups of right connection tables, and subscripts of current groups of left and right connection tables are moved backward when the current groups of left and right connection tables are equal in all connection columns.
6. The method of claim 1, wherein when all connection columns of the left and right connection tables are equal, each row in the left connection table and each row in the right connection table are combined into a row result, and the row result is added to the query result set.
7. A database system is characterized in that the database system supports the equivalent query method based on grouping information as set forth in claim 1 for data query.
CN202510213393.2A 2025-02-26 2025-02-26 Equivalent query method based on grouping information and database system Active CN119691004B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510213393.2A CN119691004B (en) 2025-02-26 2025-02-26 Equivalent query method based on grouping information and database system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510213393.2A CN119691004B (en) 2025-02-26 2025-02-26 Equivalent query method based on grouping information and database system

Publications (2)

Publication Number Publication Date
CN119691004A CN119691004A (en) 2025-03-25
CN119691004B true CN119691004B (en) 2025-05-09

Family

ID=95043845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510213393.2A Active CN119691004B (en) 2025-02-26 2025-02-26 Equivalent query method based on grouping information and database system

Country Status (1)

Country Link
CN (1) CN119691004B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827892A (en) * 2024-01-02 2024-04-05 西安电子科技大学 Multiparty database query method and system with privacy protection
CN118113740A (en) * 2022-11-30 2024-05-31 第四范式(北京)技术有限公司 Table data processing method, apparatus, device and medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110264993A1 (en) * 2010-04-23 2011-10-27 Microsoft Corporation Multi-Threaded Sort of Data Items in Spreadsheet Tables
US10635671B2 (en) * 2016-10-05 2020-04-28 Oracle International Corporation Sort-merge band join optimization
US11138202B2 (en) * 2019-07-29 2021-10-05 Salesforce.Com, Inc. Techniques for determining and presenting dataset join candidates
CN114238377B (en) * 2021-12-17 2025-01-14 广州海量数据库技术有限公司 Method of implementing percentile_cont analysis function in OpenGauss

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118113740A (en) * 2022-11-30 2024-05-31 第四范式(北京)技术有限公司 Table data processing method, apparatus, device and medium
CN117827892A (en) * 2024-01-02 2024-04-05 西安电子科技大学 Multiparty database query method and system with privacy protection

Also Published As

Publication number Publication date
CN119691004A (en) 2025-03-25

Similar Documents

Publication Publication Date Title
US8996544B2 (en) Pruning disk blocks of a clustered table in a relational database management system
US9430550B2 (en) Clustering a table in a relational database management system
US5832475A (en) Database system and method employing data cube operator for group-by operations
Antoshenkov Dynamic query optimization in Rdb/VMS
JPH05233721A (en) Method for optimizing processing of coupled operation
CN111552710B (en) A Query Optimization Method for Distributed Database
RU2004131666A (en) METHOD AND DEVICE FOR HANDLING A REQUEST FOR RELATIVE DATABASES
US20220005546A1 (en) Non-redundant gene set clustering method and system, and electronic device
US20180174681A1 (en) Leaping search algorithm for similar sub-sequences in character sequences and application thereof in searching in biological sequence database
CN112434085A (en) Roaring Bitmap-based user data statistical method
CN119691004B (en) Equivalent query method based on grouping information and database system
CN114791916A (en) Rapid comparison method of clinical test data
US7617189B2 (en) Parallel query processing techniques for minus and intersect operators
CN117216054A (en) Index tree creation method and terminal
CN117743386A (en) Optimization method, device and storage medium for distributed database query
US6421657B1 (en) Method and system for determining the lowest cost permutation for joining relational database tables
CN115455057A (en) Execution method of database connection operation, storage medium and computer device
CN111444180B (en) Double-layer structure index and query method thereof
Chen et al. An efficient indexing structure for multi-dimensional range query
CN116010448B (en) Reverse index optimization method and system for relational database
CN119557345B (en) Preference G-Skyline query method
CN110704579B (en) Full-text retrieval method and system based on branch definition
CN119066070A (en) A method for creating a database index
CN120196626A (en) Multimodal data storage management method, system, electronic device and storage medium
CN115640297A (en) Spark SQL-based LEFT JOIN connection calculation optimization method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载