+

CN102508924A - Method for realizing grace hash joint by using merge join - Google Patents

Method for realizing grace hash joint by using merge join Download PDF

Info

Publication number
CN102508924A
CN102508924A CN2011103751121A CN201110375112A CN102508924A CN 102508924 A CN102508924 A CN 102508924A CN 2011103751121 A CN2011103751121 A CN 2011103751121A CN 201110375112 A CN201110375112 A CN 201110375112A CN 102508924 A CN102508924 A CN 102508924A
Authority
CN
China
Prior art keywords
data
hash
output
brush
key assignments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103751121A
Other languages
Chinese (zh)
Inventor
汪龙重
宋鑫
陈福荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dameng Database Co Ltd
Original Assignee
Shanghai Dameng Database Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dameng Database Co Ltd filed Critical Shanghai Dameng Database Co Ltd
Priority to CN2011103751121A priority Critical patent/CN102508924A/en
Publication of CN102508924A publication Critical patent/CN102508924A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a hash join algorithm applied to a database. According to the method adopted by the algorithm, grace hash connection is realized by using merge joint. According to the method, a slot number of a hash table is ingeniously utilized to be taken as a key value for sorting, so that partial sorting time required by the merge joint is saved; and the advantages of a hash idea and a merge sorting idea are combined. When the hash joint is over high in data volume and is not enough in memory for processing, the hash joint is converted into the merge joint for processing, and thus the problem that the hash algorithm is low in efficiency under the condition of high data volume can be effectively solved.

Description

A kind of method of using merger to connect the graceful Hash connection of realization
Technical field
The invention belongs to database technical field, what relate to is a kind of Hash join algorithm, and algorithm thought is to use merger to connect and realizes that graceful Hash connects.
Background technology
It is to handle in the application at database data that Hash connects, and the multilist data is carried out a kind of common method of connection processing.If the table that data volume is bigger connects, the row in the condition of contact are unordered or when not having index and being connected to equivalent the connection, it is optimal strategy that Hash connects.
The Hash join algorithm mainly contains simple hash connecting method (Simple hash join), graceful hash connecting method (GRACE hash join) and mixes Hash join algorithm (Hybrid hash join).
Simple Hash join algorithm on the left side data volume is big, when internal memory can not be with the whole buffer memory of left data, can only set up Hash to it in batches, and the data on the right of like this need repeatedly read, and number of times is the number of times of the Hash that need set up of left data.So cause the I/O reading times more, spend very big.Graceful Hash is connected when handling left data; Be to select a hash function (hash function); Be divided into the left side simultaneously in the different child partitions with right data, and the left side is consistent with the subregion number of right data, respectively the left side child partition corresponding with right data is connected then.Generally speaking, owing to avoided repeatedly scanning right data, the performance that graceful Hash connects is superior to simple Hash and connects.But when setting up the respective sub-partitions of the left side and right data, the time cost of cost is also relatively large.Mixing the Hash join algorithm is the combination of simple Hash connection and graceful Hash join algorithm; When deal with data, similar graceful Hash connects and is divided into the N sub-partition to data, just in being divided into the child partition process; Use for reference simple Hash and connected thought; When first child partition is set up Hash, all leave this part data in the internal memory in, and then each corresponding child partition carries out simple Hash connection processing to other; When most of data during, just reduced the I/O reading times of this part data like this at first child partition.
In sum, traditional algorithm still has big shortcoming.Simple Hash join algorithm is when handling big data quantity, and right data need scan repeatedly, and it is too high that I/O reads cost; Graceful Hash join algorithm is when setting up the child partition process, and time cost is bigger, if the data volume in certain subregion is still bigger; Also need proceed hashed, make this subregion can be divided into littler subregion, if carry out repeatedly after the Hash the data in this subregion; Subregion is still bigger; Then can only stop hashed, adopt additive method to handle, connect like nested loop; Combined the above two advantages though mix the Hash join algorithm, efficient still is connected quite with graceful Hash, does not improve a lot, and has only when data can the major part existence be put in the internal memory, and efficient just improves a lot.
Summary of the invention
The object of the present invention is to provide the hash connecting method that a kind of counting yield is high, be applicable to big data quantity.Though it is fundamental purpose of the present invention is to solve the inefficient problem of hash algorithm under the big data quantity situation, suitable too to the situation of small data.
Hash connecting method provided by the invention is to be converted into merger connection (Merge join) to the Hash connection to handle, rather than uses common graceful Hash join algorithm.Concrete steps are following:
(1) left side input data (left-handed watch data) are carried out the Hash evaluation according to its key assignments (KEY VALUE), be inserted into data in the Hash table according to the cryptographic hash that obtains, if any conflict, the identical data of cryptographic hash are then according to the key assignments insertion of sorting.When not holding all data in the internal memory is Hash table when full, need brush all data into disk.The method that brush is gone into be the cryptographic hash according to Hash table to be groove size order ground form the orderly scrubbing brush of new data to all data in each groove number with this groove number (nth_cell) goes into disk, brush all data of going into each time and all be used as a burst.Then the key assignments of raw data (key1, key2 ..., keyn) with groove number form new key assignments (key0 (being nth_cell), key1, key2 ..., keyn).When the next group data are come again, carry out the processing of same form, and brush is gone into another burst.And the like, until left data is disposed; Each fragment data all is orderly like this, sorts according to new all fragment datas of the key-value pair left side then.
In this step, have better effect in order to make invention, when carrying out temporary storaging data, raw data is transformed, every record has all increased by row, cryptographic hash (nth_cell) row, i.e. and groove number, and then brush is gone into disk.All data in the Hash table are brushed the process into disk; Be the whole Hash table of scanning, No. 0 groove data in the last groove successively all brush go into to disk, so just can guarantee to brush into the order of all data; This is because the data after the process hashed have been orderly in each groove; After increasing cryptographic hash, with this cryptographic hash as a new key assignments, new data based newly-generated key assignments (key0, key1 ..., keyn) be orderly; Compare with other sort methods, just saved sorting operation data.
(2) processing that data also are similar to above-mentioned left-handed watch data is shown on the right side.
(3) take out component group data minimum the fragment data separately respectively from both sides, left and right limit, carry out size comparison according to newly-generated key assignments.
In this step, has better effect in order to make invention, when reading of data from fragment data; Adopted the method for obtaining data in the process of merger data fragmentation; Rather than etc. the whole merger of pending data accomplish after just unified reading, saved twice read-write disk operating like this: once being to read fragment data to sort, once is to brush the data after the ordering once more into disk; Be equivalent to when carrying out the merger connection; Data have been partial orders, save time, and improve treatment effeciency.When left and right both sides data were compared, new key assignments had adopted original key assignments to add the array mode of groove number.Because of having groove number, make that the contrast between the data also has more high-level efficiency.
(4) if left data less than right data, then according to the Hash connection type, need to judge whether the output left data, if, then export left data, the right-hand component of output data is filled with null value.Otherwise, do not export.Continue to take out the integrated data of next group of the left side then.
If left data greater than right data, then according to the Hash connection type, need to judge whether the output right data, if, then export right data, the left-hand component of output data is filled with null value.Otherwise, do not export.Continue to take out the integrated data of next group of the right then.
If left data equals right data, then according to the Hash connection type, need to judge whether output data, if, the both sides data are carried out cartesian product calculate, and the data result after the calculating of output cartesian product.Take out next component group data on the left side and the right then simultaneously.
(5) execution in step (4) repeatedly, up to the left side or right data dispose;
(6) if left data disposes, then according to the Hash connection type, judge whether to need the output left data, if desired, the remaining left data of output then, and the right-hand component of putting output data is a null value;
If right data disposes, then according to the Hash connection type, judge whether to need the output right data, if desired, the remaining right data of output then, and the left-hand component of putting output data is null value;
(7) left side and right data all dispose, and then the Hash connection processing finishes.
Among the present invention, the processing that Hash is connected not is as common algorithm, to adopt graceful Hash join algorithm, but combines Hash and both thought of ordering, carries out merger to the data behind the Hash and connects.
Among the present invention, the first step is carried out data after the Hash, the position of every data in Hash table (be groove number), is stored in the individual data burst together with data.New key assignments has been formed with the key assignments of raw data in position in the Hash table at every data place, and brush is gone into disk in order then.The groove of Hash table number has changed the order of legacy data contrast as the key assignments key of first contrast.
Among the present invention, the 3rd step carry out key-value pair than the time, use newly-generated key assignments (comprise groove number).
Effect of the present invention is: after data are carried out Hash; Just confirmed the order of data; Thereby saved the section data sorting operation, when carrying out data the merger operation, all be from N fragment data, to obtain minimal data to handle then at every turn.Experiment shows that the hash connecting method that use merger connection realizes improves a lot than the speed of graceful hash connecting method, and along with the increase of data volume, advantage is more obvious.
Description of drawings
Fig. 1 the present invention uses merger to connect and realizes GRACE hash connecting method diagram.
Fig. 2 writes down the insertion Hash table.
Fig. 3 Hash table data are brushed into disk.
Embodiment
More common algorithm has simple hash connecting method (Simple hash join), graceful hash connecting method (GRACE hash join) and mixes Hash join algorithm (Hybrid hash join) in the Hash join algorithm.Although many researchers improve these algorithms, the final method of using still be unable to do without the main thought of these algorithms.The algorithm that the present invention proposes, the thought that is to use a kind of merger to connect the graceful Hash connection of realization connects Hash to be handled, and exists than big-difference with above-mentioned three kinds of methods.
Use merger of the present invention connects the method that realizes that graceful Hash connects, and embodiment is following:
(1) data fragmentation.The right and left data are carried out burst, and divide sheet mode the same.
Describe according to method, the data fragmentation key step relates to step (1), (2) and (3) of instructions.Be connected to the method that example is introduced data fragmentation in detail with two tables below.
Suppose that two tables are respectively T1 (C1 INT, C2 INT) and T2 (D1 INT, D2 INT); The condition that two tables connect is C1=D1; The Hash table size is N, and hash function is H (x).
If his-and-hers watches T1 carries out data fragmentation, then to describe according to method, its key assignments is (C1).According to according to the invention, the C1 train value of every record of T1 table is carried out the Hash evaluation, obtain the cryptographic hash of C1 train value; Be nth_cell=H (C1), be inserted into this record in the Hash table then, in the insertion process; Assurance is recorded in the Hash groove orderly, as shown in Figure 2.Data recording in the T1 table is carried out similar processing then, till Hash table is full always.
If Hash table is full, then brush all data in the Hash table into disk according to groove number (nth_cell value) order from small to large.Describe according to method, in the insertion process, increased a key assignments (KEY), and these records all be according to key assignments (KEY) value (nth_cell, C1) in order.After brushing disk to all data, formed a data burst.And then the data left of taking out table T1, so repeatedly, got up to data.All fragment datas have finally been formed.Like Fig. 1, shown in Figure 3.
T2 also handles according to similar step, has obtained all fragment datas, like Fig. 1, shown in Figure 3.
(2) two table merger connect.All fragment datas that sorted are carried out merger to be connected.
When carrying out merger and connect, the data fragmentation of two tables sort, through respectively the fragment data of two tables being carried out multiway merge, has obtained two and has shown final ordered datas.Carrying out two table merger then connects.
The merger of using in this method connects, owing to increased groove number (nth_cell) as key assignments (KEY VALUE), and when carrying out the merger connection; Increased a contrast key assignments more; And, reduced sorting operation, and in comparison process than common merger method of attachment through brushing the order of partition data; If the data of contrast are not in the groove of same Hash table number, the comparison that then directly just can skip same group of data.Carry out just having drawn the result who connects after merger connects.
When the most important thought of the present invention is that the left side and right data carried out fragment process, has increased the groove number of Hash table and, made each burst not need ordering, saved the most ordering time as key assignments.Because each burst is orderly,, and then connects according to the merger join algorithm and to get final product as long as sorted data are carried out merger.Therefore, in the merger process, both saved ordering time cost, saved the space cost of merge sort again every batch data.

Claims (1)

1. method of using merger to connect to realize graceful Hash to connect is characterized in that:
(1) at first data based its key assignments of left-handed watch is carried out the Hash evaluation, is inserted into data in the Hash table according to cryptographic hash, if any conflict, the data of identical cryptographic hash according to the insertion of sorting of key assignments size; If can not hold all data in the internal memory, promptly Hash table is full, need all brush disk to the data in the Hash table; The method that brush is gone into be the cryptographic hash according to Hash table be groove number brush from small to large into; Form new tuple to all data in each groove of Hash table with this groove number (nth_cell), in order scrubbing brush is gone into disk, and the brush data of going into are as a data burst each time; Then the key assignments of raw data: key1, key2 ..., keyn with groove number form new key assignments: nth_cell, key1, key2 ..., keyn; Successively the next group data are carried out the processing of same form, and be designated as next burst; Until the left-handed watch data processing is finished; At this moment, the data in each burst are all in good order, and then sort according to the data in new all bursts of the key-value pair left side;
(2) right side table data are carried out the processing same with the left-handed watch data;
(3) from the left side, all fragment datas of the right, take out minimum component group data, and carry out size relatively according to these two groups of data of newly-generated key-value pair;
(4) if left data less than right data, then according to the Hash connection type, need to judge whether the output left data; If, then export left data, the right-hand component of output data is filled with null value; Otherwise do not export; Continue to take out the integrated data of next group of the left side then; If left data greater than right data, according to the Hash connection type, need to judge whether the output right data, if desired, then export right data, the left-hand component of output data is filled with null value; Otherwise do not export; Continue to take out the integrated data of next group of the right then; If left data equals right data, then the both sides data are carried out cartesian product and calculate, and output result of calculation; Take out next group data on the left side and the right then simultaneously;
(5) execution in step (4) repeatedly, up to the left side or right data dispose;
(6) if left data disposes, then according to the Hash connection type, judge whether to need the output left data, if desired, the remaining left data of output then, and the right-hand component of putting output data is a null value;
If right data disposes, then according to the Hash connection type, judge whether to need the output right data, if desired, the remaining right data of output then, and the left-hand component of putting output data is null value;
(7) left side and right data all dispose, and then the Hash connection processing finishes.
CN2011103751121A 2011-11-22 2011-11-22 Method for realizing grace hash joint by using merge join Pending CN102508924A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103751121A CN102508924A (en) 2011-11-22 2011-11-22 Method for realizing grace hash joint by using merge join

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103751121A CN102508924A (en) 2011-11-22 2011-11-22 Method for realizing grace hash joint by using merge join

Publications (1)

Publication Number Publication Date
CN102508924A true CN102508924A (en) 2012-06-20

Family

ID=46221010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103751121A Pending CN102508924A (en) 2011-11-22 2011-11-22 Method for realizing grace hash joint by using merge join

Country Status (1)

Country Link
CN (1) CN102508924A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345472A (en) * 2013-06-04 2013-10-09 北京航空航天大学 Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system
WO2015176315A1 (en) * 2014-05-23 2015-11-26 华为技术有限公司 Hash join method, device and database management system
JP2017532658A (en) * 2014-09-09 2017-11-02 日本電気株式会社 Method for efficient one-to-one coupling
WO2018006594A1 (en) * 2016-07-08 2018-01-11 华为技术有限公司 Method and apparatus for generating hash connection table
CN108549666A (en) * 2018-03-22 2018-09-18 上海达梦数据库有限公司 A kind of sort method of tables of data, device, equipment and storage medium
CN112765174A (en) * 2021-01-20 2021-05-07 上海达梦数据库有限公司 Detection method, device and equipment based on Hash connection and storage medium
CN113961728A (en) * 2021-09-14 2022-01-21 惠州市德赛西威智能交通技术研究院有限公司 Quick response method for vehicle-mounted storage equipment
CN116028506A (en) * 2023-02-23 2023-04-28 山东浪潮科学研究院有限公司 Hash connection method, device, equipment and medium

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345472B (en) * 2013-06-04 2016-08-10 北京航空航天大学 Deduplication File System Based on Finite Binary Tree Bloom Filter and Its Construction Method
CN103345472A (en) * 2013-06-04 2013-10-09 北京航空航天大学 Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system
WO2015176315A1 (en) * 2014-05-23 2015-11-26 华为技术有限公司 Hash join method, device and database management system
CN105359142A (en) * 2014-05-23 2016-02-24 华为技术有限公司 Hash join method, device and database management system
CN105359142B (en) * 2014-05-23 2019-04-05 华为技术有限公司 Hash connecting method and device
US10877973B2 (en) 2014-09-09 2020-12-29 Nec Corporation Method for efficient one-to-one join
JP2017532658A (en) * 2014-09-09 2017-11-02 日本電気株式会社 Method for efficient one-to-one coupling
WO2018006594A1 (en) * 2016-07-08 2018-01-11 华为技术有限公司 Method and apparatus for generating hash connection table
CN108549666A (en) * 2018-03-22 2018-09-18 上海达梦数据库有限公司 A kind of sort method of tables of data, device, equipment and storage medium
CN112765174A (en) * 2021-01-20 2021-05-07 上海达梦数据库有限公司 Detection method, device and equipment based on Hash connection and storage medium
CN112765174B (en) * 2021-01-20 2024-03-29 上海达梦数据库有限公司 Hash connection-based detection method, device, equipment and storage medium
CN113961728A (en) * 2021-09-14 2022-01-21 惠州市德赛西威智能交通技术研究院有限公司 Quick response method for vehicle-mounted storage equipment
CN116028506A (en) * 2023-02-23 2023-04-28 山东浪潮科学研究院有限公司 Hash connection method, device, equipment and medium
CN116028506B (en) * 2023-02-23 2025-08-22 山东浪潮科学研究院有限公司 Hash connection method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN102508924A (en) Method for realizing grace hash joint by using merge join
CN104361113B (en) A kind of OLAP query optimization method under internal memory flash memory mixing memory module
Ross et al. Fast computation of sparse datacubes
CN107688999A (en) A kind of parallel transaction based on block chain performs method
WO2013152543A1 (en) Multidimensional olap query processing method for column-oriented data warehouse
CN103309958A (en) OLAP star connection query optimizing method under CPU and GPU mixing framework
CN101840430B (en) Intelligent card database multi-list operation method and device
WO2013155751A1 (en) Concurrent-olap-oriented database query processing method
CN103440246A (en) Intermediate result data sequencing method and system for MapReduce
JP6418431B2 (en) Method for efficient one-to-one coupling
Patil et al. Categorical range maxima queries
Sun et al. Balance-aware distributed string similarity-based query processing system
CN105138607A (en) Hybrid granularity distributional memory grid index-based KNN query method
CN110083609B (en) Real-time query method for graph structure data in rail transit network passenger flow data
Gendrano et al. Parallel algorithms for computing temporal aggregates
CN110297858B (en) Optimization method and device for execution plan, computer equipment and storage medium
CN104572990A (en) Method and device for searching
CN116628136A (en) Collaborative query processing method, system and electronic equipment based on declarative reasoning
US20070073642A1 (en) Parallel query processing techniques for minus and intersect operators
Rao et al. Efficient Iceberg query evaluation using compressed bitmap index by deferring bitwise-XOR operations
CN111444180B (en) Double-layer structure index and query method thereof
CN117131069B (en) Database list grouping scanning method
Li et al. Better semijoins using tuple bit-vectors
CN105550284B (en) Method and device for mixed use of memory and temporary table space in Presto computing node
Chou et al. Computing complex iceberg cubes by multiway aggregation and bounding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120620

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载