CN102609490B

CN102609490B - Column-storage-oriented B+ tree index method for DWMS (data warehouse management system)

Info

Publication number: CN102609490B
Application number: CN201210019935.5A
Authority: CN
Inventors: 夏小玲; 乐嘉锦; 王梅; 李晔锋
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2012-01-20
Filing date: 2012-01-20
Publication date: 2014-07-02
Anticipated expiration: 2032-01-20
Also published as: CN102609490A

Abstract

The present invention relates to a B+ tree index method for column storage DWMS, characterized in that the steps are: step 1, column data generation; step 2, if the B+ tree key is row number, turn to step 4 to create, otherwise turn to Step 3 sorting; step 3, combining multi-way merge and heap sorting to sort column value data; step 4, B+ tree index initialization; step 5, creating leaf nodes; step 6, generating intermediate nodes from bottom to top . The present invention provides a B+ tree index applied to column storage DWMS, which has the following advantages: 1) the B tree level is guaranteed to be the shortest, reducing the number of searches; tree method. Using this method does not need to consider the split operation, which reduces a lot of overhead.

Description

A kind of B+ tree indexing means towards row storage DWMS

Technical field

The present invention relates to a kind of tree of the B+ towards row storage DWMS (Data Warehouse Management System) index technology.

Background technology

The high development of internet, applications, high-throughput and large buffer memory become the nowadays essential feature of database product, day by day urgent for issued transaction and the high performance requirement of query analysis.Traditional row stored data base can not be for business decision provides efficient query analysis as processing transactions application, the database schema of row storage is in recent years by re-examine, for the inquiry type work of reading to optimize in data warehouse or in analysis support application, row storage list reveals than row to be stored and has performance more significantly.Because relation table is in external performance, be still logical line, to be therefore connected with multilist be to be listed as the key factor that affects query performance in storing in tuple reconstruct.Index is one of important technology improving search efficiency.B+ tree index can keep data to store in order, and have advantages of allow to search, sequential access, insertion and deletion, make it in transaction environment, become the most popular index structure of Database Systems.

In traditional Database Systems, the variation of the B+ tree construction that data insertion and modification bring is frequently very large.In order to reduce the possibility that causes node split when data are inserted and revised, the node of B+ tree is not what fill up.But in data warehouse, almost do not have data to insert and retouching operation.Traditional B+tree is applied in the data warehouse of mass data storage and will causes the waste in space.Meanwhile, because node is not filled completely, data storage needs more node, and this will increase the height of B+ tree index, reduces the search efficiency of data.

Summary of the invention

The object of this invention is to provide a kind of B+ tree index that is applied to row storage DWMS, overcome the limitation of traditional B+tree index, improve the search efficiency of data.

In order to achieve the above object, technical scheme of the present invention has been to provide a kind of B+ tree indexing means towards row storage DWMS, it is characterized in that, step is:

Step 1, column data generate: import user data, the original data by row storage are divided vertically into single-row, for each item number of each row is according to according to the line number at its place, add the line number item of rebuilding for tuple, form two tuple (line numbers, train value), request for data section, is kept at the every train value data that newly produce in a data segment;

If step 2 B+ tree key word is line number data, turn to step 4 to create; If B+ tree key word is train value data, goes to step 3 and first train value is sorted;

Step 3, multiway merge and heapsort are used in combination train value data are sorted;

Step 4, the initialization of B+ tree index;

Step 5, establishment B+ leaf node: the leafy node of application B+ tree, data item is filled directly into leafy node and obtains data block, form the 0th layer, B+ tree;

The feature of step 5 is:

According to row storage characteristics, direct save data information in leaf piece, but not the pointer of key assignments and sensing data block.In search procedure, can directly obtain data by leafy node like this, reduce one time I/O; When node is filled, no longer consider sparse coefficient, the each node in tree fills up, and has adopted the principle of " key for searching number is consistent with pointer number ", room for promotion utilization factor.

Step 6, generation intermediate node: iteration is set up the middle layer node of B+ tree until whole B+ tree creates end from the bottom to top.

Preferably, described step 3 comprises:

Step 3.1, initialization: in internal memory, apply for a sequence district, its size is designated as K, and making K is the integral multiple of block size.If pending data is counted Blk_num > K according to piece, the method for multiway merge external sort will be adopted.The size of M while calculating multiway merge, sub-list number when M is merger, order

in each sub-list for the treatment of merger, piece is counted D=Blk_num/M.

Source data in step 3.2, the section of reading, the train value Data Division in data block two tuples out, packs dataitem array into, puts into sequence district.Use Heap algorithm using dataitem array as input parameter, dataitem array is sorted; If Blk_num is <=K, sequence finishes; Otherwise, sorted data item is reassembled into piece, write back in interim section.

Step 3.3, the interim section of M is carried out to merge sort.

Preferably, described step 6 comprises:

Step 6.1, generating indexes item, the tlv triple that first train value, the line number of correspondence and the piece of this piece number that index entry is served as reasons in lower one deck piece forms;

Step 6.2, judge that train value type is fixed-length data or elongated data, if fixed length train value, calculate the length of index entry, and then can obtain the index entry number=index block space size/index entry length in index block, the index entry number in the data block number/each node in index block number=0th in ground floor layer.Application index block space, is inserted into index entry in index block in batches; If elongated train value, index block of first to file, then inserts tlv triple successively, until just apply for new index block while failing to lay down;

After step 6.3, one deck have created, turn to step 5.1, create last layer intermediate node according to same process, finish until whole B+ tree creates.

The invention provides a kind of B+ tree index tool that is applied to row storage DWMS has the following advantages:

1) provide improved B+ tree index node, the direct memory row data of leafy node, intermediate node is no longer considered sparse coefficient, all fills up, and has increased space availability ratio.For the data of same number, this structure guarantees that B tree hierachy is the shortest, has reduced and has searched number of times;

2) traditional insertion has been abandoned in the foundation of B+ tree, uses the method for bottom-up structure B+ tree.Make not need to consider in this way splitting operation, reduced a large amount of expenses;

3) line number in row storage system and train value data are set up respectively to B+ tree index, keep data to store in order, be convenient to tuple reconstruct and be connected with multilist.

Accompanying drawing explanation

Fig. 1 is the structure that B+ sets inner node (index entry);

Fig. 2 is the structure of B+ leaf node (data page).

Embodiment

For the present invention is become apparent, be hereby described in detail as follows with a preferred embodiment.

The invention provides a kind of B+ tree indexing means towards row storage DWMS, the steps include:

Step 3, multiway merge and heapsort are used in combination train value data are carried out to key assignments sequence, this step comprises:

Step 3.1, initialization: in internal memory, apply for a sequence district, its size is designated as K, and making K is the integral multiple of block size.If pending data is counted Blk_num > K according to piece, the method for multiway merge external sort will be adopted.The size of M while calculating multiway merge, sub-list number when M is merger, order in each sub-list for the treatment of merger, piece is counted D=Blk_num/M.

Step 3.3, the interim section of M is carried out to merge sort.

Step 4, B+ tree initialization, specifically comprise the descriptor of B+ tree carried out to assignment, needs theing contents are as follows of assignment:

(1) type of B+ tree index key assignments, comprises fixed/elongated, value type whether.The key assignments of fixed length comprises: SMALLINT, INT, NUMBER, CHAR, DATE, TIME; Elongated key assignments has the VARCHAR of comprising.

(2) information of row, comprises row name, row type, row length, if row are elongated, specifies maximum length.

(3) be B+ tree root piece allocation space.

(4) the level value that B+ tree is set is 0.

Step 5, establishment leafy node: the leafy node of application B+ tree, data item is filled directly into leafy node and obtains data block, form the 0th layer, B+ tree; This step is according to row storage characteristics, direct save data information in leaf piece, but not the pointer of key assignments and sensing data block.In search procedure, can directly obtain data by leafy node like this, reduce one time I/O.In addition, when node is filled, no longer consider sparse coefficient, the each node in tree fills up, room for promotion utilization factor, as shown in Figures 1 and 2.KEY in figure _iby < line number, train value > two parts composition.

Step 6, generation intermediate node: iteration is set up the middle layer node of B+ tree until whole B+ tree creates end from the bottom to top, the steps include:

Step 6.1, generating indexes item, the tlv triple that index entry is made up of first train value in lower one deck piece, the line number of correspondence and the piece number of this piece, the length of computation index item, is designated as TL;

Step 6.2, application index block space: judge that according to initialization information in step 4 train value is fixed length or elongated, if fixed length train value calculates the length of index entry.Index entry number=each index block space size/index entry length T L in index block.Calculate the index entry number in the data block number/each index block in index block number=0th in ground floor layer.Apply for the index block space of this layer.Index entry is inserted in index block in batches; If elongated train value, index block of first to file, then inserts tlv triple successively, until just apply for new index block while failing to lay down.

After step 6.3, one deck have created, level=level+1; Turn to step 6.1, create last layer intermediate node according to same process, difference is the index entry number in index block number/each index block of the index block number=lower one deck that calculates current layer.In the time that the index block number in this layer is 1, iteration finishes, and this index block is the root node of B+ tree.

Claims

1. A B+ tree index method for column storage DWMS, characterized in that the steps are:

Step 1. Column data generation: import user data, divide the original data stored by row vertically into single columns, add row number items for tuple reconstruction to each item of data in each column according to the row number where it is located, and form Two-tuple (row number, column value), apply for a data segment, and save the newly generated data of each column value in a data segment;

Step 2. If the B+ tree key is row number data, go to step 4 to create it; if the B+ tree key is column value data, go to step 3 to sort the column values first;

Step 3. Combining multi-way merge and heap sorting to sort column value data, said step 3 includes:

Step 3.1. Initialization: Apply for a sorting area in the memory, and its size is recorded as K. Let K be an integer multiple of the size of the data blocks to be sorted. If the number of data blocks to be sorted Blk_num>K, multi-way merge and external sorting will be used method, the number of sublists when merging is M, let

The number of blocks in each sublist to be merged is D=Blk_num/M, read the source data in the data segment, split the column value data in the data block binary group, load it into the array, put it into the sorting area, and use the heap sorting algorithm Use the array as an input parameter to sort the array; if Blk_num<=K, the sorting ends; otherwise, reassemble the sorted data items into data blocks and write them back to the temporary segment;

Step 3.2, merging and sorting the N temporary segments;

Step 4, B+ tree index initialization;

Step 5. Create B+ tree leaf nodes: Apply for the leaf nodes of the B+ tree, fill the data items directly into the leaf nodes to obtain data blocks, and form the 0th layer of the B+ tree;

Step 6. Generate intermediate nodes: iteratively build the intermediate layer nodes of the B+ tree from bottom to top until the entire B+ tree is created. The steps are:

Step 6.1, generate an index item, the index item is a triplet composed of the first column value in the index block of the next layer, the corresponding row number and the block number of the index block;

Step 6.2. Determine whether the column value type is fixed-length data or variable-length data. If it is a fixed-length column value, calculate the length of the index item, and then obtain the number of index items in the index block = index block space size / index item length, The number of index blocks in the first layer = the number of data blocks in the 0th layer/the number of index items in each node, apply for index block space, and insert index items into index blocks in batches; if it is a variable-length column Value, apply for an index block first, then insert triples in sequence, and apply for a new index block until it cannot fit;

Step 6.3 After the creation of one layer is completed, turn to step 6.1 and follow the same process to create the upper layer of intermediate nodes until the creation of the entire B+ tree is completed.