CN104809210B

CN104809210B - One kind is based on magnanimity data weighting top k querying methods under distributed computing framework

Info

Publication number: CN104809210B
Application number: CN201510209691.0A
Authority: CN
Inventors: 何洁月; 罗浩
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2015-04-28
Filing date: 2015-04-28
Publication date: 2017-12-26
Anticipated expiration: 2035-04-28
Also published as: CN104809210A

Abstract

The invention discloses a top-k query optimization method for massive data based on a spark distributed computing framework, which divides massive data sets in advance, mainly using a grid-like data segmentation method. Divide the original data set into different sub-data sets, and then select a small number of suitable sub-data sets to replace the entire data set for query according to the weight given by the user to each attribute of the data object and the query k value. Experimental results prove that the method proposed in this paper has a fast query speed and good scalability. Compared with the traditional top-k query method and the data segmentation method based on angle and distance, the query speed is improved, and the information that the user needs to query can be fed back in a short time.

Description

A mass data weighted top-k query method based on distributed computing framework

技术领域technical field

本发明涉及一种数据查询方法，特别是一种海量数据集中加权的top-k查询方法。The invention relates to a data query method, in particular to a weighted top-k query method in massive data sets.

背景技术Background technique

top-k查询又被称作序敏感查询(rank-aware query)，是数据库中一个最基本的操作，同时也是数据分析重要工具，尤其在商业分析中，往往只需要关注最有用的数据，而不是整个数据集。Top-k query, also known as rank-aware query, is one of the most basic operations in the database, and it is also an important tool for data analysis. Especially in business analysis, it is often only necessary to focus on the most useful data, while Not the entire dataset.

top-k查询的定义如下：用D＝{T₁,T₂,…,T_n}表示所有数据对象的集合，T_i表示其中第i个数据对象，每个数据对象有d维，且都是空间中的一个点。对于一个top-k查询Q(f,k)，f表示评分函数，k表示返回满足查询要求的k个结果。f为加权和函数，即对于样本中的一个数据对象T(t₁,t₂,…,t_d)，用户给该数据对象的每个属性赋予一个权重W(w₁,w₂,…,w_d)，每个数据对象的得分是通过各个属性值加权求和得到的，即得分函数为：The definition of top-k query is as follows: use D={T ₁ ,T ₂ ,…,T _n } to represent the collection of all data objects, T _i represents the i-th data object, each data object has d dimensions, and all is a point in space. For a top-k query Q(f,k), f represents the scoring function, and k represents returning k results that meet the query requirements. f is a weighted sum function, that is, for a data object T(t ₁ ,t ₂ ,…,t _d ) in the sample, the user assigns a weight W(w ₁ ,w ₂ ,…, w _d ), the score of each data object is obtained by the weighted sum of each attribute value, that is, the score function is:

top-k查询最后只要得到k个元素的结果集，只要对远小于输入数据集的数据进行排序就可以得到，而不需要对全局的数据进行处理。近年来随着数据规模爆炸性的增长，海量的数据规模对数据存储、管理以及分析带来了极大的挑战。top-k查询作为数据分析中一个基本操作，需要快速得到查询结果。例如：在淘宝海量商品中用户根据自身偏好对商品属性赋予不同权重，然后系统根据用户需求快速返回满足用户需求的前k个商品。The top-k query only needs to get a result set of k elements at the end, and it can be obtained by sorting the data that is much smaller than the input data set, without processing the global data. In recent years, with the explosive growth of data scale, massive data scale has brought great challenges to data storage, management and analysis. As a basic operation in data analysis, top-k query needs to get query results quickly. For example: in Taobao's mass products, users assign different weights to product attributes according to their own preferences, and then the system quickly returns the top k products that meet user needs according to user needs.

然而针对海量数据top-k查询面临着两大挑战：一是数据规模达到TB或者PB级，传统的集中式数据处理方法不再适用；二是对于海量数据集怎样能够快速准确的得到查询结果。However, top-k queries on massive data face two challenges: one is that the data scale reaches TB or PB level, and traditional centralized data processing methods are no longer applicable; the other is how to quickly and accurately obtain query results for massive data sets.

在传统的集中式数据系统中的top-k查询在海量数据中遇到性能瓶颈，所以不适合海量数据集处理。在传统的分布式环境中，有的研究通过对查询结果的缓存来提高查询的效率，这种方法在本质上没有解决海量数据top-k查询问题；有的利用Skyline轮廓查询方法进行数据处理，提出DiTo整套的top-k处理框架，但也只是在传统的分布式环境中。Top-k queries in traditional centralized data systems encounter performance bottlenecks in massive data, so they are not suitable for processing massive data sets. In the traditional distributed environment, some studies improve the query efficiency by caching the query results, but this method does not solve the problem of massive data top-k query in essence; some use the Skyline outline query method for data processing, A complete set of top-k processing framework for DiTo is proposed, but it is only in a traditional distributed environment.

近年来top-k问题在云环境下最基本的解决方案就是对所有数据进行排序然后返回前k个结果，但是这个方法每次查询都要对原始数据集进行处理，导致冗余的工作量，查询时间长，所以不可取。RanKloud等人提出在MapReduce框架下通过系统运行时候统计分析来计算查询提前终止的阈值，这种方法不能保证得到k个准确的结果。还有研究通过缓存机制通过比较一个新的查询与缓存中已有查询相似性，如果相似程度大，则不用重新查询，虽然加快查询速度，但是查询结果不精确。有提出基于角度和距离数据划分top-k查询，但是基于角度的数据划分方案，数据坐标转换复杂费时，所以也不适合海量数据集处理。In recent years, the most basic solution to the top-k problem in the cloud environment is to sort all the data and return the top k results, but this method needs to process the original data set for each query, resulting in redundant workload. The query time is long, so it is not advisable. RanKloud et al. proposed to calculate the threshold for early termination of queries through statistical analysis of system runtime under the MapReduce framework. This method cannot guarantee k accurate results. There are also studies that use the cache mechanism to compare the similarity between a new query and the existing query in the cache. If the similarity is large, there is no need to re-query. Although the query speed is accelerated, the query results are not accurate. It has been proposed to divide top-k queries based on angle and distance data, but the angle-based data division scheme is complex and time-consuming to convert data coordinates, so it is not suitable for processing massive data sets.

发明内容Contents of the invention

发明目的：为了克服现有技术中存在的不足，本发明提供一种基于分布式计算框架的海量数据加权top-k查询方法，用于解决现有的top-k查询在处理海量数据时无法快速、准确、简便地获得查询结果的技术问题。Purpose of the invention: In order to overcome the deficiencies in the prior art, the present invention provides a massive data weighted top-k query method based on a distributed computing framework, which is used to solve the problem that the existing top-k query cannot be processed quickly when processing massive data. , Accurate and easy access to technical issues of query results.

技术方案：为实现上述目的，本发明采用的技术方案为：Technical scheme: in order to achieve the above object, the technical scheme adopted in the present invention is:

首先做出下列4条合理假设：First make the following four reasonable assumptions:

(1)、任意一个数据对象属性值均取非负值，即使是负值也可以通过数据的归一化，变为非负值。(1) Any attribute value of a data object takes a non-negative value, and even a negative value can be converted into a non-negative value through normalization of the data.

(2)、数据集相对固定，或者数据的更新速度相对于整个数据集而言，可以在一定时间内忽略不计，例如，京东的商品数据虽然时刻在更新，但是基于庞大的商品基数，可以认为在某个时间段内变化不大。因此，针对于流数据处理，本发明方法不适用。(2) The data set is relatively fixed, or the update speed of the data is negligible within a certain period of time compared to the entire data set. little change over a period of time. Therefore, for stream data processing, the method of the present invention is not applicable.

(3)、数据是均匀分布在空间中，在海量数据集中，这个假设在很多场景下是符合的。(3) The data is evenly distributed in space. In massive data sets, this assumption is consistent in many scenarios.

(4)对于一个输入权重W，满足即使不是，也可以通过归一化得到。(4) For an input weight W, satisfy Even if it is not, it can be obtained by normalization.

在上述4条合理假设的基础上，提出一种基于分布式计算框架下海量数据加权top-k查询方法，包括顺序执行的以下步骤：On the basis of the above four reasonable assumptions, a weighted top-k query method based on massive data under the distributed computing framework is proposed, including the following steps executed sequentially:

步骤1、建立数据空间Step 1. Create a data space

首先将所有包含d个属性的数据对象的属性值都转化为非负值，并且对属性值进行归一化处理；建立d维坐标系，坐标系的轴与数据对象的属性一一对应，将所有数据对象放置于坐标系中形成数据空间；Firstly, the attribute values of all data objects containing d attributes are converted into non-negative values, and the attribute values are normalized; a d-dimensional coordinate system is established, and the axes of the coordinate system correspond to the attributes of the data objects one by one. All data objects are placed in the coordinate system to form a data space;

步骤2、数据划分Step 2. Data division

以坐标系的原点为起点，将整个坐标系由内向外划分为m个区域，这里m取值不能过大，否则会带来计算量增大的不利后果，在目前的数据规模的情况下，一般将m取为3～5，这样的范围合理地考虑了数据规模和计算量，当然，随着以后数据规模的进一步增加，可以适当地增加m的取值以获取划分出的区域中的数据总量的减少的目的；将每个区域从外向内顺序编号为1，2,……,m，且第1区域的边界与坐标轴一起共同将所有数据对象都包括进去，对任意一个区域，该区域的每种属性的最大值相同，且每个区域的外围边界的坐标满足至少有一个轴的坐标值为该区域的属性的最大值，在设定第1区域的属性的最大值为a₁的前提下，则第i个区域的属性的最大值i＝1,2,……,m。根据上述方法划分好区域后，结合上面的假设(3)可知，每个区域中的数据量是相等的。Starting from the origin of the coordinate system, the entire coordinate system is divided into m areas from the inside to the outside. The value of m here cannot be too large, otherwise it will bring adverse consequences of increased calculation. In the case of the current data scale, Generally, m is taken as 3 to 5. This range reasonably considers the data scale and calculation amount. Of course, as the data scale further increases in the future, the value of m can be appropriately increased to obtain the data in the divided area. The purpose of reducing the total amount; each area is numbered 1, 2, ..., m from the outside to the inside, and the boundary of the first area and the coordinate axis together include all data objects. For any area, The maximum value of each property of this area is the same, and the coordinates of the outer boundary of each area satisfy at least one axis whose coordinate value is the maximum value of the property of this area, and the maximum value of the property of the first area is set to a Under the premise of ₁ , the maximum value of the attribute of the i-th area i=1,2,...,m. After the regions are divided according to the above method, combined with the assumption (3) above, it can be known that the amount of data in each region is equal.

除最外面一个区域，对其余每个区域，将属于该区域的且各个轴的属性均为最大值的点作为基点，将整个坐标系中的所有属性值都大于等于基点处的相应属性值的区域均划分出来，按照从外向内的顺序编号为1，2,……,M，其中M＝m-1，将上述新划分出来的区域作为判断区。Except for the outermost area, for each of the remaining areas, the point that belongs to the area and the attribute of each axis is the maximum value is used as the base point, and all the attribute values in the entire coordinate system are greater than or equal to the corresponding attribute value at the base point. The regions are all divided, numbered 1, 2, ..., M in order from outside to inside, where M=m-1, and the above newly divided regions are used as judgment regions.

根据Skyline的原理，对任意两个点1和点2，如果点1的所有属性值都小于点2，则点1支持点2。基于上述原理，若给定两个数据对象T₁和T₂，如果对于都有T₁的属性值大于等于T₂相应的属性值即T₁.t_i≥T₂.t_i，t_i表示第i个属性的属性值，则任意给定一个输入权重W(w₁,w₂,…,w_d)，必存在T₁的得分大于T₂的得分即f_W(T₁)≥f_w(T₂)。According to the principle of Skyline, for any two points 1 and 2, if all attribute values of point 1 are smaller than point 2, then point 1 supports point 2. Based on the above principles, if two data objects T ₁ and T ₂ are given, if for The attribute value of T ₁ is greater than or equal to the corresponding attribute value of T ₂ ie T ₁ .t _i ≥ T ₂ .t _i , t _i represents the attribute value of the i-th attribute, then an input weight W(w ₁ ,w ₂ ,…,w _d ), there must be a score of T ₁ greater than that of T ₂ , that is, f _W (T ₁ )≥f _w (T ₂ ).

基于上述分析，某一判断区中的数据对象的得分必然都大于位于这一判断区所属的区域的内侧所有区域的数据对象的得分，由于算法返回的结果是取自得分最高的k个，k为算法返回的结果集中数据对象的个数，所以一旦某一判断区中的数据对象的个数大于等于k，那么这k个数据对象必然不会是从该判断区所属的区域的内侧区域中获得的。因此，基于上述分析，对判断区进行如下操作判断：Based on the above analysis, the scores of data objects in a judgment area must be greater than the scores of data objects in all areas inside the area to which this judgment area belongs. Since the results returned by the algorithm are taken from the k with the highest scores, k is the number of data objects in the result set returned by the algorithm, so once the number of data objects in a certain judgment area is greater than or equal to k, then these k data objects must not be from the inner area of the area to which the judgment area belongs acquired. Therefore, based on the above analysis, the judgment area is judged as follows:

按照从小到大的编号顺序，依次判断N_i≥k是否成立，其中N_i为i号判断区中的数据对象的个数，k为算法返回的结果集中数据对象的个数；当i号判断区域满足上式成立，则结束判断，并将从外向内的i个区域作为搜索区域进行搜索。According to the order of numbering from small to large, judge whether N _i ≥ k holds true, where N _i is the number of data objects in the i-judgment area, and k is the number of data objects in the result set returned by the algorithm; when i judges If the area satisfies the above formula and holds, the judgment ends, and the i areas from outside to inside are used as search areas for searching.

进一步的，在本发明中，对作为搜索区域中最内侧的一个区域进行细分，该区域编号为i，细分方法如下：Further, in the present invention, the innermost area in the search area is subdivided, the area number is i, and the subdivision method is as follows:

将该区域划分为d+1块，其中d块为待选检索区域，除待选检索区域外的其余区域为必检索区域；Divide the area into d+1 blocks, where block d is the search area to be selected, and the remaining areas except the search area to be selected are mandatory search areas;

将待选检索区域编号为n＝1，2,……,d，其中第n个待选检索区域中的任意一个数据点T_nj(t_n1,t_n2,…,t_nd)，这里t_nj表示数据点T_nj的j轴对应的属性值，t_nj满足下列2个不等式：Number the search areas to be selected as n=1, 2,...,d, where any data point T _nj (t _n1 ,t _n2 ,...,t _nd ) in the nth search area to be selected, where t _nj Indicates the attribute value corresponding to the j-axis of the data point T _nj , t _nj satisfies the following two inequalities:

0≤t_nj≤2a_i+1-a_i，这里1≤j≤d且j≠n (1)0≤t _nj ≤2a _i+1 -a _i , where 1≤j≤d and j≠n (1)

a_i-a_i+1≤t_nj≤a_i，这里j＝n (2)a _i -a _i+1 ≤t _nj ≤a _i , where j=n (2)

第n个待选检索区域中，若数据点T_nj(t_n1,t_n2,…,t_nd)满足其中一个轴对应的属性值为a_i且其余轴对应的属性值为2a_i+1-a_i，则将该数据点作为第n个待选检索区域的最大边界点；In the nth search area to be selected, if the data point T _nj (t _n1 ,t _n2 ,…,t _nd ) satisfies that the attribute value corresponding to one axis is a _i and the attribute value corresponding to the other axes is 2a _i+1 - a _i , then use this data point as the maximum boundary point of the nth candidate retrieval area;

遍历用户赋予给每个待选检索区域的最大边界点处的各个属性权重w_j是否存在满足以下条件： Traversing whether each attribute weight w _j at the maximum boundary point of each candidate retrieval area assigned by the user satisfies the following conditions:

若存在满足上述条件的最大边界点的属性权重w_j，则搜索区域范围缩小为包括从外向内的i-1个区域、第i个区域中的必检索区以及该最大边界点所属的待选检索区域；If there is an attribute weight w _j of the maximum boundary point that satisfies the above conditions, the scope of the search area is narrowed to include i-1 areas from outside to inside, the required search area in the i-th area, and the candidate to be selected to which the maximum boundary point belongs search area;

若不存在满足上述条件的最大边界点的属性权重w_j，则搜索区域范围缩小为包括从外向内的i-1个区域以及第i个区域中的必检索区。If there is no attribute weight w _j of the maximum boundary point that satisfies the above conditions, the search area is narrowed to include i-1 areas from outside to inside and the required search area in the i-th area.

在满足判断区N_i≥k成立的前提下，细分方法将位于最内侧的搜索区域分为必检索区域和待选检索区域，并根据判断原则从待选检索区域中进一步选出恰当的部分进行检索，进一步缩小检索范围。根据前文的论证，每个待选检索区域的最大边界点处的得分必然大于等于该待选检索区域中其他位置的数据点的得分；因此，如果某个待选检索区域的最大边界点处的得分小于判断区N_i的基点处的得分，则该待选检索区就不必检索了，反之，则该待选检索区域则需要检索。On the premise that the judgment area N _i ≥ k is established, the subdivision method divides the innermost search area into the required search area and the candidate search area, and further selects the appropriate part from the candidate search area according to the judgment principle Search to further narrow the search scope. According to the previous argument, the score at the maximum boundary point of each search area to be selected must be greater than or equal to the score of data points in other positions in the search area to be selected; therefore, if the maximum boundary point of a search area to be selected If the score is less than the score at the base point of the judgment area N _i , then the search area to be selected does not need to be searched; otherwise, the search area to be selected needs to be searched.

为说明方便，选取一个待选检索区域，其最大边界点的坐标为T(a_i,2a_i+1-a_i,2a_i+1-a_i,…,2a_i+1-a_i)，相应的判断区N_i的基点坐标为T(a_i+1,a_i+1,a_i+1,...,a_i+1)；将上述坐标代入得分函数，若有(a_i,2a_i+1-a_i,2a_i+1-a_i,…,2a_i+1-a_i)*W＞(a_i+1,a_i+1,…,a_i+1)*W，这里W＝(w₁,w₂,…,w_d)，则上式可变形为由此可以得到则需要对上述最大边界点所对应的待检索区域进行检索；按照上述方式，对所有待检索区域进行判断，即可得到统一的表达式这里可以进一步得出，某个待检索区域若满足条件，则使不等式成立的属性权重w_j必然是该待检索区域的最大边界点的属性值为a_i的属性的权重，那么遍历检索时，只要将每个待检索区中的最大边界点的属性值为a_i的属性的权重带入若成立则该待检索区需要检索，并且一旦找到一个使得上式成立的属性权重，便不用再继续检验其他的待检索区，因为属性权重做了归一化处理，因此不可能同时有2或2个以上的属性权重满足上面的不等式；所以，在选择待检索区时，可以根据用户输入的属性权重值快速判断最大的一个属性权重是否满如果满足，那么相应地找出最大边界点第j个属性为a_i的待检索区即可。For the convenience of illustration, select a search area to be selected, and the coordinates of its maximum boundary point are T(a _i ,2a _i+1 -a _i ,2a _i+1 -a _i ,…,2a _i+1 -a _i ), The base point coordinates of the corresponding judgment area N _i are T(a _i+1 ,a _i+1 ,a _i+1 ,...,a _i+1 ); substitute the above coordinates into the score function, if (a _i , 2a _i+1 -a _i ,2a _i+1 -a _i ,…,2a _i+1 -a _i )*W＞(a _i+1 ,a _i+1 ,…,a _i+1 )*W, Here W=(w ₁ ,w ₂ ,…,w _d ), then the above formula can be transformed into From this you can get Then it is necessary to search the region to be retrieved corresponding to the above maximum boundary point; according to the above method, judge all the regions to be retrieved, and a unified expression can be obtained Here it can be further concluded that if a certain area to be retrieved satisfies the condition, then the attribute weight w _j that makes the inequality true must be the weight of the attribute whose attribute value is a _i at the largest boundary point of the area to be retrieved, then when traversing the search, As long as the weight of the attribute whose attribute value is a _i of the maximum boundary point in each area to be retrieved is brought into If it is true, the area to be retrieved needs to be retrieved, and once an attribute weight is found that makes the above formula true, there is no need to continue to check other areas to be retrieved, because the attribute weights have been normalized, so it is impossible to have 2 or More than 2 attribute weights satisfy the above inequality; therefore, when selecting the area to be retrieved, you can quickly judge whether the largest attribute weight is full according to the attribute weight value input by the user If it is satisfied, then correspondingly find out the area to be retrieved whose jth attribute of the maximum boundary point is a _i .

有益效果：Beneficial effect:

本发明提供的一种基于分布式计算框架下海量数据加权top-k查询方法，提出了一种新的类似网格划分的数据分割方式，并且通过判断区中数据量与结果集中数据量k的多少进行对比初步确定搜索区域，大大缩小了搜索范围；接着进一步缩小最内侧的搜索区域的搜索范围，使得最终的搜索区域更小，提高了搜索效率和速度。The present invention provides a mass data weighted top-k query method based on a distributed computing framework, and proposes a new data segmentation method similar to grid division, and judges the difference between the amount of data in the area and the amount of data k in the result set How many comparisons are made to preliminarily determine the search area, which greatly narrows the search range; then further narrows the search range of the innermost search area, making the final search area smaller and improving search efficiency and speed.

根据统计结果，将拥有10亿条数据的数据空间划分为m＝3个区域，最外侧的判断区中的数据量随维度d即每个数据对象包含的属性个数呈现如下表所示的变化趋势：According to the statistical results, the data space with 1 billion pieces of data is divided into m=3 areas, and the amount of data in the outermost judgment area varies with the dimension d, that is, the number of attributes contained in each data object, as shown in the following table trend:

表1Table 1

由表可知，在维度d小于8时，最外侧的判断区中仍存在18个数据对象，实际应用中，往往对于结果集数据对象个数要求不大，如返回10个结果，因此最多只要查询最外围区域中的数据集就可以了，因此至少减少了2/3的搜索范围，过滤掉大量无关数据。因此，本发明方法极大地改善了海量数据下的top-k查询性能，提高了海量较高维数据集的top-k查询速度。It can be seen from the table that when the dimension d is less than 8, there are still 18 data objects in the outermost judgment area. In practical applications, the number of data objects in the result set is often not required. For example, 10 results are returned, so at most you only need to query The data set in the outermost area is fine, so at least 2/3 of the search range is reduced, and a large amount of irrelevant data is filtered out. Therefore, the method of the present invention greatly improves the top-k query performance under massive data, and improves the top-k query speed of massive higher-dimensional data sets.

附图说明Description of drawings

图1是本发明对数据划分方法示意图；Fig. 1 is a schematic diagram of the data division method of the present invention;

图2是本发明对数据细分方法示意图；Fig. 2 is a schematic diagram of the method for subdividing data in the present invention;

图3表示三种不同数据划分方式随数据维度不同查询时间对比，其中DistImprove是本发明方法，AngleDistTop_k是基于角度和距离数据分割方法，BasicTop_k是对原始数据集不进行划分查询时间；Fig. 3 shows the comparison of three different data division methods with different query time of data dimensions, wherein DistImprove is the method of the present invention, AngleDistTop_k is based on the angle and distance data segmentation method, and BasicTop_k is the query time without dividing the original data set;

图4表示在维度4情况下，对于不同用户输入权重查询时间对比。Figure 4 shows the comparison of query time for different user input weights in the case of dimension 4.

图5是本发明方法在不同集群结点时候的加速比；Fig. 5 is the acceleration ratio of the inventive method at different cluster nodes;

图6是本发明top-k查询具体实施中top-k查询过程图。FIG. 6 is a diagram of the top-k query process in the specific implementation of the top-k query in the present invention.

具体实施方式detailed description

下面结合附图对本发明作更进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.

实验是在一个7结点的spark集群上完成的，spark是搭建在hadoop上，使用hadoop的yarn资源管理器和HDFS文件存储系统。7个结点中master结点既作为Driver结点又做worker结点，其余6个结点均为worker结点。实验环境的基本配置如下表2：The experiment is done on a 7-node spark cluster. Spark is built on hadoop, using hadoop's yarn resource manager and HDFS file storage system. Among the 7 nodes, the master node is both a driver node and a worker node, and the remaining 6 nodes are all worker nodes. The basic configuration of the experimental environment is shown in Table 2:

表2Table 2

使用均匀数据集，每条记录有8个属性，每个属性取值范围[0,1000]之间的整数，总共生成了10亿条记录，大约有40G数据量。同时还生成4维、6维数据集，且均与8维类似，都是随机生成数据集，且都是10亿条记录。Using a uniform data set, each record has 8 attributes, and each attribute takes an integer in the value range [0,1000]. A total of 1 billion records are generated, with a data volume of about 40G. At the same time, 4-dimensional and 6-dimensional data sets are also generated, and they are similar to 8-dimensional data sets, which are randomly generated data sets with 1 billion records.

下面实验如果没有特殊说明，均是在权重取均值，k＝100的条件下做的实验，且每次查询都是做了10次取平均值的结果。由于查询的数据预处理只用做一次，然后每次查询都不用考虑数据预处理，因此下文查询时间的比较都没有算入数据预处理时间。本发明方法对于8维数据预处理时间大约为42mins。Unless otherwise specified, the following experiments are conducted under the condition of taking the mean value of weights and k=100, and each query is the result of taking the average value 10 times. Since the query data preprocessing only needs to be done once, and then each query does not need to consider the data preprocessing, so the comparison of query time below is not included in the data preprocessing time. The preprocessing time of the method of the present invention is about 42 minutes for 8-dimensional data.

本实施方式如图6所示，分为两个大的步骤：As shown in Figure 6, this embodiment is divided into two major steps:

步骤1：数据预处理。主要根据本文提出的数据分割方法对原始数据集进行划分，划分成不同的Block，对每个Block做好标记，然后存储到HDFS磁盘上。在spark中HDFS磁盘主要是由每个worker节点磁盘组成的。依据权利要求中的数据分割方式以及查询数据判断方式进行查询。具体划分如下：Step 1: Data preprocessing. Mainly divide the original data set according to the data segmentation method proposed in this paper, divide it into different blocks, mark each block, and then store it on the HDFS disk. In Spark, the HDFS disk is mainly composed of each worker node disk. The query is performed according to the data segmentation method and the query data judgment method in the claims. The specific division is as follows:

第一步：从内到外的整体分割The first step: overall segmentation from the inside to the outside

依据等面积原则将整个数据空间由内向外以此划分为m＝3等分为3个大分区。如图1所示，以二维为例直观地说明划分方式，横轴为x轴对应属性1、纵轴为y轴对应属性2，图中的实线部分将空间划分为3等分，将每个区域从外向内顺序编号为1，2，3。除最外面一个区域，对其余每个区域，将属于该区域的且各个轴的属性均为最大值的点作为基点，将整个坐标系中的所有属性值都大于等于基点处的相应属性值的区域均划分出来，按照从外向内的顺序编号为A，B。依次判断n_A≥k,或者n_B≥k否成立，如果正方形A数据量n_A≥k成立，则只用查询1区域中的数据集，否则查看正方形B中的数据量n_A≥k是否成立。针对于海量数据查询一般情况下正方形A数据量远大于k可以只用查询1区域中数据就可以得到top-k结果，为了进步对减少查询数据量对每个大分区进行细分。According to the principle of equal area, the entire data space is divided into three large partitions with m=3 from the inside to the outside. As shown in Figure 1, take two-dimensional as an example to illustrate the division method intuitively. The horizontal axis is the x-axis corresponding to attribute 1, and the vertical axis is the y-axis corresponding to attribute 2. The solid line in the figure divides the space into 3 equal parts. Each area is numbered 1, 2, 3 from outside to inside. Except for the outermost area, for each of the remaining areas, the point that belongs to the area and the attribute of each axis is the maximum value is used as the base point, and all the attribute values in the entire coordinate system are greater than or equal to the corresponding attribute value at the base point. The areas are divided and numbered A, B in order from outside to inside. Sequentially judge whether n _A ≥ k, or n _B ≥ k is true. If the square A data volume n _A ≥ k is true, only query the data set in area 1, otherwise check whether the data volume n _A ≥ k in square B established. For massive data queries, in general, the amount of data in square A is much larger than k, and the top-k results can be obtained by only querying the data in area 1. In order to further reduce the amount of query data, each large partition is subdivided.

第二步：每个大区域的细分Step 2: Subdivision of each large area

针对于每个分区1、2进一步划分，例如大分区1，如图2所示可以划分为(ABC),D,E三个区域，其中A,B,C三个区域的面积是相等，且A,B,C是作为必检索区，D,E作为待选检索区。For each partition 1, 2 is further divided, for example, large partition 1, as shown in Figure 2, can be divided into three areas (ABC), D, and E, wherein the areas of A, B, and C are equal, and A, B, and C are mandatory search areas, and D and E are optional search areas.

还存在下列事实：若对于一个数据对象T₁的得分那么可以知道f_W(T₁)与空间数据点T₁在直线在的投影长度成正比，因此可以用投影长度来衡量评分函数。There is also the following fact: if for a data object T ₁ score Then it can be known that f _W (T ₁ ) is in a straight line with the spatial data point T ₁ It is proportional to the projection length of , so the scoring function can be measured by the projection length.

因此，假设正方形A中数据集大于等于k，判定D,E是否需要检索只需按照以下方式验证即可：Therefore, assuming that the data set in square A is greater than or equal to k, it is only necessary to verify whether D and E need to be retrieved in the following way:

如果用户输入的权重满足且w₁＝w₂＝0.5，比较图中D区域的最大边界点d在直线上的投影点到原点之间的距离与正方形A的基点a在上述直线的投影点到原点之间的距离，可以发现二者是相等的，同理，E区域的最大边界点e在上述直线上的投影点到原点之间的距离也与正方形A的基点a在上述直线的投影点到原点之间的距离的相等，因此，无需查询D和E区域，只用查询A,B,C区域就可以得到top-k结果，并且根据同样的道理还可以知道A,B,C区域中查到的top-k结果必然会出现在图2中虚线右上角部分；If the weight entered by the user satisfies And w ₁ =w ₂ =0.5, the maximum boundary point d of area D in the comparison figure is on the straight line The distance between the projection point above and the origin and the distance between the base point a of the square A and the projection point of the above-mentioned straight line to the origin can be found to be equal. Similarly, the maximum boundary point e of the E area is on the above-mentioned straight line The distance between the projection point on and the origin is also equal to the distance between the base point a of the square A and the projection point of the above straight line to the origin. Therefore, there is no need to query the D and E areas, only the A, B, and C areas You can get the top-k results, and according to the same reason, you can also know that the top-k results found in the A, B, and C areas will inevitably appear in the upper right corner of the dotted line in Figure 2;

与上述原理类似，如果w₁＞w₂，则不用查询D区域；如果w₁＜w₂，则不用查询区域E，在此不一一论证；Similar to the above principle, if w ₁ >w ₂ , then there is no need to query area D; if w ₁ <w ₂ , then there is no need to query area E, and we will not discuss them one by one here;

综上，可以得到如下公式：In summary, the following formula can be obtained:

推广到d维数据空间，用S_i,1≤i≤3表示由内向外划分的3个大分区中的一个；S_ij,1≤j≤d表示大分区S_i中类似于D或者E的第j个子区域；S_i(d+1)表示大分区S_i中类似为A,B,C的子区域。当数据对象第j属性的权重时，则针对大分区S_i只需要查询其中的2个区域分别为S_i(d+1)以及S_ij；否则针对大分区S_i只用查询一个区域S_i(d+1)。Extended to the d-dimensional data space, use S _i , 1≤i≤3 to represent one of the three large partitions divided from the inside to the outside; S _ij , 1≤j≤d to represent the large partition S _i that is similar to D or E The jth sub-region; S _i(d+1) represents the sub-regions similar to A, B, and C in the large partition S _i . When the weight of the jth attribute of the data object , then only two regions S _i(d+1) and S _ij need to be queried for the large partition S _i ; otherwise, only one region S _i(d+1) needs to be queried for the large partition S _i .

步骤2：查询处理。对于用户一个查询f(W,k)，在Driver结点上根据查询输入，选取部分数据集进行查询。本实施方案其取k＝100，每个数据对象属性权重取均值，此时只用查询S_1(d+1)数据区域。Step 2: Query processing. For a user query f(W,k), select part of the data set to query on the Driver node according to the query input. In this embodiment, k=100 is used, and the weight of each data object attribute is averaged. At this time, only the S _1(d+1) data area is queried.

如图3，表示三种不同数据划分方式随数据维度不同查询时间对比，其中DistImprove是本发明方法，AngleDistTop_k是基于角度和距离数据分割方法，BasicTop_k是对原始数据集不进行划分查询时间；从实验可以看出本发明方法比基于角度和距离数据分割方法更加提高了查询速度，查询速度提高了大约15％，而且随着维度增加查询时间也是平稳增加，没有出现较大波动。As shown in Figure 3, three different data division methods are compared with different query time of data dimensions, wherein DistImprove is the method of the present invention, AngleDistTop_k is based on the angle and distance data segmentation method, and BasicTop_k is not to divide the query time for the original data set; from the experiment It can be seen that the method of the present invention improves the query speed more than the data segmentation method based on angle and distance, and the query speed is increased by about 15%, and the query time also increases steadily with the increase of the dimension without large fluctuations.

由于用户的权重W＝(w₁,w₂,…,w_d)的输入会影响查询数据区域的大小，如图4所示，在4维度下不同权重值的查询时间，其中第一类为具有极度偏向某一属性权重的特点，包括W₂＝(0.06,0.06,0.07,0.8)和W₄＝(0.56,0.14,0.25,0.03)，第二类W₁＝(0.25,0.25,0.25,0.25)为等值权重，第三类W₃＝(0.16,0.32,0.34,0.18)为偏向不明显权重。图中W₁与W₃查询时间大致相等，W₂与W₄查询时间大约相同，且W₁与W₃查询时间比W₂与W₄查询时间短，主要是由于在低维数据集中不同权重导致需要查询数据块不同，W₂与W₄为极度偏向某一个属性权重，导致需要多查询某些数据块，从而导致查询时间大于w₁与w₃。Since the input of the user's weight W=(w ₁ ,w ₂ ,...,w _d ) will affect the size of the query data area, as shown in Figure 4, the query time of different weight values in the 4 dimensions, where the first category is It has the characteristics of being extremely biased towards a certain attribute weight, including W ₂ =(0.06,0.06,0.07,0.8) and W ₄ =(0.56,0.14,0.25,0.03), the second type W ₁ =(0.25,0.25,0.25, 0.25) is the equivalent weight, and the third category W ₃ =(0.16, 0.32, 0.34, 0.18) is the weight with no obvious bias. In the figure, the query time of W ₁ and W ₃ is approximately equal, the query time of W ₂ and W ₄ is approximately the same, and the query time of W ₁ and W ₃ is shorter than that of W ₂ and W ₄ , mainly due to the different weights in low-dimensional data sets As a result, different data blocks need to be queried, and W ₂ and W ₄ are extremely biased towards a certain attribute weight, resulting in the need to query some data blocks more, resulting in a query time greater than w ₁ and w ₃ .

本发明方法的可扩展性如图5所示，在8维数据集在不同节点上的加速比，可以看出加速比接近理想加速比，随着处理器即worker数量加倍，执行速度也会加倍，即本发明数据划分方法具有良好的可扩展性。The scalability of the method of the present invention is shown in Figure 5. The speedup ratio of the 8-dimensional data set on different nodes can be seen to be close to the ideal speedup ratio. As the number of processors, that is, workers doubles, the execution speed will also double. , that is, the data division method of the present invention has good scalability.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.

Claims

1. A massive data weighted top-k query method based on distributed computing framework, characterized in that: comprise the following steps of sequential execution:

Step 1. Create a data space

Firstly, the attribute values of all data objects containing d attributes are converted into non-negative values, and the attribute values are normalized; a d-dimensional coordinate system is established, and the axes of the coordinate system correspond to the attributes of the data objects one by one. All data objects are placed in the coordinate system to form a data space;

Step 2. Data division

Starting from the origin of the coordinate system, divide the entire coordinate system into m areas from the inside to the outside, and number each area from the outside to the inside as 1, 2,..., m, and the boundary of the first area is together with the coordinate axis Include all data objects together. For any region, the maximum value of each attribute of the region is the same, and the coordinates of the outer boundary of each region satisfy at least one axis whose coordinate value is the maximum value of the attribute of the region , on the premise that the maximum value of the attribute of the first region is set to a ₁ , then the maximum value of the attribute of the i-th region Except for the outermost area, for each of the remaining areas, the point that belongs to the area and the attribute of each axis is the maximum value is used as the base point, and all the attribute values in the entire coordinate system are greater than or equal to the corresponding attribute value at the base point. The areas are all divided, numbered 1, 2,..., M in the order from outside to inside, where M=m-1, and the above newly divided areas are used as the judgment area to perform the following judgments:

According to the order of numbering from small to large, judge whether N _i ≥ k holds true, where N _i is the number of data objects in the i-judgment area, and k is the number of data objects in the result set returned by the algorithm; when i judges If the area satisfies the above formula and holds, the judgment ends, and the i areas from outside to inside are used as search areas for searching.

2. The mass data weighted top-k query method based on the distributed computing framework according to claim 1, characterized in that: the innermost region in the search region is subdivided, the region number is i, and the subdivision Methods as below:

Divide the area into d+1 blocks, where block d is the search area to be selected, and the remaining areas except the search area to be selected are mandatory search areas;

Number the search areas to be selected as n=1, 2,...,d, where any data point T _nj (t _n1 ,t _n2 ,...,t _nd ) in the nth search area to be selected, where t _nj Indicates the attribute value corresponding to the j-axis of the data point T _nj , t _nj satisfies the following two inequalities:

0≤t _nj ≤2a _i+1 -a _i , where 1≤j≤d and j≠n (1)

a _i -a _i+1 ≤t _nj ≤a _i , where j=n (2)

In the nth search area to be selected, if the data point T _nj (t _n1 ,t _n2 ,…,t _nd ) satisfies that the attribute value corresponding to one axis is a _i and the attribute value corresponding to the other axes is 2a _i+1 - a _i , then use this data point as the maximum boundary point of the nth candidate retrieval area;

Traversing whether each attribute weight w _j at the maximum boundary point of each candidate retrieval area assigned by the user satisfies the following conditions:

If there is an attribute weight w _j of the maximum boundary point that satisfies the above conditions, the scope of the search area is narrowed to include i-1 areas from outside to inside, the required search area in the i-th area, and the candidate to be selected to which the maximum boundary point belongs search area;

If there is no attribute weight w _j of the maximum boundary point that satisfies the above conditions, the search area is narrowed to include i-1 areas from outside to inside and the required search area in the i-th area.