CN103902585A

CN103902585A - Data loading method and system

Info

Publication number: CN103902585A
Application number: CN201210580016.5A
Authority: CN
Inventors: 秦平; 齐骥; 钱岭
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2012-12-27
Filing date: 2012-12-27
Publication date: 2014-07-02

Abstract

The invention discloses a data loading method and system. The method includes configuring the mapping relationship between data block identifiers and collection points; when the collection points are at fault, reconfiguring the data block identifiers mapped with the collection points at fault to other collection points in normal operation; dividing data to be loaded into data blocks according to preset rules, and giving an identifier to each data block; acquiring the mapping relationships between the data block identifiers and the collection points, and sending the data blocks to the collection points mapped with the data block identifiers according to the mapping relationships; writing the data blocks into a database through the collection points. By using the method, the problem that part of data cannot be loaded into the database due to collection points at fault can be solved.

Description

A data loading method and system

技术领域technical field

本申请涉及数据处理技术领域，尤其涉及一种数据加载方法和系统。The present application relates to the technical field of data processing, and in particular to a data loading method and system.

背景技术Background technique

在互联网、通信等领域中，常常需要将大批量的数据加载到指定的数据仓库中。图1是目前常用的数据加载系统组成示意图。In fields such as the Internet and communications, it is often necessary to load a large amount of data into a designated data warehouse. Figure 1 is a schematic diagram of the composition of a commonly used data loading system at present.

如图1所示，目前的数据加载系统包括主节点101、代理服务节点102、收集点103和数据仓库104，每个代理服务节点102都绑定特定的收集点103，例如，在图1所示例子中，代理服务节点A和代理服务节点B都和收集点A绑定，代理服务节点C和代理服务节点D都和收集点B绑定。As shown in Figure 1, the current data loading system includes a master node 101, a proxy service node 102, a collection point 103 and a data warehouse 104, and each proxy service node 102 is bound to a specific collection point 103, for example, as shown in Figure 1 In the example, both proxy service node A and proxy service node B are bound to collection point A, and proxy service node C and proxy service node D are both bound to collection point B.

主节点101，用于启动或停止代理服务节点102以及收集点103。The master node 101 is used to start or stop the proxy service node 102 and the collection point 103 .

代理服务节点102包括存储模块和代理服务模块，存储模块用于存储需要加载的数据，代理服务模块用于读取所述存储模块中需要加载的数据，将所述需要加载的数据发给与该代理服务节点102绑定的收集点103。The proxy service node 102 includes a storage module and a proxy service module, the storage module is used to store data that needs to be loaded, the proxy service module is used to read the data that needs to be loaded in the storage module, and sends the data that needs to be loaded to the The collection point 103 to which the proxy service node 102 is bound.

收集点103，用于将接收的数据通过数据仓库104提供的接口写入到数据仓库104，从而实现数据加载。The collection point 103 is used to write the received data into the data warehouse 104 through the interface provided by the data warehouse 104, so as to realize data loading.

其中，代理服务模块在从存储模块读取需要加载的数据以后，按照一定的格式解析需要加载的数据，由于需要加载的数据一般都是文件类型，为了保证数据加载过程中不会丢失数据，代理服务模块对需要加载的数据文件进行解析后不进行任何其他处理而直接重命名为隐藏文件，然后将所述隐藏文件直接发给与该代理服务模块所在的代理服务节点102绑定的收集点103。Among them, after the proxy service module reads the data to be loaded from the storage module, it parses the data to be loaded according to a certain format. Since the data to be loaded is generally a file type, in order to ensure that the data will not be lost during the data loading process, the proxy The service module parses the data file to be loaded without any other processing and directly renames it as a hidden file, and then directly sends the hidden file to the collection point 103 bound to the proxy service node 102 where the proxy service module is located .

在实际应用中，数据加载系统有时会出现一些异常情况，例如，数据加载系统中的某个或某些收集点103出现故障。In practical applications, some abnormal situations sometimes occur in the data loading system, for example, one or some collection points 103 in the data loading system fail.

目前，当收集点103出现故障时，与该收集点103绑定的代理服务节点102在向该收集点103发送数据时将会出错，该代理服务节点102从而得知与其自身绑定的收集点103出现了故障。At present, when the collection point 103 breaks down, the proxy service node 102 bound to the collection point 103 will make an error when sending data to the collection point 103, and the proxy service node 102 thus knows the collection point bound to itself 103 something went wrong.

由于目前的数据加载系统中代理服务节点102和收集点103之间是绑定的关系，即每个代理服务节点102绑定到特定的收集点103上，只能通过绑定的收集点103实现数据加载，因此，当某些收集点103出现故障时，与这些收集点103绑定的代理服务节点102中存储的需要加载的数据将无法加载到数据仓库104中，造成数据丢失。Due to the binding relationship between the proxy service node 102 and the collection point 103 in the current data loading system, that is, each proxy service node 102 is bound to a specific collection point 103, which can only be realized through the bound collection point 103 Data loading, therefore, when some collection points 103 fail, the data to be loaded stored in the proxy service nodes 102 bound to these collection points 103 will not be able to be loaded into the data warehouse 104, resulting in data loss.

发明内容Contents of the invention

有鉴于此，本申请提供了一种数据加载方法和系统，能够解决由于收集节点出现故障而导致部分数据无法加载到数据仓库的问题。In view of this, the present application provides a data loading method and system, which can solve the problem that some data cannot be loaded into the data warehouse due to failure of the collection node.

一种数据加载方法，该方法包括：A data loading method, the method comprising:

配置数据块标识ID与收集点的映射关系，在收集点出现故障时，将该出现故障的收集点映射的数据块ID重新配置为与其他未出现故障的收集点相映射；Configure the mapping relationship between the data block identification ID and the collection point. When the collection point fails, reconfigure the data block ID mapped by the faulty collection point to map with other non-failure collection points;

将需要加载的数据按照预设规则划分为数据块，并为每个数据块赋予ID，获取数据块ID与收集点的映射关系，根据该映射关系，将数据块发给该数据块ID映射的收集点；Divide the data to be loaded into data blocks according to the preset rules, and assign an ID to each data block, obtain the mapping relationship between the data block ID and the collection point, and send the data block to the data block ID mapped according to the mapping relationship collection point;

收集点将数据块写入到数据仓库。Collection points write data blocks to the data warehouse.

一种数据加载系统，该系统包括主节点、代理服务节点和收集点；A data loading system, the system includes a master node, a proxy service node and a collection point;

所述主节点，用于配置数据块标识ID与收集点的映射关系，在收集点出现故障时，将该出现故障的收集点映射的数据块ID重新配置为与其他未出现故障的收集点相映射；The master node is used to configure the mapping relationship between the data block identification ID and the collection point, and when the collection point fails, reconfigure the data block ID mapped by the faulty collection point to be the same as that of other non-failure collection points. mapping;

所述代理服务节点，用于将需要加载的数据按照预设规则划分为数据块，并为每个数据块赋予ID，获取数据块ID与收集点的映射关系，根据该映射关系，将数据块发给该数据块ID映射的收集点；The proxy service node is used to divide the data that needs to be loaded into data blocks according to preset rules, and assign an ID to each data block, obtain the mapping relationship between the data block ID and the collection point, and according to the mapping relationship, the data block Send to the collection point of the data block ID mapping;

所述收集点，用于将数据块写入到数据仓库。The collection point is used to write data blocks to the data warehouse.

可见，由于本发明在进行数据加载时，并非像现有技术那样将每个代理服务节点与收集点进行绑定，每个代理服务节点负责的需要加载的数据只能通过绑定的特定收集点加载到数据仓库中，而是将需要加载的数据划分为粒度较小的数据块，为每个数据块赋予ID，建立数据块ID与收集点的映射关系，根据映射关系确定每个收集点负责加载的数据块，并且，一旦发现有某个或某些收集点出现故障，则重新配置数据块ID与收集点的映射关系，即将出现故障的收集点映射的数据块ID重新配置为与其他未出现故障的收集点相映射，进而使得能够通过其他未出现故障的收集点将已出现故障的收集点原来负责的数据块写入到数据仓库中，解决了现有技术中由于收集点出现故障而导致部分数据无法加载到数据仓库的问题。It can be seen that since the present invention does not bind each proxy service node to a collection point as in the prior art when data loading is performed, the data that each proxy service node is responsible for and needs to be loaded can only pass through the bound specific collection point Instead, divide the data to be loaded into data blocks with smaller granularity, assign an ID to each data block, establish a mapping relationship between the data block ID and the collection point, and determine the responsibility of each collection point according to the mapping relationship. The loaded data block, and once one or some collection points are found to be faulty, reconfigure the mapping relationship between the data block ID and the collection point, that is, reconfigure the data block ID mapped to the faulty collection point to be the same as other unidentified The faulty collection point is mapped to each other, so that the data block originally responsible for the faulty collection point can be written into the data warehouse through other non-failure collection points, which solves the problem of failure of the collection point in the prior art. A problem that caused some data to fail to load into the data warehouse.

附图说明Description of drawings

图1是目前常用的数据加载系统组成示意图。Figure 1 is a schematic diagram of the composition of a commonly used data loading system at present.

图2是本发明提供的数据加载方法流程图。Fig. 2 is a flow chart of the data loading method provided by the present invention.

图3是本发明提供的数据加载系统组成示意图。Fig. 3 is a schematic diagram of the composition of the data loading system provided by the present invention.

具体实施方式Detailed ways

如图2所示，该流程包括：As shown in Figure 2, the process includes:

步骤201，配置数据块标识（ID）与收集点的映射关系。Step 201, configuring the mapping relationship between data block identifiers (IDs) and collection points.

本步骤中，一般由主节点配置数据块ID与收集点的映射关系。In this step, generally, the master node configures the mapping relationship between the data block ID and the collection point.

其中，可以预设数据块ID的生成方法，因此能够明确数据块ID的范围，进而确定每个收集点对应哪些数据块ID。例如，可以采用对数据块取HASH值的方式，将数据块的HASH值确定为该数据块的ID，还可以将随机数生成器生成的随机数确定为数据块的ID，只要保证每个数据块的ID都是唯一的即可。Wherein, the generation method of the data block ID can be preset, so the range of the data block ID can be clarified, and then which data block IDs correspond to each collection point can be determined. For example, the HASH value of the data block can be determined as the ID of the data block by taking the HASH value of the data block, and the random number generated by the random number generator can also be determined as the ID of the data block, as long as each data All block IDs are unique.

为了提高运行效率，可以将所述映射关系存储在主节点的内存中，为了增加映射的数据块的个数，当采用映射关系表存储所述映射关系时，可以存储数据块ID区间与收集点之间的映射关系。In order to improve operating efficiency, the mapping relationship can be stored in the memory of the master node. In order to increase the number of mapped data blocks, when the mapping relationship table is used to store the mapping relationship, the data block ID interval and collection point can be stored. mapping relationship between them.

步骤202，在收集点出现故障时，将该出现故障的收集点映射的数据块ID重新配置为与其他未出现故障的收集点相映射。Step 202, when a failure occurs at a collection point, reconfigure the data block ID mapped to the failure collection point to be mapped with other non-failure collection points.

其中，可以采用多种方法确定收集点是否出现故障，例如，当由主节点配置数据块ID与收集点的映射关系时，收集点可以定期向主节点上报自身的状态，如果主节点在指定的时间内没有收到收集点上报的状态信息，则可以确定出收集点出现故障。Among them, a variety of methods can be used to determine whether the collection point is faulty. For example, when the mapping relationship between the data block ID and the collection point is configured by the master node, the collection point can periodically report its own status to the master node. If the status information reported by the collection point is not received within a certain period of time, it can be determined that the collection point is faulty.

步骤203，将需要加载的数据按照预设规则划分为数据块，并为每个数据块赋予ID。Step 203, divide the data to be loaded into data blocks according to preset rules, and assign an ID to each data block.

其中，数据块的大小可以根据实际需要确定，例如，对于一个文件，可以将每1000条数据划分为一个数据块，如果该文件的剩余数据不足1000条数据，则将该剩余数据确定为该文件的最后一个数据块。Wherein, the size of the data block can be determined according to actual needs, for example, for a file, every 1000 pieces of data can be divided into a data block, if the remaining data of the file is less than 1000 pieces of data, then the remaining data is determined as the file the last block of data.

步骤204，获取数据块ID与收集点的映射关系，根据该映射关系，将数据块发给该数据块ID映射的收集点。Step 204, obtain the mapping relationship between the data block ID and the collection point, and send the data block to the collection point mapped to the data block ID according to the mapping relationship.

本步骤中，在向收集点发送数据块出错时，一般可以判断出此时该收集点出现了故障，数据块ID与收集点的映射关系可能已经被重新配置，因此可以主动地重新获取数据块ID与收集点的映射关系，根据重新获取的映射关系确定该数据块映射的收集点，向该收集点发送该数据块。当然，也可以由主节点在每次重配置所述映射关系后，下发最新的所述映射关系。In this step, when an error occurs when sending a data block to the collection point, it can generally be judged that the collection point has failed at this time, and the mapping relationship between the data block ID and the collection point may have been reconfigured, so the data block can be actively reacquired According to the mapping relationship between the ID and the collection point, the collection point mapped to the data block is determined according to the reacquired mapping relationship, and the data block is sent to the collection point. Of course, the master node may also deliver the latest mapping relationship after each reconfiguration of the mapping relationship.

上述步骤201至步骤204，只要不出现逻辑矛盾，则相互之间的执行顺序可调，或者可以并发执行，例如，可以同时执行步骤201和步骤203。The above steps 201 to 204, as long as there is no logical contradiction, can be executed in an adjustable order, or can be executed concurrently, for example, step 201 and step 203 can be executed at the same time.

步骤205，收集点将数据块写入到数据仓库。Step 205, the collection point writes the data block into the data warehouse.

本发明中，将出现故障的收集点映射的数据块ID重新配置为与其他未出现故障的收集点相映射具体可以包括：In the present invention, reconfiguring the data block ID mapped by the faulty collection point to be mapped with other non-failure collection points may specifically include:

根据所述其他未出现故障的收集点的负荷状态，选取负荷满足预定条件的收集点，例如选取负荷小于预定值的收集点、或选取负荷最小的收集点，将该出现故障的收集点映射的数据块ID重新配置为与所述负荷满足预定条件的收集点相映射，或者，将该出现故障的收集点映射的数据块ID均匀地、或按照一定比例地分成两份以上，将各份分别映射不同的未出现故障的收集点。According to the load state of the other collection points that have not failed, select a collection point whose load meets a predetermined condition, for example, select a collection point with a load less than a predetermined value, or select a collection point with the smallest load, and map the faulty collection point The data block ID is reconfigured to be mapped to the collection point whose load meets the predetermined condition, or the data block ID mapped to the faulty collection point is evenly or proportionally divided into two or more parts, and each part is divided into Map a different surviving collection point.

例如，当前一共有5个收集点，收集点1映射的ID值为1~100，收集点2映射的ID值为101~200，收集点3映射的ID值为201~300,收集点4映射的ID值为301~400,收集点5映射的ID值为401~500，当收集点2出现故障时，如果发现收集点5为最不繁忙，则收集点5将映射ID值为101~200以及401到500的数据块ID，如果各个收集点的负载较均衡，则可以将ID值101~200分成101~150和151~200两个区间，选取两个其他收集点，例如收集点1和3，分别映射ID值为101~150的数据块ID和ID值为151~200的数据块ID。For example, there are currently 5 collection points. The ID value mapped to collection point 1 is 1~100, the ID value mapped to collection point 2 is 101~200, the ID value mapped to collection point 3 is 201~300, and the ID value mapped to collection point 4 is The ID value of the collection point 5 is 301~400, and the ID value mapped to the collection point 5 is 401~500. When the collection point 2 fails, if the collection point 5 is found to be the least busy, the collection point 5 will map the ID value to 101~200 And the data block IDs from 401 to 500. If the load of each collection point is relatively balanced, you can divide the ID value 101~200 into two intervals of 101~150 and 151~200, and select two other collection points, such as collection point 1 and 3. Map the data block ID with ID value 101~150 and the data block ID with ID value 151~200 respectively.

为了避免数据的重复加载，本发明还提出，还可以维护收集点接收到的每个数据块的加载状态信息，则收集点在将数据块写入到数据仓库时，根据数据块的加载状态信息判断接收的数据块是否已被加载到数据仓库中，如果是，则不将该接收的数据块写入到数据仓库，否则，将该接收的数据块写入到数据仓库。In order to avoid repeated loading of data, the present invention also proposes that the loading state information of each data block received by the collection point can also be maintained, and then the collection point, when writing the data block into the data warehouse, It is judged whether the received data block has been loaded into the data warehouse, if so, the received data block is not written into the data warehouse, otherwise, the received data block is written into the data warehouse.

为了进一步提高避免数据重复加载的效果，不仅避免已经完全加载到数据仓库中的数据块的重复加载，还尽量避免已经部分加载到数据仓库中的数据块的重复加载，本发明还提出：In order to further improve the effect of avoiding repeated loading of data, not only to avoid repeated loading of data blocks that have been fully loaded into the data warehouse, but also to avoid repeated loading of data blocks that have been partially loaded into the data warehouse, the present invention also proposes:

所述述加载状态信息包括已经加载状态、未加载状态和正在加载状态，收集点根据接收的数据块ID查询该数据块的加载状态，在该数据块的加载状态为已经加载状态时，丢弃该数据块，在该数据块的加载状态为未加载状态时，将该数据块写入到数据仓库，在该数据块的加载状态为正在加载状态时，根据该数据块ID查询数据仓库中该数据块已经加载的内容，删除该数据块已经加载的内容，然后再将该数据块写入到数据仓库。The loading state information includes a loaded state, an unloaded state and a loading state, and the collection point queries the loading state of the data block according to the received data block ID, and when the loading state of the data block is a loaded state, the data block is discarded. Data block, when the loading state of the data block is not loaded, write the data block to the data warehouse, and when the loading state of the data block is loading, query the data in the data warehouse according to the data block ID The loaded content of the block, delete the loaded content of the data block, and then write the data block to the data warehouse.

其中，为了使得收集点能够根据数据块ID快速查询到数据仓库中是否已存储了该数据块ID对应的数据块的内容，收集点向数据仓库发送的内容不仅包括数据块的内容，还包括数据块的ID，数据仓库不仅存储数据块的内容，还存储数据块的内容与数据块ID的对应关系，数据仓库中还可以设置专门的查询模块，用于根据数据块ID返回数据仓库中针对该数据块ID的存储信息。Among them, in order to enable the collection point to quickly query whether the content of the data block corresponding to the data block ID has been stored in the data warehouse according to the data block ID, the content sent by the collection point to the data warehouse includes not only the content of the data block, but also the data The ID of the block, the data warehouse not only stores the content of the data block, but also stores the corresponding relationship between the content of the data block and the ID of the data block. A special query module can also be set in the data warehouse to return the data in the data warehouse according to the ID of the data block. The storage information of the data block ID.

为了保证所维护的数据块的加载状态信息的准确性，本发明提出，维护数据块的加载状态信息具体可以包括：In order to ensure the accuracy of the loading status information of the maintained data blocks, the present invention proposes that maintaining the loading status information of the data blocks may specifically include:

维护数据块加载状态表，其中存储有数据块ID与相应数据块的加载状态信息的对应关系，在初始化所述数据块加载状态表时，将所有数据块ID对应的加载状态信息初始化为未加载状态，在接收到数据块并且确定该数据块ID对应的数据加载状态为未加载状态后，将该数据块ID对应的加载状态信息修改为正在加载状态，在该数据块全部加载成功后，将该数据块ID对应的加载状态信息修改为已经加载状态，在接收到数据块并且确定该数据块ID对应的数据加载状态为正在加载状态、且已经将数据仓库中该数据块已经加载的内容删除后，将该数据块ID对应的数据加载状态修改为未加载状态。Maintaining the data block loading state table, which stores the corresponding relationship between the data block ID and the loading state information of the corresponding data block, when initializing the data block loading state table, initializing the loading state information corresponding to all data block IDs as unloaded status, after receiving the data block and determining that the data loading status corresponding to the data block ID is not loaded, modify the loading status information corresponding to the data block ID to the loading status, and after all the data blocks are loaded successfully, set The loading state information corresponding to the data block ID is modified to the loaded state, after receiving the data block and determining that the data loading state corresponding to the data block ID is the loading state, and the loaded content of the data block in the data warehouse has been deleted After that, modify the data loading status corresponding to the data block ID to the unloading status.

本发明还提供了一种数据加载系统，具体请参见图3。The present invention also provides a data loading system, please refer to FIG. 3 for details.

如图3所示，该系统包括主节点301、代理服务节点302和收集点303，代理服务节点302根据主节点301配置的数据块ID与收集点的映射关系，将数据发给该数据所在数据块的ID映射的收集点，例如在图3所示的例子中，代理服务节点A上的数据所在数据块的ID映射收集点A，代理服务节点B、C和D上的数据所在数据块的ID映射收集点B。As shown in Figure 3, the system includes a master node 301, a proxy service node 302, and a collection point 303. The proxy service node 302 sends the data to the data where the data is located according to the mapping relationship between the data block ID and the collection point configured by the master node 301. The collection point of the ID mapping of the block, for example, in the example shown in Figure 3, the ID mapping collection point A of the data block where the data on the agent service node A is located, and the data block where the data on the agent service nodes B, C and D are located ID mapping collection point B.

具体地，图3所示系统中各个组成部分的功能如下：Specifically, the functions of each component in the system shown in Figure 3 are as follows:

主节点301，用于配置数据块标识ID与收集点的映射关系，在收集点303出现故障时，将该出现故障的收集点303映射的数据块ID重新配置为与其他未出现故障的收集点303相映射。The master node 301 is used to configure the mapping relationship between the data block identification ID and the collection point. When the collection point 303 fails, the data block ID mapped by the faulty collection point 303 is reconfigured to be the same as other non-failure collection points. 303 Phase Mapping.

为了节省主节点301的存储空间，所述映射关系可以包括数据块ID区间与收集点303之间的映射关系。In order to save storage space of the master node 301 , the mapping relationship may include a mapping relationship between a data block ID range and a collection point 303 .

代理服务节点302，用于将需要加载的数据按照预设规则划分为数据块，并为每个数据块赋予ID，获取数据块ID与收集点的映射关系，根据该映射关系，将数据块发给该数据块ID映射的收集点。The proxy service node 302 is used to divide the data to be loaded into data blocks according to preset rules, and assign an ID to each data block, obtain the mapping relationship between the data block ID and the collection point, and send the data block to The collection point mapped to the data block ID.

收集点303，用于将数据块写入到数据仓库。The collection point 303 is used to write data blocks into the data warehouse.

其中，代理服务节点302的个数一般为两个以上，收集点303的个数一般也为两个以上，一般将需要加载的数据存储在指定的代理服务节点302的指定存储空间内，由代理服务节点302将自身存储的需要加载的数据进行分块。Wherein, the number of proxy service nodes 302 is generally more than two, and the number of collection points 303 is generally more than two. Generally, the data to be loaded is stored in the designated storage space of the designated proxy service node 302, and the agent The service node 302 divides the data stored by itself that needs to be loaded into blocks.

其中，代理服务节点302中一般包括有代理服务模块，由代理服务模块对需要加载的数据进行分块、并赋予数据块唯一ID，获取数据块ID与收集点的映射关系，根据所述映射关系将数据块发给相应的收集点303。Wherein, the proxy service node 302 generally includes a proxy service module, the data to be loaded is divided into blocks by the proxy service module, and a unique ID is given to the data block, and the mapping relationship between the data block ID and the collection point is obtained, and according to the mapping relationship Send the data block to the corresponding collection point 303 .

其中，收集点303可以定期向主节点301报告自身的状态信息，主节点301在规定的时间内没有收到收集点303报告的状态信息时，主节点301确定该没有报告状态信息的收集点303出现故障。Among them, the collection point 303 can report its own state information to the master node 301 regularly, and when the master node 301 does not receive the state information reported by the collection point 303 within a specified time, the master node 301 determines that the collection point 303 that does not report state information error occured.

主节点301也可以采用其他方法确定收集点303是否出现故障，例如，可以由代理服务节点302在向某个收集点303发送数据出错时，向主节点301报告该收集点303出现故障，从而触发主节点301重新配置数据块ID与收集点303的映射关系。The master node 301 can also adopt other methods to determine whether the collection point 303 fails, for example, when the agent service node 302 sends data to a certain collection point 303 in error, it can report the failure of the collection point 303 to the master node 301, thereby triggering The master node 301 reconfigures the mapping relationship between the data block ID and the collection point 303 .

主节点301在确定出某个收集点303出现故障时，可以根据其他未出现故障的收集点303的负荷状态，选取负荷满足预定条件的收集点303，将该出现故障的收集点303映射的数据块ID重新配置为与所述负荷满足预定条件的收集点303相映射，或者，将该出现故障的收集点303映射的数据块ID均匀地、或按照一定比例地分成两份以上，将各份分别映射不同的未出现故障的收集点303。When the master node 301 determines that a certain collection point 303 has failed, it can select a collection point 303 whose load meets the predetermined condition according to the load status of other non-failed collection points 303, and map the data of the failed collection point 303 The block ID is reconfigured to be mapped with the collection point 303 whose load meets the predetermined condition, or the data block ID mapped to the failed collection point 303 is divided into two or more parts evenly or according to a certain ratio, and each part is Different surviving collection points 303 are mapped respectively.

其中的代理服务节点302，可以在向收集点303发送数据块出错时，从主节点301重新获取数据块ID与收集点的映射关系，根据重新获取的映射关系确定该数据块映射的收集点303，向该收集点303发送该数据块。Wherein, the proxy service node 302 can reacquire the mapping relationship between the ID of the data block and the collection point from the master node 301 when an error is sent to the collection point 303, and determine the collection point 303 of the data block mapping according to the reacquired mapping relationship , send the data block to the collection point 303 .

当然，代理服务节点302也可以直接接收主节点301下发的、更新后的映射关系表，根据更新后的映射关系表确定每个数据块映射的收集点303。Of course, the proxy service node 302 may also directly receive the updated mapping relationship table issued by the master node 301, and determine the collection point 303 for each data block mapping according to the updated mapping relationship table.

为了避免数据的重复加载，收集点303可以包括状态维护模块和写入模块。In order to avoid repeated loading of data, the collection point 303 may include a state maintenance module and a writing module.

所述状态维护模块，用于维护收集点接收到的每个数据块的加载状态信息。The state maintenance module is used to maintain the loading state information of each data block received by the collection point.

所述写入模块，用于根据数据块的加载状态信息判断接收的数据块是否已被加载到数据仓库中，如果是，则不将该接收的数据块写入到数据仓库，否则，将该接收的数据块写入到数据仓库。The writing module is used to judge whether the received data block has been loaded into the data warehouse according to the loading state information of the data block, and if so, do not write the received data block into the data warehouse, otherwise, write the received data block to the data warehouse. The received data blocks are written to the data warehouse.

所述加载状态信息具体可以包括已经加载状态、未加载状态和正在加载状态。该系统还可以包括数据仓库。The loading state information may specifically include a loaded state, an unloaded state, and a loading state. The system can also include a data warehouse.

所述写入模块，用于向数据仓库发送数据块的内容和该数据块的ID。The writing module is used to send the content of the data block and the ID of the data block to the data warehouse.

所述数据仓库，用于存储收集点发送的数据块内容，并且，存储数据块内容与数据块ID的对应关系。The data warehouse is used to store the content of the data block sent by the collection point, and store the corresponding relationship between the content of the data block and the ID of the data block.

所述写入模块，用于根据接收的数据块ID查询该数据块的加载状态，在该数据块的加载状态为已经加载状态时，丢弃该数据块，在该数据块的加载状态为未加载状态时，将该数据块写入到数据仓库，在该数据块的加载状态为正在加载状态时，根据该数据块ID查询数据仓库中该数据块已经加载的内容，删除该数据块已经加载的内容，然后再将该数据块写入到数据仓库。The writing module is used to query the loading state of the data block according to the received data block ID, and discard the data block when the loading state of the data block is a loaded state, and the loading state of the data block is unloaded state, write the data block to the data warehouse, and when the loading state of the data block is loading, query the loaded content of the data block in the data warehouse according to the data block ID, and delete the loaded content of the data block content before writing the data block to the data warehouse.

所述状态维护模块，具体可以用于维护数据块加载状态表，其中存储有数据块ID与相应数据块的加载状态信息的对应关系，在初始化所述数据块加载状态表时，将所有数据块ID对应的加载状态信息初始化为未加载状态，在接收到数据块并且确定该数据块ID对应的数据加载状态为未加载状态后，将该数据块ID对应的加载状态信息修改为正在加载状态，在该数据块全部加载成功后，将该数据块ID对应的加载状态信息修改为已经加载状态，在接收到数据块并且确定该数据块ID对应的数据加载状态为正在加载状态、且已经将数据仓库中该数据块已经加载的内容删除后，将该数据块ID对应的数据加载状态修改为未加载状态。The state maintenance module can specifically be used to maintain the data block loading state table, wherein the corresponding relationship between the data block ID and the loading state information of the corresponding data block is stored, and when the data block loading state table is initialized, all data blocks The loading state information corresponding to the ID is initialized to the unloaded state, and after receiving the data block and determining that the data loading state corresponding to the data block ID is the unloaded state, modify the loading state information corresponding to the data block ID to the loading state, After all the data blocks are successfully loaded, modify the loading state information corresponding to the data block ID to the loaded state. After receiving the data block and determining that the data loading state corresponding to the data block ID is the loading state, and the data After the loaded content of the data block in the warehouse is deleted, the data loading state corresponding to the data block ID is changed to the unloaded state.

可见，本发明中，一旦发现有某个或某些收集点出现故障，则重新配置数据块ID与收集点的映射关系，即将出现故障的收集点映射的数据块ID重新配置为与其他未出现故障的收集点相映射，进而使得能够通过其他未出现故障的收集点将已出现故障的收集点原来负责的数据块写入到数据仓库中，实现了不间断、无丢失地地进行数据加载，保证了数据加载的一致性和及时性。Visible, in the present invention, in case some or some collecting points break down, then reconfigure the mapping relation of data block ID and collecting point, be about to reconfigure the data block ID of the collecting point mapping that fails to be with other non-appearing The faulty collection points are mapped to each other, so that the data blocks originally responsible for the faulty collection points can be written into the data warehouse through other non-faulty collection points, realizing uninterrupted and loss-free data loading. The consistency and timeliness of data loading are guaranteed.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims

1. a data load method, is characterized in that, the method comprises:

The mapping relations of configuration data block identification ID and bleeding point, in the time that bleeding point breaks down, are reconfigured for by the data block mark ID of this bleeding point mapping of breaking down the bleeding point not breaking down with other and shine upon mutually;

The data that needs are loaded are divided into data block according to preset rules, and give mark ID for each data block, obtain the mapping relations of data block mark ID and bleeding point, according to these mapping relations, data block are issued to the bleeding point of this data block mark ID mapping;

Data block is written to data warehouse by bleeding point.

2. method according to claim 1, is characterized in that, the method also comprises: the stress state information of safeguarding each data block that bleeding point receives;

Bleeding point is written to data warehouse by data block and comprises:

Bleeding point judges according to the stress state information of data block whether the data block receiving has been loaded in data warehouse, if so, the data block of this reception is not written to data warehouse, otherwise, the data block of this reception is written to data warehouse.

3. method according to claim 2, is characterized in that, described stress state information comprises stress state, stress state and just at stress state not;

Bleeding point is inquired about the stress state of this data block according to the data block mark ID receiving, at the stress state of this data block during for stress state, abandon this data block, at the stress state of this data block during for stress state not, this data block is written to data warehouse, at the stress state of this data block when just at stress state, according to the content that in this data block mark ID data query warehouse, this data block has loaded, delete the content that this data block has loaded, and then this data block is written to data warehouse;

Wherein, bleeding point sends the content of data block and the mark ID of this data block to data warehouse, the data block contents that data warehouse storage bleeding point sends, and, the corresponding relation of storage data block contents and data block mark ID.

4. method according to claim 3, is characterized in that, the described stress state information of safeguarding each data block that bleeding point receives comprises:

Service data piece stress state table, wherein store the corresponding relation of the stress state information of data block mark ID and respective data blocks, described in initialization when data block stress state table, be stress state not by stress state information initializing corresponding all data blocks mark ID, receiving data block and determining that data stress state that this data block mark ID is corresponding is for after stress state not, stress state information corresponding to this data block mark ID is revised as just at stress state, after this data block all loads successfully, stress state information corresponding to this data block mark ID is revised as to stress state, receiving data block and determining that data stress state that this data block mark ID is corresponding is for just at stress state, and after the content that in data warehouse, this data block has loaded having been deleted, data stress state corresponding to this data block mark ID is revised as to not stress state.

5. method according to claim 1, is characterized in that, the data block mark ID of this bleeding point mapping of breaking down is reconfigured for to the bleeding point not breaking down with other and shines upon and comprise mutually:

According to the load condition of the described bleeding point that other do not break down, choose the bleeding point that load meets predetermined condition, the data block mark ID of bleeding point mapping that this is broken down is reconfigured for the bleeding point that meets predetermined condition with described load and shines upon mutually, or, the data block mark ID of this bleeding point mapping of breaking down is divided into more than two parts equably or according to a certain percentage, shines upon respectively the different bleeding points not breaking down by each part.

6. method according to claim 1, is characterized in that, obtains the mapping relations of data block mark ID and bleeding point, and according to these mapping relations, the bleeding point of data block being issued to this data block mark ID mapping comprises:

In the time makeing mistakes to bleeding point transmission data block, again obtain the mapping relations of data block mark ID and bleeding point, determine the bleeding point of this data block mapping according to the mapping relations of again obtaining, send this data block to this bleeding point.

7. a data load system, is characterized in that, this system comprises host node, proxy service node and bleeding point;

Described host node, for the mapping relations of configuration data block identification ID and bleeding point, in the time that bleeding point breaks down, is reconfigured for by the data block mark ID of this bleeding point mapping of breaking down the bleeding point not breaking down with other and shines upon mutually;

Described proxy service node, be divided into data block for the data that needs are loaded according to preset rules, and give mark ID for each data block, obtain the mapping relations of data block mark ID and bleeding point, according to these mapping relations, data block is issued to the bleeding point of this data block mark ID mapping;

Described bleeding point, for being written to data warehouse by data block.

8. system according to claim 7, is characterized in that, described bleeding point comprises state-maintenance module and writing module;

Described state-maintenance module, for safeguarding the stress state information of each data block that bleeding point receives;

Said write module, for judging that according to the stress state information of data block whether the data block receiving has been loaded into data warehouse, if so, is not written to data warehouse by the data block of this reception, otherwise, the data block of this reception is written to data warehouse.

9. system according to claim 8, is characterized in that, described stress state information comprises stress state, stress state and just at stress state not; This system also comprises data warehouse;

Said write module, for sending the content of data block and the mark ID of this data block to data warehouse;

Described data warehouse, the data block contents sending for storing bleeding point, and, the corresponding relation of storage data block contents and data block mark ID;

Said write module, for inquiring about the stress state of this data block according to the data block mark ID receiving, at the stress state of this data block during for stress state, abandon this data block, at the stress state of this data block during for stress state not, this data block is written to data warehouse, at the stress state of this data block when just at stress state, according to the content that in this data block mark ID data query warehouse, this data block has loaded, delete the content that this data block has loaded, and then this data block is written to data warehouse.

10. system according to claim 9, is characterized in that,

Described state-maintenance module, for service data piece stress state table, wherein store the corresponding relation of the stress state information of data block mark ID and respective data blocks, described in initialization when data block stress state table, be stress state not by stress state information initializing corresponding all data blocks mark ID, receiving data block and determining that data stress state that this data block mark ID is corresponding is for after stress state not, stress state information corresponding to this data block mark ID is revised as just at stress state, after this data block all loads successfully, stress state information corresponding to this data block mark ID is revised as to stress state, receiving data block and determining that data stress state that this data block mark ID is corresponding is for just at stress state, and after the content that in data warehouse, this data block has loaded having been deleted, data stress state corresponding to this data block mark ID is revised as to not stress state.

11. systems according to claim 7, is characterized in that,

Bleeding point is regularly to the status information of host node report self, and when host node is not received the status information of bleeding point report in official hour, host node determines that described bleeding point breaks down.

12. systems according to claim 7, is characterized in that,

Described host node, be used for according to the load condition of the described bleeding point that other do not break down, choose the bleeding point that load meets predetermined condition, the data block mark ID of bleeding point mapping that this is broken down is reconfigured for the bleeding point that meets predetermined condition with described load and shines upon mutually, or, the data block mark ID of this bleeding point mapping of breaking down is divided into more than two parts equably or according to a certain percentage, shines upon respectively the different bleeding points not breaking down by each part.

13. systems according to claim 7, is characterized in that,

Described proxy service node, for in the time makeing mistakes to bleeding point transmission data block, again obtain the mapping relations of data block mark ID and bleeding point from host node, determine the bleeding point of this data block mapping according to the mapping relations of again obtaining, send this data block to this bleeding point.

14. systems according to claim 7, is characterized in that, described mapping relations comprise the mapping relations between data block mark ID interval and bleeding point.