CN111352587A

CN111352587A - A data packing method and device

Info

Publication number: CN111352587A
Application number: CN202010110504.4A
Authority: CN
Inventors: 于凯
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-02-24
Filing date: 2020-02-24
Publication date: 2020-06-30

Abstract

The invention provides a data packing method and a device, wherein the method comprises the following steps: obtaining a source file to be packaged; dividing the acquired source file into data blocks; calculating data blocks with repeated FP fingerprint deletion by adopting a Hash algorithm, and reserving the data blocks with the same FP fingerprint; creating a logic file of a physical file corresponding to the data block and corresponding the physical file to the logic file; and forming the FP fingerprints into a metadata sequence and packaging the source file to be packaged into a data packet with a set format. The method can be divided into a file level and a data block level according to the granularity of eliminating duplication, the duplication elimination granularity of the data block level is smaller, obviously, the data block level duplication elimination method can provide higher data duplication elimination rate, and therefore, the novel data packaging technology adopts a data block level duplication elimination mode. The utilization efficiency of the storage product can be effectively improved, and the energy consumption is reduced.

Description

A data packing method and device

技术领域technical field

本发明涉及存储数据管理技术领域，具体涉及一种数据打包方法、装置。The invention relates to the technical field of storage data management, in particular to a data packaging method and device.

背景技术Background technique

随着存储技术的发展，人们对于存储的容量要求越来越高，与此同时对于存储空间的利用率要求也越来越高，有效提高存储的利用率可以使存储产品迅速占有市场高地。目前，存储市场上存在许多常见的数据压缩软件，例如：Tar，winrar，winzip等等，虽然这些压缩软件可以有效提高存储产品的利用率，但是存储产品的利用率仍然需要更大的提高。With the development of storage technology, people's requirements for storage capacity are getting higher and higher, and at the same time, the requirements for the utilization of storage space are also higher and higher. Effectively improving the utilization of storage can make storage products quickly occupy the market high ground. At present, there are many common data compression software in the storage market, such as Tar, winrar, winzip, etc. Although these compression software can effectively improve the utilization rate of storage products, the utilization rate of storage products still needs to be improved.

发明内容SUMMARY OF THE INVENTION

针对现有的压缩软件利用率不高的问题，本发明提供一种数据打包方法、装置。Aiming at the problem of low utilization rate of the existing compression software, the present invention provides a data packaging method and device.

本发明的技术方案是：The technical scheme of the present invention is:

一方面，本发明技术方案提供一种数据打包方法，包括如下步骤：On the one hand, the technical solution of the present invention provides a data packaging method, comprising the following steps:

获取待打包的源文件；Get the source file to be packaged;

将获取的源文件分割成数据块；Divide the obtained source file into data blocks;

采用哈希算法为数据块计算FP指纹删除重复的数据保留具有相同FP指纹的数据块；Use hash algorithm to calculate FP fingerprints for data blocks to remove duplicate data and retain data blocks with the same FP fingerprints;

创建数据块对应的物理文件的逻辑文件并将物理文件对应到逻辑文件；Create a logical file of the physical file corresponding to the data block and correspond the physical file to the logical file;

将FP指纹组成元数据序列并将待打包的源文件打包成设定格式的数据包。The FP fingerprint is formed into a metadata sequence and the source file to be packaged is packaged into a data package with a set format.

进一步的，所述的设定格式的数据包由三部分组成：文件头、唯一数据块集和逻辑文件元数据。Further, the formatted data packet is composed of three parts: file header, unique data block set and logical file metadata.

进一步的，所述的文件头是一个结构体，定义了数据块大小、唯一数据块数量、数据块ID大小、包中文件的数量、元数据在包中的位置信息；Further, the file header is a structure that defines the size of the data block, the number of unique data blocks, the size of the data block ID, the number of files in the package, and the location information of the metadata in the package;

文件头后紧接就存储着所有唯一的数据块；在数据块的后面，存储数据包中文件的逻辑表示元数据。Immediately after the file header are stored all unique data blocks; after the data blocks, the logical representation metadata of the file in the packet is stored.

进一步的，所述的逻辑表示元数据，由多个实体组成，其中一个实体表示一个文件。Further, the logical representation metadata is composed of multiple entities, one of which represents a file.

进一步的，在逻辑文件的实体头中记录着文件名长度、数据块数量、数据块ID大小和最后一个数据块大小信息。Further, the length of the file name, the number of data blocks, the size of the ID of the data block and the size of the last data block are recorded in the entity header of the logical file.

进一步的，在逻辑文件的实体头后存储文件名数据，文件名数据之后，存储着一组唯一数据块的编号，其中所述的编号与唯一数据块集中的数据块一一对应。本申请的数据打包方法按照消除重复的粒度可以分为文件级和数据块级，数据块级的重复数据删除，其消重粒度更小，显然，使用数据块级消重方法的可以提供更高的数据消重率，因此，此种新型的数据打包技术采用数据块级的消重方式。可以有效提高存储产品的利用效率，减少能耗。Further, file name data is stored after the entity header of the logical file, and after the file name data, a group of unique data block numbers are stored, wherein the numbers correspond one-to-one with the data blocks in the unique data block set. The data packaging method of the present application can be divided into file-level and data-block-level according to the granularity of deduplication. For deduplication at the data-block-level, the deduplication granularity is smaller. Obviously, using the data-block-level deduplication method can provide higher Therefore, this new data packing technology adopts the block-level deduplication method. It can effectively improve the utilization efficiency of storage products and reduce energy consumption.

进一步的，该方法还包括：Further, the method also includes:

读取数据时，先读取逻辑文件，然后根据FP序列，从存储系统中取出相应数据块，还原物理文件副本。When reading data, first read the logical file, and then retrieve the corresponding data block from the storage system according to the FP sequence to restore the physical file copy.

另一方面，本发明技术方案还提供一种数据打包装置，包括源文件获取模块、分割模块、计算处理模块、逻辑文件创建模块、数据包生成模块；On the other hand, the technical solution of the present invention also provides a data packaging device, including a source file acquisition module, a segmentation module, a calculation processing module, a logic file creation module, and a data packet generation module;

源文件获取模块，用于获取待打包的源文件；The source file acquisition module is used to acquire the source files to be packaged;

分割模块，用于将获取的源文件分割成数据块；The segmentation module is used to segment the acquired source file into data blocks;

计算处理模块，用于采用哈希算法为数据块计算FP指纹删除重复的数据保留具有相同FP指纹的数据块；The calculation processing module is used to calculate the FP fingerprint for the data block by using the hash algorithm, delete the duplicate data and retain the data block with the same FP fingerprint;

逻辑文件创建模块，用于创建数据块对应的物理文件的逻辑文件并将物理文件对应到逻辑文件；The logical file creation module is used to create the logical file of the physical file corresponding to the data block and correspond the physical file to the logical file;

数据包生成模块，用于将FP指纹组成元数据序列并将待打包的源文件打包成设定格式的数据包。The data packet generation module is used for composing the FP fingerprint into a metadata sequence and packaging the source file to be packaged into a data packet of a set format.

进一步的，该装置还包括物理文件还原模块，所述的物理文件还原模块，用于读取逻辑文件，然后根据FP序列，从存储系统中取出相应数据块，还原物理文件副本。Further, the device further includes a physical file restoration module, which is used to read the logical file, and then retrieve the corresponding data block from the storage system according to the FP sequence to restore the physical file copy.

从以上技术方案可以看出，本发明具有以下优点：本申请的数据打包方法可以大大减少数据的数量，通过收集重复的数据，进而消除冗余数据。借助这种数据打包方法，可以提高存储系统的效率，有效地节约成本、减少数据传输过程中的网络带宽的占用率。与此同时它也是一种绿色的存储技术，能有效降低能耗。本申请的数据打包方法按照消除重复的粒度可以分为文件级和数据块级，数据块级的重复数据删除，其消重粒度更小，可以达到4-24KB之间。显然，使用数据块级消重方法的可以提供更高的数据消重率，可以有效提高存储产品的利用效率，减少能耗。It can be seen from the above technical solutions that the present invention has the following advantages: the data packaging method of the present application can greatly reduce the amount of data, and eliminate redundant data by collecting duplicate data. With this data packaging method, the efficiency of the storage system can be improved, the cost can be effectively saved, and the occupancy rate of the network bandwidth during the data transmission process can be reduced. At the same time, it is also a green storage technology, which can effectively reduce energy consumption. The data packaging method of the present application can be divided into a file level and a data block level according to the granularity of deduplication, and the deduplication granularity of the data block level is smaller, and can reach between 4-24KB. Obviously, using the data block-level deduplication method can provide a higher data deduplication rate, which can effectively improve the utilization efficiency of storage products and reduce energy consumption.

此外，本发明设计原理可靠，结构简单，具有非常广泛的应用前景。In addition, the present invention has reliable design principle and simple structure, and has a very wide application prospect.

由此可见，本发明与现有技术相比，具有突出的实质性特点和显著地进步，其实施的有益效果也是显而易见的。It can be seen that, compared with the prior art, the present invention has outstanding substantive features and significant progress, and the beneficial effects of its implementation are also obvious.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，对于本领域普通技术人员而言，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. In other words, other drawings can also be obtained based on these drawings without creative labor.

图1是本发明一个实施例的方法的示意性流程图。FIG. 1 is a schematic flowchart of a method according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明中的技术方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described The embodiments are only some of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

实施例一Example 1

如图1所示，本发明技术方案提供一种数据打包方法，包括如下步骤：As shown in Figure 1, the technical solution of the present invention provides a data packaging method, comprising the following steps:

S11：获取待打包的源文件；S11: Obtain the source file to be packaged;

S12：将获取的源文件分割成数据块；本实施例中，将源文件分割成8K大小的数据块；S12: dividing the acquired source file into data blocks; in this embodiment, dividing the source file into 8K-sized data blocks;

S13：采用哈希算法为数据块计算FP指纹删除重复的数据保留具有相同FP指纹的数据块；FingerPrint，FP指纹；S13: Calculate FP fingerprints for data blocks using a hash algorithm, remove duplicate data, and retain data blocks with the same FP fingerprint; FingerPrint, FP fingerprint;

需要说明的是，哈希算法将任意长度的二进制值映射为较短的固定长度的二进制值，这个小的二进制值称为哈希值。哈希值是一段数据唯一且极其紧凑的数值表示形式。如果散列一段明文而且哪怕只更改该段落的一个字母，随后的哈希都将产生不同的值。要找到散列为同一个值的两个不同的输入，在计算上是不可能的，所以数据的哈希值可以检验数据的完整性；哈希算法用来产生一些数据片段(例如消息或会话项)的哈希值的算法。使用好的哈希算法，在输入数据中所做的更改就可以更改结果哈希值中的所有位。将文件变成定长或变长的数据块，采用MD5/SHA1等Hash散列，删除重复的数据保留具有相同FP指纹的数据块；It should be noted that the hash algorithm maps a binary value of any length to a shorter fixed-length binary value, and this small binary value is called a hash value. A hash value is a unique and extremely compact numerical representation of a piece of data. If you hash a segment of plaintext and change even one letter of the segment, subsequent hashes will yield different values. It is computationally impossible to find two different inputs that hash to the same value, so hashing the data can verify the integrity of the data; hashing algorithms are used to generate some data fragment (such as a message or a session) item) of the hash value of the algorithm. With a good hashing algorithm, changes made in the input data can change all the bits in the resulting hash value. Turn the file into fixed-length or variable-length data blocks, use MD5/SHA1 and other Hash hashes, delete duplicate data and retain data blocks with the same FP fingerprint;

S14：创建数据块对应的物理文件的逻辑文件并将物理文件对应到逻辑文件；S14: Create a logical file of the physical file corresponding to the data block and correspond the physical file to the logical file;

S15：将FP指纹组成元数据序列并将待打包的源文件打包成设定格式的数据包。需要说明的是，所述的设定格式的数据包由三部分组成：文件头、唯一数据块集和逻辑文件元数据。所述的文件头是一个结构体，定义了数据块大小、唯一数据块数量、数据块ID大小、包中文件的数量、元数据在包中的位置信息；S15: compose the FP fingerprint into a metadata sequence and package the source file to be packaged into a data package of a set format. It should be noted that the formatted data packet consists of three parts: a file header, a unique data block set and logical file metadata. The file header is a structure that defines the size of the data block, the number of unique data blocks, the size of the data block ID, the number of files in the package, and the location information of the metadata in the package;

文件头后紧接就存储着所有唯一的数据块；在数据块的后面，存储数据包中文件的逻辑表示元数据。所述的逻辑表示元数据，由多个实体组成，其中一个实体表示一个文件。在逻辑文件的实体头中记录着文件名长度、数据块数量、数据块ID大小和最后一个数据块大小信息。在逻辑文件的实体头后存储文件名数据，文件名数据之后，存储着一组唯一数据块的编号，其中所述的编号与唯一数据块集中的数据块一一对应。本申请的数据打包方法按照消除重复的粒度可以分为文件级和数据块级，数据块级的重复数据删除，其消重粒度更小，显然，使用数据块级消重方法的可以提供更高的数据消重率，因此，此种新型的数据打包技术采用数据块级的消重方式。可以有效提高存储产品的利用效率，减少能耗。该方法还包括：读取数据时，先读取逻辑文件，然后根据FP序列，从存储系统中取出相应数据块，还原物理文件副本。Immediately after the file header are stored all unique data blocks; after the data blocks, the logical representation metadata of the file in the packet is stored. The logical representation metadata is composed of multiple entities, one of which represents a file. The length of the file name, the number of data blocks, the size of the data block ID and the size of the last data block are recorded in the entity header of the logical file. File name data is stored after the entity header of the logical file, and after the file name data, a group of unique data block numbers are stored, wherein the numbers correspond one-to-one with the data blocks in the unique data block set. The data packaging method of the present application can be divided into file-level and data-block-level according to the granularity of deduplication. For deduplication at the data-block-level, the deduplication granularity is smaller. Obviously, using the data-block-level deduplication method can provide higher Therefore, this new data packing technology adopts the block-level deduplication method. It can effectively improve the utilization efficiency of storage products and reduce energy consumption. The method further includes: when reading data, first reading the logical file, and then taking out the corresponding data block from the storage system according to the FP sequence, and restoring the copy of the physical file.

实施例二Embodiment 2

本发明技术方案还提供一种数据打包装置，包括源文件获取模块、分割模块、计算处理模块、逻辑文件创建模块、数据包生成模块；The technical solution of the present invention also provides a data packaging device, comprising a source file acquisition module, a segmentation module, a calculation processing module, a logic file creation module, and a data packet generation module;

该装置还包括物理文件还原模块，所述的物理文件还原模块，用于读取逻辑文件，然后根据FP序列，从存储系统中取出相应数据块，还原物理文件副本。The device also includes a physical file restoration module, which is used to read the logical file, and then retrieve the corresponding data block from the storage system according to the FP sequence to restore the physical file copy.

生成的数据包由三部分组成文件头、唯一数据块集和逻辑文件元数据。其中，文件头是一个结构体，定义了数据块大小、唯一数据块数量、数据块ID大小、包中文件的数量、元数据在包中的位置等元信息。文件头后紧接就存储着所有唯一的数据块，大小和数量由文件头中元信息指示。在数据块的后面，就是数据包中文件的逻辑表示元数据，由多个实体组成，其中一个实体表示一个文件。解包时根据文件的元数据，逐一提取数据块，还原出当初的物理文件。在逻辑文件的实体头中记录着文件名长度、数据块数量、数据块ID大小和最后一个数据块大小等信息。紧接着是文件名数据，长度在实体头中定义。文件名数据之后，存储着一组唯一数据块的编号，编号与唯一数据块集中的数据块一一对应。最后存储着文件最后一个数据块，由于这个数据块大小通常比正常数据块小，重复概率非常小，因此单独保存。The resulting packet consists of three parts, the file header, the set of unique data blocks, and the logical file metadata. The file header is a structure that defines meta information such as the size of the data block, the number of unique data blocks, the size of the data block ID, the number of files in the package, and the location of the metadata in the package. Immediately after the file header, all unique data blocks are stored, and the size and number are indicated by the meta information in the file header. Behind the data block is the logical representation metadata of the file in the data package, consisting of multiple entities, one of which represents a file. When unpacking, extract the data blocks one by one according to the metadata of the file, and restore the original physical file. Information such as the length of the file name, the number of data blocks, the size of the data block ID, and the size of the last data block are recorded in the entity header of the logical file. This is followed by the filename data, the length is defined in the entity header. After the file name data, the numbers of a group of unique data blocks are stored, and the numbers correspond one-to-one with the data blocks in the unique data block set. Finally, the last data block of the file is stored. Since the size of this data block is usually smaller than the normal data block, the repetition probability is very small, so it is stored separately.

尽管通过参考附图并结合优选实施例的方式对本发明进行了详细描述，但本发明并不限于此。在不脱离本发明的精神和实质的前提下，本领域普通技术人员可以对本发明的实施例进行各种等效的修改或替换，而这些修改或替换都应在本发明的涵盖范围内/任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以所述权利要求的保护范围为准。Although the present invention has been described in detail in conjunction with the preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Without departing from the spirit and essence of the present invention, those of ordinary skill in the art can make various equivalent modifications or substitutions to the embodiments of the present invention, and these modifications or substitutions should all fall within the scope of the present invention/any Those skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention, which should all be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A method of data packaging, comprising the steps of:

obtaining a source file to be packaged;

dividing the acquired source file into data blocks;

calculating data blocks with repeated FP fingerprint deletion by adopting a Hash algorithm, and reserving the data blocks with the same FP fingerprint;

creating a logic file of a physical file corresponding to the data block and corresponding the physical file to the logic file;

and forming the FP fingerprints into a metadata sequence and packaging the source file to be packaged into a data packet with a set format.

2. A data packing method according to claim 1, wherein the formatted data packet is composed of three parts: a file header, a set of unique data blocks, and logical file metadata.

3. A data packing method according to claim 2, wherein the file header is a structure body defining a data block size, a number of unique data blocks, a data block ID size, a number of files in the packet, and location information of metadata in the packet;

all unique data blocks are stored immediately after the file header; following the data blocks, logical representation metadata for the files in the data package is stored.

4. A data packaging method according to claim 3, wherein said logical representation metadata is composed of a plurality of entities, wherein one entity represents one file.

5. A data packing method according to claim 4, wherein file name length, number of data blocks, data block ID size and last data block size information are recorded in the physical header of the logical file.

6. A data packing method according to claim 5, characterized in that the file name data is stored after the physical header of the logical file, and the file name data is followed by the numbers of a group of unique data blocks, wherein the numbers correspond to the data blocks in the unique data block set one by one.

7. A data packing method according to claim 1, further comprising:

when data is read, the logic file is read firstly, then the corresponding data block is taken out from the storage system according to the FP sequence, and the physical file copy is restored.

8. A data packaging device is characterized by comprising a source file acquisition module, a segmentation module, a calculation processing module, a logic file creation module and a data packet generation module;

the source file acquisition module is used for acquiring a source file to be packaged;

the segmentation module is used for segmenting the acquired source file into data blocks;

the calculation processing module is used for calculating data blocks for which the FP fingerprints are deleted repeatedly by adopting a Hash algorithm and reserving the data blocks with the same FP fingerprints;

the logical file creating module is used for creating a logical file of the physical file corresponding to the data block and corresponding the physical file to the logical file;

and the data packet generating module is used for forming the FP fingerprints into a metadata sequence and packaging the source files to be packaged into data packets with set formats.

9. The data packaging apparatus of claim 8, further comprising a physical file restoring module, wherein the physical file restoring module is configured to read the logical file, and then retrieve the corresponding data block from the storage system according to the FP sequence to restore the physical file copy.