CN106570113B

CN106570113B - Mass vector slice data cloud storage method and system

Info

Publication number: CN106570113B
Application number: CN201610939884.6A
Authority: CN
Inventors: 马潇; 王景朝; 费香泽; 王宪
Original assignee: China Electric Power Research Institute Co Ltd CEPRI; State Grid Anhui Electric Power Co Ltd; State Grid Corp of China SGCC
Current assignee: China Electric Power Research Institute Co Ltd CEPRI; State Grid Anhui Electric Power Co Ltd; State Grid Corp of China SGCC
Priority date: 2016-10-25
Filing date: 2016-10-25
Publication date: 2022-04-01
Anticipated expiration: 2036-10-25
Also published as: CN106570113A

Abstract

The invention discloses a cloud storage method for massive vector slice data. The method includes: establishing a distributed file system directory tree file; establishing all metadata nodes corresponding to the distributed file system directory tree; The massive vector slice data under the same level directory in the system is aggregated to generate massive vector slice data packets; the massive vector slice data packets are stored in the metadata node; an index is established for the massive vector slice data, the The massive vector slice data is associated through the index, forming a data index table of the massive vector slice data in the mesh structure; the index table is used to record the path of the massive vector slice data in the massive vector slice data package; The massive vector slice data packet index table provides the massive vector slice data indexing service.

Description

Mass vector slice data cloud storage method and system

Technical Field

The invention relates to the field of mass data storage, in particular to a mass vector slice data cloud storage method and system.

Background

With the continuous development of science and technology, the era of mass data has come. Therefore, how to optimize the load of the file system, and improving the load balance becomes an important requirement at present. When the size of a data set exceeds the storage capacity of a single physical computer, it is necessary to partition it and store it on several separate computers. The international companies such as google, amazon, IBM and microsoft invest a great deal of scientific research power in the field, and various innovative mass data management technologies are provided. Research work is currently focused on 3 levels, the storage layer, the computation layer and the interface layer. The Hadoop project in the prior art realizes Hadoop distributed file system Hadoop DFS (HDFS for short) and parallel programming framework Hadoop MapReduce. The distributed file system is built on a network, and complexity of network programming is introduced, so that the distributed file system is more complex than a common disk file. The goal of distributed file systems is to achieve resource sharing, so that programs operate on remote files like storing and accessing in a manner similar to accessing local files, which are typically represented by the Google file system GFS, Hadoop file system HDFS, dynamo, TFS, etc. Present distributed file systems typically maintain nearly the same access interface and object model as local file systems, primarily to provide backward compatibility to users.

The prior art mainly adopts a distributed file system to store and read data files with super-large levels (the file size is hundreds of MB, GB or TB). However, the distributed file system based on a large amount of small file data cannot meet the storage requirement of the large amount of small file data due to the low storage speed. At present, no technical scheme for storing and reading a large amount of small file data based on a distributed file system exists.

Disclosure of Invention

In order to solve the speed problem when a large amount of small file data are stored based on a distributed file system, the invention provides a method, which comprises the following steps:

establishing all metadata nodes corresponding to a directory tree of the distributed file system;

the method comprises the steps that massive vector slice data under the same-level directory in a distributed file system are aggregated to generate a massive vector slice data packet;

storing the massive vector slice data packets in the metadata nodes;

establishing indexes for the massive vector slice data, and establishing association of the massive vector slice data through the indexes to form a data index table of the massive vector slice data with a mesh structure;

and providing the massive vector slice data index service through the massive vector slice data packet index table.

Preferably, the method according to claim 1, the method comprising:

the mass vector slice data index comprises a mass vector slice data path, a name and an offset in the mass vector slice data packet;

the massive vector slice data path comprises element node positions, massive vector slice data row positions and massive vector slice data column positions.

Preferably, the method comprises:

presetting metadata nodes on each layer, and storing an index table into the preset metadata nodes on each layer;

and transmitting the massive vector slice data index table stored in the metadata to a client, and establishing a massive vector slice data index table persistent mapping table.

Preferably, the massive vector slice data packet comprises a file header and at least one record;

the file header comprises a file type, a version number, file keywords, a file name and a position corresponding to each record;

each record corresponds to a vector slice data, and each record comprises a length, a key, and a value of the vector slice data.

Preferably, the massive vector slice data packets are stored by a data file serialization method.

Preferably, the method further comprises the following steps: and performing additional storage at the tail part of the massive vector slice data packet.

Preferably, the method comprises: and caching the massive vector slice data index table to a client, and reducing the number of times of accessing the metadata node so as to improve the number of times of accessing massive vector slice data.

Preferably, the method further comprises the following steps: the method for reading the massive vector slice data comprises the following steps:

determining the shortest path of the metadata node corresponding to the massive vector slice data packet through the massive vector slice data index table;

and determining the position of the vector slice data in a file header in a data packet file in the determined metadata node.

Based on the implementation mode of the present invention, the present invention provides a cloud storage system for massive vector slice data, the system comprising:

the first generation unit is used for establishing a directory tree file of the distributed file system;

the second generation unit is used for establishing all metadata nodes corresponding to the directory tree of the distributed file system;

the aggregation unit is used for aggregating the massive vector slice data based on the same-level directory of the distributed file system to generate a massive vector slice data packet;

the storage unit is used for storing the massive vector slice data packets in the metadata nodes;

a third generating unit, configured to generate the massive vector slice data index table, establish a mesh structure of the massive vector slice data packet through the index table, and record a path of the massive vector slice data in the massive vector slice data packet;

and the indexing unit is used for providing the massive vector slice data indexing service through the massive vector slice data indexing.

The invention has the beneficial effects that: and aggregating the massive vector slice data under the same-level directory in the distributed file system to generate a massive vector slice data packet, so that the massive vector slice data can be rapidly stored. And simultaneously, establishing indexes for the massive vector slice data, and establishing association of the massive vector slice data through the indexes to form a data index table of the massive vector slice data with a mesh structure. Through a data index table of a network structure, the corresponding metadata node is found through the shortest path, and the access speed of data is accelerated.

Drawings

A more complete understanding of exemplary embodiments of the present invention may be had by reference to the following drawings in which:

FIG. 1 is a system flow chart of a mass vector slice data cloud storage method according to an embodiment of the present invention; and

fig. 2 is a system structure diagram of a cloud storage method for massive vector slice data according to an embodiment of the present invention.

Detailed Description

The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the embodiments described herein, which are provided for complete and complete disclosure of the present invention and to fully convey the scope of the present invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, the same units/elements are denoted by the same reference numerals.

Unless otherwise defined, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Further, it will be understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.

Fig. 1 is a system flow chart of a mass vector slice data cloud storage method according to an embodiment of the present invention. The invention provides a massive vector slice data storage method based on a distributed file system. The scheme of the invention is based on the directory tree structure of the existing distributed file system, a plurality of massive vector slice data in a directory are packaged into massive vector slice data packets for storage, the packaged massive vector slice data packets are large data files, and the file level is more than one hundred MB. Meanwhile, the technical scheme of the invention generates the mass vector slice data to establish the index, records the path of the mass vector slice data in the mass vector slice data packet, and provides an interface for the client to access the mass vector slice data. The method of the invention is fully used for the advantages of high fault tolerance, expandability and distributivity of the master-slave distributed file system, and realizes the high-efficiency storage of massive vector data on the basis of the distributed file system with the file-oriented level exceeding one hundred MB. The method provided by the invention uses the distributed file system to store the massive vector data, and simultaneously establishes the index for the massive vector data, thereby solving the problem of low speed of storing the massive vector data at present and improving the access speed by establishing the index.

Preferably, the method 100 starts from step 101: and establishing a directory tree file of the distributed file system. The method has the advantages that the directory tree structure file of the distributed file system is constructed, and the advantages of high fault tolerance, expandability and distribution of the distributed file system can be fully utilized.

Preferably, step 102: all metadata nodes corresponding to the distributed file system directory tree are established. The metadata node is used for storing data.

Preferably, step 103: and aggregating the massive vector slice data under the same-level directory in the distributed file system to generate a massive vector slice data packet. And designing a file structure of the massive vector slice data packet, wherein the massive vector slice data packet comprises a file header and at least one record. The file header comprises a file type, a version number, a file keyword, a file name and a position corresponding to each record. Each record corresponds to a vector slice data, and each record includes a length, a key, and a value of the vector slice data. And the additional storage of the massive vector slice data is performed at the tail part of the massive vector slice data packet. And storing the massive vector slice data packets by adopting a data file serialization method. The implementation mode provided by the invention is based on a distributed system framework, and consists of a metadata node and a plurality of levels of hierarchical data nodes under the metadata node. The embodiment of the invention stores all the massive vector slice data under the same-level directory into the data file under the directory, and the massive vector slice data packet of the data file is a file in a distributed file system. In the embodiment of the invention, the key of the aggregation storage technology lies in the design of massive vector slice data packet files. The mass vector slice package file uses a distributed file system file of binary Key/Value (Key/Value) persistent data structure, which consists of a header and one or more subsequent records. The first three bytes of the file header of the massive vector slice data packet are the file type of SEQ, and the next byte represents the version number of the file data structure. The header also includes other fields including keys and names of the corresponding types of values. And directly adding the massive vector slice data at the tail part of the massive vector slice data packet file during storage. Each record represents a vector slice of data. The record is composed of four items of record length, key and value. Wherein the value of the key is the file name of the vector slice data and the value is the content of the vector slice data.

Preferably, step 104: and storing the massive vector slice data packets in the metadata nodes. The massive vector slice data packet storage method is realized based on a distributed file system, and the operation of massive vector slice data access depends on the distributed file system. And the additional storage of the massive vector slice data is performed at the tail part of the massive vector slice data packet. And storing the massive vector slice data packets by adopting a data file serialization method. When one client writes vector slice data to a certain directory, the client performs write operation on the data file of the directory, and the distributed file system records that the occupation permission Lease of the data file can be regarded as the write lock of the file. At this time, if another client also needs to store its own vector slice data in the same directory, it will also apply for writing the massive vector slice data packet file in the directory. Because a write lock already exists in the massive vector slice data packet file and the distributed file system does not realize the maintenance of the transaction request queue, the result of operation failure is directly returned to the client. From the perspective of users, creating different massive vector slice data packet files under the same directory does not cause conflict, but at the back end, the same massive vector slice data packet file is actually operated, and due to the locking mechanism, the problem that a plurality of users write conflict to different vector slice data under the same directory occurs. The realization of the massive vector slice data packet files mainly adopts a sequence and deserialization method of the data files. By serializing, it is meant that the structured object is converted into a byte stream for transmission over a network or written to disk for permanent storage. Deserialization refers to the reverse process of converting a byte stream into an object that will be structured.

Preferably, step 105: establishing indexes for the massive vector slice data, and establishing association of the massive vector slice data through the indexes to form a data index table of the massive vector slice data with a mesh structure; the index table is used for recording the path of the massive vector slice data in the massive vector slice data packet. The mass vector slice data index comprises a mass vector slice data path, a name and an offset in a mass vector slice data packet, and the mass vector slice data path comprises an element node position, a mass vector slice data row position and a mass vector slice data column position. For example, one of the massive vector slice data paths includes <18, 0506>, where 18 is a metadata node position, 05 is a massive vector slice data row position, and 06 is a massive vector slice data column position. When searching for the massive vector slice data, the corresponding row 05 is searched for again by locating the metadata node position 18, and then the corresponding column 06 is searched for again. And all the massive vector slice data form a spatial mesh index structure according to the metadata node positions of the paths in the index table, the row positions of the massive vector slice data and the column positions of the massive vector slice data. The embodiment of the invention can realize the shortest path searched by massive vector slice data.

And presetting a metadata node for storing a data index table for each layer of metadata node, and storing the massive vector slice data index table in the corresponding metadata node. And transmitting the massive vector slice data index table recorded in the metadata to a directory file, and establishing a massive vector slice data index persistent mapping table at the client.

The index of the vector slice data records the position of the vector slice data in the specific massive vector slice data packet file and other attributes of the vector slice data, and the vector slice data must be created for the massive vector slice data after the client stores the data. The index record comprises the name of the massive vector slice data, the file path of the massive vector slice data packet in which the massive vector slice data is positioned and the offset in the massive vector slice data packet file. The number of bits occupied by the file names of the massive vector slice data packets determines the number of data files in a directory, and the number of bits occupied by the offset determines the size of the data files, so that the capacity of storing data in a directory is limited.

Preferably, the massive vector slice data indexes are distributed to various data nodes for management. Although the index data of the massive vector slice data is huge, after the index data is distributed on the metadata nodes, the index data on a single metadata node is relatively small, and the capacity of the cluster for storing the massive vector slice data depends on the size of the cluster. The size of the cluster scale can not only determine the size of the storage capacity, but also reflect the size of the quantity of the massive vector slice data. The metadata node maintains an index of the vector slice data and provides an index service to the client. The index position of the vector slice data describes the metadata node that maintains the index of the vector slice data.

Preferably, the indexes of the massive vector slice data are classified according to the parent directory where the indexes are located, and the purpose of the indexes is to manage the massive vector slice data indexes in the same directory by the metadata nodes in the same level. In view of this feature, embodiments of the present invention create an index location mapping table to record the mapping relationship of directories to metadata nodes. The index location mapping table is managed by a metadata node. When a client queries massive vector slice data indexes, the client needs to know the position of a metadata node for maintaining the sea vector slice data indexes. The method comprises the steps of transmitting a path of massive vector slice data to a metadata node, and then searching an index position mapping table by the metadata node according to a father directory of a sea vector slice data path to find the position of the metadata node. The invention designs an index position maintenance module on the metadata node, which is specially used for distributing data nodes for a directory and maintaining an index position mapping table.

Preferably, the index location maintenance module selects and allocates to the directory according to all data nodes maintained by the metadata node. The index location mapping table is persisted to the local disk, and when the data of the index location mapping table changes, the contents of the index location mapping table on the local disk are updated again. If the index location maintenance module cannot find enough metadata nodes when distributing the metadata nodes to the directory, the module inserts the unallocated directory into a directory distribution waiting queue, the content of the queue is also persisted on a disk, and the queue needs to be updated on the disk once a new directory is added or deleted. When the metadata node is started, queue data on the disk needs to be read into a memory. The purpose of the queue is to wait for the index location maintenance module to reallocate the directory in the queue when a new data node is registered in the distributed file system. Also, each update of the queue needs to be persisted.

The embodiment of the invention maintains and manages the index of the vector slice data by designing the vector slice data index module on the data node, and provides index service for the client. The module maintains index records and index files in the memory and log files corresponding to the index files. The metadata nodes sort the index records with a B-tree to speed up lookup access of the index. The updating of the index record will firstly modify the memory data structure and temporarily asynchronously correspond to the index file. The updated content is recorded in a Log file corresponding to the index file, the index file is read into a memory according to the requirement after the data node is started, the index data structure is updated according to the Log, the index record in the memory is stored on the data node again at the moment to replace the old index file, and the Log is emptied. This is done to avoid the index data in memory being lost due to a sudden power off of the data node.

Preferably, the massive vector slice data index table is cached to the client, and the number of times of accessing the metadata node is reduced so as to improve the number of times of accessing massive vector slice data. According to the embodiment of the invention, the mass vector slice data indexes commonly used by the user are cached at the client, so that the access frequency of the client to the metadata node can be reduced, and the access efficiency of the mass vector slice data is improved.

Preferably, step 106: and providing a massive vector slice data index service through a massive vector slice data packet index table. And determining the shortest path of the metadata node corresponding to the massive vector slice data packet through the massive vector slice data index table. And determining the position of the vector slice data in the file header in the data packet file in the determined metadata node.

Fig. 2 is a system structure diagram of a cloud storage method for massive vector slice data according to an embodiment of the present invention. The system 200 includes:

a first generating unit 201, configured to establish a directory tree file of a distributed file system;

a second generating unit 202, configured to establish all metadata nodes corresponding to the directory tree of the distributed file system;

the aggregation unit 203 is configured to aggregate the massive vector slice data based on the same-level directory of the distributed file system to generate a massive vector slice data packet;

a storage unit 204, configured to store the massive vector slice data packets in the metadata node;

a third generating unit 205, configured to generate a massive vector slice data index table, establish a mesh structure of a massive vector slice data packet through the index table, and record a path of massive vector slice data in the massive vector slice data packet;

and the indexing unit 206 is configured to provide a massive vector slice data indexing service through massive vector slice data indexing.

The mass vector slice data cloud storage method system 200 according to the embodiment of the present invention corresponds to the mass vector slice data cloud storage method system 100 according to another embodiment of the present invention, and details thereof are not repeated herein.

The invention has been described with reference to a few embodiments. However, other embodiments of the invention than the one disclosed above are equally possible within the scope of the invention, as would be apparent to a person skilled in the art from the appended patent claims.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the [ device, component, etc ]" are to be interpreted openly as referring to at least one instance of said device, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

In addition, as will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A cloud storage method for massive vector slice data, the method comprising:

Establish all metadata nodes corresponding to the distributed file system directory tree;

Aggregating massive vector slice data under the same level directory in the distributed file system to generate massive vector slice data packets; the massive vector slice data packets include a file header and at least one record;

The file header includes file type, version number, file keyword, file name, and the position corresponding to each of the records;

Each of the records corresponds to a vector slice data, and each record includes the length, key length, key and value of the vector slice data;

storing the massive vector slice data packets in the metadata node, including storing the massive vector slice data packets by using a data file serialization method;

establishing an index for the massive vector slice data, and the massive vector slice data is associated through the index to form a data index table of the massive vector slice data of the mesh structure;

providing the massive vector slice data indexing service through the massive vector slice data packet index table;

The method of reading massive vector tile data:

Determine the shortest path of the metadata node corresponding to the massive vector slice data package by using the massive vector slice data index table;

The position of the vector slice data is determined by the file header in the data packet file in the determined metadata node.

2. The method of claim 1, comprising:

The massive vector slice data index includes the massive vector slice data path, name and offset in the massive vector slice data packet;

The massive vector slice data path includes a meta node location, a massive vector slice data row location, and a massive vector slice data column location.

3. The method of claim 1, comprising:

Each layer is preset with a metadata node, and the index table is stored in the pre-designed metadata node of each layer;

The massive vector slice data index table stored in the metadata is transmitted to the client, and a persistent mapping table of the massive vector slice data index table is established.

4. The method according to claim 1, further comprising: performing additional storage at the tail of the massive vector slice data packet.

5. The method of claim 1, comprising: caching the massive vector tile data index table to a client.

6. A cloud storage system for massive vector slice data, the system comprising:

a first generating unit, used for establishing a distributed file system directory tree file;

The second generation unit is used to establish all metadata nodes corresponding to the distributed file system directory tree;

an aggregation unit for aggregating massive vector slice data based on the same level directory of the distributed file system to generate massive vector slice data packets; the massive vector slice data packets include a file header and at least one record;

a storage unit, configured to store the massive vector slice data packets in the metadata node, including storing the massive vector slice data packets by using a data file serialization method;

The third generating unit is configured to generate the massive vector slice data index table, establish a mesh structure of the massive vector slice data packets through the index table, and record the massive vector slice data in the massive vector slice data packets path in;

An indexing unit, configured to provide the massive vector slice data index service through the massive vector slice data index; the method for reading the massive vector slice data: