Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.
In addition, the terms "first" and "second" etc. are used to distinguish different objects and are not used to describe a particular order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
For ease of understanding, some technical terms in the present application will be first described.
Ceph (distributed storage system) is an open source project, provides a software-defined and unified storage solution, and has the advantages of large-scale expansion, high performance and no single point of failure. When an application accesses a Ceph cluster and performs a write operation, data is stored in the form of objects (objects) in the Ceph's object storage device (Object Storage Device, OSD for short). Ceph Monitor is responsible for the health status of the entire cluster, and typically Monitor nodes may be deployed on a physical host alone or both Monitor and storage nodes may be deployed on the physical host. In a Ceph cluster, a plurality of monitors are commonly responsible for managing, maintaining and publishing state information of the cluster.
In Ceph storage, data is stored in basic units of objects, each object defaults to 4MB in size, several objects belong to one PG (PLACEMENT GROUP, group), several PGs belong to one OSD, and generally one OSD corresponds to one disk. Ceph adopts a hierarchical Cluster structure (Cluster Map), and a user can customize the Cluster structure, while OSD is a leaf node of the hierarchical Cluster structure.
On a Ceph cluster, a number of resource pools (Pool) may be established, each Pool, when created, needs to indicate the number of PGs for that Pool, which is a logical concept.
Fig. 1 is a schematic diagram of a cluster structure of a conventional dual-active storage system, where the dual-active storage system shown in fig. 1 provides storage services in a double data center in the same city based on a storage service cluster remote mode, and the deployment mode is as follows:
1. Taking a node (node) for deploying the storage cluster 6 as an example, node1, node2 and node3 are deployed on a storage site A and two racks, node4, node5 and node6 are deployed on a storage site B and two racks, and an arbitration server is deployed and placed on a site C.
2. The storage gateways are deployed to form a cluster in a cross-site mode, and any gateway can provide storage cluster service;
3. The storage service clusters are deployed across sites to form clusters, taking deployment of 4 copies as an example, a storage fault domain is set as a rack, and the number of copies is set to be 4, namely the 4 copies, data of 2 copies are operated in site A, and data of 2 copies are operated in site B.
4. A storage control service mon node (Monitor) three-node cluster is deployed, mon1 is deployed on one server of site a, mon2 is deployed on one server of site B, and mon3 is deployed on an arbitration server of site C, so as to prevent the occurrence of a storage brain crack phenomenon, resulting in failure of a data storage function.
5. The disaster occurs in the site A or the site B, and the disaster recovery system of the cross-site cluster of the storage system can realize the service fault recovery of RPO=0 and RTO=0, so that the storage data is not lost, and the storage system with no service perception continuously operates.
FIG. 2 is a flow chart of a data writing operation of the dual active memory system in FIG. 1, which is specifically as follows:
1. The host side (client side) issues data to be written and metadata information to the storage side through the gateway, which receives and processes the data.
2. The gateway calculates the disk to which the data needs to fall according to a certain index rule, as shown in figure 2, the disk to be fallen is assumed to be a node3-HDD1, a node2-HDD2, a node4-HDD1 and a node5-HDD2, and the relation of 4 copies to be written is that the node3-HDD1 is a main copy, and the node2-HDD2, the node4-HDD1 and the node5-HDD2 are standby copies.
3. The gateway writes the data to the storage service process where the primary copy node3-HDD1 is located through the network.
4. The storage service process of the node3-HDD1 writes data into the storage service processes of the node2-HDD2, the node4-HDD1 and the node5-HDD2 according to the data redundancy strategy;
5. the storage service process writes the data to the disk according to certain logic processing;
6. After the node3-HDD1 waits until the 4 data are completely written successfully, the data are considered to be written successfully, and writing success information is returned to the host side until the data are written successfully;
From the above data writing operation flow, it is known that the existing dual-active storage system has the following disadvantages:
The host performance is reduced because the number of copies is 4, and 2 copies of the same data are required to be transmitted to the site B and written to different magnetic discs respectively, so that the host performance is reduced;
The second disadvantage is that the input cost of link construction is increased, namely, the transmission of redundant data volume causes doubling of the bandwidth of the link, wastes link resources and increases construction cost;
the reliability is reduced, the site faults can cause the unavailability of double copies, and the risk of continuous operation downtime of the service is increased;
And fourthly, only a copy scheme crossing the clusters is supported, erasure Coding (EC) redundancy is not supported, and space occupation is increased.
Of course, the above scheme can be simplified to 2 copies, but the reliability of the data can be reduced, for example, the number of copies of the data is set to 2, at this time, one copy of each of the site A and the site B is set, if the site A fails, the site B takes over the service, so that only one copy is providing the storage service, the risk of single point failure exists, and if the hard disk or the server where the copy is located fails, the service has no data storage capacity available at this time, and the service is down.
In view of the foregoing, an embodiment of the present application provides a dual-active storage system and a data processing method thereof, which are described below with reference to the accompanying drawings.
Referring to fig. 3, a cluster structure schematic diagram of a dual-active storage system according to some embodiments of the present application is shown, where the dual-active storage system is based on a distributed storage mode, for example, a Ceph mode or other distributed storage modes, which is not limited in this disclosure.
As shown in fig. 3, the dual active storage system includes a first storage site 100 and a second storage site 200, and the second storage site 200 is a standby storage site of the first storage site 200;
The first storage site 100 has a first resource pool 110 created therein, the first resource pool 110 being configured as a first redundancy policy, the second storage site 200 has a second resource pool 210 created therein, the second resource pool 210 being configured as a second redundancy policy;
For example, as shown in fig. 3, the first resource pool 110 includes storage nodes node1, node2, and node3, and the second resource pool 210 includes storage nodes node4, node5, and node6, where each storage node includes two disks, and each disk corresponds to an object storage device OSD.
Specifically, the first redundancy policy may be duplicate redundancy or erasure code redundancy, and the second redundancy policy may be duplicate redundancy or erasure code redundancy. For example, there are three combinations of site 100 configured with duplicate redundancy, site 200 configured with duplicate redundancy, site 100 configured with erasure code redundancy, site 200 configured with erasure code redundancy, and site 100 configured with erasure code redundancy, site 200 configured with erasure code redundancy.
The first resource pool 110 has a first logical volume created therein, the second resource pool 210 has a second logical volume created therein, the first logical volume and the second logical volume are configured as dual live volumes for recording PG logs (PGlog), and providing data caching.
Specifically, the back end storage of the dual live volumes may be a distributed cache, for example CACHE TIER, or may be a distributed database, for example mondb.
For example, logical double active pool PoolAB1 is created and redundancy policy is configured as a 2-copy in order to implement a 2-copy mechanism and a data strong consistency mechanism using existing Crush algorithms. The set Crush algorithm may specify that the local is dominant according to the layout of the data-issuing side. The Crush algorithm is a tool used to calculate on which OSD the object is distributed.
Specifically, a local resource pool PoolA1 is created at the site 100, a redundancy policy is configured as copy redundancy, the number of copies is set to 3, a local resource pool PoolB is created at the site 200, a redundancy policy is configured as copy redundancy, and the number of copies is set to 2;
A volume rbdA1 is created in a site 100/PoolA1, a volume rbdB1 is created in a site 200/PoolB1, a dual-activity relationship is created, wherein 100/PoolA1/rbdA1 and 200/PoolB1/rbdB1 are selected to create the dual-activity relationship in a dual-activity pool PoolAB1, a dual-activity relationship object 100/PoolA1/rbdA1-200/PoolB1/rbdB1 is generated in a PG object in the dual-activity pool, and the dual-activity relationship object is taken as a key value to record PGlog of a subsequent writing operation.
For write operation, the specific responsibility of the logical dual-activity pool is to provide a PG mechanism for the dual-activity relationship, so as to realize a duplicate redundancy mechanism and ensure strong consistency of data, but the data is not directly stored, the data to be written is written into a designated resource space through an index relationship, and three types of data are required to be stored:
The first kind, writing the local station resource pool, namely, for an independent storage resource pool, a redundancy strategy and a redundancy level can be configured at will, writing operation is issued from objecter, the object name is calculated according to the volume name, LBA (Logical Block Address ) and length (data length), the object name is written to a main OSD through a Crush algorithm, and then redundancy protection writing is carried out to a magnetic disk through PG where the object is located. Objecter provide a unified interface for read and write requests of clients.
The second type, the PG in the dual active pool is responsible for sending the write operation to the target site for processing, the PG in the target site indexes the data to the local resource 200/PoolB1rbdB or 100/PoolA1/rbdA for write operation, the write operation is issued from objector, the object name is calculated according to the volume name, LBA and length, the object name is written to the main OSD through Crush algorithm, then the redundant protection is carried out through the PG in the object to be written to the disk, and the higher-efficiency distributed caching technology can be used for the back end device to improve the IO performance.
And the third type, double active pool PGlog, wherein the key words of the record are volume name, LBA, length and writing sequence number, and the double active pool is a logic pool and does not truly store data, so that the written place of PGlog can be formulated according to the requirement, such as a distributed database, a storage pool where local resources of the double active member volume are multiplexed, other storage pools are multiplexed, a copy storage pool is independently built, a cache, a distributed cache and the like.
In the dual-activity storage system provided by the application, when a certain site fails, a data list is written in, and at the moment, a dual-activity pool can record PGlog, when the sites are recovered, data recovery is carried out according to PGlog, and data is read from one end with PGlog and is synchronized to the other end of the sites.
In one possible implementation manner, the dual-active storage system further comprises an arbitration station 300, specifically, the first storage station 100 is deployed with a first monitor mon1, the second storage station is deployed with a second monitor mon2, and the arbitration station is deployed with an arbitration monitor mon3, which is used for providing arbitration service for the first monitor and the second monitor to prevent the occurrence of the brain fracture phenomenon.
As shown in fig. 3, a first monitor mon1 is deployed on node2, a second monitor mon2 is deployed on node5, and an arbitration monitor mon3 is deployed on an arbitration server.
In one possible implementation manner, the dual-active storage system further includes a first storage gateway 120, configured to receive the first data sent by the client, process the first data according to the Ceph index rule to obtain a processing result, and send the processing result and the first data to the first logical volume, where the processing result includes PGlog and the stored object storage device OSD.
In one possible implementation manner, the dual-activity storage system provided by the application further comprises a second storage gateway 220, which is used for receiving the second data sent by the client, processing the second data according to the Ceph index rule to obtain a processing result, and sending the processing result and the second data to the second logical volume.
In one possible implementation manner, in the dual-active storage system provided by the present application, the first storage site 100 is one independent protection domain, and the second storage site 200 is another independent protection domain.
In the application, a Ceph cluster divides two protection domains, specifically, a site 100 is taken as an independent protection domain, a site 200 is taken as an independent protection domain, dividing is carried out, data redundancy cross-site distribution of a local site is avoided, the protection domain is a logic concept arranged for improving the reliability of the cluster, one data (including copies or fragments) only exists in one protection domain, and heartbeat detection is also carried out in the protection domain.
Compared with the prior art, the dual-active storage system provided by the application has the following beneficial effects:
1. two-copy dual-active volumes are deployed across clusters, the storage volumes do not store real data, only record metadata, and can provide a caching function.
2. The dual active volumes ensure the data flow of the cross-site by utilizing the strong consistency of the two copies of Ceph, realize the real-time synchronization of the data of the cross-site and the automatic synchronization after the fault recovery, ensure the real-time consistency of the data of the main and standby sites under the normal condition of the cluster, and have RPO=0.
3. And storing the complete copy number of the data by the single site, and ensuring the maximum reliability of the data by utilizing the complete redundant copy number.
4. In the dual-active storage system, each site of the main and the standby is configured with an independent data redundancy strategy, and the redundancy strategy can be flexibly selected.
In the above embodiment, a dual-active storage system is provided, and two data processing methods based on the dual-active storage system are provided correspondingly, one is a data processing method after a write operation of a client is issued to a first storage gateway, and the other is a data processing method after a write operation of the client is issued to a second storage gateway.
Specifically, after the write operation of the client is issued to the first storage gateway, as shown in fig. 4, the data processing method includes the following steps:
s101, the first logical volume receives first data sent by a first storage gateway and a processing result after the first data is processed according to a Ceph index rule;
S102, the first logical volume stores first data to the first storage site according to a processing result of the first data, and records a PG log in the processing result, and the first storage site performs redundancy protection on the first data according to a first redundancy strategy;
S103, the first logical volume sends the first data and the processing result thereof to the second logical volume, so that the second logical volume stores the first data to the second storage site according to the processing result of the first data, records the PG log in the processing result, and the second storage site performs redundancy protection on the first data according to a second redundancy strategy.
As shown in fig. 3, the data processing flow is as follows:
① The service client side issues writing operation to a first storage gateway, and the first storage gateway processes the first data to obtain a processing result;
② The first storage gateway transmits first data and a processing result thereof to a copy master (a first logical volume);
③ The copy master locally caches the first data and records PGlog according to the PG mechanism;
④ The copy master sends the first data and the processing result to the copy backup (second logical volume) for processing;
⑤ The copy is prepared for locally caching the first data, and double-activity data management logic is realized according to PG mechanism record PGlog;
⑥ And writing data in the second storage site and performing redundancy protection, and returning the writing completion message of the client after the copy master receives the writing completion message.
Specifically, after the write operation of the client is issued to the second storage gateway, as shown in fig. 5, the data processing method includes the following steps:
S201, the second logical volume receives second data sent by a second storage gateway and a processing result after the second data is processed according to Ceph index rules;
S202, the second logical volume locally caches second data, and records PG logs in the processing results;
S203, the second logical volume sends second data and a processing result thereof to the first logical volume, so that the first logical volume stores the second data to the first storage site according to the processing result of the second data, and records a PG log in the processing result, and the first storage site performs redundancy protection on the second data according to a first redundancy strategy;
and S204, after the second logical volume receives the control message sent by the first logical volume, the second data of the local cache is stored to the second storage site, and the second storage site performs redundancy protection on the second data according to a second redundancy strategy.
As shown in fig. 6, the data processing flow is as follows:
① The service client side issues writing operation to a second storage gateway, and the second storage gateway processes second data to obtain a processing result;
② The second storage gateway transmits second data and processing results thereof to the copy preparation;
③ The copy is prepared for locally caching the second data and recording PGlog according to the PG mechanism;
④ The copy backup sends the second data and the processing result to the copy main processing;
⑤ The copy master locally caches the second data, and double-activity data management logic is realized according to PG mechanism record PGlog;
⑥ The first storage site data is written in and redundantly protected, and a control message is sent to a copy device for data writing;
⑦ And writing data in the second storage site and performing redundancy protection, wherein the copy master receives all the writing completion messages and returns copy backup writing completion messages, and the copy backup returns client writing completion messages.
The data processing method of the dual-activity storage system realizes cross-site virtual machine copy protection and single-site independent data redundancy protection, realizes a cross-site dual-activity technology under distributed storage, and realizes a cross-site dual-activity function through a data dual-activity layer PG and a data storage layer PG based on the existing PG mechanism of Ceph. According to the double-activity requirement, a Crush algorithm of a double-activity pool is set to realize that a copy is mainly in an expected site, data read-write performance is improved, data is copied once across sites, double-activity metadata is extremely simply stored without occupying extra network resources and storage space, a copy protection mechanism between original OSD is abstracted to a client layer reservation PGlog mechanism, real data storage is not carried out, and a data double-write strong consistency mechanism across sites is realized. And (3) modifying a PG mechanism, namely selecting synchronous real data or sending a control message of a disk according to the data identification when the copy is subjected to copy-to-copy synchronization, so that high-efficiency data strong consistency among cross-site copies is realized.
Finally, it is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The embodiments are only used to illustrate the technical scheme of the present application, but not to limit the technical scheme, and although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical scheme described in the foregoing embodiments may be modified or some or all technical features may be equivalently replaced, and the modification or replacement does not make the essence of the corresponding technical scheme deviate from the scope of the technical scheme of the embodiments of the present application, and is included in the scope of the claims and the specification of the present application.