CN119493519A - Data management method and device - Google Patents
Data management method and device Download PDFInfo
- Publication number
- CN119493519A CN119493519A CN202311047417.9A CN202311047417A CN119493519A CN 119493519 A CN119493519 A CN 119493519A CN 202311047417 A CN202311047417 A CN 202311047417A CN 119493519 A CN119493519 A CN 119493519A
- Authority
- CN
- China
- Prior art keywords
- data
- written
- address
- migrated
- fingerprint information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
- G06F3/0641—De-duplication techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0891—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using clearing, invalidating or resetting means
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a data management method and a device thereof, belonging to the technical field of file systems. The method comprises the steps of obtaining data to be written in a file of an external memory to be written, distributing virtual addresses and physical addresses for the data to be written when the data to be written is not repeated data, recording the data to be written in the physical addresses, recording the corresponding relation between the virtual addresses and the physical addresses, and the corresponding relation between the virtual addresses and the logical addresses of the data to be written, wherein the logical addresses are used for indicating the positions of the data to be written in the file, and obtaining the virtual addresses of the repeated data of the data to be written when the data to be written is the repeated data, and recording the corresponding relation between the virtual addresses and the logical addresses of the data to be written. The application avoids direct mapping of logical addresses and physical addresses.
Description
Technical Field
The present application relates to the field of file system technologies, and in particular, to a data management method and apparatus thereof.
Background
The principle of deduplication, also known as deduplication, is to reduce redundant data by keeping only one copy of a physical block for logical data blocks that have the same data, indexing accesses to those logical blocks onto the physical block. The repeated data deleting technology can not only reduce real-time redundant writing on a key path and prolong the service life of equipment, but also save the storage capacity and realize the efficient multiplexing of the storage space. At present, the data block is generally divided into data blocks by the repeated data deletion, then the data blocks are compared according to the fingerprint information of the data blocks, and the logic files with the same logic data content are led to the unique shared physical copy, so that the elimination of redundant data is realized.
Currently, when the data block to be written is duplicate data, the file system layer directly saves the physical address of the copy in the parent node of the duplicate data and updates the reference count of the data block in the metadata. When the data block to be written is not the repeated data, the file system layer allocates a new physical address and a physical block for the data block, and stores fingerprint information of the data block for de-duplication of the subsequent block. The parent node is used for storing the mapping relation between the data and the physical address of the data.
But this deduplication approach may result in a many-to-one mapping of logical addresses to physical addresses. This causes two problems, one in that it is complex and difficult to organize one-to-many mappings within a spatially limited metadata region. Secondly, in the garbage collection process, if one physical block corresponds to a plurality of logical blocks, all father nodes need to be traversed, and address information in the father nodes is modified, which can certainly slow down the garbage collection processing speed.
Disclosure of Invention
The application provides a data management method and a device thereof. The application avoids direct mapping of logical addresses and physical addresses. The technical scheme provided by the application is as follows:
In a first aspect, the present application provides a data management method. The method is applied to a file system layer. The method comprises the steps of obtaining data to be written in a file of an external memory to be written, distributing virtual addresses and physical addresses for the data to be written when the data to be written is not repeated data, recording the data to be written in the physical addresses, recording the corresponding relation between the virtual addresses and the physical addresses, and the corresponding relation between the virtual addresses and the logical addresses of the data to be written, wherein the logical addresses are used for indicating the positions of the data to be written in the file, and obtaining the virtual addresses of the repeated data of the data to be written when the data to be written is the repeated data, and recording the corresponding relation between the virtual addresses and the logical addresses of the data to be written.
In the data management method, the file system layer distributes virtual addresses for data, records the corresponding relation between the virtual addresses and physical addresses and the corresponding relation between the virtual addresses and logical addresses, so that the virtual addresses can isolate the logical addresses from the physical addresses, and direct mapping of the logical addresses and the physical addresses is avoided. In addition, as the virtual addresses of the repeated data are the same, the virtual addresses and the physical addresses can form one-to-one mapping, the complexity of the mapping relation of the file system is reduced, the garbage collection processing speed is improved, and the read-write performance of the file system is ensured.
Optionally, since the data written by the file system layer may be duplicate data or not, the file system layer may record or update the total number of logical addresses corresponding to the virtual addresses each time the data is written in order to perform consistency management on the written data. For example, the method further comprises recording the total number of logical addresses corresponding to the virtual addresses when the data to be written is not the repeated data, and updating the total number of logical addresses corresponding to the virtual addresses when the data to be written is the repeated data.
In one possible implementation, the method further comprises the steps of acquiring a first physical address of data to be migrated in the external memory, acquiring a virtual address of the data to be migrated based on the first physical address, writing the data to be migrated into a second physical address, and updating the corresponding relation between the virtual address and the physical address of the data to be migrated based on the second physical address.
As can be seen from the above, when data is migrated, the corresponding relationship between the virtual address and the physical address needs to be modified, and the corresponding relationship between the logical address and the physical address does not need to be modified as in the related art. In addition, because the virtual addresses of the repeated data are the same, the virtual addresses and the physical addresses can form one-to-one mapping, when the corresponding relation between the virtual addresses and the physical addresses is modified, no matter whether the data to be migrated is unique data or repeated data, the virtual addresses are only modified, and a plurality of logic addresses corresponding to the repeated data are not required to be modified like the related technology, so that the migration speed of the data can be ensured, and the performance of a file system is ensured.
Optionally, before writing the data to be migrated into the second physical address, the total number of logical addresses corresponding to the virtual address of the data to be migrated may be obtained, and whether to execute the process of writing the data to be migrated into the second physical address is determined according to the total number. If the total number is 0, which indicates that all the logic references to the data to be migrated are invalid during the migration, the data to be migrated becomes invalid data, and then the data to be migrated does not need to be migrated. If the total number is not 0, indicating that the data to be migrated is still valid data, the data to be migrated needs to be migrated. Therefore, when the total number of logical addresses corresponding to the virtual addresses of the data to be migrated is not 0, the data to be migrated is written into the second physical address. And stopping the migration operation of the data to be migrated when the total number of logical addresses corresponding to the virtual addresses of the data to be migrated is 0, and discarding the data to be migrated. Therefore, the data quantity of data migration can be reduced, the time consumption of data migration can be shortened, and the time consumed by garbage collection can be reduced. In order to perform consistency management on data, after the data to be migrated is read into the memory, the reference count of the data to be migrated may be increased by one, and before the operation of writing the data to be migrated into the second physical address is performed, the reference count of the data to be migrated is decreased by one, and then it is determined whether the reference count of the data to be migrated is 0.
In one implementation, the data to be migrated is valid data stored in the physical block to be reclaimed.
In one implementation, obtaining the virtual address of the repeated data of the data to be written comprises obtaining the virtual address of the repeated data of the data to be written based on the corresponding relation between the query data of the data to be written and the virtual address.
Optionally, the method further comprises the steps of obtaining fingerprint information of the data to be written, matching the fingerprint information of the data to be written with the fingerprint information of the written data stored in the memory, determining that the data to be written is repeated data when the fingerprint information of the data to be written is matched with the fingerprint information of the written data, and determining that the data to be written is not repeated data when the fingerprint information of the data to be written is not matched with the fingerprint information of the written data.
Optionally, the memory stores fingerprint information of the written data in order of high access heat of the written data. By managing the fingerprint information according to the access heat, recently written data can be quickly searched in the memory and long-term unused data can be removed when the data is written.
In a second aspect, the present application provides a data management apparatus. The device is applied to a file system layer and comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring data to be written in a file of an external memory to be written in, the processing module is used for distributing a virtual address and a physical address for the data to be written in when the data to be written in is not repeated data, recording the data to be written in the physical address, recording the corresponding relation between the virtual address and the physical address and the corresponding relation between the virtual address and the logical address of the data to be written in, and the logical address is used for indicating the position of the data to be written in the file.
Optionally, the processing module is further configured to record a total number of logical addresses corresponding to the virtual address when the data to be written is not the repeated data, and update the total number of logical addresses corresponding to the virtual address when the data to be written is the repeated data.
Optionally, the acquiring module is further configured to acquire a first physical address of data to be migrated in the external memory; the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is also used for acquiring a virtual address of data to be migrated based on a first physical address, the processing module is also used for writing the data to be migrated into a second physical address, and the processing module is also used for updating the corresponding relation between the virtual address and the physical address of the data to be migrated based on the second physical address.
Optionally, the processing module is specifically configured to write the data to be migrated into the second physical address when the total number of logical addresses corresponding to the virtual addresses of the data to be migrated is not 0.
Optionally, the processing module is further configured to stop the migration operation of the data to be migrated when the total number of logical addresses corresponding to the virtual addresses of the data to be migrated is 0.
Optionally, the data to be migrated is valid data stored in the physical block to be reclaimed.
Optionally, the processing module is specifically configured to obtain a virtual address of repeated data of the data to be written based on a correspondence between query data of the data to be written and the virtual address.
Optionally, the processing module is further configured to obtain fingerprint information of the data to be written, match the fingerprint information of the data to be written with the fingerprint information of the written data stored in the memory, determine that the data to be written is duplicate data when the fingerprint information of the data to be written matches the fingerprint information of the written data, and determine that the data to be written is not duplicate data when the fingerprint information of the data to be written does not match the fingerprint information of the written data.
Optionally, the memory stores fingerprint information of the written data in order of high access heat of the written data.
In a third aspect, the application provides a computing device comprising a memory storing program instructions and a processor executing the program instructions to perform the method provided in the first aspect of the application and any one of its possible implementations.
In a fourth aspect, the present application provides a cluster of computing devices comprising a plurality of processors and a plurality of memories, the plurality of memories having stored therein program instructions that are executable by the plurality of processors to cause the cluster of computing devices to perform the method provided in the first aspect of the present application and any possible implementation thereof.
In a fifth aspect, the present application provides a computer readable storage medium, the computer readable storage medium being a non-volatile computer readable storage medium comprising program instructions which, when run on a computing device, cause the computing device to perform the method provided in the first aspect of the present application and any one of its possible implementations.
In a sixth aspect, the application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method provided in the first aspect of the application and any one of its possible implementations.
Drawings
FIG. 1 is a schematic diagram of an operating system I/O stack provided by an embodiment of the present application;
FIG. 2 is a process schematic diagram of a DMdedup deduplication implementation provided by an embodiment of the present application;
FIG. 3 is a flow chart of a deduplication process shown in FIG. 2, provided in accordance with an embodiment of the present application;
FIG. 4 is a process schematic diagram of a SmartDedup deduplication implementation provided by an embodiment of the present application;
FIG. 5 is a flow chart of a deduplication process shown in FIG. 4, provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of an implementation environment related to a data management method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an implementation environment involved in another data management method according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a data writing process according to an embodiment of the present application;
FIG. 9 is a flow chart of a process shown in FIG. 8 provided by an embodiment of the present application;
FIG. 10 is a schematic diagram of determining whether data to be written is duplicate data according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a memory management fingerprint information according to an embodiment of the present application;
FIG. 12 is a flow chart of a garbage collection process provided by an embodiment of the present application;
FIG. 13 is a schematic diagram of a garbage collection process according to an embodiment of the present application;
FIG. 14 is a schematic diagram of a layout of metadata provided by an embodiment of the present application;
FIG. 15 is a schematic diagram of a data management device according to an embodiment of the present application;
FIG. 16 is a schematic diagram of a computing device according to an embodiment of the present application;
fig. 17 is a schematic structural diagram of a computing device cluster according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
For ease of understanding, the techniques and background referred to in the embodiments of the present application are explained first.
Log-structured FILE SYSTEM (LFS) has the property of off-site updates and garbage collection, with more excellent random write and small granularity transaction update performance than traditional log file systems (jorunaling FILE SYSTEM, such as EXT and XFS) and copy-on-write-WRITE FILE SYSTEM, such as BtrFS and ZFS.
The flash-friendly file system (F2 FS) is a log-structured file system designed for flash memory devices, and is widely applied to intelligent terminal scenes such as mobile phones and high-speed servers. The flash-friendly file system organizes the storage area into a log structure, writes data such as user data, file mapping, directory indexes and the like in a log form, converts random writing of the data and the file structure into sequential writing, improves the overall writing performance of the file system, reduces the loss of life-limited equipment (such as flash memory), and has an efficient file fragment recycling mechanism. F2FS is widely used in intelligent terminal equipment (such as mobile phones and intelligent televisions) and servers, and a large amount of repeated data from application caching and backup, user repeated files, system updating and snapshot exists in the equipment.
Logical/virtual/physical addresses (logica addressl/virtual address/PHYSICAL ADDRESS, LA/VA/PA) are three-layer addresses used by the present application. Logical addresses refer to the offset of a data block relative to a file, i.e., the location of the data block in the file. The physical address refers to the offset of a block of data within the entire file system address space, i.e., the actual location of the data in the physical storage space. The virtual address is used to isolate the two to reduce the complexity of the deduplication file system mapping.
Garbage collection (garbage collection, GC) is a basic operation in F2 FS. Since F2FS organizes the storage area into log structures, it is necessary to migrate and recycle valid data in the old space when the remaining log free segments are insufficient. Migration of valid data requires garbage collection of address information in the parent node of the modified data block, so reverse mapping to physical to logical addresses is required.
Node address table (node address table, NAT) corresponding node ids are assigned to nodes in the native F2FS, which may then provide translation of the node ids to node physical addresses.
The virtual address table (virtual address table, VAT) is a metadata structure for converting a virtual address into a physical address, and plays an important role in the processes of reading and writing files and the like. Wherein a portion of the hot item is cached in memory.
The segment summary area or reverse mapping table (segment summary area, SSA) is a metadata structure in F2FS for storing reverse mapping of physical addresses to logical addresses (i.e. parent node related information), so is also called reverse mapping table, mainly used in garbage collection. In the application, because the virtual address layer is added, the information stored by the SSA is mapped from the physical address to the virtual address, and the information is mainly used in the garbage collection process.
The segment information table (segment infomation table, SIT) is a structure of the F2FS for centrally managing segment states, and is mainly used for maintaining assignment information of segments.
The reference count (REFERENCE COUNT, RC) is used to mark the number of times a copy is referenced in the deduplication system, i.e., how many logical blocks a physical block is referenced.
DEVICE MAPPER is a framework provided by the Linux kernel for mapping physical block devices to advanced virtual block devices, which is the basis of a logical volume manager.
The page cache (PAGE CACHE) is created by the operating system using memory, and is an area for caching data pages in external memory (such as a disk or a solid state disk).
Bare input output (Direct I/O) refers to directly reading and writing to external memory, and an upper file system or application only records file mapping or address index, but accesses data content directly in memory, without passing through PAGE CACHE. Thus, for data written through Direct I/O, additional reading is required to obtain the data content.
An inode (inode) is a data structure used to describe objects such as files and directories.
Super blocks (superblocks) are used to store metadata for the file system. super block contains information such as the type, size, state, use condition of the file system, and the like, and has the function of enabling the operating system to identify and manage the file system.
Checkpoints (checkpoints) are used to represent the consistent state of the system. In a file system, the consistent state of the system is saved to persistent storage for recovery based on the consistent state in the event of a failure.
The principle of deduplication, also known as deduplication, is to reduce redundant data by keeping only one copy of a physical block for logical data blocks that have the same data, indexing accesses to those logical blocks onto the physical block. The repeated data deleting technology can not only reduce real-time redundant writing on a key path and prolong the service life of equipment, but also save the storage capacity and realize the efficient multiplexing of the storage space. At present, the data block is generally divided into data blocks by the repeated data deletion, then the data blocks are compared according to the fingerprint information of the data blocks, and the logic files with the same logic data content are led to the unique shared physical copy, so that the elimination of redundant data is realized.
Currently, there are many methods for controlling the flow of liquid. The general deduplication is classified into online deduplication and offline deduplication according to trigger timing deduplication. The online de-duplication is performed before the data is dropped, while the offline de-duplication is performed by retaining the original data in an external memory and triggering the offline de-duplication under specific conditions. The data management method provided by the embodiment of the application mainly relates to online deduplication. The online deduplication has the advantages that the deduplication process is completed before the data is dropped, so that the written data volume can be reduced, and redundant writing can be eliminated in real time. However, if Fingerprint (FP) information is stored in the memory, it is generally impossible to retain all fingerprint information, so that complete deduplication cannot be performed. If the data is stored in the external memory, the input/output (I/O) speed of the system is significantly affected because the external memory needs to be read and written repeatedly on the critical path of the data read and write process. This results in the problems of high CPU overhead, long tail latency and high memory occupancy of the write path for online deduplication. Offline deduplication aims at writing data to a storage device first, then eliminating duplicate data when the system is idle, and thus, no deduplication benefits can be obtained immediately.
Deduplication may be done at different levels of the operating system I/O stack. FIG. 1 is a schematic diagram of an operating system I/O stack according to an embodiment of the present application. As shown in FIG. 1, the operating system I/O stack includes a Virtual File System (VFS) layer, a file system (FILE SYSTEM, FS) layer, a block (block) layer, and a hard disk drive layer. In some implementations, the hard disk drive layer may be a Small Computer system interface (Small Computer SYSTEM INTERFACE, SCSI) layer. The virtual file system is responsible for general file abstraction semantics, is an upper layer abstract unified operation interface, and switches different file systems on an I/O path. The file system is responsible for abstracting the concept of "files" and maintaining a location mapping of "file" data to the block layer. The block layer is used for carrying out unified abstraction on the bottom hardware equipment and is mainly responsible for realizing the I/O scheduling strategy. For example, the block layer gathers bulk I/O based on scheduling policy and aggregates issues, merges I/O requests to reduce I/O times, and so forth. The hard disk drive layer is responsible for interfacing with the hardware disk.
At present, deduplication is a relatively common approach at the file system level and at the block level.
DMdedup is a representative of deduplication in the chunking layer. DMdedup through DEVICE MAPPER technology in the kernel, the prior repetition rate is used to map two devices into a logic volume, after data is written into the logic volume, deduplication is completed in a block layer, and the deduplicated data and metadata generated by deduplication are respectively stored on the two devices, which are respectively called a data device and a metadata device. When writing data, the metadata equipment is accessed according to the fingerprint information of the data block, then the fingerprint information is compared with the fingerprint information of the stored data, and then the repeated data is de-duplicated or the non-repeated data is stored in the data equipment according to the comparison result, and the related information is stored by modifying the metadata. The data is read by accessing the metadata device to acquire address information and then reading the address information in the data device.
Fig. 2 is a process schematic diagram of a deduplication implementation of DMdedup. Fig. 3 is a corresponding flow chart. Here, a data block carrying data "2A" is written in both the node 1 and the node 2. As shown in fig. 2 and 3, the write process mainly includes the following five steps:
Dmdedup creates a logical volume in the block layer using physical device 1 and physical device 2.
2. The file system layer perceives the logical volume, formats the file system on the logical volume, and when the data writing request transmitted by the upper layer is received, the file system directly transmits the data request downwards without duplication removal.
3. After the logic volume in the block layer acquires the data block '2A' to be written, comparing the fingerprint information with the fingerprint information of the written data, and determining whether the data block '2A' is the repeated data according to the comparison result.
4. If the data block "2A" is the duplicate data, the block layer maps the data block "2A" to the actual address of the written data block "2A" in the data device and updates the metadata in the metadata device. If the data block 2A is not the repeated data, the block layer normally writes the data block 2A into the data equipment and newly adds related metadata into the metadata equipment.
5. The block layer writes unique copies to the physical device 1 according to the data type, while metadata due to deduplication is stored separately in the physical device 2.
DMdedup, however, has performance and "fake storage space" issues. Due to deduplication, the metadata is stored on a separate device in external memory. In the process of reading and writing data, metadata equipment is required to be accessed on a critical path to acquire relevant address information, so that the metadata equipment can be operated on the data equipment, and serious performance degradation is caused. Meanwhile, as the logical volume is generated by using the prior repetition rate, the size of the logical volume is determined, the file system cannot perceive the de-duplication semantics, and the problem of 'fake storage space' is generated. That is, if the prior repetition rate is higher than the repetition rate of the actual load, the file system considers that there is still space to store data, but the storage space in the actual device is full, otherwise, the problem that the actual storage space in the device cannot be fully utilized occurs.
SmartDedup is a representative of deduplication in the filesystem layer. SmartDedup focuses on the technical solution of light weight online/offline deduplication combination in resource-constrained mobile devices. In a mapping relationship SmartDedup allows the same physical address to be stored in the parent node of a file data block, deduplication is accomplished by directly mapping logical addresses in the file system to physical addresses.
Fig. 4 is a process schematic for the deduplication implementation of SmartDedup. Fig. 5 is a corresponding flow chart. Here, too, the description will be given taking as an example that both the node 1 and the node 2 write a data block carrying the data "2A". As shown in fig. 4 and 5, the writing process mainly includes the following five steps:
1. the file system layer acquires a data block to be written in and calculates fingerprint information of the data block.
2. The file system layer performs comparison according to the fingerprint information.
3. If the comparison indicates that the data block is duplicate data, the file system layer directly saves the physical address of the copy in the parent node of the duplicate data, e.g., stores the physical address in node 2 that points to data block "2A" in node 1, and updates the reference count for that data block in the metadata. If the comparison result indicates that the data block is not the repeated data, the file system layer allocates a new physical address and a physical block for the data block, and stores fingerprint information of the data block for de-duplication of subsequent blocks. The parent node is used for storing the mapping relation between the data and the physical address of the data.
4. Normal writing is performed on non-duplicate data, such as writing a data block into a physical block, and setting the address of the physical block to an allocated physical address. Since the duplicate data already has a copy in the storage device, the file system layer aborts the actual writing process of the duplicate data.
5. The writing is completed.
However SmartDedup considers that the file system layer already holds the complete mapping information, which is not considered by SmartDedup in F2FS like file systems requiring active garbage collection, since the garbage collection operation requires parent related information. There is a many-to-one mapping of logical addresses to physical addresses due to deduplication SmartDedup. At this point, if garbage collection needs to be completed, then the one-to-many physical address to logical address reverse mapping needs to be preserved. This causes two problems, one in that it is complex and difficult to organize one-to-many mappings within a spatially limited metadata region. Secondly, in the garbage collection process, if one physical block corresponds to a plurality of logical blocks, namely a plurality of father nodes, all the father nodes need to be traversed, and the address information in the father nodes is modified, which can undoubtedly slow down the garbage collection processing speed.
Based on the above, the embodiment of the application provides a data management method. The method is applied to a file system layer. In the data management method, a file system layer acquires data to be written in a file to be written in an external memory. When the data to be written is not the repeated data, a virtual address and a physical address are allocated for the data to be written, the data to be written is recorded in the physical address, and the corresponding relation between the virtual address and the physical address and the corresponding relation between the virtual address and the logical address of the data to be written are recorded. When the data to be written is the repeated data, the virtual address of the repeated data of the data to be written is obtained, and the corresponding relation between the virtual address and the logic address of the data to be written is recorded. Wherein the logical address is used to indicate the location of the data to be written in the file.
In the data management method, the file system layer distributes virtual addresses for data, records the corresponding relation between the virtual addresses and physical addresses and the corresponding relation between the virtual addresses and logical addresses, so that the virtual addresses can isolate the logical addresses from the physical addresses, and direct mapping of the logical addresses and the physical addresses is avoided. In addition, as the virtual addresses of the repeated data are the same, the virtual addresses and the physical addresses can form one-to-one mapping, the complexity of the mapping relation of the file system is reduced, the garbage collection processing speed is improved, and the read-write performance of the file system is ensured.
An implementation environment related to a data management method provided by an embodiment of the present application is described below.
Fig. 6 is a schematic diagram of an implementation environment related to a data management method according to an embodiment of the present application. As shown in FIG. 6, the implementation environment includes a computing device 10. The computing device 10 may obtain data to be written to a file in external memory and write the data to external memory. In the implementation environment shown in fig. 6, the file to be written may be a file generated by the computing device 10 itself or a file received from outside by the computing device 10.
In one implementation, the data management method provided by the embodiment of the present application may be implemented by the computing device 10 running an executable program. For example, the executable program of the data management method may be presented in the form of an application installation package that, upon installation in the computing device 10, is capable of implementing the data management method by running the executable program. At this point, computing device 10 may be a terminal. The terminal may be a computer, a personal computer, a portable mobile terminal, a multimedia player, an electronic book reader, or a wearable device, etc.
Fig. 7 is a schematic diagram of an implementation environment related to another data management method according to an embodiment of the present application. As shown in FIG. 7, the implementation environment may include a client 20 and a computing device 10. The computing device 10 is used for storing data and executing the data management method provided by the embodiment of the application. The client 20 is capable of establishing a communication connection with the computing device 10. For example, a communication connection may be established between client 20 and computing device 10 over a network. Alternatively, the network may be a local area network, or may be the internet, or may be another network, which is not limited by the embodiment of the present application.
The client 20 is for user interaction with the computing device 10. In one implementation, the client 20 is configured to send instructions to the computing device 10 as directed by a user. For example, the client 20 is configured to send a data writing instruction to the computing device 10 according to an instruction of a user to instruct writing of data to an external memory. For another example, the client 20 is configured to send a data reading instruction to the computing device 10 according to an instruction of a user to instruct reading of data from an external memory. The computing device 10 is configured to receive an instruction sent by the client 20 and perform an operation indicated by the instruction. For example, the computing device 10 is configured to receive a data writing instruction, and according to the data management method provided by the embodiment of the present application, store data to be written on an external memory based on the data writing instruction. For another example, the computing device 10 is configured to receive a data reading instruction, obtain target data based on the data reading instruction according to the data management method provided by the embodiment of the present application, and feed back the target data to the client 20.
In one implementation, the client 20 may be a desktop computer, a laptop computer, a mobile phone, a smart phone, a tablet computer, a multimedia player, a smart home appliance, an artificial intelligence device, an intelligent wearable device, an electronic reader, an intelligent vehicle device, or an internet of things device, etc. Computing device 10 may be a server (e.g., a cloud server). Typically, computing device 10 may be a server cluster of several servers, or implemented as a cloud computing service center. Among them, a large amount of basic resources owned by cloud service providers are deployed in a cloud computing service center. For example, a cloud computing service center is deployed with computing resources, storage resources, network resources, and the like. The cloud computing service center can utilize the large amount of basic resources to realize the data management method provided by the embodiment of the application.
When computing device 10 is implemented by a cloud computing service center. A user may access the cloud platform through the client 20 and use a storage service provided by an external storage through the cloud platform. At this time, the functions implemented by the data management method provided by the embodiment of the application can be abstracted into a storage cloud service by a cloud service provider on a cloud platform, and the cloud platform can provide the storage cloud service for users by utilizing resources in a cloud computing center. After the user purchases the storage cloud service through the cloud platform, the data needing to be written in can be stored for the user through the storage cloud service, or the stored data can be provided for the user. Optionally, the cloud platform may be a cloud platform of a center cloud, a cloud platform of an edge cloud, or a cloud platform including a center cloud and an edge cloud, which is not specifically limited in the embodiments of the present application. It should be noted that, in the implementation environment shown in fig. 7, the computing device 10 may also be implemented by a resource platform other than a cloud platform, which is not limited in particular by the embodiment of the present application. At this point, computing device 10 may be implemented by resources in other resource platforms and provide relevant storage services to the user.
In one implementation, the data management method provided by the embodiment of the application can be applied to the computing device deployed with the F2 FS. Because of the requirement of F2FS for sequential writing, the segment in the old space may contain valid data and data which is invalid due to deletion or update, and Garbage Collection (GC) is needed to be performed on the old space first, valid data is migrated until all data in the segment of the old space is invalid, and then the old space is recovered. Among them, the file system manages data of the file system in units of blocks (blocks), segments (segments), sections (sections), zones (zones), and the file system. The granularity of blocks, segments, sections, regions, and file systems increases in turn.
It should be understood that the foregoing is an exemplary description of the application scenario of the data management method provided by the embodiment of the present application, and does not constitute a limitation on the application scenario of the data management method, and those skilled in the art will recognize that, as the service requirement changes, the application scenario may be adjusted according to the application requirement. For example, the data management method provided by the embodiment of the application can be applied to computing equipment deployed with F2FS, file systems with the same technical problems, and the like. In addition, when the data management method provided by the embodiment of the application is applied to the computing equipment deployed with the F2FS, the function of repeating data deletion is inherited into the F2FS, and then the file system can be regarded as a flash friendly type deduplication file system (flash-friendly deduplicate FILE SYSTEM, F2 DFS) modified on the basis of the F2 FS.
The implementation process of the data management method provided by the embodiment of the present application is described below by taking an example that the data management method provided by the embodiment of the present application is applied to an application scenario shown in fig. 7. In the data management method, a file system layer allocates a virtual address to data during a data writing process, and the effect of the virtual address on data management is mainly represented in the data writing process and the data migration process. The data writing process will be described first.
Fig. 8 is a schematic diagram of a data writing process according to an embodiment of the present application. Fig. 9 is a corresponding flowchart provided by an embodiment of the present application. Also illustrated is the case where both node 1 and node 2 write a block of data carrying data "2A". As shown in fig. 8 and 9, the data writing process includes the steps of:
Step 801, the file system layer obtains data to be written in a file to be written in the external memory.
The file to be written may be a file generated by the computing device itself or a file received by the computing device from the outside. For example, the file to be written may be a file indicated by a write request sent by the user. In one implementation, the file system layer may process data in the file in units of data blocks, which may be obtained by partitioning the data in the file. The embodiment of the application takes the file system layer as an example to process the data blocks of the file.
Step 802, the file system layer determines whether the data to be written is duplicate data.
In one implementation, the file system layer may obtain fingerprint information of the data to be written, and determine whether the data to be written is duplicate data based on the fingerprint information of the data to be written. Fig. 10 is a schematic diagram of determining whether data to be written is duplicate data according to an embodiment of the present application.
As shown in fig. 10, the implementation process includes:
step 8021, obtaining fingerprint information of the data to be written.
The fingerprint information of the data to be written is a unique identification of the data to be written. The fingerprint information of the data to be written may be obtained based on the hash value of the data to be written. For example, fingerprint information of data to be written may be obtained by performing collision-resistant hash computation on the data to be written. Anti-collision means that the probability of generating a hash value that is repeated is low. For example, when writing a file into an external storage system, since the data of the file may be cached in the page buffer (PAGE CACHE) of the memory, a hash calculation may be performed in units of pages in the page buffer, and the result of the hash calculation is fingerprint information of the data to be written. Alternatively, when performing hash computation in units of pages in the page buffer, data stored in the pages may be further divided by a specified size (e.g., 4 Kilobytes (KB)), and a corresponding hash value may be obtained based on the data obtained by the division.
Step 8022, matching the fingerprint information of the data to be written with the fingerprint information of the written data stored in the memory.
After the fingerprint information of the data to be written is acquired, the fingerprint information may be matched with the fingerprint information of the written data. Since the fingerprint information is used to uniquely identify data, when the fingerprint information of the data to be written matches the fingerprint information of the written data, the data to be written can be regarded as identical to the written data, and the data to be written is determined to be the duplicate data. When the fingerprint information of the data to be written does not match the fingerprint information of the written data, the data to be written may be considered to be different from the written data, and it is determined that the data to be written is not the duplicate data. The index of the data content can be rapidly identified through the fingerprint information of the data, the efficiency is far higher than that of a mode of comparing byte-level data content, and the efficiency of judging whether the first data block is a repeated data block can be ensured. Moreover, since the collision probability (false positive) is extremely low and even far lower than the hardware error probability, the accuracy of judging whether the first data block is a repeated data block can be ensured.
In the embodiment of the application, all the fingerprint information of the written data is stored in the memory. After the fingerprint information of the data to be written is obtained, whether the same fingerprint exists in the memory can be searched based on the fingerprint information of the data to be written. And when the same fingerprint as the fingerprint information of the data to be written exists in the memory, determining that the fingerprint information of the data to be written is matched with the fingerprint information of the written data. When a fingerprint different from the fingerprint information of the data to be written exists in the memory, determining that the fingerprint information of the data to be written is not matched with the fingerprint information of the written data. At this time, the fingerprint information of the data to be written may also be written into the memory.
In one implementation, the memory may manage the fingerprint information according to a prefix of the fingerprint information. For example, the memory maintains a plurality of memory sets (e.g., memory buckets) respectively corresponding to the plurality of prefixes, and any memory set is used for storing all fingerprint information having the corresponding prefix. The size of the memory set can be scaled according to the device resources. After the fingerprint information of the data to be written is obtained, whether a memory set corresponding to the prefix exists in the memory or not can be determined according to the prefix of the fingerprint information. And when the memory set corresponding to the prefix does not exist in the memory, determining that the fingerprint information of the data to be written is not matched with the fingerprint information of the written data. When the memory set corresponding to the prefix exists in the memory, continuously searching whether the fingerprint information exists in the memory set, when the fingerprint information does not exist in the memory set, and determining that the fingerprint information of the data to be written is not matched with the fingerprint information of the written data, and determining that the fingerprint information of the data to be written is matched with the fingerprint information of the written data when the fingerprint information exists in the memory set. For example, as shown in fig. 11, the memory maintains a plurality of memory buckets, one of which corresponds to the prefix B452, for storing all fingerprint information having the prefix B452. After obtaining the fingerprint information B4522F02F35C5340F620D37AD66434D9D3CEB69C of the data to be written, the prefix of the fingerprint information may be B452. Then, the prefix B452 is used as a key to access the memory, so as to search the pointer of the position of the fingerprint barrel, and the memory barrel corresponding to the B452 exists in the memory. Then, whether fingerprint information B4522F02F35C5340F620D37AD66434D9D3CEB69C exists in the memory bucket corresponding to B452 is continuously searched, and as the fingerprint information exists in the memory bucket shown by 10, the fingerprint information of the data to be written is determined to be the same as the fingerprint information of the written data, and the data to be written is repeated data. Wherein, the prefix of the fingerprint information can be represented by a character with a specified number of bits from left to right in the fingerprint information. The specified number of bits may be set according to the application requirements, for example, the specified number of bits is four.
It should be noted that the memory may also manage fingerprint information according to the access heat of the written data. For example, the memory sequentially stores the fingerprint information of the written data in order of the access heat of the written data from high to low. The ordering of the fingerprint information in the memory may be adjusted accordingly each time the fingerprint information of the data to be written matches the fingerprint information of the written data. For example, when the fingerprint information of the data to be written matches the fingerprint information of the written data, the fingerprint information may be considered to be the highest at the current heat, and may be moved to the head of the memory bucket. And when the memory bucket is full, the fingerprint information at the tail part of the memory bucket can be deleted. By managing the fingerprint information according to the access heat, recently written data can be quickly searched in the memory and long-term unused data can be removed when the data is written.
Step 8023, determining the data to be written as repeated data when the fingerprint information of the data to be written matches the fingerprint information of the written data.
Step 8024, when the fingerprint information of the data to be written does not match the fingerprint information of the written data, determining that the data to be written is not repeated data.
It should be noted that, before executing the step 802, it may be determined whether the data to be written needs to be subjected to online deduplication, when the data to be written needs to be subjected to online deduplication, the step 802 is executed again, and when the data to be written does not need to be subjected to online deduplication, the data to be written is written according to the writing process of the step 803. In one implementation, it may be determined whether online deduplication of data to be written is required based on the current write mode and the load of the file system. For example, when the process of performing a write operation in the write mode is not compatible with the process of performing deduplication, online deduplication does not need to be performed on the data to be written. For example, when the write mode uses a data-over-write method to write data (such as a bare input/output (Direct IO)), it is not necessary to perform online deduplication on the data to be written in the write mode because the write method is not compatible with the execution process of deduplication. The read-write process of Direct IO does not pass through page buffer memory, and directly reads and writes the external memory. For example, when the overhead of performing online deduplication on the data to be written in the write mode does not match the benefit, online deduplication does not need to be performed on the data to be written. For example, when performing online deduplication on data to be written in a write mode introduces much additional overhead, but this overhead is much smaller than the benefit of online deduplication, online deduplication on data to be written in the write mode is not required.
In step 803, when the data to be written is not the repeated data, the file system layer allocates a virtual address and a physical address to the data to be written, records the data to be written in the physical address, records the corresponding relationship between the virtual address and the physical address, and the corresponding relationship between the virtual address and the logical address of the data to be written, where the logical address is used to indicate the position of the data to be written in the file.
When the data to be written is not the repeated data, the file system layer needs to write the data to be written into the external memory according to the normal writing process. The realization process comprises the steps that a file system layer distributes a virtual address and a physical address for the data to be written, then the data to be written is written into a physical block indicated by the physical address (also called writing the data to be written into the physical address), and the writing process is perfected according to the writing result. The writing process is completed according to the writing result, and the writing process mainly comprises recording the related information of the writing event. For example, information such as correspondence between virtual addresses and physical addresses, correspondence between virtual addresses and logical addresses, and update of metadata is described. The virtual address is used as a mapping bridge between the physical address and the logical address, and the physical address and the logical address have no direct mapping relation, so that the function of isolating the logical address and the physical address is achieved. The data and the virtual addresses are in one-to-one correspondence, and the physical blocks of the data are in one-to-one correspondence with the virtual addresses. Optionally, since the data written by the file system layer may be duplicate data or not, the file system layer may record or update the total number of logical addresses corresponding to the virtual addresses each time the data is written in order to perform consistency management on the written data. For example, when the data to be written is not duplicate data, a record item of the total number of logical addresses corresponding to the virtual address may be newly created, and the total number may be initialized to 1. In the embodiment of the application, because the virtual addresses and the physical addresses are in one-to-one correspondence, the total number of the logical addresses corresponding to the virtual addresses is the total number of the logical addresses corresponding to the physical addresses, so that the total number is also called the number of times that the physical block indicated by the physical address is referenced by the logical address, that is, the total number is also called the reference count of the data to be written.
In one implementation, the file system layer may assign a virtual address to the data to be written and then assign a physical address to the virtual address. In assigning virtual addresses, one of the unused virtual addresses may be assigned to the data to be written. Or in the context of updating data, a virtual address (hereinafter referred to as an old VA) that needs to be updated may be assigned to the data to be written. And, when the usage of the old VA is different, the implementation details of assigning the old VA may be slightly different. For example, when the total number of logical addresses corresponding to the old VA is 1, since the old VA itself needs to be refreshed, the old VA can be multiplexed, that is, the old VA is directly allocated to the data to be written without first recovering the old VA, and the physical address corresponding to the old VA is allocated to the data to be written. After writing the data to be written in the physical address based on the old VA, since the total number of logical addresses corresponding to the old VA is 1, it is unnecessary to update the total number of logical addresses corresponding to the old VA to 1. When the total number of logical addresses corresponding to the old VA is greater than 1, the total number of logical addresses corresponding to the old VA needs to be reduced by one to indicate that the reference count of the physical address corresponding to the old VA is reduced once. Therefore, multiplexing the old VA can further improve the writing efficiency of the data to be written. The scene of updating data refers to a process of deleting stored data and rewriting new data in the same logical address. The above-described method of multiplexing old VA is applicable to a scenario in which the rewritten data is not duplicated data. In the case where the rewritten data is the duplicate data, the logical address is recorded with the virtual address of the duplicate data, instead of the newly allocated virtual address, and at this time, no matter how many references the old VA corresponds to, the corresponding decrease is required. If the old VA original reference count is 1, it becomes 0 after one decrease, at which time this virtual address can be reclaimed for use by subsequent writes. If the old VA original reference count is greater than 1, it is not 0 after one reduction, which means that the virtual address is still referenced by at least one logical address, and is not needed to be reclaimed temporarily.
Optionally, the memory may record not only the fingerprint information, but also the virtual address and additional information of the data identified by the fingerprint information. For example, as shown in fig. 11, the memory may record a set of information for each data, the set of information including fingerprint information, virtual address, and additional information (tag) of the data. The additional information may be metadata related to fingerprint information, such as persist information, stable information, dupl information, and remove information. persist information is used to indicate whether the set of information has been persisted. The stable information is used to indicate a unique Key Value (KV) of the data identified by the fingerprint information. The key in the key value is here fingerprint information of the data, and the address of the physical block used for storing the data at the time of the value in the key value. dupl information is used to indicate whether the data identified by the fingerprint information is recently accessed duplicate data. The remove information is used to indicate whether the data indicated by the fingerprint information is valid data.
After the virtual address allocated for the data to be written is acquired, the virtual address may also be recorded in the memory. When the memory manages fingerprint information according to the access heat of the written data, after the data to be written is written into the physical address, the fingerprint information and the virtual address of the data to be written can be recorded in the head of the memory barrel. In addition, after the data to be written is written into the physical address, additional information of the data to be written can be recorded according to the writing condition of the data to be written so as to store metadata related to the fingerprint information. Alternatively, the metadata of the deduplication correlation and the metadata of the file system of the present application are unified and maintained in the same device. In this way, in the process of reading and writing data, the metadata can be accessed from the device to acquire the relevant address information, and then the corresponding reading and writing operation is performed in the device, so that the problem that the related operations need to be performed on two devices in the related art is avoided, and the performance of a file system can be ensured.
After the file system layer completes the related processing in step 803, the processed memory page may be submitted to an IO request, for example, the IO request may be submitted to a Block layer IO (bio) queue, so that the Block layer may continue to execute the subsequent writing process of the data to be written based on the allocated physical address.
Step 804, when the data to be written is the repeated data, the file system layer obtains the virtual address of the repeated data of the data to be written, and records the corresponding relationship between the virtual address and the logical address of the data to be written.
When the data to be written is the repeated data, the operation of writing the data to be written into the physical block is not required to be executed, but the related information is required to be recorded so as to record the writing event of the data to be written. At this time, the virtual address of the repeated data may be obtained, and the correspondence between the virtual address and the logical address of the data to be written may be recorded. By describing the correspondence between the virtual address and the logical address of the data to be written, since the virtual address corresponds to the physical address, it can be known from the correspondence that the data described by the physical address corresponding to the virtual address is also the data of the logical address. The file system layer may query a correspondence between fingerprint information and a virtual address according to the fingerprint information of the repeated data, so as to obtain the virtual address of the repeated data. The corresponding relation between the fingerprint information and the virtual address is obtained in the writing process of the repeated data.
In addition, in order to perform consistency management on the written data, the file system layer may record or update the total number of logical addresses corresponding to the virtual addresses each time the data is written. For example, when the data to be written is the duplicate data, since the total number of logical addresses corresponding to the virtual addresses of the duplicate data has been described in the writing process of the duplicate data, the total number of logical addresses corresponding to the virtual addresses can be updated on the basis of the total number of logical addresses corresponding to the virtual addresses originally described. Here, updating the total number is actually adding one to the original total number.
According to the foregoing, the memory can record not only the fingerprint information but also the virtual address of the data identified by the fingerprint information, so that when the memory manages the fingerprint information according to the access heat of the written data, after the virtual address of the repeated data of the data to be written is obtained, the fingerprint information and the virtual address of the data to be written can be moved to the head of the memory bucket. In addition, additional information of the data to be written can be recorded so as to store metadata related to the fingerprint information.
It should be noted that, when the data to be written is duplicate data, although the operation of writing the data to be written into the physical block does not need to be performed, the dirty mark of the data to be written will still exist in the memory page, and after the file system layer completes the related processing in step 804, the dirty mark of the memory page may be cleared and the write-back operation may be terminated, so as to complete the writing process of the data to be written.
According to the execution process of step 804, when the data to be written is duplicate data, the file system can delete duplicate data on the write path of the data, thereby reducing the occupation of the write data volume and the storage space, reducing useless writing, and being beneficial to increasing the service life and the storage capacity of the storage device.
The data writing process of the file system is described above, and the data migration process is described below. From the foregoing description, it is apparent that the effect of virtual addresses on data management is mainly represented in the data writing process and the data migration process. One typical application of data migration is in garbage collection, where valid data in physical blocks to be collected is migrated. Therefore, the implementation of the data migration will be described below by taking the application of the data migration in the garbage collection process as an example.
Fig. 12 is a flowchart of a garbage collection process according to an embodiment of the present application. As shown in fig. 12, the garbage collection process includes the steps of:
step 1101, the file system layer obtains a first physical address of data to be migrated in the external memory.
The data to be migrated is data that needs to be migrated from the current physical block to other physical blocks. When the data migration application is in the garbage collection process, the data to be migrated is the effective data stored in the physical block to be recovered. For example, since the F2FS organizes the storage area into a log structure, when the remaining storage area of the F2FS is insufficient, the valid data in the old space needs to be migrated and the old space needs to be recovered, and the data to be migrated is the valid data in the old space. In one implementation, the file system layer may determine the physical block to be reclaimed according to its garbage collection policy, and determine valid data in the physical block to be reclaimed, where the valid data is the data to be migrated. After determining the data to be migrated, its first physical address can be determined. For example, the physical address of the physical block storing the data to be migrated is the first physical address. As shown in fig. 13, a physical block with a physical address of 5 is a physical block to be recycled, and valid data stored in the physical block is data 2A, and then data to be migrated is data 2A, and the first physical address is 5. The circled numbers in fig. 13 indicate the order of operations.
In step 1102, the file system layer obtains a virtual address of the data to be migrated based on the first physical address.
After the first physical address of the data to be migrated is obtained, a virtual address of the data to be migrated may be obtained based on the first physical address. In one implementation, the corresponding relationship between the physical address and the virtual address of the data to be migrated may be queried based on the first physical address, to obtain the virtual address of the data to be migrated. For example, as shown in fig. 13, by querying the correspondence between the physical address and the virtual address of the data to be migrated, the virtual address of the data to be migrated is 800.
In one implementation, the file system layer may use SSA to maintain a mapping of physical addresses to virtual addresses. When the step 1102 is performed, SSA may be queried based on the first physical address, and a virtual address corresponding to the first physical address may be obtained. In the application, the SSA can be obtained by modifying the original SSA in the F2 FS. Fig. 14 is a schematic layout diagram of metadata according to an embodiment of the present application. It can be seen from fig. 14 that the layout of SSA in metadata has not changed.
In step 1103, the file system layer writes the data to be migrated to the second physical address.
After determining the data to be migrated, the file system layer may acquire an index of the data to be migrated, index a data block where the data to be migrated is located based on the index, read the data block into the memory based on an index result, and then write the data block into a second physical address from the memory to complete the migration process of the data to be migrated. The second physical address is a physical address reassigned to the data to be migrated. Illustratively, the second physical address is 17, as shown in FIG. 13.
Optionally, before writing the data to be migrated into the second physical address, the total number of logical addresses corresponding to the virtual address of the data to be migrated may be obtained, and whether to execute the process of writing the data to be migrated into the second physical address is determined according to the total number. If the total number is 0, which indicates that all the logic references to the data to be migrated are invalid during the migration, the data to be migrated becomes invalid data, and then the data to be migrated does not need to be migrated. If the total number is not 0, indicating that the data to be migrated is still valid data, the data to be migrated needs to be migrated. Therefore, when the total number of logical addresses corresponding to the virtual addresses of the data to be migrated is not 0, the data to be migrated is written into the second physical address. And stopping the migration operation of the data to be migrated when the total number of logical addresses corresponding to the virtual addresses of the data to be migrated is 0, and discarding the data to be migrated. Therefore, the data quantity of data migration can be reduced, the time consumption of data migration can be shortened, and the time consumed by garbage collection can be reduced. In order to perform consistency management on data, after the data to be migrated is read into the memory, the reference count of the data to be migrated may be increased by one, and before the operation of writing the data to be migrated into the second physical address is performed, the reference count of the data to be migrated is decreased by one, and then it is determined whether the reference count of the data to be migrated is 0.
In step 1104, the file system layer updates the corresponding relationship between the virtual address and the physical address of the data to be migrated based on the second physical address.
After the file system layer writes the data to be migrated into the second physical address, the corresponding relationship between the virtual address and the physical address of the data to be migrated can be updated according to the second physical address. For example, as shown in fig. 13, the physical address corresponding to the virtual address 800 is changed from 5 to 17.
In one implementation, the file system layer may maintain a mapping of virtual addresses to physical addresses using VAT. In performing the step 1104, the physical address corresponding to the virtual address may be changed to the second physical address in the VAT. In the application, the VAT can be obtained by modifying the original NAT in the F2 FS. Fig. 14 is a schematic layout diagram of metadata according to an embodiment of the present application. As can be seen from fig. 14, NAT is not included in the metadata, instead of what happens to be VAT.
After the migration of the data to be migrated in the old space is completed, the old space can be recovered. The data migration process can also be applied to a defragmentation process, and defragmentation of the storage space can be realized by performing data migration on data in the storage space with lower data storage density. In some cases, however, deduplication may conflict with defragmentation to some extent. In general, the defragmentation of files is achieved by reading and writing back files in a unified manner, and in a deduplication file system, the defragmentation effect may not be actually achieved because deduplication is performed during writing back. Secondly, even if defragmentation can be normally performed, after defragmentation is performed on a certain file, the file is recovered to be sequential, but if the logical blocks of other files are indexed to the physical blocks corresponding to the file, the sequential of the files can be destroyed. For this purpose, the application can realize the support of defragmentation by modifying the mapping from virtual address to physical address on the basis of realizing the deduplication file system. Meanwhile, if the logic blocks mapped to the same physical block have obvious access heat difference, the sequency of the files with higher access heat can be preferentially ensured.
Therefore, the file system layer distributes the virtual address for the data, records the corresponding relation between the virtual address and the physical address and the corresponding relation between the virtual address and the logical address, so that the virtual address can isolate the logical address from the physical address, and the direct mapping between the logical address and the physical address is avoided. In addition, as the virtual addresses of the repeated data are the same, the virtual addresses and the physical addresses can form one-to-one mapping, the complexity of the mapping relation of the file system is reduced, the garbage collection processing speed is improved, and the read-write performance of the file system is ensured.
When data is migrated, the corresponding relation between the virtual address and the physical address is required to be modified, and the corresponding relation between the logical address and the physical address is not required to be modified like the related technology. In addition, because the virtual addresses of the repeated data are the same, the virtual addresses and the physical addresses can form one-to-one mapping, when the corresponding relation between the virtual addresses and the physical addresses is modified, no matter whether the data to be migrated is unique data or repeated data, the virtual addresses are only modified, and a plurality of logic addresses corresponding to the repeated data are not required to be modified like the related technology, so that the migration speed of the data can be ensured, and the performance of a file system is ensured. In addition, according to the above migration process, in the migration process, no matter whether the data to be migrated is unique data or repeated data, the data migration method provided by the embodiment of the application only needs to execute a read process and a write process on the data to be migrated. In the related art, when the data to be migrated is unique data, it is necessary to perform a read process and a write process multiple times on the data to be migrated. When the data to be migrated is duplicate data, duplicate operations need to be performed on each data block, which requires more times of read and write processes to be performed. For example, in the garbage collection process of the native F2FS, migration of unique data requires performing at least three read processes and two write processes. Therefore, the data migration method provided by the embodiment of the application can further ensure the data migration speed.
In addition, the data management method provided by the application is executed by the file system layer, and can execute the de-duplication process on the data in the data writing process, so that de-duplication semantics can be perceived, and the problem of fake storage space in the related technology can be avoided. In addition, the application unifies the duplication elimination index and the data address mapping into the file mapping structure of the file system to manage, thereby avoiding the space overhead of additionally storing the duplication elimination index and reducing the influence of the additional file operation flow introduced by duplication elimination on the performance of the file system.
Furthermore, compared with the fingerprint information stored in the external memory, the fingerprint information storage method and device can effectively reduce the influence of the read fingerprint information on the read-write performance of the file system. When the memory stores fingerprint information by adopting a plurality of memory sets, and the fingerprint information of the information to be written is matched with the fingerprint information of the written data in the memory, the fingerprint prefix based on the fingerprint information can be firstly subjected to preliminary retrieval in the plurality of memory sets, and then the fingerprint information is matched in the retrieved memory sets, so that the total number of the fingerprint information to be matched is effectively reduced, the speed of matching the fingerprint information is improved, and the performance of a file system is improved.
It should be noted that, the sequence of the steps of the data management method provided by the embodiment of the application can be properly adjusted, and the steps can be correspondingly increased or decreased according to the situation. Any method that can be easily conceived by those skilled in the art within the technical scope of the present disclosure should be covered in the protection scope of the present application, and thus will not be repeated.
The data management method of the embodiment of the application is introduced above, and the embodiment of the application also provides a data management device corresponding to the method. A schematic structure of the data management apparatus is shown in fig. 15. The device is applied to a file system layer. It should be understood that the system may include more additional nodes than the illustrated structure or omit some of the nodes illustrated therein, as embodiments of the present application are not limited in this respect. As shown in fig. 15, the data management apparatus 150 includes:
An obtaining module 1501 is configured to obtain data to be written in a file to be written into an external memory.
The processing module 1502 is configured to allocate a virtual address and a physical address to the data to be written when the data to be written is not the repeated data, record the data to be written in the physical address, record a corresponding relationship between the virtual address and the physical address, and a corresponding relationship between the virtual address and a logical address of the data to be written, where the logical address is used to indicate a location of the data to be written in the file.
The processing module 1502 is further configured to, when the data to be written is duplicate data, obtain a virtual address of the duplicate data of the data to be written, and record a correspondence between the virtual address and a logical address of the data to be written.
Optionally, the processing module 1502 is further configured to record a total number of logical addresses corresponding to the virtual addresses when the data to be written is not duplicate data, and update the total number of logical addresses corresponding to the virtual addresses when the data to be written is duplicate data.
Optionally, the obtaining module 1501 is further configured to obtain a first physical address of the data to be migrated in the external memory, the obtaining module 1501 is further configured to obtain a virtual address of the data to be migrated based on the first physical address, the processing module 1502 is further configured to write the data to be migrated to the second physical address, and the processing module 1502 is further configured to update a correspondence between the virtual address and the physical address of the data to be migrated based on the second physical address.
Optionally, the processing module 1502 is specifically configured to write the data to be migrated to the second physical address when the total number of logical addresses corresponding to the virtual addresses of the data to be migrated is not 0.
Optionally, the processing module 1502 is further configured to stop the migration operation of the data to be migrated when the total number of logical addresses corresponding to the virtual addresses of the data to be migrated is 0.
Optionally, the data to be migrated is valid data stored in the physical block to be reclaimed.
Optionally, the processing module 1502 is specifically configured to obtain a virtual address of repeated data of the data to be written based on a correspondence between query data of the data to be written and the virtual address.
Optionally, the processing module 1502 is further configured to obtain fingerprint information of the data to be written, match the fingerprint information of the data to be written with the fingerprint information of the written data stored in the memory, determine that the data to be written is duplicate data when the fingerprint information of the data to be written matches the fingerprint information of the written data, and determine that the data to be written is not duplicate data when the fingerprint information of the data to be written does not match the fingerprint information of the written data.
Optionally, the memory stores fingerprint information of the written data in order of high access heat of the written data.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of each node described above may refer to the corresponding content in the foregoing method embodiment, which is not described herein again.
Embodiments of the present application provide a computing device. The computing device is used for realizing part or all of the functions in the data management method provided by the embodiment of the application. FIG. 16 is a schematic diagram of a computing device according to an embodiment of the present application. As shown in fig. 16, the computing device 1600 includes a processor 1601, a memory 1602, a communication interface 1603, and a bus 1604. The processor 1601, the memory 1602, and the communication interface 1603 are communicatively connected to each other via a bus 1604.
The processor 1601 may include a general purpose processor and/or a dedicated hardware chip. The general purpose processor may include a central processing unit (central processing unit, CPU), a microprocessor, or a graphics processor (graphics processing unit, GPU). The CPU is, for example, a single-core processor (single-CPU), and is, for example, a multi-core processor (multi-CPU). The special-purpose hardware chip is a high-performance processing hardware module. The dedicated hardware chip includes at least one of a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or a network processor (network processer, NP). The processor 1601 may also be an integrated circuit chip with signal processing capabilities. In implementation, some or all of the functions of the data management method of the present application may be performed by integrated logic circuitry of hardware in the processor 1601 or instructions in the form of software.
The memory 1602 is used to store a computer program that includes an operating system 1602a and executable code (i.e., program instructions) 1602b. Memory 1602 is, for example, but is not limited to, a read-only memory or other type of static storage device that can store static information and instructions, a random-access memory or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory, a read-only optical or other optical storage, an optical storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired executable code in the form of instructions or data structures and that can be accessed by a computer. Such as memory 1602 for storing port queues and the like. The memory 1602 is, for example, independent and is coupled to the processor 1601 by way of a bus 1604. Or the memory 1602 and the processor 1601 may be integrated. The memory 1602 may store executable code that, when executed by the processor 1601, is operable by the processor 1601 to perform some or all of the functions of the data management method provided by an embodiment of the present application. The implementation of the process performed by the processor 1601 is referred to correspondingly in the description of the previous embodiments. Memory 1602 may also include software modules and data necessary for other running processes, such as an operating system.
Communication interface 1603 enables communication with other devices or communication networks using a transceiver module such as, but not limited to, a transceiver. For example, communication interface 1603 may be any one or any combination of a network interface (e.g., ethernet interface), a wireless network card, etc. having network access functionality.
Bus 1604 is any type of communication bus used to interconnect the internal devices of the computing device (e.g., memory 1602, processor 1601, communication interface 1603). Such as a system bus. While embodiments of the application have been described with respect to devices internal to computing device being interconnected via bus 1604, devices internal to computing device 1600 may alternatively be communicatively coupled to each other using other means of connection than bus 1604. For example, the aforementioned devices within computing device 1600 are interconnected by internal logic interfaces.
The plurality of devices may be provided on separate chips, or may be provided at least partially or entirely on the same chip. Whether the individual devices are independently disposed on different chips or integrally disposed on one or more chips is often dependent on the needs of the product design. The embodiment of the application does not limit the specific implementation form of the device. And the descriptions of the processes corresponding to the drawings are focused on, and the descriptions of other processes can be referred to for the parts of a certain process which are not described in detail.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product providing the program development platform includes one or more computer instructions that, when loaded and executed on a computing device, implement all or part of the functionality of the data management methods provided by embodiments of the present application.
Moreover, the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, optical fiber, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium stores computer program instructions that provide a program development platform.
The embodiment of the application also provides a computing device cluster. The cluster of computing devices includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop, notebook, or smart phone.
Optionally, the structure of at least one computing device included in the computing device cluster may be referred to as computing device 1600 shown in fig. 16. The same instructions for performing the data management method may be stored in memory 1602 in one or more computing devices 1600 in the cluster of computing devices.
In some possible implementations, portions of the instructions for performing the data management method may also be stored separately in the memory 1602 of one or more computing devices 1600 in the cluster of computing devices. In other words, a combination of one or more computing devices 1600 may collectively execute instructions for performing a data management method.
It should be noted that, the memory 1602 in different computing devices 1600 in the computing device cluster may store different instructions for performing part of the functions of the data management apparatus. That is, the instructions stored in memory 1602 in the different computing devices 1600 may implement the functionality of one or more of the modules above.
In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein the network may be a wide area network or a local area network, etc. Fig. 17 shows one possible implementation. As shown in fig. 17, two computing devices 1700A and 1700B are connected via a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, computing devices 1700A and 1700B include a bus 1702, a processor 1704, a memory 1706, and a communication interface 1708. Illustratively, computing device 1700A may perform the functions of the acquisition module in embodiments of the present application, and computing device 1700B may perform the functions of the processing module in embodiments of the present application.
It should be appreciated that the functionality of computing device 1700A shown in fig. 17 may also be performed by multiple computing devices 1700. Likewise, the functionality of computing device 1700B may also be performed by multiple computing devices 1700. And the deployment mode of the modules for realizing the data management method in the computing equipment can be adjusted according to the application requirements.
The embodiment of the application also provides a computer readable storage medium, which is a nonvolatile computer readable storage medium, and the computer readable storage medium comprises program instructions, when the program instructions run on a computing device, the computing device is caused to implement the data management method provided by the embodiment of the application.
The embodiment of the application also provides a computer program product containing instructions, which when run on a computer, cause the computer to realize the data management method provided by the embodiment of the application.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant countries and regions. For example, the drawing information and executable code and the like involved in the present application are acquired with sufficient authorization.
In embodiments of the present application, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "at least one" means one or more, the term "plurality" means a plurality, unless expressly defined otherwise.
The term "and/or" in the present application is merely an association relation describing the association object, and indicates that three kinds of relations may exist, for example, a and/or B may indicate that a exists alone, while a and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
It is to be understood that the terminology used in the description of the various examples described herein is for the purpose of describing particular examples only and is not intended to be limiting. As used in the description of the various described examples and in the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that, in the embodiments of the present application, the sequence number of each process does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiments of the present application.
The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but is intended to cover any modifications, equivalents, alternatives, and improvements within the spirit and principles of the application.
Claims (21)
1. A method of data management, the method being applied to a file system layer, the method comprising:
acquiring data to be written in a file to be written in an external memory;
When the data to be written is not repeated data, a virtual address and a physical address are allocated for the data to be written, the data to be written is recorded in the physical address, the corresponding relation between the virtual address and the physical address and the corresponding relation between the virtual address and the logical address of the data to be written are recorded, and the logical address is used for indicating the position of the data to be written in the file;
And when the data to be written is the repeated data, acquiring a virtual address of the repeated data of the data to be written, and recording the corresponding relation between the virtual address and the logic address of the data to be written.
2. The method of claim 1, wherein the method further comprises:
when the data to be written is not repeated data, recording the total number of the logical addresses corresponding to the virtual addresses;
and when the data to be written is repeated data, updating the total number of the logical addresses corresponding to the virtual addresses.
3. The method of claim 1 or 2, wherein the method further comprises:
acquiring a first physical address of data to be migrated in the external memory;
Based on the first physical address, acquiring a virtual address of the data to be migrated;
writing the data to be migrated into a second physical address;
and updating the corresponding relation between the virtual address and the physical address of the data to be migrated based on the second physical address.
4. A method according to claim 3, wherein said writing said data to be migrated to a second physical address comprises:
and when the total number of the logical addresses corresponding to the virtual addresses of the data to be migrated is not 0, writing the data to be migrated into the second physical address.
5. The method of claim 3, wherein after the obtaining the virtual address of the data to be migrated based on the first physical address, the method further comprises:
And stopping the migration operation of the data to be migrated when the total number of logical addresses corresponding to the virtual addresses of the data to be migrated is 0.
6. The method according to any one of claims 1 to 5, wherein the data to be migrated is valid data stored in a physical block to be reclaimed.
7. The method according to any one of claims 1 to 6, wherein the obtaining the virtual address of the repeated data of the data to be written comprises:
And obtaining the virtual address of the repeated data of the data to be written based on the corresponding relation between the query data of the data to be written and the virtual address.
8. The method of any one of claims 1 to 7, further comprising:
Acquiring fingerprint information of the data to be written;
Matching the fingerprint information of the data to be written with the fingerprint information of the written data stored in the memory;
When the fingerprint information of the data to be written is matched with the fingerprint information of the written data, determining that the data to be written is repeated data;
And when the fingerprint information of the data to be written is not matched with the fingerprint information of the written data, determining that the data to be written is not repeated data.
9. The method of claim 8, wherein the memory stores fingerprint information of the written data in order of higher access heat of the written data.
10. A data management apparatus, the apparatus being applied to a file system layer, the apparatus comprising:
the acquisition module is used for acquiring data to be written in a file to be written in the external memory;
The processing module is used for distributing a virtual address and a physical address for the data to be written when the data to be written is not the repeated data, recording the data to be written in the physical address, recording the corresponding relation between the virtual address and the physical address and the corresponding relation between the virtual address and the logical address of the data to be written, wherein the logical address is used for indicating the position of the data to be written in the file;
and the processing module is further used for acquiring a virtual address of the repeated data of the data to be written when the data to be written is the repeated data, and recording the corresponding relation between the virtual address and the logical address of the data to be written.
11. The apparatus of claim 10, wherein the processing module is further to:
when the data to be written is not repeated data, recording the total number of the logical addresses corresponding to the virtual addresses;
and when the data to be written is repeated data, updating the total number of the logical addresses corresponding to the virtual addresses.
12. The apparatus of claim 10 or 11, wherein,
The acquisition module is further used for acquiring a first physical address of data to be migrated in the external memory;
the obtaining module is further configured to obtain a virtual address of the data to be migrated based on the first physical address;
The processing module is further used for writing the data to be migrated into a second physical address;
And the processing module is also used for updating the corresponding relation between the virtual address and the physical address of the data to be migrated based on the second physical address.
13. The apparatus of claim 12, wherein the processing module is specifically configured to:
and when the total number of the logical addresses corresponding to the virtual addresses of the data to be migrated is not 0, writing the data to be migrated into the second physical address.
14. The apparatus of claim 12, wherein the processing module is further to:
And stopping the migration operation of the data to be migrated when the total number of logical addresses corresponding to the virtual addresses of the data to be migrated is 0.
15. The apparatus according to any one of claims 10 to 14, wherein the data to be migrated is valid data stored in a physical block to be reclaimed.
16. The apparatus according to any one of claims 10 to 15, wherein the processing module is specifically configured to:
And obtaining the virtual address of the repeated data of the data to be written based on the corresponding relation between the query data of the data to be written and the virtual address.
17. The apparatus of any of claims 10 to 16, wherein the processing module is further configured to:
Acquiring fingerprint information of the data to be written;
Matching the fingerprint information of the data to be written with the fingerprint information of the written data stored in the memory;
When the fingerprint information of the data to be written is matched with the fingerprint information of the written data, determining that the data to be written is repeated data;
And when the fingerprint information of the data to be written is not matched with the fingerprint information of the written data, determining that the data to be written is not repeated data.
18. The apparatus of claim 17, wherein the memory stores fingerprint information of the written data in order of higher access heat of the written data.
19. A cluster of computing devices, comprising at least one computing device, each computing device comprising a processor and a memory, the processor of the at least one computing device to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method of any of claims 1 to 9.
20. A computer program product containing instructions that, when executed by a cluster of computing devices, cause the cluster of computing devices to perform the method of any of claims 1 to 9.
21. A computer readable storage medium comprising computer program instructions which, when executed by a cluster of computing devices, perform the method of any of claims 1 to 9.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311047417.9A CN119493519A (en) | 2023-08-17 | 2023-08-17 | Data management method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311047417.9A CN119493519A (en) | 2023-08-17 | 2023-08-17 | Data management method and device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN119493519A true CN119493519A (en) | 2025-02-21 |
Family
ID=94625901
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202311047417.9A Pending CN119493519A (en) | 2023-08-17 | 2023-08-17 | Data management method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119493519A (en) |
-
2023
- 2023-08-17 CN CN202311047417.9A patent/CN119493519A/en active Pending
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN108804031B (en) | Optimal record lookup | |
| US11347443B2 (en) | Multi-tier storage using multiple file sets | |
| CN110018998B (en) | File management method and system, electronic equipment and storage medium | |
| US8904137B1 (en) | Deduplication system space recycling through inode manipulation | |
| CN109697016B (en) | Method and apparatus for improving storage performance of containers | |
| CN107391774B (en) | Garbage Collection Method for Log File System Based on Data Deduplication | |
| US9122589B1 (en) | Data storage system with unified system cache | |
| CN108121813B (en) | Data management method, device, system, storage medium and electronic equipment | |
| CN108733306B (en) | A file merging method and device | |
| CN108604165B (en) | Storage device | |
| CN110119425A (en) | Solid state drive, distributed data-storage system and the method using key assignments storage | |
| US8694563B1 (en) | Space recovery for thin-provisioned storage volumes | |
| CN114253908A (en) | Data management method and device for key-value storage system | |
| US9817865B2 (en) | Direct lookup for identifying duplicate data in a data deduplication system | |
| HK1219155A1 (en) | Reduced redundancy in stored data | |
| US9430492B1 (en) | Efficient scavenging of data and metadata file system blocks | |
| EP4336336A1 (en) | Data compression method and apparatus | |
| CN104572656A (en) | Method and device for recycling space of disk mirror image | |
| CN109804359A (en) | For the system and method by write back data to storage equipment | |
| US11144508B2 (en) | Region-integrated data deduplication implementing a multi-lifetime duplicate finder | |
| WO2021047425A1 (en) | Virtualization method and system for persistent memory | |
| WO2024169158A1 (en) | Storage system, data access method, apparatus, and device | |
| CN109960662B (en) | Method and device for reclaiming memory | |
| US11586353B2 (en) | Optimized access to high-speed storage device | |
| US20240338322A1 (en) | Bin-less metadata pages allocator for storage cluster with log-structured metadata |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |