+

CN120255816A - Method and system for intelligently accelerating data reading of solid state hard disk - Google Patents

Method and system for intelligently accelerating data reading of solid state hard disk Download PDF

Info

Publication number
CN120255816A
CN120255816A CN202510410399.9A CN202510410399A CN120255816A CN 120255816 A CN120255816 A CN 120255816A CN 202510410399 A CN202510410399 A CN 202510410399A CN 120255816 A CN120255816 A CN 120255816A
Authority
CN
China
Prior art keywords
data
information
fingerprint
hash
data block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510410399.9A
Other languages
Chinese (zh)
Inventor
韩胜平
周家豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yansheng Technology Co ltd
Original Assignee
Shenzhen Yansheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yansheng Technology Co ltd filed Critical Shenzhen Yansheng Technology Co ltd
Priority to CN202510410399.9A priority Critical patent/CN120255816A/en
Publication of CN120255816A publication Critical patent/CN120255816A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for intelligently accelerating data reading of a solid state disk, wherein the method comprises the steps of reading basic information of each data block in the solid state disk when a data reading request is received, constructing information fingerprints of each data block based on the basic information, constructing information fingerprints of request data based on the data reading request, removing unmatched data blocks by using a bloom filter, performing similarity calculation on the information fingerprints of the request data and the information fingerprints of the removed data blocks by using a local sensitive hash algorithm, locating the data block most similar to the request data, and reading the request data from the solid state disk according to the locating result. In the invention, the data blocks which are most similar to the request data are positioned by adopting the bloom filter to remove the unmatched data blocks and performing similarity calculation on the information fingerprint of the request data and the information fingerprint of the removed data blocks, thereby overcoming the defects of long time consumption and waste of system resources when the data is read currently.

Description

Method and system for intelligently accelerating data reading of solid state disk
Technical Field
The invention relates to the technical field of solid state disks, in particular to a method and a system for intelligently accelerating data reading of a solid state disk.
Background
In the era of rapid development of digital information, a Solid State Disk (SSD) is widely applied to the fields of various computer devices, servers, mobile terminals and the like by virtue of the advantages of high read-write speed, good anti-seismic performance, low energy consumption and the like, and becomes a key device for data storage. However, with the explosive growth of data volume and the continuous improvement of the requirement of users on data reading instantaneity, the data reading technology of the traditional solid state disk gradually exposes a plurality of defects.
When the traditional solid state disk processes a data reading request, a storage medium needs to be traversed, and data blocks are compared one by one, so that the method has extremely low efficiency. When the stored data volume is large, the time consumption for searching the target data block is obviously increased, and the speed of data reading and the overall performance of the system are seriously affected. For example, when a server processes a large number of concurrent data read requests, the time for searching the target data block is too long, which results in a delay of server response and reduced user experience. In addition, in the data matching link, the prior art lacks an efficient data screening mechanism, can not quickly remove unmatched data blocks, and a large number of invalid data comparison operations not only waste system resources, but also prolong data processing time.
Disclosure of Invention
The invention mainly aims to provide a method and a system for intelligently accelerating data reading of a solid state disk, and aims to overcome the defects of long time consumption and waste of system resources when data is read currently.
In order to achieve the above purpose, the invention provides a method for intelligently accelerating data reading of a solid state disk, which comprises the following steps:
When a data reading request is received, basic information of each data block in the solid state disk is read, and information fingerprints of the data blocks are constructed based on the basic information;
Constructing an information fingerprint of the request data based on the data reading request, and removing unmatched data blocks by adopting a bloom filter;
calculating the similarity between the information fingerprint of the request data and the information fingerprint of the excluded data blocks by utilizing a local sensitive hash algorithm, and positioning the data block most similar to the request data;
And reading the request data from the solid state disk according to the positioning result.
Further, the constructing an information fingerprint of each data block based on the basic information includes:
converting the access frequency value and the creation time of the data block into a binary sequence, and interleaving the binary sequence with the binary representation of the size of the data block by bits to form an interleaved data sequence;
gray coding is carried out on the storage position, and storage position data after Gray coding is obtained;
Splicing the interleaved data sequence with Gray-coded storage position data to obtain spliced data, and calculating a spliced hash value of the spliced data by using Fnv hash algorithm;
selecting a data segment with a fixed length in the middle part of the data block, and generating a segment hash value by adopting a CRC32 hash algorithm;
And performing exclusive OR operation on the spliced hash value and the fragment hash value to obtain the information fingerprint of each data block.
Further, the constructing an information fingerprint of each data block based on the basic information includes:
Dividing the size information of the data block into a plurality of sub-segments according to the byte unit, and adopting binary differential coding to each sub-segment to obtain a size differential coding sequence;
performing spiral scanning coding on the storage position of the data block to generate storage position spiral coding;
splicing the size difference coding sequence and the storage position spiral coding to obtain spliced coding data, and performing primary hash on the spliced coding data by using SipHash algorithm to generate a first hash value;
Extracting year, month, date and hour information of the creation time of the data block, and carrying out Kaiser encryption to obtain encrypted time information, wherein the displacement is determined by the access frequency of the data block;
And splicing the first hash value with the encrypted time information, operating the spliced data by adopting a Fletcher checksum algorithm, and taking an operation result as an information fingerprint of each data block.
Further, employing a bloom filter to exclude mismatched data blocks includes:
According to the total number of the data blocks of the solid state disk and the requirements of the false judgment rate, an optimal parameter calculation model is used for determining the number and the number of bit groups of the hash function of the bloom filter, and the bloom filter is initialized;
Splitting the information fingerprint of each data block according to bytes, and performing cyclic left shift operation on each byte to generate a transformed fingerprint byte sequence, wherein the displacement bit number is determined by the position of the byte in the information fingerprint;
Performing parallel hash calculation on the transformed fingerprint byte sequence by adopting a plurality of hash functions, and modulo the length of the bloom filter bit number array by the calculated hash value to obtain a corresponding array index position, wherein the corresponding index position of the bloom filter is set to be 1;
Acquiring a hash index position of request data;
Checking the value of the corresponding hash index position in the bloom filter, and if the position with the value of 0 exists, judging the corresponding data block as unmatched and excluding the data block.
Further, the similarity calculation is performed on the information fingerprint of the request data and the information fingerprint of the excluded data block by utilizing a local sensitive hash algorithm, and the method comprises the following steps:
respectively converting the information fingerprint of the request data and the information fingerprint of the excluded data block from the time domain to the frequency domain to form corresponding frequency domain fingerprint data;
according to the distribution characteristics of the frequency domain fingerprint data, a self-adaptive parameter adjustment algorithm is used for determining the parameters of a hash function family in a local sensitive hash algorithm and optimizing the hash function family, wherein the parameters comprise the number of hash functions and the size of a hash bucket;
dividing frequency domain fingerprint data into a plurality of sub fingerprint blocks, performing hash calculation on each sub fingerprint block through an optimized hash function family, and mapping calculation results into a hash bucket;
counting the number of sub fingerprint blocks appearing in the same hash bucket by comparing the mapping conditions of the request data and the data blocks in each hash bucket, and taking the sub fingerprint blocks as a preliminary similarity index;
The frequency domain fingerprint data of the request data and the data block are respectively subjected to principal component analysis to construct corresponding feature vectors;
and carrying out weighted summation on the preliminary similarity index and the supplementary similarity index to obtain the similarity of the request data and the excluded data block.
Further, after the request data is read from the solid state disk according to the positioning result, the method comprises the following steps:
Acquiring attribute information of the request data;
constructing an attribute character undirected graph based on characters in the attribute information, wherein the characters in the attribute information are used as nodes of the undirected graph, and the nodes are connected through edges;
acquiring identification information of the solid state disk, and adjusting the attribute character undirected graph based on the identification information to obtain an adjusted undirected graph;
correcting a preset coding table based on the adjustment undirected graph to obtain a corrected coding table;
and encoding the request data based on the correction encoding table, and then transmitting the encoded request data to a data reading requester.
Further, the correcting the preset encoding table based on the adjustment undirected graph to obtain a corrected encoding table includes:
performing key feature analysis on the adjustment undirected graph, and generating a plurality of curves based on analysis results;
translating the generated curves into the same layer, wherein a plurality of character adding bits are preset in the layer;
The method comprises the steps of obtaining a preset encoding table, wherein the preset encoding table comprises an original data column and an encoding character column, and characters in the original data column and characters in the encoding character column form one-to-one mapping;
Sequentially adding characters in the coded character column into character adding bits in the layer one by one;
Reordering each character based on a plurality of curves added in the layer to obtain a reordered character sequence;
and sequentially replacing the characters in the rearranged character sequence into the coded character sequence of the preset coding table one by one to obtain the correction coding table.
The invention also provides a system for intelligently accelerating data reading of the solid state disk, which comprises:
The construction unit is used for reading the basic information of each data block in the solid state disk when receiving a data reading request, and constructing the information fingerprint of each data block based on the basic information;
An elimination unit for constructing an information fingerprint of the request data based on the data reading request, and eliminating the unmatched data blocks by using a bloom filter;
the positioning unit is used for calculating the similarity between the information fingerprint of the request data and the information fingerprint of the excluded data blocks by utilizing a local sensitive hash algorithm, and positioning the data block most similar to the request data;
And the reading unit is used for reading the request data from the solid state disk according to the positioning result.
The invention also provides a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of any of the methods described above when the computer program is executed.
The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the preceding claims.
The method and the system for intelligently accelerating data reading of the solid state disk comprise the steps of reading basic information of each data block in the solid state disk when a data reading request is received, constructing information fingerprints of each data block based on the basic information, constructing information fingerprints of request data based on the data reading request, removing unmatched data blocks by using a bloom filter, performing similarity calculation on the information fingerprints of the request data and the information fingerprints of the removed data blocks by using a local sensitive hash algorithm, locating the data block most similar to the request data, and reading the request data from the solid state disk according to a locating result. The invention eliminates unmatched data blocks by adopting a bloom filter, calculates the similarity between the information fingerprint of the request data and the information fingerprint of the eliminated data blocks by utilizing a local sensitive hash algorithm, locates the data block most similar to the request data, and overcomes the defects of long time consumption and waste of system resources when the data is read currently.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a method for intelligently accelerating data reading by a solid state disk according to an embodiment of the present invention;
FIG. 2 is a block diagram of a system for intelligently accelerating data reading by a solid state disk according to an embodiment of the present invention;
fig. 3 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, in one embodiment of the present invention, a method for intelligently accelerating data reading by a solid state disk is provided, including the following steps:
Step S1, when a data reading request is received, basic information of each data block in a solid state disk is read, and information fingerprints of the data blocks are constructed based on the basic information;
S2, constructing an information fingerprint of the request data based on the data reading request, and removing unmatched data blocks by using a bloom filter;
Step S3, calculating the similarity between the information fingerprint of the request data and the information fingerprint of the excluded data blocks by utilizing a local sensitive hash algorithm, and positioning the data block most similar to the request data;
and S4, reading the request data from the solid state disk according to the positioning result.
In this embodiment, as described in step S1 above, the objective is to generate a representative information fingerprint for each data block in the solid state disk, so that the data block can be identified and screened efficiently and accurately. Specifically, after receiving the data reading request, the basic information of each data block in the solid state disk is read. The above basic information typically covers the size of the data block, the storage location, the creation time, the modification time, etc. The information can reflect the characteristics of the data blocks from multiple dimensions, and provides a basis for constructing accurate information fingerprints. Based on the read basic information, a preset algorithm is applied to construct an information fingerprint. For example, the basic information may be hashed, and different data features mapped to a fixed-length hash value, which is the information fingerprint. The information fingerprints are unique and representative, and even if the content of the data block is slightly changed, the information fingerprints are obviously different, so that the subsequent rapid comparison and screening are convenient.
As described in step S2, the information fingerprint of the requested data is constructed based on the data reading request, and the bloom filter is used to exclude the unmatched data blocks, so as to quickly and efficiently exclude those data blocks which are obviously unmatched from the numerous data blocks, reduce the workload of subsequent similarity calculation, and improve the overall efficiency of data reading. Specifically, according to the relevant information in the data reading request, such as the type of the requested data, the data range, the data identifier, etc., the information fingerprint of the requested data is constructed by adopting the same or similar algorithm as the step S1. This ensures that the information fingerprint of the requested data and the data block is comparable. The bloom filter is a highly space-efficient probabilistic data structure for determining whether an element exists in a collection. The information fingerprints of all the data blocks are stored in a bloom filter, and then the information fingerprints of the requested data are compared with the bloom filter. If the bloom filter judges that the information fingerprint of a certain data block is not matched with the request data, the data block is eliminated and does not participate in subsequent similarity calculation. Although the bloom filter has a certain misjudgment rate (namely, some actually matched data blocks can be misjudged as unmatched), the misjudgment rate can be reduced by reasonably adjusting parameters of the bloom filter, and meanwhile, the screening efficiency is ensured.
As described in step S3, the similarity calculation is performed on the information fingerprint of the requested data and the information fingerprint of the excluded data blocks by using the locality sensitive hashing algorithm, so as to locate the data block most similar to the requested data, and the purpose is to further accurately find the data block most matched with the requested data after preliminary screening by using the bloom filter, so as to ensure that the finally read data is the data actually required by the user.
The local sensitive hash algorithm is a special hash algorithm and is characterized in that in a high-dimensional space, similar elements are mapped to the same hash bucket after hash operation, and the probability is high. By this algorithm, the information fingerprint can be mapped into different hash buckets. And respectively carrying out local sensitive hash operation on the information fingerprint of the request data and the information fingerprint of the excluded data block, and then comparing the distribution conditions of the information fingerprint and the information fingerprint in the hash bucket. The similarity between two information fingerprints is evaluated by calculating indexes such as the coincidence degree, the distance and the like of the two information fingerprints in the hash bucket. The higher the similarity, the more the data block matches the request data. And after similarity calculation is carried out on the information fingerprints of all the excluded data blocks, finding out the data block with the highest similarity, and positioning the data block to be the data block most similar to the request data.
As described in step S4, based on the screening and positioning results of the previous steps, the data required by the user is accurately read from the solid state disk, and the processing of the data reading request is completed. Specifically, the data block is read from the solid state disk according to the information such as the storage location of the data block that is most similar to the request data and is located in step S3. In the reading process, the reading and writing protocol and the operation specification of the solid state disk can be followed, and the accurate reading and transmission of data are ensured. After the reading is completed, the data is returned to the user or the system which initiates the reading request, and the whole data reading flow is completed.
In summary, according to the technical scheme, through a series of operations such as constructing the information fingerprint, applying the bloom filter and the local sensitive hash algorithm, intelligent acceleration reading of the solid state disk data is realized, and the efficiency and accuracy of data reading are improved.
In an embodiment, the constructing an information fingerprint of each data block based on the basic information includes:
converting the access frequency value and the creation time of the data block into a binary sequence, and interleaving the binary sequence with the binary representation of the size of the data block by bits to form an interleaved data sequence;
gray coding is carried out on the storage position, and storage position data after Gray coding is obtained;
Splicing the interleaved data sequence with Gray-coded storage position data to obtain spliced data, and calculating a spliced hash value of the spliced data by using Fnv hash algorithm;
selecting a data segment with a fixed length in the middle part of the data block, and generating a segment hash value by adopting a CRC32 hash algorithm;
And performing exclusive OR operation on the spliced hash value and the fragment hash value to obtain the information fingerprint of each data block.
In this embodiment, the access frequency value, creation time, and data block size of the data block are all important basic information describing the characteristics of the data block. They are converted into binary sequences for ease of processing and storage by a computer. The three kinds of information can be fused together in a bit interleaving mode to form a new sequence, and the sequence integrates a plurality of attribute characteristics of the data block, so that the characteristics of the data block can be more comprehensively reflected by subsequent calculation based on the sequence. For example, assuming that the binary representation of the access frequency value is 101, the binary representation of the creation time is 110, the binary representation of the data block size is 011, the resulting interleaved data sequence after bit interleaving may be 100111011.
The Gray code is a coding mode and is characterized in that adjacent codes are different by only one bit. The Gray coding of the storage position has two main advantages, namely, the risk of misjudgment caused by coding errors can be reduced, because in the data transmission or storage process, if one-bit errors occur, the Gray coding has relatively small change and is not easy to misjudge as other different codes, and the Gray coding has certain advantages in certain calculation, can simplify the calculation process and improve the calculation efficiency. The storage location information is converted into a form more suitable for subsequent processing by gray coding.
The splicing interleaving data sequence and the Gray-coded storage position data are used for combining the attribute information and the storage position information of the data block. Then, a Fnv hash algorithm is used for calculating the hash value of the spliced data, and the hash algorithm can map data with any length into the hash value with fixed length and has the characteristics of high efficiency, uniform distribution and the like. The Fnv hash algorithm has good performance and hash effect when calculating the hash value, can map spliced data into the hash space more uniformly, so that different data blocks have different hash values, and the hash values of similar data blocks also have certain similarity, thereby facilitating the subsequent comparison and screening of data.
Furthermore, the data segment with a fixed length in the middle part of the data block is selected for hash calculation, because the data in the middle part of the data block is relatively stable and is not easily affected by noise or incomplete data possibly occurring in the beginning or end part of the data block. The CRC32 hash algorithm is a widely used cyclic redundancy check algorithm, which can perform rapid checksum calculation on data to generate a 32-bit hash value. The fragment hash value generated by the CRC32 hash algorithm can be used as a characteristic representation of the content of the data block, so that the information fingerprint of the data block is further enriched.
Finally, the exclusive-or operation is a logical operation, and when the corresponding bits of two operands are different, the result is 1, and when the corresponding bits are the same, the result is 0. And performing exclusive OR operation on the spliced hash value and the fragment hash value, and further fusing and compressing information such as the attribute, the storage position, the content characteristics and the like of the data block to obtain a final information fingerprint. The information fingerprint integrates various information of the data blocks, has higher uniqueness and representativeness, and can be used for subsequent operations such as data block matching, similarity calculation and the like so as to achieve the purpose of intelligently accelerating data reading of the solid state disk.
In an embodiment, the constructing an information fingerprint of each data block based on the basic information includes:
Dividing the size information of the data block into a plurality of sub-segments according to the byte unit, and adopting binary differential coding to each sub-segment to obtain a size differential coding sequence;
performing spiral scanning coding on the storage position of the data block to generate storage position spiral coding;
splicing the size difference coding sequence and the storage position spiral coding to obtain spliced coding data, and performing primary hash on the spliced coding data by using SipHash algorithm to generate a first hash value;
Extracting year, month, date and hour information of the creation time of the data block, and carrying out Kaiser encryption to obtain encrypted time information, wherein the displacement is determined by the access frequency of the data block;
And splicing the first hash value with the encrypted time information, operating the spliced data by adopting a Fletcher checksum algorithm, and taking an operation result as an information fingerprint of each data block.
In this embodiment, the above-described data block size is one of key features describing the data block. By splitting the size information of the data block according to bytes and adopting binary differential coding, the change characteristics of the size of the data block at the byte level can be highlighted, compared with the direct use of original data, the differential coding can not only reduce the data redundancy, but also better reserve the size characteristics in the subsequent operation, and improve the identification degree of the information fingerprint.
Taking a data block size information as an example, if the data block size is a string of byte sequences, they are split in byte order. And (3) carrying out binary differential coding on each split sub-segment, namely calculating the binary difference value between the current sub-segment and the previous sub-segment, and carrying out coding record. In this way, size change information between adjacent bytes is extracted and encoded, forming a size differential encoded sequence.
The storage location of a data block in a solid state disk is an important attribute thereof. While the conventional linear coding method cannot fully reflect the spatial distribution characteristics of storage positions, the spiral scanning coding converts the two-dimensional or multi-dimensional storage position information of the data block into a one-dimensional coding sequence with spatial continuity through a specific scanning method. The method is favorable for reflecting the spatial characteristics of the data block storage in the information fingerprint and improving the expression capability of the fingerprint on the data block positioning characteristics. Taking a common two-dimensional storage structure of a solid state disk as an example, starting from the center of a storage area or a certain fixed point, outwards scanning in a spiral mode, coding scanned storage position information in sequence, and finally generating a storage position spiral code.
And fusing the size characteristics of the data blocks with the storage position characteristics, so that the information fingerprint generated later simultaneously comprises the attribute characteristics and the positioning characteristics of the data blocks. The splicing operation is simple and direct, and two different types of feature codes can be combined together on the basis of not losing original feature information, so that a foundation is laid for generating comprehensive information fingerprints. Specifically, the obtained size difference coding sequence and the storage position are subjected to spiral coding, and head-to-tail splicing is carried out according to a certain sequence to form a longer coding sequence, namely splicing coding data.
The SipHash algorithm is a fast and secure hash algorithm, and is particularly suitable for hash calculation of short data. The hash operation is carried out on the spliced encoded data, so that a longer encoded sequence can be compressed into a hash value with a fixed length, the data dimension is reduced, and the uniqueness and the collision resistance of the data are ensured. The first hash value initially refines key information in spliced encoded data, and provides a basis for further generating information fingerprints. The spliced encoded data is used as the input of SipHash algorithm, and a hash value with a fixed length, namely a first hash value, is generated through a series of operations in the algorithm.
The creation time and access frequency of the data blocks are also important characteristic information. The key part of the creation time is extracted, and the Kaiser encryption is carried out by combining the access frequency of the data block, so that the safety of the information fingerprint can be increased, the information is prevented from being easily stolen or tampered, and the access frequency is integrated into the encryption process, so that the information fingerprint can reflect the use liveness characteristic of the data block. Specifically, from the creation time information of the data block, year, month, date, and hour information is extracted. And determining the displacement of Kaiser encryption according to the access frequency of the data block, and performing Kaiser encryption on the extracted time information, namely shifting each character according to the displacement to obtain encrypted time information.
And splicing the first hash value obtained by the primary hash with the encrypted time information, and fusing different characteristics of the data block again. The Fletcher checksum algorithm is a quick checksum algorithm, and is used for calculating spliced data, so that the data can be further compressed, and the integrity and accuracy of the data can be ensured. The finally generated operation result is used as the information fingerprint of the data block, integrates the various characteristics of the size, the storage position, the creation time, the access frequency and the like of the data block, has higher uniqueness and identification degree, and is suitable for subsequent data matching and retrieval operation. Specifically, the first hash value and the encrypted time information are spliced to form new input data. And taking the input data as the input of the Fletcher checksum algorithm, and obtaining a fixed-length result through algorithm operation, wherein the result is the information fingerprint of each data block.
In one embodiment, employing a bloom filter to exclude mismatched data blocks includes:
According to the total number of the data blocks of the solid state disk and the requirements of the false judgment rate, an optimal parameter calculation model is used for determining the number and the number of bit groups of the hash function of the bloom filter, and the bloom filter is initialized;
Splitting the information fingerprint of each data block according to bytes, and performing cyclic left shift operation on each byte to generate a transformed fingerprint byte sequence, wherein the displacement bit number is determined by the position of the byte in the information fingerprint;
Performing parallel hash calculation on the transformed fingerprint byte sequence by adopting a plurality of hash functions, and modulo the length of the bloom filter bit number array by the calculated hash value to obtain a corresponding array index position, wherein the corresponding index position of the bloom filter is set to be 1;
Acquiring a hash index position of request data;
Checking the value of the corresponding hash index position in the bloom filter, and if the position with the value of 0 exists, judging the corresponding data block as unmatched and excluding the data block.
In this embodiment, the performance of the bloom filter is dependent on the hash function number and bit array size. The total number of the data blocks of the solid state disk and the allowable false positive rate are key factors for determining the two parameters. Through the optimal parameter calculation model, the parameter value which can lead the space utilization rate and the accuracy of the bloom filter to reach the optimal balance under the given data block quantity and the misjudgment rate requirement can be found. The bloom filter is then initialized based on the calculated parameters, i.e., an array of bits of a specified size is created, and all bits are initialized to 0. And an infrastructure and parameter configuration are provided for the follow-up data block screening by using the bloom filter, so that the bloom filter can work accurately and efficiently.
After splitting the information fingerprint of the data block according to bytes, performing cyclic left shift operation on each byte. The number of displacement bits is determined by the position of the bytes in the information fingerprint, and the information fingerprint can be transformed to a certain degree in this way, so that the randomness and diversity of the information fingerprint can be increased, and the subsequent hash calculation can be more uniformly distributed in the number of bits of the bloom filter. The transformed fingerprint byte sequence can be better suitable for calculation of various hash functions, so that the distribution uniformity of hash values is improved, hash conflicts are reduced, and the judgment accuracy of a bloom filter is improved.
And carrying out parallel hash calculation on the transformed fingerprint byte sequence by adopting a plurality of hash functions, wherein each hash function can generate a hash value. And then, the hash values are subjected to modulo operation on the length of the bloom filter bit array, so that the corresponding array index position is obtained. Since there may be multiple hash functions, an information fingerprint may correspond to multiple array index positions, which are set to 1, indicating that the information fingerprint for the data block is occupied at the corresponding position in the bloom filter. The information fingerprints of the data blocks are mapped into the bit array of the bloom filter through a plurality of hash functions and modulo operations, so that the bloom filter can record the characteristics of each data block for subsequent quick presence check.
The method comprises the steps of dividing information fingerprints of request data according to bytes, performing cyclic left shift operation, performing parallel hash calculation by using a plurality of same hash functions, and performing length modulo on a bloom filter bit array to obtain hash index positions of the request data in the bloom filter. The location in the bloom filter to which the requested data should correspond is determined for comparison with the stored data block information.
The value of the requested data corresponding to the hash index location in the bloom filter is checked. If there is a location with a value of 0, it is indicated that the information fingerprint of the data block is not set to 1 at some corresponding location in the bloom filter, which means that the data block does not match the requested data, which may be determined to be a mismatch and excluded, according to the bloom filter principle. Because of the bloom filter nature, the data blocks may match if all corresponding locations are 1, but not necessarily completely, and do not match if only one location is 0. Through the elimination process, the data blocks which are obviously not matched are quickly eliminated, the operations such as the follow-up unnecessary similarity calculation and the like are reduced, and the efficiency and the speed of data reading are improved.
In an embodiment, using a locality sensitive hashing algorithm, performing similarity calculation on an information fingerprint of the requested data and an information fingerprint of the excluded data block, includes:
respectively converting the information fingerprint of the request data and the information fingerprint of the excluded data block from the time domain to the frequency domain to form corresponding frequency domain fingerprint data;
according to the distribution characteristics of the frequency domain fingerprint data, a self-adaptive parameter adjustment algorithm is used for determining the parameters of a hash function family in a local sensitive hash algorithm and optimizing the hash function family, wherein the parameters comprise the number of hash functions and the size of a hash bucket;
dividing frequency domain fingerprint data into a plurality of sub fingerprint blocks, performing hash calculation on each sub fingerprint block through an optimized hash function family, and mapping calculation results into a hash bucket;
counting the number of sub fingerprint blocks appearing in the same hash bucket by comparing the mapping conditions of the request data and the data blocks in each hash bucket, and taking the sub fingerprint blocks as a preliminary similarity index;
The frequency domain fingerprint data of the request data and the data block are respectively subjected to principal component analysis to construct corresponding feature vectors;
and carrying out weighted summation on the preliminary similarity index and the supplementary similarity index to obtain the similarity of the request data and the excluded data block.
In this embodiment, in many cases, the data can exhibit different characteristics and regularity in the frequency domain than in the time domain. The information fingerprint of the request data and the information fingerprint of the excluded data block are converted from the time domain to the frequency domain through Fourier transform and other methods, so that the frequency characteristics of the data can be extracted. These frequency characteristics may reflect information such as periodicity, volatility, etc. of the data and help to more fully describe the characteristics of the data. The information fingerprint is converted into the frequency domain, so that richer characteristic representation is provided for the subsequent local sensitive hash algorithm, and the hash calculation can be performed based on the frequency domain characteristics with more distinguishing property, thereby improving the accuracy of similarity calculation.
The distribution characteristics of the frequency domain fingerprint data (such as the distribution range of the frequency, the density degree, etc.) can affect the performance of the locally sensitive hash algorithm. The adaptive parameter adjustment algorithm dynamically determines the parameters of the hash function family, such as the number of hash functions and the size of the hash bucket, based on these distribution characteristics. The greater the number of hash functions, the higher the accuracy of hash computation may be, but the computation complexity may also increase, and the size of the hash bucket may affect the mapping and clustering effects of the data. By optimizing the parameters, the hash function family can be better adapted to the characteristics of frequency domain fingerprint data, and the sensitivity and accuracy of the algorithm are improved. In this embodiment, it is ensured that the local sensitive hash algorithm can achieve the best performance when processing the frequency domain fingerprint data, so that similar data can be mapped into the same hash bucket more accurately, and dissimilar data is mapped into different hash buckets as much as possible.
Further, the frequency domain fingerprint data is divided into a plurality of sub fingerprint blocks in order to perform finer processing on the data. Each sub-fingerprint block is subjected to hash calculation through an optimized hash function family, the sub-fingerprint blocks are mapped into a hash value by a hash function, and then the sub-fingerprint blocks are mapped into corresponding hash buckets according to the hash value. In this way, similar sub-fingerprint blocks are mapped into the same hash bucket with a high probability, so that preliminary judgment on data similarity is realized. The frequency domain fingerprint data is converted into the distribution condition in the hash bucket through hash calculation and mapping, so that the similarity of the data can be conveniently evaluated through comparing the mapping condition in the hash bucket.
If sub-fingerprint blocks of the requested data and data blocks are mapped into the same bucket in a plurality of hash buckets, indicating that they have similar characteristics under the influence of these hash functions, the greater the number of sub-fingerprint blocks that occur in the same hash bucket, the greater their similarity may be. Therefore, the number of sub-fingerprint blocks appearing in the same hash bucket can be used as a preliminary similarity index for rapidly evaluating the similarity between data. The process provides a simple and visual similarity evaluation method, can reflect the similarity between the request data and the data block to a certain extent, and provides a basis for subsequent comprehensive similarity calculation.
Principal Component Analysis (PCA) is a dimension-reduction technique that converts high-dimensional frequency-domain fingerprint data into a new set of low-dimensional feature vectors that preserve the primary information of the original data. And calculating cosine similarity between feature vectors of the frequency domain fingerprint data of the request data and the data block, wherein the cosine similarity can measure an included angle of the two vectors, and the smaller the included angle is, the more similar the two vectors are, namely the more similar the request data and the data block are in frequency domain features. In this embodiment, the similarity between the request data and the data block is evaluated from another angle, which makes up for the deficiency of evaluating the similarity only through the hash-bucket mapping condition, and provides more comprehensive similarity information.
The preliminary similarity index and the supplementary similarity index are weighted and summed, and the weight setting can be adjusted according to actual conditions, for example, the importance of the two indexes or the reliability under different scenes are determined. By means of weighted summation, information reflected by the two indexes can be comprehensively considered, and the similarity of the more accurate and comprehensive request data and the data block after elimination is obtained. And finally, a comprehensive similarity value is obtained, so that the similarity degree between the request data and the data block can be reflected more accurately, and a more reliable basis is provided for subsequent data processing and decision making.
In an embodiment, after reading the request data from the solid state disk according to the positioning result, the method includes:
Acquiring attribute information of the request data;
constructing an attribute character undirected graph based on characters in the attribute information, wherein the characters in the attribute information are used as nodes of the undirected graph, and the nodes are connected through edges;
acquiring identification information of the solid state disk, and adjusting the attribute character undirected graph based on the identification information to obtain an adjusted undirected graph;
correcting a preset coding table based on the adjustment undirected graph to obtain a corrected coding table;
and encoding the request data based on the correction encoding table, and then transmitting the encoded request data to a data reading requester.
In this embodiment, the attribute information of the requested data includes key features of the data, such as data type, creator, creation time, and the like. The information can describe the data from multiple dimensions, and is an important basis for the subsequent construction of attribute character undirected graphs and the processing of the data. By acquiring the information, the intrinsic characteristics of the request data can be deeply known, and a basis is provided for targeted data processing. Various types of attribute information associated with the requested data are extracted from a metadata area storing the requested data or through a specific interface and are organized into a format, such as a structured data table or key-value pair format, for facilitating subsequent processing.
The undirected graph can intuitively show the relation between characters in the attribute information, and the attribute characteristics of the data can be converted into a graph structure by taking the characters in the attribute information as nodes and the association relation between the characters as edges. The graphic structure is helpful for mining potential relations among data attributes, and compared with the traditional linear data structure, the undirected graph can more comprehensively and clearly express the inherent logic of the data, so that a brand new view angle is provided for subsequent analysis and processing of the data. Specifically, characters in the attribute information are traversed, and a node is created for each character. In an embodiment, relationships between characters, such as adjacent occurrences in the same attribute field, or having specific semantic associations, etc., may also be analyzed, edges are established between corresponding nodes based on the relationships, and finally an attribute character undirected graph is constructed.
The identification information of the solid state disk comprises various information such as hardware characteristics, manufacturers, use states and the like. The information can reflect the unique attribute of the solid state disk, and the undirected graph of the attribute character can be better adapted to the characteristic of the solid state disk by combining the information to adjust the undirected graph of the attribute character, so that the data processing flow is optimized. For example, solid state disks of different manufacturers may have differences in data storage and transmission modes, and the undirected graph is adjusted according to the differences, so that the efficiency and accuracy of data processing can be improved. The identification information is obtained from firmware, device description files and the like of the solid state disk. And according to different types of the identification information, formulating a corresponding adjustment strategy. And adjusting the importance of the nodes according to the storage capacity, so as to obtain the adjusted undirected graph.
The preset encoding table is a rule set for encoding data, and the adjustment undirected graph reflects new characteristics of the request data after being combined with the characteristics of the solid state disk. The coding table is corrected based on the adjustment undirected graph, so that the coding table is more attached to the actual condition of data, and the coding effect is optimized. The corrected coding table can improve coding efficiency and reduce data transmission and storage cost on the premise of ensuring data accuracy. And analyzing and adjusting characteristics of nodes and edges in the undirected graph, such as importance of the nodes, weight of the edges and the like. According to the characteristics, the coding rules in the preset coding table are modified, such as adjusting the coding length of the characters, changing the coding sequence, and the like, so as to obtain a modified coding table. The correction coding table has stronger uniqueness, privacy and security.
Finally, the requested data is encoded using the modified encoding table, which enables the data to be converted into a form more suitable for transmission and storage. The bandwidth occupation of the coded request data can be reduced in the transmission process, the transmission efficiency is improved, and meanwhile, the storage space can be saved in the storage process. And sending the encoded data to a data reading requester to complete the whole data reading and processing flow, thereby meeting the requirements of users on the data. And according to the rule of the correction coding table, the requested data is coded character by character or byte by byte. And sending the encoded data to a data reading requester through a network interface or other data transmission channels.
In an embodiment, the correcting the preset encoding table based on the adjustment undirected graph to obtain a corrected encoding table includes:
performing key feature analysis on the adjustment undirected graph, and generating a plurality of curves based on analysis results;
translating the generated curves into the same layer, wherein a plurality of character adding bits are preset in the layer;
The method comprises the steps of obtaining a preset encoding table, wherein the preset encoding table comprises an original data column and an encoding character column, and characters in the original data column and characters in the encoding character column form one-to-one mapping;
Sequentially adding characters in the coded character column into character adding bits in the layer one by one;
Reordering each character based on a plurality of curves added in the layer to obtain a reordered character sequence;
and sequentially replacing the characters in the rearranged character sequence into the coded character sequence of the preset coding table one by one to obtain the correction coding table.
In this embodiment, the key feature analysis includes:
And analyzing node centrality, namely in the adjustment undirected graph, the node centrality is an index for measuring the importance degree of the node in the graph. Common centrality indexes are centrality, near centrality and medium centrality. The centrality refers to the number of connecting edges of a node, and the more the connecting edges are, the more active the node in the graph represents more important attribute information. The proximity centrality measures the average shortest path length of a node to all other nodes in the graph, and the higher the proximity centrality, the more quickly the node can communicate with other nodes. The betweenness centrality represents the occurrence frequency of the nodes on the shortest paths among other nodes, and the nodes with high betweenness centrality often play a key role in the information propagation of the graph.
Connectivity analysis, analyzing connected components of the graph, i.e., a collection of interconnected nodes in the graph. By finding the largest connected component in the graph, the core structure of the graph can be determined. Meanwhile, the communication path length between the nodes can be analyzed, and the difficulty degree of information propagation in the graph can be known. For example, if the shortest path between two nodes is long, indicating that the information transfer between them needs to pass through multiple intermediate nodes, the efficiency is low.
And (3) cluster analysis, namely detecting whether node clusters exist in the graph. Clusters represent a tighter connection between a group of nodes and relatively fewer connections to other groups of nodes. Through cluster analysis, different clusters of information present in the graph can be found, each cluster representing a particular attribute category or information topic.
In the process of generating the curves, firstly, according to key characteristics obtained through analysis, different types of curves can be selected for visual representation. For example, a line graph may be used to show the change in the centrality of a node with the node number, and a histogram may be used to present the near centrality distribution of the node. And mapping the key characteristic data obtained by analysis onto the coordinate axis of the curve. For example, a curve reflecting the centrality change of the node is drawn with the node number as the horizontal axis and the centrality value of the node as the vertical axis. For a plurality of key features, a plurality of curves may be generated, each curve representing a change in a feature.
Furthermore, in order to integrate multiple curves into the same layer, a uniform coordinate system needs to be determined first. The coordinate system can be set according to the data range and the actual requirement of the curves, so that all the curves can be reasonably displayed in the coordinate system. And translating each curve according to the position of the curve in the original coordinate system and the coordinate system requirement of the new layer. The purpose of the translation is to enable all curves to be clearly shown in the new layer and to maintain the relative positional relationship. The distance of translation may be determined by calculating the difference between the starting point of the curve and the desired starting point in the new layer.
A plurality of character addition bits are preset in the layer, and in general, the character addition bits can be uniformly distributed in a blank area of the layer, or the positions can be determined according to the trend of the curve and key feature points. Bits are added to each character for identification so that the encoded character can be accurately added to the corresponding position later. The method can use the modes of numbering, coordinates and the like to carry out identification, and establish a corresponding management mechanism to ensure the accuracy and the order of character addition.
The preset encoding table is typically stored in a database, configuration file or memory. The database storage mode is suitable for large-scale code table management, can conveniently perform data adding, deleting and checking operations, is simple and easy to use, is suitable for small-scale code tables, and the memory storage mode can improve the access speed of the code tables and is suitable for scenes with higher performance requirements. And according to the storage mode of the coding table, reading by using a corresponding method. If the code table is stored in the database, SQL query statements may be used to obtain the data of the code table, if stored in the configuration file, file read functions may be used to read the contents of the code table, if stored in the memory, the data structures in the memory may be accessed directly to obtain the code table.
After the coding table is acquired, it is necessary to check whether the mapping relationship between the original data column and the coding character column is complete. I.e. checking whether each character in the original data column has a corresponding code character and whether each character in the code character column corresponds to a character in the original data column. And the consistency of the mapping relation is ensured, namely, the characters in the original data column and the coded character column are in one-to-one correspondence, and no repeated or conflicted mapping condition exists.
Further, the characters are sequentially added to the character addition bits in the layer in the order of the coded character string in the coding table. Therefore, the adding sequence of the characters can be ensured to be consistent with the sequence in the coding table, and subsequent operation and management are facilitated. When adding characters, it is necessary to accurately add the characters to the corresponding character addition bits. The corresponding position can be found according to the identification information of the character addition bit, and the character is placed at the position. In order to make the characters clearly displayed in the layer, proper fonts and sizes need to be set. The font is selected in consideration of readability and aesthetic properties, and the size is adjusted according to the scale of the layer and the size of the character addition bits. In some embodiments, different colors and styles may also be provided for characters as desired to distinguish between different types of characters or to highlight certain key characters. For example, for important coded characters, bolded, skewed, or different colors may be used for identification.
And according to key characteristics of the adjustment undirected graph reflected by the curves in the layer, formulating a character ordering rule. For example, characters associated with important nodes may be ranked ahead if a curve reflects the importance of the node, and characters on paths with high information propagation efficiency may be ranked ahead if a curve reflects the efficiency of information propagation. When the ordering rule is formulated, the characteristics of a plurality of curves need to be comprehensively considered. Different weights may be assigned to different curve features, and the characters may then be ordered according to the weighted composite features.
Common sorting algorithms, such as bubble sorting, fast sorting, merge sorting, etc., may be selected to sort the characters. Different sorting algorithms have different time and space complexity, and proper algorithms need to be selected according to the number of characters and performance requirements. In the sorting process, the sorting algorithm can be optimized according to the characteristics of the curve characteristics. For example, if the curve features have some monotonicity, an algorithm such as insert ordering that fits into the ordered data may be used to optimize.
And sequentially replacing the characters into the coded character columns of the preset coding table according to the sequence of the rearranged character sequences. Starting from the first position of the encoded character string, the first character in the rearranged character sequence is replaced, and then the characters in the subsequent positions are replaced in sequence. In replacing characters, the data of the encoding table needs to be updated in time. If the coding table is stored in the database, the content of the coding table needs to be updated by using SQL update sentences, if the coding table is stored in the configuration file, the updated coding table needs to be rewritten into the configuration file, and if the coding table is stored in the memory, the data structure in the memory needs to be directly modified.
After the character replacement is completed, it is necessary to verify again whether the mapping relationship between the original data column and the encoded character column in the corrected encoding table remains intact and consistent. Each original data character is ensured to have a corresponding coding character, and the mapping relation does not conflict. And carrying out functional verification on the correction coding table, and checking whether the correction coding table can work normally in the actual coding and decoding processes. Some test data may be used for encoding and decoding operations to verify the accuracy and validity of the revised encoding table.
In an embodiment, the correcting the preset encoding table based on the adjustment undirected graph to obtain a corrected encoding table includes:
Analyzing the weight distribution of the edges in the adjustment undirected graph, calculating the average weight, the maximum weight and the minimum weight of the edges, determining the weight distribution range, and normalizing the weights of the edges;
The method comprises the steps of mapping an adjustment undirected graph to a low-dimensional vector space by using a graph embedding technology to obtain an embedded vector of each node, taking original data column characters in a preset coding table as rows, taking the node embedded vector of the adjustment undirected graph as a column, constructing a characteristic association matrix, calculating the similarity between the original data column characters and the node embedded vector through cosine similarity, filling matrix elements, and representing the association degree between the original data column characters and the adjustment undirected graph nodes.
Carrying out row summation on the feature incidence matrix to obtain an incidence score of each original data column character, and sequencing the original data column characters according to the scores, wherein the characters with high scores are arranged in front;
and correspondingly adjusting the coded character columns, ensuring a one-to-one mapping relation between the original data columns and the coded character columns, and obtaining the corrected coding table.
In this embodiment, node attribute analysis is performed on the adjustment undirected graph, and attribute values such as degree, centrality, clustering coefficient and the like can reflect importance and structural features of nodes in the graph. For example, the degree represents the number of edges connected by the node, the higher the degree is, the more important the node is connected with more other nodes, the centrality is used for measuring the centrality degree of the node in the graph, the commonly used medium centrality, the near centrality and the like can help to determine the nodes playing a key role in information transmission or data transmission, and the clustering coefficient describes the connection tightness degree between the neighboring nodes of the node and reflects the local aggregation characteristic of the graph. The extracted attribute values are normalized to unify the ranges of different attribute values to a specific interval, and eliminate the influence of the dimension among different attribute values so that the attribute values have comparability. Thus, each node can be represented by a quantization attribute vector, and subsequent analysis and calculation are facilitated.
And analyzing and adjusting the weight distribution of the edges in the undirected graph, calculating the average weight, the maximum weight and the minimum weight of the edges, and determining the weight distribution range. These parameters may help to understand the overall weight of an edge, e.g., the average weight reflects the overall level of edge weights, and the maximum and minimum weights define the upper and lower limits of the weights, thus comprehensively grasping the weight distribution characteristics of the edge in the graph. The weights of the edges are normalized, and the weights of different edges can be compared and processed under the same scale in order to unify the value ranges of the weights. After normalization, the weights of the edges can more accurately reflect the relative importance of the edges in the whole graph structure.
The graph embedding technology is used for mapping the adjustment undirected graph to a low-dimensional vector space, so that complex graph structure data are converted into vector representations, and a computer can process and analyze the complex graph structure data conveniently. The embedded vector of each node obtained by the method can keep the structural information and the semantic information of the node in the original image in a low-dimensional space, so that similar nodes are closer in distance in the low-dimensional space, and different nodes are farther in distance, thereby providing an effective data representation form for the subsequent analysis of the association between the nodes and the data column characters.
And constructing a characteristic association matrix by taking the original data column characters in the preset encoding table as rows and taking the node embedded vectors of the adjustment undirected graph as columns. The purpose of this matrix is to establish a relationship between the original data column characters and the adjustment undirected graph nodes. By using the original data column characters and the node embedded vectors as rows and columns of the matrix respectively, matrix elements can be filled in by calculating the similarity between the original data column characters and the node embedded vectors, so that the association degree between the original data column characters and the nodes of the adjustment undirected graph can be comprehensively described.
And calculating the similarity between the character of the original data column and the node embedded vector through cosine similarity, and filling matrix elements. Cosine similarity is a common method for measuring the similarity degree between two vectors, and the similarity degree is judged by calculating the cosine value of an included angle between the two vectors, wherein the value range is between-1 and 1, and the closer the value is to 1, the more similar the two vectors are. After matrix elements are filled in the mode, each element in the characteristic association matrix can accurately represent the association degree between the original data column characters and the adjustment undirected graph nodes, and a quantization basis is provided for subsequent analysis and sequencing of the data column characters.
And carrying out row summation on the characteristic association matrix to obtain the association degree score of each original data column character. The score integrates the association degree of the data column character and all the adjustment undirected graph nodes, and the higher the score is, the closer the association of the data column character and the solid state disk state and the data processing requirement reflected by the adjustment undirected graph is. The original data sequence characters are ordered according to the scores, and the characters with high scores are arranged in front, so that the priority ordering can be carried out on the original data sequence characters according to the degree of correlation with the state of the solid state disk and the data processing requirement, and a basis is provided for optimizing the coding table.
And rearranging the original data columns in the preset encoding table according to the ordered character sequence of the original data columns. Meanwhile, the coding character columns are correspondingly adjusted, and one-to-one mapping relation between the original data columns and the coding character columns is guaranteed. The adjusted coding table can better adapt to the current state of the solid state disk and the data processing requirement, and the data can be more efficiently encoded, decoded, transmitted and the like in the data processing process by arranging the data column characters which are more closely related to the state of the solid state disk in front, so that the data processing efficiency and the data processing performance of the whole system are improved. The finally obtained correction coding table is an optimized result obtained through deep analysis of the adjustment undirected graph and effective combination with the original coding table, and can better serve the data processing task of the solid state disk.
In an embodiment, obtaining the identification information of the solid state disk, adjusting the attribute character undirected graph based on the identification information to obtain an adjusted undirected graph, including:
Screening key information from the obtained solid state disk identification information, and quantizing the key information into a numerical value for mathematical analysis;
adding the attribute character undirected graph and the characteristic change curve into a graph layer, and moving the characteristic change curve so that the movement is stopped when the edge, intersecting the characteristic change curve, in the attribute character undirected graph reaches the maximum;
and calculating the distance relation between each node in the attribute character undirected graph and the characteristic change curve, acquiring the type characteristics of the characters on each node, and exchanging the character positions on each node according to the distance relation and the type characteristics to obtain the adjustment undirected graph.
In this embodiment, the step of screening the key information from the identification information of the solid state disk is a starting point of the whole process. The identification information of the solid state disk may include numerous contents, such as a model number, a serial number, a capacity, a read-write speed, a temperature, a use duration, a remaining life, and the like. The key information screened is quantized into values that can be used for mathematical analysis in order to enable accurate calculation and analysis later. Different types of key information have different quantization modes, such as read-write speed can be expressed by specific MB/s (megabytes per second) or GB/s (gigabytes per second), temperature can be measured in degrees centigrade (DEG C), service duration can be measured in hours (h), and residual life can be quantized in percent (%). By quantification, the key information is converted into specific numerical values, so that mathematical operation and comparison are convenient. And drawing a corresponding characteristic change curve by taking time as a horizontal axis and taking the quantized numerical value as a vertical axis. The curve can intuitively show the change trend of the key information along with time.
The attribute character undirected graph and the feature variation curve are added to the same layer in order to operate and analyze them in the same space. The attribute character undirected graph is a graph structure composed of nodes and edges, wherein the nodes represent attribute characters, the edges represent the relation among the characters, and the characteristic change curve is a visual representation of the change of key information along with time. Placing them in the same layer facilitates viewing the spatial relationship between them.
And moving the characteristic change curve, so that the movement is stopped when the edge, intersecting the characteristic change curve, in the attribute character undirected graph reaches the maximum. The principle of this is that when the curve intersects the edges of the undirected graph the most, it means that the key information changes most closely correlate with the attribute character relationships represented by the undirected graph. In this way, the best matching state between the key information and the attribute character relationship can be found, and a basis is provided for the subsequent adjustment of the character position on the node.
And calculating the distance relation between each node in the attribute character undirected graph and the characteristic change curve. This distance may be euclidean distance, manhattan distance, etc., and a suitable distance measurement method is specifically selected according to the actual situation. The distance between the node and the curve reflects the association degree of the attribute character represented by the node and the key information change. The closer the distance, the more relevant the attribute character to the change in key information, and the farther the distance, the less relevant.
And obtaining the type characteristics of the characters on each node. The type characteristics of the character may include the semantics of the character, the category to which the character belongs, and so on. And exchanging the character positions on each node according to the calculated distance relation and the character type characteristics. The aim of the exchange is to make the structure of the undirected graph more reasonable, and to better reflect the actual state and performance of the solid state disk. For example, if a node is closely spaced from a characteristic change curve and the character type feature on that node indicates that it is closely related to key information, then it can be swapped with characters on other nodes to optimize the structure of the undirected graph. The finally obtained adjustment undirected graph can reflect the relation between the identification information and the attribute characters of the solid state disk more accurately, and provides more effective support for subsequent data processing and analysis.
In an embodiment, obtaining the identification information of the solid state disk, adjusting the attribute character undirected graph based on the identification information to obtain an adjusted undirected graph, including:
Screening key information from the obtained solid state disk identification information, and quantizing the key information into a numerical value for mathematical analysis;
extracting the characteristics of the generated characteristic change curve;
Establishing an association rule between the attribute character undirected graph and the characteristic change curve, and distributing corresponding adjustment parameters for nodes in the undirected graph according to the characteristics of the characteristic change curve;
and modifying the weight of the nodes and the connection strength of the edges in the undirected graph according to the adjustment parameters determined by the characteristics of the characteristic change curve.
In this embodiment, the screening of key information from the identification information of the solid state disk is the basis of the whole process. The identification information of the solid state disk generally covers various aspects, such as hardware specifications (such as flash memory type and cache size), running states (such as read-write speed and temperature), use history (such as power-on time and erasing times), and the like. The screened key information is quantized into a numerical value which can be used for mathematical analysis, so that preparation is made for subsequent mathematical processing. Different types of key information have different quantization modes, for example, the read-write speed can be expressed by MB/s or GB/s, the temperature is expressed by degrees centigrade (°C), the power-on time is expressed by hours (h), the erasing times are expressed by specific numerical values, and the like. By quantization, this information can be processed and compared in a unified mathematical form. And drawing a corresponding characteristic change curve by taking time as a horizontal axis and taking the quantized numerical value as a vertical axis. These curves can intuitively show the trend of the key information over time.
Feature extraction of the feature variation curve is performed in order to obtain more valuable information from the curve. Common feature extraction methods include calculating the slope, peak, valley, fluctuation range, period, etc. of the curve. For example, the slope of the curve can reflect the change rate of the key information, the slope is large to indicate that the change is fast, the slope is small to indicate that the change is slow, the peak value and the valley value respectively represent the maximum value and the minimum value of the key information in a certain time period, the fluctuation range shows the stability of the key information, the small fluctuation range shows the relatively stable performance, the large fluctuation range shows the unstable performance, and the period possibly reflects the regular use of the solid state disk or the periodic change of the performance. The characteristics can describe the performance state of the solid state disk more accurately, and provide key parameters for subsequent establishment of association rules.
The association rule between the attribute character undirected graph and the characteristic change curve is established to associate the performance information of the solid state disk with the structure and the attribute of the undirected graph. The attribute character undirected graph is composed of nodes and edges, the nodes represent attribute characters, and the edges represent the relationship between the characters. The establishment of association rules needs to consider the meaning of the nodes and edges in the feature change curve and the undirected graph. For example, if the characteristic change curve reflects a change in the read-write speed, the change characteristic of the read-write speed may be associated with nodes and edges in the undirected graph that are related to data transmission, and if the curve reflects a change in temperature, the temperature change characteristic may be associated with nodes and edges in the undirected graph that are related to hardware stability.
For the nodes in the undirected graph, corresponding adjustment parameters are allocated to the nodes according to the characteristics of the characteristic change curve. The adjustment parameters may be weights, priorities, etc. For example, if the slope of the change curve of the read-write speed is positive and larger, which indicates that the read-write performance is rapidly improved, the weight of the node related to the read-write operation may be appropriately increased, and if the temperature change curve reaches a peak value, which indicates that the hard disk may be in a high temperature state, the weight of the node related to the hardware stability may be appropriately reduced. By distributing adjustment parameters for the nodes, the structure and the attribute of the undirected graph can better reflect the actual performance state of the solid state disk.
And modifying the weight of the nodes and the connection strength of the edges in the undirected graph according to the adjustment parameters determined by the characteristics of the characteristic change curve. The weight of the node reflects the importance degree of the node in the undirected graph, and the connection strength of the edges represents the tightness degree of the relationship between the nodes. The structure and the attribute of the undirected graph can be adjusted by modifying the weight of the nodes and the connection strength of the edges, so that the undirected graph is more in line with the current performance state of the solid state disk. For example, the weight of the nodes related to read-write operation is increased, so that the undirected graph can be more focused on the nodes in the data processing process, and the connection strength of the edges related to data transmission is enhanced, so that the data flow in the undirected graph can be smoother. The finally obtained adjustment undirected graph can reflect the identification information of the solid state disk more accurately, and provides more effective support for subsequent data processing and analysis.
Referring to fig. 2, in an embodiment of the present invention, a system for intelligently accelerating data reading by a solid state disk is further provided, including:
The construction unit is used for reading the basic information of each data block in the solid state disk when receiving a data reading request, and constructing the information fingerprint of each data block based on the basic information;
An elimination unit for constructing an information fingerprint of the request data based on the data reading request, and eliminating the unmatched data blocks by using a bloom filter;
the positioning unit is used for calculating the similarity between the information fingerprint of the request data and the information fingerprint of the excluded data blocks by utilizing a local sensitive hash algorithm, and positioning the data block most similar to the request data;
And the reading unit is used for reading the request data from the solid state disk according to the positioning result.
In this embodiment, for specific implementation of each unit in the above embodiment of the apparatus, please refer to the description in the above embodiment of the method, and no further description is given here.
Referring to fig. 3, in an embodiment of the present invention, there is further provided a computer device, which may be a server, and an internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, a display screen, an input device, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store the corresponding data in this embodiment. The network interface of the computer device is used for communicating with an external terminal through a network connection. Which computer program, when being executed by a processor, carries out the above-mentioned method.
It will be appreciated by those skilled in the art that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.
An embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above method. It is understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a nonvolatile readable storage medium.
In summary, the method and system for intelligently accelerating data reading of the solid state disk provided by the embodiment of the invention comprise the steps of reading basic information of each data block in the solid state disk when a data reading request is received, constructing information fingerprints of each data block based on the basic information, constructing information fingerprints of request data based on the data reading request, eliminating unmatched data blocks by using a bloom filter, performing similarity calculation on the information fingerprints of the request data and the information fingerprints of the eliminated data blocks by using a local sensitive hash algorithm, positioning the data block most similar to the request data, and reading the request data from the solid state disk according to the positioning result. The invention eliminates unmatched data blocks by adopting a bloom filter, calculates the similarity between the information fingerprint of the request data and the information fingerprint of the eliminated data blocks by utilizing a local sensitive hash algorithm, locates the data block most similar to the request data, and overcomes the defects of long time consumption and waste of system resources when the data is read currently.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present invention and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, apparatus, article, or method that comprises the element.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes using the descriptions and drawings of the present invention or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims (8)

1. The method for intelligently accelerating data reading of the solid state disk is characterized by comprising the following steps of:
When a data reading request is received, basic information of each data block in the solid state disk is read, and information fingerprints of the data blocks are constructed based on the basic information;
Constructing an information fingerprint of the request data based on the data reading request, and removing unmatched data blocks by adopting a bloom filter;
calculating the similarity between the information fingerprint of the request data and the information fingerprint of the excluded data blocks by utilizing a local sensitive hash algorithm, and positioning the data block most similar to the request data;
And reading the request data from the solid state disk according to the positioning result.
2. The method for intelligently accelerating data reading by using a solid state disk according to claim 1, wherein the constructing an information fingerprint of each data block based on the basic information comprises the following steps:
converting the access frequency value and the creation time of the data block into a binary sequence, and interleaving the binary sequence with the binary representation of the size of the data block by bits to form an interleaved data sequence;
gray coding is carried out on the storage position, and storage position data after Gray coding is obtained;
Splicing the interleaved data sequence with Gray-coded storage position data to obtain spliced data, and calculating a spliced hash value of the spliced data by using Fnv hash algorithm;
selecting a data segment with a fixed length in the middle part of the data block, and generating a segment hash value by adopting a CRC32 hash algorithm;
And performing exclusive OR operation on the spliced hash value and the fragment hash value to obtain the information fingerprint of each data block.
3. The method for intelligently accelerating data reading by using a solid state disk according to claim 1, wherein the constructing an information fingerprint of each data block based on the basic information comprises the following steps:
Dividing the size information of the data block into a plurality of sub-segments according to the byte unit, and adopting binary differential coding to each sub-segment to obtain a size differential coding sequence;
performing spiral scanning coding on the storage position of the data block to generate storage position spiral coding;
splicing the size difference coding sequence and the storage position spiral coding to obtain spliced coding data;
performing primary hash on the spliced encoded data by using SipHash algorithm to generate a first hash value;
Extracting year, month, date and hour information of the creation time of the data block, and carrying out Kaiser encryption to obtain encrypted time information, wherein the displacement is determined by the access frequency of the data block;
And splicing the first hash value with the encrypted time information, operating the spliced data by adopting a Fletcher checksum algorithm, and taking an operation result as an information fingerprint of each data block.
4. The method for intelligently accelerating data reading by a solid state disk according to claim 1, wherein the step of removing the unmatched data blocks by using a bloom filter comprises the steps of:
According to the total number of the data blocks of the solid state disk and the requirements of the false judgment rate, an optimal parameter calculation model is used for determining the number and the number of bit groups of the hash function of the bloom filter, and the bloom filter is initialized;
Splitting the information fingerprint of each data block according to bytes, and performing cyclic left shift operation on each byte to generate a transformed fingerprint byte sequence, wherein the displacement bit number is determined by the position of the byte in the information fingerprint;
Performing parallel hash calculation on the transformed fingerprint byte sequence by adopting a plurality of hash functions, and modulo the length of the bloom filter bit number array by the calculated hash value to obtain a corresponding array index position, wherein the corresponding index position of the bloom filter is set to be 1;
Acquiring a hash index position of request data;
Checking the value of the corresponding hash index position in the bloom filter, and if the position with the value of 0 exists, judging the corresponding data block as unmatched and excluding the data block.
5. The method for intelligently accelerating data reading by a solid state disk according to claim 1, wherein the similarity calculation of the information fingerprint of the requested data and the information fingerprint of the excluded data block is performed by using a locally sensitive hash algorithm, and the method comprises the following steps:
respectively converting the information fingerprint of the request data and the information fingerprint of the excluded data block from the time domain to the frequency domain to form corresponding frequency domain fingerprint data;
according to the distribution characteristics of the frequency domain fingerprint data, a self-adaptive parameter adjustment algorithm is used for determining the parameters of a hash function family in a local sensitive hash algorithm and optimizing the hash function family, wherein the parameters comprise the number of hash functions and the size of a hash bucket;
dividing frequency domain fingerprint data into a plurality of sub fingerprint blocks, performing hash calculation on each sub fingerprint block through an optimized hash function family, and mapping calculation results into a hash bucket;
counting the number of sub fingerprint blocks appearing in the same hash bucket by comparing the mapping conditions of the request data and the data blocks in each hash bucket, and taking the sub fingerprint blocks as a preliminary similarity index;
The frequency domain fingerprint data of the request data and the data block are respectively subjected to principal component analysis to construct corresponding feature vectors;
and carrying out weighted summation on the preliminary similarity index and the supplementary similarity index to obtain the similarity of the request data and the excluded data block.
6. The method for intelligently accelerating data reading by using a solid state disk according to claim 1, wherein after the request data is read from the solid state disk according to the positioning result, the method comprises the following steps:
Acquiring attribute information of the request data;
constructing an attribute character undirected graph based on characters in the attribute information, wherein the characters in the attribute information are used as nodes of the undirected graph, and the nodes are connected through edges;
acquiring identification information of the solid state disk, and adjusting the attribute character undirected graph based on the identification information to obtain an adjusted undirected graph;
correcting a preset coding table based on the adjustment undirected graph to obtain a corrected coding table;
and encoding the request data based on the correction encoding table, and then transmitting the encoded request data to a data reading requester.
7. The method for intelligently accelerating data reading of a solid state disk according to claim 6, wherein the correcting the preset encoding table based on the adjustment undirected graph to obtain the corrected encoding table comprises:
performing key feature analysis on the adjustment undirected graph, and generating a plurality of curves based on analysis results;
translating the generated curves into the same layer, wherein a plurality of character adding bits are preset in the layer;
The method comprises the steps of obtaining a preset encoding table, wherein the preset encoding table comprises an original data column and an encoding character column, and characters in the original data column and characters in the encoding character column form one-to-one mapping;
Sequentially adding characters in the coded character column into character adding bits in the layer one by one;
Reordering each character based on a plurality of curves added in the layer to obtain a reordered character sequence;
and sequentially replacing the characters in the rearranged character sequence into the coded character sequence of the preset coding table one by one to obtain the correction coding table.
8. The system for intelligently accelerating data reading of the solid state disk is characterized by comprising the following components:
The construction unit is used for reading the basic information of each data block in the solid state disk when receiving a data reading request, and constructing the information fingerprint of each data block based on the basic information;
An elimination unit for constructing an information fingerprint of the request data based on the data reading request, and eliminating the unmatched data blocks by using a bloom filter;
the positioning unit is used for calculating the similarity between the information fingerprint of the request data and the information fingerprint of the excluded data blocks by utilizing a local sensitive hash algorithm, and positioning the data block most similar to the request data;
And the reading unit is used for reading the request data from the solid state disk according to the positioning result.
CN202510410399.9A 2025-04-02 2025-04-02 Method and system for intelligently accelerating data reading of solid state hard disk Pending CN120255816A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510410399.9A CN120255816A (en) 2025-04-02 2025-04-02 Method and system for intelligently accelerating data reading of solid state hard disk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510410399.9A CN120255816A (en) 2025-04-02 2025-04-02 Method and system for intelligently accelerating data reading of solid state hard disk

Publications (1)

Publication Number Publication Date
CN120255816A true CN120255816A (en) 2025-07-04

Family

ID=96192523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510410399.9A Pending CN120255816A (en) 2025-04-02 2025-04-02 Method and system for intelligently accelerating data reading of solid state hard disk

Country Status (1)

Country Link
CN (1) CN120255816A (en)

Similar Documents

Publication Publication Date Title
US10423626B2 (en) Systems and methods for data conversion and comparison
US10430433B2 (en) Systems and methods for data conversion and comparison
US10521441B2 (en) System and method for approximate searching very large data
US20100106713A1 (en) Method for performing efficient similarity search
EP3292481B1 (en) Method, system and computer program product for performing numeric searches
CN112162977B (en) MES-oriented mass data redundancy removing method and system
CN112733545A (en) Text blocking method and device, computer equipment and storage medium
CN114741368A (en) Log data statistical method based on artificial intelligence and related equipment
WO2024235184A1 (en) Method for partitioning memory, electronic device, and storage medium
CN118964686A (en) Vector retrieval method, device, equipment and storage medium
CN117608476A (en) Vector data block storage method and device, electronic equipment and medium
CN118964422A (en) Vector retrieval method, device, equipment and storage medium
CN108345607B (en) Searching method and device
CN120255816A (en) Method and system for intelligently accelerating data reading of solid state hard disk
Zhang et al. Efficient Dynamic Indexing for Range Filtered Approximate Nearest Neighbor Search
CN112860712B (en) Block chain-based transaction database construction method, system and electronic equipment
CN114461606A (en) Data storage method and device, computer equipment and storage medium
CN118132520B (en) Storage system file processing method, electronic device, storage medium and program product
CN120256435B (en) Vector set-based data processing method, device, medium and product
CN110968267A (en) Data management method, device, server and system
CN119011673B (en) Data detection method, device, non-volatile storage medium and computer equipment
CN112286948B (en) Data storage method, data reading method and data storage device of time sequence database
CN116910337B (en) Entity object selection method, query method, device, server and medium
CN120763367A (en) Unstructured data storage optimization method and system based on information entropy and blockchain
CN120598066A (en) Knowledge base construction method and device and knowledge question and answer retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载