CN118551414B

CN118551414B - File management method and system based on big data

Info

Publication number: CN118551414B
Application number: CN202411025089.7A
Authority: CN
Inventors: 莫伊南; 付赫然; 陈延; 封占江
Original assignee: Tianjin Construction And Development Group Co ltd
Current assignee: Tianjin Construction And Development Group Co ltd
Priority date: 2024-07-29
Filing date: 2024-07-29
Publication date: 2024-10-11
Anticipated expiration: 2044-07-29
Also published as: CN118551414A

Abstract

The invention provides a file management method and system based on big data, which relate to the technical field of file management and comprise the steps of acquiring file data, and identifying sensitive attributes in the file data by adopting an automatic sensitive attribute extraction method; taking the obtained sensitive attribute, attribute value distribution characteristics and risk assessment matrix as input, carrying out privacy disclosure risk quantification scoring by utilizing a pre-trained risk assessment model to obtain a privacy disclosure risk value of each archive data, automatically generating a corresponding data desensitization strategy according to the privacy disclosure risk value, and triggering dynamic update of the data desensitization strategy when the privacy disclosure risk value changes to obtain initial desensitization data; and carrying out association consistency verification on the initial desensitization data, correcting the updated data desensitization strategy according to a verification result to obtain a corrected data desensitization strategy, and carrying out desensitization treatment on the archive data again by utilizing the corrected data desensitization strategy to realize the safety management of the archive data.

Description

File management method and system based on big data

Technical Field

The present invention relates to a file management technology, and in particular, to a file management method and system based on big data.

Background

Along with the rapid development of information technology, the digitalization degree of archives is continuously improved, and the collection, storage and utilization of archival information face more privacy revealing risks. How to fully utilize the file information and effectively prevent the leakage of sensitive information at the same time has become a key problem to be solved in the file management work.

At present, archive information desensitization mainly relies on a manual mode to identify sensitive data and make a desensitization strategy, and the archive information desensitization has the defects of low efficiency, strong subjectivity, strategy lag and the like. In comparison, the intelligent desensitization method based on the automation technology has better development prospect. Some prior art attempts to implement desensitization by simple techniques such as data masking, encryption, etc., but these methods cannot dynamically adjust policies according to changes in data content and usage scenarios, and may break semantic relevance of data to some extent, affecting quality of subsequent data usage. In other methods, data semantic protection is considered, but the evaluation dimension is single, the evaluation model precision is low, the method cannot adapt to different application scenes and the like, and the whole desensitization effect is still to be improved.

Therefore, an intelligent file desensitization new method is needed to meet new requirements of file management in a big data environment, and desensitization quality and data availability are improved, so that privacy protection and safety management of file information are realized.

Disclosure of Invention

The embodiment of the invention provides a file management method and system based on big data, which can solve the problems in the prior art.

In a first aspect of an embodiment of the present invention,

Provided is a archive management method based on big data, comprising the following steps:

Acquiring file data, identifying sensitive attributes in the file data by adopting an automatic sensitive attribute extraction method, carrying out attribute value distribution analysis on the value range distribution of each sensitive attribute to obtain attribute value distribution characteristics reflecting the sparseness degree of the attribute values, and simultaneously carrying out data use environment assessment on the use environment of the file data to obtain a risk assessment matrix of the data use environment;

The obtained sensitive attribute, attribute value distribution characteristics and risk assessment matrix are used as input, a pre-trained risk assessment model is utilized to conduct privacy disclosure risk quantification scoring, a privacy disclosure risk value of each archive data is obtained, a corresponding data desensitization strategy is automatically generated according to the privacy disclosure risk value, in the data desensitization process, the change of the privacy disclosure risk value is continuously monitored, when the privacy disclosure risk value is changed, the dynamic update of the data desensitization strategy is triggered, and the archive data is desensitized by the updated data desensitization strategy, so that initial desensitization data are obtained;

And carrying out association consistency verification on the initial desensitization data, correcting the updated data desensitization strategy according to a verification result until the preset semantic association maintenance degree is met, obtaining a corrected data desensitization strategy, carrying out desensitization processing on the archive data again by utilizing the corrected data desensitization strategy, and realizing the safety management on the archive data.

In an alternative embodiment of the present invention,

Identifying sensitive attributes in the archive data by adopting an automatic sensitive attribute extraction method, carrying out attribute value distribution analysis on the value range distribution of each sensitive attribute, and obtaining attribute value distribution characteristics reflecting the sparseness of the attribute values comprises the following steps:

encrypting the sensitive attribute in the identified archive data to obtain encrypted attribute value data;

distributing the encrypted attribute value data to all the participants of the distributed data source by adopting a federal learning framework, locally training a local model by all the participants by using the encrypted attribute value data, aggregating the parameters of the local model by a safe aggregation algorithm, and repeating iteration until a preset termination condition is met to obtain a final attribute value distribution model;

Converting the obtained attribute value distribution model into a garbled circuit, providing encrypted attribute value data by each participant, reconstructing the secret sharing share of the encrypted attribute value data through a secure computing protocol, and completing attribute value distribution prediction cooperatively by each participant based on the garbled circuit and the reconstructed attribute value secret sharing share to obtain the final attribute value distribution characteristics.

In an alternative embodiment of the present invention,

The parameters of the local model are aggregated through a secure aggregation algorithm, and the calculation formula of the global model is updated as follows:

；

Wherein w ^t+1 represents model parameters after the t+1st iteration update, w ^t represents model parameters after the t iteration update, t represents iteration times, γ represents learning rate, s represents total number of secret share shares, j represents secret share index, K represents total number of participants, K represents kth participant, Representing the local model parameter update values, N (·) representing Gaussian noise, σ ² representing variance, I representing identity matrix, λ represents the gradient clipping threshold, |·| ₂ represents the L2 norm of the global model parameter update value.

In an alternative embodiment of the present invention,

In the data desensitization process, continuously monitoring the change of the privacy disclosure risk value, triggering the dynamic update of the data desensitization strategy when the privacy disclosure risk value is changed, and desensitizing the archive data by utilizing the updated data desensitization strategy, wherein the obtaining of the initial desensitization data comprises the following steps:

acquiring an original data set and a desensitized data set, calculating the difference between the original data distribution and the desensitized data distribution, introducing attribute importance weight to carry out weighted aggregation on the information loss of an attribute level, and obtaining an information loss measurement;

Performing data mining tasks on the original data set and the desensitized data set respectively to obtain performance scores of the data mining tasks, and combining the performance scores of the data mining tasks to obtain utility loss scores of the desensitized data;

combining the information loss measurement with the utility loss score of the desensitized data to obtain comprehensive data utility loss;

Constructing a multi-objective optimization model aiming at minimizing privacy disclosure risk values and minimizing comprehensive data utility loss, iteratively updating desensitization parameters by adopting a gradient descent method to obtain optimal desensitization parameters, desensitizing archive data by utilizing the optimal desensitization parameters, continuously monitoring data environment changes and privacy disclosure risk values, and triggering desensitization parameter tuning when data distribution changes or the privacy disclosure risk values exceed a preset threshold value;

and acquiring the latest original data set and privacy disclosure risk value, re-executing the desensitization parameter optimization step, updating the optimal desensitization parameters, and desensitizing the archive data by utilizing the updated optimal desensitization parameters to obtain initial desensitization data.

In an alternative embodiment of the present invention,

The constructing of the multi-objective optimization model targeting the minimization of privacy-preserving risk values and the minimization of comprehensive data utility loss comprises:

the calculation formula of the objective function of the multi-objective optimization model is as follows:

；

Where J represents an objective function, R (-) represents a privacy leakage risk value, D represents an original data set, θ represents a desensitization parameter, m represents a number of attributes, δ _a represents an importance weight of an a-th attribute, X represents a set of all data points, P (X) represents a probability distribution of data points X in the original data set, Q (X) represents a probability distribution of data points X in the desensitized data set, n represents a number of data mining tasks, α _b represents an importance weight of a b-th data mining task, perf (-) represents a performance score, D' represents a desensitization data set, and T _b represents a b-th data mining task.

In an alternative embodiment of the present invention,

Carrying out association consistency verification on the initial desensitization data, correcting the updated data desensitization strategy according to a verification result until a preset semantic association maintenance degree is met, and obtaining the corrected data desensitization strategy comprises the following steps:

Preprocessing file data, recognizing the preprocessed file data by adopting a named entity recognition method to obtain key entities, extracting semantic relations among the key entities, and constructing entity relation triples of the file data;

Designing a archive data field body by using an ontology construction method, mapping entity relation triples of archive data into the archive data field body, and forming an archive data knowledge graph represented by resource description framework triples;

mapping the initial desensitization data into a archive data knowledge graph, generating a desensitization data subgraph, and constructing an association consistency measurement index by calculating the structural similarity and semantic similarity between the archive data knowledge graph and the desensitization data subgraph;

And identifying key factors causing the association distortion according to the association consistency measurement index, determining the association distortion type according to the key factors causing the association distortion, designing corresponding data desensitization strategy correction rules aiming at different association distortion types, correcting the updated data desensitization strategy according to the desensitization strategy correction rules until the preset semantic association maintenance degree is met, and obtaining the corrected data desensitization strategy.

In an alternative embodiment of the present invention,

By calculating the structural similarity and the semantic similarity between the archive data knowledge graph and the desensitization data subgraph, a calculation formula for constructing the association consistency measurement index is as follows:

；

wherein C (G, G ') represents an associated consistency metric, G represents a archive data knowledge graph, G ' represents a desensitized data subgraph, eta represents a balance factor of structural similarity, P represents a set of node alignments, C (u, v) represents a cost required for aligning node u with node v, mu represents a balance factor of semantic similarity, h represents the number of nodes of the archive data knowledge graph, h ' represents the number of nodes of the desensitized data subgraph, e _p represents a node embedding vector of node P in the archive data knowledge graph, Representing node q embedding vectors in the desensitized data subgraph.

In a second aspect of an embodiment of the present invention,

Provided is a big data based archive management system including:

The first unit is used for acquiring the archive data, identifying sensitive attributes in the archive data by adopting a sensitive attribute automatic extraction method, carrying out attribute value distribution analysis on the value range distribution of each sensitive attribute to obtain attribute value distribution characteristics reflecting the sparseness degree of the attribute values, and simultaneously carrying out data use environment assessment on the use environment of the archive data to obtain a risk assessment matrix of the data use environment;

The second unit is used for taking the obtained sensitive attribute, attribute value distribution characteristics and risk assessment matrix as input, carrying out privacy disclosure risk quantification scoring by utilizing a pre-trained risk assessment model to obtain a privacy disclosure risk value of each archive data, automatically generating a corresponding data desensitization strategy according to the privacy disclosure risk value, continuously monitoring the change of the privacy disclosure risk value in the data desensitization process, triggering the dynamic update of the data desensitization strategy when the privacy disclosure risk value changes, and carrying out desensitization on the archive data by utilizing the updated data desensitization strategy to obtain initial desensitization data;

and the third unit is used for carrying out association consistency verification on the initial desensitization data, correcting the updated data desensitization strategy according to a verification result until the preset semantic association maintenance degree is met, obtaining the corrected data desensitization strategy, carrying out desensitization processing on the archive data again by utilizing the corrected data desensitization strategy, and realizing the safety management on the archive data.

In a third aspect of an embodiment of the present invention,

There is provided an electronic device including:

A processor;

A memory for storing processor-executable instructions;

wherein the processor is configured to invoke the instructions stored in the memory to perform the method described previously.

In a fourth aspect of an embodiment of the present invention,

There is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.

In the embodiment, the sensitive attribute in the archive data can be automatically identified by utilizing the sensitive attribute automatic extraction method, so that the problems of inefficiency and easiness in omission of manual identification are avoided. By evaluating the distribution of sensitive attribute values and the data use environment and combining a pre-trained risk evaluation model, the privacy leakage risk value of each archive data can be quantized, and a basis is provided for the establishment of a subsequent desensitization strategy. According to the estimated privacy disclosure risk value, the system can automatically generate a corresponding data desensitization strategy, so that subjectivity and inefficiency of manually making the desensitization strategy are avoided. The method can continuously monitor the change of the data environment, and once the privacy disclosure risk value changes, the dynamic update of the desensitization strategy is triggered, so that the timeliness and the effectiveness of the desensitization strategy are ensured. Through carrying out association consistency verification on the data after desensitization and correcting the desensitization strategy, the semantic association of the data can be reserved to the greatest extent, and the availability of the data after desensitization is improved. The functions of sensitive information identification, privacy risk quantification evaluation, automatic desensitization strategy generation, dynamic optimization updating and the like are realized, the automation, intellectualization and adaptability levels of archive data desensitization can be greatly improved, and the efficiency and quality of archive data safety management are improved.

Drawings

FIG. 1 is a flow chart of a file management method based on big data according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a file management system based on big data according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 1 is a flow chart of a archive management method based on big data according to an embodiment of the present invention, as shown in fig. 1, the method includes:

S101, acquiring archive data, identifying sensitive attributes in the archive data by adopting an automatic sensitive attribute extraction method, carrying out attribute value distribution analysis on the value range distribution of each sensitive attribute to obtain attribute value distribution characteristics reflecting the sparseness degree of the attribute values, and simultaneously carrying out data use environment assessment on the use environment of the archive data to obtain a risk assessment matrix of the data use environment.

Among them, the sensitive attribute automatic extraction method is generally an ontology-based method. An ontology is a formal description of real world concepts, including concepts, relationships between concepts, attributes of concepts, and so on. And constructing an ontology library covering common sensitive information types (such as names, identification card numbers, telephone numbers and the like) in the file field, matching file data to be processed with the ontology library, and identifying and extracting sensitive attribute examples in the data.

The value domain distribution analysis refers to counting the occurrence frequency distribution of each sensitive attribute value in the whole data set, and attribute value distribution characteristics can be described by different statistics, such as a maximum/minimum frequency value, a median, a quartile, a coefficient of a kunity and the like, and the characteristics can reflect the sparseness degree of the attribute values, such as the higher the maximum frequency is, the larger the coefficient of the kunity is, the more concentrated the value distribution is, and the higher the risk of identifying an individual is.

In an alternative embodiment of the present invention,

Illustratively, the original attribute value data is first encrypted or perturbed, as if encrypted, differentially private, etc. Homomorphic encryption allows direct computation on encrypted data, and the result obtained after decryption is consistent with the result of the same computation on the original data. Common homomorphic encryption schemes comprise partial homomorphic encryption and full homomorphic encryption, and in attribute value distribution modeling, homomorphic encryption can be used for encrypting sensitive data, so that privacy security of the data in a machine learning process is ensured. Differential privacy provides a quantifiable privacy protection guarantee by introducing random noise into the data. Differential privacy ensures that the influence on the model output is limited by the presence or absence of any individual data, thereby protecting individual privacy. Common differential privacy mechanisms include Laplace and Exponential mechanisms. In attribute value distribution modeling, differential privacy can be adopted to disturb attribute values, and individual privacy is protected from being revealed while data statistical characteristics are reserved.

And distributing the encrypted attribute value data to all participants of the distributed data source by adopting a federal learning framework. Each participant holds only part of the data and does not exchange local data. Each participant locally trains a local model using the encrypted attribute value data. Common attribute value distribution modeling methods such as histogram statistics: and (5) carrying out bin division on the attribute values, and counting the frequency in each bin to obtain an attribute value distribution histogram. Nuclear density estimation: and smoothly estimating the attribute value distribution by using a kernel function to obtain a continuous attribute value probability density function. Gaussian mixture model: a weighted mixture of a plurality of gaussian distributions is used to fit the attribute value distribution. And (3) aggregating parameters of the local model through a secure aggregation algorithm (such as secure multiparty calculation, differential privacy and the like) on the premise of protecting the data privacy of each party, and updating the global model. Repeating the steps until the preset termination condition (such as maximum iteration round number, global model convergence and the like) is met, and obtaining a final attribute value distribution model.

The trained attribute value distribution model is converted into a garbled circuit (Garbled Circuit). The garbled circuit allows multiple parties to co-compute a function without revealing the input data. For example, the attribute value distribution model may be expressed as a function, with attribute values as inputs, and corresponding distribution features as outputs. Specifically, the function is represented as a boolean circuit. The boolean circuit is composed of a series of logic gates (e.g., and gates, or gates, not gates, etc.) that can implement arbitrarily complex functions. And performing topological sequencing on the Boolean circuit, and determining the calculation sequence of each logic gate. The garbled circuit is generated from the boolean circuit. For each logic gate, a corresponding obfuscation table is generated, rows in the truth table are randomly arranged, and encryption is performed with a randomly generated key. The confusion table ensures the privacy of the computation process. The garbled circuits are distributed to the various parties as a basis for subsequent collaborative calculations.

Each participant provides the encrypted attribute value data, and the secret sharing share of the attribute value data is reconstructed through a secure computing protocol. Secret sharing allows a secret value to be split into multiple shares, each held by a different party, from which a single party cannot recover the original secret value. Specifically, the participants perform secret sharing on the respective held encrypted attribute value data. Common secret sharing schemes include Shamir secret sharing, additive secret sharing, and the like. Secret sharing shares of attribute value data are reconstructed by interactive computation among the parties through secure computation protocols, such as secure multiparty computation, threshold secret sharing, and the like. Each party holds a secret share of attribute value data, which is subsequently used for collaborative computing.

Based on the secret share of the attribute value of the garbled circuit and the reconstructed attribute value, each participant cooperatively completes the security prediction of the attribute value distribution. First the participants input their own secret share of the property value into the garbled circuit. The participants cooperatively calculate the outputs of the respective logic gates in accordance with the order of calculation in the garbled circuit. For each logic gate, the participants inquire the corresponding confusion table according to the own input share, and an encrypted output result is obtained. The party decrypts the output result of the obfuscation table by a secure computation protocol, such as inadvertent transmission (Oblivious Transfer), to obtain the plain-text form of the logic gate output. Repeating the steps until the calculation of the confusion circuit is completed, and obtaining the secret sharing share of the final attribute value distribution prediction result.

And aggregating the secret sharing shares of the attribute value distribution prediction result obtained by cooperative calculation to obtain the final attribute value distribution characteristics. Alternatively, the aggregation process may be implemented by a secure computing protocol, such as a secure summation protocol, a threshold secret reconstruction, or the like. The aggregated attribute value distribution characteristics can be used for subsequent data desensitization decisions.

In an alternative embodiment of the present invention,

；

In this embodiment, the original sensitive attribute values are encrypted and distributed to multiple participants, and a secure computing protocol is adopted, so that attribute value distribution feature extraction can be completed without revealing original data of any party, and data privacy is fundamentally protected. The attribute value distribution characteristics are extracted and modeled as a machine learning task, and a federal learning framework of iterative training is adopted, so that multi-source data can be fully utilized, and the generalization capability and the prediction precision of the model can be improved. The data and the model are stored in a distributed mode on a plurality of participating nodes, so that the risk of single-point data or model leakage is reduced. Each participant under the federal learning framework can train the local model in parallel, so that the overall training speed is increased; the secure multiparty computation also supports parallelization, and ciphertext computation can be efficiently completed. The privacy protection technology is combined with the attribute value distribution feature extraction, so that win-win of model precision and privacy protection is realized, and meanwhile, the privacy protection method has the advantages of high efficiency, expandability, risk reduction and the like, and is favorable for realizing self-adaptive management of file data privacy.

S102, taking the obtained sensitive attribute, attribute value distribution characteristics and risk assessment matrix as input, carrying out privacy disclosure risk quantification scoring by utilizing a pre-trained risk assessment model to obtain a privacy disclosure risk value of each archive data, automatically generating a corresponding data desensitization strategy according to the privacy disclosure risk value, continuously monitoring the change of the privacy disclosure risk value in the data desensitization process, triggering the dynamic update of the data desensitization strategy when the privacy disclosure risk value is changed, and desensitizing the archive data by utilizing the updated data desensitization strategy to obtain initial desensitization data.

Before performing privacy disclosure risk quantization scoring by using a pre-trained risk assessment model, firstly training the risk assessment model, specifically, collecting a certain number of real archive data samples, and labeling the privacy disclosure risk level of each piece of data, wherein the labeled data are used as training data sets of the risk assessment model and can be collected from a plurality of data sources to cover different data distribution and privacy risk situations as much as possible. Preprocessing the original data in the training data set, such as filling missing values, removing abnormal values, feature codes and the like, extracting relevant features such as sensitive attributes, attribute value distribution features and the like from the original data, taking the relevant features as model input, selecting a proper machine learning model such as logistic regression, decision trees, neural networks and the like, taking the features extracted from the training data as model input, taking privacy risk level as model output, and setting reasonable super-parameters such as regularization coefficients, layer numbers and the like according to model types. The preprocessed training data is input into a constructed model for training, a loss function is calculated according to model output and a real label, model parameters are updated through an optimization algorithm such as gradient descent iteration, and the model is prevented from being fitted by means of cross verification and the like. And evaluating the performance of the model on a reserved test set, calculating evaluation indexes such as accuracy, F1 score and the like, visualizing the performance of the model under different data distribution, analyzing the generalization capability of the model, and further adjusting and optimizing the model and the characteristics according to the evaluation result.

And taking the sensitive attribute, the attribute value distribution characteristic and the risk assessment matrix as inputs, inputting the inputs into a risk assessment model, and generating a privacy leakage risk score for each archive data by comprehensively considering the factors.

The privacy disclosure risk scores are divided into a plurality of risk classes, such as low risk, medium risk and high risk, corresponding desensitization strategy templates are designed in advance for each risk class, and proper desensitization strategy templates are automatically selected according to the privacy risk scores of each piece of data. Changes in factors that affect the privacy risk score, such as data distribution, external environment, etc., are monitored periodically or in real-time, and once significant changes in these factors are found, the privacy risk score for each piece of data is recalculated. And updating the risk levels of the data with the privacy risk scores changed, selecting a proper desensitization strategy template according to the new risk level, and updating the data desensitization strategy. And desensitizing the archive data according to a new desensitization strategy, such as masking, noise adding, generalization, replacement and the like, and obtaining a new initial desensitization data set after the desensitization is completed.

In an alternative embodiment of the present invention,

Illustratively, the original non-desensitized archive data set and the desensitized archive data set are first acquired. And comparing the data distribution of the original data set and the desensitized data set by using KL divergence, and calculating the distribution difference between the two data sets. When the difference is calculated, importance weight of each attribute is introduced, and information losses of different attribute levels are weighted to obtain an integral information loss measurement value.

To evaluate the utility of desensitized data in an actual data mining task, a data mining task validation mechanism is introduced. Typical data mining tasks related to the data application scene, such as classification, clustering, association rule mining and the like, are selected, mining algorithms are respectively executed on the original data and the desensitized data, and differences of mining results are compared.

Data mining tasks, such as pattern recognition, predictive analysis, etc., are performed on the raw data set and the desensitized data set, respectively, and performance scores of each task on both data sets are evaluated and given. Utility loss scores for the desensitized data sets relative to the original data sets are calculated and derived based on the performance scores of the data mining tasks. And combining the obtained information loss measurement with the utility loss score to calculate the comprehensive data utility loss value of the desensitization data set.

A multi-objective optimization model is constructed, the objective of which is to simultaneously minimize the privacy disclosure risk value and the comprehensive data utility loss value. And adopting an optimization algorithm such as a gradient descent method and the like to iteratively adjust and update the desensitization parameters until an optimal desensitization parameter combination capable of achieving two targets is found. And performing desensitization treatment on the archive data by using the obtained optimal desensitization parameters to obtain initial desensitization data. The change of the data environment and the change of the privacy disclosure risk value are continuously monitored. And triggering the readjustment of the desensitization parameters once the data distribution is found to have significant change or the privacy disclosure risk value exceeds a preset threshold.

And acquiring the latest original data set and the current privacy disclosure risk value, repeatedly executing the steps, and optimizing the desensitization parameters again. And (3) performing desensitization treatment on the archive data by using the optimized and updated optimal desensitization parameters to obtain a new initial desensitization data set. Optionally, the desensitization strategy is continuously monitored and dynamically adjusted, so that the desensitized data always keep low privacy disclosure risk and high data utility.

In an alternative embodiment of the present invention,

；

In this embodiment, by continuously monitoring the change of the data environment and the change of the privacy disclosure risk value and dynamically adjusting the desensitization policy, even under the condition that the data distribution, the external environment and the like change, appropriate privacy protection can be continuously provided for the data, and the robustness of the privacy protection is improved. By combining the privacy disclosure risk value and the data utility loss measurement, a multi-objective optimization model is constructed, and by searching the optimal desensitization parameter, the privacy is protected to the greatest extent, the utility value of the data is reserved as far as possible, and the balance between the privacy disclosure risk value and the data utility loss measurement is realized. By automatically evaluating the indexes such as information loss measurement, utility loss score and the like and automatically adjusting the desensitization parameters based on the optimization model, the requirement of manual participation is reduced, and the efficiency is improved. And the latest original data set and privacy disclosure risk value can be obtained in real time, and the desensitization parameters are re-optimized, so that the desensitization strategy can be automatically adjusted along with the change of the data, and the adaptability of the desensitization strategy is improved. By quantifying the privacy and utility loss into measurable indicators, a clear evaluation standard is provided for the optimization of the data desensitization strategy, which is beneficial to model optimization and strategy selection. Privacy protection and desensitization can be rapidly and efficiently performed on large-scale archive data, and data processing efficiency is improved.

S103, carrying out association consistency verification on the initial desensitization data, correcting the updated data desensitization strategy according to a verification result until a preset semantic association maintenance degree is met, obtaining a corrected data desensitization strategy, carrying out desensitization processing on the archive data again by using the corrected data desensitization strategy, and realizing safety management on the archive data.

Illustratively, a data mining algorithm such as frequent item sets, association rules and the like is adopted first to find association rules among archive data attributes. For example, there may be an association rule of "occupation=teacher ⇒ academy=family and above", indicating that the teacher occupation has a strong correlation with the family and above.

When the attribute values are desensitized, the association rules between the attributes are prioritized. And for attribute combinations with strong association, adopting a coordinated desensitization strategy to ensure that attribute values after desensitization still meet the original association rules. For example, for the "profession" and "academic" attributes, desensitization can be performed using the same general hierarchy, summarizing "teacher" as "educational industry practitioner" and "family" as "higher education".

And after the data desensitization is completed, carrying out association consistency verification on the desensitization result. And re-mining association rules of the desensitized data, comparing the association rules with association rules of the original data, and evaluating the maintenance degree of semantic association. If the association rule after desensitization is found to have larger difference from the original rule, the desensitization strategy needs to be further adjusted until the acceptable semantic association maintenance level is reached.

In an alternative embodiment of the present invention,

Illustratively, to characterize semantic associations between archive data, the archive data is first preprocessed and feature extracted using natural language processing techniques. And a named entity recognition method is adopted to recognize key entities in the text, such as person names, place names, organization names and the like. And then extracting semantic relations among the entities by utilizing dependency syntactic analysis and coreference resolution technology to construct entity relation triples.

On the basis of extracting the entity and the relation, the ontology construction method is utilized to design the ontology in the archive data field. The ontology is described in OWL (Web Ontology Language) language, defining the core concepts, attributes and relationships between concepts in the profile data. And constructing an ontology framework covering main characteristics of the archive data by combining manual definition and expert knowledge introduction.

And mapping the extracted entities and relations into the ontology to form a archive data knowledge graph. The archive knowledge graph is represented in the form of RDF (Resource Description Framework) triples, each consisting of subjects (subjects), predicates (PREDICATE), and objects (objects), corresponding to entities, relationships, and entities, respectively. And mapping ERE into RDF triples, and organizing archive data into a semantic association network to form an archive knowledge graph.

Mapping the desensitized archive data to a knowledge graph to generate a desensitized data subgraph. And matching the entities and the relations in the desensitization data with the nodes and the edges in the knowledge graph to realize the mapping of the desensitization data to the knowledge graph.

And calculating the structural similarity between the knowledge graph of the original archive data and the desensitized data subgraph. And (3) measuring the structural difference between the two graphs by adopting graph editing distance, and calculating the cost of atomic operations such as insertion, deletion, replacement and the like required by converting the archive data knowledge graph into the desensitized data subgraph.

And calculating the semantic similarity between the knowledge graph of the original archive data and the desensitized data subgraph. And mapping the entities and the relations in the knowledge graph to a low-dimensional continuous vector space by using a knowledge graph embedding technology to obtain the distributed representation of the entities and the relations. And then, calculating the semantic similarity of the archive data knowledge graph and the desensitized data subgraph in the embedded space by adopting measurement methods such as cosine similarity and the like. Specifically, a structural similarity weight factor and a semantic similarity weight factor are set, and other parameters are related. And for the archive data knowledge graph and the desensitization data subgraph, a node alignment set is found, so that the cost or distance of each node pair in the archive data knowledge graph is minimum. And summing the cost or the distance of all the node pairs to obtain the measurement value of the structural similarity. And for each node in the archive data knowledge graph and each node in the desensitization data subgraph, calculating the cosine similarity of the node embedded vector, and summing the cosine similarity of all node pairs to obtain the measurement value of the semantic similarity.

The structural similarity and the semantic similarity are comprehensively considered, the association consistency measurement index is constructed in a weighted combination mode, a balance factor is introduced, the weights of the structural similarity and the semantic similarity in the association consistency measurement are controlled, and the influence of the desensitization operation on the data association is quantitatively evaluated.

By analyzing the correlation between the correlation consistency metric and the desensitization policy parameters, key factors causing correlation distortion are identified. And calculating the correlation between the correlation consistency measurement index and the factors such as desensitization granularity, desensitization algorithm type, algorithm parameter setting and the like by adopting a correlation analysis method such as a Pearson correlation coefficient, a Spearman rank correlation coefficient and the like. And determining main factors influencing the association distortion according to the correlation analysis result.

The dominant type of associated distortion is determined based on the key factors that lead to the associated distortion. Common types of associated distortion include: semantic loss due to excessive desensitization: the desensitization granularity is too large or the desensitization algorithm is too simple, so that the data semantic expression capacity after desensitization is reduced and the associated semantics are lost; privacy disclosure due to inadequate desensitization: the desensitization granularity is too small or the intensity of the desensitization algorithm is insufficient, so that sensitive information is remained, and privacy leakage risks exist; associated distortion due to desensitization inconsistencies: the desensitization strategies of different attributes are inconsistent, so that the association relation among the attributes is destroyed.

For different associated distortion types, corresponding data desensitization strategy correction rules are designed, for example, for semantic loss caused by excessive desensitization, correction can be performed by means of reducing desensitization granularity, selecting a desensitization algorithm for semantic retention, and the like. For example, the desensitization granularity is adjusted from the "city" level to the "county" level, or the desensitization algorithm is adjusted from random substitution to homophonic substitution, etc. The privacy leakage caused by insufficient desensitization can be corrected by improving the desensitization granularity, enhancing the intensity of the desensitization algorithm and the like. For example, the desensitization granularity is adjusted from "county" level to "city" level, or the desensitization algorithm is adjusted from mask replacement to encryption, etc. For the association distortion caused by the desensitization inconsistency, the consistency of the desensitization strategy can be maintained by unifying the desensitization granularity and the desensitization algorithm with different attributes. For example, the same desensitization granularity and desensitization algorithm are used for the attributes of name, identification card number, telephone number, etc.

And correcting the updated data desensitization strategy according to the strategy correction rule to generate a corrected desensitization strategy. And (5) desensitizing the archive data again by using the corrected desensitization strategy, and recalculating the associated consistency measurement index. And judging whether the corrected desensitization strategy meets the preset semantic association maintaining degree, namely whether the association consistency measurement index reaches a preset threshold value. If the requirements are met, the corrected desensitization strategy is used as a final optimization strategy; if the requirement is not met, returning to the previous step, and further adjusting the desensitization strategy until the semantic association maintaining degree is met.

And taking the desensitization strategy meeting the semantic association maintaining degree as a new data desensitization strategy, and updating the original desensitization strategy configuration. And (3) performing desensitization on the original archive data again by using the updated desensitization strategy, and generating a desensitization data set with higher semantic association retention degree.

Optionally, the associated consistency metric is used as environmental feedback by using reinforcement learning techniques to adaptively learn and improve the data desensitization strategy by continually trying and optimizing. The selection and parameter setting of the desensitization strategy are modeled as a Markov decision process, and the optimal desensitization strategy is learned through a reinforcement learning algorithm such as Q-learning and the like. First, with the selection of desensitization strategies and parameter settings as states, different state representations can be defined according to specific problems. For example, one vector may be used to represent the selection and parameter settings of different desensitization strategies and define the set of actions of the desensitization strategies that can be taken. Each action corresponds to a desensitization policy or parameter setting.

The associated consistency metric is used as a feedback signal for the bonus function. The reward function may be designed based on a change in the associated consistency metric, for example, giving a positive reward when the associated consistency metric increases and a negative reward when the associated consistency metric decreases. And performing data desensitization operation according to the current state and the selected action to obtain the next state. The specific state transition mode depends on the problem and can be operated according to the parameter setting of the desensitization strategy.

A reinforcement learning algorithm (e.g., Q-learning) is used to learn the optimal desensitization strategy. At each time step, an action is selected based on the current state, a desensitization operation is performed, feedback of the environment (i.e., correlation of consistency metrics) is observed, and the Q-value function is updated to select a better action in future decisions, and the strategies and parameters are iteratively updated using reinforcement learning algorithms by continually attempting and optimizing to learn an optimal desensitization strategy.

In an alternative embodiment of the present invention,

；

In this embodiment, semantic relationships in text data are structurally represented by constructing a knowledge graph of archive data. By calculating the structural similarity and semantic similarity of the data before and after desensitization and the knowledge graph, the influence degree of desensitization on the consistency of data semantic association can be quantitatively evaluated. And (3) pertinently correcting the desensitization strategy according to the association distortion type, and finally ensuring that the desensitization data can maintain the semantic association relation with the original data. The data desensitization is considered under the whole semantic association framework, so that information loss caused by association relation loss in the desensitization process is avoided, and the quality and usability of desensitized data are improved. The method integrates various technical means such as knowledge graph, semantic modeling, association analysis and the like, comprehensively evaluates and optimizes the archive data desensitization strategy from the semantic level, aims to preserve the data value to the maximum extent, protects privacy, and has good quality assurance and interpretability.

FIG. 2 is a schematic structural diagram of archive management based on big data according to an embodiment of the present invention, as shown in FIG. 2, the system includes:

In a third aspect of an embodiment of the present invention,

There is provided an electronic device including:

A processor;

A memory for storing processor-executable instructions;

In a fourth aspect of an embodiment of the present invention,

The present invention may be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A archive management method based on big data, comprising:

Carrying out association consistency verification on the initial desensitization data, correcting the updated data desensitization strategy according to a verification result until a preset semantic association maintenance degree is met, obtaining a corrected data desensitization strategy, carrying out desensitization processing on the archive data again by utilizing the corrected data desensitization strategy, and realizing safety management on the archive data;

Converting the obtained attribute value distribution model into an confusion circuit, providing encrypted attribute value data by each participant, reconstructing a secret sharing share of the encrypted attribute value data through a secure computing protocol, and completing attribute value distribution prediction cooperatively by each participant on the basis of the secret sharing share of the confusion circuit and the reconstructed attribute value to obtain the final attribute value distribution characteristics;

Identifying key factors causing association distortion according to the association consistency measurement index, determining association distortion types according to the key factors causing association distortion, designing corresponding data desensitization strategy correction rules aiming at different association distortion types, correcting the updated data desensitization strategy according to the desensitization strategy correction rules until the preset semantic association maintenance degree is met, and obtaining the corrected data desensitization strategy;

；

2. The method according to claim 1, wherein the parameters of the local model are aggregated by a secure aggregation algorithm, and the calculation formula for updating the global model is as follows:

；

3. The method of claim 1, wherein continuously monitoring changes in the privacy disclosure risk value during the data desensitization process, triggering dynamic updating of the data desensitization policy when the privacy disclosure risk value changes, desensitizing the archive data using the updated data desensitization policy, and obtaining initial desensitization data comprises:

4. The method of claim 3, wherein constructing a multi-objective optimization model targeting minimizing privacy-revealing risk values and minimizing comprehensive data utility loss comprises:

；

5. A big data based archive management system for implementing the method of any of the preceding claims 1-4, comprising:

6. An electronic device, comprising:

A processor;

A memory for storing processor-executable instructions;

Wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 4.

7. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 4.