WO2018102861A1 - Secure text analytics - Google Patents
Secure text analytics Download PDFInfo
- Publication number
- WO2018102861A1 WO2018102861A1 PCT/AU2017/051315 AU2017051315W WO2018102861A1 WO 2018102861 A1 WO2018102861 A1 WO 2018102861A1 AU 2017051315 W AU2017051315 W AU 2017051315W WO 2018102861 A1 WO2018102861 A1 WO 2018102861A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- encrypted
- text
- aggregated
- learning
- numeric values
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 40
- 230000002776 aggregation Effects 0.000 claims abstract description 38
- 238000004220 aggregation Methods 0.000 claims abstract description 37
- 239000011159 matrix material Substances 0.000 claims description 18
- 230000001131 transforming effect Effects 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 7
- 238000010801 machine learning Methods 0.000 claims description 7
- 238000003058 natural language processing Methods 0.000 claims description 7
- 230000009471 action Effects 0.000 claims description 6
- 230000004931 aggregating effect Effects 0.000 claims description 5
- 238000007792 addition Methods 0.000 description 13
- 239000013598 vector Substances 0.000 description 12
- 238000012545 processing Methods 0.000 description 11
- 238000013459 approach Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 7
- 201000010099 disease Diseases 0.000 description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 5
- 230000002427 irreversible effect Effects 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 230000036541 health Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 101150036841 minJ gene Proteins 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000004888 barrier function Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000000391 smoking effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/22—Social work or social welfare, e.g. community support activities or counselling services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Definitions
- PCT/AU2015/050653 discloses a method for learning with transformed data and is incorporated herein by reference.
- PCT/AU2015/050661 discloses gradients over distributed datasets and is incorporated herein by reference.
- a system for automatic text analysis comprises:
- a source device comprising
- an input port to receive text data representing text that comprises multiple text segments
- a text processor to transform the text data to numeric values that represent the multiple text segments
- an encryption engine to encrypt the numeric values into encrypted numeric values that represent the multiple text segments in encrypted form
- an aggregation device to aggregate the encrypted numeric values into aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form
- a learning device remote from the source device to provide an analysis of the text data by performing a learning method on the aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form using operations that maintain the aggregated encrypted numeric values and thereby the text data concealed from the learning device.
- numeric data values allow encryption to preserve privacy while, at the same time, allowing learning on the numeric data on a remote device without disclosing the text data because the aggregated numeric values remain concealed from the learning device.
- the text processor may be a language processor.
- Transforming the text to numeric values may be based on natural language processing.
- Transforming the text to numeric values may be based on a dictionary.
- a dictionary is a highly optimised data structure that allows look-up and creation with small computational overhead. As a result, many operations on the dictionary can be performed in a short time.
- the dictionary may comprise multiple dictionary entries and each of the multiple dictionary entries is associated with a numeric value and transforming the text to numeric values comprises replacing a word from the text data with a corresponding numeric value from the dictionary.
- Each numeric value from the dictionary may be associated with a tag from natural language processing.
- the tag may represent grammatical function in sentence of that phrase.
- the tag may comprise multiple labels .
- the multiple labels may be one-hot encoded.
- the encryption engine may use homomorphic encryption.
- the homomorphic encryption may be Pailler encryption.
- the aggregation device may be remote from the source device and the learning device.
- the numeric values may represent features and labels of learning samples
- the encryption engine may encrypt the features and labels to create encrypted features and encrypted labels
- aggregation may comprise combining the encrypted features based on the encrypted labels.
- Aggregation may comprise adding the features multiplied by the labels.
- Adding features multiplied by the labels may comprise adding
- homomorphically encrypted features multiplied by homomorphic ally encrypted labels.
- Aggregation may comprise combining a random selection of samples.
- Aggregation may comprise calculating block rados.
- Learning on the aggregated values may comprise determining learning parameter values indicative of a relationship between the features and labels of the learning samples and the method may further comprise applying the learning parameter values to current features to determine an estimated label for the current features.
- the system may further comprise an output engine to automatically suggest actions based on the estimated label.
- the learning device may be configured to calculate a rado loss to learn from the aggregated values.
- the operations that maintain the aggregated numeric values concealed from the learning device may comprise one or more of:
- a method for automatic text analysis comprises:
- a method for encrypting text data representing text that comprises multiple text segments comprises:
- a method for aggregating data comprises:
- each of the multiple encrypted samples comprising one or more encrypted numeric features and an encrypted numeric label
- a method for machine learning on data comprises:
- Fig. 1 illustrates a system for automatic text analysis.
- Fig. 2 illustrates a text processing pipeline. It is noted that the pipeline transforms unstructured text (left) into structured numeric features (right).
- FIG. 3 shows the output of a text processing pipeline (using CoreNLP).
- CoreNLP parts of speech are labelled, entities are labelled, coreferences are labelled and then dependencies between text elements (a label and end-point) are formed.
- Fig. 4 illustrates a method for automatic text analysis.
- Fig. 5 illustrates a protocol for Secure Outer Product.
- Fig. 6 illustrates a protocol for Secure Inner Product.
- Fig. 7 illustrates a protocol for Secure Elementwise Product. This protocol extends [FRANZl l, Protocol 2, pp.39] to elementwise products on vectors. A similar algorithm is given in [CLIFTON02] .
- Fig. 8 illustrates a protocol for Secure Matrix Multiply.
- Fig. 9a illustrates a set of samples for the case of Vertical Partition of the samples.
- Fig. 9b illustrates a set of samples for the General case.
- Fig. 10 illustrates a two-party scenario.
- Fig. 11 illustrates a protocol for calculating an encrypted Rademacher.
- Fig. 12 illustrates a protocol for Secure Matrix Inverse.
- Fig. 13 illustrates a Secure Rado Solver protocol.
- Fig. 14 shows an algorithm Classifier for Secure Text with Central Coordinator.
- Fig. 15 illustrates a communication architecture for multiple peers and a common coordinator.
- Fig. 16 shows a protocol for Local Classify for Secure Text.
- Fig. 17 shows a protocol for Local Inner Product.
- Fig. 18 shows a computer network for automatic text analysis. Description of Embodiments
- Confidential text corpora exist in many forms, but most do not allow arbitrary sharing.
- This disclosure provides systems and methods to use such private corpora using privacy preserving text analytics. For example, an important application of machine learning is the determination of correlations between doctors' observations (phenotypes) and diagnosed diseases. Based on such correlations, diagnosis of future patients would be more accurate leading to improved healthcare outcomes. Since there is often a large number of variables (i.e. different phenotypes), a large number of samples (i.e. number of patients) should be used in order to have statistically significant correlations. This large number of patients is often greater than the number of patients within a single organisation, such as a hospital provider. However, confidentiality regimes make it difficult to share the patient data to build a larger dataset.
- the current disclosure provides a solution that allows the combined learning without jeopardising the confidentiality of the patent data.
- the disclosed solution transforms the text data representing text that comprises multiple text segments into numeric values that represent the multiple text segments in encrypted form and encrypts and aggregates these values to create learning samples. This makes it practically impossible to determine the patient data from the learning samples but allows the determination of correlations between phenotypes and diseases based on the encrypted, aggregated learning samples.
- Numeric in this context means that the values only comprise digits from 0-9 and can be interpreted as a number. For example, "456" is not interpreted as a string of '4', '5', '6' but as four hundred fifty six. In this sense, the property of being numeric allows calculations of values, such as addition and multiplication of different numeric values instead of string operations, such as concatenation, sub-strings and searching.
- a text processing application is disclosed using appropriate privacy preservation techniques (including homomorphic encryption, Rademacher operators and secure computation). This disclosure sets out the preliminary materials from Rademacher operators for binary classifiers, and then constructs basic text processing approaches to match those binary classifiers.
- This disclosure provides encryption techniques to text which allow learning without viewing the raw data, thereby applying machine learning without the need to share (or read) text data. This is applicable to private stores of text, where multiple participants wish to
- the systems and methods described herein are applied to text in health. Although open sharing is generally accepted as a good principal in health, privacy concerns may overwhelm implied scientific benefit. There is a need to address privacy while supporting research-use (as well as non-research use) of health data. To support confidentiality, the British Medical Journal recommends not publishing verbatim responses or transcriptions of clinical discussions, which is a type of data that text mining systems can use.
- Fig. 1 illustrates a system for automatic text analysis 100.
- System 100 comprises a source device 101 connected to an aggregation device 102, which in turn is connected to a learning device 103.
- Source device 101 comprises an input port 111 to receive text data, a text processor 112 to transform the text data to numeric values, and an encryption engine 113 to encrypt the numeric values.
- Input port 111 may receive text data from a graphical user interface, which may be displayed on a computer screen, such as a tablet computer.
- the user interface may be part of a medical record keeping system in a hospital, for example.
- the text processor 112 transforms the text data to numeric values.
- the text processor 112 performs a bag of words approach and assigns a numeric value to each word in a dictionary.
- text processor 112 is a language processor and uses language grammar to transform the text into numeric values.
- Encryption engine 113 encrypts the numeric values and may perform homomorphic encryption to encrypt the numerical values.
- Aggregation device 120 aggregates the encrypted data.
- the encrypted data comprises learning samples including for each sample multiple encrypted features and a label. Aggregating the encrypted data may then comprise adding the features based on the label as described below.
- the learning device 130 is remote from the source device 110, which means that the learning device 130 has no access to the text data.
- learning device may be on a separate server with separate administrator rights such that an
- the learning device 130 learns on the aggregated values using secure mathematical operations.
- Secure mathematical operations in this sense are operations that use input data in a way that it is not possible to discern the underlying original data from the input data, from any intermediate results or calculations or from the output data.
- the secure mathematical operations maintain the aggregated numeric values concealed from the learning device 130.
- the learning device 130 performs addition and multiplication of homomorphically encrypted numerical values.
- Fig. 4 illustrates a method 400 for automatic text analysis.
- the method commences by transforming 401 text data to numeric values.
- the numeric values are encrypted 402 to create encrypted numeric values.
- the encrypted numeric values are then aggregated 403 to create aggregated values finally, remotely from the transforming, learning 404 can be performed on the aggregated values using secure mathematical operations, that is, the numeric values remain concealed from the learning device 130.
- the term 'learning' is used herein in the sense of machine learning. This means the calculation and storage of values that represent a relationship or correlation between features and labels (such as phenotypes and diseases). This is useful as a learning outcome, that is the relationship or correlation, can be applied to new data to provide an estimated label. In one example, the learning calculates linear weighting factors.
- the transformation 401 and encryption 402 is performed on a first device
- the aggregation 403 is performed on a second device
- the learning 404 is performed on a third device.
- the first, second and third device are remote from each other in the sense that there each device can only access data that is stored on that device or associated with that device.
- the data may be stored on cloud storage but access is controlled exclusively by one of the first, second or third devices.
- the steps 401, 402, 403 and 404 are distributed differently across multiple devices.
- Fig. 4 is to be understood as a blueprint for the software program and may be implemented step-by-step, such that each step in Fig. 4 is represented by a function in a programming language, such as C++ or Java.
- the resulting source code is then compiled and stored as computer executable instructions on program memory.
- the data processing can be pushed to peers, which, in some examples, maintain control of private data.
- the need for "trusted"' intermediaries is limited, and where such intermediaries are used, the information they may receive is restricted by aggregation and secure computing.
- pipeline 200 can be characterised as a linear so- called “1-best”' pipeline as outlined in [JOHNSON 14]. This could be extended to parallel, iterative, or network approaches.
- Typical natural language processing packages such as MALLET [MCC ALLUM02] , Stanford CoreNLP [MANNING 14], OpenNLP [MORTON05], NLTK [BIRD09], produce human-readable output, in the form of delimited text or XML files.
- MALLET MCC ALLUM02
- Stanford CoreNLP MANNING 14
- OpenNLP MORTON05
- NLTK NLTK
- each dictionary entry is associated with a numeric value and transformation comprises replacing a word from the text data with a corresponding numeric value from the dictionary.
- a text processing pipeline maps a document into a series of numeric-valued features.
- the numeric features are not independent.
- documents may be seen as words, sentences, paragraphs, text-segments, whole-documents or whatever other unit of measure is used in the text pipeline.
- the text processor 112 implementing pipeline 200 also uses labelled text segments - which may be document-, sentence- or word- labels (e.g. for sentiment analysis) or some combination of text fragments (such as used by brat
- text processor 112 may represent the features as numeric labels— where the "tags" are converted into a dictionary— and a series of numeric values.
- each numeric value from the dictionary is associated with a tag from the natural language processing and the tag represents grammatical function in the sentence of that phrase, such as subject, predicate, object, etc.
- the numeric values represent the multiple text segments in the text data.
- For this work processor 112 may use binary values— such as might result from a "1-hot"' encoding. These become observations, and also labels if the aim is learning on the text.
- encryption engine 113 can consider a text document as a series of numeric features and uses homomorphic encryption. Encryption engine 113 may directly apply homomorphic cryptography to the numeric (key-encoded text) data to create encrypted numeric values. The encrypted values still represent the multiple text segments but now in encrypted form. However, the numeric data (observations and labels) are often highly structured. For example, it is unlikely that there will be a sequence of noun-phrases, or named entities (without any other text) in a text block. Co-references and dependencies introduce Markov dependencies in the feature vectors. Labels and feature selections (such as "true iff previous word was noun”) add additional structure to the observation vector.
- Encryption engine 113 may use the approach of [SHANNON49] to improve independent secret systems by concatenating them. In this case, encryption engine firstly encrypts the numeric features and labels (using a Paillier homomorphic encryption system [PAILLIER99, DAMGARD01]). This makes direct interpretation difficult. Second, encryption engine 113 passes the encrypted values to aggregation device 120 to address the data dependencies by applying irreversible aggregation to the numeric data so as to hide many of the implied dependencies between observations and to create aggregated encrypted numeric values. These aggregated values still represent the multiple text segments but in aggregated encrypted form.
- PAILLIER99, DAMGARD01 Paillier homomorphic encryption system
- learning device 130 wraps the learning process in a secure learning approach, to further reduce the capacity of an inquisitive user discovering the underlying labels. This reflects the idea that data dependences can be accounted for in training, validation, and testing of machine learning methods in order to produce reliable performance estimates. This basically extends Fig. 2 to include encryption in Fig. 1. The pre-processing pipeline of Fig. 2 is encapsulated in the text processor 112.
- the Paillier encryption scheme (and later generalisations) is a public-private key encryption scheme.
- the following notation is used where an integer x is encrypted as ((x)) and the decrypt operation as (ly)) ⁇
- ⁇ f ⁇ i are public key encryptions: users can encrypt data (and perform computations) using a common public key, however, only the user with the corresponding private key can extract the clear data.
- Aggregation device 120 can sum encrypted values in the encrypted domain (and consequently decrypt the result) in (1) and can multiply encrypted values with unencrypted scalars. The result (2) does not apply when a is encrypted. For more advanced operations (such as multiplying encrypted values) the devices can use secure computation.
- Fig. 5 illustrates a protocol for Secure Outer Product 500
- Fig. 6 illustrates a protocol for Secure Inner Product 600, both of which extend the protocol for Secure Elementwise Product 700 shown in Fig. 7 to form outer- and inner-products, building up a basic linear algebra toolkit. This can be extended to secure matrix multiplication.
- the values are encrypted by a single party B (encryption engine 113 on source device) and calculated by another party A (aggregation device).
- the source device 110 is remote from the aggregation device 120.
- [89] The work of [NOCK15] and [PATRINI16] may be used to present relevant parts of the aggregation techniques. Firstly, the work of [NOCK15] presents learners on specially aggregated datasets, where the data set could be in a single location. This is extended to multiple locations below.
- the data set is a single (coherent) source, i.e. all data is held by a single organisation.
- Figs. 9a and 9b illustrate a data schematic, with first peer 901, second peer 902 and third peer 903. Some features (dotted, reference as 911, 912 and 913, respectively) are common to each peer. These are denoted J a X . One of the common features is the label ( or 'class'). Other (non-shared) features are split between all peers. The total sample is the bold outline 920.
- Fig. 9a shows the Vertical Partition case (VP) where all peers see different observations of the same examples, but do not know who is who in the data set.
- Fig. 9b shows the General case (G) where it is not known whether one example viewed by a peer exists in other peers' datasets. Missing data is marked as _L .
- VP Vertical Partition case
- G General case
- One element is a basic block (BB) rado.
- the features over which the Rado's will operate are encrypted.
- One example is a two-party scenario as shown in Fig. 10.
- the encryption process occurs after securely processing the text documents at the private location of B (source device 110).
- B encryption engine 113 uses private key, B encryption engine 113 then encrypts the features, and these are then aggregated by aggregation device 120.
- the aggregation occurs blind to B , that is, remotely from source device 110, and may be performed by an honest-but-curious intermediary X B (aggregation device 120).
- the rados ⁇ 8 are generated privately at T B
- the rados can be used by other honest-but-curious external parties ⁇ , such as learning device 130.
- secure mathematical operations are applied and the source device 110 and the aggregation device 120 are referred to as ⁇ B, X ⁇ where B is the private key holder and X is an intermediary.
- X can "see" encrypted features ((x, )) , and encrypted labels (()>, ⁇ )) , and knows the choices of rado vectors (i.e. X knows values of cr ). It would be possible to operate with cr also encrypted.
- Fig. 10 outlines the encryption steps for this section. It is possible to extend the rados to secure rado learners. We re-write (3) with the encrypted values made explicit. Corresponding secure mathematical operations are also shown. We use the notation to denote a series of homomorphic addition operations ie.
- Equation (5) shows the formation of the (encrypted) rado.
- the additions and (scalar) multiplications must all be translated to the appropriate homomorphic addition and multiplication operations.
- the output is an encoded Rademacher observation, based on any numerical field. We will refer to this rado as (( « here)) ⁇ Fig- shows a protocol for calculating an encrypted Rademacher.
- Learning device 130 may use the equivalent (exponential) learner for rado's
- ⁇ ⁇ ⁇ is a regularising term.
- encryption engine 113 encrypts the rado's as described above and the learning outcomes are to be kept confidential. While this example is not concerned with the multi-source case, the following description provides mechanics to allow transfer to the scenario where multiple un-shared data sources exist.
- the regulariser in the above equation is an application of a protocol for Local Inner Product as shown in Fig. 17.
- learning device 130 does not evaluate the above equation but rather performs a gradient descent to solve Problem 1 as exmplained below.
- the mean square loss can be used, over V— ie on the limited sample sets— by using a modified mean loss as given in Definition 4
- ⁇ * (BB T + dim c (B) ⁇ ⁇ ) 1 Bl (8) where B is stacked (column- wise) rado's and dim c (S) is the number of columns of B
- Fig. 12 illustrates a protocol for Secure Matrix Inverse, which uses an iterative algorithm to multiply matrices to form an inverse.
- the Secure Matrix Inverse protocol in Fig. 12 uses a significant number of secure and homomorphic operations per iteration.
- the secure multiplies uses internal homomorphic multiplications and additions.
- Each peer, p can calculate e p and v p independently
- mean b may be calculated centrally, it is preferable to use secure multiparty addition to achieve the same result. This reduces the scope for the coordinator to access (encrypted, aggregated) vectors, and (instead) only access noisy aggregates of the data.
- this may be implemented via a securely accredited text processing software installed on source device 110, and having the encryption software 113 running on the machine with access to the raw data, whilst the aggregation software operates on a separate server (aggregation device 120) - with no access to the raw data.
- the solution of the optimisation to find ⁇ is performed at a central coordinator (learning device 130), which is also responsible for generating the public -private keys.
- Fig. 14 shows an algorithm Classifier for Secure Text with Central Coordinator. This algorithm uses a number of intermediate steps, which are detailed in Figs. 5-8 and 11-13.
- the algorithm may incorporate the secure multi-party summation work of
- Fig. 15 illustrates a communication architecture 1500 for multiple peers 1501 and 1502 and a common coordinator 1503.
- Fig. 15 outlines the protocol from Fig. 14.
- the coordinator 1503 sends a common dictionary and public key to all peers 1501 and 1502.
- Each peer has different data components, with some common elements ( 1510, 1511) - see also Figs. 9a and 9b.
- Each peer has encrypted and aggregated its local data (as indicated at 1512 and 1513).
- the servers 1512 and 1513 correspond to the "intermediary" in Fig. 10.
- the encryption key is generated by the coordinator 1503. Dashed arrows denote information transfers between participants, whilst solid arrows denote local transformations (at the respective participant). Not all transfers are shown.
- each peer only has a subset of features x ( ] .
- the label may be determined only by the sign of a scalar, and hence it is possible to break the inner product ⁇ ⁇ ⁇ into an inner product of local features and remote features:
- the local component of (16) may be calculated at peer p . If we denote the local classifier result as p then we may write
- the summation in (19) is the sum of all (local) calculated classifier results on the sub-components of the vector x .
- the result of (19) shows that the remote classification results may be treated as offsets for the local result— i.e. the remote inner products act as corrections to the local result.
- every peer shares linking information about the observation x .
- the local inner product can be calculated by keeping the encrypted classifier vector in its encrypted form, and treating the elements of x as unencrypted scalars.
- the local inner product is written explicitly in the protocol for Local Inner Product as show in Fig. 17.
- the innerproduct with the rado is performed at the coordinator C (learning device 130), using secure inner product.
- the summation may be achieved using multi-party secure addition, as outlined in [CLIFTON02] .
- Fig. 16 shows a protocol for Local Classify for Secure Text.
- V n+l K (3I -AV n [3I -AV n ]) (21)
- ⁇ (e) iog ⁇ ex P (-e ⁇ ) + ⁇ ⁇ ⁇ (24)
- ⁇ ⁇ ⁇ is a regularising term.
- the learner can be developed using secure exponentiation.
- Fig. 18 illustrates a computer network 1800 for automatic text analysis including a computer system 1801 implementing coordinator 1503 from Fig. 15.
- the computer system 1801 comprises a processor 1802 connected to a program memory 1804, a data memory 1806 and a communication port 1808.
- the program memory 1804 is a non-transitory computer readable medium, such as a hard drive, a solid state disk or CD-ROM.
- Software that is, an executable program stored on program memory 1804 causes the processor 1802 to perform the steps performed by learning device 130 and/or coordinator 11503.
- the processor 102 may then store the learning result, such as a learned correlation between feature values and label values on data store 1806, such as on RAM or a processor register.
- Processor 1802 may also send the determined correlation via communication port 1808 to a server, such as an automatic evaluation engine. For example, the learned correlations may assist doctors in automatically diagnosing patients.
- the processor 1802 may receive data, such as learning samples, from data memory 1806 as well as from the communications port 1808.
- the processor 102 receives learning samples data from aggregation device 130 via communications port 1808, such as by using a Wi-Fi network according to IEEE 802.11.
- the Wi-Fi network may be a decentralised ad-hoc network, such that no dedicated management infrastructure, such as a router, is required or a centralised network with a router or access point managing the network.
- communications port 1808 is shown as a distinct entity, it is to be understood that any kind of data port may be used to receive data, such as a network connection, a memory interface, a pin of the chip package of processor 1802, or logical ports, such as IP sockets or parameters of functions stored on program memory 1804 and executed by processor 1802. These parameters may be stored on data memory 1806 and may be handled by-value or by-reference, that is, as a pointer, in the source code.
- the processor 1802 may receive data through all these interfaces, which includes memory access of volatile memory, such as cache or RAM, or non-volatile memory, such as an optical disk drive, hard disk drive, storage server or cloud storage.
- volatile memory such as cache or RAM
- non-volatile memory such as an optical disk drive, hard disk drive, storage server or cloud storage.
- the computer system 1800 may further be implemented within a cloud computing environment, such as a managed group of interconnected servers hosting a dynamic number of virtual machines.
- nodes, edges, graphs, solutions, variables, samples, features, labels and the like refer to data structures, which are physically stored on data memory 1806 or processed by processor 1802. Further, for the sake of brevity when reference is made to particular variable names, such as "period of time” or “quantity of movement” this is to be understood to refer to values of variables stored as physical data in computer system 1800.
- the learning on the aggregated values may comprise
- the system may then further comprise an output engine to
- the learning device 130 may distribute the learning parameters, such as machine learning model parameters including linear factors of regression models, back to the source device 110.
- source device 110 may be used by a doctor to enter patient data and diagnostic data for each patient. Learning device 130 learns the contribution of each observation to particular diseases in the form of regression parameters. This learning is based on learning samples not only from the one source device 110 but from a large number of source devices while maintaining confidentiality as described above. The learning parameters also do not contain patient-confidential information and can therefore be distributed freely back to source device 110.
- the doctor can enter the observations and the source device 110 automatically proposes a diagnosis or a number of possible diagnoses for the doctor to evaluate. In this sense, source device automatically suggests actions, such as providing information to the patient or applying a certain therapy.
- doctor's observations may also be accompanied by genomic data or meta-genetics including habits of the patient, such as smoking etc.
- the proposed systems and methods can equally be applied to this information.
- the text data is related to other applications, such as social media, email, SMS, credit ratings and others.
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Tourism & Hospitality (AREA)
- Data Mining & Analysis (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Marketing (AREA)
- Physics & Mathematics (AREA)
- Economics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Child & Adolescent Psychology (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Operations Research (AREA)
- Pathology (AREA)
- Quality & Reliability (AREA)
- Epidemiology (AREA)
- Machine Translation (AREA)
Abstract
This disclosure relates to automatic text analysis. A source device comprises an input port to receive text data representing text that comprises multiple text segments. A text processor transforms the text data to numeric values that represent the multiple text segments. An encryption engine encrypts the numeric values into encrypted numeric values. An aggregation device aggregates the encrypted numeric values into aggregated encrypted numeric. Finally, a learning device remote from the source device provides an analysis of the text data by performing a learning method on the aggregated encrypted numeric values using operations that maintain the aggregated encrypted numeric values and thereby the text data concealed from the learning device. The numeric data values allow encryption to preserve privacy while, at the same time, allowing learning on the numeric data on a remote device without disclosing the text data because the aggregated numeric values remain concealed from the learning device.
Description
"Secure text analytics"
Related applications
[1] The present application claims priority from Australian Provisional Patent Application No 2016905070 filed on 8 December 2016, the content of which is incorporated herein by reference. PCT/AU2016/050088 discloses a method for learning from distributed data and is incorporated herein by reference.
PCT/AU2015/050653 discloses a method for learning with transformed data and is incorporated herein by reference. PCT/AU2015/050661 discloses gradients over distributed datasets and is incorporated herein by reference.
Technical Field
[2] This disclosure relates to automatic text analysis. Background
[3] The amount of unstructured data used by organisations is growing. However, the difficulty in processing text represents a significant barrier to extracting maximum value from this large and growing resource.
[4] Private text data with confidential or otherwise restricted content is difficult to "open" in the sense of open-data. Privacy requirements in text data are difficult to guarantee due to the inter-dependencies of text, and grammar.
[5] Components of the text document that might be subject to privacy concerns (names of persons, places, drug-names, disease-names) are likely to be the very components most interesting for text processing— and most damaging to algorithm performance if altered. More advanced privacy requirements, such as determining whether a part of a document is subject to "national security", make general autonomous text replacements infeasible.
[6] Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each claim of this application.
[7] Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.
Summary
[8] A system for automatic text analysis comprises:
a source device comprising
an input port to receive text data representing text that comprises multiple text segments,
a text processor to transform the text data to numeric values that represent the multiple text segments, and
an encryption engine to encrypt the numeric values into encrypted numeric values that represent the multiple text segments in encrypted form;
an aggregation device to aggregate the encrypted numeric values into aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form; and
a learning device remote from the source device to provide an analysis of the text data by performing a learning method on the aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form using operations that maintain the aggregated encrypted numeric values and thereby the text data concealed from the learning device.
[9] It is a technical advantage that the numeric data values allow encryption to preserve privacy while, at the same time, allowing learning on the numeric data on a
remote device without disclosing the text data because the aggregated numeric values remain concealed from the learning device.
[10] The text processor may be a language processor.
[11] Transforming the text to numeric values may be based on natural language processing.
[12] It is a technical advantage that a language processor can consider grammar in the text data. As a result, the transformed numeric values represent the text data more accurately.
[13] Transforming the text to numeric values may be based on a dictionary.
[14] It is a technical advantage that a dictionary is a highly optimised data structure that allows look-up and creation with small computational overhead. As a result, many operations on the dictionary can be performed in a short time.
[15] The dictionary may comprise multiple dictionary entries and each of the multiple dictionary entries is associated with a numeric value and transforming the text to numeric values comprises replacing a word from the text data with a corresponding numeric value from the dictionary.
[16] Each numeric value from the dictionary may be associated with a tag from natural language processing.
[17] The tag may represent grammatical function in sentence of that phrase.
[18] The tag may comprise multiple labels .
[19] The multiple labels may be one-hot encoded.
[20] It is a technical advantage that one-hot encoding allows binary classifiers, which leads to computationally more efficient calculations.
[21] The encryption engine may use homomorphic encryption.
[22] The homomorphic encryption may be Pailler encryption.
[23] The aggregation device may be remote from the source device and the learning device.
[24] It is a technical advantage that having the aggregation device remote from the source device adds a further layer of protection of the text data from disclosure.
[25] The numeric values may represent features and labels of learning samples, the encryption engine may encrypt the features and labels to create encrypted features and encrypted labels and aggregation may comprise combining the encrypted features based on the encrypted labels.
[26] Aggregation may comprise adding the features multiplied by the labels.
[27] Adding features multiplied by the labels may comprise adding
homomorphically encrypted features multiplied by homomorphic ally encrypted labels.
[28] Aggregation may comprise combining a random selection of samples.
[29] Aggregation may comprise calculating block rados.
[30] Learning on the aggregated values may comprise determining learning parameter values indicative of a relationship between the features and labels of the learning samples and the method may further comprise applying the learning parameter values to current features to determine an estimated label for the current features.
[31] The system may further comprise an output engine to automatically suggest actions based on the estimated label.
[32] It is a technical advantage that current features can be received and actions suggested automatically. In essence, this incorporates the correlations extracted from the learning samples into the current decision while keeping the learning samples confident. This way, a significantly larger number of learning samples can be used, which makes the suggested actions more accurate.
[33] The learning device may be configured to calculate a rado loss to learn from the aggregated values.
[34] The operations that maintain the aggregated numeric values concealed from the learning device may comprise one or more of:
Secure Outer Product;
Secure Inner Product;
Secure Elementwise Product;
Secure Matrix Multiply;
Secure Matrix Inverse;
Secure Rado Solver;
Classifier for Secure Text with Central Coordinator;
Local Classify for Secure Text; and
Local Inner Product.
[35] A method for automatic text analysis comprises:
transforming text data representing text that comprises multiple text segments to numeric values that represent the multiple text segments;
encrypting the numeric values to create encrypted numeric values that represent the multiple text segments in encrypted form;
aggregating the encrypted numeric values to create aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form; and
providing an analysis of the text data by performing a learning method, remotely from the transforming, on the aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form using using operations that maintain the aggregated encrypted numeric values and thereby the text data concealed from the learning device.
[36] Software, when installed on a computer, causes the computer to perform the above method.
[37] A method for encrypting text data representing text that comprises multiple text segments comprises:
transforming the text data to numeric values that represent the multiple text segments, the numeric values forming multiple samples, each sample comprising one or more features and a label; and
encrypting the numeric values forming the multiple samples to obtain encrypted samples that represent the multiple text segments in encrypted form.
[38] A method for aggregating data comprises:
receiving multiple encrypted samples that represent multiple text segments in encrypted form, each of the multiple encrypted samples comprising one or more encrypted numeric features and an encrypted numeric label;
creating aggregated samples by combining the encrypted numeric features based on the encrypted numeric label for each encrypted sample into aggregated encrypted samples that represent the multiple text segments in aggregated encrypted form.
[39] A method for machine learning on data comprises:
receiving aggregated samples that represent multiple text segments of text data in aggregated encrypted form, the aggregated samples comprising combined encrypted numeric features; and
learning based on the aggregated samples using secure mathematical operations to provide an analysis of the text data by performing a learning method on
the aggregated samples that represent the multiple text segments in aggregated encrypted form using operations that maintain the aggregated encrypted samples and thereby the text data concealed from a learning device performing the method.
[40] Optional features described of any aspect of method, computer readable medium or computer system, where appropriate, similarly apply to the other aspects also described here.
Brief Description of Drawings
[41] An example will now be described with reference to: [42] Fig. 1 illustrates a system for automatic text analysis.
[43] Fig. 2 illustrates a text processing pipeline. It is noted that the pipeline transforms unstructured text (left) into structured numeric features (right).
[44] Fig. 3 shows the output of a text processing pipeline (using CoreNLP). With a full application of CoreNLP, parts of speech are labelled, entities are labelled, coreferences are labelled and then dependencies between text elements (a label and end-point) are formed.
[45] Fig. 4 illustrates a method for automatic text analysis.
[46] Fig. 5 illustrates a protocol for Secure Outer Product.
[47] Fig. 6 illustrates a protocol for Secure Inner Product.
[48] Fig. 7 illustrates a protocol for Secure Elementwise Product. This protocol extends [FRANZl l, Protocol 2, pp.39] to elementwise products on vectors. A similar algorithm is given in [CLIFTON02] .
[49] Fig. 8 illustrates a protocol for Secure Matrix Multiply.
[50] Fig. 9a illustrates a set of samples for the case of Vertical Partition of the samples.
[51] Fig. 9b illustrates a set of samples for the General case.
[52] Fig. 10 illustrates a two-party scenario.
[53] Fig. 11 illustrates a protocol for calculating an encrypted Rademacher.
[54] Fig. 12 illustrates a protocol for Secure Matrix Inverse.
[55] Fig. 13 illustrates a Secure Rado Solver protocol.
[56] Fig. 14 shows an algorithm Classifier for Secure Text with Central Coordinator.
[57] Fig. 15 illustrates a communication architecture for multiple peers and a common coordinator.
[58] Fig. 16 shows a protocol for Local Classify for Secure Text. [59] Fig. 17 shows a protocol for Local Inner Product. [60] Fig. 18 shows a computer network for automatic text analysis. Description of Embodiments
[61] Confidential text corpora exist in many forms, but most do not allow arbitrary sharing. This disclosure provides systems and methods to use such private corpora using privacy preserving text analytics. For example, an important application of machine learning is the determination of correlations between doctors' observations (phenotypes) and diagnosed diseases. Based on such correlations, diagnosis of future
patients would be more accurate leading to improved healthcare outcomes. Since there is often a large number of variables (i.e. different phenotypes), a large number of samples (i.e. number of patients) should be used in order to have statistically significant correlations. This large number of patients is often greater than the number of patients within a single organisation, such as a hospital provider. However, confidentiality regimes make it difficult to share the patient data to build a larger dataset.
[62] The current disclosure provides a solution that allows the combined learning without jeopardising the confidentiality of the patent data. In particular, the disclosed solution transforms the text data representing text that comprises multiple text segments into numeric values that represent the multiple text segments in encrypted form and encrypts and aggregates these values to create learning samples. This makes it practically impossible to determine the patient data from the learning samples but allows the determination of correlations between phenotypes and diseases based on the encrypted, aggregated learning samples. Numeric in this context means that the values only comprise digits from 0-9 and can be interpreted as a number. For example, "456" is not interpreted as a string of '4', '5', '6' but as four hundred fifty six. In this sense, the property of being numeric allows calculations of values, such as addition and multiplication of different numeric values instead of string operations, such as concatenation, sub-strings and searching.
[63] A text processing application is disclosed using appropriate privacy preservation techniques (including homomorphic encryption, Rademacher operators and secure computation). This disclosure sets out the preliminary materials from Rademacher operators for binary classifiers, and then constructs basic text processing approaches to match those binary classifiers.
[64] This disclosure provides encryption techniques to text which allow learning without viewing the raw data, thereby applying machine learning without the need to share (or read) text data. This is applicable to private stores of text, where multiple participants wish to
1. Learn from private text corpora— without sharing the corpora— and/or
2. Classify private text— without allowing the classifier to have access to the plain text.
[65] In one example, the systems and methods described herein are applied to text in health. Although open sharing is generally accepted as a good principal in health, privacy concerns may overwhelm implied scientific benefit. There is a need to address privacy while supporting research-use (as well as non-research use) of health data. To support confidentiality, the British Medical Journal recommends not publishing verbatim responses or transcriptions of clinical discussions, which is a type of data that text mining systems can use.
Text analysis system
[66] Fig. 1 illustrates a system for automatic text analysis 100. System 100 will first be described in general terms while the mathematical details are described further below. System 100 comprises a source device 101 connected to an aggregation device 102, which in turn is connected to a learning device 103. Source device 101 comprises an input port 111 to receive text data, a text processor 112 to transform the text data to numeric values, and an encryption engine 113 to encrypt the numeric values. Input port 111 may receive text data from a graphical user interface, which may be displayed on a computer screen, such as a tablet computer. The user interface may be part of a medical record keeping system in a hospital, for example. The text processor 112 transforms the text data to numeric values. For example, the text processor 112 performs a bag of words approach and assigns a numeric value to each word in a dictionary. In other examples, text processor 112 is a language processor and uses language grammar to transform the text into numeric values. Encryption engine 113 encrypts the numeric values and may perform homomorphic encryption to encrypt the numerical values.
[67] Aggregation device 120 aggregates the encrypted data. In one example, the encrypted data comprises learning samples including for each sample multiple
encrypted features and a label. Aggregating the encrypted data may then comprise adding the features based on the label as described below.
[68] The learning device 130 is remote from the source device 110, which means that the learning device 130 has no access to the text data. For example, learning device may be on a separate server with separate administrator rights such that an
administrator of a first server hosting the source device 110 is not an administrator of a second server hosting the learning device. The learning device 130 learns on the aggregated values using secure mathematical operations. Secure mathematical operations in this sense are operations that use input data in a way that it is not possible to discern the underlying original data from the input data, from any intermediate results or calculations or from the output data. In other words, the secure mathematical operations maintain the aggregated numeric values concealed from the learning device 130. For example, the learning device 130 performs addition and multiplication of homomorphically encrypted numerical values.
[69] Fig. 4 illustrates a method 400 for automatic text analysis. The method commences by transforming 401 text data to numeric values. Next, the numeric values are encrypted 402 to create encrypted numeric values. The encrypted numeric values are then aggregated 403 to create aggregated values finally, remotely from the transforming, learning 404 can be performed on the aggregated values using secure mathematical operations, that is, the numeric values remain concealed from the learning device 130. It is noted here that the term 'learning' is used herein in the sense of machine learning. This means the calculation and storage of values that represent a relationship or correlation between features and labels (such as phenotypes and diseases). This is useful as a learning outcome, that is the relationship or correlation, can be applied to new data to provide an estimated label. In one example, the learning calculates linear weighting factors.
[70] In some examples disclosed herein, the transformation 401 and encryption 402 is performed on a first device, the aggregation 403 is performed on a second device and the learning 404 is performed on a third device. In those examples, the first, second
and third device are remote from each other in the sense that there each device can only access data that is stored on that device or associated with that device. For example, the data may be stored on cloud storage but access is controlled exclusively by one of the first, second or third devices. In other examples, however, the steps 401, 402, 403 and 404 are distributed differently across multiple devices.
[71] Fig. 4 is to be understood as a blueprint for the software program and may be implemented step-by-step, such that each step in Fig. 4 is represented by a function in a programming language, such as C++ or Java. The resulting source code is then compiled and stored as computer executable instructions on program memory.
Background
[72] The following description provides relevant aspects from homomorphic encryption, obscured computation and irreversible aggregation. One example uses the assumption that all participants are "honest-but-curious" according to the following definition:
[73] "The honest-but-curious (HBC) adversary is a legitimate participant in a communication protocol who will not deviate from the defined protocol but will attempt to learn all possible information from legitimately received messages. "
[74] The data processing can be pushed to peers, which, in some examples, maintain control of private data. The need for "trusted"' intermediaries is limited, and where such intermediaries are used, the information they may receive is restricted by aggregation and secure computing.
Text Processing
[75] Consider a text-processing pipeline 200 as shown in Fig. 2, which is implemented in text processor 112. In one example, pipeline 200 can be characterised as a linear so- called "1-best"' pipeline as outlined in [JOHNSON 14]. This could be extended to
parallel, iterative, or network approaches. Typical natural language processing packages, such as MALLET [MCC ALLUM02] , Stanford CoreNLP [MANNING 14], OpenNLP [MORTON05], NLTK [BIRD09], produce human-readable output, in the form of delimited text or XML files. However, the text files represent coded
(dictionary) categorical numerical variables.
[76] In one example, a (large) number of categorical variables is presumed and 'text' features are coded via a lookup table, such as a dictionary. In this sense, each dictionary entry is associated with a numeric value and transformation comprises replacing a word from the text data with a corresponding numeric value from the dictionary.
[77] A particular example of text processing pipeline output is outlined in Fig. 3. One observation is
A text processing pipeline maps a document into a series of numeric-valued features. The numeric features are not independent.
[78] Here "documents" may be seen as words, sentences, paragraphs, text-segments, whole-documents or whatever other unit of measure is used in the text pipeline.
[79] In training, the text processor 112 implementing pipeline 200 also uses labelled text segments - which may be document-, sentence- or word- labels (e.g. for sentiment analysis) or some combination of text fragments (such as used by brat
[STENETORP12]). In each case, text processor 112 may represent the features as numeric labels— where the "tags" are converted into a dictionary— and a series of numeric values. In this sense, each numeric value from the dictionary is associated with a tag from the natural language processing and the tag represents grammatical function in the sentence of that phrase, such as subject, predicate, object, etc. In other words, the numeric values represent the multiple text segments in the text data.
[80] For this work processor 112 may use binary values— such as might result from a "1-hot"' encoding. These become observations, and also labels if the aim is learning on the text.
Encryption
[81] At this point, encryption engine 113 can consider a text document as a series of numeric features and uses homomorphic encryption. Encryption engine 113 may directly apply homomorphic cryptography to the numeric (key-encoded text) data to create encrypted numeric values. The encrypted values still represent the multiple text segments but now in encrypted form. However, the numeric data (observations and labels) are often highly structured. For example, it is unlikely that there will be a sequence of noun-phrases, or named entities (without any other text) in a text block. Co-references and dependencies introduce Markov dependencies in the feature vectors. Labels and feature selections (such as "true iff previous word was noun") add additional structure to the observation vector. All these factors may make numerically encoded text susceptible to frequency attacks. Worse, since the dictionary of key-values is shared in clear text to all participants— to allow uniform encoding of text elements to common key-value pairs— the underlying structure of the document may be readily inferred by matching observations and labels to dictionary elements.
[82] Encryption engine 113 may use the approach of [SHANNON49] to improve independent secret systems by concatenating them. In this case, encryption engine firstly encrypts the numeric features and labels (using a Paillier homomorphic encryption system [PAILLIER99, DAMGARD01]). This makes direct interpretation difficult. Second, encryption engine 113 passes the encrypted values to aggregation device 120 to address the data dependencies by applying irreversible aggregation to the numeric data so as to hide many of the implied dependencies between observations and to create aggregated encrypted numeric values. These aggregated values still represent the multiple text segments but in aggregated encrypted form. Finally, learning device 130 wraps the learning process in a secure learning approach, to further reduce the capacity of an inquisitive user discovering the underlying labels. This reflects the idea
that data dependences can be accounted for in training, validation, and testing of machine learning methods in order to produce reliable performance estimates. This basically extends Fig. 2 to include encryption in Fig. 1. The pre-processing pipeline of Fig. 2 is encapsulated in the text processor 112.
Partial Homomorphic Encryption
[83] The Paillier encryption scheme (and later generalisations) is a public-private key encryption scheme. In this disclosure the following notation is used where an integer x is encrypted as ((x)) and the decrypt operation as (ly)) ■
and {f^†i are public key encryptions: users can encrypt data (and perform computations) using a common public key, however, only the user with the corresponding private key can extract the clear data.
[84] The main operations for Paillier homomorphic encryption are the operators S and ® . For two integers xl, x2 < n = pq where n is a constant for the particular encryption and p,q are two large primes.
«¾»«(<¾» = (<*. +¾» (D and
a® ((¾)) = ((<*■¾)) (2)
[85] Aggregation device 120 can sum encrypted values in the encrypted domain (and consequently decrypt the result) in (1) and can multiply encrypted values with unencrypted scalars. The result (2) does not apply when a is encrypted. For more advanced operations (such as multiplying encrypted values) the devices can use secure computation.
Secure Computations
[86] This disclosure provides several protocols for secure computation among two parties A and B . [CLIFTON02] provides mechanism for multiple parties (ie. more than two). It is assumed that A operates on encrypted data, and B has the private key (and can decrypt data). Neither party should be able to discern the numerical values. These protocols comprise 3 steps
1. an obfuscation step by the public key holder A ,
2. a transfer to the private key holder B , who decrypts and then performs the calculation on clear data and returns an encrypted result to A . As the result is obfuscated, B learns nothing from this operation, even though it is performed on clear data.
3. A then removes the original obfuscation
[87] Fig. 5 illustrates a protocol for Secure Outer Product 500 and Fig. 6 illustrates a protocol for Secure Inner Product 600, both of which extend the protocol for Secure Elementwise Product 700 shown in Fig. 7 to form outer- and inner-products, building up a basic linear algebra toolkit. This can be extended to secure matrix multiplication. We re-use the Secure Inner Product protocol 600 as the basis for Secure Matrix Multiply protocol 800 shown in Fig. 8. In one example, the values are encrypted by a single party B (encryption engine 113 on source device) and calculated by another party A (aggregation device). In other words, the source device 110 is remote from the aggregation device 120.
[88] It is noted that a number of protocol or algorithms are disclosed herein and these are to be understood as a description of steps performed by a processor of a computer system to calculate the respective outputs, such as software code, or by dedicated circuits or by cloud computing instances.
Irreversible Aggregation
[89] The work of [NOCK15] and [PATRINI16] may be used to present relevant parts of the aggregation techniques. Firstly, the work of [NOCK15] presents learners on
specially aggregated datasets, where the data set could be in a single location. This is extended to multiple locations below.
Single (complete) data set
[90] In one example, the data set is a single (coherent) source, i.e. all data is held by a single organisation.
[91] Definition 1 ((numeric) Supervised Learning Space) Given a set of m > 0 examples S = |(x; , yi ), i e { 1 , 2, ... , m}} , where x^^c Rlxd are observations, X is the domain, and yt e {— 1,1 } are labels. We are concerned with a (binary) linear classifier Θ e Θ for fixed Θ cz Rlxd. The label of an observation x is given by:
label(x) = sign(erx) e {-1,1 }
[92] The irreversible aggregation is based on rado's as defined by [93] Definition 2 (Rado)
Let∑ = {-1,1 }m . Then given a set of S , and for any σ e∑ with σ = [σι, ..., ση]τ . The Rademacher observation ("Rado" herein) π8 with signature σ is
[94] It is noted that the signature elements σ are -1 or +1. As a result, the addition to the label (which is also -1 or +1) effectively results in a random selection of samples, which are then combined in the sum of equation (3).
[95] Further detail and the calculations of rados by aggregation of samples calculating block rados is disclosed in PCT/AU2016/050088, which is incorporated herein by reference.
Multiple data sets
[96] Figs. 9a and 9b illustrate a data schematic, with first peer 901, second peer 902 and third peer 903. Some features (dotted, reference as 911, 912 and 913, respectively) are common to each peer. These are denoted J a X . One of the common features is the label ( or 'class'). Other (non-shared) features are split between all peers. The total sample is the bold outline 920. Fig. 9a shows the Vertical Partition case (VP) where all peers see different observations of the same examples, but do not know who is who in the data set. Fig. 9b shows the General case (G) where it is not known whether one example viewed by a peer exists in other peers' datasets. Missing data is marked as _L .
[97] In the following example, different peers have different sub-components of the full data set . In this example, entities are not linked: different text corpora are held by different parties, and no entity resolution is performed.
[98] One element is a basic block (BB) rado.
Definition 3 (BB-Rado) Consider z' G U a V . Let lift(z') concatenate zeros to
z = [z' 0] such that z e V . For any s e , labels y e {— 1,1 } and a e R, the a -basic block rado for (s, y) is
Encrypted Rado's
[99] It is noted that the features over which the Rado's will operate are encrypted. One example is a two-party scenario as shown in Fig. 10. The encryption process occurs after securely processing the text documents at the private location of B (source device 110). Using private key, B encryption engine 113 then encrypts the features, and these are then aggregated by aggregation device 120. The aggregation occurs blind to B , that is, remotely from source device 110, and may be performed by an honest-but-curious intermediary XB (aggregation device 120). The rados π8 are generated privately at TB
120. Once generated, the rados can be used by other honest-but-curious external parties Λ , such as learning device 130.
[100] In this section, secure mathematical operations are applied and the source device 110 and the aggregation device 120 are referred to as {B, X} where B is the private key holder and X is an intermediary. X can "see" encrypted features ((x, )) , and encrypted labels (()>,· )) , and knows the choices of rado vectors (i.e. X knows values of cr ). It would be possible to operate with cr also encrypted.
[101] Fig. 10 outlines the encryption steps for this section. It is possible to extend the rados to secure rado learners. We re-write (3) with the encrypted values made explicit. Corresponding secure mathematical operations are also shown. We use the notation to denote a series of homomorphic addition operations ie.
©"_! = αλ ® a2 ® " ' ® an - We will use : as an abuse of notation, to denote "has the meaning of rather than equality.
[102] Equation (5) shows the formation of the (encrypted) rado. The additions and (scalar) multiplications must all be translated to the appropriate homomorphic addition and multiplication operations. The output is an encoded Rademacher observation, based on any numerical field. We will refer to this rado as ((«„)) · Fig- shows a protocol for calculating an encrypted Rademacher.
Multiparty case
[103] The case for multiple parties uses the lift(-) function. This function appends zeros onto a vector, and thus (in the encrypted domain) may be represented as appending an encrypted scalar (zero) to the encrypted vector. As above, (16) can be rewritten in the encrypted domain.
[104] In one example, the calculation of rados is based on a random selection of samples aggregation comprise combining a random selection of samples
Learning, using encrypted Rado's
Unencrypted single -party case
[105] The learner for Rado's (in the unencrypted case) is given by [PATRINI16].
Learning device 130 may use the equivalent (exponential) learner for rado's
is equivalent to minimising the standard logistic loss Flog(S,Q) [107] The supervised learning (optimisation) is written as
Problem 1 (Minimise Exponential Loss) The optimal classifier Θ* is given by solving
and ΘΓΘ is a regularising term.
[108] Further detail about learning on rados is also provided in
PCT/AU2015/050653, which is incorporated herein by reference.
Secure single-party case
[109] In one example, encryption engine 113 encrypts the rado's as described above and the learning outcomes are to be kept confidential. While this example is not concerned
with the multi-source case, the following description provides mechanics to allow transfer to the scenario where multiple un-shared data sources exist.
[110] The regulariser in the above equation is an application of a protocol for Local Inner Product as shown in Fig. 17. However, in some examples, learning device 130 does not evaluate the above equation but rather performs a gradient descent to solve Problem 1 as exmplained below.
Gradient descent
[111] With reference to Problem 1 above it is noted that the gradient of J(q) , with respect to Θ . , is
[[111122]] IItt iiss ffuurrtthheerr nnootteedd tthhaatt
ee VV .. UUssiinngg tthhee aabboovvee nnoottaattiioonn iitt ccaann bbee ssttaatteedd tthhaatt::
Unencrypted, multi-party case
[113] The mean square loss can be used, over V— ie on the limited sample sets— by using a modified mean loss as given in Definition 4
[114] Definitio -Rado loss) The M -loss for the classifier Θ is
(6)
where expectation Έν and variance YP are computed with respect to the uniform sampling of σ in V . Γ is positive definite. The positive definite matrix Γ can be set to a weighted diagonal matrix
where ε < 1 accounts for (lack of) confidence in certain columns of π .
[115] Minimizing the Ridge regularized square loss over examples is equivalent to minimizing a regularized version of the M-loss, over the complete set of all rados.
[116] The optimal classifier Θ* is given by the simple closed-form expression:
Θ* = (BBT + dimc (B)■ Τ) 1 Bl (8) where B is stacked (column- wise) rado's and dimc(S) is the number of columns of B
[117] To find the inverse (BBT + dime(S) r) 1 in (13), [HALL11] recommends an iterative (Shur [GUO06]) approach. The proposed method may use around
log2(dim( )) or 128 iterations. Faster approaches (with fewer iterations and thus fewer multiplications) may be achieved by higher order algorithms. An outline and review are given in [RAJAGOPALAN96, S OLE YM ANI 12] .
[118] Fig. 12 illustrates a protocol for Secure Matrix Inverse, which uses an iterative algorithm to multiply matrices to form an inverse. The Secure Matrix Inverse protocol in Fig. 12 uses a significant number of secure and homomorphic operations per iteration.
• 0(n2) secure multiplies. The secure multiplies uses internal homomorphic multiplications and additions.
• Secure Matrix Inverse also uses 0(n2) homomorphic multiplies and 0(n2) additions.
[119] Using Secure Matrix Inverse from Fig. 12 and Secure Matrix Multiply from Fig. 8, learning device 130 can solve (8) in the encrypted domain at the central coordinator. In this sense, learning device 130 can perform a Secure Rado Solver protocol as shown in Fig. 13.
Blind addition: De-risking the coordinator
[120] The proposed approach avoids a central coordinator with access to all vectors (and also substantially reduces the scope of the coordinator to reveal data) by returning to Definition 4. Equation (6) can be re-written as follows: eM(ns ,D = -QT (9)
Let
[121] The sums in (10) and (11) are over the appropriate rado's. However, these rado's may be calculated by their peers, so the sums may be broken into per-peer summations, where we consider disjoint sets V such that V1 ^J V2 ^ . . . = 'P
[122] Definition 5 (BB-rado per peer) Consider P peers with p e { 1, 2, ..., P} , where each peer p has distinct rados drawn from V , and the rados are distinct in
V = { ^ V2 ^ For each peer p we have an expectation ep and variance v defined as
" (12)
Vp =∑(«i -b)(«i -b)r (13)
VP
Each peer, p can calculate ep and vp independently
[123] Although the mean b may be calculated centrally, it is preferable to use secure multiparty addition to achieve the same result. This reduces the scope for the coordinator to access (encrypted, aggregated) vectors, and (instead) only access noisy aggregates of the data.
Combining the elements
[124] The above description provides a toolkit that allows to
1. transform text to numeric values;
2. encrypt those values;
3. aggregate the encrypted values;
4. learn on the aggregated values using secure mathematical operations;
[125] In one example this may be implemented via a securely accredited text processing software installed on source device 110, and having the encryption software 113 running on the machine with access to the raw data, whilst the aggregation software operates on a separate server (aggregation device 120) - with no access to the raw data. Additionally, the solution of the optimisation to find Θ is performed at a central coordinator (learning device 130), which is also responsible for generating the public -private keys. Fig. 14 shows an algorithm Classifier for Secure Text with Central Coordinator. This algorithm uses a number of intermediate steps, which are detailed in Figs. 5-8 and 11-13.
[126] The algorithm may incorporate the secure multi-party summation work of
[CLIFTON44], to prevent the coordinator from obtaining the rado's directly. This adds a third layer of obfuscation to the data (encryption, aggregation, blind addition), which means that at the coordinator (who can decrypt the data) the data remains protected by the aggregation to rado's and the blind addition of the vectors.
[127] Fig. 15 illustrates a communication architecture 1500 for multiple peers 1501 and 1502 and a common coordinator 1503. Fig. 15 outlines the protocol from Fig. 14. The coordinator 1503 sends a common dictionary and public key to all peers 1501 and 1502. Each peer has different data components, with some common elements ( 1510, 1511) - see also Figs. 9a and 9b. Some examples may be unique to certain peers. Each peer has encrypted and aggregated its local data (as indicated at 1512 and 1513). The servers 1512 and 1513 correspond to the "intermediary" in Fig. 10. The encryption key is generated by the coordinator 1503. Dashed arrows denote information transfers between participants, whilst solid arrows denote local transformations (at the respective participant). Not all transfers are shown.
[128] Each peer p (1501 and 1502), now classifies a particular observation vector x and could calculate
5> = sign(erx) (14)
[129] However, each peer only has a subset of features x( ] . The label may be determined only by the sign of a scalar, and hence it is possible to break the inner product θΓχ into an inner product of local features and remote features:
^ remote ^remote ) (15)
[130] The local component of (16) may be calculated at peer p . If we denote the local classifier result as p then we may write
[131] The summation in (19) is the sum of all (local) calculated classifier results on the sub-components of the vector x . The result of (19) shows that the remote classification results may be treated as offsets for the local result— i.e. the remote inner products act as corrections to the local result. However, in this example every peer shares linking information about the observation x . To avoid this, we replace the summation in (19) with an equivalent rado:
sign(ap + 0^ ) (20)
[132] In the homomorphic encrypted case, the local inner product can be calculated by keeping the encrypted classifier vector
in its encrypted form, and treating the elements of x as unencrypted scalars. The local inner product is written explicitly in the protocol for Local Inner Product as show in Fig. 17. The innerproduct with the rado is performed at the coordinator C (learning device 130), using secure inner product. Finally, the summation may be achieved using multi-party secure addition, as outlined in [CLIFTON02] .
[133] Fig. 16 shows a protocol for Local Classify for Secure Text.
[134] Definition 6 (Matrix Inverse, using only products and sums) cf. [?, eqn. 2.2].
Given a matrix A , with inverse V = A-1 we can iteratively approximate V as
Vn+l =K (3I -AVn [3I -AVn]) (21)
Single cohort learner
[135] Equivalent (exponential) learner for rado's can be used here.
[136] Lemma 2 (Rado Learning) For any Θ and S , and a set of r ados O G W ∑, minimizing the loss
1οδ ) (22)
is equivalent to minimising the standard logistic loss Fl (S, Q) [137] The supervised learning (optimisation) is written as
[138] Problem 2 (Minimise Exponential Loss) The optimal classifier q* is given by solving
Θ* = minJ(e) (23) where
^(e) = iog ÷∑exP(-e^) +ΘΓΘ (24)
- <S<E
and ΘΓΘ is a regularising term.
[139] Note that the gradient of /(θ) , with respect to Θ . , is
[140] The learner can be developed using secure exponentiation.
[141] Fig. 18 illustrates a computer network 1800 for automatic text analysis including a computer system 1801 implementing coordinator 1503 from Fig. 15. The computer system 1801 comprises a processor 1802 connected to a program memory 1804, a data memory 1806 and a communication port 1808. The program memory 1804 is a non-transitory computer readable medium, such as a hard drive, a solid state disk or CD-ROM. Software, that is, an executable program stored on program memory 1804 causes the processor 1802 to perform the steps performed by learning device 130 and/or coordinator 11503.
[142] The processor 102 may then store the learning result, such as a learned correlation between feature values and label values on data store 1806, such as on RAM or a processor register. Processor 1802 may also send the determined correlation via communication port 1808 to a server, such as an automatic evaluation engine. For example, the learned correlations may assist doctors in automatically diagnosing patients.
[143] The processor 1802 may receive data, such as learning samples, from data memory 1806 as well as from the communications port 1808. In one example, the processor 102 receives learning samples data from aggregation device 130 via communications port 1808, such as by using a Wi-Fi network according to IEEE 802.11. The Wi-Fi network may be a decentralised ad-hoc network, such that no dedicated management infrastructure, such as a router, is required or a centralised network with a router or access point managing the network.
[144] Although communications port 1808 is shown as a distinct entity, it is to be understood that any kind of data port may be used to receive data, such as a network connection, a memory interface, a pin of the chip package of processor 1802, or logical ports, such as IP sockets or parameters of functions stored on program memory 1804 and executed by processor 1802. These parameters may be stored on data memory 1806 and may be handled by-value or by-reference, that is, as a pointer, in the source code.
[145] The processor 1802 may receive data through all these interfaces, which includes memory access of volatile memory, such as cache or RAM, or non-volatile memory, such as an optical disk drive, hard disk drive, storage server or cloud storage. The computer system 1800 may further be implemented within a cloud computing environment, such as a managed group of interconnected servers hosting a dynamic number of virtual machines.
[146] It is to be understood that throughout this disclosure unless stated otherwise, nodes, edges, graphs, solutions, variables, samples, features, labels and the like refer to
data structures, which are physically stored on data memory 1806 or processed by processor 1802. Further, for the sake of brevity when reference is made to particular variable names, such as "period of time" or "quantity of movement" this is to be understood to refer to values of variables stored as physical data in computer system 1800.
[147] It is noted that the learning on the aggregated values may comprise
determining learning parameter values indicative of a relationship between the features and labels of the learning samples as described herein. These learning parameter values can then be applied to current features to determine an estimated label for the current features. The system may then further comprise an output engine to
automatically suggest actions based on the estimated label. For example, the learning device 130 may distribute the learning parameters, such as machine learning model parameters including linear factors of regression models, back to the source device 110. For example, source device 110 may be used by a doctor to enter patient data and diagnostic data for each patient. Learning device 130 learns the contribution of each observation to particular diseases in the form of regression parameters. This learning is based on learning samples not only from the one source device 110 but from a large number of source devices while maintaining confidentiality as described above. The learning parameters also do not contain patient-confidential information and can therefore be distributed freely back to source device 110. For the next patient, the doctor can enter the observations and the source device 110 automatically proposes a diagnosis or a number of possible diagnoses for the doctor to evaluate. In this sense, source device automatically suggests actions, such as providing information to the patient or applying a certain therapy.
[148] It is noted that the doctor's observations may also be accompanied by genomic data or meta-genetics including habits of the patient, such as smoking etc. The proposed systems and methods can equally be applied to this information. In other examples, the text data is related to other applications, such as social media, email, SMS, credit ratings and others.
[149] It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the above-described embodiments, without departing from the broad general scope of the present disclosure. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
[150] The following references are incorporated herein by reference:
[BIRD09] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. O'Reilly Media, 2009, available: http://www.nltk.org/book led/.
[DAMGARD01] I. Damgard and M. Jurik, "A generalisation, a simplification and some applications of Paillier's probabilistic public -key system," in PKC '01
Proceedings of the 4th International Workshop on Practice and Theory in Public Key Cryptography: Public Key Cryptography, 2001, pp. 119-136.
[CLIFTON02] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu, "Tools for privacy preserving distributed data mining" SIGKDD Explorations, vol. 4, no. 2, 2002.
[FRANZl l] M. Franz, "Secure computations on non-integer values" Ph.D. dissertation, Technische Universitat Darmstadt, 2011.
[GUO06] C.-H. Guo and N. J. Higham, "A Schur-Newton method for the matrix p'th root and its inverse," Manchester Institute for Mathematical Sciences, Tech. Rep., Oct. 2006.
[HALL11] R. Hall, S. E. Fienberg, and Y. Nardi, "Secure multiple linear regression based on homomorphic encryption," Journal of Official Statistics, vol. 27, no. 4, pp. 669-691, 2011 (revised 2013).
[JOHNSON14] M. Johnson, "Beyond the 1-best pipeline," Presentation slides, at NICTA-NLP workshop 2014, Sep. 2014.
[MANNING 14] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, "The Stanford CoreNLP natural language processing toolkit," in
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55-60.
[MCCALLUM02] A. K. McCallum, \MALLET: A machine learning for language toolkit," 2002, http://mallet.cs.umas s .edu .
[MORTON05] T. Morton, J. Kottmann, J. Baldridge, and G. Bierner, \OpenNLP: A Java-based NLP toolkit," 2005, http://opennlp.sourceforge.net.
[NOCK15] R. Nock, G. Patrini, and A. Friedman, \Rademacher observations, private data, and boosting," JMachine Learning Research, vol. 37, 2015.
[PAILLIER99] P. Paillier, "Public-key cryptosystems based on composite degree residuosity classes," in EUROCRYPT. Springer- Verlag, 1999, pp. 223-238.
[PATRINI 16] G. Patrini, R. Nock, S. Hardy, and T. Caetano, \Fast learning from distributed datasets without entity matching," Data61, Tech. Rep., Mar. 2016.
[RAJAGOPALAN96]J. Rajagopalan, \An iterative algorithm for inversion of matrices," Ph.D. dissertation, Concordia University, Montreal, Canada, Sep. 1996. [SHANNON49] C. E. Shannon, "Communication theory of secrecy systems," The Bell System Technical Journal, vol. 28, no. 4, pp. 656-715, May 1949.
[SOLEYMANI12] F. Soleymani, "A rapid numerical algorithm to compute matrix inversion," International Journal of Mathematics and Mathematical Sciences, vol. 2012, 2012.
[STENETORP12] P. Stenetorp, S. Pyysalo, G. Topic, T. Ohta, S. Ananiadou, and J. Tsujii, "brat: a web-based tool for nip-assisted text annotation," in Proceedings of the Demonstrations Session at EACL, 2012.
Claims
1. A system for automatic text analysis, the system comprising:
a source device comprising
an input port to receive text data representing text that comprises multiple text segments,
a text processor to transform the text data to numeric values that represent the multiple text segments, and
an encryption engine to encrypt the numeric values into encrypted numeric values that represent the multiple text segments in encrypted form;
an aggregation device to aggregate the encrypted numeric values into aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form; and
a learning device remote from the source device to provide an analysis of the text data by performing a learning method on the aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form using operations that maintain the aggregated encrypted numeric values and thereby the text data concealed from the learning device.
2. The system of claim 1 wherein the text processor is a language processor.
3. The system of claim 2, wherein transforming the text to numeric values is based on natural language processing.
4. The system of any one of the preceding claims, wherein transforming the text to numeric values is based on a dictionary.
5. The system of claim 4, wherein the dictionary comprises multiple dictionary entries and each of the multiple dictionary entries is associated with a numeric value and transforming the text to numeric values comprises replacing a word from the text data with a corresponding numeric value from the dictionary.
6. The system of claim 5, wherein each numeric value from the dictionary is associated with a tag from natural language processing.
7. The system of claim 6, wherein the tag represents grammatical function in sentence of that phrase.
8. The system of claim 6 or 7, wherein the tag comprises multiple labels.
9. The system of claim 8, wherein the multiple labels are one-hot encoded.
10. The system of any one of the preceding claims, wherein the encryption engine uses homomorphic encryption.
11. The system of claim 10, wherein the homomorphic encryption is Pailler encryption.
12. The system of any one of the preceding claims, wherein the aggregation device is remote from the source device and the learning device.
13. The system of any one of the preceding claims, wherein the numeric values represent features and labels of learning samples, the encryption engine encrypts the features and labels to create encrypted features and encrypted labels and aggregation comprises combining the encrypted features based on the encrypted labels.
14. The system of claim 13, wherein aggregation comprises adding the features multiplied by the labels.
15. The system claim 14, wherein adding features multiplied by the labels comprises adding homomorphically encrypted features multiplied by homomorphic ally encrypted labels.
16. The system of any one of claims 13 to 15, wherein aggregation comprises combining a random selection of samples.
17. The system of any one of claims 13 to 16, wherein aggregation comprises calculating block rados.
18. The system of any one of claims 13 to 17, wherein learning on the aggregated values comprises determining learning parameter values indicative of a relationship between the features and labels of the learning samples and the method further comprises:
applying the learning parameter values to current features to determine an estimated label for the current features.
19. The system of claim 18, further comprising an output engine to automatically suggest actions based on the estimated label.
20. The system of any one of the preceding claims, wherein the learning device is configured to calculate a rado loss to learn from the aggregated values.
21. The system of any one of the preceding claims, wherein the operations that maintain the aggregated numeric values concealed from the learning device comprise one or more of:
Secure Outer Product;
Secure Inner Product;
Secure Elementwise Product;
Secure Matrix Multiply;
Secure Matrix Inverse;
Secure Rado Solver;
Classifier for Secure Text with Central Coordinator;
Local Classify for Secure Text; and
Local Inner Product.
22. A method for automatic text analysis, the method comprising:
transforming text data representing text that comprises multiple text segments to numeric values that represent the multiple text segments;
encrypting the numeric values to create encrypted numeric values that represent the multiple text segments in encrypted form;
aggregating the encrypted numeric values to create aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form; and
providing an analysis of the text data by performing a learning method, remotely from the transforming, on the aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form using using operations that maintain the aggregated encrypted numeric values and thereby the text data concealed from the learning device.
23. Software that, when installed on a computer, causes the computer to perform the method of claim 22.
24. A method for encrypting text data representing text that comprises multiple text segments, the method comprising:
transforming the text data to numeric values that represent the multiple text segments, the numeric values forming multiple samples, each sample comprising one or more features and a label; and
encrypting the numeric values forming the multiple samples to obtain encrypted samples that represent the multiple text segments in encrypted form.
25. A method for aggregating data, the method comprising:
receiving multiple encrypted samples that represent multiple text segments in encrypted form, each of the multiple encrypted samples comprising one or more encrypted numeric features and an encrypted numeric label;
creating aggregated samples by combining the encrypted numeric features based on the encrypted numeric label for each encrypted sample into aggregated encrypted samples that represent the multiple text segments in aggregated encrypted form.
26. A method for machine learning on data, the method comprising:
receiving aggregated samples that represent multiple text segments of text data in aggregated encrypted form, the aggregated samples comprising combined encrypted numeric features; and
learning based on the aggregated samples using secure mathematical operations to provide an analysis of the text data by performing a learning method on the aggregated samples that represent the multiple text segments in aggregated encrypted form using operations that maintain the aggregated encrypted samples and thereby the text data concealed from a learning device performing the method.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2016905070 | 2016-12-08 | ||
AU2016905070A AU2016905070A0 (en) | 2016-12-08 | Secure text analytics |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018102861A1 true WO2018102861A1 (en) | 2018-06-14 |
Family
ID=62490545
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/AU2017/051315 WO2018102861A1 (en) | 2016-12-08 | 2017-11-29 | Secure text analytics |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2018102861A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110727956A (en) * | 2019-10-11 | 2020-01-24 | 陕西师范大学 | Double-authentication test question backup disguising method combining codebook expansion and question stem hashing |
US10859662B2 (en) | 2018-03-01 | 2020-12-08 | Commonwealth Scientific And Industrial Research Organisation | Object monitoring system |
US20210110113A1 (en) * | 2019-10-11 | 2021-04-15 | Open Text Corporation | Dynamic attribute extraction systems and methods for artificial intelligence platform |
CN112906904A (en) * | 2021-02-03 | 2021-06-04 | 华控清交信息科技(北京)有限公司 | Data processing method and device and data processing device |
US11343068B2 (en) | 2019-02-06 | 2022-05-24 | International Business Machines Corporation | Secure multi-party learning and inferring insights based on encrypted data |
CN114696990A (en) * | 2022-05-31 | 2022-07-01 | 深圳市洞见智慧科技有限公司 | Multi-party computing method, system and related equipment based on fully homomorphic encryption |
US20230122359A1 (en) * | 2021-10-15 | 2023-04-20 | Lognovations Holdings, Llc | Encoding / Decoding System and Method |
CN116192363A (en) * | 2023-04-26 | 2023-05-30 | 中新宽维传媒科技有限公司 | Audible processing method and device based on text information, medium and computing equipment |
US12143465B2 (en) | 2019-05-17 | 2024-11-12 | International Business Machines Corporation | Searching over encrypted model and encrypted data using secure single-and multi-party learning based on encrypted data |
US12314794B2 (en) | 2019-08-30 | 2025-05-27 | Commonwealth Scientific And Industrial Research Organisation | Object monitoring |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011080745A2 (en) * | 2009-12-31 | 2011-07-07 | Vaultive Ltd. | System, apparatus and method for encryption and decryption of data transmitted over a network |
US20110194691A1 (en) * | 2010-02-09 | 2011-08-11 | Shantanu Rane | Method for Privacy-Preserving Computation of Edit Distance of Symbol Sequences |
US9094378B1 (en) * | 2013-08-16 | 2015-07-28 | Google Inc. | Homomorphic cryptography on numerical values in digital computing |
-
2017
- 2017-11-29 WO PCT/AU2017/051315 patent/WO2018102861A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011080745A2 (en) * | 2009-12-31 | 2011-07-07 | Vaultive Ltd. | System, apparatus and method for encryption and decryption of data transmitted over a network |
US20110194691A1 (en) * | 2010-02-09 | 2011-08-11 | Shantanu Rane | Method for Privacy-Preserving Computation of Edit Distance of Symbol Sequences |
US9094378B1 (en) * | 2013-08-16 | 2015-07-28 | Google Inc. | Homomorphic cryptography on numerical values in digital computing |
Non-Patent Citations (4)
Title |
---|
HARDY, S. ET AL.: "On Private Supervised Distributed Learning: Weakly Labeled and without Entity Resolution", 29TH NIPS/ PRIVATE MULTI-PARTY MACHINE LEARNING WORKSHOP (PMPML 2016, XP055509713, Retrieved from the Internet <URL:https://pmpml.github.io/PMPML16/papers/PMPML16_paper_9.pdf> [retrieved on 20170706] * |
NOCK, R.: "On Regularizing Rademacher Observation Losses", NIPS 2016, XP055509722, Retrieved from the Internet <URL:https://papers.nips.cc/paper/6572-on-regularizing-rademacher-observation-losses.pdf> [retrieved on 20170706] * |
PATRINI, G. ET AL., FAST LEARNING FROM DISTRIBUTED DATASETS WITHOUT ENTITY MATCHING, 13 March 2016 (2016-03-13), XP080689308, Retrieved from the Internet <URL:https://arxiv.org/pdf/1603.04002.pdf> [retrieved on 20170706] * |
PATRINI, G. ET AL.: "Almost) No Label No Cry", NIPS 2014, XP055471444, Retrieved from the Internet <URL:http://papers.nips.cc/paper/5453-almost-no-label-no-cry.pdf> [retrieved on 20170706] * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10859662B2 (en) | 2018-03-01 | 2020-12-08 | Commonwealth Scientific And Industrial Research Organisation | Object monitoring system |
US11486956B2 (en) | 2018-03-01 | 2022-11-01 | Commonwealth Scientific And Industrial Research Organisation | Object monitoring system |
US11343068B2 (en) | 2019-02-06 | 2022-05-24 | International Business Machines Corporation | Secure multi-party learning and inferring insights based on encrypted data |
US12143465B2 (en) | 2019-05-17 | 2024-11-12 | International Business Machines Corporation | Searching over encrypted model and encrypted data using secure single-and multi-party learning based on encrypted data |
US12314794B2 (en) | 2019-08-30 | 2025-05-27 | Commonwealth Scientific And Industrial Research Organisation | Object monitoring |
US20210110113A1 (en) * | 2019-10-11 | 2021-04-15 | Open Text Corporation | Dynamic attribute extraction systems and methods for artificial intelligence platform |
CN110727956A (en) * | 2019-10-11 | 2020-01-24 | 陕西师范大学 | Double-authentication test question backup disguising method combining codebook expansion and question stem hashing |
US11681874B2 (en) * | 2019-10-11 | 2023-06-20 | Open Text Corporation | Dynamic attribute extraction systems and methods for artificial intelligence platform |
CN112906904B (en) * | 2021-02-03 | 2024-03-26 | 华控清交信息科技(北京)有限公司 | Data processing method and device for data processing |
CN112906904A (en) * | 2021-02-03 | 2021-06-04 | 华控清交信息科技(北京)有限公司 | Data processing method and device and data processing device |
US12132712B2 (en) | 2021-10-15 | 2024-10-29 | Lognovations Holdings, Llc | Encoding / decoding system and method |
US20230122359A1 (en) * | 2021-10-15 | 2023-04-20 | Lognovations Holdings, Llc | Encoding / Decoding System and Method |
US12169476B2 (en) | 2021-10-15 | 2024-12-17 | Lognovations Holdings, Llc | Encoding / decoding system and method |
US12218918B2 (en) | 2021-10-15 | 2025-02-04 | Lognovations Holdings, Llc | Encoding / decoding system and method |
US12244574B2 (en) | 2021-10-15 | 2025-03-04 | Lognovations Holdings, Llc | Encoding / decoding system and method |
CN114696990A (en) * | 2022-05-31 | 2022-07-01 | 深圳市洞见智慧科技有限公司 | Multi-party computing method, system and related equipment based on fully homomorphic encryption |
CN116192363B (en) * | 2023-04-26 | 2023-07-11 | 中新宽维传媒科技有限公司 | Audible processing method and device based on text information, medium and computing equipment |
CN116192363A (en) * | 2023-04-26 | 2023-05-30 | 中新宽维传媒科技有限公司 | Audible processing method and device based on text information, medium and computing equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018102861A1 (en) | Secure text analytics | |
Al Badawi et al. | Privft: Private and fast text classification with homomorphic encryption | |
Podschwadt et al. | A survey of deep learning architectures for privacy-preserving machine learning with fully homomorphic encryption | |
Bhattacharya et al. | Bindaas: Blockchain-based deep-learning as-a-service in healthcare 4.0 applications | |
Tai et al. | Privacy-preserving decision trees evaluation via linear functions | |
Blatt et al. | Optimized homomorphic encryption solution for secure genome-wide association studies | |
US20200366459A1 (en) | Searching Over Encrypted Model and Encrypted Data Using Secure Single-and Multi-Party Learning Based on Encrypted Data | |
US9288039B1 (en) | Privacy-preserving text language identification using homomorphic encryption | |
JP2020532771A (en) | High-precision privacy protection real-valued function evaluation | |
Liu et al. | Revfrf: Enabling cross-domain random forest training with revocable federated learning | |
US11750362B2 (en) | Private decision tree evaluation using an arithmetic circuit | |
Fritchman et al. | Privacy-preserving scoring of tree ensembles: A novel framework for AI in healthcare | |
Giacomelli et al. | Privacy-preserving collaborative prediction using random forests | |
CN115668235A (en) | Depth-limited knowledge distillation for inferring encrypted data | |
US20210081807A1 (en) | Non-Interactive Private Decision Tree Evaluation | |
CN113221153B (en) | Graph neural network training method and device, computing equipment and storage medium | |
Huang et al. | Support vector machine classification over encrypted data | |
Wang et al. | Learning privately: Privacy-preserving canonical correlation analysis for cross-media retrieval | |
CN108829774A (en) | A kind of cloud storage ciphertext full-text search method using dual key | |
Mahdi et al. | Privacy-preserving string search on encrypted genomic data using a generalized suffix tree | |
Nassar et al. | Securing aggregate queries for DNA databases | |
Lu et al. | Multicenter privacy-preserving Cox analysis based on homomorphic encryption | |
US11861476B2 (en) | Secure ensemble training and inference using heterogeneous private machine learning models | |
Espiritu et al. | Synq: Public Policy Analytics Over Encrypted Data | |
Chen et al. | SHOSVD: Secure outsourcing of high-order singular value decomposition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17878826 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17878826 Country of ref document: EP Kind code of ref document: A1 |