WO2018102861A1

WO2018102861A1 - Secure text analytics

Info

Publication number: WO2018102861A1
Application number: PCT/AU2017/051315
Authority: WO
Inventors: Leif Hanlen; Neil Bacon; Richard Nock
Original assignee: Commonwealth Scientific And Industrial Research Organisation
Priority date: 2016-12-08
Filing date: 2017-11-29
Publication date: 2018-06-14

Abstract

This disclosure relates to automatic text analysis. A source device comprises an input port to receive text data representing text that comprises multiple text segments. A text processor transforms the text data to numeric values that represent the multiple text segments. An encryption engine encrypts the numeric values into encrypted numeric values. An aggregation device aggregates the encrypted numeric values into aggregated encrypted numeric. Finally, a learning device remote from the source device provides an analysis of the text data by performing a learning method on the aggregated encrypted numeric values using operations that maintain the aggregated encrypted numeric values and thereby the text data concealed from the learning device. The numeric data values allow encryption to preserve privacy while, at the same time, allowing learning on the numeric data on a remote device without disclosing the text data because the aggregated numeric values remain concealed from the learning device.

Description

"Secure text analytics"

Related applications

[1] The present application claims priority from Australian Provisional Patent Application No 2016905070 filed on 8 December 2016, the content of which is incorporated herein by reference. PCT/AU2016/050088 discloses a method for learning from distributed data and is incorporated herein by reference.

PCT/AU2015/050653 discloses a method for learning with transformed data and is incorporated herein by reference. PCT/AU2015/050661 discloses gradients over distributed datasets and is incorporated herein by reference.

Technical Field

[2] This disclosure relates to automatic text analysis. Background

[3] The amount of unstructured data used by organisations is growing. However, the difficulty in processing text represents a significant barrier to extracting maximum value from this large and growing resource.

[4] Private text data with confidential or otherwise restricted content is difficult to "open" in the sense of open-data. Privacy requirements in text data are difficult to guarantee due to the inter-dependencies of text, and grammar.

[5] Components of the text document that might be subject to privacy concerns (names of persons, places, drug-names, disease-names) are likely to be the very components most interesting for text processing— and most damaging to algorithm performance if altered. More advanced privacy requirements, such as determining whether a part of a document is subject to "national security", make general autonomous text replacements infeasible. [6] Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each claim of this application.

[7] Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.

Summary

[8] A system for automatic text analysis comprises:

a source device comprising

an input port to receive text data representing text that comprises multiple text segments,

a text processor to transform the text data to numeric values that represent the multiple text segments, and

an encryption engine to encrypt the numeric values into encrypted numeric values that represent the multiple text segments in encrypted form;

an aggregation device to aggregate the encrypted numeric values into aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form; and

a learning device remote from the source device to provide an analysis of the text data by performing a learning method on the aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form using operations that maintain the aggregated encrypted numeric values and thereby the text data concealed from the learning device.

[9] It is a technical advantage that the numeric data values allow encryption to preserve privacy while, at the same time, allowing learning on the numeric data on a remote device without disclosing the text data because the aggregated numeric values remain concealed from the learning device.

[10] The text processor may be a language processor.

[11] Transforming the text to numeric values may be based on natural language processing.

[12] It is a technical advantage that a language processor can consider grammar in the text data. As a result, the transformed numeric values represent the text data more accurately.

[13] Transforming the text to numeric values may be based on a dictionary.

[14] It is a technical advantage that a dictionary is a highly optimised data structure that allows look-up and creation with small computational overhead. As a result, many operations on the dictionary can be performed in a short time.

[15] The dictionary may comprise multiple dictionary entries and each of the multiple dictionary entries is associated with a numeric value and transforming the text to numeric values comprises replacing a word from the text data with a corresponding numeric value from the dictionary.

[16] Each numeric value from the dictionary may be associated with a tag from natural language processing.

[17] The tag may represent grammatical function in sentence of that phrase.

[18] The tag may comprise multiple labels .

[19] The multiple labels may be one-hot encoded. [20] It is a technical advantage that one-hot encoding allows binary classifiers, which leads to computationally more efficient calculations.

[21] The encryption engine may use homomorphic encryption.

[22] The homomorphic encryption may be Pailler encryption.

[23] The aggregation device may be remote from the source device and the learning device.

[24] It is a technical advantage that having the aggregation device remote from the source device adds a further layer of protection of the text data from disclosure.

[25] The numeric values may represent features and labels of learning samples, the encryption engine may encrypt the features and labels to create encrypted features and encrypted labels and aggregation may comprise combining the encrypted features based on the encrypted labels.

[26] Aggregation may comprise adding the features multiplied by the labels.

[27] Adding features multiplied by the labels may comprise adding

homomorphically encrypted features multiplied by homomorphic ally encrypted labels.

[28] Aggregation may comprise combining a random selection of samples.

[29] Aggregation may comprise calculating block rados.

[30] Learning on the aggregated values may comprise determining learning parameter values indicative of a relationship between the features and labels of the learning samples and the method may further comprise applying the learning parameter values to current features to determine an estimated label for the current features. [31] The system may further comprise an output engine to automatically suggest actions based on the estimated label.

[32] It is a technical advantage that current features can be received and actions suggested automatically. In essence, this incorporates the correlations extracted from the learning samples into the current decision while keeping the learning samples confident. This way, a significantly larger number of learning samples can be used, which makes the suggested actions more accurate.

[33] The learning device may be configured to calculate a rado loss to learn from the aggregated values.

[34] The operations that maintain the aggregated numeric values concealed from the learning device may comprise one or more of:

Secure Outer Product;

Secure Inner Product;

Secure Elementwise Product;

Secure Matrix Multiply;

Secure Matrix Inverse;

Secure Rado Solver;

Classifier for Secure Text with Central Coordinator;

Local Classify for Secure Text; and

Local Inner Product.

[35] A method for automatic text analysis comprises:

transforming text data representing text that comprises multiple text segments to numeric values that represent the multiple text segments;

encrypting the numeric values to create encrypted numeric values that represent the multiple text segments in encrypted form;

aggregating the encrypted numeric values to create aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form; and providing an analysis of the text data by performing a learning method, remotely from the transforming, on the aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form using using operations that maintain the aggregated encrypted numeric values and thereby the text data concealed from the learning device.

[36] Software, when installed on a computer, causes the computer to perform the above method.

[37] A method for encrypting text data representing text that comprises multiple text segments comprises:

transforming the text data to numeric values that represent the multiple text segments, the numeric values forming multiple samples, each sample comprising one or more features and a label; and

encrypting the numeric values forming the multiple samples to obtain encrypted samples that represent the multiple text segments in encrypted form.

[38] A method for aggregating data comprises:

receiving multiple encrypted samples that represent multiple text segments in encrypted form, each of the multiple encrypted samples comprising one or more encrypted numeric features and an encrypted numeric label;

creating aggregated samples by combining the encrypted numeric features based on the encrypted numeric label for each encrypted sample into aggregated encrypted samples that represent the multiple text segments in aggregated encrypted form.

[39] A method for machine learning on data comprises:

receiving aggregated samples that represent multiple text segments of text data in aggregated encrypted form, the aggregated samples comprising combined encrypted numeric features; and

learning based on the aggregated samples using secure mathematical operations to provide an analysis of the text data by performing a learning method on the aggregated samples that represent the multiple text segments in aggregated encrypted form using operations that maintain the aggregated encrypted samples and thereby the text data concealed from a learning device performing the method.

[40] Optional features described of any aspect of method, computer readable medium or computer system, where appropriate, similarly apply to the other aspects also described here.

Brief Description of Drawings

[41] An example will now be described with reference to: [42] Fig. 1 illustrates a system for automatic text analysis.

[43] Fig. 2 illustrates a text processing pipeline. It is noted that the pipeline transforms unstructured text (left) into structured numeric features (right).

[44] Fig. 3 shows the output of a text processing pipeline (using CoreNLP). With a full application of CoreNLP, parts of speech are labelled, entities are labelled, coreferences are labelled and then dependencies between text elements (a label and end-point) are formed.

[45] Fig. 4 illustrates a method for automatic text analysis.

[46] Fig. 5 illustrates a protocol for Secure Outer Product.

[47] Fig. 6 illustrates a protocol for Secure Inner Product.

[48] Fig. 7 illustrates a protocol for Secure Elementwise Product. This protocol extends [FRANZl l, Protocol 2, pp.39] to elementwise products on vectors. A similar algorithm is given in [CLIFTON02] . [49] Fig. 8 illustrates a protocol for Secure Matrix Multiply.

[50] Fig. 9a illustrates a set of samples for the case of Vertical Partition of the samples.

[51] Fig. 9b illustrates a set of samples for the General case.

[52] Fig. 10 illustrates a two-party scenario.

[53] Fig. 11 illustrates a protocol for calculating an encrypted Rademacher.

[54] Fig. 12 illustrates a protocol for Secure Matrix Inverse.

[55] Fig. 13 illustrates a Secure Rado Solver protocol.

[56] Fig. 14 shows an algorithm Classifier for Secure Text with Central Coordinator.

[57] Fig. 15 illustrates a communication architecture for multiple peers and a common coordinator.

[58] Fig. 16 shows a protocol for Local Classify for Secure Text. [59] Fig. 17 shows a protocol for Local Inner Product. [60] Fig. 18 shows a computer network for automatic text analysis. Description of Embodiments

[61] Confidential text corpora exist in many forms, but most do not allow arbitrary sharing. This disclosure provides systems and methods to use such private corpora using privacy preserving text analytics. For example, an important application of machine learning is the determination of correlations between doctors' observations (phenotypes) and diagnosed diseases. Based on such correlations, diagnosis of future patients would be more accurate leading to improved healthcare outcomes. Since there is often a large number of variables (i.e. different phenotypes), a large number of samples (i.e. number of patients) should be used in order to have statistically significant correlations. This large number of patients is often greater than the number of patients within a single organisation, such as a hospital provider. However, confidentiality regimes make it difficult to share the patient data to build a larger dataset.

[62] The current disclosure provides a solution that allows the combined learning without jeopardising the confidentiality of the patent data. In particular, the disclosed solution transforms the text data representing text that comprises multiple text segments into numeric values that represent the multiple text segments in encrypted form and encrypts and aggregates these values to create learning samples. This makes it practically impossible to determine the patient data from the learning samples but allows the determination of correlations between phenotypes and diseases based on the encrypted, aggregated learning samples. Numeric in this context means that the values only comprise digits from 0-9 and can be interpreted as a number. For example, "456" is not interpreted as a string of '4', '5', '6' but as four hundred fifty six. In this sense, the property of being numeric allows calculations of values, such as addition and multiplication of different numeric values instead of string operations, such as concatenation, sub-strings and searching.

[63] A text processing application is disclosed using appropriate privacy preservation techniques (including homomorphic encryption, Rademacher operators and secure computation). This disclosure sets out the preliminary materials from Rademacher operators for binary classifiers, and then constructs basic text processing approaches to match those binary classifiers.

[64] This disclosure provides encryption techniques to text which allow learning without viewing the raw data, thereby applying machine learning without the need to share (or read) text data. This is applicable to private stores of text, where multiple participants wish to

1. Learn from private text corpora— without sharing the corpora— and/or 2. Classify private text— without allowing the classifier to have access to the plain text.

[65] In one example, the systems and methods described herein are applied to text in health. Although open sharing is generally accepted as a good principal in health, privacy concerns may overwhelm implied scientific benefit. There is a need to address privacy while supporting research-use (as well as non-research use) of health data. To support confidentiality, the British Medical Journal recommends not publishing verbatim responses or transcriptions of clinical discussions, which is a type of data that text mining systems can use.

Text analysis system

[66] Fig. 1 illustrates a system for automatic text analysis 100. System 100 will first be described in general terms while the mathematical details are described further below. System 100 comprises a source device 101 connected to an aggregation device 102, which in turn is connected to a learning device 103. Source device 101 comprises an input port 111 to receive text data, a text processor 112 to transform the text data to numeric values, and an encryption engine 113 to encrypt the numeric values. Input port 111 may receive text data from a graphical user interface, which may be displayed on a computer screen, such as a tablet computer. The user interface may be part of a medical record keeping system in a hospital, for example. The text processor 112 transforms the text data to numeric values. For example, the text processor 112 performs a bag of words approach and assigns a numeric value to each word in a dictionary. In other examples, text processor 112 is a language processor and uses language grammar to transform the text into numeric values. Encryption engine 113 encrypts the numeric values and may perform homomorphic encryption to encrypt the numerical values.

[67] Aggregation device 120 aggregates the encrypted data. In one example, the encrypted data comprises learning samples including for each sample multiple encrypted features and a label. Aggregating the encrypted data may then comprise adding the features based on the label as described below.

[68] The learning device 130 is remote from the source device 110, which means that the learning device 130 has no access to the text data. For example, learning device may be on a separate server with separate administrator rights such that an

administrator of a first server hosting the source device 110 is not an administrator of a second server hosting the learning device. The learning device 130 learns on the aggregated values using secure mathematical operations. Secure mathematical operations in this sense are operations that use input data in a way that it is not possible to discern the underlying original data from the input data, from any intermediate results or calculations or from the output data. In other words, the secure mathematical operations maintain the aggregated numeric values concealed from the learning device 130. For example, the learning device 130 performs addition and multiplication of homomorphically encrypted numerical values.

[69] Fig. 4 illustrates a method 400 for automatic text analysis. The method commences by transforming 401 text data to numeric values. Next, the numeric values are encrypted 402 to create encrypted numeric values. The encrypted numeric values are then aggregated 403 to create aggregated values finally, remotely from the transforming, learning 404 can be performed on the aggregated values using secure mathematical operations, that is, the numeric values remain concealed from the learning device 130. It is noted here that the term 'learning' is used herein in the sense of machine learning. This means the calculation and storage of values that represent a relationship or correlation between features and labels (such as phenotypes and diseases). This is useful as a learning outcome, that is the relationship or correlation, can be applied to new data to provide an estimated label. In one example, the learning calculates linear weighting factors.

[70] In some examples disclosed herein, the transformation 401 and encryption 402 is performed on a first device, the aggregation 403 is performed on a second device and the learning 404 is performed on a third device. In those examples, the first, second and third device are remote from each other in the sense that there each device can only access data that is stored on that device or associated with that device. For example, the data may be stored on cloud storage but access is controlled exclusively by one of the first, second or third devices. In other examples, however, the steps 401, 402, 403 and 404 are distributed differently across multiple devices.

[71] Fig. 4 is to be understood as a blueprint for the software program and may be implemented step-by-step, such that each step in Fig. 4 is represented by a function in a programming language, such as C++ or Java. The resulting source code is then compiled and stored as computer executable instructions on program memory.

Background

[72] The following description provides relevant aspects from homomorphic encryption, obscured computation and irreversible aggregation. One example uses the assumption that all participants are "honest-but-curious" according to the following definition:

[73] "The honest-but-curious (HBC) adversary is a legitimate participant in a communication protocol who will not deviate from the defined protocol but will attempt to learn all possible information from legitimately received messages. "

[74] The data processing can be pushed to peers, which, in some examples, maintain control of private data. The need for "trusted"' intermediaries is limited, and where such intermediaries are used, the information they may receive is restricted by aggregation and secure computing.

Text Processing

[75] Consider a text-processing pipeline 200 as shown in Fig. 2, which is implemented in text processor 112. In one example, pipeline 200 can be characterised as a linear so- called "1-best"' pipeline as outlined in [JOHNSON 14]. This could be extended to parallel, iterative, or network approaches. Typical natural language processing packages, such as MALLET [MCC ALLUM02] , Stanford CoreNLP [MANNING 14], OpenNLP [MORTON05], NLTK [BIRD09], produce human-readable output, in the form of delimited text or XML files. However, the text files represent coded

(dictionary) categorical numerical variables.

[76] In one example, a (large) number of categorical variables is presumed and 'text' features are coded via a lookup table, such as a dictionary. In this sense, each dictionary entry is associated with a numeric value and transformation comprises replacing a word from the text data with a corresponding numeric value from the dictionary.

[77] A particular example of text processing pipeline output is outlined in Fig. 3. One observation is

A text processing pipeline maps a document into a series of numeric-valued features. The numeric features are not independent.

[78] Here "documents" may be seen as words, sentences, paragraphs, text-segments, whole-documents or whatever other unit of measure is used in the text pipeline.

[79] In training, the text processor 112 implementing pipeline 200 also uses labelled text segments - which may be document-, sentence- or word- labels (e.g. for sentiment analysis) or some combination of text fragments (such as used by brat

[STENETORP12]). In each case, text processor 112 may represent the features as numeric labels— where the "tags" are converted into a dictionary— and a series of numeric values. In this sense, each numeric value from the dictionary is associated with a tag from the natural language processing and the tag represents grammatical function in the sentence of that phrase, such as subject, predicate, object, etc. In other words, the numeric values represent the multiple text segments in the text data. [80] For this work processor 112 may use binary values— such as might result from a "1-hot"' encoding. These become observations, and also labels if the aim is learning on the text.

Encryption

[81] At this point, encryption engine 113 can consider a text document as a series of numeric features and uses homomorphic encryption. Encryption engine 113 may directly apply homomorphic cryptography to the numeric (key-encoded text) data to create encrypted numeric values. The encrypted values still represent the multiple text segments but now in encrypted form. However, the numeric data (observations and labels) are often highly structured. For example, it is unlikely that there will be a sequence of noun-phrases, or named entities (without any other text) in a text block. Co-references and dependencies introduce Markov dependencies in the feature vectors. Labels and feature selections (such as "true iff previous word was noun") add additional structure to the observation vector. All these factors may make numerically encoded text susceptible to frequency attacks. Worse, since the dictionary of key-values is shared in clear text to all participants— to allow uniform encoding of text elements to common key-value pairs— the underlying structure of the document may be readily inferred by matching observations and labels to dictionary elements.

[82] Encryption engine 113 may use the approach of [SHANNON49] to improve independent secret systems by concatenating them. In this case, encryption engine firstly encrypts the numeric features and labels (using a Paillier homomorphic encryption system [PAILLIER99, DAMGARD01]). This makes direct interpretation difficult. Second, encryption engine 113 passes the encrypted values to aggregation device 120 to address the data dependencies by applying irreversible aggregation to the numeric data so as to hide many of the implied dependencies between observations and to create aggregated encrypted numeric values. These aggregated values still represent the multiple text segments but in aggregated encrypted form. Finally, learning device 130 wraps the learning process in a secure learning approach, to further reduce the capacity of an inquisitive user discovering the underlying labels. This reflects the idea that data dependences can be accounted for in training, validation, and testing of machine learning methods in order to produce reliable performance estimates. This basically extends Fig. 2 to include encryption in Fig. 1. The pre-processing pipeline of Fig. 2 is encapsulated in the text processor 112.

Partial Homomorphic Encryption

[83] The Paillier encryption scheme (and later generalisations) is a public-private key encryption scheme. In this disclosure the following notation is used where an integer x is encrypted as ((x)) and the decrypt operation as (ly)) ■

and {f^†i are public key encryptions: users can encrypt data (and perform computations) using a common public key, however, only the user with the corresponding private key can extract the clear data.

[84] The main operations for Paillier homomorphic encryption are the operators S and ® . For two integers x_l, x₂ < n = pq where n is a constant for the particular encryption and p,q are two large primes.

«¾»«(<¾» = (<*. ⁺¾» ⁽D and

a® ((¾)) = ((<*■¾)) (2)

[85] Aggregation device 120 can sum encrypted values in the encrypted domain (and consequently decrypt the result) in (1) and can multiply encrypted values with unencrypted scalars. The result (2) does not apply when a is encrypted. For more advanced operations (such as multiplying encrypted values) the devices can use secure computation.

Secure Computations [86] This disclosure provides several protocols for secure computation among two parties A and B . [CLIFTON02] provides mechanism for multiple parties (ie. more than two). It is assumed that A operates on encrypted data, and B has the private key (and can decrypt data). Neither party should be able to discern the numerical values. These protocols comprise 3 steps

1. an obfuscation step by the public key holder A ,

2. a transfer to the private key holder B , who decrypts and then performs the calculation on clear data and returns an encrypted result to A . As the result is obfuscated, B learns nothing from this operation, even though it is performed on clear data.

3. A then removes the original obfuscation

[87] Fig. 5 illustrates a protocol for Secure Outer Product 500 and Fig. 6 illustrates a protocol for Secure Inner Product 600, both of which extend the protocol for Secure Elementwise Product 700 shown in Fig. 7 to form outer- and inner-products, building up a basic linear algebra toolkit. This can be extended to secure matrix multiplication. We re-use the Secure Inner Product protocol 600 as the basis for Secure Matrix Multiply protocol 800 shown in Fig. 8. In one example, the values are encrypted by a single party B (encryption engine 113 on source device) and calculated by another party A (aggregation device). In other words, the source device 110 is remote from the aggregation device 120.

[88] It is noted that a number of protocol or algorithms are disclosed herein and these are to be understood as a description of steps performed by a processor of a computer system to calculate the respective outputs, such as software code, or by dedicated circuits or by cloud computing instances.

Irreversible Aggregation

[89] The work of [NOCK15] and [PATRINI16] may be used to present relevant parts of the aggregation techniques. Firstly, the work of [NOCK15] presents learners on specially aggregated datasets, where the data set could be in a single location. This is extended to multiple locations below.

Single (complete) data set

[90] In one example, the data set is a single (coherent) source, i.e. all data is held by a single organisation.

[91] Definition 1 ((numeric) Supervised Learning Space) Given a set of m > 0 examples S = |(x_; , y_i ), i e { 1 , 2, ... , m}} , where x^^c R^lxd are observations, X is the domain, and y_t e {— 1,1 } are labels. We are concerned with a (binary) linear classifier Θ e Θ for fixed Θ cz R^lxd. The label of an observation x is given by:

label(x) = sign(e^rx) e {-1,1 }

[92] The irreversible aggregation is based on rado's as defined by [93] Definition 2 (Rado)

Let∑ = {-1,1 }^m . Then given a set of S , and for any σ e∑ with σ = [σ_ι, ..., σ_η]^τ . The Rademacher observation ("Rado" herein) π₈ with signature σ is

[94] It is noted that the signature elements σ are -1 or +1. As a result, the addition to the label (which is also -1 or +1) effectively results in a random selection of samples, which are then combined in the sum of equation (3).

[95] Further detail and the calculations of rados by aggregation of samples calculating block rados is disclosed in PCT/AU2016/050088, which is incorporated herein by reference.

Multiple data sets [96] Figs. 9a and 9b illustrate a data schematic, with first peer 901, second peer 902 and third peer 903. Some features (dotted, reference as 911, 912 and 913, respectively) are common to each peer. These are denoted J a X . One of the common features is the label ( or 'class'). Other (non-shared) features are split between all peers. The total sample is the bold outline 920. Fig. 9a shows the Vertical Partition case (VP) where all peers see different observations of the same examples, but do not know who is who in the data set. Fig. 9b shows the General case (G) where it is not known whether one example viewed by a peer exists in other peers' datasets. Missing data is marked as _L .

[97] In the following example, different peers have different sub-components of the full data set . In this example, entities are not linked: different text corpora are held by different parties, and no entity resolution is performed.

[98] One element is a basic block (BB) rado.

Definition 3 (BB-Rado) Consider z' G U a V . Let lift(z') concatenate zeros to

z = [z' 0] such that z e V . For any s e , labels y e {— 1,1 } and a e R, the a -basic block rado for (s, y) is

Encrypted Rado's

[99] It is noted that the features over which the Rado's will operate are encrypted. One example is a two-party scenario as shown in Fig. 10. The encryption process occurs after securely processing the text documents at the private location of B (source device 110). Using private key, B encryption engine 113 then encrypts the features, and these are then aggregated by aggregation device 120. The aggregation occurs blind to B , that is, remotely from source device 110, and may be performed by an honest-but-curious intermediary X_B (aggregation device 120). The rados π₈ are generated privately at T_B

120. Once generated, the rados can be used by other honest-but-curious external parties Λ , such as learning device 130. [100] In this section, secure mathematical operations are applied and the source device 110 and the aggregation device 120 are referred to as {B, X} where B is the private key holder and X is an intermediary. X can "see" encrypted features ((x, )) , and encrypted labels (()>,_· )) , and knows the choices of rado vectors (i.e. X knows values of cr ). It would be possible to operate with cr also encrypted.

[101] Fig. 10 outlines the encryption steps for this section. It is possible to extend the rados to secure rado learners. We re-write (3) with the encrypted values made explicit. Corresponding secure mathematical operations are also shown. We use the notation to denote a series of homomorphic addition operations ie.

©"__! = α_λ ® a₂ ® " ' ® ^an - We will use : as an abuse of notation, to denote "has the meaning of rather than equality.

[102] Equation (5) shows the formation of the (encrypted) rado. The additions and (scalar) multiplications must all be translated to the appropriate homomorphic addition and multiplication operations. The output is an encoded Rademacher observation, based on any numerical field. We will refer to this rado as ((«„)) · Fig- shows a protocol for calculating an encrypted Rademacher.

Multiparty case

[103] The case for multiple parties uses the lift(-) function. This function appends zeros onto a vector, and thus (in the encrypted domain) may be represented as appending an encrypted scalar (zero) to the encrypted vector. As above, (16) can be rewritten in the encrypted domain. [104] In one example, the calculation of rados is based on a random selection of samples aggregation comprise combining a random selection of samples

Learning, using encrypted Rado's

Unencrypted single -party case

[105] The learner for Rado's (in the unencrypted case) is given by [PATRINI16].

Learning device 130 may use the equivalent (exponential) learner for rado's

[106] Lemma 1 (Rado Learning) For any Θ and S , and a set of rados o e / c∑, minimizing the loss

is equivalent to minimising the standard logistic loss F_log(S,Q) [107] The supervised learning (optimisation) is written as

Problem 1 (Minimise Exponential Loss) The optimal classifier Θ^* is given by solving

Θ - minJ (Θ) where

and Θ^ΓΘ is a regularising term.

[108] Further detail about learning on rados is also provided in

PCT/AU2015/050653, which is incorporated herein by reference.

Secure single-party case

[109] In one example, encryption engine 113 encrypts the rado's as described above and the learning outcomes are to be kept confidential. While this example is not concerned with the multi-source case, the following description provides mechanics to allow transfer to the scenario where multiple un-shared data sources exist.

[110] The regulariser in the above equation is an application of a protocol for Local Inner Product as shown in Fig. 17. However, in some examples, learning device 130 does not evaluate the above equation but rather performs a gradient descent to solve Problem 1 as exmplained below.

Gradient descent

[111] With reference to Problem 1 above it is noted that the gradient of J(q) , with respect to Θ . , is

[[111122]] IItt iiss ffuurrtthheerr nnootteedd tthhaatt

ee VV .. UUssiinngg tthhee aabboovvee nnoottaattiioonn iitt ccaann bbee ssttaatteedd tthhaatt::

Unencrypted, multi-party case

[113] The mean square loss can be used, over V— ie on the limited sample sets— by using a modified mean loss as given in Definition 4

[114] Definitio -Rado loss) The M -loss for the classifier Θ is

(6)

where expectation Έ_ν and variance Y_P are computed with respect to the uniform sampling of σ in V . Γ is positive definite. The positive definite matrix Γ can be set to a weighted diagonal matrix

where ε < 1 accounts for (lack of) confidence in certain columns of π .

[115] Minimizing the Ridge regularized square loss over examples is equivalent to minimizing a regularized version of the M-loss, over the complete set of all rados.

[116] The optimal classifier Θ^* is given by the simple closed-form expression:

Θ* = (BB^T + dim_c (B)^■ Τ) ¹ Bl (8) where B is stacked (column- wise) rado's and dim_c(S) is the number of columns of B

[117] To find the inverse (BB^T + dim_e(S) r) ¹ in (13), [HALL11] recommends an iterative (Shur [GUO06]) approach. The proposed method may use around

log₂(dim( )) or 128 iterations. Faster approaches (with fewer iterations and thus fewer multiplications) may be achieved by higher order algorithms. An outline and review are given in [RAJAGOPALAN96, S OLE YM ANI 12] .

[118] Fig. 12 illustrates a protocol for Secure Matrix Inverse, which uses an iterative algorithm to multiply matrices to form an inverse. The Secure Matrix Inverse protocol in Fig. 12 uses a significant number of secure and homomorphic operations per iteration.

• 0(n²) secure multiplies. The secure multiplies uses internal homomorphic multiplications and additions.

• Secure Matrix Inverse also uses 0(n²) homomorphic multiplies and 0(n²) additions. [119] Using Secure Matrix Inverse from Fig. 12 and Secure Matrix Multiply from Fig. 8, learning device 130 can solve (8) in the encrypted domain at the central coordinator. In this sense, learning device 130 can perform a Secure Rado Solver protocol as shown in Fig. 13.

Blind addition: De-risking the coordinator

[120] The proposed approach avoids a central coordinator with access to all vectors (and also substantially reduces the scope of the coordinator to reveal data) by returning to Definition 4. Equation (6) can be re-written as follows: e_M(n_s ,D = -Q^T (9)

Let

b =∑n = E„(_n.) (10) then

[121] The sums in (10) and (11) are over the appropriate rado's. However, these rado's may be calculated by their peers, so the sums may be broken into per-peer summations, where we consider disjoint sets V such that V₁ ^J V₂ ^ . . . = ^'P

[122] Definition 5 (BB-rado per peer) Consider P peers with p e { 1, 2, ..., P} , where each peer p has distinct rados drawn from V , and the rados are distinct in

V = { ^ V₂ ^ For each peer p we have an expectation e_p and variance v defined as

^" (12) Vp =∑(«i -b)(«_i -b)^r (13)

^VP

Each peer, p can calculate e_p and v_p independently

[123] Although the mean b may be calculated centrally, it is preferable to use secure multiparty addition to achieve the same result. This reduces the scope for the coordinator to access (encrypted, aggregated) vectors, and (instead) only access noisy aggregates of the data.

Combining the elements

[124] The above description provides a toolkit that allows to

1. transform text to numeric values;

2. encrypt those values;

3. aggregate the encrypted values;

4. learn on the aggregated values using secure mathematical operations;

[125] In one example this may be implemented via a securely accredited text processing software installed on source device 110, and having the encryption software 113 running on the machine with access to the raw data, whilst the aggregation software operates on a separate server (aggregation device 120) - with no access to the raw data. Additionally, the solution of the optimisation to find Θ is performed at a central coordinator (learning device 130), which is also responsible for generating the public -private keys. Fig. 14 shows an algorithm Classifier for Secure Text with Central Coordinator. This algorithm uses a number of intermediate steps, which are detailed in Figs. 5-8 and 11-13.

[126] The algorithm may incorporate the secure multi-party summation work of

[CLIFTON44], to prevent the coordinator from obtaining the rado's directly. This adds a third layer of obfuscation to the data (encryption, aggregation, blind addition), which means that at the coordinator (who can decrypt the data) the data remains protected by the aggregation to rado's and the blind addition of the vectors. [127] Fig. 15 illustrates a communication architecture 1500 for multiple peers 1501 and 1502 and a common coordinator 1503. Fig. 15 outlines the protocol from Fig. 14. The coordinator 1503 sends a common dictionary and public key to all peers 1501 and 1502. Each peer has different data components, with some common elements ( 1510, 1511) - see also Figs. 9a and 9b. Some examples may be unique to certain peers. Each peer has encrypted and aggregated its local data (as indicated at 1512 and 1513). The servers 1512 and 1513 correspond to the "intermediary" in Fig. 10. The encryption key is generated by the coordinator 1503. Dashed arrows denote information transfers between participants, whilst solid arrows denote local transformations (at the respective participant). Not all transfers are shown.

[128] Each peer p (1501 and 1502), now classifies a particular observation vector x and could calculate

5> = sign(e^rx) (14)

[129] However, each peer only has a subset of features x_{( ]} . The label may be determined only by the sign of a scalar, and hence it is possible to break the inner product θ^Γχ into an inner product of local features and remote features:

^ remote ^remote ) (15)

[130] The local component of (16) may be calculated at peer p . If we denote the local classifier result as _p then we may write

[131] The summation in (19) is the sum of all (local) calculated classifier results on the sub-components of the vector x . The result of (19) shows that the remote classification results may be treated as offsets for the local result— i.e. the remote inner products act as corrections to the local result. However, in this example every peer shares linking information about the observation x . To avoid this, we replace the summation in (19) with an equivalent rado:

sign(a_p + 0^ ) (20)

[132] In the homomorphic encrypted case, the local inner product can be calculated by keeping the encrypted classifier vector

in its encrypted form, and treating the elements of x as unencrypted scalars. The local inner product is written explicitly in the protocol for Local Inner Product as show in Fig. 17. The innerproduct with the rado is performed at the coordinator C (learning device 130), using secure inner product. Finally, the summation may be achieved using multi-party secure addition, as outlined in [CLIFTON02] .

[133] Fig. 16 shows a protocol for Local Classify for Secure Text.

[134] Definition 6 (Matrix Inverse, using only products and sums) cf. [?, eqn. 2.2].

Given a matrix A , with inverse V = A^-1 we can iteratively approximate V as

V_n+l =K (3I -AV_n [3I -AV_n]) (21)

Single cohort learner

[135] Equivalent (exponential) learner for rado's can be used here.

[136] Lemma 2 (Rado Learning) For any Θ and S , and a set of r ados O G W ∑, minimizing the loss 1ο_δ ) (22)

is equivalent to minimising the standard logistic loss F_l (S, Q) [137] The supervised learning (optimisation) is written as

[138] Problem 2 (Minimise Exponential Loss) The optimal classifier q^* is given by solving

Θ^* = minJ(e) (23) where

^(e) = iog ÷∑ex_P(-e^) +Θ^ΓΘ (24)

- <S<E

and Θ^ΓΘ is a regularising term.

[139] Note that the gradient of /(θ) , with respect to Θ . , is

[140] The learner can be developed using secure exponentiation.

[141] Fig. 18 illustrates a computer network 1800 for automatic text analysis including a computer system 1801 implementing coordinator 1503 from Fig. 15. The computer system 1801 comprises a processor 1802 connected to a program memory 1804, a data memory 1806 and a communication port 1808. The program memory 1804 is a non-transitory computer readable medium, such as a hard drive, a solid state disk or CD-ROM. Software, that is, an executable program stored on program memory 1804 causes the processor 1802 to perform the steps performed by learning device 130 and/or coordinator 11503. [142] The processor 102 may then store the learning result, such as a learned correlation between feature values and label values on data store 1806, such as on RAM or a processor register. Processor 1802 may also send the determined correlation via communication port 1808 to a server, such as an automatic evaluation engine. For example, the learned correlations may assist doctors in automatically diagnosing patients.

[143] The processor 1802 may receive data, such as learning samples, from data memory 1806 as well as from the communications port 1808. In one example, the processor 102 receives learning samples data from aggregation device 130 via communications port 1808, such as by using a Wi-Fi network according to IEEE 802.11. The Wi-Fi network may be a decentralised ad-hoc network, such that no dedicated management infrastructure, such as a router, is required or a centralised network with a router or access point managing the network.

[144] Although communications port 1808 is shown as a distinct entity, it is to be understood that any kind of data port may be used to receive data, such as a network connection, a memory interface, a pin of the chip package of processor 1802, or logical ports, such as IP sockets or parameters of functions stored on program memory 1804 and executed by processor 1802. These parameters may be stored on data memory 1806 and may be handled by-value or by-reference, that is, as a pointer, in the source code.

[145] The processor 1802 may receive data through all these interfaces, which includes memory access of volatile memory, such as cache or RAM, or non-volatile memory, such as an optical disk drive, hard disk drive, storage server or cloud storage. The computer system 1800 may further be implemented within a cloud computing environment, such as a managed group of interconnected servers hosting a dynamic number of virtual machines.

[146] It is to be understood that throughout this disclosure unless stated otherwise, nodes, edges, graphs, solutions, variables, samples, features, labels and the like refer to data structures, which are physically stored on data memory 1806 or processed by processor 1802. Further, for the sake of brevity when reference is made to particular variable names, such as "period of time" or "quantity of movement" this is to be understood to refer to values of variables stored as physical data in computer system 1800.

[147] It is noted that the learning on the aggregated values may comprise

determining learning parameter values indicative of a relationship between the features and labels of the learning samples as described herein. These learning parameter values can then be applied to current features to determine an estimated label for the current features. The system may then further comprise an output engine to

automatically suggest actions based on the estimated label. For example, the learning device 130 may distribute the learning parameters, such as machine learning model parameters including linear factors of regression models, back to the source device 110. For example, source device 110 may be used by a doctor to enter patient data and diagnostic data for each patient. Learning device 130 learns the contribution of each observation to particular diseases in the form of regression parameters. This learning is based on learning samples not only from the one source device 110 but from a large number of source devices while maintaining confidentiality as described above. The learning parameters also do not contain patient-confidential information and can therefore be distributed freely back to source device 110. For the next patient, the doctor can enter the observations and the source device 110 automatically proposes a diagnosis or a number of possible diagnoses for the doctor to evaluate. In this sense, source device automatically suggests actions, such as providing information to the patient or applying a certain therapy.

[148] It is noted that the doctor's observations may also be accompanied by genomic data or meta-genetics including habits of the patient, such as smoking etc. The proposed systems and methods can equally be applied to this information. In other examples, the text data is related to other applications, such as social media, email, SMS, credit ratings and others. [149] It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the above-described embodiments, without departing from the broad general scope of the present disclosure. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

[150] The following references are incorporated herein by reference:

[BIRD09] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. O'Reilly Media, 2009, available: http://www.nltk.org/book led/.

[DAMGARD01] I. Damgard and M. Jurik, "A generalisation, a simplification and some applications of Paillier's probabilistic public -key system," in PKC '01

Proceedings of the 4th International Workshop on Practice and Theory in Public Key Cryptography: Public Key Cryptography, 2001, pp. 119-136.

[CLIFTON02] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu, "Tools for privacy preserving distributed data mining" SIGKDD Explorations, vol. 4, no. 2, 2002.

[FRANZl l] M. Franz, "Secure computations on non-integer values" Ph.D. dissertation, Technische Universitat Darmstadt, 2011.

[GUO06] C.-H. Guo and N. J. Higham, "A Schur-Newton method for the matrix p'th root and its inverse," Manchester Institute for Mathematical Sciences, Tech. Rep., Oct. 2006.

[HALL11] R. Hall, S. E. Fienberg, and Y. Nardi, "Secure multiple linear regression based on homomorphic encryption," Journal of Official Statistics, vol. 27, no. 4, pp. 669-691, 2011 (revised 2013).

[JOHNSON14] M. Johnson, "Beyond the 1-best pipeline," Presentation slides, at NICTA-NLP workshop 2014, Sep. 2014.

[MANNING 14] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, "The Stanford CoreNLP natural language processing toolkit," in

Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55-60. [MCCALLUM02] A. K. McCallum, \MALLET: A machine learning for language toolkit," 2002, http://mallet.cs.umas s .edu .

[MORTON05] T. Morton, J. Kottmann, J. Baldridge, and G. Bierner, \OpenNLP: A Java-based NLP toolkit," 2005, http://opennlp.sourceforge.net.

[NOCK15] R. Nock, G. Patrini, and A. Friedman, \Rademacher observations, private data, and boosting," JMachine Learning Research, vol. 37, 2015.

[PAILLIER99] P. Paillier, "Public-key cryptosystems based on composite degree residuosity classes," in EUROCRYPT. Springer- Verlag, 1999, pp. 223-238.

[PATRINI 16] G. Patrini, R. Nock, S. Hardy, and T. Caetano, \Fast learning from distributed datasets without entity matching," Data61, Tech. Rep., Mar. 2016.

[RAJAGOPALAN96]J. Rajagopalan, \An iterative algorithm for inversion of matrices," Ph.D. dissertation, Concordia University, Montreal, Canada, Sep. 1996. [SHANNON49] C. E. Shannon, "Communication theory of secrecy systems," The Bell System Technical Journal, vol. 28, no. 4, pp. 656-715, May 1949.

[SOLEYMANI12] F. Soleymani, "A rapid numerical algorithm to compute matrix inversion," International Journal of Mathematics and Mathematical Sciences, vol. 2012, 2012.

[STENETORP12] P. Stenetorp, S. Pyysalo, G. Topic, T. Ohta, S. Ananiadou, and J. Tsujii, "brat: a web-based tool for nip-assisted text annotation," in Proceedings of the Demonstrations Session at EACL, 2012.

Claims

CLAIMS:

1. A system for automatic text analysis, the system comprising:

a source device comprising

2. The system of claim 1 wherein the text processor is a language processor.

3. The system of claim 2, wherein transforming the text to numeric values is based on natural language processing.

4. The system of any one of the preceding claims, wherein transforming the text to numeric values is based on a dictionary.

5. The system of claim 4, wherein the dictionary comprises multiple dictionary entries and each of the multiple dictionary entries is associated with a numeric value and transforming the text to numeric values comprises replacing a word from the text data with a corresponding numeric value from the dictionary.

6. The system of claim 5, wherein each numeric value from the dictionary is associated with a tag from natural language processing.

7. The system of claim 6, wherein the tag represents grammatical function in sentence of that phrase.

8. The system of claim 6 or 7, wherein the tag comprises multiple labels.

9. The system of claim 8, wherein the multiple labels are one-hot encoded.

10. The system of any one of the preceding claims, wherein the encryption engine uses homomorphic encryption.

11. The system of claim 10, wherein the homomorphic encryption is Pailler encryption.

12. The system of any one of the preceding claims, wherein the aggregation device is remote from the source device and the learning device.

13. The system of any one of the preceding claims, wherein the numeric values represent features and labels of learning samples, the encryption engine encrypts the features and labels to create encrypted features and encrypted labels and aggregation comprises combining the encrypted features based on the encrypted labels.

14. The system of claim 13, wherein aggregation comprises adding the features multiplied by the labels.

15. The system claim 14, wherein adding features multiplied by the labels comprises adding homomorphically encrypted features multiplied by homomorphic ally encrypted labels.

16. The system of any one of claims 13 to 15, wherein aggregation comprises combining a random selection of samples.

17. The system of any one of claims 13 to 16, wherein aggregation comprises calculating block rados.

18. The system of any one of claims 13 to 17, wherein learning on the aggregated values comprises determining learning parameter values indicative of a relationship between the features and labels of the learning samples and the method further comprises:

applying the learning parameter values to current features to determine an estimated label for the current features.

19. The system of claim 18, further comprising an output engine to automatically suggest actions based on the estimated label.

20. The system of any one of the preceding claims, wherein the learning device is configured to calculate a rado loss to learn from the aggregated values.

21. The system of any one of the preceding claims, wherein the operations that maintain the aggregated numeric values concealed from the learning device comprise one or more of:

Secure Outer Product;

Secure Inner Product;

Secure Elementwise Product;

Secure Matrix Multiply;

Secure Matrix Inverse;

Secure Rado Solver;

Classifier for Secure Text with Central Coordinator;

Local Classify for Secure Text; and

Local Inner Product.

22. A method for automatic text analysis, the method comprising:

aggregating the encrypted numeric values to create aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form; and

providing an analysis of the text data by performing a learning method, remotely from the transforming, on the aggregated encrypted numeric values that represent the multiple text segments in aggregated encrypted form using using operations that maintain the aggregated encrypted numeric values and thereby the text data concealed from the learning device.

23. Software that, when installed on a computer, causes the computer to perform the method of claim 22.

24. A method for encrypting text data representing text that comprises multiple text segments, the method comprising:

25. A method for aggregating data, the method comprising:

receiving multiple encrypted samples that represent multiple text segments in encrypted form, each of the multiple encrypted samples comprising one or more encrypted numeric features and an encrypted numeric label; creating aggregated samples by combining the encrypted numeric features based on the encrypted numeric label for each encrypted sample into aggregated encrypted samples that represent the multiple text segments in aggregated encrypted form.

26. A method for machine learning on data, the method comprising: