hybridtheory · garciapuig · Mar 9, 2021 · Mar 9, 2021
diff --git a/README.md b/README.md
@@ -1,2 +1,173 @@
-# floc-simhash
-A fast python implementation of the SimHash algorithm.
+# FLoC SimHash
+
+This Python package provides hashing algorithms for computing cohort ids of users based on their browsing history.
+As such, it may be used to compute cohort ids of users following Google's **Federated Learning of Cohorts** (FLoC) proposal.
+
+The FLoC proposal is an important part of [The Privacy Sandbox](https://www.chromium.org/Home/chromium-privacy/privacy-sandbox), which is Google's replacement for third-party cookies.
+FLoC will enable interest-based advertising, thus preserving an important source of monetization for today's web.
+
+The main idea, as outlined in the [FLoC whitepaper](https://raw.githubusercontent.com/google/ads-privacy/master/proposals/FLoC/FLOC-Whitepaper-Google.pdf), is to replace user cookie ids, which enable user-targeting across multiple sites, by _cohort ids_.
+A cohort would consist of a set of users sharing similar browsing behaviour.
+By targeting a given cohort, advertisers can ensure that relevant ads are shown while user privacy is preserved by a _hiding in the pack_ mechanism.
+
+The FLoC whitepaper mentions several mechanisms to map users to cohorts, with varying amounts of centralized information.
+The algorithms [currently](https://blog.google/products/ads-commerce/2021-01-privacy-sandbox/) being implemented in Google Chrome as a POC are methods based on **SimHash**, which is a type of locality-sensitive hashing initially introduced for detecting near-duplicate documents.
+
+## Contents
+
+<!-- toc -->
+
+- [FLoC SimHash](#floc-simhash)
+  - [Contents](#contents)
+  - [Installation](#installation)
+  - [Usage](#usage)
+    - [Individual document-based SimHash](#individual-document-based-simhash)
+      - [Providing your own tokenizer](#providing-your-own-tokenizer)
+    - [Using the SimHashTransformer in scikit-learn pipelines](#using-the-simhashtransformer-in-scikit-learn-pipelines)
+      - [Caveats](#caveats)
+  - [Development](#development)
+
+<!-- tocstop -->
+
+## Installation
+
+The `floc-simhash` package is available at PyPI. Install using `pip` as follows.
+
+```bash
+pip install floc-simhash
+```
+
+The package requires `python>=3.7` and will install `scikit-learn` as a dependency.
+
+## Usage
+
+The package provides two main classes.
+
+- `SimHash`, applying the SimHash algorithm on the md5 hashes of tokens in the given document.
+
+- `SimHashTransformer`, applying the SimHash algorithm to a document vectorization as part of a scikit-learn [pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline)
+
+Finally, there is a third class available:
+
+- `SortingSimHash`, which performs the SortingLSH algorithm by first applying `SimHash` and then clipping the resulting hashes to a given precision.
+
+### Individual document-based SimHash
+
+The `SimHash` class provides a way to calculate the SimHash of any given document, without using any information coming from other documents.
+
+In this case, the document hash is computed by looking at md5 hashes of individual tokens.
+We use:
+
+- The implementation of the md5 hashing algorithm available in the `hashlib` module in the [Python standard library](https://docs.python.org/3/library/hashlib.html).
+
+- Bitwise arithmetic for fast computations of the document hash from the individual hashed tokens.
+
+The program below, for example, will print the following hexadecimal string: `cf48b038108e698418650807001800c5`.
+
+```python
+from floc_simhash import SimHash
+
+document = "Lorem ipsum dolor sit amet consectetur adipiscing elit"
+hashed_document = SimHash(n_bits=128).hash(document)
+
+print(hashed_document)
+```
+
+An example more related to computing cohort ids:
+the following program computes the cohort id of a user by applying SimHash to the document formed by the pipe-separated list of domains in the user browsing history.
+
+```python
+from floc_simhash import SimHash
+
+document = "google.com|hybridtheory.com|youtube.com|reddit.com"
+hasher = SimHash(n_bits=128, tokenizer=lambda x: x.split("|"))
+hashed_document = hasher.hash(document)
+
+print(hashed_document)
+```
+
+The code above will print the hexadecimal string: `14dd1064800880b40025764cd0014715`.
+
+#### Providing your own tokenizer
+
+The `SimHash` constructor will split the given document according to white space by default.
+However, it is possible to pass any callable that parses a string into a list of strings in the `tokenizer` parameter.
+We have provided an example above where we pass `tokenizer=lambda x: x.split("|")`.
+
+A good example of a more complex tokenization could be passing the [word tokenizer](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize) in NLTK.
+This would be a nice choice if we wished to compute hashes of text documents.
+
+### Using the SimHashTransformer in scikit-learn pipelines
+
+The approach to SimHash outlined in the [FLoC Whitepaper](https://raw.githubusercontent.com/google/ads-privacy/master/proposals/FLoC/FLOC-Whitepaper-Google.pdf) consists of choosing random unit vectors and working on already vectorized data.
+
+The choice of a random unit vector is equivalent to choosing a random hyperplane in feature space.
+Choosing `p` random hyperplanes partitions the feature space into `2^p` regions.
+Then, a `p`-bit SimHash of a vector encodes the region to which it belongs.
+
+It is reasonable to expect _similar_ documents to have the same hash, provided the vectorization respects the given notion of similarity.
+
+Two vectorizations are discussed in the aforementioned whitepaper: **one-hot** and **tf-idf**; they are available in [scikit-learn](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).
+
+The `SimHashTransformer` supplies a transformer (implementing the `fit` and `transform` methods) that can be used directly on the output of any of these two vectorizers in order to obtain hashes.
+
+For example, given a 1d-array `X` containing strings, each of them corresponding to a concatenation of the domains visited by a given user and separated by `"|"`, the following code will store in `y` the cohort id of each user, using one-hot encoding and a 32-bit SimHash.
+
+```python
+from sklearn.feature_extraction.text import CountVectorizer
+from sklearn.pipeline import Pipeline
+
+from floc_simhash import SimHashTransformer
+
+
+X = [
+    "google.com|hybridtheory.com|youtube.com|reddit.com",
+    "google.com|youtube.com|reddit.com",
+    "github.com",
+    "google.com|github.com",
+]
+
+one_hot_simhash = Pipeline(
+    [
+        ("vect", CountVectorizer(tokenizer=lambda x: x.split("|"), binary=True)),
+        ("simhash", SimHashTransformer(n_bits=32)),
+    ]
+)
+
+y = one_hot_simhash.fit_transform(X)
+```
+
+After running this code, the value of `y` would look similar to the following (expect same lengths; actual hash values depend on the choice of random vectors during `fit`):
+
+```python
+['0xd98c7e93' '0xd10b79b3' '0x1085154d' '0x59cd150d']
+```
+
+#### Caveats
+
+- The implementation works on the sparse matrices output by `CountVectorizer` and `TfidfTransformer`, in order to manage memory efficiently.
+
+- At the moment, the choice of precision in the numpy arrays results in overflow errors for `p >= 64`. While we are waiting for implementation details of the FLoC POCs, the first indications hint at choices around `p = 50`.
+
+## Development
+
+This project uses [poetry](https://python-poetry.org/) for managing dependencies.
+
+In order to clone the repository and run the unit tests, execute the following steps on an environment with `python>=3.7`.
+
+```bash
+git clone https://github.com/hybridtheory/floc-simhash.git
+cd floc-simhash
+poetry install
+pytest
+```
+
+The unit tests are property-based, using the [hypothesis](https://hypothesis.readthedocs.io/en/latest/) library.
+This allows for algorithm veritication against hundreds or thousands of random generated inputs.
+
+Since running many examples may lengthen the test suite runtime, we also use `pytest-xdist` in order to parallelize the tests.
+For example, the following call will run up to 1000 examples for each test with parallelism 4.
+
+```bash
+pytest -n 4 --hypothesis-profile=ci
+```
diff --git a/floc_simhash/__init__.py b/floc_simhash/__init__.py
@@ -0,0 +1,3 @@
+from .text.simhash import SimHash
+from .text.sorting import SortingSimHash
+from .transformer.geometric import SimHashTransformer
diff --git a/floc_simhash/text/__init__.py b/floc_simhash/text/__init__.py
diff --git a/floc_simhash/text/simhash.py b/floc_simhash/text/simhash.py
@@ -0,0 +1,62 @@
+import hashlib
+from typing import List, Callable
+
+
+class SimHash:
+    def __init__(
+        self,
+        n_bits: int,
+        tokenizer: Callable[[str], List[str]] = lambda x: x.split(" "),
+    ):
+        """Return a class that performs SimHash.
+
+        The algorithm will compute hashes of `n_bits` bits, splitting documents using `tokenizer`.
+
+        :param n_bits: Number of bits of the returned hashes. This is currently limited to a maximum
+        of 128 bits.
+        :type n_bits: int
+        :param tokenizer: rule to split a given document into a list of tokens, defaults to `lambda
+        x: x.split(" ")`.
+        :type tokenizer: Callable[[str], List[str]], optional
+        :raises ValueError: if `n_bits` is greater than 128.
+        """
+        if n_bits > 128:
+            raise ValueError("Only hashes up to 128 bits are currently supported")
+
+        self.n_bits = n_bits
+        self.tokenizer = tokenizer
+
+    def _bitwise_compare(self, hashes: List[int], result: int, bit: int) -> int:
+
+        if bit >= self.n_bits:
+            return result
+
+        hash_bits = [h & 1 for h in hashes]
+        next_bit = int(sum(hash_bits) / len(hashes) > 0.5)
+
+        return self._bitwise_compare(
+            [h >> 1 for h in hashes], result + next_bit * 2 ** bit, bit + 1
+        )
+
+    def inthash(self, document: str) -> int:
+        """Compute the hash of `document` and return it as an integer.
+
+        :type document: str
+        :rtype: int
+        """
+        tokens: List[bytes] = [w.encode() for w in self.tokenizer(document)]
+        md5_hashes: List = [hashlib.md5(token) for token in tokens]
+        token_clipped_hashes: List[int] = [
+            int(h.hexdigest(), 16) >> (h.digest_size * 8 - self.n_bits)
+            for h in md5_hashes
+        ]
+
+        return self._bitwise_compare(token_clipped_hashes, 0, 0)
+
+    def hash(self, document: str) -> str:
+        """Compute the hash of `document` as a hexadecimal string.
+
+        :type document: str
+        :rtype: str
+        """
+        return hex(self.inthash(document))[2:]
diff --git a/floc_simhash/text/sorting.py b/floc_simhash/text/sorting.py
@@ -0,0 +1,38 @@
+from typing import Callable, List
+
+from .simhash import SimHash
+
+
+class SortingSimHash(SimHash):
+    def __init__(
+        self,
+        bits_to_keep: int,
+        simhash_bits: int,
+        tokenizer: Callable[[str], List[str]] = lambda x: x.split(" "),
+    ):
+        """Return a class that performs SortingLSH.
+
+        First, a SimHash with `simhash_bits` is computed. Then, the first `bits_to_keep` bits are
+        kept.
+
+        :type bits_to_keep: int
+        :type simhash_bits: int
+        :param tokenizer: method to split a given document into a list of tokens, defaults to
+        `lambda x: x.split(" ")`
+        :type tokenizer: Callable[[str], List[str]], optional
+        :raises ValueError: if `bits_to_keep > simhash_bits`.
+        """
+        if bits_to_keep > simhash_bits:
+            raise ValueError("Cannot keep more bits than those given by Simhash")
+
+        self.bits_to_keep = bits_to_keep
+        super().__init__(simhash_bits, tokenizer)
+
+    def inthash(self, document: str) -> int:
+        """Compute the hash of `document` and return it as an integer.
+
+        :type document: str
+        :rtype: int
+        """
+        simhash = super().inthash(document)
+        return simhash >> (self.n_bits - self.bits_to_keep)
diff --git a/floc_simhash/transformer/geometric.py b/floc_simhash/transformer/geometric.py
@@ -0,0 +1,61 @@
+from typing import Optional, Union, Any
+
+import numpy as np
+import scipy
+from sklearn.base import TransformerMixin
+
+
+class SimHashTransformer(TransformerMixin):
+    def __init__(self, n_bits: int):
+        """Initialize a transformer capable of computing the SimHash algorithm on vectorized data.
+
+        :param n_bits: Precision to hash vectors to.
+        :type n_bits: int
+        """
+        self.n_bits = n_bits
+        self._vectors: Optional[np.array] = None
+
+    def fit(
+        self, X: Union[np.array, scipy.sparse.csr_matrix], y: Any = None
+    ) -> "SimHashTransformer":
+        """Fit the transformer to the given data.
+
+        This amounts to choosing random unit vectors of dimension `n`, where `X` is of dimension
+        `(number_of_documents, n)`.
+
+        :param X: vectorized data to be hashed. Can either be a numpy array or a sparse matrix.
+        :type X: Union[np.array, scipy.sparse.csr_matrix]
+        :param y: not used, left for matching the scikit-learn transformer pattern. Defaults to
+        None.
+        :type y: Any, optional
+        :return: the fitted instance itself.
+        :rtype: "SimHashTransformer"
+        """
+        random_vectors = np.random.randn(X.shape[1], self.n_bits)
+        self._vectors = random_vectors / np.linalg.norm(
+            random_vectors, axis=0, keepdims=True
+        )
+        return self
+
+    def transform(
+        self, X: Union[np.array, scipy.sparse.csr_matrix], y: Any = None
+    ) -> np.array:
+        """Compute the SimHashes of the vectors in X as an array of hexadecimal strings.
+
+        :param X: 2D array, where each row contains a vector to be hashed.
+        :type X: Union[np.array, scipy.sparse.csr_matrix]
+        :param y: not used, left for matching the scikit-learn transformer pattern. Defaults to
+        None.
+        :type y: Any, optional
+        :return: an array of hexadecimal strings
+        :rtype: np.array
+        :raises ValueError: if the fit method has not been previously called.
+        """
+        if self._vectors is None:
+            raise ValueError("The fit method has not been called")
+        mul = X.dot(self._vectors)
+        bits = np.where(mul > 0, 1, 0)
+        powers = 2 ** np.arange(self.n_bits)
+        int_values = np.dot(bits, powers)
+        hex_converter = np.vectorize(lambda x: str(hex(x)))
+        return hex_converter(int_values)
diff --git a/lefthook.yml b/lefthook.yml
@@ -0,0 +1,10 @@
+pre-commit:
+  commands:
+    black:
+      tags: python
+      glob: "*.py"
+      run: black --check {staged_files}
+    pylint:
+      tags: python
+      glob: "*.py"
+      run: pylint-fail-under --fail_under 8.0 {staged_files}