这是indexloc提供的服务,不要输入任何密码
Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 173 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,173 @@
# floc-simhash
A fast python implementation of the SimHash algorithm.
# FLoC SimHash

This Python package provides hashing algorithms for computing cohort ids of users based on their browsing history.
As such, it may be used to compute cohort ids of users following Google's **Federated Learning of Cohorts** (FLoC) proposal.

The FLoC proposal is an important part of [The Privacy Sandbox](https://www.chromium.org/Home/chromium-privacy/privacy-sandbox), which is Google's replacement for third-party cookies.
FLoC will enable interest-based advertising, thus preserving an important source of monetization for today's web.

The main idea, as outlined in the [FLoC whitepaper](https://raw.githubusercontent.com/google/ads-privacy/master/proposals/FLoC/FLOC-Whitepaper-Google.pdf), is to replace user cookie ids, which enable user-targeting across multiple sites, by _cohort ids_.
A cohort would consist of a set of users sharing similar browsing behaviour.
By targeting a given cohort, advertisers can ensure that relevant ads are shown while user privacy is preserved by a _hiding in the pack_ mechanism.

The FLoC whitepaper mentions several mechanisms to map users to cohorts, with varying amounts of centralized information.
The algorithms [currently](https://blog.google/products/ads-commerce/2021-01-privacy-sandbox/) being implemented in Google Chrome as a POC are methods based on **SimHash**, which is a type of locality-sensitive hashing initially introduced for detecting near-duplicate documents.

## Contents

<!-- toc -->

- [FLoC SimHash](#floc-simhash)
- [Contents](#contents)
- [Installation](#installation)
- [Usage](#usage)
- [Individual document-based SimHash](#individual-document-based-simhash)
- [Providing your own tokenizer](#providing-your-own-tokenizer)
- [Using the SimHashTransformer in scikit-learn pipelines](#using-the-simhashtransformer-in-scikit-learn-pipelines)
- [Caveats](#caveats)
- [Development](#development)

<!-- tocstop -->

## Installation

The `floc-simhash` package is available at PyPI. Install using `pip` as follows.

```bash
pip install floc-simhash
```

The package requires `python>=3.7` and will install `scikit-learn` as a dependency.

## Usage

The package provides two main classes.

- `SimHash`, applying the SimHash algorithm on the md5 hashes of tokens in the given document.

- `SimHashTransformer`, applying the SimHash algorithm to a document vectorization as part of a scikit-learn [pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline)

Finally, there is a third class available:

- `SortingSimHash`, which performs the SortingLSH algorithm by first applying `SimHash` and then clipping the resulting hashes to a given precision.

### Individual document-based SimHash

The `SimHash` class provides a way to calculate the SimHash of any given document, without using any information coming from other documents.

In this case, the document hash is computed by looking at md5 hashes of individual tokens.
We use:

- The implementation of the md5 hashing algorithm available in the `hashlib` module in the [Python standard library](https://docs.python.org/3/library/hashlib.html).

- Bitwise arithmetic for fast computations of the document hash from the individual hashed tokens.

The program below, for example, will print the following hexadecimal string: `cf48b038108e698418650807001800c5`.

```python
from floc_simhash import SimHash

document = "Lorem ipsum dolor sit amet consectetur adipiscing elit"
hashed_document = SimHash(n_bits=128).hash(document)

print(hashed_document)
```

An example more related to computing cohort ids:
the following program computes the cohort id of a user by applying SimHash to the document formed by the pipe-separated list of domains in the user browsing history.

```python
from floc_simhash import SimHash

document = "google.com|hybridtheory.com|youtube.com|reddit.com"
hasher = SimHash(n_bits=128, tokenizer=lambda x: x.split("|"))
hashed_document = hasher.hash(document)

print(hashed_document)
```

The code above will print the hexadecimal string: `14dd1064800880b40025764cd0014715`.

#### Providing your own tokenizer

The `SimHash` constructor will split the given document according to white space by default.
However, it is possible to pass any callable that parses a string into a list of strings in the `tokenizer` parameter.
We have provided an example above where we pass `tokenizer=lambda x: x.split("|")`.

A good example of a more complex tokenization could be passing the [word tokenizer](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize) in NLTK.
This would be a nice choice if we wished to compute hashes of text documents.

### Using the SimHashTransformer in scikit-learn pipelines

The approach to SimHash outlined in the [FLoC Whitepaper](https://raw.githubusercontent.com/google/ads-privacy/master/proposals/FLoC/FLOC-Whitepaper-Google.pdf) consists of choosing random unit vectors and working on already vectorized data.

The choice of a random unit vector is equivalent to choosing a random hyperplane in feature space.
Choosing `p` random hyperplanes partitions the feature space into `2^p` regions.
Then, a `p`-bit SimHash of a vector encodes the region to which it belongs.

It is reasonable to expect _similar_ documents to have the same hash, provided the vectorization respects the given notion of similarity.

Two vectorizations are discussed in the aforementioned whitepaper: **one-hot** and **tf-idf**; they are available in [scikit-learn](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

The `SimHashTransformer` supplies a transformer (implementing the `fit` and `transform` methods) that can be used directly on the output of any of these two vectorizers in order to obtain hashes.

For example, given a 1d-array `X` containing strings, each of them corresponding to a concatenation of the domains visited by a given user and separated by `"|"`, the following code will store in `y` the cohort id of each user, using one-hot encoding and a 32-bit SimHash.

```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

from floc_simhash import SimHashTransformer


X = [
"google.com|hybridtheory.com|youtube.com|reddit.com",
"google.com|youtube.com|reddit.com",
"github.com",
"google.com|github.com",
]

one_hot_simhash = Pipeline(
[
("vect", CountVectorizer(tokenizer=lambda x: x.split("|"), binary=True)),
("simhash", SimHashTransformer(n_bits=32)),
]
)

y = one_hot_simhash.fit_transform(X)
```

After running this code, the value of `y` would look similar to the following (expect same lengths; actual hash values depend on the choice of random vectors during `fit`):

```python
['0xd98c7e93' '0xd10b79b3' '0x1085154d' '0x59cd150d']
```

#### Caveats

- The implementation works on the sparse matrices output by `CountVectorizer` and `TfidfTransformer`, in order to manage memory efficiently.

- At the moment, the choice of precision in the numpy arrays results in overflow errors for `p >= 64`. While we are waiting for implementation details of the FLoC POCs, the first indications hint at choices around `p = 50`.

## Development

This project uses [poetry](https://python-poetry.org/) for managing dependencies.

In order to clone the repository and run the unit tests, execute the following steps on an environment with `python>=3.7`.

```bash
git clone https://github.com/hybridtheory/floc-simhash.git
cd floc-simhash
poetry install
pytest
```

The unit tests are property-based, using the [hypothesis](https://hypothesis.readthedocs.io/en/latest/) library.
This allows for algorithm veritication against hundreds or thousands of random generated inputs.

Since running many examples may lengthen the test suite runtime, we also use `pytest-xdist` in order to parallelize the tests.
For example, the following call will run up to 1000 examples for each test with parallelism 4.

```bash
pytest -n 4 --hypothesis-profile=ci
```
3 changes: 3 additions & 0 deletions floc_simhash/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .text.simhash import SimHash
from .text.sorting import SortingSimHash
from .transformer.geometric import SimHashTransformer
Empty file added floc_simhash/text/__init__.py
Empty file.
62 changes: 62 additions & 0 deletions floc_simhash/text/simhash.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import hashlib
from typing import List, Callable


class SimHash:
def __init__(
self,
n_bits: int,
tokenizer: Callable[[str], List[str]] = lambda x: x.split(" "),
):
"""Return a class that performs SimHash.

The algorithm will compute hashes of `n_bits` bits, splitting documents using `tokenizer`.

:param n_bits: Number of bits of the returned hashes. This is currently limited to a maximum
of 128 bits.
:type n_bits: int
:param tokenizer: rule to split a given document into a list of tokens, defaults to `lambda
x: x.split(" ")`.
:type tokenizer: Callable[[str], List[str]], optional
:raises ValueError: if `n_bits` is greater than 128.
"""
if n_bits > 128:
raise ValueError("Only hashes up to 128 bits are currently supported")

self.n_bits = n_bits
self.tokenizer = tokenizer

def _bitwise_compare(self, hashes: List[int], result: int, bit: int) -> int:

if bit >= self.n_bits:
return result

hash_bits = [h & 1 for h in hashes]
next_bit = int(sum(hash_bits) / len(hashes) > 0.5)

return self._bitwise_compare(
[h >> 1 for h in hashes], result + next_bit * 2 ** bit, bit + 1
)

def inthash(self, document: str) -> int:
"""Compute the hash of `document` and return it as an integer.

:type document: str
:rtype: int
"""
tokens: List[bytes] = [w.encode() for w in self.tokenizer(document)]
md5_hashes: List = [hashlib.md5(token) for token in tokens]
token_clipped_hashes: List[int] = [
int(h.hexdigest(), 16) >> (h.digest_size * 8 - self.n_bits)
for h in md5_hashes
]

return self._bitwise_compare(token_clipped_hashes, 0, 0)

def hash(self, document: str) -> str:
"""Compute the hash of `document` as a hexadecimal string.

:type document: str
:rtype: str
"""
return hex(self.inthash(document))[2:]
38 changes: 38 additions & 0 deletions floc_simhash/text/sorting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
from typing import Callable, List

from .simhash import SimHash


class SortingSimHash(SimHash):
def __init__(
self,
bits_to_keep: int,
simhash_bits: int,
tokenizer: Callable[[str], List[str]] = lambda x: x.split(" "),
):
"""Return a class that performs SortingLSH.

First, a SimHash with `simhash_bits` is computed. Then, the first `bits_to_keep` bits are
kept.

:type bits_to_keep: int
:type simhash_bits: int
:param tokenizer: method to split a given document into a list of tokens, defaults to
`lambda x: x.split(" ")`
:type tokenizer: Callable[[str], List[str]], optional
:raises ValueError: if `bits_to_keep > simhash_bits`.
"""
if bits_to_keep > simhash_bits:
raise ValueError("Cannot keep more bits than those given by Simhash")

self.bits_to_keep = bits_to_keep
super().__init__(simhash_bits, tokenizer)

def inthash(self, document: str) -> int:
"""Compute the hash of `document` and return it as an integer.

:type document: str
:rtype: int
"""
simhash = super().inthash(document)
return simhash >> (self.n_bits - self.bits_to_keep)
61 changes: 61 additions & 0 deletions floc_simhash/transformer/geometric.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
from typing import Optional, Union, Any

import numpy as np
import scipy
from sklearn.base import TransformerMixin


class SimHashTransformer(TransformerMixin):
def __init__(self, n_bits: int):
"""Initialize a transformer capable of computing the SimHash algorithm on vectorized data.

:param n_bits: Precision to hash vectors to.
:type n_bits: int
"""
self.n_bits = n_bits
self._vectors: Optional[np.array] = None

def fit(
self, X: Union[np.array, scipy.sparse.csr_matrix], y: Any = None
) -> "SimHashTransformer":
"""Fit the transformer to the given data.

This amounts to choosing random unit vectors of dimension `n`, where `X` is of dimension
`(number_of_documents, n)`.

:param X: vectorized data to be hashed. Can either be a numpy array or a sparse matrix.
:type X: Union[np.array, scipy.sparse.csr_matrix]
:param y: not used, left for matching the scikit-learn transformer pattern. Defaults to
None.
:type y: Any, optional
:return: the fitted instance itself.
:rtype: "SimHashTransformer"
"""
random_vectors = np.random.randn(X.shape[1], self.n_bits)
self._vectors = random_vectors / np.linalg.norm(
random_vectors, axis=0, keepdims=True
)
return self

def transform(
self, X: Union[np.array, scipy.sparse.csr_matrix], y: Any = None
) -> np.array:
"""Compute the SimHashes of the vectors in X as an array of hexadecimal strings.

:param X: 2D array, where each row contains a vector to be hashed.
:type X: Union[np.array, scipy.sparse.csr_matrix]
:param y: not used, left for matching the scikit-learn transformer pattern. Defaults to
None.
:type y: Any, optional
:return: an array of hexadecimal strings
:rtype: np.array
:raises ValueError: if the fit method has not been previously called.
"""
if self._vectors is None:
raise ValueError("The fit method has not been called")
mul = X.dot(self._vectors)
bits = np.where(mul > 0, 1, 0)
powers = 2 ** np.arange(self.n_bits)
int_values = np.dot(bits, powers)
hex_converter = np.vectorize(lambda x: str(hex(x)))
return hex_converter(int_values)
10 changes: 10 additions & 0 deletions lefthook.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
pre-commit:
commands:
black:
tags: python
glob: "*.py"
run: black --check {staged_files}
pylint:
tags: python
glob: "*.py"
run: pylint-fail-under --fail_under 8.0 {staged_files}
Loading