这是indexloc提供的服务,不要输入任何密码
Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

RFC: Tokenization API & Initial Implementations #98

Merged
merged 4 commits into from
Oct 11, 2019

Conversation

broken
Copy link
Member

@broken broken commented Apr 30, 2019

This review is open for public comment until 2019-05-15

Status Accepted
Author(s) Robby Neale (Google)
Sponsor Mark Omernick (Google), Greg Billock (Google)
Updated 2019-04-18

Objective

Establish a base Tokenizer class for TensorFlow tokenizers to extend, and create an initial set of tokenization offerings which implement this API.

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

Status        | Proposed
:------------ | :-----------------------------------
**Author(s)** | Robby Neale (Google)
**Sponsor**   | Mark Omernick (Google), Greg Billock (Google)
**Updated**   | 2019-04-18

**Objective**

Establish a set of principles to follow that will unify Tensorflow tokenization
offerings, and apply these conventions to APIs for a set of op-level tokenizers.
@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

APIs of tokenization implementations following the above conventions.

```python
def xyz_tokenize(input, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a function instead of an object?

I'd expect tokenizers to implement an interface like

class Tokenizer:
  def tokenize(input, encoding=None):

  def tokenize_with_offsets(input, encoding=None):

where any additional kwargs and tokenizer-specific configuration options are passed to the constructor, so it's easy to pass these objects around without all call sites having to also know what options to pass to tokenizers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; a valid question. Two main reasons:

  • Many tokenizers past what are proposed here require a trained model and are not fit to be surfaced as ops. The expectation is that these conventions would apply to more than just ops, but also TF Hub modules and TF Serving models as well.
  • While a majority of tokenizers could fit with the interface and having a constructor for initialization, I did not want to limit possible future methods. For example, a multilingual tokenizer might allow for a language_hint as an argument for tokenization, or you may want to pass different lookup tables to a wordpiece tokenizer depending on the language.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just bothered that if you're using these tokenizers in eager mode you'll need to keep passing initialization arguments again and again on every iteration. You can always make a free function a method by using a wrapper empty class, but you cannot in general go the other way without wasting work.

I have the same concern even in graph mode, where one may want to tokenize string inputs in multiple ops using essentially the same tokenizer. If a tokenizer depends on expensive initialization or external file resources (e.g. a look up table; or a lexicon), we would end up initializing/loading resources/allocating memory O(# ops) times instead of O(1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been discussing this subject offline as well and will update the RFC to include an interface.

@alextp
Copy link
Contributor

alextp commented May 1, 2019 via email

scalar, in which case the output will be a vector Tensor.
1. Tokenizers should also provide a `_with_offsets` variant that includes start
and limit offsets with the output which will allow alignment of the tokens
to the original byte offsets of the strings they are tokenized from.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the examples I can see that the output token is not required to be byte equal to the original input substring. Perhaps it is worth making this explicit here (since not all users of tokenizers may know about NFC normalization etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll include this with the interface update.

might implement following the same set of guidelines.

```python
def my_tokenize(input):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How serious are we with regard to "composability"? The current interface can be quite tricky and burdensome for implementing custom tokenizers using existing tokenizers, especially if offset tracking is required ("Tokenizers should also provide a _with_offsets variant" in "Conventions").

For example, consider the extra work needed for implementing my_tokenize_with_offsets().

  1. We need to know what input bytes each output code point maps to in every single string rewriting operation, i.e. case_fold_utf8(), regex_replace().
  2. For each string splitting operation, we need to translate offsets relative to the splitting input to those relative to the original input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; I was trying to be careful about the language use of "should" in that it is a best practice, but could be left out if it proved too tricky.

I am updating the rfc to include two interfaces which will make this more explicit.

class Tokenizer:
  def tokenize(input):

class TokenizerWithOffsets(Tokenizer):
  def tokenize_with_offsets(input):

@broken
Copy link
Member Author

broken commented May 7, 2019

Pushed a new version which moves from the function/conventions approach to an interface (base class) with concrete Tokenizer classes .

@broken
Copy link
Member Author

broken commented May 17, 2019

Notes from design review:

Q: Were there any major last minute changes to the doc?
Previously we were proposing tokenizer methods, after comments from the RFC we changed to proposing python classes instead with Tokenizer and TokenizerWithOffsets.

Q: How does this integrate with tf.transform/hub?
Since the tokenization classes are backed by tf ops, they can be included as part of a tf.hub module or tf.transform graph as originally designed.

Q: How does this fit other parts of the TF ecosystem like Keras/preprocessing layers and tf.data?
We originally planned on later adding a Keras tokenization layer that used the tokenizer, but we could create tokenization preprocessing layers that take a Tokenizer object and internally tokenize. This option requires adding two functions for the tokenizer object to take part in Keras’ serialization.

Q: Initial tokenizers are basic, how configurable are they?
The initial tokenizer implementations are pretty basic and don’t need much configuration, wordpiece allows some configuration.

General feedback:
Current API is good, simple, easy to use
Makes TF much better for text use cases.

We should think more on the implications and how this fits/plays other other components in the TF ecosystem. There Particularly preprocessing layers and tf.data.of where this would go in TF ecosystem.

Next steps (AIs):

Sync up and think more about how Tokenizer objects can fit in keras preprocessing layers and tf.data
Discussion over email/fix up RFC with conclusion and then good for approval.

@ewilderj
Copy link
Contributor

If this document has been reviewed and we're moving forward, please change the status to "Accepted", and I will merge this. Thank you!

@broken broken changed the title RFC: Tokenization API Conventions & Initial Implementations RFC: Tokenization API & Initial Implementations May 30, 2019
@broken
Copy link
Member Author

broken commented May 30, 2019

I updated the status. Thanks Edd.


Because of the simplified nature of the Keras tokenization and that the
tokenizer API described above is to be included in a TensorFlow library outside
of core, these tokenizers will not be used from within the Keras preprocessing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the keras preprocessing layers take an optional tokenizer as a constructor argument, to allow for customization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I asked about that, but the response was Keras wanted to keep their preprocessing layer as simple as possible, and any advanced tokenization should happen outside of the default layers.

@ewilderj ewilderj merged commit a6d1b63 into tensorflow:master Oct 11, 2019
@ewilderj ewilderj added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Oct 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants