RFC: Tokenization API & Initial Implementations #98

broken · 2019-04-30T20:35:34Z

This review is open for public comment until 2019-05-15

Status	Accepted
Author(s)	Robby Neale (Google)
Sponsor	Mark Omernick (Google), Greg Billock (Google)
Updated	2019-04-18

Objective

Establish a base Tokenizer class for TensorFlow tokenizers to extend, and create an initial set of tokenization offerings which implement this API.

googlebot · 2019-04-30T20:36:37Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

Status | Proposed :------------ | :----------------------------------- **Author(s)** | Robby Neale (Google) **Sponsor** | Mark Omernick (Google), Greg Billock (Google) **Updated** | 2019-04-18 **Objective** Establish a set of principles to follow that will unify Tensorflow tokenization offerings, and apply these conventions to APIs for a set of op-level tokenizers.

googlebot · 2019-04-30T20:41:57Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

alextp · 2019-04-30T22:27:11Z

rfcs/20190430-tokenization-conventions.md

+APIs of tokenization implementations following the above conventions.
+
+```python
+def xyz_tokenize(input, **kwargs):


Why a function instead of an object?

I'd expect tokenizers to implement an interface like

class Tokenizer: def tokenize(input, encoding=None): def tokenize_with_offsets(input, encoding=None):

where any additional kwargs and tokenizer-specific configuration options are passed to the constructor, so it's easy to pass these objects around without all call sites having to also know what options to pass to tokenizers.

Yeah; a valid question. Two main reasons:

Many tokenizers past what are proposed here require a trained model and are not fit to be surfaced as ops. The expectation is that these conventions would apply to more than just ops, but also TF Hub modules and TF Serving models as well.

While a majority of tokenizers could fit with the interface and having a constructor for initialization, I did not want to limit possible future methods. For example, a multilingual tokenizer might allow for a language_hint as an argument for tokenization, or you may want to pass different lookup tables to a wordpiece tokenizer depending on the language.

I am just bothered that if you're using these tokenizers in eager mode you'll need to keep passing initialization arguments again and again on every iteration. You can always make a free function a method by using a wrapper empty class, but you cannot in general go the other way without wasting work.

I have the same concern even in graph mode, where one may want to tokenize string inputs in multiple ops using essentially the same tokenizer. If a tokenizer depends on expensive initialization or external file resources (e.g. a look up table; or a lexicon), we would end up initializing/loading resources/allocating memory O(# ops) times instead of O(1).

I've been discussing this subject offline as well and will update the RFC to include an interface.

alextp · 2019-05-01T15:00:31Z

I am just bothered that if you're using these tokenizers in eager mode you'll need to keep passing initialization arguments again and again on every iteration. You can always make a free function a method by using a wrapper empty class, but you cannot in general go the other way without wasting work.

…

On Tue, Apr 30, 2019 at 8:03 PM Robert Neale ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In rfcs/20190430-tokenization-conventions.md <#98 (comment)>: > +standard version of the op along which includes the required asset files, +these should be published as a [TF.Hub](https://www.tensorflow.org/hub) module. +These modules can include popular versions of assets to be used for general use +cases. The published TF.Hub module should continue to follow the other +conventions outlined above. + +### New Tokenize APIs {#new-tokenize-apis} + +Following the above conventions, tokenization ops tend to follow a pattern. +This is intentional, as it gives familiarity across differing implementations, +and again allows for convenient exchanging of implementations. +This is not a base class, but a general pattern that is followed throughout +APIs of tokenization implementations following the above conventions. + +```python +def xyz_tokenize(input, **kwargs): Yeah; a valid question. Two main reasons: - Many tokenizers past what are proposed here require a trained model and are not fit to be surfaced as ops. The expectation is that these conventions would apply to more than just ops, but also TF Hub modules and TF Serving models as well. - While a majority of tokenizers could fit with the interface and having a constructor for initialization, I did not want to limit possible future methods. For example, a multilingual tokenizer might allow for a language_hint as an argument for tokenization, or you may want to pass different lookup tables to a wordpiece tokenizer depending on the language. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#98 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAABHRKP2M6VMBDEHYIKV2LPTEB6VANCNFSM4HJP62QA> .

-- - Alex

kho · 2019-05-03T15:14:43Z

rfcs/20190430-tokenization-conventions.md

+    scalar, in which case the output will be a vector Tensor.
+1.  Tokenizers should also provide a `_with_offsets` variant that includes start
+    and limit offsets with the output which will allow alignment of the tokens
+    to the original byte offsets of the strings they are tokenized from.


From the examples I can see that the output token is not required to be byte equal to the original input substring. Perhaps it is worth making this explicit here (since not all users of tokenizers may know about NFC normalization etc).

Agreed. I'll include this with the interface update.

kho · 2019-05-03T15:28:16Z

rfcs/20190430-tokenization-conventions.md

+might implement following the same set of guidelines.
+
+```python
+def my_tokenize(input):


How serious are we with regard to "composability"? The current interface can be quite tricky and burdensome for implementing custom tokenizers using existing tokenizers, especially if offset tracking is required ("Tokenizers should also provide a _with_offsets variant" in "Conventions").

For example, consider the extra work needed for implementing my_tokenize_with_offsets().

We need to know what input bytes each output code point maps to in every single string rewriting operation, i.e. case_fold_utf8(), regex_replace().

For each string splitting operation, we need to translate offsets relative to the splitting input to those relative to the original input.

Yeah; I was trying to be careful about the language use of "should" in that it is a best practice, but could be left out if it proved too tricky.

I am updating the rfc to include two interfaces which will make this more explicit.

class Tokenizer: def tokenize(input): class TokenizerWithOffsets(Tokenizer): def tokenize_with_offsets(input):

broken · 2019-05-07T21:54:49Z

Pushed a new version which moves from the function/conventions approach to an interface (base class) with concrete Tokenizer classes .

broken · 2019-05-17T23:07:02Z

Notes from design review:

Q: Were there any major last minute changes to the doc?
Previously we were proposing tokenizer methods, after comments from the RFC we changed to proposing python classes instead with Tokenizer and TokenizerWithOffsets.

Q: How does this integrate with tf.transform/hub?
Since the tokenization classes are backed by tf ops, they can be included as part of a tf.hub module or tf.transform graph as originally designed.

Q: How does this fit other parts of the TF ecosystem like Keras/preprocessing layers and tf.data?
We originally planned on later adding a Keras tokenization layer that used the tokenizer, but we could create tokenization preprocessing layers that take a Tokenizer object and internally tokenize. This option requires adding two functions for the tokenizer object to take part in Keras’ serialization.

Q: Initial tokenizers are basic, how configurable are they?
The initial tokenizer implementations are pretty basic and don’t need much configuration, wordpiece allows some configuration.

General feedback:
Current API is good, simple, easy to use
Makes TF much better for text use cases.

We should think more on the implications and how this fits/plays other other components in the TF ecosystem. There Particularly preprocessing layers and tf.data.of where this would go in TF ecosystem.

Next steps (AIs):

Sync up and think more about how Tokenizer objects can fit in keras preprocessing layers and tf.data
Discussion over email/fix up RFC with conclusion and then good for approval.

ewilderj · 2019-05-29T18:59:31Z

If this document has been reviewed and we're moving forward, please change the status to "Accepted", and I will merge this. Thank you!

broken · 2019-05-30T18:02:01Z

I updated the status. Thanks Edd.

alextp · 2019-05-31T14:55:49Z

rfcs/20190430-tokenization-conventions.md

+
+Because of the simplified nature of the Keras tokenization and that the
+tokenizer API described above is to be included in a TensorFlow library outside
+of core, these tokenizers will not be used from within the Keras preprocessing


Will the keras preprocessing layers take an optional tokenizer as a constructor argument, to allow for customization?

No. I asked about that, but the response was Keras wanted to keep their preprocessing layer as simple as possible, and any advanced tokenization should happen outside of the default layers.

broken requested review from dynamicwebpaige, ewilderj, goldiegadde and martinwicke as code owners April 30, 2019 20:35

googlebot added the cla: no label Apr 30, 2019

broken force-pushed the token branch from d623d77 to 1b83b38 Compare April 30, 2019 20:41

googlebot added cla: yes and removed cla: no labels Apr 30, 2019

broken mentioned this pull request Apr 30, 2019

RFC: Tokenization API Conventions & Initial Implementations #97

Closed

ewilderj added the RFC: Proposed RFC Design Document label Apr 30, 2019

alextp reviewed Apr 30, 2019

View reviewed changes

kho reviewed May 3, 2019

View reviewed changes

update with base class

a06c972

broken added 2 commits May 29, 2019 11:19

Add info on how API interacts with tf.data & default Keras layers.

66cf4ae

Update title

98fbc70

broken changed the title ~~RFC: Tokenization API Conventions & Initial Implementations~~ RFC: Tokenization API & Initial Implementations May 30, 2019

alextp reviewed May 31, 2019

View reviewed changes

ewilderj approved these changes Oct 11, 2019

View reviewed changes

ewilderj merged commit a6d1b63 into tensorflow:master Oct 11, 2019

ewilderj added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Oct 11, 2019

RFC: Tokenization API & Initial Implementations #98

RFC: Tokenization API & Initial Implementations #98

Uh oh!

Conversation

broken commented Apr 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Uh oh!

googlebot commented Apr 30, 2019

Uh oh!

googlebot commented Apr 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alextp commented May 1, 2019 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

broken commented May 7, 2019

Uh oh!

broken commented May 17, 2019

Uh oh!

ewilderj commented May 29, 2019

Uh oh!

broken commented May 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

broken commented Apr 30, 2019 •

edited

Loading