-
Notifications
You must be signed in to change notification settings - Fork 577
RFC: Tokenization API & Initial Implementations #98
Conversation
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
Status | Proposed :------------ | :----------------------------------- **Author(s)** | Robby Neale (Google) **Sponsor** | Mark Omernick (Google), Greg Billock (Google) **Updated** | 2019-04-18 **Objective** Establish a set of principles to follow that will unify Tensorflow tokenization offerings, and apply these conventions to APIs for a set of op-level tokenizers.
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
APIs of tokenization implementations following the above conventions. | ||
|
||
```python | ||
def xyz_tokenize(input, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a function instead of an object?
I'd expect tokenizers to implement an interface like
class Tokenizer:
def tokenize(input, encoding=None):
def tokenize_with_offsets(input, encoding=None):
where any additional kwargs and tokenizer-specific configuration options are passed to the constructor, so it's easy to pass these objects around without all call sites having to also know what options to pass to tokenizers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah; a valid question. Two main reasons:
- Many tokenizers past what are proposed here require a trained model and are not fit to be surfaced as ops. The expectation is that these conventions would apply to more than just ops, but also TF Hub modules and TF Serving models as well.
- While a majority of tokenizers could fit with the interface and having a constructor for initialization, I did not want to limit possible future methods. For example, a multilingual tokenizer might allow for a language_hint as an argument for tokenization, or you may want to pass different lookup tables to a wordpiece tokenizer depending on the language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just bothered that if you're using these tokenizers in eager mode you'll need to keep passing initialization arguments again and again on every iteration. You can always make a free function a method by using a wrapper empty class, but you cannot in general go the other way without wasting work.
I have the same concern even in graph mode, where one may want to tokenize string inputs in multiple ops using essentially the same tokenizer. If a tokenizer depends on expensive initialization or external file resources (e.g. a look up table; or a lexicon), we would end up initializing/loading resources/allocating memory O(# ops) times instead of O(1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been discussing this subject offline as well and will update the RFC to include an interface.
I am just bothered that if you're using these tokenizers in eager mode
you'll need to keep passing initialization arguments again and again on
every iteration. You can always make a free function a method by using a
wrapper empty class, but you cannot in general go the other way without
wasting work.
…On Tue, Apr 30, 2019 at 8:03 PM Robert Neale ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In rfcs/20190430-tokenization-conventions.md
<#98 (comment)>:
> +standard version of the op along which includes the required asset files,
+these should be published as a [TF.Hub](https://www.tensorflow.org/hub) module.
+These modules can include popular versions of assets to be used for general use
+cases. The published TF.Hub module should continue to follow the other
+conventions outlined above.
+
+### New Tokenize APIs {#new-tokenize-apis}
+
+Following the above conventions, tokenization ops tend to follow a pattern.
+This is intentional, as it gives familiarity across differing implementations,
+and again allows for convenient exchanging of implementations.
+This is not a base class, but a general pattern that is followed throughout
+APIs of tokenization implementations following the above conventions.
+
+```python
+def xyz_tokenize(input, **kwargs):
Yeah; a valid question. Two main reasons:
- Many tokenizers past what are proposed here require a trained model
and are not fit to be surfaced as ops. The expectation is that these
conventions would apply to more than just ops, but also TF Hub modules and
TF Serving models as well.
- While a majority of tokenizers could fit with the interface and
having a constructor for initialization, I did not want to limit possible
future methods. For example, a multilingual tokenizer might allow for a
language_hint as an argument for tokenization, or you may want to pass
different lookup tables to a wordpiece tokenizer depending on the language.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#98 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABHRKP2M6VMBDEHYIKV2LPTEB6VANCNFSM4HJP62QA>
.
--
- Alex
|
scalar, in which case the output will be a vector Tensor. | ||
1. Tokenizers should also provide a `_with_offsets` variant that includes start | ||
and limit offsets with the output which will allow alignment of the tokens | ||
to the original byte offsets of the strings they are tokenized from. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the examples I can see that the output token is not required to be byte equal to the original input substring. Perhaps it is worth making this explicit here (since not all users of tokenizers may know about NFC normalization etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'll include this with the interface update.
might implement following the same set of guidelines. | ||
|
||
```python | ||
def my_tokenize(input): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How serious are we with regard to "composability"? The current interface can be quite tricky and burdensome for implementing custom tokenizers using existing tokenizers, especially if offset tracking is required ("Tokenizers should also provide a _with_offsets
variant" in "Conventions").
For example, consider the extra work needed for implementing my_tokenize_with_offsets().
- We need to know what input bytes each output code point maps to in every single string rewriting operation, i.e. case_fold_utf8(), regex_replace().
- For each string splitting operation, we need to translate offsets relative to the splitting input to those relative to the original input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah; I was trying to be careful about the language use of "should" in that it is a best practice, but could be left out if it proved too tricky.
I am updating the rfc to include two interfaces which will make this more explicit.
class Tokenizer:
def tokenize(input):
class TokenizerWithOffsets(Tokenizer):
def tokenize_with_offsets(input):
Pushed a new version which moves from the function/conventions approach to an interface (base class) with concrete Tokenizer classes . |
Notes from design review: Q: Were there any major last minute changes to the doc? Q: How does this integrate with tf.transform/hub? Q: How does this fit other parts of the TF ecosystem like Keras/preprocessing layers and tf.data? Q: Initial tokenizers are basic, how configurable are they? General feedback: We should think more on the implications and how this fits/plays other other components in the TF ecosystem. There Particularly preprocessing layers and tf.data.of where this would go in TF ecosystem. Next steps (AIs): Sync up and think more about how Tokenizer objects can fit in keras preprocessing layers and tf.data |
If this document has been reviewed and we're moving forward, please change the status to "Accepted", and I will merge this. Thank you! |
I updated the status. Thanks Edd. |
|
||
Because of the simplified nature of the Keras tokenization and that the | ||
tokenizer API described above is to be included in a TensorFlow library outside | ||
of core, these tokenizers will not be used from within the Keras preprocessing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the keras preprocessing layers take an optional tokenizer as a constructor argument, to allow for customization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I asked about that, but the response was Keras wanted to keep their preprocessing layer as simple as possible, and any advanced tokenization should happen outside of the default layers.
This review is open for public comment until 2019-05-15
Objective
Establish a base Tokenizer class for TensorFlow tokenizers to extend, and create an initial set of tokenization offerings which implement this API.