Move `from_preset` to base tokenizer classes #673

kanpuriyanawab · 2023-01-17T16:25:05Z

Closes #648

kanpuriyanawab · 2023-01-17T16:25:36Z

@jbischof Please review.

mattdangerw

Thanks you!!

The actual runnable code changes looks good, just one minor comment.

We will likely need to do something fancier for docstrings though. Will think through this a bit and post more here.

mattdangerw · 2023-01-17T20:10:39Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

            tokenized_words, axis=1, separator=" "
        )
        self.cache.insert(tokens, tokenized_words)
+


each of these classes should probably have a preset property defined now, that is empty. E.g.

https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/backbone.py#L33-L35

jbischof

Thanks mostly documentation changes!

jbischof · 2023-01-17T23:20:05Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+        tokenizer.detokenize([5, 6, 7, 8, 9])
+        ```
+        """
+


After adding the preset property, check if empty as we do for backbone (link). Same for the other two tokenizers.

jbischof · 2023-01-17T23:39:34Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+        preset,
+        **kwargs,
+    ):
+        """Instantiate a GPT-2 tokenizer from preset vocabulary and merge rules.


Please remove GPT-2 language. This should be a generic docstring for BPE. Same for the other two

We can actually switch this to a templatized version like this... https://github.com/keras-team/keras-nlp/blob/c9e5040bf7646da471bf9cec2177be2398162568/keras_nlp/models/backbone.py#L45-L65
Make sure to not copy that verbatim, we should keep the language from this docstring, but update this to use the format variables {{model_name}} {{preset_names}} and {{example_preset_name}}.

To get that working, you will also need to copy the __init_subclass__ method we use for our Backbone and Task classes, but you should be able to copy that almost exactly (just update Backbone -> BytePairTokenizer). https://github.com/keras-team/keras-nlp/blob/c9e5040bf7646da471bf9cec2177be2398162568/keras_nlp/models/backbone.py#L94-L114

We should make similar changes to the other tokenizer base classes.

mattdangerw

Left some comments re how we can handle the docstrings here.

Also, we are a bit of a moving target here (lots of changes this week!), but you can also mirror these changes for the albert and f_net models. Thank you!

mattdangerw · 2023-01-19T00:10:31Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+        preset,
+        **kwargs,
+    ):
+        """Instantiate a GPT-2 tokenizer from preset vocabulary and merge rules.


We can actually switch this to a templatized version like this... https://github.com/keras-team/keras-nlp/blob/c9e5040bf7646da471bf9cec2177be2398162568/keras_nlp/models/backbone.py#L45-L65
Make sure to not copy that verbatim, we should keep the language from this docstring, but update this to use the format variables {{model_name}} {{preset_names}} and {{example_preset_name}}.

To get that working, you will also need to copy the __init_subclass__ method we use for our Backbone and Task classes, but you should be able to copy that almost exactly (just update Backbone -> BytePairTokenizer). https://github.com/keras-team/keras-nlp/blob/c9e5040bf7646da471bf9cec2177be2398162568/keras_nlp/models/backbone.py#L94-L114

We should make similar changes to the other tokenizer base classes.

mattdangerw · 2023-01-19T00:11:43Z

keras_nlp/models/bert/bert_tokenizer.py


-    @classmethod
-    @format_docstring(names=PRESET_NAMES)
    def from_preset(


After following the changes below re docstrings and __init__subclass__ you should be able to remove the from_preset method here and elsewhere entirely!

@shivance, we don't need from_preset in subclasses anymore! See, for example, BertPreprocessor

mattdangerw · 2023-01-19T00:17:49Z

Looks like you also have some formatting issues on this PR, checkout the Tests / Check the code format (pull_request) above!

jbischof

Good progress! Let us know if you get stuck. You need to run the format.sh script before every commit, so let us know if you're having trouble there.

jbischof · 2023-01-22T15:28:47Z

keras_nlp/models/bert/bert_tokenizer.py


-    @classmethod
-    @format_docstring(names=PRESET_NAMES)
    def from_preset(


@shivance, we don't need from_preset in subclasses anymore! See, for example, BertPreprocessor

jbischof · 2023-01-22T15:29:55Z

keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py

-        )
-
-        return cls.from_config({**config, **kwargs})
+        return super().from_preset(cls, preset, **kwargs)


You need a newline at the end of each file. Are you still having issues with the format.sh script?

Hi @jbischof , I'm still addressing the comments. So work is pending.
I tend to run formatting scripts upon finishing changes for every round of review.

Should I continue with this or run it every time before commit?

Thanks.

Sorry, didn't understand this was still WIP! Up to you on how you organize your commits 😄

kanpuriyanawab · 2023-01-22T18:10:15Z

@jbischof It's ready for review now 😄

mattdangerw

LGTM! Thank you!

Just found a few small nits that need fixing.

mattdangerw · 2023-01-23T20:55:23Z

keras_nlp/models/albert/albert_tokenizer.py

        self.sep_token_id = self.token_to_id(sep_token)
        self.pad_token_id = self.token_to_id(pad_token)

    @classproperty


We will need some changes to the class level docstrings for our model specific tokenizers, we should document the from preset usage front and center in our code examples above. But I think that would best be done as a follow up anyway, just opened #688

mattdangerw · 2023-01-23T21:02:25Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+        """Instantiate {{model_name}} tokenizer from preset vocabulary.
+
+        Args:
+            preset: string. Must be one of {{preset_names}}.


We actually need {{preset_names}} surrounded by quotes for the docstring to render correctly. See https://github.com/keras-team/keras-nlp/blob/3cfdeb6bb1eeacd755a880f1674bf8b9d765aa43/keras_nlp/models/backbone.py#L58

mattdangerw · 2023-01-23T21:02:40Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+        """Instantiate {{model_name}} tokenizer from preset vocabulary.
+
+        Args:
+            preset: string. Must be one of {{preset_names}}.


Surround with quotes.

mattdangerw · 2023-01-23T21:03:10Z

keras_nlp/tokenizers/word_piece_tokenizer.py

+        """Instantiate {{model_name}} tokenizer from preset vocabulary.
+
+        Args:
+            preset: string. Must be one of {{preset_names}}.


Surround with quotes.

jbischof

Looks good in general, but please follow @mattdangerw's suggestions and fix the formatting

mattdangerw · 2023-01-24T21:50:18Z

Actually, since my comments add up to just a couple lines changes, I can just make these as merge this. Thanks very much for contribution!

kanpuriyanawab added 2 commits January 17, 2023 21:44

moving from_preset to base tokenizer classes

c06c006

formatting

fc38f4b

mattdangerw self-requested a review January 17, 2023 19:59

mattdangerw requested changes Jan 17, 2023

View reviewed changes

jbischof suggested changes Jan 17, 2023

View reviewed changes

mattdangerw requested changes Jan 19, 2023

View reviewed changes

Merge branch 'master' into issue648

c149950

jbischof reviewed Jan 22, 2023

View reviewed changes

kanpuriyanawab added 4 commits January 22, 2023 23:24

incorporating suggested changes

025da95

incoming + updated

2b2d907

minor edit

774da5a

formatting

1277c6f

kanpuriyanawab requested a review from jbischof January 22, 2023 18:10

mattdangerw approved these changes Jan 23, 2023

View reviewed changes

Format and docstring fixes

7a518a1

jbischof approved these changes Jan 24, 2023

View reviewed changes

mattdangerw merged commit 9d19bc5 into keras-team:master Jan 24, 2023

mattdangerw mentioned this pull request Jan 25, 2023

Add BartTokenizer and BART Presets #685

Merged

jbischof mentioned this pull request Jan 31, 2023

Base classes for architecture workhorses #530

Closed

kanpuriyanawab deleted the issue648 branch February 13, 2023 13:56

Move from_preset to base tokenizer classes #673

Move from_preset to base tokenizer classes #673

Uh oh!

Conversation

kanpuriyanawab commented Jan 17, 2023

Uh oh!

kanpuriyanawab commented Jan 17, 2023

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbischof left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw Jan 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

mattdangerw Jan 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Jan 19, 2023

Uh oh!

jbischof left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kanpuriyanawab commented Jan 22, 2023

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

mattdangerw Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbischof left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Move `from_preset` to base tokenizer classes #673

Move `from_preset` to base tokenizer classes #673

mattdangerw Jan 19, 2023 •

edited

Loading

mattdangerw Jan 19, 2023 •

edited

Loading

mattdangerw Jan 23, 2023 •

edited

Loading

jbischof left a comment •

edited

Loading