Add WhisperTokenizerFast #21222

jonatanklosko · 2023-01-20T18:53:32Z

Adds the fast version of Whisper tokenizer. The Whisper tokenizer is essentially GPT2 tokenizer with special tokens. The main difference is the additional normalizer (which I mirrored from the slow tokenizer) and language/task-dependent prefix tokens.

One of the tokenizer tests is failing, it's because there is no tokenizer.json file in the openai/whisper-* (specifically the tiny checkpoint). I added a converter, so now it is possible to load fast tokenizer from existing checkpoints and export tokenizer.json.

HuggingFaceDocBuilderDev · 2023-01-20T19:07:09Z

The documentation is not available anymore as the PR was closed or merged.

jonatanklosko

The corresponding audio pipeline test used to be skipped, because of the missing fast tokenizer. After adding the tokenizer it started to fail, but after a few changes it works fine now. A couple related notes inline.

tests/models/whisper/test_modeling_whisper.py

tests/pipelines/test_pipelines_common.py

jonatanklosko · 2023-01-21T01:29:02Z

tests/pipelines/test_pipelines_common.py

+        # We adjust the sampling rate, such that the featurizer returns features
+        # compatible with the model
+        feature_extractor = feature_extractor.__class__(
+            sampling_rate=tiny_config.max_source_positions * 2 * 160 // 30, hop_length=160, chunk_length=30


Happy to learn if there's an easier way to get a featurizer compatible with the given model config!

Not really sure! This works for me

ArthurZucker

Thanks for working on this. There are a few files that should not have been modified here and there, otherwise very neat!

ArthurZucker · 2023-01-23T13:19:53Z

src/transformers/convert_slow_tokenizer.py

            bos_token_id = self.original_tokenizer.bos_token_id
            tokenizer.post_processor = processors.TemplateProcessing(
-                single=f"{bos}:0 $A:0",  # token_type_id is 2 for Funnel transformer
+                single=f"{bos}:0 $A:0",


Not sure there's a reason why this is modified?

I think this comment is a leftover from copying the FunnelConverter, because note that the template doesn't have :2 token type id anywhere, which is the case here:

transformers/src/transformers/convert_slow_tokenizer.py

Line 189 in b6303bb

single=f"{cls}:2 $A:0 {sep}:0", # token_type_id is 2 for Funnel transformer

(I just noticed that when adding the WhisperConverter based on the GPT2 one)

tests/models/whisper/test_modeling_whisper.py

tests/pipelines/test_pipelines_automatic_speech_recognition.py

tests/pipelines/test_pipelines_common.py

ArthurZucker · 2023-01-23T13:28:31Z

tests/pipelines/test_pipelines_common.py

+        # We adjust the sampling rate, such that the featurizer returns features
+        # compatible with the model
+        feature_extractor = feature_extractor.__class__(
+            sampling_rate=tiny_config.max_source_positions * 2 * 160 // 30, hop_length=160, chunk_length=30


Not really sure! This works for me

src/transformers/models/whisper/tokenization_whisper_fast.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

jonatanklosko · 2023-01-23T14:09:21Z

src/transformers/tokenization_utils_fast.py

+        # TODO: it looks like the '' token is not re-added when retraining
+        # the tokenizer in tests, and we fall into an infinite reursion
+        # trying to convert unknown token to id
+        if index is None and token != self.unk_token:


I think the issue comes down to this:

import string from transformers import BertTokenizerFast tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased") tokenizer._tokenizer.add_special_tokens(['']) print(tokenizer._tokenizer.token_to_id('')) #=> None

Adding any other token (e.g. <|test|>) works fine, but the empty string doesn't work as a token.

Using tokenizers directly:

from tokenizers import Tokenizer tokenizer = Tokenizer.from_pretrained("bert-base-cased") tokenizer.add_special_tokens(['', '<|test|>']) #=> 1 print(tokenizer.token_to_id('<|test|>')) #=> 28996 print(tokenizer.token_to_id('')) #=> None

Ok, the actual discrepancy is between how slow and fast tokenizers handle adding the '' token:

Slow

from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-cased") tokenizer.add_tokens(['']) print(tokenizer.convert_tokens_to_ids('')) #=> 28996

Fast

from transformers import BertTokenizerFast tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased") tokenizer.add_tokens(['']) print(tokenizer.convert_tokens_to_ids('')) #=> 100 (unknown)

Yep exactly. Which is why in #21250 I set any unknown token to "". This was such a headache.
In the whisper-large version, I added "" to the list of world in the vocab, and set unk_token = "" which gave the correct ID, but it is a bit confusing.

How should we mirror this change in the fast version?

The failure path is all_special_ids -> convert_tokens_to_ids(all_special_tokens) -> _convert_token_to_id_with_added_voc('') -> _tokenizer.token_to_id('') -> None

If you want to reproduce locally, remove that check and run pytest tests/pipelines/test_pipelines_automatic_speech_recognition.py -k 'test_pt_WhisperConfig_WhisperForConditionalGeneration_WhisperTokenizer_WhisperFeatureExtractor' :)

Yep, I think we should remove the unk_token (the way it is defined) from the multilingual models. They should behave the same way as whisper-tiny.en

Oh, that would be ideal, won't this result in token ids shifting if the '' is missing?

No it should't, we will leave '' in the vocab but just set unk_token = <|endoftext|>

Ah, it is in the vocab too, not just the special tokens, perfect!

jonatanklosko · 2023-01-23T17:14:59Z

@ArthurZucker thanks for the help! I think now the steps are to update the unknown token in multilangual checkpoints and add tokenizer.json to the repos. Let me know if there's anything I can help with :)

ArthurZucker · 2023-01-23T18:42:22Z

Feel free to open community PR on the model' (hub) linking to this PR (github) 🚀

jonatanklosko · 2023-01-23T19:23:09Z

@ArthurZucker sure! I've just created https://huggingface.co/openai/whisper-tiny/discussions/5, let me know if it looks as expected and I will open a matching PR on the other checkpoints too.

FTR I generated the tokenizer.json with:

import sys
sys.path.reverse()
sys.path.append("/Users/jonatanklosko/git/transformers/src")
sys.path.reverse()

from transformers import WhisperTokenizerFast

tokenizer = WhisperTokenizerFast.from_pretrained("/Users/jonatanklosko/git/hf/whisper-tiny/")
tokenizer.save_pretrained("/Users/jonatanklosko/git/hf/whisper-tiny/")

I also updated the unknown token configuration manually.

jonatanklosko · 2023-01-23T20:26:57Z

Changing the unknown token in configuration leads to a weird behaviour when loading the slow tokenizer, see an example in the PR. Any ideas why that is?

jonatanklosko · 2023-01-23T22:58:43Z

So the issue is that the multilingual tokenizer doesn't have <|endoftext|> in the initial vocabulary, so it would need to be added from special tokens map. However, when loading special tokens we have this check:

transformers/src/transformers/tokenization_utils.py

Lines 419 to 420 in 7119bb0

    
           if ( 
        
               token != self.unk_token

and since eos_token and unk_token are both <|endoftext|>, we end up not adding them to the vocabulary.

jonatanklosko · 2023-01-23T23:05:12Z

To address this we would need to add "<|endoftext|>": 50257 to vocab.json and remove it from added_tokens.json. Note that this is the case in the English checkpoints (except with 50256).

The question is if this hurts compatibility; when loading the slow tokenizer both of these files would be used to load the vocabulary, so moving the entry from one to the other should be alright?

ArthurZucker · 2023-01-24T08:52:42Z

Yep, I think the idea is to make the multilingual added tokens match the ones that we have for english. I forgot to mention but yes, we have to add "<|endoftext|> to the vocabulary instead of ''. This should normally do the trick (with also the modification of the content of the unknown token.

jonatanklosko · 2023-01-24T10:05:28Z

Ah, so we should actually replace it, so that <|endoftext|> gets the id that currently "" has, and we keep "" just to make sure the ids are not shifted at any point?

"<|endoftext|>": 50256,
"": 50257,

and not:

"": 50256,
"<|endoftext|>": 50257,

jonatanklosko · 2023-01-24T10:46:08Z

@ArthurZucker I updated the PR on the checkpoint. I tried the remaining failing tests locally pointing tokenizer to the updated revision and they passed, so I think we are good on this side.

jonatanklosko · 2023-01-24T10:51:24Z

Note that the only difference is that originally EOS (<|endoftext|>) was 50257 and now it is 50256, not sure if that's something to worry about.

jonatanklosko · 2023-01-24T11:08:17Z

The EOS toke id appears multiple times in the config.json so we need to adjust it too. Let me know if that's the way to go, or if we should swap them back :)

ArthurZucker · 2023-01-24T11:47:38Z

Note that the only difference is that originally EOS (<|endoftext|>) was 50257 and now it is 50256, not sure if that's something to worry about.

Ah, this can be an issue I think. We have to keep it at 50257! So let's leave '' in the vocab (it is also in the original repo) and we just need {"<|endoftext|>": 50257} this to be in the added_special_tokens. See this repo which contains most of what we need

jonatanklosko · 2023-01-24T11:50:47Z

@ArthurZucker we need <|endoftext|> in the vocab rather than added_tokens as per #21222 (comment).

Note that this means unknown token changes from 50256 to 50257, but hopefully that's less invasive.

ArthurZucker · 2023-01-24T14:09:02Z

Yeah! That's better

tests/pipelines/test_pipelines_automatic_speech_recognition.py

src/transformers/models/whisper/tokenization_whisper_fast.py

jonatanklosko · 2023-01-25T12:53:48Z

@ArthurZucker it looks like the new failures come from the GenerationConfig missing some attributes, also looking at openai/whisper-tiny the forced_decoder_ids have a null token and don't match what we have in config.json.

ArthurZucker · 2023-01-25T12:56:52Z

Hey, null token is fine! I added that for the refactoring, it allows the model to automatically predict the language

ArthurZucker · 2023-01-25T12:59:43Z

OKay the error comes from the tiny_random_testing where configuration files are created from the config, and thus don't have any of the parameters related to generation. The return_timestamps is set to True but it should not if there are not generation config.
Feel free to skip these tests for now, unless @ydshieh you have an alternative solution

sgugger

Thanks for adding this. Unrelated to this PR, it looks like the whisper tokenizer has a requirement on sentencepiece that is not accurate @ArthurZucker

tests/pipelines/test_pipelines_automatic_speech_recognition.py

sgugger · 2023-01-25T14:53:01Z

src/transformers/models/auto/tokenization_auto.py

+            (
+                "whisper",
+                (
+                    "WhisperTokenizer" if is_sentencepiece_available() else None,


Why is sentencepiece here @ArthurZucker, Whisper tokenizer files do not depend on sentencepiece at all.

I removed the dependency, if not desired I will revert.

ydshieh · 2023-01-25T15:28:52Z

OKay the error comes from the tiny_random_testing where configuration files are created from the config, and thus don't have any of the parameters related to generation. The return_timestamps is set to True but it should not if there are not generation config. Feel free to skip these tests for now, unless @ydshieh you have an alternative solution

The CI is currently running and I can't see which test you are mentioning. I will check later once the CI results is available.

jonatanklosko · 2023-01-25T15:31:55Z

PRs for other checkpoints:

jonatanklosko · 2023-01-25T15:33:09Z

Hey @ydshieh, the tests are aforementioned tests are not skipped, but you can see the previous CI failure here.

ArthurZucker

This looks good to me, I think that we can merge. Just pinging @ydshieh for the tiny config issues to review

ydshieh · 2023-02-03T17:50:15Z

Hi, @jonatanklosko could you rebase on main branch? You will need to resolve the conflicts. Let me know if you need help on this. Sorry for being late here.

ydshieh · 2023-02-03T18:21:36Z

@jonatanklosko Thank you. I will take a look on Monday if the pipeline testing is still failing!

jonatanklosko · 2023-02-03T18:22:17Z

@ydaigo perfect, thanks :)

ArthurZucker · 2023-02-20T13:26:42Z

Hey @jonatanklosko can you rebase on main to or resolve the merge conflicts?

jonatanklosko · 2023-02-20T22:27:31Z

@ArthurZucker done and everything passes now :)

Add WhisperTokenizerFast

94b9156

sgugger requested a review from ArthurZucker January 20, 2023 18:55

jonatanklosko added 2 commits January 20, 2023 20:41

Fixup

40f0389

Up

c13e76a

jonatanklosko commented Jan 21, 2023

View reviewed changes

tests/models/whisper/test_modeling_whisper.py Outdated Show resolved Hide resolved

tests/models/whisper/test_modeling_whisper.py Outdated Show resolved Hide resolved

tests/pipelines/test_pipelines_common.py Outdated Show resolved Hide resolved

Up

ff79f1f

jonatanklosko commented Jan 21, 2023

View reviewed changes

ArthurZucker reviewed Jan 23, 2023

View reviewed changes

jonatanklosko and others added 3 commits January 23, 2023 14:42

Improve tests

ba24f21

Update src/transformers/models/whisper/tokenization_whisper_fast.py

b6303bb

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Merge branch 'main' into jk-whisper-fast-tokenizer

53cbd1b

jonatanklosko commented Jan 23, 2023

View reviewed changes

jonatanklosko added 2 commits January 23, 2023 15:50

Keep stride in whisper pipelien test

db9676d

Remove unknown token special case

4bebb2c

Reduce vocabulary size in tests

2c4d0f4

Merge branch 'main' into jk-whisper-fast-tokenizer

7591356

ArthurZucker reviewed Jan 25, 2023

View reviewed changes

tests/pipelines/test_pipelines_automatic_speech_recognition.py Outdated Show resolved Hide resolved

src/transformers/models/whisper/tokenization_whisper_fast.py Show resolved Hide resolved

Sync copied changes from WhisperTokenizer

21f69cf

jonatanklosko force-pushed the jk-whisper-fast-tokenizer branch from 9257971 to 21f69cf Compare January 25, 2023 12:33

jonatanklosko force-pushed the jk-whisper-fast-tokenizer branch from f15e60f to 7f69f4a Compare January 25, 2023 14:10

Skip pipeline tests

4e636bf

jonatanklosko force-pushed the jk-whisper-fast-tokenizer branch from 7f69f4a to 4e636bf Compare January 25, 2023 14:32

sgugger approved these changes Jan 25, 2023

View reviewed changes

jonatanklosko added 2 commits January 25, 2023 16:19

Update assertion

63ac687

Remove Whisper tokenizer dependency on sentencepiece

487a7ac

ArthurZucker approved these changes Feb 3, 2023

View reviewed changes

Merge branch 'main' into jk-whisper-fast-tokenizer

9d861a1

jonatanklosko added 2 commits February 20, 2023 20:12

Merge branch 'main' into jk-whisper-fast-tokenizer

3863b69

Format

9311ab0

jonatanklosko force-pushed the jk-whisper-fast-tokenizer branch from 582eef5 to 9311ab0 Compare February 20, 2023 19:33

ArthurZucker merged commit deafc24 into huggingface:main Feb 21, 2023

jonatanklosko deleted the jk-whisper-fast-tokenizer branch February 21, 2023 08:31

ydshieh mentioned this pull request Mar 1, 2023

Fix WhisperModelTest #21883

Merged

Add WhisperTokenizerFast #21222

Add WhisperTokenizerFast #21222

Uh oh!

Conversation

jonatanklosko commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonatanklosko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonatanklosko Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonatanklosko Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonatanklosko Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonatanklosko commented Jan 23, 2023

Uh oh!

ArthurZucker commented Jan 23, 2023

Uh oh!

jonatanklosko commented Jan 23, 2023

Uh oh!

jonatanklosko commented Jan 23, 2023

Uh oh!

jonatanklosko commented Jan 23, 2023

Uh oh!

jonatanklosko commented Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Jan 24, 2023

Uh oh!

jonatanklosko commented Jan 24, 2023

Uh oh!

jonatanklosko commented Jan 24, 2023

Uh oh!

jonatanklosko commented Jan 24, 2023

Uh oh!

jonatanklosko commented Jan 20, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 20, 2023 •

edited

Loading

jonatanklosko Jan 23, 2023 •

edited

Loading

jonatanklosko Jan 23, 2023 •

edited

Loading

jonatanklosko Jan 23, 2023 •

edited

Loading

jonatanklosko commented Jan 23, 2023 •

edited

Loading

ArthurZucker commented Jan 24, 2023 •

edited

Loading

jonatanklosko commented Jan 24, 2023 •

edited

Loading

ArthurZucker commented Jan 25, 2023 •

edited

Loading

ArthurZucker commented Jan 25, 2023 •

edited

Loading

ydshieh commented Feb 3, 2023 •

edited

Loading