Skip to content

Conversation

@jonatanklosko
Copy link
Contributor

@jonatanklosko jonatanklosko commented Jan 20, 2023

Adds the fast version of Whisper tokenizer. The Whisper tokenizer is essentially GPT2 tokenizer with special tokens. The main difference is the additional normalizer (which I mirrored from the slow tokenizer) and language/task-dependent prefix tokens.

One of the tokenizer tests is failing, it's because there is no tokenizer.json file in the openai/whisper-* (specifically the tiny checkpoint). I added a converter, so now it is possible to load fast tokenizer from existing checkpoints and export tokenizer.json.

@sgugger sgugger requested a review from ArthurZucker January 20, 2023 18:55
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jan 20, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor Author

@jonatanklosko jonatanklosko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corresponding audio pipeline test used to be skipped, because of the missing fast tokenizer. After adding the tokenizer it started to fail, but after a few changes it works fine now. A couple related notes inline.

# We adjust the sampling rate, such that the featurizer returns features
# compatible with the model
feature_extractor = feature_extractor.__class__(
sampling_rate=tiny_config.max_source_positions * 2 * 160 // 30, hop_length=160, chunk_length=30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to learn if there's an easier way to get a featurizer compatible with the given model config!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure! This works for me

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. There are a few files that should not have been modified here and there, otherwise very neat!

bos_token_id = self.original_tokenizer.bos_token_id
tokenizer.post_processor = processors.TemplateProcessing(
single=f"{bos}:0 $A:0", # token_type_id is 2 for Funnel transformer
single=f"{bos}:0 $A:0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure there's a reason why this is modified?

Copy link
Contributor Author

@jonatanklosko jonatanklosko Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is a leftover from copying the FunnelConverter, because note that the template doesn't have :2 token type id anywhere, which is the case here:

single=f"{cls}:2 $A:0 {sep}:0", # token_type_id is 2 for Funnel transformer

(I just noticed that when adding the WhisperConverter based on the GPT2 one)

# We adjust the sampling rate, such that the featurizer returns features
# compatible with the model
feature_extractor = feature_extractor.__class__(
sampling_rate=tiny_config.max_source_positions * 2 * 160 // 30, hop_length=160, chunk_length=30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure! This works for me

jonatanklosko and others added 3 commits January 23, 2023 14:42
Comment on lines 259 to 262
# TODO: it looks like the '' token is not re-added when retraining
# the tokenizer in tests, and we fall into an infinite reursion
# trying to convert unknown token to id
if index is None and token != self.unk_token:
Copy link
Contributor Author

@jonatanklosko jonatanklosko Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue comes down to this:

import string
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
tokenizer._tokenizer.add_special_tokens([''])
print(tokenizer._tokenizer.token_to_id(''))
#=> None

Adding any other token (e.g. <|test|>) works fine, but the empty string doesn't work as a token.

Using tokenizers directly:

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-cased")
tokenizer.add_special_tokens(['', '<|test|>']) #=> 1
print(tokenizer.token_to_id('<|test|>')) #=> 28996
print(tokenizer.token_to_id('')) #=> None

Copy link
Contributor Author

@jonatanklosko jonatanklosko Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the actual discrepancy is between how slow and fast tokenizers handle adding the '' token:

Slow

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer.add_tokens([''])
print(tokenizer.convert_tokens_to_ids(''))
#=> 28996

Fast

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
tokenizer.add_tokens([''])
print(tokenizer.convert_tokens_to_ids(''))
#=> 100 (unknown)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly. Which is why in #21250 I set any unknown token to "". This was such a headache.
In the whisper-large version, I added "" to the list of world in the vocab, and set unk_token = "" which gave the correct ID, but it is a bit confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we mirror this change in the fast version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure path is all_special_ids -> convert_tokens_to_ids(all_special_tokens) -> _convert_token_to_id_with_added_voc('') -> _tokenizer.token_to_id('') -> None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to reproduce locally, remove that check and run pytest tests/pipelines/test_pipelines_automatic_speech_recognition.py -k 'test_pt_WhisperConfig_WhisperForConditionalGeneration_WhisperTokenizer_WhisperFeatureExtractor' :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think we should remove the unk_token (the way it is defined) from the multilingual models. They should behave the same way as whisper-tiny.en

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that would be ideal, won't this result in token ids shifting if the '' is missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it should't, we will leave '' in the vocab but just set unk_token = <|endoftext|>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it is in the vocab too, not just the special tokens, perfect!

@jonatanklosko
Copy link
Contributor Author

@ArthurZucker thanks for the help! I think now the steps are to update the unknown token in multilangual checkpoints and add tokenizer.json to the repos. Let me know if there's anything I can help with :)

@ArthurZucker
Copy link
Collaborator

Feel free to open community PR on the model' (hub) linking to this PR (github) 🚀

@jonatanklosko
Copy link
Contributor Author

@ArthurZucker sure! I've just created https://huggingface.co/openai/whisper-tiny/discussions/5, let me know if it looks as expected and I will open a matching PR on the other checkpoints too.

FTR I generated the tokenizer.json with:

import sys
sys.path.reverse()
sys.path.append("/Users/jonatanklosko/git/transformers/src")
sys.path.reverse()

from transformers import WhisperTokenizerFast

tokenizer = WhisperTokenizerFast.from_pretrained("/Users/jonatanklosko/git/hf/whisper-tiny/")
tokenizer.save_pretrained("/Users/jonatanklosko/git/hf/whisper-tiny/")

I also updated the unknown token configuration manually.

@jonatanklosko
Copy link
Contributor Author

Changing the unknown token in configuration leads to a weird behaviour when loading the slow tokenizer, see an example in the PR. Any ideas why that is?

@jonatanklosko
Copy link
Contributor Author

So the issue is that the multilingual tokenizer doesn't have <|endoftext|> in the initial vocabulary, so it would need to be added from special tokens map. However, when loading special tokens we have this check:

if (
token != self.unk_token

and since eos_token and unk_token are both <|endoftext|>, we end up not adding them to the vocabulary.

@jonatanklosko
Copy link
Contributor Author

jonatanklosko commented Jan 23, 2023

To address this we would need to add "<|endoftext|>": 50257 to vocab.json and remove it from added_tokens.json. Note that this is the case in the English checkpoints (except with 50256).

The question is if this hurts compatibility; when loading the slow tokenizer both of these files would be used to load the vocabulary, so moving the entry from one to the other should be alright?

@ArthurZucker
Copy link
Collaborator

Yep, I think the idea is to make the multilingual added tokens match the ones that we have for english. I forgot to mention but yes, we have to add "<|endoftext|> to the vocabulary instead of ''. This should normally do the trick (with also the modification of the content of the unknown token.

@jonatanklosko
Copy link
Contributor Author

Ah, so we should actually replace it, so that <|endoftext|> gets the id that currently "" has, and we keep "" just to make sure the ids are not shifted at any point?

"<|endoftext|>": 50256,
"": 50257,

and not:

"": 50256,
"<|endoftext|>": 50257,

@jonatanklosko
Copy link
Contributor Author

@ArthurZucker I updated the PR on the checkpoint. I tried the remaining failing tests locally pointing tokenizer to the updated revision and they passed, so I think we are good on this side.

@jonatanklosko
Copy link
Contributor Author

Note that the only difference is that originally EOS (<|endoftext|>) was 50257 and now it is 50256, not sure if that's something to worry about.

@jonatanklosko
Copy link
Contributor Author

The EOS toke id appears multiple times in the config.json so we need to adjust it too. Let me know if that's the way to go, or if we should swap them back :)

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jan 24, 2023

Note that the only difference is that originally EOS (<|endoftext|>) was 50257 and now it is 50256, not sure if that's something to worry about.

Ah, this can be an issue I think. We have to keep it at 50257! So let's leave '' in the vocab (it is also in the original repo) and we just need {"<|endoftext|>": 50257} this to be in the added_special_tokens. See this repo which contains most of what we need

@jonatanklosko
Copy link
Contributor Author

jonatanklosko commented Jan 24, 2023

@ArthurZucker we need <|endoftext|> in the vocab rather than added_tokens as per #21222 (comment).

Note that this means unknown token changes from 50256 to 50257, but hopefully that's less invasive.

@ArthurZucker
Copy link
Collaborator

Yeah! That's better

@jonatanklosko jonatanklosko force-pushed the jk-whisper-fast-tokenizer branch from 9257971 to 21f69cf Compare January 25, 2023 12:33
@jonatanklosko
Copy link
Contributor Author

@ArthurZucker it looks like the new failures come from the GenerationConfig missing some attributes, also looking at openai/whisper-tiny the forced_decoder_ids have a null token and don't match what we have in config.json.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jan 25, 2023

Hey, null token is fine! I added that for the refactoring, it allows the model to automatically predict the language

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jan 25, 2023

OKay the error comes from the tiny_random_testing where configuration files are created from the config, and thus don't have any of the parameters related to generation. The return_timestamps is set to True but it should not if there are not generation config.
Feel free to skip these tests for now, unless @ydshieh you have an alternative solution

@jonatanklosko jonatanklosko force-pushed the jk-whisper-fast-tokenizer branch from f15e60f to 7f69f4a Compare January 25, 2023 14:10
@jonatanklosko jonatanklosko force-pushed the jk-whisper-fast-tokenizer branch from 7f69f4a to 4e636bf Compare January 25, 2023 14:32
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. Unrelated to this PR, it looks like the whisper tokenizer has a requirement on sentencepiece that is not accurate @ArthurZucker

(
"whisper",
(
"WhisperTokenizer" if is_sentencepiece_available() else None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is sentencepiece here @ArthurZucker, Whisper tokenizer files do not depend on sentencepiece at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the dependency, if not desired I will revert.

@ydshieh
Copy link
Collaborator

ydshieh commented Jan 25, 2023

OKay the error comes from the tiny_random_testing where configuration files are created from the config, and thus don't have any of the parameters related to generation. The return_timestamps is set to True but it should not if there are not generation config. Feel free to skip these tests for now, unless @ydshieh you have an alternative solution

The CI is currently running and I can't see which test you are mentioning. I will check later once the CI results is available.

@jonatanklosko
Copy link
Contributor Author

@jonatanklosko
Copy link
Contributor Author

Hey @ydshieh, the tests are aforementioned tests are not skipped, but you can see the previous CI failure here.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I think that we can merge. Just pinging @ydshieh for the tiny config issues to review

@ydshieh
Copy link
Collaborator

ydshieh commented Feb 3, 2023

Hi, @jonatanklosko could you rebase on main branch? You will need to resolve the conflicts. Let me know if you need help on this. Sorry for being late here.

@ydshieh
Copy link
Collaborator

ydshieh commented Feb 3, 2023

@jonatanklosko Thank you. I will take a look on Monday if the pipeline testing is still failing!

@jonatanklosko
Copy link
Contributor Author

@ydaigo perfect, thanks :)

@ArthurZucker
Copy link
Collaborator

Hey @jonatanklosko can you rebase on main to or resolve the merge conflicts?

@jonatanklosko jonatanklosko force-pushed the jk-whisper-fast-tokenizer branch from 582eef5 to 9311ab0 Compare February 20, 2023 19:33
@jonatanklosko
Copy link
Contributor Author

@ArthurZucker done and everything passes now :)

@ArthurZucker ArthurZucker merged commit deafc24 into huggingface:main Feb 21, 2023
@jonatanklosko jonatanklosko deleted the jk-whisper-fast-tokenizer branch February 21, 2023 08:31
@ydshieh ydshieh mentioned this pull request Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants