-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Add WhisperTokenizerFast #21222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WhisperTokenizerFast #21222
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
jonatanklosko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corresponding audio pipeline test used to be skipped, because of the missing fast tokenizer. After adding the tokenizer it started to fail, but after a few changes it works fine now. A couple related notes inline.
| # We adjust the sampling rate, such that the featurizer returns features | ||
| # compatible with the model | ||
| feature_extractor = feature_extractor.__class__( | ||
| sampling_rate=tiny_config.max_source_positions * 2 * 160 // 30, hop_length=160, chunk_length=30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to learn if there's an easier way to get a featurizer compatible with the given model config!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure! This works for me
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. There are a few files that should not have been modified here and there, otherwise very neat!
| bos_token_id = self.original_tokenizer.bos_token_id | ||
| tokenizer.post_processor = processors.TemplateProcessing( | ||
| single=f"{bos}:0 $A:0", # token_type_id is 2 for Funnel transformer | ||
| single=f"{bos}:0 $A:0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure there's a reason why this is modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is a leftover from copying the FunnelConverter, because note that the template doesn't have :2 token type id anywhere, which is the case here:
| single=f"{cls}:2 $A:0 {sep}:0", # token_type_id is 2 for Funnel transformer |
(I just noticed that when adding the WhisperConverter based on the GPT2 one)
| # We adjust the sampling rate, such that the featurizer returns features | ||
| # compatible with the model | ||
| feature_extractor = feature_extractor.__class__( | ||
| sampling_rate=tiny_config.max_source_positions * 2 * 160 // 30, hop_length=160, chunk_length=30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure! This works for me
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
| # TODO: it looks like the '' token is not re-added when retraining | ||
| # the tokenizer in tests, and we fall into an infinite reursion | ||
| # trying to convert unknown token to id | ||
| if index is None and token != self.unk_token: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue comes down to this:
import string
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
tokenizer._tokenizer.add_special_tokens([''])
print(tokenizer._tokenizer.token_to_id(''))
#=> NoneAdding any other token (e.g. <|test|>) works fine, but the empty string doesn't work as a token.
Using tokenizers directly:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-cased")
tokenizer.add_special_tokens(['', '<|test|>']) #=> 1
print(tokenizer.token_to_id('<|test|>')) #=> 28996
print(tokenizer.token_to_id('')) #=> NoneThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the actual discrepancy is between how slow and fast tokenizers handle adding the '' token:
Slow
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer.add_tokens([''])
print(tokenizer.convert_tokens_to_ids(''))
#=> 28996Fast
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
tokenizer.add_tokens([''])
print(tokenizer.convert_tokens_to_ids(''))
#=> 100 (unknown)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep exactly. Which is why in #21250 I set any unknown token to "". This was such a headache.
In the whisper-large version, I added "" to the list of world in the vocab, and set unk_token = "" which gave the correct ID, but it is a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we mirror this change in the fast version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure path is all_special_ids -> convert_tokens_to_ids(all_special_tokens) -> _convert_token_to_id_with_added_voc('') -> _tokenizer.token_to_id('') -> None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to reproduce locally, remove that check and run pytest tests/pipelines/test_pipelines_automatic_speech_recognition.py -k 'test_pt_WhisperConfig_WhisperForConditionalGeneration_WhisperTokenizer_WhisperFeatureExtractor' :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think we should remove the unk_token (the way it is defined) from the multilingual models. They should behave the same way as whisper-tiny.en
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that would be ideal, won't this result in token ids shifting if the '' is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it should't, we will leave '' in the vocab but just set unk_token = <|endoftext|>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it is in the vocab too, not just the special tokens, perfect!
|
@ArthurZucker thanks for the help! I think now the steps are to update the unknown token in multilangual checkpoints and add |
|
Feel free to open community PR on the model' (hub) linking to this PR (github) 🚀 |
|
@ArthurZucker sure! I've just created https://huggingface.co/openai/whisper-tiny/discussions/5, let me know if it looks as expected and I will open a matching PR on the other checkpoints too. FTR I generated the import sys
sys.path.reverse()
sys.path.append("/Users/jonatanklosko/git/transformers/src")
sys.path.reverse()
from transformers import WhisperTokenizerFast
tokenizer = WhisperTokenizerFast.from_pretrained("/Users/jonatanklosko/git/hf/whisper-tiny/")
tokenizer.save_pretrained("/Users/jonatanklosko/git/hf/whisper-tiny/")I also updated the unknown token configuration manually. |
|
Changing the unknown token in configuration leads to a weird behaviour when loading the slow tokenizer, see an example in the PR. Any ideas why that is? |
|
So the issue is that the multilingual tokenizer doesn't have transformers/src/transformers/tokenization_utils.py Lines 419 to 420 in 7119bb0
and since |
|
To address this we would need to add The question is if this hurts compatibility; when loading the slow tokenizer both of these files would be used to load the vocabulary, so moving the entry from one to the other should be alright? |
|
Yep, I think the idea is to make the multilingual added tokens match the ones that we have for english. I forgot to mention but yes, we have to add |
|
Ah, so we should actually replace it, so that and not: |
|
@ArthurZucker I updated the PR on the checkpoint. I tried the remaining failing tests locally pointing tokenizer to the updated revision and they passed, so I think we are good on this side. |
|
Note that the only difference is that originally EOS ( |
|
The EOS toke id appears multiple times in the |
Ah, this can be an issue I think. We have to keep it at 50257! So let's leave |
|
@ArthurZucker we need Note that this means unknown token changes from 50256 to 50257, but hopefully that's less invasive. |
|
Yeah! That's better |
9257971 to
21f69cf
Compare
|
@ArthurZucker it looks like the new failures come from the GenerationConfig missing some attributes, also looking at |
|
Hey, |
|
OKay the error comes from the |
f15e60f to
7f69f4a
Compare
7f69f4a to
4e636bf
Compare
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this. Unrelated to this PR, it looks like the whisper tokenizer has a requirement on sentencepiece that is not accurate @ArthurZucker
| ( | ||
| "whisper", | ||
| ( | ||
| "WhisperTokenizer" if is_sentencepiece_available() else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is sentencepiece here @ArthurZucker, Whisper tokenizer files do not depend on sentencepiece at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the dependency, if not desired I will revert.
The CI is currently running and I can't see which test you are mentioning. I will check later once the CI results is available. |
|
PRs for other checkpoints: |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, I think that we can merge. Just pinging @ydshieh for the tiny config issues to review
|
Hi, @jonatanklosko could you rebase on main branch? You will need to resolve the conflicts. Let me know if you need help on this. Sorry for being late here. |
|
@jonatanklosko Thank you. I will take a look on Monday if the pipeline testing is still failing! |
|
@ydaigo perfect, thanks :) |
|
Hey @jonatanklosko can you rebase on main to or resolve the merge conflicts? |
582eef5 to
9311ab0
Compare
|
@ArthurZucker done and everything passes now :) |
Adds the fast version of Whisper tokenizer. The Whisper tokenizer is essentially GPT2 tokenizer with special tokens. The main difference is the additional normalizer (which I mirrored from the slow tokenizer) and language/task-dependent prefix tokens.
One of the tokenizer tests is failing, it's because there is no
tokenizer.jsonfile in theopenai/whisper-*(specifically thetinycheckpoint). I added a converter, so now it is possible to load fast tokenizer from existing checkpoints and exporttokenizer.json.