[Whisper] fix all issues with unk token #21250

ArthurZucker · 2023-01-23T10:40:22Z

What does this PR do?

Previously, all OOV ( and thus timestamp tokens) outputed by the model are decoded to <|endoftext|> by the xxx.en whisper models. This does not happen with the multilingual model only because I added "" to the vocabulary, and the unk_token_id is the same "". But this does not really make sense.
As the default behavior for Whisper is just to outptu "" for any OOV, now the _convert_id_to_token function does not use a unk_token.

This will fix the inconsistency, and will help for the whisper refactoring.

HuggingFaceDocBuilderDev · 2023-01-23T10:53:49Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for the fix! Make sure to run make style on your branch to fix the quality issue.

sanchit-gandhi

SGTM aligning with the official implementation here 👍 thanks for the fix!

fix all issues with unk token

30515fb

ArthurZucker requested review from sanchit-gandhi and sgugger January 23, 2023 10:40

ArthurZucker mentioned this pull request Jan 23, 2023

Add WhisperTokenizerFast #21222

Merged

sgugger approved these changes Jan 23, 2023

View reviewed changes

fixup

17457a7

ArthurZucker merged commit d8415ba into huggingface:main Jan 23, 2023

sanchit-gandhi reviewed Jan 24, 2023

View reviewed changes

sanchit-gandhi mentioned this pull request Jan 26, 2023

[Whisper] Add rescaling function with do_normalize #21263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper] fix all issues with unk token #21250

[Whisper] fix all issues with unk token #21250

Uh oh!

ArthurZucker commented Jan 23, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Jan 23, 2023 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

sanchit-gandhi left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Whisper] fix all issues with unk token #21250

[Whisper] fix all issues with unk token #21250

Uh oh!

Conversation

ArthurZucker commented Jan 23, 2023

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HuggingFaceDocBuilderDev commented Jan 23, 2023 •

edited

Loading

sanchit-gandhi left a comment •

edited

Loading