Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding Tokens added by the user for Whisper models #803

Open
aravindMahadevan opened this issue Jun 10, 2024 · 2 comments
Open

Decoding Tokens added by the user for Whisper models #803

aravindMahadevan opened this issue Jun 10, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@aravindMahadevan
Copy link
Contributor

Feature request

Support decoding user defined added tokens that get added to end of the tokenizer's vocabulary for Whisper based models. This requires modifying the if statement in _decode_asr to make this work.

Motivation

Motivation for this proposal is to have feature parity with the tokenizers.decode which is able to decode user added tokens.

Your contribution

To support this feature, we just need to modify the if statement in _decode_asr from token >= timestamp_begin to token >= timestamp_begin && token <= timestamp_end where timestamp_end = this.model.convert_tokens_to_ids(["<|30.00|>"])[0].

Why this should work:

When a user adds a new token to the tokenizer, it gets placed at the end of the tokenizer's vocabulary. The last 1500 vocab tokens in whisper-tiny, whisper-tiny.en, whisper-small.en, whisper-small, whisper-base, whisper-base.en, whisper-large, and whisper-large-v2 correspond to timestamp tokens from "<|0.00|>" to "<|30.00|>". By bounding the if statement condition from token >= timestamp_begin to token >= timestamp_begin && token <= timstamp_end, we will ensure that added user tokens will be decoded as regular tokens as the condition will evaluate to False and we will go to the else block

@aravindMahadevan aravindMahadevan added the enhancement New feature or request label Jun 10, 2024
@xenova
Copy link
Collaborator

xenova commented Jun 10, 2024

Good spot! Feel free to submit a PR for this. Thanks! 🤗

@aravindMahadevan
Copy link
Contributor Author

@xenova submitted the PR with a fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants