Decoding Tokens added by the user for Whisper models #803

aravindMahadevan · 2024-06-10T19:55:17Z

Feature request

Support decoding user defined added tokens that get added to end of the tokenizer's vocabulary for Whisper based models. This requires modifying the if statement in _decode_asr to make this work.

Motivation

Motivation for this proposal is to have feature parity with the tokenizers.decode which is able to decode user added tokens.

Your contribution

To support this feature, we just need to modify the if statement in _decode_asr from token >= timestamp_begin to token >= timestamp_begin && token <= timestamp_end where timestamp_end = this.model.convert_tokens_to_ids(["<|30.00|>"])[0].

Why this should work:

When a user adds a new token to the tokenizer, it gets placed at the end of the tokenizer's vocabulary. The last 1500 vocab tokens in whisper-tiny, whisper-tiny.en, whisper-small.en, whisper-small, whisper-base, whisper-base.en, whisper-large, and whisper-large-v2 correspond to timestamp tokens from "<|0.00|>" to "<|30.00|>". By bounding the if statement condition from token >= timestamp_begin to token >= timestamp_begin && token <= timstamp_end, we will ensure that added user tokens will be decoded as regular tokens as the condition will evaluate to False and we will go to the else block

The text was updated successfully, but these errors were encountered:

xenova · 2024-06-10T21:40:17Z

Good spot! Feel free to submit a PR for this. Thanks! 🤗

aravindMahadevan · 2024-06-11T14:19:40Z

@xenova submitted the PR with a fix!

aravindMahadevan added the enhancement New feature or request label Jun 10, 2024

aravindMahadevan mentioned this issue Jun 11, 2024

Fixes Issue #803 #804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding Tokens added by the user for Whisper models #803

Decoding Tokens added by the user for Whisper models #803

aravindMahadevan commented Jun 10, 2024

xenova commented Jun 10, 2024

aravindMahadevan commented Jun 11, 2024

Decoding Tokens added by the user for Whisper models #803

Decoding Tokens added by the user for Whisper models #803

Comments

aravindMahadevan commented Jun 10, 2024

Feature request

Motivation

Your contribution

Why this should work:

xenova commented Jun 10, 2024

aravindMahadevan commented Jun 11, 2024