You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support decoding user defined added tokens that get added to end of the tokenizer's vocabulary for Whisper based models. This requires modifying the if statement in _decode_asr to make this work.
Motivation
Motivation for this proposal is to have feature parity with the tokenizers.decode which is able to decode user added tokens.
Your contribution
To support this feature, we just need to modify the if statement in _decode_asr from token >= timestamp_begin to token >= timestamp_begin && token <= timestamp_end where timestamp_end = this.model.convert_tokens_to_ids(["<|30.00|>"])[0].
Why this should work:
When a user adds a new token to the tokenizer, it gets placed at the end of the tokenizer's vocabulary. The last 1500 vocab tokens in whisper-tiny, whisper-tiny.en, whisper-small.en, whisper-small, whisper-base, whisper-base.en, whisper-large, and whisper-large-v2 correspond to timestamp tokens from "<|0.00|>" to "<|30.00|>". By bounding the if statement condition from token >= timestamp_begin to token >= timestamp_begin && token <= timstamp_end, we will ensure that added user tokens will be decoded as regular tokens as the condition will evaluate to False and we will go to the else block
The text was updated successfully, but these errors were encountered:
Feature request
Support decoding user defined added tokens that get added to end of the tokenizer's vocabulary for Whisper based models. This requires modifying the if statement in _decode_asr to make this work.
Motivation
Motivation for this proposal is to have feature parity with the tokenizers.decode which is able to decode user added tokens.
Your contribution
To support this feature, we just need to modify the if statement in _decode_asr from
token >= timestamp_begin
totoken >= timestamp_begin && token <= timestamp_end
wheretimestamp_end = this.model.convert_tokens_to_ids(["<|30.00|>"])[0]
.Why this should work:
When a user adds a new token to the tokenizer, it gets placed at the end of the tokenizer's vocabulary. The last 1500 vocab tokens in whisper-tiny, whisper-tiny.en, whisper-small.en, whisper-small, whisper-base, whisper-base.en, whisper-large, and whisper-large-v2 correspond to timestamp tokens from
"<|0.00|>"
to"<|30.00|>"
. By bounding the if statement condition fromtoken >= timestamp_begin
totoken >= timestamp_begin && token <= timstamp_end
, we will ensure that added user tokens will be decoded as regular tokens as the condition will evaluate to False and we will go to the else blockThe text was updated successfully, but these errors were encountered: