-
Notifications
You must be signed in to change notification settings - Fork 785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes Issue #803 #804
Fixes Issue #803 #804
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just one thing:
Co-authored-by: Joshua Lochner <[email protected]>
@xenova committed the fix, let me know if there is anything else needed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Thanks! Will merge after tests pass. By the way, do you have an example of a whisper model which such tokens? Might be good to add a test. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Hi @xenova, there are two tests failing due to a few test models not having the <30.00> token. Here's one such example: https://huggingface.co/Xenova/whisper-small/resolve/output_attentions/tokenizer.json. Do you have any suggestions on how to overcome this issue? This token does exist in all of the base whisper models. |
We could probably use the time precision (0.02) to calculate the offset: 30/0.02 + 1 = 1501 tokens (50364 -> 51864). Another fix is to simply update the tokenizer.json. Also, can you provide a model which does have added tokens after the final timestamps token? |
Hi @xenova , that was a good suggestion and I have updated the logic to the following: This logic should work on both English only Whisper and multilingual Whisper variants. The beginning timestamp of 0.00 is equal to token id 50363 and 50364 in the English only Whisper variants and Multilingual Whisper variants respectively. Similarly, the final timestamp token id of 30.00 is equal to 51863 and 51864 in the English only Whisper variants and Multilingual Whisper variants respectively as well. In both cases, the final timestamp token id occurs at an offset of 1500 which is what total_timestamp_tokens evaluates to. I do not have a model with added tokens that I can share publicly. I did find this model https://huggingface.co/oza75/whisper-bambara-asr-001 which has added tokens after the final timestamp but it's a special token. |
Hey @xenova, are there any more changes that should be made to this PR? |
@xenova Hi! I believe all test cases are passing, is there anything else needed before we can merge this change? Would you be able to take one last look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay... This PR got lost in my notifs...
Thanks again! 🙏
Modify the _decode_asr method to support decoding user defined token in Whisper based models. Addresses issue in #803