-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Fix for tokenizer.apply_chat_template with continue_final_message=True #34214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for tokenizer.apply_chat_template with continue_final_message=True #34214
Conversation
|
Alternative or possibly more robust way would be to only do the stripping if the full final message string cannot be found in the rendered chat as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schoennenbeck this is a really good fix, thank you! I think we don't need the more robust solution - because of how whitespace is tokenized, I think it will be quite hard to end your message in multiple spaces and still get a good continuation.
(We're having some CI issues, but hopefully we'll resolve them later today and then I can merge this)
|
@schoennenbeck merged! Thanks again for a clean and helpful PR! |
huggingface#34214) * Strip final message * Do full strip instead of rstrip * Retrigger CI --------- Co-authored-by: Matt <[email protected]>
For a lot of tokenizers in
Tokenizer.apply_chat_templatewithcontinue_final_message=Truewe get a "ValueError: substring not found" if the final message starts or ends in some whitespace. Here is some example code that exhibits the issue:This is due to the fact that the
apply_chat_template-method looks for the full final message in the rendered chat but many modern chat templates (in particular the Llama3.1-chat-template) actually trim messages before rendering.This PR strips the final message before looking it up in the rendered string. This fixes the issue. However, this means that the continuation can now happen at a slightly different spot than the user intended. I believe this is the best way to address this issue but I would also be open to simply raise a more descriptive error in case this happens so the user can strip the last message themselves to handle this.
Either way I think the current failure mode is not ideal.
If required I could also add a test for this behaviour.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Potential reviewers (based on affected area)