-
Notifications
You must be signed in to change notification settings - Fork 32.1k
[V5] Return a BatchEncoding dict from apply_chat_template by default again #42567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nderlying tokenizer
|
[For maintainers] Suggested jobs to run (before merge) run-slow: blenderbot, bloom, cohere, gpt2, gpt_sw3 |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
cc @LysandreJik - this was one of the V5 PRs before, do I need to do anything special with this one, or can we just merge it to |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, it was already approved once so lgtm 😄
| if not tokenize: | ||
| return_dict = False # dicts are only returned by the tokenizer anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes me wonder, do we need to support a combination of tokenize=True, return_dict=False or can we deprecate/remove return_dict over time? Can't think of cases when users want a list of tokens as output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can get rid of it over time, but I think it's fine as a backward compatibility flag for now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i meant after v5 + several more minor releases, and if users are fine with it
…again (huggingface#42567) * Flip the default return type for `apply_chat_template` to match the underlying tokenizer * Remove test_tokenization_for_chat tests, which no longer do anything useful * Remove test_tokenization_for_chat tests, which no longer do anything useful * Fix test_encode_message tests * Fix test_encode_message tests * nit fix * Trigger tests * Remove test_tokenization_for_chat * make fixup * Add a little test to make sure that doesn't happen again * make fixup
…again (huggingface#42567) * Flip the default return type for `apply_chat_template` to match the underlying tokenizer * Remove test_tokenization_for_chat tests, which no longer do anything useful * Remove test_tokenization_for_chat tests, which no longer do anything useful * Fix test_encode_message tests * Fix test_encode_message tests * nit fix * Trigger tests * Remove test_tokenization_for_chat * make fixup * Add a little test to make sure that doesn't happen again * make fixup
This is basically PR #41626 again! Some of it got clobbered in the tokenizer refactor, but it's just as good the second time.