-
Notifications
You must be signed in to change notification settings - Fork 32k
Trainer - deprecate tokenizer for processing_class #32385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer - deprecate tokenizer for processing_class #32385
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Love this (Always confused about writing |
26f0430 to
32f7cc3
Compare
src/transformers/trainer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains the most important logic changes for review
src/transformers/trainer_callback.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I directly just swapped the names here, as the docstring says the object is just an internal class
|
@molbap If you have capacity - do you think you could do a first review of this? Arthur and Zack are super busy so would be good to have someone who knows the processors well to review in their stead |
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked out the rework, nice - saw that now you can't pass both tokenizer and processing_class, it will rightfully raise an error, much better. lgtm!
|
Asking for review from @SunMarc for Trainer as Zach is off. As this is a very public API I'd like to get a few eyes on it, especially from people who know it well, to make sure it's OK! |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean, nothing to say! Thanks for ths!
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @amyeroberts ! LGTM !
tests/trainer/test_trainer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these nice tests !
e3b3aa9 to
5503bce
Compare
we use positional args to obtain model and optimizer, however, this has the un-needed tokenizer argument between them due to recent changes, the tokenizer arg is now renamed to processing_class, see: + huggingface/trl#2162 + huggingface/transformers#32385 leading to unexpected breakdown of scanner the line relevant to us is here: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L523 since we anyway don't depend on this arg, switch out to using model and opt from the kwargs
|
_pad_tensors_to_max_len in Seq2SeqTrainer still uses tokenizer instead of processing_class |
* Trainer - deprecate tokenizer for processing_class * Extend chage across Seq2Seq trainer and docs * Add tests * Update to FutureWarning and add deprecation version
For reviewers: most files touched here are just updates to the documentation. Unfortunately, a lot of the diff is just stripping of unnecessary whitespaces (which my editor does automatically).
The most important changes are in:
What does this PR do?
At the moment, if we wish for a processing class e.g. an image processor, to be saved alongside the model e.g. when pushing to the hub, we have to do the following:
This causes a lot of confusion for users (an image processor is not a tokenizer).
Previous efforts were made to add individual classes to the trainer s.t. one could do:
trainer = Trainer(model, args, image_processor=image_processor, ...), however, this has some drawbacks:Trainer(model, args, processor=processor)or should it beTrainer(model, args, tokenizer=tokenizer, image_processor=image_processor)?This PR therefore just deprecates
tokenizerin favour of a more generalprocessing_classargument, which can be any class with afrom_pretrainedmethod.Relevant PRs and discussions:
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.