[Trainer] Rename tokenizer to processor, add deprecation#30102
[Trainer] Rename tokenizer to processor, add deprecation#30102NielsRogge wants to merge 9 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for updating - looks a lot better with processor!
- Other examples which have
tokenizer=tokenizerwill also need to be updated if they exist - Question over the deprecation cycle - v5 means it won't be remove in a long time - but as this is a fairly large change from a commonly used API I think it's OK
| callbacks: Optional[List[TrainerCallback]] = None, | ||
| optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None), | ||
| preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None, | ||
| tokenizer: Optional[PreTrainedTokenizerBase] = None, |
There was a problem hiding this comment.
I think we should be careful when manipulating the init signature, see: #30126 - for new args we should put them at the end
|
Thanks, I opened #30129 as a better quick fix cause there are a huge amount of files having |
|
@NielsRogge Although #30129 proposes a solution for the default data collator, we should still add this update. Replacing |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Thanks for pointing me in the direction of this PR @amyeroberts! I personally don't think replacing I would be more in favour of a design where allow NLP users to pass |
|
@sanchit-gandhi I think what's being suggested is actually a flag more like |
| ): | ||
| if tokenizer is not None: | ||
| warnings.warn( | ||
| "The `tokenizer` argument is deprecated and will be removed in v5 of Transformers. You can use `processor` " |
There was a problem hiding this comment.
I believe this PR is proposing deprecating tokenizer @amyeroberts! (as per this line and the title) I think preprocessor is limited in the sense that we don't just want to save the pre-processing class (e.g. image processor or feature extractor), but also the post-processing class (e.g. the tokenizer) => for this reason I think passing the processor is cleanest here (for multimodal models)
There was a problem hiding this comment.
Ah, yes. Perhaps I wasn't clear. What I meant was we would introduce an new argument, which should be used in preference to tokenizer which would accept all of the processing classes. However we wouldn't fully deprecate tokenizer for a long time, and still allow it to be used
There was a problem hiding this comment.
Great thanks for the clarification - on the same page here! Happy to update the PR #30864 accordingly, unless you want to see this one to completion @NielsRogge?
What does this PR do?
This PR updates the
tokenizerargument which people can pass to the Trainer and TrainerCallback classes toprocessorinstead. This allows for people to pass tokenizers, image processors, feature extractors or multimodal, which it will then automatically save along the model when training.Follow-up of #29896
To do:
preprocessorinstead ofprocessorto avoid confusion with multimodal processors likeCLIPProcessor?