Skip to content

[Trainer] Rename tokenizer to processor, add deprecation#30102

Closed
NielsRogge wants to merge 9 commits intohuggingface:mainfrom
NielsRogge:patch_trainer
Closed

[Trainer] Rename tokenizer to processor, add deprecation#30102
NielsRogge wants to merge 9 commits intohuggingface:mainfrom
NielsRogge:patch_trainer

Conversation

@NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Apr 7, 2024

What does this PR do?

This PR updates the tokenizer argument which people can pass to the Trainer and TrainerCallback classes to processor instead. This allows for people to pass tokenizers, image processors, feature extractors or multimodal, which it will then automatically save along the model when training.

Follow-up of #29896

To do:

  • perhaps use preprocessor instead of processor to avoid confusion with multimodal processors like CLIPProcessor?
  • add deprecation message for tokenizer attribute

@NielsRogge NielsRogge requested a review from ArthurZucker April 7, 2024 21:25
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NielsRogge NielsRogge removed the request for review from ArthurZucker April 8, 2024 07:30
Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating - looks a lot better with processor!

  • Other examples which have tokenizer=tokenizer will also need to be updated if they exist
  • Question over the deprecation cycle - v5 means it won't be remove in a long time - but as this is a fairly large change from a commonly used API I think it's OK

callbacks: Optional[List[TrainerCallback]] = None,
optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None),
preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None,
tokenizer: Optional[PreTrainedTokenizerBase] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be careful when manipulating the init signature, see: #30126 - for new args we should put them at the end

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Apr 8, 2024

Thanks, I opened #30129 as a better quick fix cause there are a huge amount of files having tokenizer=tokenizer and I'm not sure it's worth the effort (would require some offline discussion).

@amyeroberts
Copy link
Contributor

@NielsRogge Although #30129 proposes a solution for the default data collator, we should still add this update. tokenizer=image_processor is a confusing API (as highlighted by a few issues from users) and considering we have more and more multimodal, audio and vision models increasingly out-of-date

Replacing tokenizer=tokenizer with processor=tokenizer might touch a lot of files but I'd expect to be fairly straightforward with some grep and find/replace.

@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@sanchit-gandhi
Copy link
Contributor

Thanks for pointing me in the direction of this PR @amyeroberts! I personally don't think replacing tokenizer by processor is a good idea in the context of NLP users: many NLP users won't know the notion of a processor, since they only use the tokenizer to pre- and post-process their data. To me, having them set processor=tokenizer is just as confusing as having multimodal users set tokenizer=processor.

I would be more in favour of a design where allow NLP users to pass tokenizer=tokenizer, and multimodal users to pass processor=processor, as suggested in #30864 (comment) -> to me, this is the cleanest and most intuitive API. Yes, it adds some complexity under-the-hood in the Trainer, since we now have to handle both object, but this is a valid burden to make the API as intuitive as possible for the user.

@amyeroberts
Copy link
Contributor

@sanchit-gandhi I think what's being suggested is actually a flag more like preprocessor or processing_class which would replace tokenizer. As Trainer is such a commonly used object, in reality we're unlikely to fully remove or deprecate tokenizer anytime soon, but I don't think this means we have to add a whole new argument for every processing class.

):
if tokenizer is not None:
warnings.warn(
"The `tokenizer` argument is deprecated and will be removed in v5 of Transformers. You can use `processor` "
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this PR is proposing deprecating tokenizer @amyeroberts! (as per this line and the title) I think preprocessor is limited in the sense that we don't just want to save the pre-processing class (e.g. image processor or feature extractor), but also the post-processing class (e.g. the tokenizer) => for this reason I think passing the processor is cleanest here (for multimodal models)

Copy link
Contributor

@amyeroberts amyeroberts May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. Perhaps I wasn't clear. What I meant was we would introduce an new argument, which should be used in preference to tokenizer which would accept all of the processing classes. However we wouldn't fully deprecate tokenizer for a long time, and still allow it to be used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks for the clarification - on the same page here! Happy to update the PR #30864 accordingly, unless you want to see this one to completion @NielsRogge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To do

Development

Successfully merging this pull request may close these issues.

5 participants