[Trainer] Rename tokenizer to processor, add deprecation by NielsRogge · Pull Request #30102 · huggingface/transformers

NielsRogge · 2024-04-07T21:06:08Z

What does this PR do?

This PR updates the tokenizer argument which people can pass to the Trainer and TrainerCallback classes to processor instead. This allows for people to pass tokenizers, image processors, feature extractors or multimodal, which it will then automatically save along the model when training.

Follow-up of #29896

To do:

perhaps use preprocessor instead of processor to avoid confusion with multimodal processors like CLIPProcessor?
add deprecation message for tokenizer attribute

HuggingFaceDocBuilderDev · 2024-04-07T21:49:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for updating - looks a lot better with processor!

Other examples which have tokenizer=tokenizer will also need to be updated if they exist
Question over the deprecation cycle - v5 means it won't be remove in a long time - but as this is a fairly large change from a commonly used API I think it's OK

src/transformers/trainer.py

src/transformers/trainer_callback.py

src/transformers/trainer.py

younesbelkada · 2024-04-08T14:33:41Z

src/transformers/trainer.py

        callbacks: Optional[List[TrainerCallback]] = None,
        optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None),
        preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None,
+        tokenizer: Optional[PreTrainedTokenizerBase] = None,


I think we should be careful when manipulating the init signature, see: #30126 - for new args we should put them at the end

NielsRogge · 2024-04-08T18:06:34Z

Thanks, I opened #30129 as a better quick fix cause there are a huge amount of files having tokenizer=tokenizer and I'm not sure it's worth the effort (would require some offline discussion).

amyeroberts · 2024-04-09T08:01:34Z

@NielsRogge Although #30129 proposes a solution for the default data collator, we should still add this update. tokenizer=image_processor is a confusing API (as highlighted by a few issues from users) and considering we have more and more multimodal, audio and vision models increasingly out-of-date

Replacing tokenizer=tokenizer with processor=tokenizer might touch a lot of files but I'd expect to be fairly straightforward with some grep and find/replace.

github-actions · 2024-05-08T08:03:51Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sanchit-gandhi · 2024-05-17T08:55:40Z

Thanks for pointing me in the direction of this PR @amyeroberts! I personally don't think replacing tokenizer by processor is a good idea in the context of NLP users: many NLP users won't know the notion of a processor, since they only use the tokenizer to pre- and post-process their data. To me, having them set processor=tokenizer is just as confusing as having multimodal users set tokenizer=processor.

I would be more in favour of a design where allow NLP users to pass tokenizer=tokenizer, and multimodal users to pass processor=processor, as suggested in #30864 (comment) -> to me, this is the cleanest and most intuitive API. Yes, it adds some complexity under-the-hood in the Trainer, since we now have to handle both object, but this is a valid burden to make the API as intuitive as possible for the user.

amyeroberts · 2024-05-17T11:20:54Z

@sanchit-gandhi I think what's being suggested is actually a flag more like preprocessor or processing_class which would replace tokenizer. As Trainer is such a commonly used object, in reality we're unlikely to fully remove or deprecate tokenizer anytime soon, but I don't think this means we have to add a whole new argument for every processing class.

sanchit-gandhi · 2024-05-17T15:52:37Z

src/transformers/trainer.py

    ):
+        if tokenizer is not None:
+            warnings.warn(
+                "The `tokenizer` argument is deprecated and will be removed in v5 of Transformers. You can use `processor` "


I believe this PR is proposing deprecating tokenizer @amyeroberts! (as per this line and the title) I think preprocessor is limited in the sense that we don't just want to save the pre-processing class (e.g. image processor or feature extractor), but also the post-processing class (e.g. the tokenizer) => for this reason I think passing the processor is cleanest here (for multimodal models)

Ah, yes. Perhaps I wasn't clear. What I meant was we would introduce an new argument, which should be used in preference to tokenizer which would accept all of the processing classes. However we wouldn't fully deprecate tokenizer for a long time, and still allow it to be used

Great thanks for the clarification - on the same page here! Happy to update the PR #30864 accordingly, unless you want to see this one to completion @NielsRogge?

NielsRogge added 5 commits April 7, 2024 23:02

Use processor

bea74b4

Fix bug

cd6d68a

Update docstring

6ca7eb5

Add property

65be4ac

Update message

e8abcfc

NielsRogge requested a review from ArthurZucker April 7, 2024 21:25

Fix bug

b91869b

NielsRogge removed the request for review from ArthurZucker April 8, 2024 07:30

Update scripts

b166f58

amyeroberts reviewed Apr 8, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/trainer_callback.py Outdated Show resolved Hide resolved

src/transformers/trainer.py Outdated Show resolved Hide resolved

younesbelkada reviewed Apr 8, 2024

View reviewed changes

amyeroberts mentioned this pull request Apr 8, 2024

Trainer / Core : Do not change init signature order #30126

Merged

NielsRogge added 2 commits April 8, 2024 19:44

Address comments

a4ccda9

Fix merge

2548635

NielsRogge mentioned this pull request Apr 8, 2024

[Trainer] Undo #29896 #30129

Merged

chenin-wang mentioned this pull request Apr 10, 2024

[Trainer] Allow passing image processor #29896

Merged

1 task

amyeroberts mentioned this pull request May 16, 2024

[trainer] allow processor instead of tokenizer #30864

Closed

github-actions bot closed this May 17, 2024

sanchit-gandhi reviewed May 17, 2024

View reviewed changes

amyeroberts mentioned this pull request Sep 9, 2024

Trainer - deprecate tokenizer for processing_class #32385

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Trainer] Rename tokenizer to processor, add deprecation#30102

[Trainer] Rename tokenizer to processor, add deprecation#30102
NielsRogge wants to merge 9 commits intohuggingface:mainfrom
NielsRogge:patch_trainer

NielsRogge commented Apr 7, 2024 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 7, 2024

Uh oh!

amyeroberts left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

younesbelkada Apr 8, 2024

Uh oh!

NielsRogge commented Apr 8, 2024 •

edited

Loading

Uh oh!

amyeroberts commented Apr 9, 2024

Uh oh!

github-actions bot commented May 8, 2024

Uh oh!

sanchit-gandhi commented May 17, 2024

Uh oh!

amyeroberts commented May 17, 2024

Uh oh!

sanchit-gandhi May 17, 2024 •

edited

Loading

Uh oh!

amyeroberts May 17, 2024 •

edited

Loading

Uh oh!

sanchit-gandhi May 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

NielsRogge commented Apr 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 7, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

younesbelkada Apr 8, 2024

Choose a reason for hiding this comment

Uh oh!

NielsRogge commented Apr 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts commented Apr 9, 2024

Uh oh!

github-actions bot commented May 8, 2024

Uh oh!

sanchit-gandhi commented May 17, 2024

Uh oh!

amyeroberts commented May 17, 2024

Uh oh!

sanchit-gandhi May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amyeroberts May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi May 17, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NielsRogge commented Apr 7, 2024 •

edited

Loading

NielsRogge commented Apr 8, 2024 •

edited

Loading

sanchit-gandhi May 17, 2024 •

edited

Loading

amyeroberts May 17, 2024 •

edited

Loading