[Zero-shot image classification pipeline] Remove tokenizer_kwargs by NielsRogge · Pull Request #33174 · huggingface/transformers

NielsRogge · 2024-08-28T16:29:55Z

What does this PR do?

This PR is a follow-up of #29261, namely the tokenizer_kwargs argument is unnecessary, one can just update the model_input_names attribute of the tokenizer.

HuggingFaceDocBuilderDev · 2024-08-28T16:49:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for opening this PR.

We can't nor shouldn't remove tokenizer_kwargs from the pipeline. It is both a breaking change in terms of the pipeline's functionality and will break for all existing Blip2 checkpoints saved with a bert tokenizer

amyeroberts · 2024-08-29T11:05:52Z

src/transformers/pipelines/zero_shot_image_classification.py

-            tokenizer_kwargs (`dict`, *optional*):
-                Additional dictionary of keyword arguments passed along to the tokenizer.
-


I don't think this should be removed. tokenizer_kwargs is a fairly standard input to _sanitize_parameters as a way to control tokenizer behaviour. It would also be breaking for anyone using this in their pipelines.

We might be able to remove in the tests but it should stay here

Hmm could you clarify? tokenizer_kwargs was added in #29261 which is not yet in a stable release. Hence removing this argument wouldn't break anything

Ah, OK, in this case we can remove!

amyeroberts · 2024-08-29T11:11:01Z

src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py

    else:
        tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")

+    tokenizer.model_input_names = ["input_ids", "attention_mask"]


This is pretty hacky. It should at the very least only be done in the BertTokenizer branch to make explicit why this is needed (including an explanatory comment)

I don't see what's hacky about this, model_input_names is a genuine attribute of PretrainedTokenizer that has been there since BERT/GPT-2 came out (it's not used a lot). As can be seen here, a tokenizer will return token_type_ids if it's present in the model_input_names.

A less "hacky" way would be to pass model_input_names directly in the from_pretrained method, which the Transformers library also supports.

What I'm trying to say is that one can avoid lines like these by appropriately setting the model_input_names attribute of the tokenizer, which is something we can still do for the Blip2ImageTextRetrieval models on the hub as they are just added and not yet in a stable release.

Ah, yes, I see. Although both modifying model_input_names and return_token_ids are somewhat hacky: they modify a class attribute which might be depended upon for other behaviours / assumed to have certain properties. Passing in model_input_names to the init is better, as this is more likely to correctly propagate any changes to any other dependant attributes / variables.

NielsRogge · 2024-08-29T12:04:49Z

It is both a breaking change in terms of the pipeline's functionality and will break for all existing Blip2 checkpoints saved with a bert tokenizer

The Blip2 checkpoints compatible with this pipeline are the following:

The pipeline was updated in a just-merged PR (which also pushed those 2 checkpoints). My point would be to just set the model_input_names of the tokenizer of those 2 repos appropriately which would ensure the tokenizer_kwargs isn't needed - I also don't see it present in any other pipeline

amyeroberts · 2024-08-29T15:51:13Z

OK, as tokenizer_kwargs were just added for a model on main then we can remove. Could you open PRs on the BLIP2 checkpoints to update to make sure they are compatible?

NielsRogge · 2024-08-29T16:38:27Z

It seems the model_input_names attribute doesn't get serialized in tokenizer_config.json (as it's a class attribute), cc @ArthurZucker. So I'm not sure it's possible to remove this.

amyeroberts · 2024-08-29T16:46:35Z

@NielsRogge In this case, we should just keep the tokenizer_kwargs in the pipeline, as they're a fairly standard input anyways. The issue regarding having to pass {return_token_ids: "False"} for BLIP will hopefully be something we can resolve once we can load processors in pipelines - #32514, as tokenizer.return_token_ids will be automatically set.

What should be done is document that we need to pass in {return_token_ids: "False"} to use these blip2 checkpoints in the pipeline

ArthurZucker · 2024-09-06T09:11:55Z

I think model_input_names should be serialized! (some model like mamba use GPTNeoXTokenizerFast, but don't use token_type_ids for example)

First draft

7d93bb5

NielsRogge requested a review from amyeroberts August 28, 2024 19:39

amyeroberts reviewed Aug 29, 2024

View reviewed changes

Pass attribute to from_pretrained

abdf5c5

		tokenizer_kwargs (`dict`, optional):
		Additional dictionary of keyword arguments passed along to the tokenizer.

Conversation

NielsRogge commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 28, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

amyeroberts Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

NielsRogge Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

amyeroberts Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

amyeroberts Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

NielsRogge Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

NielsRogge Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amyeroberts Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

NielsRogge commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts commented Aug 29, 2024

Uh oh!

NielsRogge commented Aug 29, 2024

Uh oh!

amyeroberts commented Aug 29, 2024

Uh oh!

ArthurZucker commented Sep 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NielsRogge commented Aug 28, 2024 •

edited

Loading

NielsRogge Aug 29, 2024 •

edited

Loading

NielsRogge commented Aug 29, 2024 •

edited

Loading