[Zero-shot image classification pipeline] Remove tokenizer_kwargs#33174
[Zero-shot image classification pipeline] Remove tokenizer_kwargs#33174NielsRogge wants to merge 2 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for opening this PR.
We can't nor shouldn't remove tokenizer_kwargs from the pipeline. It is both a breaking change in terms of the pipeline's functionality and will break for all existing Blip2 checkpoints saved with a bert tokenizer
| tokenizer_kwargs (`dict`, *optional*): | ||
| Additional dictionary of keyword arguments passed along to the tokenizer. | ||
|
|
There was a problem hiding this comment.
I don't think this should be removed. tokenizer_kwargs is a fairly standard input to _sanitize_parameters as a way to control tokenizer behaviour. It would also be breaking for anyone using this in their pipelines.
We might be able to remove in the tests but it should stay here
There was a problem hiding this comment.
Hmm could you clarify? tokenizer_kwargs was added in #29261 which is not yet in a stable release. Hence removing this argument wouldn't break anything
There was a problem hiding this comment.
Ah, OK, in this case we can remove!
| else: | ||
| tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl") | ||
|
|
||
| tokenizer.model_input_names = ["input_ids", "attention_mask"] |
There was a problem hiding this comment.
This is pretty hacky. It should at the very least only be done in the BertTokenizer branch to make explicit why this is needed (including an explanatory comment)
There was a problem hiding this comment.
I don't see what's hacky about this, model_input_names is a genuine attribute of PretrainedTokenizer that has been there since BERT/GPT-2 came out (it's not used a lot). As can be seen here, a tokenizer will return token_type_ids if it's present in the model_input_names.
A less "hacky" way would be to pass model_input_names directly in the from_pretrained method, which the Transformers library also supports.
There was a problem hiding this comment.
What I'm trying to say is that one can avoid lines like these by appropriately setting the model_input_names attribute of the tokenizer, which is something we can still do for the Blip2ImageTextRetrieval models on the hub as they are just added and not yet in a stable release.
There was a problem hiding this comment.
Ah, yes, I see. Although both modifying model_input_names and return_token_ids are somewhat hacky: they modify a class attribute which might be depended upon for other behaviours / assumed to have certain properties. Passing in model_input_names to the init is better, as this is more likely to correctly propagate any changes to any other dependant attributes / variables.
The Blip2 checkpoints compatible with this pipeline are the following:
The pipeline was updated in a just-merged PR (which also pushed those 2 checkpoints). My point would be to just set the |
|
OK, as |
|
It seems the |
|
@NielsRogge In this case, we should just keep the What should be done is document that we need to pass in |
|
I think |
What does this PR do?
This PR is a follow-up of #29261, namely the
tokenizer_kwargsargument is unnecessary, one can just update themodel_input_namesattribute of the tokenizer.