Add ImageTextToText pipeline#29572
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
I just tested it, thanks a lot for the PR! |
|
The pipeline is returning weird output for KOSMOS-2 and Fuyu-8B (and raises an error for Llava) and the output doesn't seem to be postprocessed. from transformers import pipeline
import torch
model_id = "microsoft/kosmos-2-patch14-224"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = pipeline(task="image-text-to-text", model=model_id, device=device)
outputs = pipe(
images="http://images.cocodataset.org/val2017/000000039769.jpg",
text="<grounding> Explain what the cats in the image are doing.",
max_new_tokens=40
)
print(outputs)
# [{'generated_text': '<image>. the, to and of as in I that\' for is was- on' it with The as at bet he have from by are " you his " this said not has an ( but had we her they will my or were their): up about out who one all been she can more would It</image><grounding> Explain what the cats in the image are doing.<phrase> The two cats</phrase><object><patch_index_0049><patch_index_0799></delimiter_of_multi_objects/><patch_index_0096><patch_index_1007></object> are laying on a pink blanket, sleeping next to each other. One cat is on its back, while the other is laying on its side'}]I can reproduce it with Fuyu-8B as well, I feel like something around postprocessing might be causing it. from transformers import pipeline
import torch
model_id = "adept/fuyu-8b"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = pipeline(task="image-text-to-text", model=model_id, device=device)
outputs = pipe(
images="http://images.cocodataset.org/val2017/000000039769.jpg",
text="<grounding> Explain what the cats in the image are doing.",
max_new_tokens=40
)
print(outputs)
# [{'generated_text': '|SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE|<s> <grounding> Explain what the cats in the image are doing.\x04 In the image, two cats are lying on a pink blanket, seemingly sleeping or resting.\n'}]Llava-1.5B raises an error.
|
e4541aa to
cfc8a13
Compare
|
@merveenoyan yes the outputs are as expected, they output what you would also get by running The reason KOSMOS-2 outputs a set of random tokens first is because it also outputs tokens for the image patches, which can be discarded. Similarly, Fuyu-8B first outputs a large set of |SPEAKER| tokens before the actual answer. The reason you're getting an error for Llava-1.5B is because you're not including an <image> token in the prompt, it's recommended to leverage this prompt template: |
f6ba64d to
8acf164
Compare
|
@amyeroberts I'd like to assign you for review, but the CI keeps failing with unrelated changes (cc @ydshieh): TypeError: snapshot_download() got an unexpected keyword argument '_from_pipeline'I've rebased with main already several times. Other than that, the PR is ready for review. |
|
Thanks for working on this! Just skimming the PR, there's code which shouldn't be be pushed - please make sure to look at the diff in the PR to catch this before asking for review. Once it's ready for review, I'll give a more in-depth look. At a high-level, I'm a bit concerned about the addition of the auto model, when from the preprocessing and processing steps there isn't complete unity in the inputs and outputs e.g. adding As discussed a week or two ago, it would be good to have a method for all the model's processors which handle the postprocessing of generation outputs for these models. cc @molbap. |
|
Removed the script, PR should be ready for review now |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for working on this!
My main comment from before still applies here: I'm concenred that bundling these models together under this auto class doesn't make sense, in particular including the vision-encoder-deocder model type as it's clear the inputs and their preparation aren't consistent, resulting in a lot of additional code needed within the pipeline itself. The pipeline shouldn't need to know about the model it's loading
| [ | ||
| { | ||
| "generated_text": "<image> \nUSER: What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud?\nASSISTANT: Lava" | ||
| "generated_text": "\nUSER: What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud?\nASSISTANT: Lava" |
There was a problem hiding this comment.
If passing a prompt is deprecated, this should be completely removed
| {"feature-extraction": GitModel, "image-to-text": GitForCausalLM, "text-generation": GitForCausalLM} | ||
| if is_torch_available() | ||
| else {} | ||
| {"feature-extraction": GitModel, "image-to-text": GitForCausalLM} if is_torch_available() else {} |
There was a problem hiding this comment.
Why remove the mapping to text-generation?
There was a problem hiding this comment.
Cause Git is inherently a multimodal model, nobody would use it for text generation tasks (although it is compatible with it).
| """ | ||
| return super().__call__(images, **kwargs) | ||
|
|
||
| def preprocess(self, image=None, text=None, timeout=None): |
There was a problem hiding this comment.
It doesn't make sense for image or text to be None for this pipeline
There was a problem hiding this comment.
This is due to Idefics only requiring text.
|
|
||
| return preprocess_params, forward_kwargs, {} | ||
|
|
||
| def __call__(self, images: Union[str, List[str], "Image.Image", List["Image.Image"]] = None, **kwargs): |
There was a problem hiding this comment.
only images is defined in the signature
| if model_type == "vision-encoder-decoder" and self.processor.__class__.__name__ == "DonutProcessor": | ||
| model_inputs["decoder_input_ids"] = self.processor.tokenizer( |
There was a problem hiding this comment.
This is another flag we shouldn't be bundling these models together if the inputs can't be consistently prepared
| self.assertTrue(list(outputs)[0][0]["generated_text"].startswith(text)) | ||
| self.assertTrue(list(outputs)[1][0]["generated_text"].startswith(text)) | ||
|
|
||
| @slow |
There was a problem hiding this comment.
I don't think we want to add an integration test for every model for this pipeline. The tests are heavy as they involve loading large checkpoints, and the output of the model should be well captured by its own modeling integration tests. Instead, each of the models should have their small model equivalents tested to make sure that they functionally work with the pipeline.
| [{"generated_text": "hello world 陽ɔ 劇र ♯ɔง 藥 ਾ"}], | ||
| ) | ||
|
|
||
| outputs = pipe([image, image], text=text) |
There was a problem hiding this comment.
And can we pass [image, image], [text, text]?
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Is this PR still planned? It would be useful, see e.g. these internal threads (thread1, thread2) @NielsRogge |
|
Yes it's still planned, I just don't have the bandwidth to work on it now. I hope @molbap @zucchini-nlp have. Although it would require to have the refactor of all multimodal processors first before this can land. |
What does this PR do?
This PR adds a deprecation warning when users pass a text prompt to the
image-to-textpipeline, and recommends to leverage a newimage-text-to-textpipeline instead.This way, we keep
image-to-textfor only that said task, which means image as input and text as output. Example tasks include image captioning or optical character recognition (OCR).The new image-text-to-text pipeline assumes input = image + text, output = text.
Usage
Usage is as follows:
Supported models
This has been tested on: