Add ImageTextToText pipeline by NielsRogge · Pull Request #29572 · huggingface/transformers

NielsRogge · 2024-03-10T18:46:58Z

What does this PR do?

This PR adds a deprecation warning when users pass a text prompt to the image-to-text pipeline, and recommends to leverage a new image-text-to-text pipeline instead.

This way, we keep image-to-text for only that said task, which means image as input and text as output. Example tasks include image captioning or optical character recognition (OCR).

The new image-text-to-text pipeline assumes input = image + text, output = text.

Usage

Usage is as follows:

from transformers import pipeline


# OK:
# model_id = "microsoft/git-base-coco"
model_id = "Salesforce/blip-image-captioning-base"
# model_id = "Salesforce/blip2-opt-2.7b" ok, although it doesn't include the text prompt in the output
# model_id = "Salesforce/instructblip-flan-t5-xl" ok, although it doesn't include the text prompt in the output
# model_id = "llava-hf/llava-1.5-7b-hf"
# model_id = "adept/fuyu-8b"
model_id = "google/pix2struct-textcaps-base"
# model_id = "microsoft/udop-large"
# model_id = "naver-clova-ix/donut-base-finetuned-docvqa"
# model_id = "microsoft/kosmos-2-patch14-224"

pipe = pipeline(task="image-text-to-text", model=model_id)

outputs = pipe(
    images="http://images.cocodataset.org/val2017/000000039769.jpg",
    text="USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:",
    max_new_tokens=200,
)

print(outputs)

Supported models

This has been tested on:

HuggingFaceDocBuilderDev · 2024-03-10T19:09:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

merveenoyan · 2024-03-12T14:59:30Z

I just tested it, thanks a lot for the PR!

merveenoyan · 2024-03-14T14:18:40Z

The pipeline is returning weird output for KOSMOS-2 and Fuyu-8B (and raises an error for Llava) and the output doesn't seem to be postprocessed.

from transformers import pipeline
import torch 

model_id = "microsoft/kosmos-2-patch14-224"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = pipeline(task="image-text-to-text", model=model_id, device=device)

outputs = pipe(
    images="http://images.cocodataset.org/val2017/000000039769.jpg",
    text="<grounding> Explain what the cats in the image are doing.",
    max_new_tokens=40
)

print(outputs)
# [{'generated_text': '<image>. the, to and of as in I that\' for is was- on' it with The as at bet he have from by are " you his " this said not has an ( but had we her they will my or were their): up about out who one all been she can more would It</image><grounding> Explain what the cats in the image are doing.<phrase> The two cats</phrase><object><patch_index_0049><patch_index_0799></delimiter_of_multi_objects/><patch_index_0096><patch_index_1007></object> are laying on a pink blanket, sleeping next to each other. One cat is on its back, while the other is laying on its side'}]

I can reproduce it with Fuyu-8B as well, I feel like something around postprocessing might be causing it.

from transformers import pipeline
import torch 

model_id = "adept/fuyu-8b"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = pipeline(task="image-text-to-text", model=model_id, device=device)

outputs = pipe(
    images="http://images.cocodataset.org/val2017/000000039769.jpg",
    text="<grounding> Explain what the cats in the image are doing.",
    max_new_tokens=40
)

print(outputs)

# [{'generated_text': '|SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE|<s> <grounding> Explain what the cats in the image are doing.\x04 In the image, two cats are lying on a pink blanket, seemingly sleeping or resting.\n'}]

Llava-1.5B raises an error.

from transformers import pipeline
import torch 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
mm_pipeline = pipeline("image-text-to-text", model="llava-hf/llava-1.5-7b-hf", device=device)

mm_pipeline("https://huggingface.co/spaces/llava-hf/llava-4bit/resolve/main/examples/baklava.png", text="How to make this pastry?")

ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.

src/transformers/pipelines/image_text_to_text.py

…rmers into feature/use_processor

NielsRogge · 2024-03-20T20:03:46Z

@merveenoyan yes the outputs are as expected, they output what you would also get by running processor.batch_decode(generated_ids) for these models.

The reason KOSMOS-2 outputs a set of random tokens first is because it also outputs tokens for the image patches, which can be discarded.

Similarly, Fuyu-8B first outputs a large set of |SPEAKER| tokens before the actual answer.

The reason you're getting an error for Llava-1.5B is because you're not including an <image> token in the prompt, it's recommended to leverage this prompt template: "USER: <image>\nWhat are these?\nASSISTANT:".

NielsRogge · 2024-03-25T08:44:19Z

@amyeroberts I'd like to assign you for review, but the CI keeps failing with unrelated changes (cc @ydshieh):

TypeError: snapshot_download() got an unexpected keyword argument '_from_pipeline'

I've rebased with main already several times. Other than that, the PR is ready for review.

amyeroberts · 2024-04-02T11:27:36Z

Thanks for working on this!

Just skimming the PR, there's code which shouldn't be be pushed - please make sure to look at the diff in the PR to catch this before asking for review.

Once it's ready for review, I'll give a more in-depth look. At a high-level, I'm a bit concerned about the addition of the auto model, when from the preprocessing and processing steps there isn't complete unity in the inputs and outputs e.g. adding decoder_input_ids for donut.

As discussed a week or two ago, it would be good to have a method for all the model's processors which handle the postprocessing of generation outputs for these models. cc @molbap.

NielsRogge · 2024-04-02T13:20:57Z

Removed the script, PR should be ready for review now

amyeroberts

Thanks for working on this!

My main comment from before still applies here: I'm concenred that bundling these models together under this auto class doesn't make sense, in particular including the vision-encoder-deocder model type as it's clear the inputs and their preparation aren't consistent, resulting in a lot of additional code needed within the pipeline itself. The pipeline shouldn't need to know about the model it's loading

src/transformers/models/git/processing_git.py

amyeroberts · 2024-04-04T14:46:56Z

tests/pipelines/test_pipelines_image_to_text.py

            [
                {
-                    "generated_text": "<image> \nUSER: What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud?\nASSISTANT: Lava"
+                    "generated_text": "\nUSER: What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud?\nASSISTANT: Lava"


If passing a prompt is deprecated, this should be completely removed

amyeroberts · 2024-04-04T14:48:08Z

tests/models/git/test_modeling_git.py

-        {"feature-extraction": GitModel, "image-to-text": GitForCausalLM, "text-generation": GitForCausalLM}
-        if is_torch_available()
-        else {}
+        {"feature-extraction": GitModel, "image-to-text": GitForCausalLM} if is_torch_available() else {}


Why remove the mapping to text-generation?

Cause Git is inherently a multimodal model, nobody would use it for text generation tasks (although it is compatible with it).

amyeroberts · 2024-04-04T14:54:54Z

src/transformers/pipelines/image_text_to_text.py

+        """
+        return super().__call__(images, **kwargs)
+
+    def preprocess(self, image=None, text=None, timeout=None):


It doesn't make sense for image or text to be None for this pipeline

This is due to Idefics only requiring text.

amyeroberts · 2024-04-04T14:57:18Z

src/transformers/pipelines/image_text_to_text.py

+
+        return preprocess_params, forward_kwargs, {}
+
+    def __call__(self, images: Union[str, List[str], "Image.Image", List["Image.Image"]] = None, **kwargs):


only images is defined in the signature

src/transformers/pipelines/image_text_to_text.py

amyeroberts · 2024-04-04T15:02:53Z

src/transformers/pipelines/image_text_to_text.py

+        if model_type == "vision-encoder-decoder" and self.processor.__class__.__name__ == "DonutProcessor":
+            model_inputs["decoder_input_ids"] = self.processor.tokenizer(


This is another flag we shouldn't be bundling these models together if the inputs can't be consistently prepared

src/transformers/pipelines/image_text_to_text.py

amyeroberts · 2024-04-04T15:09:13Z

tests/pipelines/test_pipelines_image_text_to_text.py

+            self.assertTrue(list(outputs)[0][0]["generated_text"].startswith(text))
+            self.assertTrue(list(outputs)[1][0]["generated_text"].startswith(text))
+
+    @slow


I don't think we want to add an integration test for every model for this pipeline. The tests are heavy as they involve loading large checkpoints, and the output of the model should be well captured by its own modeling integration tests. Instead, each of the models should have their small model equivalents tested to make sure that they functionally work with the pipeline.

amyeroberts · 2024-04-04T15:10:55Z

tests/pipelines/test_pipelines_image_text_to_text.py

+            [{"generated_text": "hello world 陽ɔ 劇र ♯ɔง 藥 ਾ"}],
+        )
+
+        outputs = pipe([image, image], text=text)


And can we pass [image, image], [text, text]?

github-actions · 2024-05-10T08:04:59Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

MoritzLaurer · 2024-07-03T12:04:49Z

Is this PR still planned? It would be useful, see e.g. these internal threads (thread1, thread2) @NielsRogge

NielsRogge · 2024-07-03T14:37:36Z

Yes it's still planned, I just don't have the bandwidth to work on it now. I hope @molbap @zucchini-nlp have.

Although it would require to have the refactor of all multimodal processors first before this can land.

NielsRogge added 7 commits March 9, 2024 11:46

Add pipeline

cff2439

More improvements

d254f58

More improvements

40051db

Add support for Donut

55a40ad

More improvements

cd0f3ac

More improvements

21c6fb9

More improvements

056e363

NielsRogge added 3 commits March 10, 2024 20:40

Fix tests

0d6d7df

Fix tests

b48add3

Fix git tests

e4541aa

merveenoyan reviewed Mar 14, 2024

View reviewed changes

src/transformers/pipelines/image_text_to_text.py Show resolved Hide resolved

merveenoyan reviewed Mar 14, 2024

View reviewed changes

src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved

NielsRogge mentioned this pull request Mar 17, 2024

Integration of InstructBLIP with Pipeline in Transformers Library #29696

Closed

4 tasks

NielsRogge added 2 commits March 19, 2024 21:16

Fix merge

7cbb644

Fix merge

cfc8a13

NielsRogge force-pushed the feature/use_processor branch from e4541aa to cfc8a13 Compare March 19, 2024 21:30

NielsRogge added 4 commits March 20, 2024 17:32

Merge branch 'feature/use_processor' of github.com:NielsRogge/transfo…

04fcbfe

…rmers into feature/use_processor

Fix merge

021334d

Update metadata

9c384cc

Add support for idefics

f6ba64d

NielsRogge added 6 commits March 22, 2024 13:12

Add pipeline

fc4363a

More improvements

dc2ca31

More improvements

d575921

Add support for Donut

fd77e76

More improvements

5f772f1

More improvements

59855ad

NielsRogge added 6 commits March 22, 2024 13:12

More improvements

c2067b9

Fix tests

40fe2f8

Fix tests

8b06c67

Fix git tests

81db879

Update metadata

7382075

Add support for idefics

8acf164

NielsRogge force-pushed the feature/use_processor branch from f6ba64d to 8acf164 Compare March 22, 2024 12:13

Fix documentation test

89ac5c4

NielsRogge requested a review from amyeroberts March 26, 2024 11:26

Merge remote-tracking branch 'upstream/main' into feature/use_processor

b64070c

Remove script

40bd731

amyeroberts reviewed Apr 4, 2024

View reviewed changes

NielsRogge mentioned this pull request Apr 11, 2024

Add video modality for InstrucBLIP #30182

Merged

NielsRogge added 2 commits April 12, 2024 15:05

Fix merge

743a967

Address comments

22d3d70

NielsRogge requested a review from Narsil April 15, 2024 10:01

github-actions bot closed this May 19, 2024

NielsRogge mentioned this pull request Jul 18, 2024

AutoModel class for image-text-to-text models #32042

Closed

yonigozlan mentioned this pull request Jul 18, 2024

[WIP] Standardize inputs and outputs for existing image-text-to-text models #32059

Closed

20 tasks

ArthurZucker mentioned this pull request Oct 17, 2024

ImportError: cannot import name 'AutoModelForImageTextToText' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py) #34217

Closed

4 tasks


		return preprocess_params, forward_kwargs, {}

		def __call__(self, images: Union[str, List[str], "Image.Image", List["Image.Image"]] = None, **kwargs):

		if model_type == "vision-encoder-decoder" and self.processor.__class__.__name__ == "DonutProcessor":
		model_inputs["decoder_input_ids"] = self.processor.tokenizer(

Conversation

NielsRogge commented Mar 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Supported models

Uh oh!

HuggingFaceDocBuilderDev commented Mar 10, 2024

Uh oh!

merveenoyan commented Mar 12, 2024

Uh oh!

merveenoyan commented Mar 14, 2024

Uh oh!

Uh oh!

Uh oh!

NielsRogge commented Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NielsRogge commented Mar 25, 2024

Uh oh!

amyeroberts commented Apr 2, 2024

Uh oh!

NielsRogge commented Apr 2, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 10, 2024

Uh oh!

MoritzLaurer commented Jul 3, 2024

Uh oh!

NielsRogge commented Jul 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

NielsRogge commented Mar 10, 2024 •

edited

Loading

NielsRogge commented Mar 20, 2024 •

edited

Loading