Skip to content

Add ImageTextToText pipeline#29572

Closed
NielsRogge wants to merge 33 commits intohuggingface:mainfrom
NielsRogge:feature/use_processor
Closed

Add ImageTextToText pipeline#29572
NielsRogge wants to merge 33 commits intohuggingface:mainfrom
NielsRogge:feature/use_processor

Conversation

@NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Mar 10, 2024

What does this PR do?

This PR adds a deprecation warning when users pass a text prompt to the image-to-text pipeline, and recommends to leverage a new image-text-to-text pipeline instead.

This way, we keep image-to-text for only that said task, which means image as input and text as output. Example tasks include image captioning or optical character recognition (OCR).

The new image-text-to-text pipeline assumes input = image + text, output = text.

Usage

Usage is as follows:

from transformers import pipeline


# OK:
# model_id = "microsoft/git-base-coco"
model_id = "Salesforce/blip-image-captioning-base"
# model_id = "Salesforce/blip2-opt-2.7b" ok, although it doesn't include the text prompt in the output
# model_id = "Salesforce/instructblip-flan-t5-xl" ok, although it doesn't include the text prompt in the output
# model_id = "llava-hf/llava-1.5-7b-hf"
# model_id = "adept/fuyu-8b"
model_id = "google/pix2struct-textcaps-base"
# model_id = "microsoft/udop-large"
# model_id = "naver-clova-ix/donut-base-finetuned-docvqa"
# model_id = "microsoft/kosmos-2-patch14-224"

pipe = pipeline(task="image-text-to-text", model=model_id)

outputs = pipe(
    images="http://images.cocodataset.org/val2017/000000039769.jpg",
    text="USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:",
    max_new_tokens=200,
)

print(outputs)

Supported models

This has been tested on:

  • GIT
  • BLIP
  • BLIP-2
  • IDEFICS
  • InstructBLIP
  • LLaVa
  • Fuyu
  • Pix2Struct
  • UDOP
  • Donut
  • KOSMOS-2

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@merveenoyan
Copy link
Contributor

I just tested it, thanks a lot for the PR!

@merveenoyan
Copy link
Contributor

The pipeline is returning weird output for KOSMOS-2 and Fuyu-8B (and raises an error for Llava) and the output doesn't seem to be postprocessed.

from transformers import pipeline
import torch 

model_id = "microsoft/kosmos-2-patch14-224"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = pipeline(task="image-text-to-text", model=model_id, device=device)

outputs = pipe(
    images="http://images.cocodataset.org/val2017/000000039769.jpg",
    text="<grounding> Explain what the cats in the image are doing.",
    max_new_tokens=40
)

print(outputs)
# [{'generated_text': '<image>. the, to and of as in I that\' for is was- on' it with The as at bet he have from by are " you his " this said not has an ( but had we her they will my or were their): up about out who one all been she can more would It</image><grounding> Explain what the cats in the image are doing.<phrase> The two cats</phrase><object><patch_index_0049><patch_index_0799></delimiter_of_multi_objects/><patch_index_0096><patch_index_1007></object> are laying on a pink blanket, sleeping next to each other. One cat is on its back, while the other is laying on its side'}]

I can reproduce it with Fuyu-8B as well, I feel like something around postprocessing might be causing it.

from transformers import pipeline
import torch 

model_id = "adept/fuyu-8b"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = pipeline(task="image-text-to-text", model=model_id, device=device)

outputs = pipe(
    images="http://images.cocodataset.org/val2017/000000039769.jpg",
    text="<grounding> Explain what the cats in the image are doing.",
    max_new_tokens=40
)

print(outputs)

# [{'generated_text': '|SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||SPEAKER||NEWLINE|<s> <grounding> Explain what the cats in the image are doing.\x04 In the image, two cats are lying on a pink blanket, seemingly sleeping or resting.\n'}]

Llava-1.5B raises an error.

from transformers import pipeline
import torch 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
mm_pipeline = pipeline("image-text-to-text", model="llava-hf/llava-1.5-7b-hf", device=device)

mm_pipeline("https://huggingface.co/spaces/llava-hf/llava-4bit/resolve/main/examples/baklava.png", text="How to make this pastry?")

ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.

@NielsRogge NielsRogge force-pushed the feature/use_processor branch from e4541aa to cfc8a13 Compare March 19, 2024 21:30
@NielsRogge
Copy link
Contributor Author

NielsRogge commented Mar 20, 2024

@merveenoyan yes the outputs are as expected, they output what you would also get by running processor.batch_decode(generated_ids) for these models.

The reason KOSMOS-2 outputs a set of random tokens first is because it also outputs tokens for the image patches, which can be discarded.

Similarly, Fuyu-8B first outputs a large set of |SPEAKER| tokens before the actual answer.

The reason you're getting an error for Llava-1.5B is because you're not including an <image> token in the prompt, it's recommended to leverage this prompt template: "USER: <image>\nWhat are these?\nASSISTANT:".

@NielsRogge NielsRogge force-pushed the feature/use_processor branch from f6ba64d to 8acf164 Compare March 22, 2024 12:13
@NielsRogge
Copy link
Contributor Author

@amyeroberts I'd like to assign you for review, but the CI keeps failing with unrelated changes (cc @ydshieh):

TypeError: snapshot_download() got an unexpected keyword argument '_from_pipeline'

I've rebased with main already several times. Other than that, the PR is ready for review.

@NielsRogge NielsRogge requested a review from amyeroberts March 26, 2024 11:26
@amyeroberts
Copy link
Contributor

Thanks for working on this!

Just skimming the PR, there's code which shouldn't be be pushed - please make sure to look at the diff in the PR to catch this before asking for review.

Once it's ready for review, I'll give a more in-depth look. At a high-level, I'm a bit concerned about the addition of the auto model, when from the preprocessing and processing steps there isn't complete unity in the inputs and outputs e.g. adding decoder_input_ids for donut.

As discussed a week or two ago, it would be good to have a method for all the model's processors which handle the postprocessing of generation outputs for these models. cc @molbap.

@NielsRogge
Copy link
Contributor Author

Removed the script, PR should be ready for review now

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

My main comment from before still applies here: I'm concenred that bundling these models together under this auto class doesn't make sense, in particular including the vision-encoder-deocder model type as it's clear the inputs and their preparation aren't consistent, resulting in a lot of additional code needed within the pipeline itself. The pipeline shouldn't need to know about the model it's loading

[
{
"generated_text": "<image> \nUSER: What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud?\nASSISTANT: Lava"
"generated_text": "\nUSER: What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud?\nASSISTANT: Lava"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If passing a prompt is deprecated, this should be completely removed

{"feature-extraction": GitModel, "image-to-text": GitForCausalLM, "text-generation": GitForCausalLM}
if is_torch_available()
else {}
{"feature-extraction": GitModel, "image-to-text": GitForCausalLM} if is_torch_available() else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the mapping to text-generation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cause Git is inherently a multimodal model, nobody would use it for text generation tasks (although it is compatible with it).

"""
return super().__call__(images, **kwargs)

def preprocess(self, image=None, text=None, timeout=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense for image or text to be None for this pipeline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to Idefics only requiring text.


return preprocess_params, forward_kwargs, {}

def __call__(self, images: Union[str, List[str], "Image.Image", List["Image.Image"]] = None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only images is defined in the signature

Comment on lines +146 to +147
if model_type == "vision-encoder-decoder" and self.processor.__class__.__name__ == "DonutProcessor":
model_inputs["decoder_input_ids"] = self.processor.tokenizer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another flag we shouldn't be bundling these models together if the inputs can't be consistently prepared

self.assertTrue(list(outputs)[0][0]["generated_text"].startswith(text))
self.assertTrue(list(outputs)[1][0]["generated_text"].startswith(text))

@slow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to add an integration test for every model for this pipeline. The tests are heavy as they involve loading large checkpoints, and the output of the model should be well captured by its own modeling integration tests. Instead, each of the models should have their small model equivalents tested to make sure that they functionally work with the pipeline.

[{"generated_text": "hello world 陽ɔ 劇र ♯ɔง 藥 ਾ"}],
)

outputs = pipe([image, image], text=text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can we pass [image, image], [text, text]?

@NielsRogge NielsRogge requested a review from Narsil April 15, 2024 10:01
@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this May 19, 2024
@MoritzLaurer
Copy link

Is this PR still planned? It would be useful, see e.g. these internal threads (thread1, thread2) @NielsRogge

@NielsRogge
Copy link
Contributor Author

Yes it's still planned, I just don't have the bandwidth to work on it now. I hope @molbap @zucchini-nlp have.

Although it would require to have the refactor of all multimodal processors first before this can land.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants