Process inputs directly in apply_chat_template in image-text-to-text pipeline by yonigozlan · Pull Request #35616 · huggingface/transformers

yonigozlan · 2025-01-10T18:08:09Z

What does this PR do?

Follows #34275
Process inputs directly in apply_chat_template instead of calling apply_chat_template then the processor.
This also means that a small part of the pipeline logic needed to change, but I think it's better now :).
The pipeline also supports passing images with the images arg even when using chat template, where the corresponding image is represented with a {"type": "image"} in the chat.

In the previous behavior, when the input was:

image_ny = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image_chicago = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s the difference between these two images?"},
            {"type": "image"},
            {"type": "image"},
        ],
    }
]
outputs = pipe([image_ny, image_chicago], text=messages)

The output would be:

[
    {
        "input_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {"type": "image"},
                    {"type": "image"},
                ],
            }
        ],
        "generated_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {"type": "image"},
                    {"type": "image"},
                ],
            },
            {
                "role": "assistant",
                "content": "The first image shows a statue of Liberty in the foreground, while the second image shows a city skyline",
            },
        ],
    }
]

With no mention of the actual input images.
Now the output is:

[
    {
        "input_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
                    },
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg",
                    },
                ],
            }
        ],
        "generated_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
                    },
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg",
                    },
                ],
            },
            {
                "role": "assistant",
                "content": "The first image shows a statue of Liberty in the foreground, while the second image shows a city skyline",
            },
        ],
    }
]

Who can review?

@zucchini-nlp @Rocketknight1

HuggingFaceDocBuilderDev · 2025-01-10T18:35:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

Nice, thanks for updating the pipeline. Left a couple comments

zucchini-nlp · 2025-01-10T19:58:06Z

src/transformers/models/llava/processing_llava.py

This should be 1 to work correctly with different ViT backbones. Was it causing any test failures?

Without this change, I'm getting errors on pipeline tests that use to work with llava-interleave. For example:

pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf") image = "./tests/fixtures/tests_samples/COCO/000000039769.png" text = "<image> What this is? Assistant: This is" outputs = pipe(image, text=text) self.assertEqual( outputs, [ { "input_text": "<image> What this is? Assistant: This is", "generated_text": "<image> What this is? Assistant: This is a photo of two cats lying on a pink blanket. The cats are sleeping and appear to be comfortable", } ], )

returns:

ValueError: Image features and image tokens do not match: tokens: 728, features 729

in case of llava-interleave-qwen-0.5b-hf I see a mismatch in vision_feature_select_strategy for the model config and for processor. Will fix that on the hub :)

src/transformers/pipelines/image_text_to_text.py

qubvel

Thanks! A few comments on my side

src/transformers/pipelines/image_text_to_text.py

ArthurZucker

Missing doc / examples but nice otherwise!

…t-image-text-to-text-pipeline

…line

yonigozlan · 2025-03-19T20:14:24Z

Added some docs and updated the branch, this should be ready to merge @ArthurZucker

…line

…t-image-text-to-text-pipeline

…tps://github.com/yonigozlan/transformers into vectorize-input-chat-image-text-to-text-pipeline

…line

Cyrilvallez

Nice thanks! Just the function that should probably be renamed IMO!

Cyrilvallez · 2025-04-23T15:04:10Z

src/transformers/pipelines/image_text_to_text.py

    if images is None:
        images = []
-    elif not isinstance(images, Iterable):
+    elif not isinstance(images, (Iterable)) or isinstance(images, str):


Suggested change

elif not isinstance(images, (Iterable)) or isinstance(images, str):

elif not isinstance(images, (str, Iterable)):

i want this to be True when images is a string, and str is an Iterable

Cyrilvallez · 2025-04-23T15:16:07Z

src/transformers/pipelines/image_text_to_text.py


 def retrieve_images_in_messages(


I feel like the function name should be changed here as it's not really what it does anymore

…tps://github.com/yonigozlan/transformers into vectorize-input-chat-image-text-to-text-pipeline

Cyrilvallez

LGTM, thanks! 🤗

Cyrilvallez · 2025-04-23T17:25:38Z

src/transformers/pipelines/image_text_to_text.py

    if images is None:
        images = []
-    elif not isinstance(images, Iterable):
+    elif not isinstance(images, Iterable) or isinstance(images, str):


Still the same small nit haha

Suggested change

elif not isinstance(images, Iterable) or isinstance(images, str):

elif not isinstance(images, (str, Iterable)):

…pipeline (huggingface#35616) * tokenize inputs directly in apply_chat_template * refactor processing * revert changes processing llava * Update docs * fix issue with str being iterable * add test chat text only * change function name

yonigozlan requested review from ArthurZucker, Rocketknight1, molbap and qubvel as code owners January 10, 2025 18:08

zucchini-nlp reviewed Jan 13, 2025

View reviewed changes

qubvel reviewed Jan 13, 2025

View reviewed changes

src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved

src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved

yonigozlan added 2 commits January 13, 2025 17:31

tokenize inputs directly in apply_chat_template

1834fab

refactor processing

37bb6fc

yonigozlan force-pushed the vectorize-input-chat-image-text-to-text-pipeline branch from e3d95fd to 37bb6fc Compare January 13, 2025 17:31

ArthurZucker reviewed Feb 13, 2025

View reviewed changes

yonigozlan and others added 5 commits March 19, 2025 17:52

revert changes processing llava

4618387

Merge remote-tracking branch 'upstream/main' into vectorize-input-cha…

4352620

…t-image-text-to-text-pipeline

Update docs

86fce1f

fix issue with str being iterable

df09033

Merge branch 'main' into vectorize-input-chat-image-text-to-text-pipe…

12ea800

…line

yonigozlan and others added 5 commits March 24, 2025 11:30

Merge branch 'main' into vectorize-input-chat-image-text-to-text-pipe…

9e47e34

…line

Merge branch 'main' into vectorize-input-chat-image-text-to-text-pipe…

ca0c94a

…line

Merge remote-tracking branch 'upstream/main' into vectorize-input-cha…

033522b

…t-image-text-to-text-pipeline

add test chat text only

950f361

Merge branch 'vectorize-input-chat-image-text-to-text-pipeline' of ht…

62598e9

…tps://github.com/yonigozlan/transformers into vectorize-input-chat-image-text-to-text-pipeline

yonigozlan mentioned this pull request Apr 23, 2025

[pipeline] allow image-text-to-text generate from text inputs #37696

Closed

yonigozlan removed request for Rocketknight1 and molbap April 23, 2025 14:46

Merge branch 'main' into vectorize-input-chat-image-text-to-text-pipe…

094c44d

…line

yonigozlan requested a review from Cyrilvallez April 23, 2025 15:00

Cyrilvallez reviewed Apr 23, 2025

View reviewed changes

yonigozlan added 2 commits April 23, 2025 17:18

change function name

053f396

Merge branch 'vectorize-input-chat-image-text-to-text-pipeline' of ht…

b288256

…tps://github.com/yonigozlan/transformers into vectorize-input-chat-image-text-to-text-pipeline

Cyrilvallez approved these changes Apr 23, 2025

View reviewed changes

yonigozlan merged commit 5cd6b64 into huggingface:main Apr 23, 2025
20 checks passed

	elif not isinstance(images, (Iterable)) or isinstance(images, str):
	elif not isinstance(images, (str, Iterable)):

	elif not isinstance(images, Iterable) or isinstance(images, str):
	elif not isinstance(images, (str, Iterable)):

Conversation

yonigozlan commented Jan 10, 2025

What does this PR do?

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 10, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jan 10, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qubvel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Mar 19, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants