Skip to content

Process inputs directly in apply_chat_template in image-text-to-text pipeline#35616

Merged
yonigozlan merged 15 commits intohuggingface:mainfrom
yonigozlan:vectorize-input-chat-image-text-to-text-pipeline
Apr 23, 2025
Merged

Process inputs directly in apply_chat_template in image-text-to-text pipeline#35616
yonigozlan merged 15 commits intohuggingface:mainfrom
yonigozlan:vectorize-input-chat-image-text-to-text-pipeline

Conversation

@yonigozlan
Copy link
Member

What does this PR do?

Follows #34275
Process inputs directly in apply_chat_template instead of calling apply_chat_template then the processor.
This also means that a small part of the pipeline logic needed to change, but I think it's better now :).
The pipeline also supports passing images with the images arg even when using chat template, where the corresponding image is represented with a {"type": "image"} in the chat.

In the previous behavior, when the input was:

image_ny = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image_chicago = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s the difference between these two images?"},
            {"type": "image"},
            {"type": "image"},
        ],
    }
]
outputs = pipe([image_ny, image_chicago], text=messages)

The output would be:

[
    {
        "input_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {"type": "image"},
                    {"type": "image"},
                ],
            }
        ],
        "generated_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {"type": "image"},
                    {"type": "image"},
                ],
            },
            {
                "role": "assistant",
                "content": "The first image shows a statue of Liberty in the foreground, while the second image shows a city skyline",
            },
        ],
    }
]

With no mention of the actual input images.
Now the output is:

[
    {
        "input_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
                    },
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg",
                    },
                ],
            }
        ],
        "generated_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
                    },
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg",
                    },
                ],
            },
            {
                "role": "assistant",
                "content": "The first image shows a statue of Liberty in the foreground, while the second image shows a city skyline",
            },
        ],
    }
]

Who can review?

@zucchini-nlp @Rocketknight1

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for updating the pipeline. Left a couple comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 1 to work correctly with different ViT backbones. Was it causing any test failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, I'm getting errors on pipeline tests that use to work with llava-interleave. For example:

pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
        image = "./tests/fixtures/tests_samples/COCO/000000039769.png"
        text = "<image> What this is? Assistant: This is"

        outputs = pipe(image, text=text)
        self.assertEqual(
            outputs,
            [
                {
                    "input_text": "<image> What this is? Assistant: This is",
                    "generated_text": "<image> What this is? Assistant: This is a photo of two cats lying on a pink blanket. The cats are sleeping and appear to be comfortable",
                }
            ],
        )

returns:

ValueError: Image features and image tokens do not match: tokens: 728, features 729

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of llava-interleave-qwen-0.5b-hf I see a mismatch in vision_feature_select_strategy for the model config and for processor. Will fix that on the hub :)

Copy link
Contributor

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few comments on my side

@yonigozlan yonigozlan force-pushed the vectorize-input-chat-image-text-to-text-pipeline branch from e3d95fd to 37bb6fc Compare January 13, 2025 17:31
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc / examples but nice otherwise!

@yonigozlan
Copy link
Member Author

Added some docs and updated the branch, this should be ready to merge @ArthurZucker

@yonigozlan yonigozlan requested a review from Cyrilvallez April 23, 2025 15:00
Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks! Just the function that should probably be renamed IMO!

if images is None:
images = []
elif not isinstance(images, Iterable):
elif not isinstance(images, (Iterable)) or isinstance(images, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif not isinstance(images, (Iterable)) or isinstance(images, str):
elif not isinstance(images, (str, Iterable)):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want this to be True when images is a string, and str is an Iterable

Comment on lines 65 to 66

def retrieve_images_in_messages(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the function name should be changed here as it's not really what it does anymore

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! 🤗

if images is None:
images = []
elif not isinstance(images, Iterable):
elif not isinstance(images, Iterable) or isinstance(images, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still the same small nit haha

Suggested change
elif not isinstance(images, Iterable) or isinstance(images, str):
elif not isinstance(images, (str, Iterable)):

@yonigozlan yonigozlan merged commit 5cd6b64 into huggingface:main Apr 23, 2025
20 checks passed
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
…pipeline (huggingface#35616)

* tokenize inputs directly in apply_chat_template

* refactor processing

* revert changes processing llava

* Update docs

* fix issue with str being iterable

* add test chat text only

* change function name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants