Process inputs directly in apply_chat_template in image-text-to-text pipeline#35616
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Nice, thanks for updating the pipeline. Left a couple comments
There was a problem hiding this comment.
This should be 1 to work correctly with different ViT backbones. Was it causing any test failures?
There was a problem hiding this comment.
Without this change, I'm getting errors on pipeline tests that use to work with llava-interleave. For example:
pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
image = "./tests/fixtures/tests_samples/COCO/000000039769.png"
text = "<image> What this is? Assistant: This is"
outputs = pipe(image, text=text)
self.assertEqual(
outputs,
[
{
"input_text": "<image> What this is? Assistant: This is",
"generated_text": "<image> What this is? Assistant: This is a photo of two cats lying on a pink blanket. The cats are sleeping and appear to be comfortable",
}
],
)returns:
ValueError: Image features and image tokens do not match: tokens: 728, features 729
There was a problem hiding this comment.
in case of llava-interleave-qwen-0.5b-hf I see a mismatch in vision_feature_select_strategy for the model config and for processor. Will fix that on the hub :)
qubvel
left a comment
There was a problem hiding this comment.
Thanks! A few comments on my side
e3d95fd to
37bb6fc
Compare
ArthurZucker
left a comment
There was a problem hiding this comment.
Missing doc / examples but nice otherwise!
|
Added some docs and updated the branch, this should be ready to merge @ArthurZucker |
…t-image-text-to-text-pipeline
…tps://github.com/yonigozlan/transformers into vectorize-input-chat-image-text-to-text-pipeline
Cyrilvallez
left a comment
There was a problem hiding this comment.
Nice thanks! Just the function that should probably be renamed IMO!
| if images is None: | ||
| images = [] | ||
| elif not isinstance(images, Iterable): | ||
| elif not isinstance(images, (Iterable)) or isinstance(images, str): |
There was a problem hiding this comment.
| elif not isinstance(images, (Iterable)) or isinstance(images, str): | |
| elif not isinstance(images, (str, Iterable)): |
There was a problem hiding this comment.
i want this to be True when images is a string, and str is an Iterable
|
|
||
| def retrieve_images_in_messages( |
There was a problem hiding this comment.
I feel like the function name should be changed here as it's not really what it does anymore
…tps://github.com/yonigozlan/transformers into vectorize-input-chat-image-text-to-text-pipeline
| if images is None: | ||
| images = [] | ||
| elif not isinstance(images, Iterable): | ||
| elif not isinstance(images, Iterable) or isinstance(images, str): |
There was a problem hiding this comment.
Still the same small nit haha
| elif not isinstance(images, Iterable) or isinstance(images, str): | |
| elif not isinstance(images, (str, Iterable)): |
…pipeline (huggingface#35616) * tokenize inputs directly in apply_chat_template * refactor processing * revert changes processing llava * Update docs * fix issue with str being iterable * add test chat text only * change function name
What does this PR do?
Follows #34275
Process inputs directly in
apply_chat_templateinstead of callingapply_chat_templatethen the processor.This also means that a small part of the pipeline logic needed to change, but I think it's better now :).
The pipeline also supports passing images with the
imagesarg even when using chat template, where the corresponding image is represented with a{"type": "image"}in the chat.In the previous behavior, when the input was:
The output would be:
[ { "input_text": [ { "role": "user", "content": [ {"type": "text", "text": "What’s the difference between these two images?"}, {"type": "image"}, {"type": "image"}, ], } ], "generated_text": [ { "role": "user", "content": [ {"type": "text", "text": "What’s the difference between these two images?"}, {"type": "image"}, {"type": "image"}, ], }, { "role": "assistant", "content": "The first image shows a statue of Liberty in the foreground, while the second image shows a city skyline", }, ], } ]With no mention of the actual input images.
Now the output is:
[ { "input_text": [ { "role": "user", "content": [ {"type": "text", "text": "What’s the difference between these two images?"}, { "type": "image", "image": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg", }, { "type": "image", "image": "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg", }, ], } ], "generated_text": [ { "role": "user", "content": [ {"type": "text", "text": "What’s the difference between these two images?"}, { "type": "image", "image": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg", }, { "type": "image", "image": "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg", }, ], }, { "role": "assistant", "content": "The first image shows a statue of Liberty in the foreground, while the second image shows a city skyline", }, ], } ]Who can review?
@zucchini-nlp @Rocketknight1