Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch#35779
Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch#35779zucchini-nlp merged 2 commits intohuggingface:mainfrom sheryc:fix-llava-onevision-padding
Conversation
|
@zucchini-nlp I got the same issue: ValueError: Image features and image tokens do not match: tokens: 4589, features 4588 when using Llava-v1.6-vicuna-7b-hf (llava-next model) and the transformers version is 4.47.0. I followed PR: #35779 and modified processing_llava_next.py shown in the following code. But it doesn't work. Is there anything difference between llava-next and llava-onevision? I thought they might be the same issue. original_aspect_ratio = width / height |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Great, thanks a lot for fixing!
Unpadding is having too many precision related bugs 🥲 Can you propagate the changes to llava-next and llava-next-video which use the same pad-unpad method? And then add # Copied from transformers.models.llava_next.processing_llava_next._get_unpadded_features in other models, so it always stays the same.Since we already have the model-side copied and identical
|
There has been some other fixes (e.g. #34779 for #34625) for this problem integrated in |
Right, it cannot be 100% copied from llava-next. Oke, you can then leave "Adapted from ..." so that anyone reading the code can refer to llava-next as the source of truth |
|
Is there anything else I need to do? @zucchini-nlp |
|
Sorry, yeah let's merge it. I don't think we need one more review for this |
|
Quick question, why do we have a 7 hardcoded, a comment would be nice to know why it's seven and not 12983 |
|
Yeah hardcoded, comes from an old issue afair when the rounding happened to the wrong decimal, and we needed more precision. So 7 was aribtrary large number to keep precision. Same rounding is present in modeling code, and the idea is to keep model-processor code identical to prevent precision errors |
… mismatch (huggingface#35779) * Fix Llava OneVision's token padding * Fix Llava next and Llava next video's token unpadding for consistency
|
Hey @sheryc @zucchini-nlp I'm still getting this error even after installing the lastest version (4.49.0.dev0) by running
The mismatch between number of features and tokens is always very small, what makes me think that this is still related to these rounding issues somehow. I'm trying to finetune either Not sure if this helps, but here is some pseudo-code that describes the prompt/image formatting and tokenization process that I'm performing. Images are PIL objects. messages = [
{
"role": "user",
"content": [
{"index": 0, "text": None, "type": "image"},
{"index": 1, "text": None, "type": "image"},
{"index": None, "text": instruction, "type": "text"},
]
},
{
"role": "assistant",
"content": [
{"index": None, "text": label, "type": "text"},
]
}
]
prompt = self.processor.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt = False,
)
inputs = self.processor(
text = prompt,
images = [image1, image2],
padding = True,
return_tensors = "pt",
) |
|
Hi @baraujo98, the models you used are Llava models. This PR only fixes the problem for Llava-NeXT, Llava-NeXT Video, and Llava-OneVision, whose processing is a bit different from Llava. I think it's worth a new issue? Also could you add a minimal code snippet to reproduce the error? |
|
@sheryc I believe that I will try to write a minimal example nonetheless. |
Yes you're right, sorry |
|
@baraujo98 right, I see the config on the hub wasn't updated with the |
|
Thank you very much for the tip @zucchini-nlp. Unsloth doesn't allow me to load revisions of the model for some reason, but I was able to fix the issue for now by running I opend a similar PR for the second model I mentioned: https://huggingface.co/unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit/discussions/1 |
… mismatch (huggingface#35779) * Fix Llava OneVision's token padding * Fix Llava next and Llava next video's token unpadding for consistency
|
I am using Qwen2-VL-2b instruct and on inference I am getting this - ValueError: Image features and image tokens do not match: tokens: 841, features 1682. while you guys dont have a big gap in token vs features I have the difference by a factor of 2. I am using 800*800 resolution images and passing them in an array to the model |
|
@ambar3497 Hi Ambar, this PR only solves the problem for llava-next/onevision models, you may consider opening a new issue and provide a code snippet to reproduce the error |
What does this PR do?
Fixes #35775 and the further discussions in #34625.
This PR fixes the vision feature / vision token mismatch issue for LLaVA OneVision models. Specifically, the unpadding token calculation for the Processor and the model differs by a rounding funciton, where precision issue may happen to create a dimension mismatch. The PR adds the rounding function to the Processor to match the behavior with LlavaOnevision's and LlavaNext's model behaviors.
Before submitting
Pull Request section?
to it if that's the case. (LLaVA-OneVision image features and image tokens mismatch #35775)
documentation guidelines, and
here are tips on formatting docstrings. (No new functions introduced.)
Who can review?
@zucchini-nlp
Edit: updated the title to reflect the latest changes in this PR.