Image_token mismatch when applying SigLIP to Llava & LlavaNext.#34447
Image_token mismatch when applying SigLIP to Llava & LlavaNext.#34447jp1924 wants to merge 1 commit intohuggingface:mainfrom
Conversation
|
This is my personal opinion, but I don't think my implementation is particularly good. In transformers vision models, there are models like ViT where CLS tokens are hardcoded to be inserted, and it's somewhat difficult to determine whether a model includes CLS tokens or not. The bug occurred because of this issue, and I think using a flag-based approach isn't a fundamental solution. Instead, I believe the best approach would be to add a configuration value in the vision encoder's config to distinguish between models with and without CLS tokens. Then, we could write code that adds the additional value only when the vision encoder includes CLS tokens, and doesn't add it when it doesn't. This would be a more robust solution. |
|
Hey! Thanks for reporting, it is a known but and we discovered it later after releasing new processing logic. The fix will be in #33424 in a few weeks :) |
|
Oh! Thanks for the quick answer! But I have to run llava with siglip, which is annoying because of this error. |
|
Oh, it's not fixed yet, I see. I'll take care of it until it's fixed. Thanks! |
|
Yes, feel free to install from that PR in the meanwhile. It took a bit longer to merge as we had to discuss the long-term solution which would work in most cases for all VLMs |
What does this PR do?
When using a vision encoder that doesn't insert CLS tokens (like SigLIP) with Llava or Llava-next models,
an img_size mismatch error occurs.
Unlike ViT, SigLIP doesn't add a
clstoken in the vision embedding layer.However, in the Llava processor,
+1is hardcoded.As a result, an error occurs even when
vision_feature_select_strategyis set tofull.Therefore, I propose replacing the hardcoded part with a flag named
vision_feature_use_cls.bug reproduction code
transformersversion: 4.46.0Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@amyeroberts, @qubvel, @zucchini-nlp