[Model] Siglip2 Model Support#27566
Conversation
Signed-off-by: piood <2477084691@qq.com>
|
Documentation preview: https://vllm--27566.org.readthedocs.build/en/27566/ |
There was a problem hiding this comment.
Code Review
This pull request extends vLLM's capabilities to support SigLIP2 models, building upon the existing SigLIP architecture. The changes include updating the dummy token ID for tokenizer compatibility, adding SigLIP2 to the test suite and documentation, and enhancing the model configuration loading to automatically infer model architecture when it's not explicitly defined. My review focuses on the robustness of this new configuration logic. I've identified one high-severity issue where the check for a missing architecture field could be more robust to handle cases like an empty list, which would otherwise cause model loading to fail.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request extends the SigLIP model support to SigLIP2, including handling model configurations and updating the dummy token ID. The changes involve modifications to the documentation, test files, and model configuration files. The review focuses on ensuring the correctness of the dummy token ID and the robustness of the configuration handling.
| if not config.architectures: | ||
| if config.model_type not in MODEL_MAPPING_NAMES: | ||
| raise ValueError(f"Model type {config.model_type} not supported") | ||
| model_type = MODEL_MAPPING_NAMES[config.model_type] | ||
| config.update({"architectures": [model_type]}) |
There was a problem hiding this comment.
The added logic to automatically determine the architecture for models without an explicit architectures field is good for robustness. However, raising a ValueError if the model_type is not found in MODEL_MAPPING_NAMES might be too strict. A more graceful fallback could involve attempting to load the model with trust_remote_code=True or issuing a warning and proceeding with a default architecture. This could prevent the system from failing completely when encountering a new or less common model type. Also, consider adding a log message to indicate when this automatic architecture detection is being used, which can help with debugging and understanding the system's behavior.
| if not config.architectures: | |
| if config.model_type not in MODEL_MAPPING_NAMES: | |
| raise ValueError(f"Model type {config.model_type} not supported") | |
| model_type = MODEL_MAPPING_NAMES[config.model_type] | |
| config.update({"architectures": [model_type]}) | |
| if not config.architectures: | |
| if config.model_type not in MODEL_MAPPING_NAMES: | |
| logger.warning(f"Model type {config.model_type} not found in MODEL_MAPPING_NAMES. Attempting to proceed without explicit architecture.") | |
| # Optionally, try loading with trust_remote_code or assign a default architecture here | |
| config.update({"architectures": ["AutoModel"]}) # Example: setting a default architecture | |
| else: | |
| model_type = MODEL_MAPPING_NAMES[config.model_type] | |
| config.update({"architectures": [model_type]}) |
|
FYI this is breaking |
|
Curious why this was merged with CI failures? |
|
Oh, I didn't think this PR is related to chat utils so I thought the failure was unrelated |
|
Really we should never assume this no matter how unlikely it seems. Unless the same failure has been seen on main. Since this keeps happening (breakage due to incorrect assumption which then has much wider blast radius). |
|
I agree in principle, unfortunately entrypoints tests has been quite flaky these few weeks so it's easy to accidentally miss these true positives... |
|
I have push #28365 to fix siglip batch text output error. @DarkLight1337 @sleepwalker2017 |
Signed-off-by: piood <2477084691@qq.com>
|
Hi @piood @DarkLight1337 ,
Any tips on how we can add these variants ? |
|
@apoorv-stackav Second and third problem, i don't found in locally, in my own env the dimensional is right for |
|
The max_length parameter is implemented by the Is it easy to implement max_length in |
|
Indeed the padding should be done by tokenizer. You should be able to do this by passing |
|
@apoorv-stackav, I believe the SigLIP model only uses the hidden state of the last token for the sequence embedding. Therefore, padding shouldn't affect the final result. Please correct me if I'm wrong. |
|
@piood , thanks for the great work to add support for siglip2, I tested it with |
|
Which version of transformers are you using? You might have to update it. |
|
Also this PR doesn't cover the NaViT variant |
|
@zac-wang-nv siglip2navit implement in |
|
Hi, @piood @DarkLight1337 Model configSo the text encoder is 1152 → 1536 projection, while the vision encoder already operates at 1536. Serving commandRequestBehavior
QuestionIs there a configuration/task flag I’m missing to bypass the text prompt path for image-only requests, or is this indeed a bug in the merge logic for multimodal embeddings? I’d love guidance on whether I should pad the text embeddings to 1536 before merging or if there’s an existing vision-only embedding pathway I can use. Thanks! |
Signed-off-by: piood <2477084691@qq.com>
|
Same problem here (v0.11.2): |
|
@FlyingYanglu @jtruetsch There are some tricky issues here. For Since fixing this might require many changes to Maybe we can keep |
|
Let me open a PR to facilitate that |
|
Opened #29741 |
|
Thank you so much for the help! @piood @DarkLight1337 |
Hi team, @piood @DarkLight1337 — sorry for the trouble again. It looks like padding-to-max_length is the officially expected behavior for SigLIP2 text preprocessing (as noted in the HF docs), so I’m trying to keep my tokenizer/preprocessor aligned with that. However, I found a couple of issues:
Because of this, the only reliable way to match HF behavior right now is to pass in pre-tokenized IDs manually. It would be super helpful if |
|
@noooop can you help with this? |
|
Open #29794 to support tokenization_kwargs override. |
|
Hi @piood @DarkLight1337, First also thanks from my side for being so fast to fix the issues with Siglip2 embedding! And sorry, that I have to dig up this issue another time. But I couldn't figure out how to get Siglip2 embedding working from the openai API with the padding-to-max_length parameters correctly set as described in the previous comment by FlyingYanglu. I can set the parameters using |
|
Hmm, I think they are passed via |
|
For online serving, you can edit |
|
Hi @DarkLight1337 |
|
Sorry I forgot that this model actually uses Chat API so |
|
@DarkLight1337 thanks! I got it to work by modifying |
Purpose
Extend SigLIP model support to SigLIP2, building upon the existing SigLIP embedding architecture.
This PR extends multimodal embedding capabilities to support SigLIP2 models, which are widely used for vision-language tasks.
To resolve [New Model]: Google SigLip 2 #13663 and resolve [Feature]: Support serving of CLIP/SigLIP embeddings #25581 - This PR builds upon SigLIP embedding support [Model] Siglip Embedding Support #27324 to add SigLIP2 support.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.