[Model] Siglip2 Model Support by piood · Pull Request #27566 · vllm-project/vllm

piood · 2025-10-27T10:51:00Z

Purpose

Extend SigLIP model support to SigLIP2, building upon the existing SigLIP embedding architecture.

Added handling for model configs missing the architectures field by automatically get the architecture through MODEL_MAPPING_NAMES
Dynamically select dummy_token_id by finding the first non-special token ID, replacing the previous hardcoded value. This ensures compatibility with both SigLIP and SigLIP2 tokenizers across different versions
Maintains the existing SigLIP embedding architecture: for text inputs, only token_embedding is applied in get_input_embeddings; for image inputs, complete vision encoding is applied in get_input_embeddings
This PR extends multimodal embedding capabilities to support SigLIP2 models, which are widely used for vision-language tasks.
To resolve [New Model]: Google SigLip 2 #13663 and resolve [Feature]: Support serving of CLIP/SigLIP embeddings #25581 - This PR builds upon SigLIP embedding support [Model] Siglip Embedding Support #27324 to add SigLIP2 support.

Test Plan

Added google/siglip2-base-patch16-224 tests in tests/models/multimodal/pooling/test_siglip.py
Updated model registry to include SigLIP2 embedding support
Updated supported_models.md documentation with SigLIP2 model examples
Verified with local test runs

Test Result

All SigLIP2-specific tests pass
Model registry correctly recognizes SigLIP2 embedding models
Both text and image embedding generation work as expected
Config processing logic correctly handles models without architectures field

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: piood <2477084691@qq.com>

mergify · 2025-10-27T10:51:46Z

Documentation preview: https://vllm--27566.org.readthedocs.build/en/27566/

gemini-code-assist

Code Review

This pull request extends vLLM's capabilities to support SigLIP2 models, building upon the existing SigLIP architecture. The changes include updating the dummy token ID for tokenizer compatibility, adding SigLIP2 to the test suite and documentation, and enhancing the model configuration loading to automatically infer model architecture when it's not explicitly defined. My review focuses on the robustness of this new configuration logic. I've identified one high-severity issue where the check for a missing architecture field could be more robust to handle cases like an empty list, which would otherwise cause model loading to fail.

vllm/transformers_utils/config.py

Signed-off-by: piood <2477084691@qq.com>

piood · 2025-10-27T11:09:31Z

/gemini review

docs/models/supported_models.md

gemini-code-assist

Code Review

This pull request extends the SigLIP model support to SigLIP2, including handling model configurations and updating the dummy token ID. The changes involve modifications to the documentation, test files, and model configuration files. The review focuses on ensuring the correctness of the dummy token ID and the robustness of the configuration handling.

vllm/model_executor/models/siglip.py

gemini-code-assist · 2025-10-27T11:11:39Z

vllm/transformers_utils/config.py

+    if not config.architectures:
+        if config.model_type not in MODEL_MAPPING_NAMES:
+            raise ValueError(f"Model type {config.model_type} not supported")
+        model_type = MODEL_MAPPING_NAMES[config.model_type]
+        config.update({"architectures": [model_type]})


The added logic to automatically determine the architecture for models without an explicit architectures field is good for robustness. However, raising a ValueError if the model_type is not found in MODEL_MAPPING_NAMES might be too strict. A more graceful fallback could involve attempting to load the model with trust_remote_code=True or issuing a warning and proceeding with a default architecture. This could prevent the system from failing completely when encountering a new or less common model type. Also, consider adding a log message to indicate when this automatic architecture detection is being used, which can help with debugging and understanding the system's behavior.

Suggested change

if not config.architectures:

if config.model_type not in MODEL_MAPPING_NAMES:

raise ValueError(f"Model type {config.model_type} not supported")

model_type = MODEL_MAPPING_NAMES[config.model_type]

config.update({"architectures": [model_type]})

if not config.architectures:

if config.model_type not in MODEL_MAPPING_NAMES:

logger.warning(f"Model type {config.model_type} not found in MODEL_MAPPING_NAMES. Attempting to proceed without explicit architecture.")

# Optionally, try loading with trust_remote_code or assign a default architecture here

config.update({"architectures": ["AutoModel"]}) # Example: setting a default architecture

else:

model_type = MODEL_MAPPING_NAMES[config.model_type]

config.update({"architectures": [model_type]})

vllm/transformers_utils/config.py

Signed-off-by: piood <2477084691@qq.com>

vllm/model_executor/models/siglip.py

Signed-off-by: piood <2477084691@qq.com>

DarkLight1337

LGTM, thanks

ywang96 · 2025-10-27T21:00:17Z

FYI this is breaking entrypoints/test_chat_utils.py::test_resolve_content_format_fallbacks[deepseek-ai/deepseek-vl2-tiny-string on main

njhill · 2025-10-27T23:09:33Z

Curious why this was merged with CI failures?

DarkLight1337 · 2025-10-28T01:35:16Z

Oh, I didn't think this PR is related to chat utils so I thought the failure was unrelated

njhill · 2025-10-28T15:23:47Z

Really we should never assume this no matter how unlikely it seems. Unless the same failure has been seen on main. Since this keeps happening (breakage due to incorrect assumption which then has much wider blast radius).

DarkLight1337 · 2025-10-28T15:48:37Z

I agree in principle, unfortunately entrypoints tests has been quite flaky these few weeks so it's easy to accidentally miss these true positives...

piood · 2025-11-09T17:37:47Z

I have push #28365 to fix siglip batch text output error. @DarkLight1337 @sleepwalker2017

Signed-off-by: piood <2477084691@qq.com>

apoorv-stackav · 2025-11-15T00:32:21Z

Hi @piood @DarkLight1337 ,
I was recently working with the recently added siglip2 model google/siglip2-base-patch16-224 and found the following shortcomings:

The text embeddings produced by google/siglip2-base-patch16-224 do not produce embeddings corresponding to padding='maxlen' . Because of this, it can not be used for downstream tasks like zero-shot classification. Can we somehow force padding='maxlen' for all the entire sequence length which is 64 in size as an exception for this model?
I tried to load siglip2-large-patch16-256 but this results in output embeddings of shape 768 while the huggingface model is returns a 1024 dimensional embedding. Where can I look into the root cause for this ?
I also tried to load google/siglip2-so400m-patch14-224 model, but am seeing this error :
(EngineCore_DP0 pid=3066502) ERROR 11-15 00:28:55 [core.py:855] raise NotImplementedError(
(EngineCore_DP0 pid=3066502) ERROR 11-15 00:28:55 [core.py:855] NotImplementedError: Encoder self-attention and encoder/decoder cross-attention are not implemented for TritonAttentionImpl

Any tips on how we can add these variants ?

piood · 2025-11-16T16:45:47Z

@apoorv-stackav Second and third problem, i don't found in locally, in my own env the dimensional is right for siglip2-large-patch16-256 , and run siglip2-so400m-patch14-224 not occur error, please check your env.

piood · 2025-11-16T16:57:00Z

The max_length parameter is implemented by the transformers tokenizer, not the model itself. Padding is applied in the tokenizer space before the token IDs are passed to the model. Since vllm uses the transformers tokenizer to get the token IDs, the padding should be configured at the tokenizer level.

Is it easy to implement max_length in vllm? @DarkLight1337

DarkLight1337 · 2025-11-16T22:45:17Z

Indeed the padding should be done by tokenizer. You should be able to do this by passing mm_processor_kwargs?

piood · 2025-11-17T04:50:47Z

@apoorv-stackav, I believe the SigLIP model only uses the hidden state of the last token for the sequence embedding. Therefore, padding shouldn't affect the final result. Please correct me if I'm wrong.

zac-wang-nv · 2025-11-27T01:22:32Z

@piood , thanks for the great work to add support for siglip2, I tested it with google/siglip2-so400m-patch16-naflex and got the following error,
AttributeError: 'Siglip2Processor' object has no attribute '_get_num_multimodal_tokens',
is that expected? how can I get it fixed?

DarkLight1337 · 2025-11-27T03:39:37Z

Which version of transformers are you using? You might have to update it.

DarkLight1337 · 2025-11-27T03:40:43Z

Also this PR doesn't cover the NaViT variant

piood · 2025-11-27T14:21:11Z

@zac-wang-nv siglip2navit implement in vllm/model_executor/models/siglip2navit.py , but now only have vision embedding, text embedding need to support.I will try to support siglip2navit embedding in weekend.

FlyingYanglu · 2025-11-28T06:23:49Z

Hi, @piood @DarkLight1337
I'm new to vllm and I’m trying to serve google/siglip2-giant-opt-patch16-384 for image embedding.
I noticed that image-only embedding requests always come back with 1152-dimensional vectors, even though the vision tower and HF reference implementation produce 1536-dimensional embeddings.

Model config

{
  "initializer_factor": 1.0,
  "model_type": "siglip",
  "text_config": {
    "hidden_size": 1152,
    "projection_size": 1536,
    ...
  },
  "vision_config": {
    "hidden_size": 1536,
    ...
  }
}

So the text encoder is 1152 → 1536 projection, while the vision encoder already operates at 1536.

Serving command

CUDA_VISIBLE_DEVICES=1,2 vllm serve \
  /data1/common_models/hub/models--google--siglip2-giant-opt-patch16-384/snapshots/a713301b217d38485fb2204c808367d10bc3cc40/ \
  --runner pooling \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.7 \
  --enforce-eager \
  --chat-template tmp/siglip_chat_template.jinja \
  --mm-processor-cache-gb 0 \
  --disable-mm-preprocessor-cache \
  --port 9120

Request

POST http://localhost:9120/v1/embeddings
{
  "model": "<resolved model id>",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "image_url", "image_url": { "url": "<image>" } }
      ]
    }
  ]
}

Behavior

The server returns a 1152-length embedding array.
Debug logs show the vision tower outputs torch.Size([1, 1, 1536]), but by the time SiglipEmbeddingModel.forward() runs, inputs_embeds has been coerced to torch.Size([1, 1152]).
That suggests _merge_multimodal_embeddings (or upstream prompt assembly) truncates the image embedding to the text hidden size before the model forward, so the pooler only ever sees 1152 dims.

Question

Is there a configuration/task flag I’m missing to bypass the text prompt path for image-only requests, or is this indeed a bug in the merge logic for multimodal embeddings? I’d love guidance on whether I should pad the text embeddings to 1536 before merging or if there’s an existing vision-only embedding pathway I can use.

Thanks!

Signed-off-by: piood <2477084691@qq.com>

jtruetsch · 2025-11-29T02:51:51Z

Same problem here (v0.11.2): google/siglip2-giant-opt-patch16-384 returns text and image embeddings with mismatched dimensions. 1536-dimensional text embeddings (expected size) but 1152-dimensional image embeddings.

piood · 2025-11-30T02:20:36Z

@FlyingYanglu @jtruetsch There are some tricky issues here. For siglip2-giant-opt-patch16-384, text_config.hidden_size is 1152, text_config.projection_size is 1536, and vision_config.hidden_size is 1536.
However, the implementation for all models in vLLM's core gpu_model_runner.py creates input_embeds based on hidden_size(multimodal use text_config.hidden_size). If there are multimodal embeds, it fills them into the text embeds at the placeholder positions. So, the current SigLIP implementation is correct, but during scheduling, it gets truncated to hidden_size=1152. This happens because the current gpu_model_runner.py assumes the final output hidden_size equals the model config hidden_size. But for siglip2-giant-opt-patch16-384, this assumption does not hold, which causes the issue.

Since fixing this might require many changes to gpu_model_runner.py, @DarkLight1337 do you have any suggestions?

Maybe we can keep input_embeds as is, but introduce an output_embeds to handle cases where the final embedding size differs from config.hidden_size due to projection_size.

DarkLight1337 · 2025-11-30T04:18:48Z

Let me open a PR to facilitate that

DarkLight1337 · 2025-11-30T04:40:19Z

Opened #29741

FlyingYanglu · 2025-12-01T06:54:32Z

Thank you so much for the help! @piood @DarkLight1337

FlyingYanglu · 2025-12-01T11:30:06Z

Indeed the padding should be done by tokenizer. You should be able to do this by passing mm_processor_kwargs?

Hi team, @piood @DarkLight1337 — sorry for the trouble again.

It looks like padding-to-max_length is the officially expected behavior for SigLIP2 text preprocessing (as noted in the HF docs), so I’m trying to keep my tokenizer/preprocessor aligned with that.

However, I found a couple of issues:

mm_processor_kwargs only applies when multi_modal_data is present, so text-only prompts never receive those overrides.
There is a tokenization_kwargs parameter on LLM.encode(...), but in the current implementation it’s never used — the code reconstructs an empty dict before calling the tokenizer.

Because of this, the only reliable way to match HF behavior right now is to pass in pre-tokenized IDs manually. It would be super helpful if tokenization_kwargs were actually plumbed through so we can set things like padding="max_length" and truncation=True, and keep SigLIP’s text path consistent with the expected HF preprocessing.

DarkLight1337 · 2025-12-01T15:23:15Z

@noooop can you help with this?

piood · 2025-12-01T16:42:41Z

Open #29794 to support tokenization_kwargs override.

jtruetsch · 2025-12-12T16:03:10Z

Hi @piood @DarkLight1337,

First also thanks from my side for being so fast to fix the issues with Siglip2 embedding!

And sorry, that I have to dig up this issue another time. But I couldn't figure out how to get Siglip2 embedding working from the openai API with the padding-to-max_length parameters correctly set as described in the previous comment by FlyingYanglu.

I can set the parameters using mm_processor_kwargs in the openai embedding API (I use the API like described in this example). But as FlyingYanglu mentioned, this only affects the tokenization if there actually are multi-modal inputs, so it doesn't work for text-only inputs. And as far as I can tell there is no way to pass tokenization_kwargs via the openai API. Is this possible at all?

DarkLight1337 · 2025-12-13T07:28:44Z

Hmm, I think they are passed via InputPreprocessor._process_text already which calls InputPreprocessor._tokenize_prompt with the tokenization kwargs?

DarkLight1337 · 2025-12-13T07:30:47Z

For online serving, you can edit CompletionRenderer._create_prompt_from_text to accept truncation argument

jtruetsch · 2026-01-13T20:46:36Z

Hi @DarkLight1337
thanks for the fast response! I finally had time to test your suggestion. Maybe I'm fundamentally misunderstanding something, but it seems to me that CompletionRenderer._create_prompt_from_text is never actually called? (I tested it with an raise RuntimeError() in the first line of the function, the exception is never raised but siglip can be successfully prompted anyway. If I add a RuntimeError to the __init__ of CompletionRenderer I do get the exception printed to the terminal. So I think I edited the correct file and CompletionRenderer is even initialized, but not actually used?). Adding a truncation argument therefore doesn't make a difference. Or did I misunderstand your suggestion? I've tested this with vllm v0.13.0 and the most recent nightly build from dockerhub using the openai-api.

DarkLight1337 · 2026-01-14T02:54:18Z

Sorry I forgot that this model actually uses Chat API so CompletionRenderer isn't used. Then the tokenization should be handled by OpenAIServing._normalize_prompt_text_to_input.

jtruetsch · 2026-01-20T10:34:27Z

@DarkLight1337 thanks! I got it to work by modifying OpenAIServing._normalize_prompt_text_to_input.

[Model] Siglip Model Support

4dfc5d4

Signed-off-by: piood <2477084691@qq.com>

piood requested a review from noooop as a code owner October 27, 2025 10:51

mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) labels Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

vllm/transformers_utils/config.py Outdated Show resolved Hide resolved

piood mentioned this pull request Oct 27, 2025

[New Model]: Google SigLip 2 #13663

Closed

1 task

piood added 2 commits October 27, 2025 11:03

fix

5a15ab9

Signed-off-by: piood <2477084691@qq.com>

fix

1970105

Signed-off-by: piood <2477084691@qq.com>

DarkLight1337 reviewed Oct 27, 2025

View reviewed changes

docs/models/supported_models.md Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

piood changed the title ~~[Model] Siglip Model Support~~ [Model] Siglip2 Model Support Oct 27, 2025

DarkLight1337 reviewed Oct 27, 2025

View reviewed changes

vllm/transformers_utils/config.py Outdated Show resolved Hide resolved

fix

463c27f

Signed-off-by: piood <2477084691@qq.com>

DarkLight1337 reviewed Oct 27, 2025

View reviewed changes

vllm/model_executor/models/siglip.py Outdated Show resolved Hide resolved

fix

39ee47a

Signed-off-by: piood <2477084691@qq.com>

DarkLight1337 approved these changes Oct 27, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) October 27, 2025 11:32

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 27, 2025

vllm-bot merged commit 4f882be into vllm-project:main Oct 27, 2025
54 of 57 checks passed

DarkLight1337 mentioned this pull request Oct 27, 2025

[Feature]: Support serving of CLIP/SigLIP embeddings #25581

Closed

1 task

Kay-Tian mentioned this pull request Oct 27, 2025

vLLM PR #27566 变更核心文件提醒 Kay-Tian/vllm#48

Closed

junpuf mentioned this pull request Oct 27, 2025

[Bugfix][Frontend] validate arg priority in frontend LLM class before add request #27596

Merged

2 tasks

ywang96 mentioned this pull request Oct 27, 2025

[Bugfix][CI] Fix config resolving logic with remote models #27610

Merged

5 tasks

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Model] Siglip2 Model Support (vllm-project#27566)

0e5056b

Signed-off-by: piood <2477084691@qq.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Model] Siglip2 Model Support (vllm-project#27566)

a38984e

Signed-off-by: piood <2477084691@qq.com>

DarkLight1337 mentioned this pull request Nov 30, 2025

[Core] Enable inputs_embeds_size separate from hidden_size #29741

Merged

5 tasks

piood mentioned this pull request Dec 1, 2025

Support tokenization_kwargs override #29794

Merged

piood mentioned this pull request Dec 2, 2025

support skip tokenizer chat_template #29895

Closed

4 tasks

Uh oh!

Conversation

piood commented Oct 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

piood commented Oct 27, 2025

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywang96 commented Oct 27, 2025

Uh oh!

njhill commented Oct 27, 2025

Uh oh!

DarkLight1337 commented Oct 28, 2025

Uh oh!

njhill commented Oct 28, 2025

Uh oh!

DarkLight1337 commented Oct 28, 2025

Uh oh!

piood commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apoorv-stackav commented Nov 15, 2025

Uh oh!

piood commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

piood commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Nov 16, 2025

Uh oh!

piood commented Nov 17, 2025

Uh oh!

zac-wang-nv commented Nov 27, 2025

Uh oh!

DarkLight1337 commented Nov 27, 2025

Uh oh!

DarkLight1337 commented Nov 27, 2025

Uh oh!

piood commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FlyingYanglu commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model config

Serving command

Request

Behavior

Question

Uh oh!

jtruetsch commented Nov 29, 2025

Uh oh!

piood commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Nov 30, 2025

piood commented Oct 27, 2025 •

edited by github-actions bot

Loading

piood commented Nov 9, 2025 •

edited

Loading

piood commented Nov 16, 2025 •

edited

Loading

piood commented Nov 16, 2025 •

edited

Loading

piood commented Nov 27, 2025 •

edited

Loading

FlyingYanglu commented Nov 28, 2025 •

edited

Loading

piood commented Nov 30, 2025 •

edited

Loading

DarkLight1337 commented Dec 13, 2025 •

edited

Loading