[bugfix] fix siglip batch text output error#28365
[bugfix] fix siglip batch text output error#28365DarkLight1337 merged 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: piood <2477084691@qq.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in SigLIP's batch text embedding. The fix involves two main changes: switching the text encoder to use EncoderOnlyAttention for correct batch handling, and implementing a new method _flip_sequences_by_position_ids to correctly flip individual sequences within a batch. The changes are logical and directly address the root causes described. I have one suggestion to improve the performance of the new flipping method by vectorizing its implementation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@DarkLight1337 Can you review it? |
|
Does the existing test pass locally? |
Yes, all passed. |
Signed-off-by: piood <2477084691@qq.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request aims to fix a bug in SigLIP's batch text embedding process. The solution involves switching to EncoderOnlyAttention for proper batch handling and implementing a new method, _flip_sequences_by_position_ids, to correctly flip individual sequences. While the overall approach is sound, I've identified a potential critical issue in the boundary detection logic of the new flipping method which appears inconsistent with how SigLIP's reversed position IDs work.
|
@DarkLight1337 All checks have passed, please review again, thanks. |
Signed-off-by: piood <2477084691@qq.com>
Purpose
Fix SigLIP text embedding batch processing error when batch size > 1, as reported in #27566.
Root Cause: Text embeddings flatten batch sequences into a single sequence and use
MultiHeadAttention, which incorrectly handles batch attention. Additionally, the previous implementation usedflip(0)on the entire batch tensor, mixing features across different sequences.Solution:
EncoderOnlyAttentionfor proper batch handling_flip_sequences_by_position_ids()to flip each sequence individually based on position_ids boundariesTest Plan
Test Result
Before fix
After fix
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.