fix: Fix decode worker in vllm for qwen_vl models#5281
Conversation
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
WalkthroughThe changes implement Qwen VL multimodal decoding support by introducing a new utility function to construct decode-time multimodal data structures and integrating it into the worker handler with model-specific prompt token handling logic. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @components/src/dynamo/vllm/multimodal_utils/model.py:
- Around line 182-187: The signature for construct_qwen_decode_mm_data uses
embeddings_shape: Optional[Any>, which is too loose; tighten it to a more
specific sequence type such as Optional[Tuple[int, ...]] or
Optional[Sequence[int]] (and import Tuple/Sequence from typing) so callers must
pass a valid shape (e.g., (N, C, H, W)). Update the function annotation to
embeddings_shape: Optional[Tuple[int, ...]] (or Optional[Sequence[int]]) and
adjust any internal code or tests if they relied on Any, ensuring any runtime
checks that treat embeddings_shape as an indexable shape remain valid.
🧹 Nitpick comments (1)
components/src/dynamo/vllm/multimodal_utils/model.py (1)
199-206: Add validation for embeddings_shape and ndim before squeezing.The function doesn't validate that
embeddings_shapeis a valid shape (e.g., a sequence of integers), which could lead to cryptic PyTorch errors. Additionally, the squeeze operation assumes a 3D tensor but only checksndim == 3after creating the tensor—consider validating the shape length beforehand.♻️ Proposed validation improvements
if image_grid_thw is None or len(image_grid_thw) == 0: raise ValueError("No image grid provided for Qwen model.") if embeddings_shape is None: raise ValueError("embeddings_shape is required for Qwen decode mm data.") + if not isinstance(embeddings_shape, (tuple, list)) or not all(isinstance(x, int) and x > 0 for x in embeddings_shape): + raise ValueError(f"embeddings_shape must be a tuple or list of positive integers, got {embeddings_shape}") image_embeds = torch.zeros(embeddings_shape, dtype=dtype, device="cpu") - if image_embeds.ndim == 3: + if len(embeddings_shape) == 3: image_embeds = image_embeds.squeeze(0)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/src/dynamo/vllm/multimodal_handlers/worker_handler.pycomponents/src/dynamo/vllm/multimodal_utils/model.pyexamples/backends/vllm/launch/disagg_multimodal_epd.sh
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-01-04T06:45:28.414Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 5153
File: examples/backends/vllm/launch/lora/setup_minio.sh:99-109
Timestamp: 2026-01-04T06:45:28.414Z
Learning: For HuggingFace CLI version reporting (hf version and huggingface-cli version) in v0.34.6 and later, use direct argument syntax instead of the --version flag. Review shell-script changes and any scripts invoking the HuggingFace CLI to ensure they call the version output with a direct argument (e.g., 'hf version' or equivalent) rather than using '--version'. Apply to shell scripts and any related CLI invocations in the repository.
Applied to files:
examples/backends/vllm/launch/disagg_multimodal_epd.sh
📚 Learning: 2025-10-28T04:09:48.264Z
Learnt from: ayushag-nv
Repo: ai-dynamo/dynamo PR: 3634
File: components/src/dynamo/vllm/multimodal_handlers/processor_handler.py:66-72
Timestamp: 2025-10-28T04:09:48.264Z
Learning: In components/src/dynamo/vllm/multimodal_handlers/processor_handler.py, the AutoTokenizer.from_pretrained call with trust_remote_code=True is intentional and expected for the vLLM multimodal handler implementation.
Applied to files:
components/src/dynamo/vllm/multimodal_handlers/worker_handler.py
🪛 Ruff (0.14.10)
components/src/dynamo/vllm/multimodal_utils/model.py
200-200: Avoid specifying long messages outside the exception class
(TRY003)
202-202: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (3)
examples/backends/vllm/launch/disagg_multimodal_epd.sh (1)
96-96: LGTM! Decode worker now consistent with prefill configuration.Adding
--enable-mm-embedsto the decode worker aligns with the prefill worker configuration (line 91) and supports the multimodal decode functionality introduced in this PR.components/src/dynamo/vllm/multimodal_handlers/worker_handler.py (2)
67-85: Well-documented Qwen-specific multimodal handling.The approach of passing zero embeddings with
image_grid_thwfor mRoPE position calculation is clearly explained. The conditional logic ensures Qwen VL models receive the necessary multimodal data while maintaining backward compatibility. Themulti_modal_dataparameter is correctly typed asNotRequired[Optional[Any]](per the type override inmultimodal_utils/protocol.py), so passingNonefor non-Qwen models is safe.
273-283: Conditional prompt handling logic for disaggregated multimodal decode is correctly implemented.The code properly distinguishes between Qwen VL models and others:
- Qwen VL: Preserves original prompt for decode worker to expand with
multi_modal_data, matching prefill expansion and ensuring block count alignment- Non-Qwen: Uses expanded prompt from prefill response where vLLM won't expand further, matching the KV cache layout
The
kv_transfer_paramsare correctly propagated to maintain block synchronization between prefill and decode workers.
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com> Co-authored-by: Krishnan Prashanth <kprashanth@nvidia.com> Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
|
Auto-linked to DIS-1220 |
Overview:
Fix disaggregated multimodal decode for
Qwen2.5-VLmodels by passing the original unexpanded prompt to the decode worker, allowing vLLM to expand it identically to prefill and fix crash.Details:
Problem:
Qwen2.5-VLdisaggregated decode was failing with:IndexError: list index out of range when no multimodal data was passed (mRoPE needs
image_grid_thw)Solution:
multi_modal_datawith zero embeddings andimage_grid_thwfor mRoPE position calculationWhere should the reviewer start?
components/src/dynamo/vllm/multimodal_handlers/worker_handler.py- Main logic changes inMultimodalDecodeWorkerHandlerandMultimodalPDWorkerHandlercomponents/src/dynamo/vllm/multimodal_utils/model.py- Newconstruct_qwen_decode_mm_data()functionSummary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.