[Bugfix] fix recompile in qwen3 vl#16785
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Could you help to verify with SGLANG_VIT_ENABLE_CUDA_GRAPH=1? Qwen3-VL ViT CUDA Graph considered injecting deepstack output into LLM layers, just want make sure the full ViT+LLM CUDA Graph path working correctly. |
|
/rerun-failed-ci |
Hi @yuan-luo Thank you for your review. |
Oasis-Git
left a comment
There was a problem hiding this comment.
Previously when I run this model I noticed the recompilation is caused by grad mode. So could you please share which part introduce forward with gradient?
python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
Outdated
Show resolved
Hide resolved
@narutolhy Did #16902 fix your problem? |
Hi @yuan-luo I still got the error with VIT CUDA Graph in your branch. I will try to debug it. Thank you. |
Hi @Oasis-Git This part forward with gradient. And my latest code only added |
|
Hi @yuan-luo |
ca6a2c4 to
3c1b102
Compare
|
Will take a look at it soon. |
|
Hi @yuan-luo , Do you have time to take a look? Or can you take a look when you come back from your holiday? |
3c1b102 to
171c9b9
Compare
171c9b9 to
c5c286c
Compare
|
/tag-and-rerun-ci |
- LLaVA/LLaVaVid: add None guard for mm_inputs during PCG warmup, where ForwardBatch is constructed with mm_inputs=None - Qwen3VL/Omni: stabilize input_deepstack_embeds kwarg to prevent TorchDynamo recompilation during PCG capture. Preallocate deepstack buffer and always pass it for consistent function signatures. Based on sgl-project#16785. Co-Authored-By: narutolhy <582909902@qq.com>
Motivation
Qwen3-VL automatically injects input_deepstack_embeds into the language model inputs when requests contain mm_inputs (multimodal inputs). The current implementation does not handle this injected tensor consistently, so the model execution path differs between multimodal and text-only requests. In mixed traffic, this leads to unnecessary TorchDynamo recompilations at runtime (e.g., repeatedly specializing on the presence/absence of deepstack inputs), increasing latency and reducing throughput stability.
This PR makes the deepstack injection path explicit and stable to avoid redundant recompilation for Qwen3-VL multimodal inference.
Modifications
Standardized the Qwen3-VL deepstack injection behavior so that the language model forward path remains stable when mm_inputs is present.
Ensured input_deepstack_embeds is handled in a consistent way across requests to prevent runtime recompilation churn (especially under mixed multimodal/text workloads).
Refactored the relevant code paths to make the deepstack injection explicit and easier to reason about.
Accuracy Tests
The update in test/manual/nightly/test_vlms_vit_cuda_graph.py case covers the accuracy.
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci