[Bugfix] fix recompile in qwen3 vl by narutolhy · Pull Request #16785 · sgl-project/sglang

narutolhy · 2026-01-09T07:03:40Z

Motivation

Qwen3-VL automatically injects input_deepstack_embeds into the language model inputs when requests contain mm_inputs (multimodal inputs). The current implementation does not handle this injected tensor consistently, so the model execution path differs between multimodal and text-only requests. In mixed traffic, this leads to unnecessary TorchDynamo recompilations at runtime (e.g., repeatedly specializing on the presence/absence of deepstack inputs), increasing latency and reducing throughput stability.

This PR makes the deepstack injection path explicit and stable to avoid redundant recompilation for Qwen3-VL multimodal inference.

Modifications

Standardized the Qwen3-VL deepstack injection behavior so that the language model forward path remains stable when mm_inputs is present.

Ensured input_deepstack_embeds is handled in a consistent way across requests to prevent runtime recompilation churn (especially under mixed multimodal/text workloads).

Refactored the relevant code paths to make the deepstack injection explicit and easier to reason about.

Accuracy Tests

The update in test/manual/nightly/test_vlms_vit_cuda_graph.py case covers the accuracy.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-09T07:03:43Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yuan-luo · 2026-01-14T02:48:30Z

Could you help to verify with SGLANG_VIT_ENABLE_CUDA_GRAPH=1? Qwen3-VL ViT CUDA Graph considered injecting deepstack output into LLM layers, just want make sure the full ViT+LLM CUDA Graph path working correctly.
The issue also exists when SGLANG_VIT_ENABLE_CUDA_GRAPH is set to 0.
#15320

yuan-luo · 2026-01-14T03:57:08Z

/rerun-failed-ci

narutolhy · 2026-01-14T06:01:58Z

Could you help to verify with SGLANG_VIT_ENABLE_CUDA_GRAPH=1? Qwen3-VL ViT CUDA Graph considered injecting deepstack output into LLM layers, just want make sure the full ViT+LLM CUDA Graph path working correctly.
The issue also exists when SGLANG_VIT_ENABLE_CUDA_GRAPH is set to 0.

Hi @yuan-luo Thank you for your review.
I tested with VIT CUDA Graph before. It has some error. This is the issue #16883
I think this is due to the latest optimizations, but after I removed them, it resulted in a CUDA address access error during graph replay. I think it's because different graphs are being used for the same seq_len length, and your planned refactoring should solve this problem. #16902
I'll try test it with your new pr. Thank you

Oasis-Git

Previously when I run this model I noticed the recompilation is caused by grad mode. So could you please share which part introduce forward with gradient?

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py

python/sglang/srt/models/qwen3_vl.py

yuan-luo · 2026-01-16T03:23:09Z

Could you help to verify with SGLANG_VIT_ENABLE_CUDA_GRAPH=1? Qwen3-VL ViT CUDA Graph considered injecting deepstack output into LLM layers, just want make sure the full ViT+LLM CUDA Graph path working correctly.
The issue also exists when SGLANG_VIT_ENABLE_CUDA_GRAPH is set to 0.

Hi @yuan-luo Thank you for your review. I tested with VIT CUDA Graph before. It has some error. This is the issue #16883 I think this is due to the latest optimizations, but after I removed them, it resulted in a CUDA address access error during graph replay. I think it's because different graphs are being used for the same seq_len length, and your planned refactoring should solve this problem. #16902 I'll try test it with your new pr. Thank you

@narutolhy Did #16902 fix your problem?

narutolhy · 2026-01-16T08:40:53Z

Could you help to verify with SGLANG_VIT_ENABLE_CUDA_GRAPH=1? Qwen3-VL ViT CUDA Graph considered injecting deepstack output into LLM layers, just want make sure the full ViT+LLM CUDA Graph path working correctly.
The issue also exists when SGLANG_VIT_ENABLE_CUDA_GRAPH is set to 0.

Hi @yuan-luo Thank you for your review. I tested with VIT CUDA Graph before. It has some error. This is the issue #16883 I think this is due to the latest optimizations, but after I removed them, it resulted in a CUDA address access error during graph replay. I think it's because different graphs are being used for the same seq_len length, and your planned refactoring should solve this problem. #16902 I'll try test it with your new pr. Thank you

@narutolhy Did #16902 fix your problem?

Hi @yuan-luo I still got the error with VIT CUDA Graph in your branch. I will try to debug it. Thank you.
File "/opt/tiger/search-llm-inference/qwen/sglang/python/sglang/srt/managers/mm_utils.py", line 1091, in general_mm_embed_routine
input_embeds, other_info = embed_mm_inputs(
^^^^^^^^^^^^^^^^
File "/opt/tiger/search-llm-inference/qwen/sglang/python/sglang/srt/managers/mm_utils.py", line 983, in embed_mm_inputs
embedding, mask, input_ids = get_embedding_and_mask(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/search-llm-inference/qwen/sglang/python/sglang/srt/managers/mm_utils.py", line 903, in get_embedding_and_mask
special_multimodal_mask = _get_multimodal_mask(input_ids, placeholder_tensor)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/search-llm-inference/qwen/sglang/python/sglang/srt/managers/mm_utils.py", line 820, in _get_multimodal_mask
return torch.isin(input_ids, placeholder_tensor).unsqueeze(-1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

narutolhy · 2026-01-16T08:43:54Z

Previously when I run this model I noticed the recompilation is caused by grad mode. So could you please share which part introduce forward with gradient?

Hi @Oasis-Git This part forward with gradient. And my latest code only added @torch.no_grad() on this part.
hidden_states = general_mm_embed_routine( input_ids=input_ids, forward_batch=forward_batch, language_model=self.model, multimodal_model=self, positions=positions, use_deepstack=self.use_deepstack, pp_proxy_tensors=pp_proxy_tensors, )

narutolhy · 2026-01-16T23:04:18Z

Hi @yuan-luo
A new discovery is that when I use --disable-custom-all-reduce along with your pull request, the ViT CUDA graph works correctly. I'm trying to understand the logic of custom-all-reduce. Thank you
SGLANG_MM_FEATURE_CACHE_MB=4096
SGLANG_USE_CUDA_IPC_TRANSPORT=1
SGLANG_VLM_CACHE_SIZE_MB=0
SGLANG_VIT_ENABLE_CUDA_GRAPH=1
python3 -m sglang.launch_server --host 127.0.0.1 --mem-fraction-static 0.7 --port 30000 --max-running-requests 64 --chunked-prefill-size 8192 --attention-backend fa3 --mm-attention-backend fa3 --enable-multimodal --model Qwen/Qwen3-VL-8B-Instruct --disable-radix-cache --piecewise-cuda-graph-max-tokens 4096 --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager --tp-size 4 --disable-custom-all-reduce --enable-torch-symm-mem

narutolhy · 2026-01-17T08:54:46Z

#17255
Hi @yuan-luo I found the custom-all-reduce related failure and submitted a fix. Please review it, thank you.

Oasis-Git

LGTM from the pcg side. @yuan-luo Could you take a next step look please? Thanks

yuan-luo · 2026-02-03T14:53:49Z

Will take a look at it soon.

narutolhy · 2026-02-18T01:13:30Z

Hi @yuan-luo , Do you have time to take a look? Or can you take a look when you come back from your holiday?
Thank you

narutolhy · 2026-03-02T18:48:33Z

/tag-and-rerun-ci

- LLaVA/LLaVaVid: add None guard for mm_inputs during PCG warmup, where ForwardBatch is constructed with mm_inputs=None - Qwen3VL/Omni: stabilize input_deepstack_embeds kwarg to prevent TorchDynamo recompilation during PCG capture. Preallocate deepstack buffer and always pass it for consistent function signatures. Based on sgl-project#16785. Co-Authored-By: narutolhy <582909902@qq.com>

narutolhy requested review from Ying1123, hebiao064, hnyls2002, merrymercy and xiezhq-hermann as code owners January 9, 2026 07:03

narutolhy mentioned this pull request Jan 9, 2026

[Bug] Piecewise Cuda Graph Runtime Recompile with Qwen/Qwen3-Omni-30B-A3B-Instruct #15654

Closed

5 tasks

narutolhy changed the title ~~[WIP] fix recompile in qwen3 vl~~ [Bugfix] fix recompile in qwen3 vl Jan 12, 2026

github-actions bot added the Multi-modal multi-modal language model label Jan 13, 2026

yuan-luo self-requested a review January 13, 2026 08:38

Oasis-Git reviewed Jan 14, 2026

View reviewed changes

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py Outdated Show resolved Hide resolved

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py Outdated Show resolved Hide resolved

python/sglang/srt/models/qwen3_vl.py Outdated Show resolved Hide resolved

narutolhy force-pushed the fix_qwen3_vl_recompile branch from ca6a2c4 to 3c1b102 Compare January 17, 2026 22:03

Oasis-Git approved these changes Jan 23, 2026

View reviewed changes

narutolhy force-pushed the fix_qwen3_vl_recompile branch from 3c1b102 to 171c9b9 Compare February 25, 2026 06:50

fix recompile in qwen3 vl

c5c286c

narutolhy force-pushed the fix_qwen3_vl_recompile branch from 171c9b9 to c5c286c Compare March 2, 2026 06:11

github-actions bot added the run-ci label Mar 2, 2026

edwingao28 mentioned this pull request Mar 13, 2026

[PCG] Enable piecewise CUDA graph testing for VLM models #20548

Open

5 tasks

cs-cat mentioned this pull request Mar 17, 2026

[Qwen3.5 bugfix] Add mm_input_embeds in piecewise cuda graph replay #20448

Open

5 tasks

edwingao28 mentioned this pull request Mar 24, 2026

[PCG] Enable piecewise CUDA graph by default for VLM models #21282

Open

5 tasks

Conversation

narutolhy commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 9, 2026

Uh oh!

yuan-luo commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented Jan 14, 2026

Uh oh!

narutolhy commented Jan 14, 2026

Uh oh!

Oasis-Git left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

narutolhy commented Jan 16, 2026

Uh oh!

narutolhy commented Jan 16, 2026

Uh oh!

narutolhy commented Jan 16, 2026

Uh oh!

narutolhy commented Jan 17, 2026

Uh oh!

Oasis-Git left a comment

Choose a reason for hiding this comment

Uh oh!

yuan-luo commented Feb 3, 2026

Uh oh!

narutolhy commented Feb 18, 2026

Uh oh!

narutolhy commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

narutolhy commented Jan 9, 2026 •

edited

Loading

yuan-luo commented Jan 14, 2026 •

edited

Loading

yuan-luo commented Jan 16, 2026 •

edited

Loading