Skip to content

[Bugfix] fix recompile in qwen3 vl#16785

Open
narutolhy wants to merge 1 commit intosgl-project:mainfrom
narutolhy:fix_qwen3_vl_recompile
Open

[Bugfix] fix recompile in qwen3 vl#16785
narutolhy wants to merge 1 commit intosgl-project:mainfrom
narutolhy:fix_qwen3_vl_recompile

Conversation

@narutolhy
Copy link
Copy Markdown
Contributor

@narutolhy narutolhy commented Jan 9, 2026

Motivation

Qwen3-VL automatically injects input_deepstack_embeds into the language model inputs when requests contain mm_inputs (multimodal inputs). The current implementation does not handle this injected tensor consistently, so the model execution path differs between multimodal and text-only requests. In mixed traffic, this leads to unnecessary TorchDynamo recompilations at runtime (e.g., repeatedly specializing on the presence/absence of deepstack inputs), increasing latency and reducing throughput stability.

This PR makes the deepstack injection path explicit and stable to avoid redundant recompilation for Qwen3-VL multimodal inference.

Modifications

Standardized the Qwen3-VL deepstack injection behavior so that the language model forward path remains stable when mm_inputs is present.

Ensured input_deepstack_embeds is handled in a consistent way across requests to prevent runtime recompilation churn (especially under mixed multimodal/text workloads).

Refactored the relevant code paths to make the deepstack injection explicit and easier to reason about.

Accuracy Tests

The update in test/manual/nightly/test_vlms_vit_cuda_graph.py case covers the accuracy.

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@narutolhy narutolhy changed the title [WIP] fix recompile in qwen3 vl [Bugfix] fix recompile in qwen3 vl Jan 12, 2026
@github-actions github-actions bot added the Multi-modal multi-modal language model label Jan 13, 2026
@yuan-luo yuan-luo self-requested a review January 13, 2026 08:38
@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Jan 14, 2026

Could you help to verify with SGLANG_VIT_ENABLE_CUDA_GRAPH=1? Qwen3-VL ViT CUDA Graph considered injecting deepstack output into LLM layers, just want make sure the full ViT+LLM CUDA Graph path working correctly.
The issue also exists when SGLANG_VIT_ENABLE_CUDA_GRAPH is set to 0.
#15320

@yuan-luo
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

Could you help to verify with SGLANG_VIT_ENABLE_CUDA_GRAPH=1? Qwen3-VL ViT CUDA Graph considered injecting deepstack output into LLM layers, just want make sure the full ViT+LLM CUDA Graph path working correctly.
The issue also exists when SGLANG_VIT_ENABLE_CUDA_GRAPH is set to 0.

Hi @yuan-luo Thank you for your review.
I tested with VIT CUDA Graph before. It has some error. This is the issue #16883
I think this is due to the latest optimizations, but after I removed them, it resulted in a CUDA address access error during graph replay. I think it's because different graphs are being used for the same seq_len length, and your planned refactoring should solve this problem. #16902
I'll try test it with your new pr. Thank you

Copy link
Copy Markdown
Collaborator

@Oasis-Git Oasis-Git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously when I run this model I noticed the recompilation is caused by grad mode. So could you please share which part introduce forward with gradient?

@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Jan 16, 2026

Could you help to verify with SGLANG_VIT_ENABLE_CUDA_GRAPH=1? Qwen3-VL ViT CUDA Graph considered injecting deepstack output into LLM layers, just want make sure the full ViT+LLM CUDA Graph path working correctly.
The issue also exists when SGLANG_VIT_ENABLE_CUDA_GRAPH is set to 0.

Hi @yuan-luo Thank you for your review. I tested with VIT CUDA Graph before. It has some error. This is the issue #16883 I think this is due to the latest optimizations, but after I removed them, it resulted in a CUDA address access error during graph replay. I think it's because different graphs are being used for the same seq_len length, and your planned refactoring should solve this problem. #16902 I'll try test it with your new pr. Thank you

@narutolhy Did #16902 fix your problem?

@narutolhy
Copy link
Copy Markdown
Contributor Author

Could you help to verify with SGLANG_VIT_ENABLE_CUDA_GRAPH=1? Qwen3-VL ViT CUDA Graph considered injecting deepstack output into LLM layers, just want make sure the full ViT+LLM CUDA Graph path working correctly.
The issue also exists when SGLANG_VIT_ENABLE_CUDA_GRAPH is set to 0.

Hi @yuan-luo Thank you for your review. I tested with VIT CUDA Graph before. It has some error. This is the issue #16883 I think this is due to the latest optimizations, but after I removed them, it resulted in a CUDA address access error during graph replay. I think it's because different graphs are being used for the same seq_len length, and your planned refactoring should solve this problem. #16902 I'll try test it with your new pr. Thank you

@narutolhy Did #16902 fix your problem?

Hi @yuan-luo I still got the error with VIT CUDA Graph in your branch. I will try to debug it. Thank you.
File "/opt/tiger/search-llm-inference/qwen/sglang/python/sglang/srt/managers/mm_utils.py", line 1091, in general_mm_embed_routine
input_embeds, other_info = embed_mm_inputs(
^^^^^^^^^^^^^^^^
File "/opt/tiger/search-llm-inference/qwen/sglang/python/sglang/srt/managers/mm_utils.py", line 983, in embed_mm_inputs
embedding, mask, input_ids = get_embedding_and_mask(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/search-llm-inference/qwen/sglang/python/sglang/srt/managers/mm_utils.py", line 903, in get_embedding_and_mask
special_multimodal_mask = _get_multimodal_mask(input_ids, placeholder_tensor)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/search-llm-inference/qwen/sglang/python/sglang/srt/managers/mm_utils.py", line 820, in _get_multimodal_mask
return torch.isin(input_ids, placeholder_tensor).unsqueeze(-1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@narutolhy
Copy link
Copy Markdown
Contributor Author

Previously when I run this model I noticed the recompilation is caused by grad mode. So could you please share which part introduce forward with gradient?

Hi @Oasis-Git This part forward with gradient. And my latest code only added @torch.no_grad() on this part.
hidden_states = general_mm_embed_routine( input_ids=input_ids, forward_batch=forward_batch, language_model=self.model, multimodal_model=self, positions=positions, use_deepstack=self.use_deepstack, pp_proxy_tensors=pp_proxy_tensors, )

@narutolhy
Copy link
Copy Markdown
Contributor Author

Hi @yuan-luo
A new discovery is that when I use --disable-custom-all-reduce along with your pull request, the ViT CUDA graph works correctly. I'm trying to understand the logic of custom-all-reduce. Thank you
SGLANG_MM_FEATURE_CACHE_MB=4096
SGLANG_USE_CUDA_IPC_TRANSPORT=1
SGLANG_VLM_CACHE_SIZE_MB=0
SGLANG_VIT_ENABLE_CUDA_GRAPH=1
python3 -m sglang.launch_server --host 127.0.0.1 --mem-fraction-static 0.7 --port 30000 --max-running-requests 64 --chunked-prefill-size 8192 --attention-backend fa3 --mm-attention-backend fa3 --enable-multimodal --model Qwen/Qwen3-VL-8B-Instruct --disable-radix-cache --piecewise-cuda-graph-max-tokens 4096 --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager --tp-size 4 --disable-custom-all-reduce --enable-torch-symm-mem

@narutolhy
Copy link
Copy Markdown
Contributor Author

#17255
Hi @yuan-luo I found the custom-all-reduce related failure and submitted a fix. Please review it, thank you.

@narutolhy narutolhy force-pushed the fix_qwen3_vl_recompile branch from ca6a2c4 to 3c1b102 Compare January 17, 2026 22:03
Copy link
Copy Markdown
Collaborator

@Oasis-Git Oasis-Git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the pcg side. @yuan-luo Could you take a next step look please? Thanks

@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Feb 3, 2026

Will take a look at it soon.

@narutolhy
Copy link
Copy Markdown
Contributor Author

Hi @yuan-luo , Do you have time to take a look? Or can you take a look when you come back from your holiday?
Thank you

@narutolhy narutolhy force-pushed the fix_qwen3_vl_recompile branch from 3c1b102 to 171c9b9 Compare February 25, 2026 06:50
@narutolhy narutolhy force-pushed the fix_qwen3_vl_recompile branch from 171c9b9 to c5c286c Compare March 2, 2026 06:11
@narutolhy
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

edwingao28 added a commit to edwingao28/sglang that referenced this pull request Mar 18, 2026
- LLaVA/LLaVaVid: add None guard for mm_inputs during PCG warmup,
  where ForwardBatch is constructed with mm_inputs=None
- Qwen3VL/Omni: stabilize input_deepstack_embeds kwarg to prevent
  TorchDynamo recompilation during PCG capture. Preallocate deepstack
  buffer and always pass it for consistent function signatures.

Based on sgl-project#16785.

Co-Authored-By: narutolhy <582909902@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants