[Bugfix] Fix Triton stream capture error on A100 in GDN attention with MTP speculative decoding#39483
Open
jacob-crux wants to merge 1 commit into
Open
Conversation
…h MTP speculative decoding Signed-off-by: jacob-crux <jacob.crux@kakaocorp.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the _warmup_and_capture method in vllm/v1/worker/gpu_model_runner.py to pass the is_graph_capturing argument, set to the value of force_attention, during the model warmup phase. I have no feedback to provide as there are no review comments to evaluate.
5 tasks
|
any update? |
1 task
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Fix a Triton "operation not permitted when stream is capturing" error during FULL CUDA graph capture for models using GDN (Gated Delta Net) attention with MTP speculative decoding.
Reproduced with Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 (Qwen3.5 MoE architecture, GPTQ Int4 quantized) on NVIDIA A100-SXM4-80GB.
When
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'is enabled, the engine fails to start during CUDA graph capture with:The spec-decode Triton kernels (
causal_conv1d_updatewithIS_SPEC_DECODING=True,fused_sigmoid_gating_delta_rule_update,fused_gdn_gating,l2norm_fwd_kernel2, etc.) attempt to JIT-compile during the actual CUDA graph capture, calling Triton'sload_binarywhile a CUDA stream capture is active — which CUDA forbids.Root cause: During
_warmup_and_capture, the warmup_dummy_runcalls do not passis_graph_capturing=True. As a result, attention metadata builders usebuild()instead ofbuild_for_cudagraph_capture(). For GDN attention,build()receives dummynum_decode_draft_tokens=0, sospec_sequence_masks=Noneand the non-spec decode path is taken during warmup. The spec-decode Triton kernels are therefore never JIT-compiled during warmup. When the actual capture then runs (which always usesis_graph_capturing=True),build_for_cudagraph_capture()auto-generates spec-decode metadata fromquery_start_locand triggers the spec-decode code path — at which point Triton tries to JIT-compile the kernels for the first time inside an active stream capture and crashes.Fix: Pass
is_graph_capturing=force_attentionto the warmup_dummy_runso that FULL-mode warmup exercises the samebuild_for_cudagraph_capture()path as the actual capture, ensuring all required Triton kernels are pre-compiled before capture begins. Only affects FULL cudagraph warmup; PIECEWISE mode is unchanged because GDN attention runs as a splitting op outside the captured graph there.Test Plan
qwen3_next_mtpspeculative decoding enabled:Test Result
Before fix — Engine fails to start during CUDA graph capture:
After fix — Engine starts successfully and serves requests normally with MTP speculative decoding enabled.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.