[Bugfix] Fix GDN FLA kernel crashes with NULL_BLOCK_ID=0 CUDA graph padding#39064
Conversation
611405f to
a031d72
Compare
…flags FLA SSM kernels checked `state_idx < 0` for PAD_SLOT_ID (-1), but PR vllm-project#35431 changed CUDA graph padding to NULL_BLOCK_ID (0). Changed to `state_idx <= 0` in fused_recurrent.py and fused_sigmoid_gating.py to handle both values. Matches upstream PR vllm-project#39064. Also adds --enable-auto-tool-choice --tool-call-parser hermes to run scripts and benchmark result patterns to .gitignore. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Investigation update — the fix was incomplete, now completeAfter some back-and-forth trying to reproduce: the original fix (only The bugCommit Why it was hard to reproduce
Test results
The latest push adds the missing |
|
Please let me know when it's ready for a test. Happy to try to practice running feature branch's on gpus |
|
Its ready @Gregory-Pereira pls do test once. |
|
Ping @Gregory-Pereira @vadiklyutiy @ZJY0516 This fix is needed for qwen3.5 to stop crashing when using Tensor Parallel. |
vadiklyutiy
left a comment
There was a problem hiding this comment.
For me it is unclear why ind=0 don't appear as real ind.
Maybe the issue is that padding should fill with -1 instead of 0?
Alberto-Codes
left a comment
There was a problem hiding this comment.
Independent verification — straggler sweep
Dug into this while it was blocking TurboQuant validation on Qwen3.5-35B-A3B downstream (the crash reported against #38479 traces back here). Posting the evidence for anyone wanting to verify the scope before approving.
1. The fix is the complete straggler set. Grepped upstream/main for all guards that might have missed the NULL_BLOCK_ID=0 convention change from #35431:
$ git grep -nE 'state_idx\s*(<=?|>=?)\s*0|final_state_idx\s*(<=?|>=?)\s*0' \
-- 'vllm/model_executor/layers/**' 'csrc/**' 'vllm/v1/**'
vllm/model_executor/layers/fla/ops/fused_recurrent.py:114: if state_idx < 0:
vllm/model_executor/layers/fla/ops/fused_recurrent.py:158: if final_state_idx >= 0:
vllm/model_executor/layers/fla/ops/fused_recurrent.py:295: if state_idx < 0:
vllm/model_executor/layers/fla/ops/fused_sigmoid_gating.py:114: if state_idx < 0:
vllm/model_executor/layers/fla/ops/fused_sigmoid_gating.py:163: if final_state_idx >= 0:
Exactly the 5 guards this PR flips. No other kernel or site silently accepts 0 as a valid state index.
2. NULL_BLOCK_ID producers (what actually fills padded slots with 0):
vllm/v1/attention/backends/gdn_attn.py:372, 418— GDN state indicesvllm/v1/attention/backends/mamba_attn.py:507— mamba state indicesvllm/v1/worker/gpu_model_runner.py:2152— block table tensor
3. Correct consumers (using the new convention):
vllm/model_executor/layers/mamba/ops/mamba_ssm.py—state_batch_idx != null_block_id(L206),token_dst_idx != null_block_id(L260), kernel takesnull_block_idas a paramvllm/model_executor/layers/mamba/ops/causal_conv1d.py— accepts bothpad_slot_idandnull_block_idparams, handles each in the right code path
4. Verified not-a-straggler: vllm/model_executor/layers/lightning_attn.py:622 uses slot_id == pad_slot_id, but it consumes a separate slot_idx tensor that traces back to slot_mappings.fill_(PAD_SLOT_ID) in vllm/v1/worker/gpu/block_table.py:164. Different tensor, different padding convention, still -1. Not affected by #35431.
5. Why the <= 0 guard is correct in response to @vadiklyutiy's concern: NULL_BLOCK_ID = 0 is a reserved sentinel by definition — real sequences never get assigned slot 0 after #35431. The strongest argument is that mamba_ssm.py consumes the same ssm_state_indices tensor and already treats 0 as padding. Two kernels reading the same tensor with disagreeing sentinel semantics is a bug regardless of the specific values.
6. Minor follow-up (not blocking): stale comment at vllm/v1/worker/gpu_model_runner.py:3738 references blk_table_tensor on a line that operates on slot_mapping, and mentions "mamba PAD_SLOT_ID" when mamba now uses NULL_BLOCK_ID for state indices. Code is correct, comment is misleading. Janitor PR material.
Longer-term thought: the three Triton kernel families in vLLM (mamba/ops/, fla/ops/, and lightning_attn.py) each invented their own padding-guard pattern. Parameterizing the FLA kernels with null_block_id: tl.constexpr (matching mamba_ssm's pattern) instead of <= 0 would eliminate the straggler class permanently. Happy to open a follow-up if maintainers want it — not a blocker for this PR.
cc @LucasWilkinson @MatthewBonanni since you co-authored #35431 and would know if I'm missing any producer/consumer site.
|
@vadiklyutiy Your question is the right one to ask, and I think it's answerable without reproducing — the answer is in the parent commit that introduced the regression.
The strongest correctness argument is actually this: I grepped upstream/main to confirm the scope and posted the full sweep evidence in a separate comment above. The 5 guards in this PR are the complete straggler set. Reverting to |
|
fix pls DCO |
|
Sorry I had hoped to reproduce this but realized since this is blackwell specific I cannot, Im stuck with H100 and H200s. Hopefully will be getting more in future |
@Gregory-Pereira I have this issue happening with H100 NVL GPU's |
…A graphs The FLA SSM kernel (fused_sigmoid_gating_delta_rule_update_kernel) used `state_idx < 0` to skip padded entries, but CUDA graph block table padding uses NULL_BLOCK_ID=0. Since 0 is not < 0, the kernel processed padded entries as real requests, reading/writing ssm_state[0] and corrupting a live cache slot. This caused cascading memory corruption manifesting as illegal memory access in downstream MoE GEMM kernels. Fix: change the guard to `state_idx <= 0` (initial state load) and `final_state_idx > 0` (final state store), making it consistent with the causal_conv1d kernel which already guards on `== NULL_BLOCK_ID`. Only affects GDN hybrid models (e.g. Qwen3.5-35B-A3B) under CUDA graphs with TP>1. Pure transformer models are unaffected as they have no GDN layers. Fixes: vllm-project#39025 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vibhav Agarwal <vibhavagarwal5@gmail.com> Signed-off-by: vibhavagarwal5 <vibhavagarwal5@gmail.com>
The fused_recurrent kernel has the identical bug as fused_sigmoid_gating: `state_idx < 0` guards only catch PAD_SLOT_ID=-1 but miss NULL_BLOCK_ID=0, which is used for CUDA graph block table padding. This causes ssm_state[0] corruption for padded entries. Affects 3 locations in fused_recurrent.py: - Multi-token initial state load guard (line 114) - Multi-token final state store guard (line 158) - Packed decode kernel guard (line 295) Without this fix, the fused_sigmoid_gating fix alone still crashes under higher concurrency (200 concurrent requests) because fused_recurrent is used in the packed decode fast path. Tested: 10,000 requests (50 rounds x 200 concurrency, mixed short/medium/long prompts, ~2.8M tokens) with zero failures on 4x Blackwell RTX PRO 6000, TP=2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vibhav Agarwal <vibhavagarwal5@gmail.com> Signed-off-by: vibhavagarwal5 <vibhavagarwal5@gmail.com>
9b2342f to
5362db8
Compare
|
@vadiklyutiy DCO is fixed |
MatthewBonanni
left a comment
There was a problem hiding this comment.
LGTM, thanks for the fix!
Starting testing |
Per review feedback from @MatthewBonanni — we should only be using NULL_BLOCK_ID for padding block table entries now, so the comments no longer reference PAD_SLOT_ID. Signed-off-by: vibhavagarwal5 <vibhavagarwal5@gmail.com>
237d705 to
2177912
Compare
|
@MatthewBonanni fixed comments, pls approve again |
|
@vadiklyutiy padding with 0 is correct for the block table, see #35431 |
|
Validated on H200 as best I could @gaby. Started with relevant python tests shared in the PR After this server starts up fine and submited 10k requests with 200 concurrency and it looks fine: |
|
Ran again with 500 concurrency and got 9/10 rounds through but then hung on shared memory broadcast and crashes Seems unrelated though not seeing anything related to IMA |
…adding (vllm-project#39064) Signed-off-by: Vibhav Agarwal <vibhavagarwal5@gmail.com> Co-authored-by: vibhav-agarwal <vibhav.agarwal@glance.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Fixes #39025 — GDN (Gated Delta Network) hybrid models crash with IMA (Illegal Memory Access) errors under CUDA graphs + TP>1 on Blackwell GPUs.
Root cause: Commit
bcc6f6744(Mar 30) changed CUDA graph block table padding fromfill_(-1)(PAD_SLOT_ID) tofill_(NULL_BLOCK_ID=0). The FLA SSM kernels guard padded entries withstate_idx < 0, which catches-1but not0. This causes padded CUDA graph entries to read/writessm_state[0], corrupting the state of real sequence 0.Fix: Change all 5 guard comparisons across both FLA kernels to use
<= 0/> 0:fused_sigmoid_gating.pystate_idx < 0state_idx <= 0fused_sigmoid_gating.pyfinal_state_idx >= 0final_state_idx > 0fused_recurrent.pystate_idx < 0state_idx <= 0fused_recurrent.pyfinal_state_idx >= 0final_state_idx > 0fused_recurrent.pystate_idx < 0state_idx <= 0Note:
causal_conv1d.pyandmamba_ssm.pyalready use== null_block_idand are not affected.Test plan
--enforce-eagerprevents crash (CUDA graph padding is the trigger)fused_sigmoid_gating.pyfix alone passes at 100 concurrency but crashes at 200pytest tests/models/test_gdn.py -vpasses🤖 Generated with Claude Code
Co-authored-by: Claude noreply@anthropic.com