[Bugfix] Remove nested torch.compile in GDN rearrange_mixed_qkv causing CUDA graph capture failure#42070
Conversation
Remove nested @torch.compile(fullgraph=True) decorator that triggered Triton autotuning (torch.cuda.synchronize) during CUDA graph capture. The method is already compiled by the outer AOT compilation pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
There was a problem hiding this comment.
Code Review
This pull request removes the @torch.compile(fullgraph=True) decorator from the rearrange_mixed_qkv method in vllm/model_executor/layers/mamba/gdn_linear_attn.py. I have no feedback to provide as there were no review comments to evaluate.
|
@tdoublep a question. In the PR that introduced this seems to be able to run Qwen3.5 on B200 https://buildkite.com/vllm/ci/builds/65043/canvas?jid=019e05df-d38b-4432-b813-5f66b42e419a&tab=output If I understand correctly, the |
|
Sorry about this and I give my spiritual approval. When this started, there were some noticeable additional fusions with torch.compile (with an older vLLM though and before some other GDN changes), so there might be some lost perf, but of course this should be done to fix breakages. |
|
reproduced using And it only happens with spec decoding |
I think the best way is to change the kernel to accept non-contiguous input |
That's why the CI also didn't capture this issue. |
|
@tjtanaa No test group, I was just deploying model with MTP - similar to the above example. Surprised it is not caught by tests though. |
There is no Qwen 3..5 test on CI it seems. There is only one test group lm-eval Qwen3.5 (B200) that I had to triggered manually. |
|
Should we rebase this on latest main and rerun the relevant Buildkite jobs? Thanks. |
|
Could a maintainer please rerun the two failing Buildkite jobs?
The PR change is limited to removing one nested Thanks. |
…ng CUDA graph capture failure (vllm-project#42070) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Summary
@torch.compile(fullgraph=True)fromrearrange_mixed_qkvwhich triggers Triton autotuning (torch.cuda.synchronize()) during CUDA graph captureReproduction
vllm serve Qwen/Qwen3.5-35B-A3B --language-model-only --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'Fails with:
Related
causal_conv1dTriton kernels via warmup pathTest plan
Qwen/Qwen3.5-35B-A3Bwith MTP spec decode starts successfully on GB200🤖 Generated with Claude Code