[CI][Bugfix] Fix CI Failure Step "Basic Models Tests (Extra Initialization) 1 & 2"#42154
[CI][Bugfix] Fix CI Failure Step "Basic Models Tests (Extra Initialization) 1 & 2"#42154haosdent wants to merge 1 commit intovllm-project:mainfrom
Conversation
The @torch.compile(fullgraph=True) decorators added in vllm-project#40711 on prepare_gdn_attention_core_inputs and rearrange_mixed_qkv crash CUDA-graph capture for Qwen3.5 MTP / MoeMTP: Inductor's first-call Triton autotune runs torch.cuda.synchronize(), which is illegal during stream capture. The non-spec path is autotuned during eager warmup; the spec path's mixed_qkv_spec is None during warmup and only becomes a tensor during capture, so autotune fires inside torch.cuda.graph(...) and the engine core dies with cudaErrorStreamCaptureInvalidated. Removing the decorators fixes the crash. The cat-then-slice bodies were pessimizations without compile fusion, so simplify them to plain split/contiguous/view (and drop the fused round-trip in prepare_gdn_attention_core_inputs). Signed-off-by: haosdent <haosdent@hotmail.com> Signed-off-by: haosdent <haosdent@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request simplifies the prepare_gdn_attention_core_inputs and rearrange_mixed_qkv methods by removing @torch.compile decorators and replacing complex concatenation-based contiguity logic with more straightforward reshape and contiguous calls. Feedback suggests explicitly adding .contiguous() to the reshaped outputs in prepare_gdn_attention_core_inputs to ensure memory layout compatibility with downstream kernels that strictly require contiguous memory.
|
@tpopp @ChuanLi1101 may you help to review this, this is try to fix the CI failure "Basic Models Tests (Extra Initialization) x" related to #40711 |
|
@vadiklyutiy @tjtanaa can you help with this fix for ci failure? Completely removing compile doesn't seem great |
fixed in #42070
agreed, but it's acceptable to do so to unblock ci, or we will have to revert pr that causes this |
|
Let me take a look today. |
|
Thanks all, didn't notice that PR before, let me close mine |
yup, this test group is running fine on main now after the other PR bugfix. https://buildkite.com/vllm/ci/builds/65423/canvas?sid=019e107a-23ca-47ea-bc60-22d1590d15f2 |
Purpose
Fixes Buildkite #65314 — Basic Models Tests (Extra Initialization).
The
@torch.compile(fullgraph=True)decorators added by #40711 onprepare_gdn_attention_core_inputsandrearrange_mixed_qkvcrash CUDA-graph capture for Qwen3.5 MTP / MoeMTP: Inductor's first-call Triton autotune callstorch.cuda.synchronize(), which is illegal inside stream capture. The non-spec path is autotuned during eager warmup, butmixed_qkv_specisNonethen and only becomes a tensor during capture — so the spec-path autotune fires insidetorch.cuda.graph(...)→cudaErrorStreamCaptureInvalidated→ engine core dies.Drop the two decorators. The cat-then-slice bodies were designed for compile fusion and pessimize eager mode, so simplify to plain
split/contiguous/view.Test Plan
pytest tests/models/test_initialization.py::test_can_initialize_large_subset \ -k 'Qwen3_5MTP or Qwen3_5MoeMTP' -vTest Result
FAILED ... cudaErrorStreamCaptureInvalidated1 passed(GB10 / SM12.1, dense
Qwen3_5MTPonly —Qwen3_5MoeMTPshares the exact same code path throughqwen3_next.py:503 forward→gdn_attention_core→rearrange_mixed_qkv.)