[CI][Bugfix] Fix CI Failure Step "Basic Models Tests (Extra Initialization) 1 & 2" by haosdent · Pull Request #42154 · vllm-project/vllm

haosdent · 2026-05-09T09:38:58Z

Purpose

Fixes Buildkite #65314 — Basic Models Tests (Extra Initialization).

The @torch.compile(fullgraph=True) decorators added by #40711 on prepare_gdn_attention_core_inputs and rearrange_mixed_qkv crash CUDA-graph capture for Qwen3.5 MTP / MoeMTP: Inductor's first-call Triton autotune calls torch.cuda.synchronize(), which is illegal inside stream capture. The non-spec path is autotuned during eager warmup, but mixed_qkv_spec is None then and only becomes a tensor during capture — so the spec-path autotune fires inside torch.cuda.graph(...) → cudaErrorStreamCaptureInvalidated → engine core dies.

Drop the two decorators. The cat-then-slice bodies were designed for compile fusion and pessimize eager mode, so simplify to plain split/contiguous/view.

Test Plan

pytest tests/models/test_initialization.py::test_can_initialize_large_subset \
    -k 'Qwen3_5MTP or Qwen3_5MoeMTP' -v

Test Result

	Result	Time
Before patch	`FAILED ... cudaErrorStreamCaptureInvalidated`	114.8 s
After patch	`1 passed`	48.2 s

(GB10 / SM12.1, dense Qwen3_5MTP only — Qwen3_5MoeMTP shares the exact same code path through qwen3_next.py:503 forward → gdn_attention_core → rearrange_mixed_qkv.)

The @torch.compile(fullgraph=True) decorators added in vllm-project#40711 on prepare_gdn_attention_core_inputs and rearrange_mixed_qkv crash CUDA-graph capture for Qwen3.5 MTP / MoeMTP: Inductor's first-call Triton autotune runs torch.cuda.synchronize(), which is illegal during stream capture. The non-spec path is autotuned during eager warmup; the spec path's mixed_qkv_spec is None during warmup and only becomes a tensor during capture, so autotune fires inside torch.cuda.graph(...) and the engine core dies with cudaErrorStreamCaptureInvalidated. Removing the decorators fixes the crash. The cat-then-slice bodies were pessimizations without compile fusion, so simplify them to plain split/contiguous/view (and drop the fused round-trip in prepare_gdn_attention_core_inputs). Signed-off-by: haosdent <haosdent@hotmail.com> Signed-off-by: haosdent <haosdent@gmail.com>

gemini-code-assist

Code Review

This pull request simplifies the prepare_gdn_attention_core_inputs and rearrange_mixed_qkv methods by removing @torch.compile decorators and replacing complex concatenation-based contiguity logic with more straightforward reshape and contiguous calls. Feedback suggests explicitly adding .contiguous() to the reshaped outputs in prepare_gdn_attention_core_inputs to ensure memory layout compatibility with downstream kernels that strictly require contiguous memory.

haosdent · 2026-05-09T09:47:15Z

@tpopp @ChuanLi1101 may you help to review this, this is try to fix the CI failure "Basic Models Tests (Extra Initialization) x" related to #40711

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mgoin · 2026-05-09T17:05:39Z

@vadiklyutiy @tjtanaa can you help with this fix for ci failure? Completely removing compile doesn't seem great

ZJY0516 · 2026-05-09T17:11:27Z

@vadiklyutiy @tjtanaa can you help with this fix for ci failure? Completely removing compile doesn't seem great

fixed in #42070

Completely removing compile doesn't seem great

agreed, but it's acceptable to do so to unblock ci, or we will have to revert pr that causes this

tjtanaa · 2026-05-10T01:46:38Z

Let me take a look today.

haosdent · 2026-05-10T03:34:47Z

Thanks all, didn't notice that PR before, let me close mine

tjtanaa · 2026-05-10T09:10:05Z

Thanks all, didn't notice that PR before, let me close mine

yup, this test group is running fine on main now after the other PR bugfix. https://buildkite.com/vllm/ci/builds/65423/canvas?sid=019e107a-23ca-47ea-bc60-22d1590d15f2

mergify Bot added the bug Something isn't working label May 9, 2026

gemini-code-assist Bot reviewed May 9, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/mamba/gdn_linear_attn.py

haosdent changed the title ~~[WIP][Bugfix] Drop @torch.compile from GDN qkv reshape helpers~~ [CI][Bugfix] Fix CI Failure Step "Basic Models Tests (Extra Initialization) 1 & 2" May 9, 2026

haosdent marked this pull request as ready for review May 9, 2026 09:50

haosdent requested review from ZJY0516, tdoublep and vadiklyutiy as code owners May 9, 2026 09:50

claude Bot reviewed May 9, 2026

View reviewed changes

This was referenced May 9, 2026

[CI Bug 2026-05-09] 4 Qwen3.5/Qwen3-Next MTP tests: @torch.compile crashes CUDA graph capture (PR #40711) ZhanqiuHu/vllm-ci-watch#110

Open

[CI Summary 2026-05-09] 9 failed (8 new, 1 recurring), 7 fixed ZhanqiuHu/vllm-ci-watch#115

Open

DarkLight1337 requested review from Isotr0py and ywang96 May 10, 2026 02:44

haosdent closed this May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI][Bugfix] Fix CI Failure Step "Basic Models Tests (Extra Initialization) 1 & 2"#42154

[CI][Bugfix] Fix CI Failure Step "Basic Models Tests (Extra Initialization) 1 & 2"#42154
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:ci-0722062a

haosdent commented May 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

haosdent commented May 9, 2026

Uh oh!

claude Bot left a comment

Uh oh!

mgoin commented May 9, 2026

Uh oh!

ZJY0516 commented May 9, 2026

Uh oh!

tjtanaa commented May 10, 2026

Uh oh!

haosdent commented May 10, 2026

Uh oh!

tjtanaa commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

haosdent commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

haosdent commented May 9, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mgoin commented May 9, 2026

Uh oh!

ZJY0516 commented May 9, 2026

Uh oh!

tjtanaa commented May 10, 2026

Uh oh!

haosdent commented May 10, 2026

Uh oh!

tjtanaa commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haosdent commented May 9, 2026 •

edited

Loading