[CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test" by haosdent · Pull Request #41953 · vllm-project/vllm

haosdent · 2026-05-07T12:24:12Z

Purpose

Fixes #41960 — 8 fullgraph failures introduced by #41423
(CI build #64888):

AssertionError: num_graphs_seen not as expected, before it is 0, after it is 0, ...
CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS in triton_poi_fused_add_0

Commit 1 — 7 counter-assertion failures. spawn_new_process_for_each_test cloudpickles f; the decorator hides f behind the wrapper so cloudpickle falls back to by-value pickling, serializing f.__globals__ and turning singletons like compilation_counter into stale clones. Send f.__module__ + f.__qualname__ instead; child re-imports and resolves via getattr. VLLM_TEST_SPAWN_CHILD env var prevents re-spawning. Args/kwargs are still cloudpickled.

Commit 2 — 1 CUDA crash. test_simple_piecewise_compile parametrizes over intermediate_unbacked (gates a branch in SillyModel.forward), but the AOT cache key doesn't include per-instance attrs, so both variants share one cache slot and clobber each other. Add VLLM_DISABLE_COMPILE_CACHE=1, matching its sibling tests.

Independent of #41895 (XPU/ROCm mp.set_start_method).

Test Plan

Test Result

8/8 originally-failing tests pass locally (NVIDIA GB10). Full test_simple.py and utils_/test_spawn_decorator.py also green.

…child `spawn_new_process_for_each_test` uses cloudpickle to send the test function to the child subprocess. Cloudpickle cannot resolve the inner `f` by reference because the decorator hides it behind the wrapper (`module.test_foo` is the wrapper, not `f`), so cloudpickle falls back to by-value pickling. By-value pickling of a function pickles its `__globals__` dict, which serializes module-level singletons like `vllm.compilation.counter.compilation_counter` as `NEWOBJ + state`, producing fresh clones in the child. Result: VllmBackend in the child increments the real singleton, but the test's reference is the cloned copy that never sees the increment. Every `compilation_counter.expect(num_graphs_seen=N)` block fails with "before 0, after 0". Send `f.__module__` + `f.__qualname__` instead. The child re-imports the module (running normal vllm init, building singletons once) and walks qualname via `getattr`. The wrapper short-circuits via the `VLLM_TEST_SPAWN_CHILD` env var so the child's resolution doesn't re-spawn. Args/kwargs are still cloudpickled — pytest fixtures (`MonkeyPatch`, `tmp_path`, parametrize values) don't reference singletons, so by-value pickling there is harmless. Verified locally: `compile/fullgraph/test_toy_llama.py` (3/3), `test_multiple_graphs.py` (4/4), `test_simple.py` (4/5; the remaining `[False-inductor]` failure is a separate AOT codegen issue, unrelated to this regression). Signed-off-by: haosdent <haosdent@hotmail.com> Signed-off-by: haosdent <haosdent@gmail.com>

gemini-code-assist

Code Review

This pull request updates the spawn_new_process_for_each_test decorator in tests/utils.py to resolve test functions via module import and qualified name lookup rather than direct pickling. This change prevents module-level singletons from being cloned as stale values in the child process. Additionally, it introduces a guard variable to prevent recursive process spawning and refines the error handling for capturing tracebacks from failed subprocesses. I have no feedback to provide.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

haosdent · 2026-05-07T12:32:41Z

@dzhengAP Please take a look when you are available, thank you in advance!

SoluMilken · 2026-05-07T12:59:10Z

Hi @haosdent,

👋 Friendly tip: For active projects like vLLM, it's highly recommended to open an issue first to provide context. Also, to speed up the review, make sure you've reproduced the issue and fully tested your code locally before pinging a reviewer.

haosdent · 2026-05-07T13:48:40Z

Thanks @SoluMilken , I will create an issue next time. Previously, I thought a CI failure didn't need a ticket.
For test part, I verfied locally before, but to make the PR description shorter, I didn't add it, let me add it now.

…d cross-variant collision `test_simple_piecewise_compile` parametrizes over `intermediate_unbacked` (`True`/`False`), which gates a control-flow branch in `SillyModel.forward`: if self.intermediate_unbacked: u0 = foo(x) ones = x.new_ones(x.shape[0], u0).sum(-1) / 3 x = x * ones The AOT-compile cache key (`_model_hash_key` in `vllm/compilation/decorators.py`) only hashes `vllm.__version__ + fn.__qualname__ + fn.__code__.co_firstlineno` — none of which capture per-instance attributes. Both variants therefore share the same cache slot, but Dynamo traces them differently. Whichever runs first persists its compiled Triton kernel; the other loads that artifact and crashes with `CUDA illegal memory access` (CI surfaced this as `WARP_ILLEGAL_ADDRESS` in `triton_poi_fused_add_0`). The sibling tests (`test_toy_llama`, `test_multi_graph_piecewise_compile`, `test_simple_inductor_graph_partition`) already `monkeypatch.setenv("VLLM_DISABLE_COMPILE_CACHE", "1")` for the same reason; this brings `test_simple_piecewise_compile` in line. The underlying vLLM AOT-cache hash is too coarse for any model whose forward branches on an `__init__` arg, but tightening it is non-trivial (which attrs count?) and risks breaking legitimate cache hits in production. Tracking that as a separate item. Verified: all 5 `test_simple.py` cases pass in a single pytest run; the full 8-test set from CI build #64888 is green. Signed-off-by: haosdent <haosdent@hotmail.com> Signed-off-by: haosdent <haosdent@gmail.com>

jeejeelee · 2026-05-07T15:33:42Z

@ProExpertProg PTAL, thank you

ProExpertProg · 2026-05-07T16:34:28Z

Whoever merges extra commits from main, please make sure the cudagraph, v1-others-cpu , pytorch, and v1-sample-plus-logits tests all run

ProExpertProg · 2026-05-07T20:36:50Z

@haosdent the PyTorch failure on previous commit seems related?

haosdent · 2026-05-08T02:51:50Z

the PyTorch failure on previous commit seems related?

@ProExpertProg sorry, which failure you mean, currently there are multiple Pytorch failure on the main branch.

ProExpertProg · 2026-05-08T03:44:33Z

Yeah it was the other one I think

…-project#41953) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Libin Tang <libin.tang@intel.com>

mergify Bot added the bug Something isn't working label May 7, 2026

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

haosdent changed the title ~~[WIP][Bugfix] spawn_new_process_for_each_test: resolve test fn by name in child~~ [Bugfix] spawn_new_process_for_each_test: resolve test fn by name in child May 7, 2026

haosdent changed the title ~~[Bugfix] spawn_new_process_for_each_test: resolve test fn by name in child~~ [CI][Bugfix] spawn_new_process_for_each_test: resolve test fn by name in child May 7, 2026

haosdent marked this pull request as ready for review May 7, 2026 12:31

claude Bot reviewed May 7, 2026

View reviewed changes

haosdent changed the title ~~[CI][Bugfix] spawn_new_process_for_each_test: resolve test fn by name in child~~ [CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test" May 7, 2026

haosdent mentioned this pull request May 7, 2026

[CI Failure]: PyTorch Fullgraph Smoke Test — counter assertions and AOT cache collision in tests/compile/fullgraph/ #41960

Closed

3 tasks

jeejeelee requested a review from ProExpertProg May 7, 2026 15:33

ProExpertProg approved these changes May 7, 2026

View reviewed changes

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026

ProExpertProg enabled auto-merge (squash) May 7, 2026 16:35

Merge branch 'main' into ci-d145ce3a

db25a13

ProExpertProg disabled auto-merge May 7, 2026 20:36

vllm-bot merged commit 5f6a028 into vllm-project:main May 8, 2026
12 of 15 checks passed

This was referenced May 8, 2026

[CI Bug 2026-05-08] PyTorch Compilation Unit Tests: test_gpt2_cache_hit monkey-patch invisible to engine subprocess ZhanqiuHu/vllm-ci-watch#105

Open

[CI Summary 2026-05-08] 8 failed (3 new, 5 recurring), 4 fixed ZhanqiuHu/vllm-ci-watch#109

Open

haosdent mentioned this pull request May 8, 2026

[CI][Bugfix] Make test_gpt2_cache_hit observable across V1 EngineCore #42037

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test"#41953

[CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test"#41953
vllm-bot merged 3 commits intovllm-project:mainfrom
haosdent:ci-d145ce3a

haosdent commented May 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

haosdent commented May 7, 2026

Uh oh!

SoluMilken commented May 7, 2026

Uh oh!

haosdent commented May 7, 2026

Uh oh!

jeejeelee commented May 7, 2026

Uh oh!

ProExpertProg commented May 7, 2026

Uh oh!

ProExpertProg commented May 7, 2026

Uh oh!

Uh oh!

haosdent commented May 8, 2026

Uh oh!

ProExpertProg commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

haosdent commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

haosdent commented May 7, 2026

Uh oh!

SoluMilken commented May 7, 2026

Uh oh!

haosdent commented May 7, 2026

Uh oh!

jeejeelee commented May 7, 2026

Uh oh!

ProExpertProg commented May 7, 2026

Uh oh!

ProExpertProg commented May 7, 2026

Uh oh!

Uh oh!

haosdent commented May 8, 2026

Uh oh!

ProExpertProg commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

haosdent commented May 7, 2026 •

edited

Loading