Skip to content

[CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test"#41953

Merged
vllm-bot merged 3 commits intovllm-project:mainfrom
haosdent:ci-d145ce3a
May 8, 2026
Merged

[CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test"#41953
vllm-bot merged 3 commits intovllm-project:mainfrom
haosdent:ci-d145ce3a

Conversation

@haosdent
Copy link
Copy Markdown
Contributor

@haosdent haosdent commented May 7, 2026

Purpose

Fixes #41960 — 8 fullgraph failures introduced by #41423
(CI build #64888):

AssertionError: num_graphs_seen not as expected, before it is 0, after it is 0, ...
CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS in triton_poi_fused_add_0

Commit 1 — 7 counter-assertion failures. spawn_new_process_for_each_test cloudpickles f; the decorator hides f behind the wrapper so cloudpickle falls back to by-value pickling, serializing f.__globals__ and turning singletons like compilation_counter into stale clones. Send f.__module__ + f.__qualname__ instead; child re-imports and resolves via getattr. VLLM_TEST_SPAWN_CHILD env var prevents re-spawning. Args/kwargs are still cloudpickled.

Commit 2 — 1 CUDA crash. test_simple_piecewise_compile parametrizes over intermediate_unbacked (gates a branch in SillyModel.forward), but the AOT cache key doesn't include per-instance attrs, so both variants share one cache slot and clobber each other. Add VLLM_DISABLE_COMPILE_CACHE=1, matching its sibling tests.

Independent of #41895 (XPU/ROCm mp.set_start_method).

Test Plan

Test Result

8/8 originally-failing tests pass locally (NVIDIA GB10). Full test_simple.py and utils_/test_spawn_decorator.py also green.

…child

`spawn_new_process_for_each_test` uses cloudpickle to send the test
function to the child subprocess. Cloudpickle cannot resolve the inner
`f` by reference because the decorator hides it behind the wrapper
(`module.test_foo` is the wrapper, not `f`), so cloudpickle falls back
to by-value pickling. By-value pickling of a function pickles its
`__globals__` dict, which serializes module-level singletons like
`vllm.compilation.counter.compilation_counter` as `NEWOBJ + state`,
producing fresh clones in the child.

Result: VllmBackend in the child increments the real singleton, but
the test's reference is the cloned copy that never sees the increment.
Every `compilation_counter.expect(num_graphs_seen=N)` block fails with
"before 0, after 0".

Send `f.__module__` + `f.__qualname__` instead. The child re-imports
the module (running normal vllm init, building singletons once) and
walks qualname via `getattr`. The wrapper short-circuits via the
`VLLM_TEST_SPAWN_CHILD` env var so the child's resolution doesn't
re-spawn. Args/kwargs are still cloudpickled — pytest fixtures
(`MonkeyPatch`, `tmp_path`, parametrize values) don't reference
singletons, so by-value pickling there is harmless.

Verified locally: `compile/fullgraph/test_toy_llama.py` (3/3),
`test_multiple_graphs.py` (4/4), `test_simple.py` (4/5; the remaining
`[False-inductor]` failure is a separate AOT codegen issue, unrelated
to this regression).

Signed-off-by: haosdent <haosdent@hotmail.com>

Signed-off-by: haosdent <haosdent@gmail.com>
@mergify mergify Bot added the bug Something isn't working label May 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the spawn_new_process_for_each_test decorator in tests/utils.py to resolve test functions via module import and qualified name lookup rather than direct pickling. This change prevents module-level singletons from being cloned as stale values in the child process. Additionally, it introduces a guard variable to prevent recursive process spawning and refines the error handling for capturing tracebacks from failed subprocesses. I have no feedback to provide.

@haosdent haosdent changed the title [WIP][Bugfix] spawn_new_process_for_each_test: resolve test fn by name in child [Bugfix] spawn_new_process_for_each_test: resolve test fn by name in child May 7, 2026
@haosdent haosdent changed the title [Bugfix] spawn_new_process_for_each_test: resolve test fn by name in child [CI][Bugfix] spawn_new_process_for_each_test: resolve test fn by name in child May 7, 2026
@haosdent haosdent marked this pull request as ready for review May 7, 2026 12:31
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@haosdent haosdent changed the title [CI][Bugfix] spawn_new_process_for_each_test: resolve test fn by name in child [CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test" May 7, 2026
@haosdent
Copy link
Copy Markdown
Contributor Author

haosdent commented May 7, 2026

@dzhengAP Please take a look when you are available, thank you in advance!

@SoluMilken
Copy link
Copy Markdown
Contributor

Hi @haosdent,

👋 Friendly tip: For active projects like vLLM, it's highly recommended to open an issue first to provide context. Also, to speed up the review, make sure you've reproduced the issue and fully tested your code locally before pinging a reviewer.

@haosdent
Copy link
Copy Markdown
Contributor Author

haosdent commented May 7, 2026

Thanks @SoluMilken , I will create an issue next time. Previously, I thought a CI failure didn't need a ticket.
For test part, I verfied locally before, but to make the PR description shorter, I didn't add it, let me add it now.

…d cross-variant collision

`test_simple_piecewise_compile` parametrizes over `intermediate_unbacked`
(`True`/`False`), which gates a control-flow branch in
`SillyModel.forward`:

    if self.intermediate_unbacked:
        u0 = foo(x)
        ones = x.new_ones(x.shape[0], u0).sum(-1) / 3
        x = x * ones

The AOT-compile cache key (`_model_hash_key` in
`vllm/compilation/decorators.py`) only hashes
`vllm.__version__ + fn.__qualname__ + fn.__code__.co_firstlineno` — none
of which capture per-instance attributes. Both variants therefore share
the same cache slot, but Dynamo traces them differently. Whichever runs
first persists its compiled Triton kernel; the other loads that artifact
and crashes with `CUDA illegal memory access` (CI surfaced this as
`WARP_ILLEGAL_ADDRESS` in `triton_poi_fused_add_0`).

The sibling tests (`test_toy_llama`, `test_multi_graph_piecewise_compile`,
`test_simple_inductor_graph_partition`) already
`monkeypatch.setenv("VLLM_DISABLE_COMPILE_CACHE", "1")` for the same
reason; this brings `test_simple_piecewise_compile` in line.

The underlying vLLM AOT-cache hash is too coarse for any model whose
forward branches on an `__init__` arg, but tightening it is non-trivial
(which attrs count?) and risks breaking legitimate cache hits in
production. Tracking that as a separate item.

Verified: all 5 `test_simple.py` cases pass in a single pytest run; the
full 8-test set from CI build #64888 is green.

Signed-off-by: haosdent <haosdent@hotmail.com>

Signed-off-by: haosdent <haosdent@gmail.com>
@jeejeelee
Copy link
Copy Markdown
Collaborator

@ProExpertProg PTAL, thank you

@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026
@ProExpertProg
Copy link
Copy Markdown
Collaborator

Whoever merges extra commits from main, please make sure the cudagraph, v1-others-cpu , pytorch, and v1-sample-plus-logits tests all run

@ProExpertProg ProExpertProg enabled auto-merge (squash) May 7, 2026 16:35
@ProExpertProg ProExpertProg disabled auto-merge May 7, 2026 20:36
@ProExpertProg
Copy link
Copy Markdown
Collaborator

@haosdent the PyTorch failure on previous commit seems related?

@vllm-bot vllm-bot merged commit 5f6a028 into vllm-project:main May 8, 2026
12 of 15 checks passed
@haosdent
Copy link
Copy Markdown
Contributor Author

haosdent commented May 8, 2026

the PyTorch failure on previous commit seems related?

@ProExpertProg sorry, which failure you mean, currently there are multiple Pytorch failure on the main branch.

@ProExpertProg
Copy link
Copy Markdown
Collaborator

Yeah it was the other one I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI Failure]: PyTorch Fullgraph Smoke Test — counter assertions and AOT cache collision in tests/compile/fullgraph/

6 participants