[CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test"#41953
[CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test"#41953vllm-bot merged 3 commits intovllm-project:mainfrom
Conversation
…child `spawn_new_process_for_each_test` uses cloudpickle to send the test function to the child subprocess. Cloudpickle cannot resolve the inner `f` by reference because the decorator hides it behind the wrapper (`module.test_foo` is the wrapper, not `f`), so cloudpickle falls back to by-value pickling. By-value pickling of a function pickles its `__globals__` dict, which serializes module-level singletons like `vllm.compilation.counter.compilation_counter` as `NEWOBJ + state`, producing fresh clones in the child. Result: VllmBackend in the child increments the real singleton, but the test's reference is the cloned copy that never sees the increment. Every `compilation_counter.expect(num_graphs_seen=N)` block fails with "before 0, after 0". Send `f.__module__` + `f.__qualname__` instead. The child re-imports the module (running normal vllm init, building singletons once) and walks qualname via `getattr`. The wrapper short-circuits via the `VLLM_TEST_SPAWN_CHILD` env var so the child's resolution doesn't re-spawn. Args/kwargs are still cloudpickled — pytest fixtures (`MonkeyPatch`, `tmp_path`, parametrize values) don't reference singletons, so by-value pickling there is harmless. Verified locally: `compile/fullgraph/test_toy_llama.py` (3/3), `test_multiple_graphs.py` (4/4), `test_simple.py` (4/5; the remaining `[False-inductor]` failure is a separate AOT codegen issue, unrelated to this regression). Signed-off-by: haosdent <haosdent@hotmail.com> Signed-off-by: haosdent <haosdent@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request updates the spawn_new_process_for_each_test decorator in tests/utils.py to resolve test functions via module import and qualified name lookup rather than direct pickling. This change prevents module-level singletons from being cloned as stale values in the child process. Additionally, it introduces a guard variable to prevent recursive process spawning and refines the error handling for capturing tracebacks from failed subprocesses. I have no feedback to provide.
|
@dzhengAP Please take a look when you are available, thank you in advance! |
|
Hi @haosdent, 👋 Friendly tip: For active projects like vLLM, it's highly recommended to open an issue first to provide context. Also, to speed up the review, make sure you've reproduced the issue and fully tested your code locally before pinging a reviewer. |
|
Thanks @SoluMilken , I will create an issue next time. Previously, I thought a CI failure didn't need a ticket. |
…d cross-variant collision
`test_simple_piecewise_compile` parametrizes over `intermediate_unbacked`
(`True`/`False`), which gates a control-flow branch in
`SillyModel.forward`:
if self.intermediate_unbacked:
u0 = foo(x)
ones = x.new_ones(x.shape[0], u0).sum(-1) / 3
x = x * ones
The AOT-compile cache key (`_model_hash_key` in
`vllm/compilation/decorators.py`) only hashes
`vllm.__version__ + fn.__qualname__ + fn.__code__.co_firstlineno` — none
of which capture per-instance attributes. Both variants therefore share
the same cache slot, but Dynamo traces them differently. Whichever runs
first persists its compiled Triton kernel; the other loads that artifact
and crashes with `CUDA illegal memory access` (CI surfaced this as
`WARP_ILLEGAL_ADDRESS` in `triton_poi_fused_add_0`).
The sibling tests (`test_toy_llama`, `test_multi_graph_piecewise_compile`,
`test_simple_inductor_graph_partition`) already
`monkeypatch.setenv("VLLM_DISABLE_COMPILE_CACHE", "1")` for the same
reason; this brings `test_simple_piecewise_compile` in line.
The underlying vLLM AOT-cache hash is too coarse for any model whose
forward branches on an `__init__` arg, but tightening it is non-trivial
(which attrs count?) and risks breaking legitimate cache hits in
production. Tracking that as a separate item.
Verified: all 5 `test_simple.py` cases pass in a single pytest run; the
full 8-test set from CI build #64888 is green.
Signed-off-by: haosdent <haosdent@hotmail.com>
Signed-off-by: haosdent <haosdent@gmail.com>
|
@ProExpertProg PTAL, thank you |
|
Whoever merges extra commits from main, please make sure the cudagraph, v1-others-cpu , pytorch, and v1-sample-plus-logits tests all run |
|
@haosdent the PyTorch failure on previous commit seems related? |
@ProExpertProg sorry, which failure you mean, currently there are multiple Pytorch failure on the main branch. |
|
Yeah it was the other one I think |
…-project#41953) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Libin Tang <libin.tang@intel.com>
Purpose
Fixes #41960 — 8 fullgraph failures introduced by #41423
(CI build #64888):
Commit 1 — 7 counter-assertion failures.
spawn_new_process_for_each_testcloudpicklesf; the decorator hidesfbehind the wrapper so cloudpickle falls back to by-value pickling, serializingf.__globals__and turning singletons likecompilation_counterinto stale clones. Sendf.__module__+f.__qualname__instead; child re-imports and resolves viagetattr.VLLM_TEST_SPAWN_CHILDenv var prevents re-spawning. Args/kwargs are still cloudpickled.Commit 2 — 1 CUDA crash.
test_simple_piecewise_compileparametrizes overintermediate_unbacked(gates a branch inSillyModel.forward), but the AOT cache key doesn't include per-instance attrs, so both variants share one cache slot and clobber each other. AddVLLM_DISABLE_COMPILE_CACHE=1, matching its sibling tests.Independent of #41895 (XPU/ROCm
mp.set_start_method).Test Plan
Test Result
8/8 originally-failing tests pass locally (NVIDIA GB10). Full
test_simple.pyandutils_/test_spawn_decorator.pyalso green.