[CI][Bugfix] Make test_gpt2_cache_hit observable across V1 EngineCore#42037
[CI][Bugfix] Make test_gpt2_cache_hit observable across V1 EngineCore#42037ZJY0516 merged 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the test_gpt2_cache_hit test to use the compilation_counter utility for verifying AOT compilation and cache hits, replacing the previous manual symbol-counting approach. It also disables V1 multiprocessing for this test to allow the singleton counter to function correctly in-process and includes explicit cleanup of the activation registry and distributed environment to prevent state leakage. The review feedback highlights that the fresh_vllm_cache fixture is unused in the test body and notes that clearing the private _ACTIVATION_REGISTRY._dict is a fragile operation that relies on implementation details.
| @pytest.mark.skipif(not is_torch_equal_or_newer("2.10.0"), reason="requires torch 2.10") | ||
| @create_new_process_for_each_test("spawn") | ||
| def test_gpt2_cache_hit(monkeypatch: pytest.MonkeyPatch): | ||
| def test_gpt2_cache_hit(monkeypatch: pytest.MonkeyPatch, fresh_vllm_cache: str): |
There was a problem hiding this comment.
The fresh_vllm_cache fixture is added to the test signature but is not explicitly used within the test body. While it likely performs necessary setup (like setting VLLM_CACHE_ROOT and clearing the directory) via its side effects, it is generally better practice to use the fixture's value if it provides the path, or at least document that it's relied upon for environment setup. If fresh_vllm_cache is intended to replace the manual VLLM_CACHE_ROOT setting seen in the previous version, ensure that the fixture indeed sets this environment variable globally for the test process.
There was a problem hiding this comment.
This is refer from test_aot_counters_on_save_and_load, I would keep it
| # otherwise mutates compilation_config.disabled_custom_ops between | ||
| # phase 1 and phase 2 and changes the AOT cache hash. | ||
| del llm_model | ||
| vllm.model_executor.layers.activation._ACTIVATION_REGISTRY._dict.clear() |
There was a problem hiding this comment.
Accessing and clearing _ACTIVATION_REGISTRY._dict is a fragile operation as it relies on private implementation details of the LazyDict class and the activation module. While this is a test and the PR description explains why it's necessary (to prevent state leakage affecting the AOT cache hash), this suggests that the CompilationConfig or the AOT hashing logic might be overly sensitive to global state. Consider if there's a more robust way to ensure a clean state between the two LLM instantiations, or at least add a comment explaining that this is a workaround for global state leakage in the activation registry.
507c094 to
1cc211d
Compare
| Compiling gpt2 twice must produce a fresh AOT compile the first time | ||
| and a cache load (zero compiles) the second time. | ||
|
|
||
| Forces VLLM_ENABLE_V1_MULTIPROCESSING=0 so the EngineCore runs in this |
There was a problem hiding this comment.
Are you sure we need to set VLLM_ENABLE_V1_MULTIPROCESSING=0 here? It's been running fine under multiproc for a while before
There was a problem hiding this comment.
It looks related to my previous fix #41943 .
Before #41943, the module was lazy loading, and the monkey-patch was executed before init CUDA, so it could monitor the increase count
Old path (cloudpickle):
child starts → deserialize test func (lazy) → monkey-patch → LLM(...)
→ CUDA init happens here → _maybe_force_spawn: no prior CUDA → in-process
→ compile visible to patch → counter increments ✓
After #41943, the module is imported and triggers CUDA init first, and then the monkey-patch is executed after that, so the failure happens.
New path (importlib):
child starts → import_module("tests.compile.test_aot_compile")
→ top-level import chain triggers CUDA init → monkey-patch → LLM(...)
→ _maybe_force_spawn: CUDA already init'd → spawns EngineCore subprocess
→ compile in subprocess, patch in parent → counter stays 0 ✗
There was a problem hiding this comment.
Have updated the description, thanks @ZJY0516
There was a problem hiding this comment.
I think we may want to make sure these test are running with multiproc enabled which is aligned with the real world usage
There was a problem hiding this comment.
I see, let me check if could track the counter without disable multiple processes
There was a problem hiding this comment.
Change to keep multiple process way, @ZJY0516 may you take a look at again? Thank you in advance!
The test patches `torch.fx.experimental.symbolic_shapes.make_symbol` in the parent process and counts via a `multiprocessing.Value`. In V1 the actual compile runs inside an `EngineCore` subprocess that vLLM spawns whenever CUDA is initialized in the parent (via `_maybe_force_spawn`), so the parent-process patch never reaches the compile path and the counter stays at 0. Replace the brittle torch-internal patch with `LLM.collective_rpc` to snapshot `compilation_counter` from the EngineCore subprocess itself. This is process-model agnostic — works under default V1 multiprocessing without sharing memory between the test and the engine. Each phase creates a fresh `LLM(...)` and tears it down via `cleanup_dist_env_and_memory()`, so per-phase counters start at zero without needing the previous activation-registry workaround. Phase 2 also sets `VLLM_FORCE_AOT_LOAD=1` as a fail-loud guard (raises FileNotFoundError on cache miss) on top of the counter assertion. `collective_rpc(callable)` requires pickle-based serialization, so the test sets `VLLM_ALLOW_INSECURE_SERIALIZATION=1` (the same pattern other collective_rpc-using tests follow, e.g. `tests/v1/e2e/general/test_pooling_chunked_prefill.py`). Signed-off-by: haosdent <haosdent@gmail.com>
|
merge this to unblock CI since @ProExpertProg has approved |
Purpose
Fix
tests/compile/test_aot_compile.py::test_gpt2_cache_hit. The test patchedmake_symbolin the parent process, but in V1 the compile runs inside anEngineCoresubprocess so the patch never reached the counter. Related to #41953, which made the spawn wrapper eagerly import the test module and indirectly initialize CUDA in the parent, flipping EngineCore from fork to spawn.Replace the parent-process patch with
LLM.collective_rpc(_snap)to readcompilation_counterfrom the EngineCore subprocess. Process-model agnostic, runs under default V1 multiprocessing.collective_rpc(callable)requiresVLLM_ALLOW_INSECURE_SERIALIZATION=1.Test Plan
rm -rf ~/.cache/vllm/torch_compile_cache pytest tests/compile/test_aot_compile.py::test_gpt2_cache_hit -s -vTest Result