[Bugfix] Fix XPU/ROCm compatibility in spawn_new_process_for_each_test#41895
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request refactors the spawn_new_process_for_each_test decorator to use subprocess.run combined with cloudpickle for serializing test functions and arguments. This change ensures that exceptions occurring in the child process are captured and propagated back to the parent, preventing silent test failures. The PR also adds comprehensive unit tests for the decorator and updates existing tests in test_cudagraph_dispatch.py and test_custom_online.py to align with the new implementation. Feedback suggests that the child_script should explicitly set the multiprocessing start method to 'spawn' to avoid potential RuntimeError issues in environments where 'fork' is the default.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1eba41a30a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Please check @create_new_process_for_each_test() uses spawn on XPU, which cannot pickle the server fixture (it contains a multiprocessing.Process with an unpicklable AuthenticationString).
|
Yes, exactly. The RemoteOpenAIServerCustom fixture uses multiprocessing.Process internally which holds an AuthenticationString that can't be pickled across a spawn boundary. I would think of the fix to not wrap that specific test with spawn_new_process_for_each_test. Will implement this tmr. @chaojun-zhang |
|
@dzhengAP Please fix pre-commit so we can begin evaluations of this PR. |
5ad28d8 to
c2623ff
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
07c609c to
4eb8f08
Compare
@AndreasKaratzas Hi Andreas are you able to add a ready label? |
|
@dzhengAP pre-commit is still failing. |
866372a to
576ea71
Compare
|
…ures The previous implementation ran 'python -m <module_name>' in the child process, which re-executed the module's __main__ block instead of actually calling the test function. As a result, check_returncode() always saw exit 0 and any exception in the test was silently swallowed. Fix: serialize the test function with cloudpickle and pass it to a minimal child script via stdin, matching the robustness of fork_new_process_for_each_test. The child writes its traceback to a temp file on failure; the parent reads it and raises RuntimeError with the full traceback included. Fixes vllm-project#41415 Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
…erCustom unpicklable Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
576ea71 to
4a04fb6
Compare
c4c3413 to
4a04fb6
Compare
Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
|
@jikunshang can you help add the ready label to check CI, since focus now mainly on Intel CI tests. cc: @ProExpertProg @AndreasKaratzas |
So this does not fix rocm? So we should probably move on with reverting the other PR then. |
@AndreasKaratzas what new ROCM failure did you see? can you help paste here? AMD CI tests were passed after the fix in this PR. — the mp.set_start_method("spawn") guard applies to both current_platform.is_rocm() or current_platform.is_xpu(). The VLLM_WORKER_MULTIPROC_METHOD=spawn is also set in the child env for both platforms. Please make sure share the specific ROCm failure you're seeing. I want to make sure we address it before any revert decision. |
|
Hi @ProExpertProg @jikunshang I went through the CI log and put a summary below, which seems unrelated to #41895 1.
|
|
for intel/ci,
What confuse me is: |
Hi @jikunshang. The new spawn_new_process_for_each_test uses subprocess.run([sys.executable, ...]) which launches a completely fresh interpreter — this is actually more isolated than mp.spawn since there's no shared state at all. The child process sets VLLM_WORKER_MULTIPROC_METHOD=spawn and calls mp.set_start_method('spawn') before running the test, so any mp.Process calls inside the test still use spawn semantics. The behavior change is in how the subprocess is launched, not in what the subprocess does. |
|
@ProExpertProg @jikunshang @chaojun-zhang @AndreasKaratzas |
|
As for the first one: Same as before — 1 failed, 33 passed — only test_custom_logitsprocs failing with the XPU fork error (Cannot re-initialize XPU in forked subprocess). This is because the test explicitly sets VLLM_WORKER_MULTIPROC_METHOD=fork internally. In summay, both are pre-existing XPU limitations, not caused by this PR. The decorator is working correctly in both cases — it's properly surfacing failures that were previously hidden. |
|
Intel CI doens't gate PR merge currently. thanks @dzhengAP for great work. |
|
Hi @jikunshang and @AndreasKaratzas, I am trying to propose a fix for in #42040 |
vllm-project#41895) Signed-off-by: dqzhengAP <dqzheng1996@gmail.com> Signed-off-by: Libin Tang <libin.tang@intel.com>

Purpose
Follow-up to #41423.
mp.set_start_method("spawn")was inadvertently removed when the decorator was rewritten to usesubprocess.run. While the decorator itself doesn't need it (sincesubprocess.runlaunches a fresh interpreter regardless), other parts of the test session — specifically XPU/ROCm engine workers that usemp.Processdirectly — depend on this global mp start method being set to spawn.Without it, those workers default back to
forkon Linux, causing:RuntimeError: Cannot re-initialize XPU in forked subprocessThis restore has no effect on the failure-propagation fix from #41423.
cc @jikunshang @ProExpertProg @AndreasKaratzas