[Bugfix] Fix XPU/ROCm compatibility in spawn_new_process_for_each_test by dzhengAP · Pull Request #41895 · vllm-project/vllm

dzhengAP · 2026-05-07T04:29:52Z

Purpose

Follow-up to #41423.

mp.set_start_method("spawn") was inadvertently removed when the decorator was rewritten to use subprocess.run. While the decorator itself doesn't need it (since subprocess.run launches a fresh interpreter regardless), other parts of the test session — specifically XPU/ROCm engine workers that use mp.Process directly — depend on this global mp start method being set to spawn.

Without it, those workers default back to fork on Linux, causing:
RuntimeError: Cannot re-initialize XPU in forked subprocess

This restore has no effect on the failure-propagation fix from #41423.

cc @jikunshang @ProExpertProg @AndreasKaratzas

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-05-07T04:31:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dzhengAP.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request refactors the spawn_new_process_for_each_test decorator to use subprocess.run combined with cloudpickle for serializing test functions and arguments. This change ensures that exceptions occurring in the child process are captured and propagated back to the parent, preventing silent test failures. The PR also adds comprehensive unit tests for the decorator and updates existing tests in test_cudagraph_dispatch.py and test_custom_online.py to align with the new implementation. Feedback suggests that the child_script should explicitly set the multiprocessing start method to 'spawn' to avoid potential RuntimeError issues in environments where 'fork' is the default.

luobosibing2 · 2026-05-07T05:10:25Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1eba41a30a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chaojun-zhang · 2026-05-07T07:52:31Z

Please check @create_new_process_for_each_test() uses spawn on XPU, which cannot pickle the server fixture (it contains a multiprocessing.Process with an unpicklable AuthenticationString).

FAILED tests/v1/logits_processors/test_custom_online.py::test_custom_logitsprocs[server1-facebook/opt-125m] - TypeError: Pickling an AuthenticationString object is disallowed for security reasons

dzhengAP · 2026-05-07T09:55:51Z

Please check @create_new_process_for_each_test() uses spawn on XPU, which cannot pickle the server fixture (it contains a multiprocessing.Process with an unpicklable AuthenticationString).

FAILED tests/v1/logits_processors/test_custom_online.py::test_custom_logitsprocs[server1-facebook/opt-125m] - TypeError: Pickling an AuthenticationString object is disallowed for security reasons

Yes, exactly. The RemoteOpenAIServerCustom fixture uses multiprocessing.Process internally which holds an AuthenticationString that can't be pickled across a spawn boundary. I would think of the fix to not wrap that specific test with spawn_new_process_for_each_test. Will implement this tmr. @chaojun-zhang

AndreasKaratzas · 2026-05-07T17:07:56Z

@dzhengAP Please fix pre-commit so we can begin evaluations of this PR.

mergify · 2026-05-07T17:47:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dzhengAP.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

dzhengAP · 2026-05-07T17:59:55Z

@dzhengAP Please fix pre-commit so we can begin evaluations of this PR.

@AndreasKaratzas Hi Andreas are you able to add a ready label?

AndreasKaratzas · 2026-05-07T18:00:55Z

@dzhengAP pre-commit is still failing.

dzhengAP · 2026-05-07T21:05:27Z

Why is GH showing 0 files changed? Has this been fixed in a different PR?
@ProExpertProg Pevious push accidentally resolved the conflicts in a way that made the branch identical to main. I've now rebased cleanly on top of the latest main — the 2 new commits are now showing correctly. The files changed should be visible now.

…ures The previous implementation ran 'python -m <module_name>' in the child process, which re-executed the module's __main__ block instead of actually calling the test function. As a result, check_returncode() always saw exit 0 and any exception in the test was silently swallowed. Fix: serialize the test function with cloudpickle and pass it to a minimal child script via stdin, matching the robustness of fork_new_process_for_each_test. The child writes its traceback to a temp file on failure; the parent reads it and raises RuntimeError with the full traceback included. Fixes vllm-project#41415 Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

…erCustom unpicklable Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP · 2026-05-07T21:20:30Z

@jikunshang can you help add the ready label to check CI, since focus now mainly on Intel CI tests. cc: @ProExpertProg @AndreasKaratzas

AndreasKaratzas · 2026-05-07T21:21:50Z

@jikunshang can you help add the ready label to check CI, since focus now mainly on Intel CI tests. cc: @ProExpertProg @AndreasKaratzas

So this does not fix rocm? So we should probably move on with reverting the other PR then.

dzhengAP · 2026-05-07T21:23:02Z

@jikunshang can you help add the ready label to check CI, since focus now mainly on Intel CI tests. cc: @ProExpertProg @AndreasKaratzas

So this does not fix rocm? So we should probably move on with reverting the other PR then.

@AndreasKaratzas what new ROCM failure did you see? can you help paste here? AMD CI tests were passed after the fix in this PR. — the mp.set_start_method("spawn") guard applies to both current_platform.is_rocm() or current_platform.is_xpu(). The VLLM_WORKER_MULTIPROC_METHOD=spawn is also set in the child env for both platforms. Please make sure share the specific ROCm failure you're seeing. I want to make sure we address it before any revert decision.

dzhengAP · 2026-05-08T00:16:16Z

Hi @ProExpertProg @jikunshang I went through the CI log and put a summary below, which seems unrelated to #41895

1. `test_custom_logitsprocs` — `v1/logits_processors`

Error: RuntimeError: Cannot re-initialize XPU in forked subprocess
Root cause: The test intentionally sets VLLM_WORKER_MULTIPROC_METHOD=fork so the entry_points monkey-patch is visible to workers. This is fundamentally incompatible with XPU which requires spawn.
Status: Pre-existing issue

2. `test_qwen35_text_lora` — `lora/`

Error: RuntimeError: Engine core initialization failed (XPU fork error in EngineCoreProc)
Root cause: The vLLM engine core is forking despite VLLM_WORKER_MULTIPROC_METHOD=spawn being set. XPU executor path is not respecting the spawn context.
Status: Pre-existing issue

Fix Plan

test_custom_logitsprocs may need to replace fork-based entrypoint injection with spawn-compatible approach (config file or env var instead of monkey-patching across fork boundary)
test_qwen35_text_lora need to Investigate why EngineCoreProc forks on XPU despite spawn env var; explicitly use mp.get_context("spawn") in XPU executor path

I can submit followup PRs for them or fix the bugs here, but it may make current PR too heavy, what do you guys think? cc @jikunshang @ProExpertProg

jikunshang · 2026-05-08T00:34:19Z

for intel/ci,

test_custom_logitsprocs — v1/logits_processors. I didn't notice it force use fork before. I will investigate more how to proper handle this case on XPU. maybe I should just skip it due to such limitation.
~~I agree test_qwen35_text_lora — lora/ is not related. we can ignore it.~~ test_qwen35_text_lora is case with create_new_process_for_each_test decorator. it still use fork instead of spawn.

What confuse me is:
#41423 changed spawn_new_process_for_each_test method behavior, no longer use spawn, see comments #41423 (comment), this may affect some other ROCm/XPU test case with create_new_process_for_each_test decorator.

dzhengAP · 2026-05-08T00:48:10Z

What confuse me is:
#41423 changed spawn_new_process_for_each_test method behavior, no longer use spawn, see comments #41423 (comment), this may affect some other ROCm/XPU test case with create_new_process_for_each_test decorator.

Hi @jikunshang. The new spawn_new_process_for_each_test uses subprocess.run([sys.executable, ...]) which launches a completely fresh interpreter — this is actually more isolated than mp.spawn since there's no shared state at all. The child process sets VLLM_WORKER_MULTIPROC_METHOD=spawn and calls mp.set_start_method('spawn') before running the test, so any mp.Process calls inside the test still use spawn semantics. The behavior change is in how the subprocess is launched, not in what the subprocess does.

dzhengAP · 2026-05-08T01:08:31Z

@ProExpertProg @jikunshang @chaojun-zhang @AndreasKaratzas
After checking the Intel CI log for test_qwen35_text_lora — the actual error is AssertionError: lora isn't supported on XPU, not a fork/spawn issue. The test is running via subprocess.run correctly (you can see python3 -c "import sys, cloudpickle, traceback..." in the output). That means, the new decorator is working correctly and properly surfacing this pre-existing failure. This test should be skipped on XPU with pytest.mark.skip or a platform check. The fork/spawn concern doesn't apply here.

Aligned with @jikunshang

dzhengAP · 2026-05-08T01:29:35Z

As for the first one: Same as before — 1 failed, 33 passed — only test_custom_logitsprocs failing with the XPU fork error (Cannot re-initialize XPU in forked subprocess). This is because the test explicitly sets VLLM_WORKER_MULTIPROC_METHOD=fork internally.

In summay, both are pre-existing XPU limitations, not caused by this PR. The decorator is working correctly in both cases — it's properly surfacing failures that were previously hidden.

@ProExpertProg @jikunshang @chaojun-zhang @AndreasKaratzas

jikunshang · 2026-05-08T01:34:57Z

Intel CI doens't gate PR merge currently. thanks @dzhengAP for great work.
Intel folk will temperary disable these failed case in anohter PR to make ci happy and provide solid fix for these.

dzhengAP · 2026-05-08T07:54:40Z

Hi @jikunshang and @AndreasKaratzas, I am trying to propose a fix for in #42040
This PR fixes the LOGITPROC_SOURCE_ENTRYPOINT fork compatibility issue on XPU/ROCm that was previously failing Intel and AMD CI, as we discussed here (failure 1). The monkey-patch approach required VLLM_WORKER_MULTIPROC_METHOD=fork which is incompatible with XPU/ROCm. The fix registers a real dist-info package on disk via PYTHONPATH so any spawned subprocess can discover the entrypoint without needing fork. Please give this a look when you get a chance!
cc: @ProExpertProg

vllm-project#41895) Signed-off-by: dqzhengAP <dqzheng1996@gmail.com> Signed-off-by: Libin Tang <libin.tang@intel.com>

claude Bot reviewed May 7, 2026

View reviewed changes

mergify Bot added nvidia rocm Related to AMD ROCm intel-gpu Related to Intel GPU v1 bug Something isn't working labels May 7, 2026

mergify Bot added the needs-rebase label May 7, 2026

github-project-automation Bot added this to NVIDIA and AMD May 7, 2026

github-project-automation Bot moved this to Todo in AMD May 7, 2026

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

Comment thread tests/utils.py

mergify Bot removed the needs-rebase label May 7, 2026

This was referenced May 7, 2026

[Bugfix] Fix spawn_new_process_for_each_test silently swallowing test failures #41423

Merged

Revert "[Bugfix] Fix spawn_new_process_for_each_test silently swallowing test failures" (#41423) #41887

Closed

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

Comment thread tests/utils.py Outdated

Comment thread tests/utils.py Outdated

Comment thread tests/utils.py Outdated

jikunshang reviewed May 7, 2026

View reviewed changes

Comment thread tests/utils.py Outdated

This was referenced May 7, 2026

[CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test" #41953

Merged

[CI Failure]: PyTorch Fullgraph Smoke Test — counter assertions and AOT cache collision in tests/compile/fullgraph/ #41960

Closed

dzhengAP force-pushed the bugfix/fix-spawn-decorator-exitcode branch from 5ad28d8 to c2623ff Compare May 7, 2026 17:46

mergify Bot added the needs-rebase label May 7, 2026

dzhengAP force-pushed the bugfix/fix-spawn-decorator-exitcode branch from 07c609c to 4eb8f08 Compare May 7, 2026 17:53

mergify Bot removed the needs-rebase label May 7, 2026

dzhengAP force-pushed the bugfix/fix-spawn-decorator-exitcode branch from 866372a to 576ea71 Compare May 7, 2026 21:03

dzhengAP added 2 commits May 7, 2026 14:05

Remove spawn decorator from test_custom_logitsprocs: RemoteOpenAIServ…

4a04fb6

…erCustom unpicklable Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP force-pushed the bugfix/fix-spawn-decorator-exitcode branch from 576ea71 to 4a04fb6 Compare May 7, 2026 21:06

AndreasKaratzas suggested changes May 7, 2026

View reviewed changes

Comment thread tests/utils.py Outdated

github-project-automation Bot moved this to In review in NVIDIA May 7, 2026

AndreasKaratzas removed the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026

dzhengAP force-pushed the bugfix/fix-spawn-decorator-exitcode branch from c4c3413 to 4a04fb6 Compare May 7, 2026 21:13

Remove duplicate spawn_new_process_for_each_test definition

cea4520

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

Merge branch 'main' into bugfix/fix-spawn-decorator-exitcode

53134de

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026

jikunshang approved these changes May 8, 2026

View reviewed changes

github-project-automation Bot moved this from In review to Ready in NVIDIA May 8, 2026

ProExpertProg merged commit 1acd67a into vllm-project:main May 8, 2026
16 of 18 checks passed

github-project-automation Bot moved this from Todo to Done in AMD May 8, 2026

github-project-automation Bot moved this from Ready to Done in NVIDIA May 8, 2026

This was referenced May 8, 2026

[Bugfix] Remove incorrect @pytest.mark.asyncio from test_custom_logitsprocs #42036

Closed

[Bugfix] Fix LOGITPROC_SOURCE_ENTRYPOINT test to use spawn-compatible dist-info registration for XPU/ROCm #42040

Merged

libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026

[Bugfix] Fix XPU/ROCm compatibility in spawn_new_process_for_each_test (

953d483

vllm-project#41895) Signed-off-by: dqzhengAP <dqzheng1996@gmail.com> Signed-off-by: Libin Tang <libin.tang@intel.com>

Uh oh!

Conversation

dzhengAP commented May 7, 2026

Purpose

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

luobosibing2 commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chaojun-zhang commented May 7, 2026

Uh oh!

Uh oh!

dzhengAP commented May 7, 2026

Uh oh!

AndreasKaratzas commented May 7, 2026

Uh oh!

mergify Bot commented May 7, 2026

Uh oh!

dzhengAP commented May 7, 2026

Uh oh!

AndreasKaratzas commented May 7, 2026

Uh oh!

dzhengAP commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dzhengAP commented May 7, 2026

Uh oh!

AndreasKaratzas commented May 7, 2026

Uh oh!

dzhengAP commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzhengAP commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. test_custom_logitsprocs — v1/logits_processors

2. test_qwen35_text_lora — lora/

Fix Plan

Uh oh!

jikunshang commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzhengAP commented May 8, 2026

Uh oh!

dzhengAP commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzhengAP commented May 8, 2026

Uh oh!

jikunshang commented May 8, 2026

Uh oh!

Uh oh!

dzhengAP commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dzhengAP commented May 7, 2026 •

edited

Loading

dzhengAP commented May 7, 2026 •

edited

Loading

dzhengAP commented May 8, 2026 •

edited

Loading

1. `test_custom_logitsprocs` — `v1/logits_processors`

2. `test_qwen35_text_lora` — `lora/`

jikunshang commented May 8, 2026 •

edited

Loading

dzhengAP commented May 8, 2026 •

edited

Loading