Skip to content

[Bugfix] Fix XPU/ROCm compatibility in spawn_new_process_for_each_test#41895

Merged
ProExpertProg merged 4 commits intovllm-project:mainfrom
dzhengAP:bugfix/fix-spawn-decorator-exitcode
May 8, 2026
Merged

[Bugfix] Fix XPU/ROCm compatibility in spawn_new_process_for_each_test#41895
ProExpertProg merged 4 commits intovllm-project:mainfrom
dzhengAP:bugfix/fix-spawn-decorator-exitcode

Conversation

@dzhengAP
Copy link
Copy Markdown
Contributor

@dzhengAP dzhengAP commented May 7, 2026

Purpose

Follow-up to #41423.

mp.set_start_method("spawn") was inadvertently removed when the decorator was rewritten to use subprocess.run. While the decorator itself doesn't need it (since subprocess.run launches a fresh interpreter regardless), other parts of the test session — specifically XPU/ROCm engine workers that use mp.Process directly — depend on this global mp start method being set to spawn.

Without it, those workers default back to fork on Linux, causing:
RuntimeError: Cannot re-initialize XPU in forked subprocess

This restore has no effect on the failure-propagation fix from #41423.

cc @jikunshang @ProExpertProg @AndreasKaratzas

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added nvidia rocm Related to AMD ROCm intel-gpu Related to Intel GPU v1 bug Something isn't working labels May 7, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 7, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dzhengAP.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the spawn_new_process_for_each_test decorator to use subprocess.run combined with cloudpickle for serializing test functions and arguments. This change ensures that exceptions occurring in the child process are captured and propagated back to the parent, preventing silent test failures. The PR also adds comprehensive unit tests for the decorator and updates existing tests in test_cudagraph_dispatch.py and test_custom_online.py to align with the new implementation. Feedback suggests that the child_script should explicitly set the multiprocessing start method to 'spawn' to avoid potential RuntimeError issues in environments where 'fork' is the default.

Comment thread tests/utils.py
@luobosibing2
Copy link
Copy Markdown

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1eba41a30a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/utils.py Outdated
Comment thread tests/utils.py Outdated
Comment thread tests/utils.py Outdated
@chaojun-zhang
Copy link
Copy Markdown
Contributor

Please check @create_new_process_for_each_test() uses spawn on XPU, which cannot pickle the server fixture (it contains a multiprocessing.Process with an unpicklable AuthenticationString).

FAILED tests/v1/logits_processors/test_custom_online.py::test_custom_logitsprocs[server1-facebook/opt-125m] - TypeError: Pickling an AuthenticationString object is disallowed for security reasons

Comment thread tests/utils.py Outdated
@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented May 7, 2026

Please check @create_new_process_for_each_test() uses spawn on XPU, which cannot pickle the server fixture (it contains a multiprocessing.Process with an unpicklable AuthenticationString).

FAILED tests/v1/logits_processors/test_custom_online.py::test_custom_logitsprocs[server1-facebook/opt-125m] - TypeError: Pickling an AuthenticationString object is disallowed for security reasons

Yes, exactly. The RemoteOpenAIServerCustom fixture uses multiprocessing.Process internally which holds an AuthenticationString that can't be pickled across a spawn boundary. I would think of the fix to not wrap that specific test with spawn_new_process_for_each_test. Will implement this tmr. @chaojun-zhang

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@dzhengAP Please fix pre-commit so we can begin evaluations of this PR.

@dzhengAP dzhengAP force-pushed the bugfix/fix-spawn-decorator-exitcode branch from 5ad28d8 to c2623ff Compare May 7, 2026 17:46
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 7, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dzhengAP.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 7, 2026
@dzhengAP dzhengAP force-pushed the bugfix/fix-spawn-decorator-exitcode branch from 07c609c to 4eb8f08 Compare May 7, 2026 17:53
@mergify mergify Bot removed the needs-rebase label May 7, 2026
@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented May 7, 2026

@dzhengAP Please fix pre-commit so we can begin evaluations of this PR.

@AndreasKaratzas Hi Andreas are you able to add a ready label?

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@dzhengAP pre-commit is still failing.

@dzhengAP dzhengAP force-pushed the bugfix/fix-spawn-decorator-exitcode branch from 866372a to 576ea71 Compare May 7, 2026 21:03
@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented May 7, 2026

Why is GH showing 0 files changed? Has this been fixed in a different PR?
@ProExpertProg Pevious push accidentally resolved the conflicts in a way that made the branch identical to main. I've now rebased cleanly on top of the latest main — the 2 new commits are now showing correctly. The files changed should be visible now.

dzhengAP added 2 commits May 7, 2026 14:05
…ures

The previous implementation ran 'python -m <module_name>' in the child
process, which re-executed the module's __main__ block instead of
actually calling the test function. As a result, check_returncode()
always saw exit 0 and any exception in the test was silently swallowed.

Fix: serialize the test function with cloudpickle and pass it to a
minimal child script via stdin, matching the robustness of
fork_new_process_for_each_test. The child writes its traceback to a
temp file on failure; the parent reads it and raises RuntimeError with
the full traceback included.

Fixes vllm-project#41415

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
…erCustom unpicklable

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
@dzhengAP dzhengAP force-pushed the bugfix/fix-spawn-decorator-exitcode branch from 576ea71 to 4a04fb6 Compare May 7, 2026 21:06
Comment thread tests/utils.py Outdated
@github-project-automation github-project-automation Bot moved this to In review in NVIDIA May 7, 2026
@AndreasKaratzas AndreasKaratzas removed the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026
@dzhengAP dzhengAP force-pushed the bugfix/fix-spawn-decorator-exitcode branch from c4c3413 to 4a04fb6 Compare May 7, 2026 21:13
Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented May 7, 2026

@jikunshang can you help add the ready label to check CI, since focus now mainly on Intel CI tests. cc: @ProExpertProg @AndreasKaratzas

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@jikunshang can you help add the ready label to check CI, since focus now mainly on Intel CI tests. cc: @ProExpertProg @AndreasKaratzas

So this does not fix rocm? So we should probably move on with reverting the other PR then.

@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented May 7, 2026

@jikunshang can you help add the ready label to check CI, since focus now mainly on Intel CI tests. cc: @ProExpertProg @AndreasKaratzas

So this does not fix rocm? So we should probably move on with reverting the other PR then.

@AndreasKaratzas what new ROCM failure did you see? can you help paste here? AMD CI tests were passed after the fix in this PR. — the mp.set_start_method("spawn") guard applies to both current_platform.is_rocm() or current_platform.is_xpu(). The VLLM_WORKER_MULTIPROC_METHOD=spawn is also set in the child env for both platforms. Please make sure share the specific ROCm failure you're seeing. I want to make sure we address it before any revert decision.

@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026
@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented May 8, 2026

Hi @ProExpertProg @jikunshang I went through the CI log and put a summary below, which seems unrelated to #41895

1. test_custom_logitsprocsv1/logits_processors

Error: RuntimeError: Cannot re-initialize XPU in forked subprocess
Root cause: The test intentionally sets VLLM_WORKER_MULTIPROC_METHOD=fork so the entry_points monkey-patch is visible to workers. This is fundamentally incompatible with XPU which requires spawn.
Status: Pre-existing issue

2. test_qwen35_text_loralora/

Error: RuntimeError: Engine core initialization failed (XPU fork error in EngineCoreProc)
Root cause: The vLLM engine core is forking despite VLLM_WORKER_MULTIPROC_METHOD=spawn being set. XPU executor path is not respecting the spawn context.
Status: Pre-existing issue


Fix Plan

test_custom_logitsprocs may need to replace fork-based entrypoint injection with spawn-compatible approach (config file or env var instead of monkey-patching across fork boundary)
test_qwen35_text_lora need to Investigate why EngineCoreProc forks on XPU despite spawn env var; explicitly use mp.get_context("spawn") in XPU executor path


I can submit followup PRs for them or fix the bugs here, but it may make current PR too heavy, what do you guys think? cc @jikunshang @ProExpertProg

@jikunshang
Copy link
Copy Markdown
Collaborator

jikunshang commented May 8, 2026

for intel/ci,

  1. test_custom_logitsprocs — v1/logits_processors. I didn't notice it force use fork before. I will investigate more how to proper handle this case on XPU. maybe I should just skip it due to such limitation.
  2. I agree test_qwen35_text_lora — lora/ is not related. we can ignore it. test_qwen35_text_lora is case with create_new_process_for_each_test decorator. it still use fork instead of spawn.

What confuse me is:
#41423 changed spawn_new_process_for_each_test method behavior, no longer use spawn, see comments #41423 (comment), this may affect some other ROCm/XPU test case with create_new_process_for_each_test decorator.

@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented May 8, 2026

What confuse me is:
#41423 changed spawn_new_process_for_each_test method behavior, no longer use spawn, see comments #41423 (comment), this may affect some other ROCm/XPU test case with create_new_process_for_each_test decorator.

Hi @jikunshang. The new spawn_new_process_for_each_test uses subprocess.run([sys.executable, ...]) which launches a completely fresh interpreter — this is actually more isolated than mp.spawn since there's no shared state at all. The child process sets VLLM_WORKER_MULTIPROC_METHOD=spawn and calls mp.set_start_method('spawn') before running the test, so any mp.Process calls inside the test still use spawn semantics. The behavior change is in how the subprocess is launched, not in what the subprocess does.

@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented May 8, 2026

@ProExpertProg @jikunshang @chaojun-zhang @AndreasKaratzas
After checking the Intel CI log for test_qwen35_text_lora — the actual error is AssertionError: lora isn't supported on XPU, not a fork/spawn issue. The test is running via subprocess.run correctly (you can see python3 -c "import sys, cloudpickle, traceback..." in the output). That means, the new decorator is working correctly and properly surfacing this pre-existing failure. This test should be skipped on XPU with pytest.mark.skip or a platform check. The fork/spawn concern doesn't apply here.
image
Aligned with @jikunshang

@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented May 8, 2026

As for the first one: Same as before — 1 failed, 33 passed — only test_custom_logitsprocs failing with the XPU fork error (Cannot re-initialize XPU in forked subprocess). This is because the test explicitly sets VLLM_WORKER_MULTIPROC_METHOD=fork internally.

In summay, both are pre-existing XPU limitations, not caused by this PR. The decorator is working correctly in both cases — it's properly surfacing failures that were previously hidden.

@ProExpertProg @jikunshang @chaojun-zhang @AndreasKaratzas

@jikunshang
Copy link
Copy Markdown
Collaborator

Intel CI doens't gate PR merge currently. thanks @dzhengAP for great work.
Intel folk will temperary disable these failed case in anohter PR to make ci happy and provide solid fix for these.

@github-project-automation github-project-automation Bot moved this from In review to Ready in NVIDIA May 8, 2026
@ProExpertProg ProExpertProg merged commit 1acd67a into vllm-project:main May 8, 2026
16 of 18 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 8, 2026
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA May 8, 2026
@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented May 8, 2026

Hi @jikunshang and @AndreasKaratzas, I am trying to propose a fix for in #42040
This PR fixes the LOGITPROC_SOURCE_ENTRYPOINT fork compatibility issue on XPU/ROCm that was previously failing Intel and AMD CI, as we discussed here (failure 1). The monkey-patch approach required VLLM_WORKER_MULTIPROC_METHOD=fork which is incompatible with XPU/ROCm. The fix registers a real dist-info package on disk via PYTHONPATH so any spawned subprocess can discover the entrypoint without needing fork. Please give this a look when you get a chance!
cc: @ProExpertProg

libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026
vllm-project#41895)

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
Signed-off-by: Libin Tang <libin.tang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working intel-gpu Related to Intel GPU nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants