[CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_test by haosdent · Pull Request #41943 · vllm-project/vllm

haosdent · 2026-05-07T10:40:46Z

Purpose

spawn_new_process_for_each_test calls subprocess.run(..., capture_output=True) and then drops result.stderr/stdout whenever the child writes a Python traceback — i.e. on every failure. Engine-core, NCCL and CUDA output is hidden, which is where the actual cause of crashes lives (e.g. the whisper TP=2 failures in #64813 / #64851 / #64859 show Failed core proc(s): {} with zero engine-subprocess lines).

Drop capture_output=True so the child inherits the parent's stdout/stderr, mirroring fork_new_process_for_each_test. tb_file still re-raises the Python traceback in the parent. A small helper renders signal exits as killed by SIGSEGV (-11).

Related to #41423

Test Plan

Test Result

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the spawn_new_process_for_each_test decorator in tests/utils.py to enable live output from test subprocesses by removing capture_output=True. It also introduces a _format_subprocess_exit utility to improve error reporting by identifying signals for non-zero exit codes and simplifies the traceback retrieval logic. I have no feedback to provide as there were no review comments.

ZJY0516 · 2026-05-07T13:44:44Z

@claude review

claude

LGTM, straightforward CI diagnostics fix in test utilities — drops capture_output so child stdout/stderr stream live to the runner, while tb_file still propagates the Python traceback.

Extended reasoning...

Overview

The PR modifies a single function (spawn_new_process_for_each_test) in tests/utils.py. It removes capture_output=True from the subprocess.run call so the child process inherits the parent's stdout/stderr (matching fork_new_process_for_each_test's behavior), simplifies the traceback-read path to always read tb_file (with a fallback message if empty), and adds a small _format_subprocess_exit helper that renders signal exits like killed by SIGSEGV (-11).

Security risks

None. This is purely a test-runner diagnostics change; no production code, no auth/crypto/permissions, no user input handling.

Level of scrutiny

Low. The file is test infrastructure used only by the CI test harness. The behavioral change is narrow: failures that previously hid engine-core/NCCL/CUDA output will now surface that output in the CI log, which is the explicit motivation. The simplified open(tb_file) path is safe because tb_file is unconditionally created via NamedTemporaryFile(delete=False) earlier in the same try block, and the finally still removes it.

Other factors

Small diff (~15 lines of meaningful change), single file, clearly tied to a real CI debuggability problem (referenced Buildkite failures).
No bugs flagged by the bug hunting system.
The new helper uses signal.Signals(-returncode).name with a ValueError fallback, which correctly handles platform-specific signal numbers.
Worst case if this regresses is a less-informative CI log, not a correctness or security issue.

ZJY0516 · 2026-05-07T13:57:26Z

If I understand correctly, this pr tries to change Failed core proc(s): {} to Failed core proc(s): {killed by SIGSEGV (-11)} right?

haosdent · 2026-05-07T14:18:54Z

If I understand correctly, this pr tries to change Failed core proc(s): {} to Failed core proc(s): {killed by SIGSEGV (-11)} right?

No, what this PR change is to show result.stdout/result.stderr from the subprocess

Before this PR

result.stdout/result.stderr are swallowed since the current code captured it but didn't show

models/multimodal/generation/test_whisper.py::test_models_distributed[5-200-half-ray-openai/whisper-large-v3-turbo] Fork a new process to run a test 44035
Fork a new process to run a test 0
FAILED

[~46s of silence — no engine output captured]

____ test_models_distributed[5-200-half-ray-openai/whisper-large-v3-turbo] _____
...
E   RuntimeError: Test subprocess 'test_models_distributed' failed (exit code 1):
E   Traceback (most recent call last):
E     File ".../vllm/v1/engine/utils.py", line 1183, in wait_for_engine_startup
E       raise RuntimeError(
E   RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

After this PR

Could observe the stdout/stderr in the log

models/multimodal/generation/test_whisper.py::test_models_distributed[5-200-half-ray-openai/whisper-large-v3-turbo] Fork a new process to run a test 44035
Fork a new process to run a test 0
INFO 05-07 00:30:02 [utils.py:240] non-default args: {'dtype': 'half', 'max_model_len': 448, 'distributed_executor_backend': 'ray', 'tensor_parallel_size': 2, ...}
INFO 05-07 00:30:05 [model.py:563] Resolved architecture: WhisperForConditionalGeneration
(EngineCore pid=44120) INFO 05-07 00:30:08 [core.py:109] Initializing a V1 LLM engine ...
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210) INFO ... Starting to load model openai/whisper-large-v3-turbo...
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210) INFO ... Model loading took 1.55 GiB memory and 4.7 seconds
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210) INFO ... Capturing CUDA graphs (decode-only, FULL_DECODE_ONLY): 0% 0/8
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210) ERROR ... torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 624.00 MiB. GPU 0 has a total capacity of 22.06 GiB of which 412.50 MiB is free.
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210) Traceback (most recent call last):
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210)   File ".../vllm/v1/worker/gpu_model_runner.py", line 6210, in capture_model
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210)     self._capture_cudagraphs(...)
(EngineCore pid=44120) ERROR 05-07 00:30:43 [core.py:807] EngineCore failed to start.
(EngineCore pid=44120) Process EngineCore_DP0:
(EngineCore pid=44120) Traceback (most recent call last):
(EngineCore pid=44120)   File ".../vllm/v1/engine/core.py", line 800, in run_engine_core
(EngineCore pid=44120)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore pid=44120)   File ".../vllm/v1/engine/core.py", line 524, in __init__
(EngineCore pid=44120)     super().__init__(...)
(EngineCore pid=44120) torch.OutOfMemoryError: CUDA out of memory. ...
FAILED

____ test_models_distributed[5-200-half-ray-openai/whisper-large-v3-turbo] _____
...
E   RuntimeError: Engine core initialization failed. ... Failed core proc(s): {}

The Failed core proc(s): {} line stays the same;
The killed by SIGSEGV (-11) formatting is for the outer test wrapper's RuntimeError("Test subprocess ... failed (exit code N)") and only appears if the test wrapper process itself dies from a signal. I could remove it if you think it is unnecessary.

ZJY0516

LGTM

…test The decorator captured the child's stdout/stderr via ``subprocess.run(..., capture_output=True)`` and then dropped them whenever the child wrote a Python traceback to ``tb_file`` — i.e. on every test failure. Engine-core, NCCL and CUDA output (the actual root cause of crashes like the whisper TP=2 distributed-test failures in #64813 / #64851 / #64859) ended up nowhere visible. Inherit stdout/stderr instead, mirroring ``fork_new_process_for_each_test`` so subprocess output flows into the test runner / CI log live. Pytest's fd-level capture still suppresses the noise on passing runs. ``tb_file`` remains the structured channel for re-raising the Python traceback in the parent. A small helper renders signal-induced exits as ``killed by SIGSEGV (-11)`` instead of the cryptic ``exit code -11``. Signed-off-by: haosdent huang <haosdent@gmail.com> Signed-off-by: haosdent <haosdent@gmail.com>

…test (vllm-project#41943) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Libin Tang <libin.tang@intel.com>

haosdent changed the title ~~[WIP][CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_test~~ [CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_test May 7, 2026

mergify Bot added the bug Something isn't working label May 7, 2026

haosdent marked this pull request as ready for review May 7, 2026 10:44

claude Bot reviewed May 7, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

claude Bot reviewed May 7, 2026

View reviewed changes

ZJY0516 approved these changes May 7, 2026

View reviewed changes

ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026

haosdent force-pushed the ci/spawn-test-surface-subprocess-output branch from 83775d2 to 24ffdb5 Compare May 8, 2026 03:55

NickLucche merged commit 160858c into vllm-project:main May 8, 2026
12 checks passed

ZhanqiuHu mentioned this pull request May 8, 2026

[CI Investigate 2026-05-08] Distributed Model Tests: Whisper engine core init failure (pre-existing bug) ZhanqiuHu/vllm-ci-watch#108

Open

haosdent mentioned this pull request May 8, 2026

[CI][Bugfix] Make test_gpt2_cache_hit observable across V1 EngineCore #42037

Open

libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026

[CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_…

ad54c5a

…test (vllm-project#41943) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Libin Tang <libin.tang@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_test#41943

[CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_test#41943
NickLucche merged 1 commit intovllm-project:mainfrom
haosdent:ci/spawn-test-surface-subprocess-output

haosdent commented May 7, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

ZJY0516 commented May 7, 2026

Uh oh!

claude Bot left a comment

Uh oh!

ZJY0516 commented May 7, 2026

Uh oh!

haosdent commented May 7, 2026

Uh oh!

ZJY0516 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

haosdent commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ZJY0516 commented May 7, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

ZJY0516 commented May 7, 2026

Uh oh!

haosdent commented May 7, 2026

Before this PR

After this PR

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haosdent commented May 7, 2026 •

edited

Loading