Skip to content

[CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_test#41943

Merged
NickLucche merged 1 commit intovllm-project:mainfrom
haosdent:ci/spawn-test-surface-subprocess-output
May 8, 2026
Merged

[CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_test#41943
NickLucche merged 1 commit intovllm-project:mainfrom
haosdent:ci/spawn-test-surface-subprocess-output

Conversation

@haosdent
Copy link
Copy Markdown
Contributor

@haosdent haosdent commented May 7, 2026

Purpose

spawn_new_process_for_each_test calls subprocess.run(..., capture_output=True) and then drops result.stderr/stdout whenever the child writes a Python traceback — i.e. on every failure. Engine-core, NCCL and CUDA output is hidden, which is where the actual cause of crashes lives (e.g. the whisper TP=2 failures in #64813 / #64851 / #64859 show Failed core proc(s): {} with zero engine-subprocess lines).

Drop capture_output=True so the child inherits the parent's stdout/stderr, mirroring fork_new_process_for_each_test. tb_file still re-raises the Python traceback in the parent. A small helper renders signal exits as killed by SIGSEGV (-11).

Related to #41423

Test Plan

Test Result

@haosdent haosdent changed the title [WIP][CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_test [CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_test May 7, 2026
@mergify mergify Bot added the bug Something isn't working label May 7, 2026
@haosdent haosdent marked this pull request as ready for review May 7, 2026 10:44
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the spawn_new_process_for_each_test decorator in tests/utils.py to enable live output from test subprocesses by removing capture_output=True. It also introduces a _format_subprocess_exit utility to improve error reporting by identifying signals for non-zero exit codes and simplifies the traceback retrieval logic. I have no feedback to provide as there were no review comments.

@ZJY0516
Copy link
Copy Markdown
Member

ZJY0516 commented May 7, 2026

@claude review

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, straightforward CI diagnostics fix in test utilities — drops capture_output so child stdout/stderr stream live to the runner, while tb_file still propagates the Python traceback.

Extended reasoning...

Overview

The PR modifies a single function (spawn_new_process_for_each_test) in tests/utils.py. It removes capture_output=True from the subprocess.run call so the child process inherits the parent's stdout/stderr (matching fork_new_process_for_each_test's behavior), simplifies the traceback-read path to always read tb_file (with a fallback message if empty), and adds a small _format_subprocess_exit helper that renders signal exits like killed by SIGSEGV (-11).

Security risks

None. This is purely a test-runner diagnostics change; no production code, no auth/crypto/permissions, no user input handling.

Level of scrutiny

Low. The file is test infrastructure used only by the CI test harness. The behavioral change is narrow: failures that previously hid engine-core/NCCL/CUDA output will now surface that output in the CI log, which is the explicit motivation. The simplified open(tb_file) path is safe because tb_file is unconditionally created via NamedTemporaryFile(delete=False) earlier in the same try block, and the finally still removes it.

Other factors

  • Small diff (~15 lines of meaningful change), single file, clearly tied to a real CI debuggability problem (referenced Buildkite failures).
  • No bugs flagged by the bug hunting system.
  • The new helper uses signal.Signals(-returncode).name with a ValueError fallback, which correctly handles platform-specific signal numbers.
  • Worst case if this regresses is a less-informative CI log, not a correctness or security issue.

@ZJY0516
Copy link
Copy Markdown
Member

ZJY0516 commented May 7, 2026

If I understand correctly, this pr tries to change Failed core proc(s): {} to Failed core proc(s): {killed by SIGSEGV (-11)} right?

@haosdent
Copy link
Copy Markdown
Contributor Author

haosdent commented May 7, 2026

If I understand correctly, this pr tries to change Failed core proc(s): {} to Failed core proc(s): {killed by SIGSEGV (-11)} right?

No, what this PR change is to show result.stdout/result.stderr from the subprocess

Before this PR

result.stdout/result.stderr are swallowed since the current code captured it but didn't show

models/multimodal/generation/test_whisper.py::test_models_distributed[5-200-half-ray-openai/whisper-large-v3-turbo] Fork a new process to run a test 44035
Fork a new process to run a test 0
FAILED

[~46s of silence — no engine output captured]

____ test_models_distributed[5-200-half-ray-openai/whisper-large-v3-turbo] _____
...
E   RuntimeError: Test subprocess 'test_models_distributed' failed (exit code 1):
E   Traceback (most recent call last):
E     File ".../vllm/v1/engine/utils.py", line 1183, in wait_for_engine_startup
E       raise RuntimeError(
E   RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

After this PR

Could observe the stdout/stderr in the log

models/multimodal/generation/test_whisper.py::test_models_distributed[5-200-half-ray-openai/whisper-large-v3-turbo] Fork a new process to run a test 44035
Fork a new process to run a test 0
INFO 05-07 00:30:02 [utils.py:240] non-default args: {'dtype': 'half', 'max_model_len': 448, 'distributed_executor_backend': 'ray', 'tensor_parallel_size': 2, ...}
INFO 05-07 00:30:05 [model.py:563] Resolved architecture: WhisperForConditionalGeneration
(EngineCore pid=44120) INFO 05-07 00:30:08 [core.py:109] Initializing a V1 LLM engine ...
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210) INFO ... Starting to load model openai/whisper-large-v3-turbo...
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210) INFO ... Model loading took 1.55 GiB memory and 4.7 seconds
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210) INFO ... Capturing CUDA graphs (decode-only, FULL_DECODE_ONLY): 0% 0/8
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210) ERROR ... torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 624.00 MiB. GPU 0 has a total capacity of 22.06 GiB of which 412.50 MiB is free.
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210) Traceback (most recent call last):
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210)   File ".../vllm/v1/worker/gpu_model_runner.py", line 6210, in capture_model
(EngineCore pid=44120) (RayWorkerProc pid=44210) (Worker_TP0 pid=44210)     self._capture_cudagraphs(...)
(EngineCore pid=44120) ERROR 05-07 00:30:43 [core.py:807] EngineCore failed to start.
(EngineCore pid=44120) Process EngineCore_DP0:
(EngineCore pid=44120) Traceback (most recent call last):
(EngineCore pid=44120)   File ".../vllm/v1/engine/core.py", line 800, in run_engine_core
(EngineCore pid=44120)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore pid=44120)   File ".../vllm/v1/engine/core.py", line 524, in __init__
(EngineCore pid=44120)     super().__init__(...)
(EngineCore pid=44120) torch.OutOfMemoryError: CUDA out of memory. ...
FAILED

____ test_models_distributed[5-200-half-ray-openai/whisper-large-v3-turbo] _____
...
E   RuntimeError: Engine core initialization failed. ... Failed core proc(s): {}

The Failed core proc(s): {} line stays the same;
The killed by SIGSEGV (-11) formatting is for the outer test wrapper's RuntimeError("Test subprocess ... failed (exit code N)") and only appears if the test wrapper process itself dies from a signal. I could remove it if you think it is unnecessary.

Copy link
Copy Markdown
Member

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZJY0516 ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026
…test

The decorator captured the child's stdout/stderr via
``subprocess.run(..., capture_output=True)`` and then dropped them whenever
the child wrote a Python traceback to ``tb_file`` — i.e. on every test
failure. Engine-core, NCCL and CUDA output (the actual root cause of
crashes like the whisper TP=2 distributed-test failures in #64813 / #64851
/ #64859) ended up nowhere visible.

Inherit stdout/stderr instead, mirroring ``fork_new_process_for_each_test``
so subprocess output flows into the test runner / CI log live. Pytest's
fd-level capture still suppresses the noise on passing runs. ``tb_file``
remains the structured channel for re-raising the Python traceback in the
parent. A small helper renders signal-induced exits as
``killed by SIGSEGV (-11)`` instead of the cryptic ``exit code -11``.

Signed-off-by: haosdent huang <haosdent@gmail.com>

Signed-off-by: haosdent <haosdent@gmail.com>
@haosdent haosdent force-pushed the ci/spawn-test-surface-subprocess-output branch from 83775d2 to 24ffdb5 Compare May 8, 2026 03:55
@NickLucche NickLucche merged commit 160858c into vllm-project:main May 8, 2026
12 checks passed
libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026
…test (vllm-project#41943)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: Libin Tang <libin.tang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants