[CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_test#41943
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the spawn_new_process_for_each_test decorator in tests/utils.py to enable live output from test subprocesses by removing capture_output=True. It also introduces a _format_subprocess_exit utility to improve error reporting by identifying signals for non-zero exit codes and simplifies the traceback retrieval logic. I have no feedback to provide as there were no review comments.
|
@claude review |
There was a problem hiding this comment.
LGTM, straightforward CI diagnostics fix in test utilities — drops capture_output so child stdout/stderr stream live to the runner, while tb_file still propagates the Python traceback.
Extended reasoning...
Overview
The PR modifies a single function (spawn_new_process_for_each_test) in tests/utils.py. It removes capture_output=True from the subprocess.run call so the child process inherits the parent's stdout/stderr (matching fork_new_process_for_each_test's behavior), simplifies the traceback-read path to always read tb_file (with a fallback message if empty), and adds a small _format_subprocess_exit helper that renders signal exits like killed by SIGSEGV (-11).
Security risks
None. This is purely a test-runner diagnostics change; no production code, no auth/crypto/permissions, no user input handling.
Level of scrutiny
Low. The file is test infrastructure used only by the CI test harness. The behavioral change is narrow: failures that previously hid engine-core/NCCL/CUDA output will now surface that output in the CI log, which is the explicit motivation. The simplified open(tb_file) path is safe because tb_file is unconditionally created via NamedTemporaryFile(delete=False) earlier in the same try block, and the finally still removes it.
Other factors
- Small diff (~15 lines of meaningful change), single file, clearly tied to a real CI debuggability problem (referenced Buildkite failures).
- No bugs flagged by the bug hunting system.
- The new helper uses
signal.Signals(-returncode).namewith aValueErrorfallback, which correctly handles platform-specific signal numbers. - Worst case if this regresses is a less-informative CI log, not a correctness or security issue.
|
If I understand correctly, this pr tries to change |
No, what this PR change is to show Before this PRresult.stdout/result.stderr are swallowed since the current code captured it but didn't show After this PRCould observe the stdout/stderr in the log The |
…test The decorator captured the child's stdout/stderr via ``subprocess.run(..., capture_output=True)`` and then dropped them whenever the child wrote a Python traceback to ``tb_file`` — i.e. on every test failure. Engine-core, NCCL and CUDA output (the actual root cause of crashes like the whisper TP=2 distributed-test failures in #64813 / #64851 / #64859) ended up nowhere visible. Inherit stdout/stderr instead, mirroring ``fork_new_process_for_each_test`` so subprocess output flows into the test runner / CI log live. Pytest's fd-level capture still suppresses the noise on passing runs. ``tb_file`` remains the structured channel for re-raising the Python traceback in the parent. A small helper renders signal-induced exits as ``killed by SIGSEGV (-11)`` instead of the cryptic ``exit code -11``. Signed-off-by: haosdent huang <haosdent@gmail.com> Signed-off-by: haosdent <haosdent@gmail.com>
83775d2 to
24ffdb5
Compare
…test (vllm-project#41943) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Libin Tang <libin.tang@intel.com>
Purpose
spawn_new_process_for_each_testcallssubprocess.run(..., capture_output=True)and then dropsresult.stderr/stdoutwhenever the child writes a Python traceback — i.e. on every failure. Engine-core, NCCL and CUDA output is hidden, which is where the actual cause of crashes lives (e.g. the whisper TP=2 failures in #64813 / #64851 / #64859 showFailed core proc(s): {}with zero engine-subprocess lines).Drop
capture_output=Trueso the child inherits the parent's stdout/stderr, mirroringfork_new_process_for_each_test.tb_filestill re-raises the Python traceback in the parent. A small helper renders signal exits askilled by SIGSEGV (-11).Related to #41423
Test Plan
Test Result