[CI] Fail subprocess tests with root-cause error by njhill · Pull Request #23795 · vllm-project/vllm

njhill · 2025-08-28T04:32:45Z

QOL improvement

We use a create_new_process_for_each_test decorator on many tests to avoid cleanup issues. However when such tests fail the pytest error is from the process failure not the actual test failure. This change propagates the actual test exception of interest so that it's reported by pytest as if the test was running inline.

Before:

After:

gemini-code-assist

Code Review

This pull request is a great quality-of-life improvement for debugging tests that run in forked processes. By serializing and re-raising exceptions from the child process, it ensures that the root-cause error is not lost, which will significantly speed up debugging.

I've found one critical bug related to operator precedence that prevents the exception from being correctly re-raised, and one high-severity issue concerning a resource leak where temporary files are not being cleaned up. Addressing these issues will make this a solid and robust enhancement.

tests/utils.py

Signed-off-by: Nick Hill <nhill@redhat.com>

tests/utils.py

DarkLight1337 · 2025-08-29T05:04:34Z

Retrying async tests to see if the timeout is just flaky or caused by this PR.

njhill · 2025-08-29T05:21:15Z

Retrying async tests to see if the timeout is just flaky or caused by this PR.

Thanks @DarkLight1337. Looks like it hung again at the same place after ~10min. I'll cancel it so that it doesn't timeout after 3hrs again, will investigate tomorrow.

…se-failures

Signed-off-by: Nick Hill <nhill@redhat.com>

# Conflicts: # tests/conftest.py

…se-failures

Signed-off-by: Nick Hill <nhill@redhat.com>

…se-failures

Signed-off-by: Nick Hill <nhill@redhat.com>

mergify bot added the ci/build label Aug 28, 2025

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

tests/utils.py Outdated Show resolved Hide resolved

[CI] Fail subprocess tests with root-cause error

00f167a

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill force-pushed the root-cause-failures branch from 5bc6669 to 00f167a Compare August 28, 2025 06:59

vllm-project deleted a comment from gemini-code-assist bot Aug 28, 2025

njhill added 3 commits August 28, 2025 07:43

only delete temp file in parent process

5557740

Signed-off-by: Nick Hill <nhill@redhat.com>

Merge remote-tracking branch 'origin/main' into root-cause-failures

7923e66

fix pre-commit

545294f

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 28, 2025

njhill linked an issue Aug 28, 2025 that may be closed by this pull request

[RFC]: Refactor CI/CD #22992

Closed

1 task

njhill removed a link to an issue Aug 28, 2025

[RFC]: Refactor CI/CD #22992

Closed

1 task

njhill added 3 commits August 28, 2025 15:42

Merge remote-tracking branch 'origin/main' into root-cause-failures

cb95464

minor

26133f0

Signed-off-by: Nick Hill <nhill@redhat.com>

Merge remote-tracking branch 'origin/main' into root-cause-failures

1127de0

DarkLight1337 reviewed Aug 29, 2025

View reviewed changes

tests/utils.py Show resolved Hide resolved

njhill added 6 commits August 29, 2025 13:01

Merge remote-tracking branch 'refs/remotes/origin/main' into root-cau…

e5476c9

…se-failures

fix ray distributed executor destructor error

c8e1e32

Signed-off-by: Nick Hill <nhill@redhat.com>

Merge remote-tracking branch 'origin/main' into root-cause-failures

1d0e395

Merge remote-tracking branch 'origin/main' into root-cause-failures

57be627

Merge remote-tracking branch 'origin/main' into root-cause-failures

97833c2

# Conflicts: # tests/conftest.py

Merge remote-tracking branch 'origin/main' into root-cause-failures

2a37c8d

njhill requested review from robertgshaw2-redhat and simon-mo as code owners September 3, 2025 23:21

njhill force-pushed the root-cause-failures branch from 11214c2 to 1510ff0 Compare September 4, 2025 01:47

njhill added 3 commits September 5, 2025 14:47

Merge remote-tracking branch 'refs/remotes/origin/main' into root-cau…

4f5cde4

…se-failures

add timeout to hanging test

970465f

Signed-off-by: Nick Hill <nhill@redhat.com>

add env var for nccl debug

a5b79e2

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill force-pushed the root-cause-failures branch from 1510ff0 to a5b79e2 Compare September 5, 2025 21:50

njhill added 6 commits September 8, 2025 19:08

Merge remote-tracking branch 'origin/main' into root-cause-failures

c12fce9

try some things

ef248cf

Signed-off-by: Nick Hill <nhill@redhat.com>

revert debug changes

d14cbac

Signed-off-by: Nick Hill <nhill@redhat.com>

Merge remote-tracking branch 'refs/remotes/origin/main' into root-cau…

605d205

…se-failures

Merge remote-tracking branch 'origin/main' into root-cause-failures

6255133

Merge remote-tracking branch 'origin/main' into root-cause-failures

587b0a2

simon-mo merged commit 4db4426 into vllm-project:main Sep 10, 2025
69 of 72 checks passed

njhill deleted the root-cause-failures branch September 10, 2025 21:43

faaany mentioned this pull request Sep 11, 2025

[XPU] add missing dependency tblib for XPU CI #24639

Merged

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[CI] Fail subprocess tests with root-cause error (vllm-project#23795)

c1a565a

Signed-off-by: Nick Hill <nhill@redhat.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[CI] Fail subprocess tests with root-cause error (vllm-project#23795)

39dd7db

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill mentioned this pull request Sep 27, 2025

[RFC][Core] propagate the error message up to the frontend process #25722

Closed

markmc mentioned this pull request Nov 21, 2025

[Test] Fix pytest termination with @create_new_process_for_each_test("fork") #29130

Open

pi314ever mentioned this pull request Feb 23, 2026

Buildkite hardware ci xpu test vllm-project/vllm-omni#1340

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Fail subprocess tests with root-cause error#23795

[CI] Fail subprocess tests with root-cause error#23795
simon-mo merged 22 commits intovllm-project:mainfrom
njhill:root-cause-failures

njhill commented Aug 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Aug 29, 2025

Uh oh!

njhill commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

njhill commented Aug 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Aug 29, 2025

Uh oh!

njhill commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

njhill commented Aug 28, 2025 •

edited by github-actions bot

Loading