[Bugfix] Fix test_whisper distributed test process handling by dzhengAP · Pull Request #42038 · vllm-project/vllm

dzhengAP · 2026-05-08T07:16:37Z

Follow-up to #41423

Add method parameter to multi_gpu_test — allows callers to explicitly specify spawn/fork instead of always defaulting to platform detection. This avoids the need for a separate @create_new_process_for_each_test("spawn") decorator on top of @multi_gpu_test.
Use method="spawn" for Whisper distributed test — replaces the double-decorator pattern with a single @multi_gpu_test(num_gpus=2, method="spawn"), ensuring a clean CUDA environment without double-wrapping.
Lower gpu_memory_utilization=0.7 — the Whisper test runs last in CI (command 7/7), by which point earlier tests leave ~6.6 GiB of GPU memory occupied. max_model_len=448 doesn't need the default 0.92 reservation.
enforce_eager=True — avoids torch.compile/AOT cache flakiness in the distributed correctness test.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-05-08T07:17:32Z

Hi @dzhengAP, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

gemini-code-assist

Code Review

This pull request removes the create_new_process_for_each_test utility import and its corresponding decorator from the test_models_distributed function within the Whisper model tests. I have no feedback to provide as no review comments were included in the request.

DarkLight1337 · 2026-05-08T08:49:51Z

https://buildkite.com/vllm/ci/builds/65117/canvas?jid=019e067c-c2a9-467e-831a-dfe3e0bdb43e&tab=output are failing

ProExpertProg · 2026-05-08T13:58:25Z

Oh yeah the failure is not fixed yet

dzhengAP · 2026-05-08T15:42:02Z

@ProExpertProg and @DarkLight1337 Investigated build #65117 — PR #42038 (double decorator fix) is correctly applied. The Whisper test still fails but for a different reason: *Failed core proc(s): {} (empty dict), *which means the worker crashes before registering, likely during torch.compile/AOT cache initialization with enforce_eager=False. This is a pre-existing flaky infrastructure issue unrelated to the decorator changes. I will fix this by setting enforce_eager=True to test_models_distributed or marking it as flaky with rerun in a follow up PR(#42092) since it is separate issue with different root cause .

…

On Fri, May 8, 2026 at 6:58 AM Luka Govedič ***@***.***> wrote: *ProExpertProg* left a comment (vllm-project/vllm#42038) <#42038 (comment)> Oh yeah the failure is not fixed yet — Reply to this email directly, view it on GitHub <#42038 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BEP3VP47WOX2WSRVWYAQHN34ZXRZPAVCNFSM6AAAAACYVUIEQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DIMBWHE4DEMZUGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SoluMilken · 2026-05-08T16:12:37Z

Hi @dzhengAP @ProExpertProg @DarkLight1337,

I noticed that in the CI build #65117, the Whisper distributed test seems to start with leftover GPU memory already in use. The failure is:

Free memory on device cuda:0 (15.41/22.05 GiB) on startup is less than desired GPU memory utilization (0.92, 20.28 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

So even before the test really runs, vLLM fails the startup memory check.

I can think of two possible ways to handle this:

Lower gpu_memory_utilization for this Whisper distributed correctness test, since it uses max_model_len=448/max_tokens=200 and probably does not need the default 0.92 reservation.
Investigate/clean up the leftover GPU memory from earlier distributed/Ray/vLLM tests, so this test starts with a clean GPU state.

Which direction do reviewers think is better here?

Thanks.

…y_utilization and enforce_eager Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP · 2026-05-08T16:43:36Z

Hi @SoluMilken, this is a good point. the memory part was also included when I fixed the torch.compile flakiness issue, and Gemini also suggested some other settings related to memory, please check here #42092 I will also mention you catch the same issue there .

@ProExpertProg @DarkLight1337 would you check the fix? thanks

SoluMilken · 2026-05-08T17:11:30Z

@dzhengAP @ProExpertProg @DarkLight1337 I'm not entirely sure, but should we just combine this PR and #42092? That way we can run the CI together and see if it actually passes. Thanks.

… test Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP · 2026-05-08T17:23:30Z

Changes

Add method parameter to multi_gpu_test — allows callers to explicitly specify spawn/fork instead of always defaulting to platform detection. This avoids the need for a separate @create_new_process_for_each_test("spawn") decorator on top of @multi_gpu_test.
Use method="spawn" for Whisper distributed test — replaces the double-decorator pattern with a single @multi_gpu_test(num_gpus=2, method="spawn"), ensuring a clean CUDA environment without double-wrapping.
Lower gpu_memory_utilization=0.7 — the Whisper test runs last in CI (command 7/7), by which point earlier tests leave ~6.6 GiB of GPU memory occupied. max_model_len=448 doesn't need the default 0.92 reservation.
enforce_eager=True — avoids torch.compile/AOT cache flakiness in the distributed correctness test.

Fixes failures in CI builds #64792 and #65117.

Closes #42092.(Combined here)

cc @jikunshang @ProExpertProg @SoluMilken

mergify · 2026-05-08T17:27:24Z

Hi @dzhengAP, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

dzhengAP · 2026-05-08T22:04:49Z

checking new CI failures now.
Updated the Whisper distributed test fix.

Instead of passing method="spawn" into multi_gpu_test, which caused mypy failures because multi_gpu_test only accepts num_gpus, I restored the explicit process decorator:

Kept @multi_gpu_test(num_gpus=2)
Added @create_new_process_for_each_test("spawn")
Imported create_new_process_for_each_test from tests.utils

Validation:
PYTHONPATH= MYPYPATH= pre-commit run mypy-3.10 --hook-stage manual --files tests/models/multimodal/generation/test_whisper.py

Result: passed.

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

mergify · 2026-05-08T22:18:20Z

Hi @dzhengAP, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP · 2026-05-08T23:38:26Z

Failure: The 2-GPU distributed Whisper tests were the only one during vLLM startup because gpu_memory_utilization=0.7 requested ~15.43 GiB while CI had ~15.41 GiB free. Lower this test to 0.65 to provide startup memory headroom.
**Root cause: ** the distributed Whisper CI tests requested slightly more GPU memory than was available at startup. With gpu_memory_utilization=0.7, vLLM tried to reserve about 15.43 GiB, but CI only had about 15.41 GiB free on each GPU, so engine initialization failed before the test could run.

Failing tests:
tests/models/multimodal/generation/test_whisper.py::test_models_distributed[5-200-half-ray-openai/whisper-large-v3-turbo]
tests/models/multimodal/generation/test_whisper.py::test_models_distributed[5-200-half-mp-openai/whisper-large-v3-turbo]

Fix: lower the Whisper distributed test’s memory utilization margin:

```
   gpu_memory_utilization=0.7,
```

```
   gpu_memory_utilization=0.65,
```

jikunshang · 2026-05-09T01:09:05Z

may i know why you say:
enforce_eager=True — avoids torch.compile/AOT cache flakiness in the distributed correctness test.
torch.compile cache is empty when container launch.

dzhengAP · 2026-05-09T01:16:58Z

Hi @jikunshang, what I meant was not that the torch.compile cache already exists when the container launches. Usually it is empty as you mentioned also. The concern is different: torch.compile / AOT can introduce extra nondeterminism or failure modes during distributed tests, there are some cases in the doc especially when: 1. each rank compiles independently; 2. graph capture differs slightly across ranks; 3. first-run compilation happens during the correctness test; 4. compiled kernels / guards / dynamic shapes create rank-specific behavior; 5. cache directory, permissions, or warmup timing differs in CI/container runs. So enforce_eager=True is mainly used to make the distributed correctness test focus on NCCL synchronization correctness, not on whether TorchInductor/AOT compilation behaves consistently.

…

On Fri, May 8, 2026 at 6:09 PM Kunshang Ji ***@***.***> wrote: *jikunshang* left a comment (vllm-project/vllm#42038) <#42038 (comment)> may i know why you say: enforce_eager=True — avoids torch.compile/AOT cache flakiness in the distributed correctness test. torch.compile cache is empty when container launch. — Reply to this email directly, view it on GitHub <#42038 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BEP3VP45ZH4BT2PGVK3N3Y34Z2AMPAVCNFSM6AAAAACYVUIEQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DIMJQHEYTGNZVGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dzhengAP · 2026-05-09T01:30:36Z

@DarkLight1337 @ProExpertProg all passed on the distributed test, can we merge?

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP · 2026-05-09T03:32:08Z

@ProExpertProg Hi Luka, all passed with eager mode off.

jikunshang · 2026-05-09T03:37:18Z

thanks for fixing! merged

dzhengAP requested review from DarkLight1337 and ywang96 as code owners May 8, 2026 07:16

claude Bot reviewed May 8, 2026

View reviewed changes

mergify Bot added multi-modality Related to multi-modality (#4194) bug Something isn't working labels May 8, 2026

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

dzhengAP mentioned this pull request May 8, 2026

Revert "[Bugfix] Fix spawn_new_process_for_each_test silently swallowing test failures" (#41423) #41887

Closed

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label May 8, 2026

ProExpertProg approved these changes May 8, 2026

View reviewed changes

ProExpertProg enabled auto-merge (squash) May 8, 2026 13:57

ProExpertProg disabled auto-merge May 8, 2026 13:58

[Bugfix] Fix test_whisper distributed test stability: lower gpu_memor…

7968143

…y_utilization and enforce_eager Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP mentioned this pull request May 8, 2026

[Bugfix] Fix test_whisper distributed test stability: torch.compile flakiness and memory utilization #42092

Closed

Add method param to multi_gpu_test, use spawn for Whisper distributed…

38d129e

… test Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP force-pushed the bugfix/fix-whisper-distributed-double-spawn branch from 061c236 to 38d129e Compare May 8, 2026 17:22

Fix whisper distributed test process method

a3f03e8

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP and others added 2 commits May 8, 2026 15:20

Remove invalid method argument from multi_gpu_test

8bef796

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

Merge branch 'main' into bugfix/fix-whisper-distributed-double-spawn

a434917

dzhengAP changed the title ~~[Bugfix] Remove redundant @create_new_process_for_each_test from test_whisper distributed test~~ [Bugfix] Fix test_whisper distributed test process handling May 8, 2026

ProExpertProg reviewed May 9, 2026

View reviewed changes

Comment thread tests/models/multimodal/generation/test_whisper.py Outdated

Lower Whisper distributed test GPU memory utilization

5b7447a

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP force-pushed the bugfix/fix-whisper-distributed-double-spawn branch from cf102df to 5b7447a Compare May 9, 2026 01:39

jikunshang merged commit 845ca32 into vllm-project:main May 9, 2026
25 checks passed

ZhanqiuHu mentioned this pull request May 9, 2026

[CI Summary 2026-05-09] 9 failed (8 new, 1 recurring), 7 fixed ZhanqiuHu/vllm-ci-watch#115

Open

Uh oh!

Conversation

dzhengAP commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

DarkLight1337 commented May 8, 2026

Uh oh!

ProExpertProg commented May 8, 2026

Uh oh!

dzhengAP commented May 8, 2026 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SoluMilken commented May 8, 2026

Uh oh!

dzhengAP commented May 8, 2026

Uh oh!

SoluMilken commented May 8, 2026

Uh oh!

dzhengAP commented May 8, 2026

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

dzhengAP commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

dzhengAP commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jikunshang commented May 9, 2026

Uh oh!

dzhengAP commented May 9, 2026 via email

Uh oh!

dzhengAP commented May 9, 2026

Uh oh!

Uh oh!

dzhengAP commented May 9, 2026

Uh oh!

jikunshang commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dzhengAP commented May 8, 2026 •

edited

Loading

dzhengAP commented May 8, 2026 via email •

edited

Loading

dzhengAP commented May 8, 2026 •

edited

Loading

dzhengAP commented May 8, 2026 •

edited

Loading