[Bugfix] Fix test_whisper distributed test process handling#42038
Conversation
|
Hi @dzhengAP, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request removes the create_new_process_for_each_test utility import and its corresponding decorator from the test_models_distributed function within the Whisper model tests. I have no feedback to provide as no review comments were included in the request.
|
Oh yeah the failure is not fixed yet |
|
@ProExpertProg and @DarkLight1337
Investigated build #65117 — PR #42038 (double decorator fix) is correctly
applied.
The Whisper test still fails but for a different reason: *Failed core
proc(s): {} (empty dict), *which means the worker crashes before
registering, likely during torch.compile/AOT cache initialization with
enforce_eager=False. This is a pre-existing flaky infrastructure issue
unrelated to the decorator changes. I will fix this by
setting enforce_eager=True to test_models_distributed or marking it as
flaky with rerun in a follow up PR(#42092) since it is separate issue with different root cause .
…On Fri, May 8, 2026 at 6:58 AM Luka Govedič ***@***.***> wrote:
*ProExpertProg* left a comment (vllm-project/vllm#42038)
<#42038 (comment)>
Oh yeah the failure is not fixed yet
—
Reply to this email directly, view it on GitHub
<#42038 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BEP3VP47WOX2WSRVWYAQHN34ZXRZPAVCNFSM6AAAAACYVUIEQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DIMBWHE4DEMZUGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Hi @dzhengAP @ProExpertProg @DarkLight1337, I noticed that in the CI build #65117, the Whisper distributed test seems to start with leftover GPU memory already in use. The failure is:
So even before the test really runs, vLLM fails the startup memory check. I can think of two possible ways to handle this:
Which direction do reviewers think is better here? Thanks. |
…y_utilization and enforce_eager Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
|
Hi @SoluMilken, this is a good point. the memory part was also included when I fixed the torch.compile flakiness issue, and Gemini also suggested some other settings related to memory, please check here #42092 I will also mention you catch the same issue there . @ProExpertProg @DarkLight1337 would you check the fix? thanks |
|
@dzhengAP @ProExpertProg @DarkLight1337 I'm not entirely sure, but should we just combine this PR and #42092? That way we can run the CI together and see if it actually passes. Thanks. |
… test Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
061c236 to
38d129e
Compare
|
Changes
Fixes failures in CI builds #64792 and #65117. Closes #42092.(Combined here) |
|
Hi @dzhengAP, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
checking new CI failures now. Instead of passing
Validation: Result: passed. |
Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
|
Hi @dzhengAP, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
|
Failure: The 2-GPU distributed Whisper tests were the only one during vLLM startup because gpu_memory_utilization=0.7 requested ~15.43 GiB while CI had ~15.41 GiB free. Lower this test to 0.65 to provide startup memory headroom. Failing tests: Fix: lower the Whisper distributed test’s memory utilization margin:
|
|
may i know why you say: |
|
Hi @jikunshang, what I meant was not that the torch.compile cache already
exists when the container launches. Usually it is empty as you mentioned
also.
The concern is different: torch.compile / AOT can introduce extra
nondeterminism or failure modes during distributed tests, there are some
cases in the doc especially when:
1. each rank compiles independently;
2. graph capture differs slightly across ranks;
3. first-run compilation happens during the correctness test;
4. compiled kernels / guards / dynamic shapes create rank-specific
behavior;
5. cache directory, permissions, or warmup timing differs in
CI/container runs.
So enforce_eager=True is mainly used to make the distributed correctness
test focus on NCCL synchronization correctness, not on whether
TorchInductor/AOT compilation behaves consistently.
…On Fri, May 8, 2026 at 6:09 PM Kunshang Ji ***@***.***> wrote:
*jikunshang* left a comment (vllm-project/vllm#42038)
<#42038 (comment)>
may i know why you say:
enforce_eager=True — avoids torch.compile/AOT cache flakiness in the
distributed correctness test.
torch.compile cache is empty when container launch.
—
Reply to this email directly, view it on GitHub
<#42038 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BEP3VP45ZH4BT2PGVK3N3Y34Z2AMPAVCNFSM6AAAAACYVUIEQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DIMJQHEYTGNZVGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
@DarkLight1337 @ProExpertProg all passed on the distributed test, can we merge? |
Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
cf102df to
5b7447a
Compare
|
@ProExpertProg Hi Luka, all passed with eager mode off. |
|
thanks for fixing! merged |
Follow-up to #41423
cc @ProExpertProg