[CI/Build] Temporary fix to LM Eval Small Models #28324

zhewenl · 2025-11-07T20:47:34Z

Purpose

This PR provides a temp fix to broken CI: LM Eval Small Models, the test is failing with GSM8K evals on Qwen1.5-MoE-W4A16-CT-tp1:


[2025-11-05T06:16:56Z]             # Verify accuracy is within tolerance
--
  | [2025-11-05T06:16:56Z] >           assert measured_accuracy >= expected_accuracy - RTOL, (
  | [2025-11-05T06:16:56Z]                 f"Accuracy too low: {measured_accuracy:.3f} < "
  | [2025-11-05T06:16:56Z]                 f"{expected_accuracy:.3f} - {RTOL:.3f}"
  | [2025-11-05T06:16:56Z]             )
  | [2025-11-05T06:16:56Z] E           AssertionError: Accuracy too low: 0.000 < 0.450 - 0.080
  | [2025-11-05T06:16:56Z] E           assert 0.0 >= (0.45 - 0.08)
  | [2025-11-05T06:16:56Z]
  | [2025-11-05T06:16:56Z] evals/gsm8k/test_gsm8k_correctness.py:86: AssertionError

After bisecting, we found this PR could contribute to failing tests: #26440 - disabling the feature with VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 can make test passed consistently.

Note: this is only easy to reproduce on Nvidia L4(what CI is using), H100/MI300 is very difficult to repro this: pytest -s -v 'tests/evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness_param[Qwen1.5-MoE-W4A16-CT-tp1]'

Some hypothesis on why is this failing:

Quantization Incompatibility? The test is using W4A16 quantization, so the previous PR might break tensor properties with hidden_states.clone()?
It might be breaking memory layout in Qwen2 MoE model forward pass?

Test Plan

With VLLM_DISABLE_SHARED_EXPERTS_STREAM=1, tested 10 times(10/10 are passing): https://gist.github.com/zhewenl/a23ead4262f500b764ecd56c8c4c213d

Without disabling the env var, 3/3 are failing: https://gist.github.com/zhewenl/01a70feea4de15f6258dbecd76415a22

Signed-off-by: zhewenli <[email protected]>

zhewenl · 2025-11-07T21:07:58Z

On a side note, since the root cause and the impact is still unknown - shall we temporally disable VLLM_DISABLE_SHARED_EXPERTS_STREAM ? It's enabled globally as default:

vllm/vllm/envs.py

Line 221 in da786e3

VLLM_DISABLE_SHARED_EXPERTS_STREAM: bool = False

vadiklyutiy · 2025-11-07T21:16:43Z

Do I understand correctly that we add VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 for all tests?

zhewenl · 2025-11-07T22:12:13Z

Do I understand correctly that we add VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 for all tests?

For all the tests under LM Eval Small Models

yeqcharlotte · 2025-11-08T00:46:48Z

ok does this mean the failure is specific to L4 gpu? i think the duo stream still need some ci coverage.

can we reorganize these testing pipelines so we can separate testing that must be run on H100/B200 vs. L4. @hl475 @rzabarazesh please consider these in the test refactoring!

hl475 · 2025-11-08T01:52:07Z

thanks!

not sure if feasible - can we add something like env: VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 into https://github.com/vllm-project/vllm/blob/b158df28139d134c2a43680104418eaa0d58e91c/tests/evals/gsm8k/configs/Qwen1.5-MoE-W4A16-CT.yaml , and then propagate it to the pytest which can use monkeypatch for the environment variables ? In this way, we can make the impact only to this test.

mgoin

Can you please add it to the Blackwell lm eval job too?

vadiklyutiy · 2025-11-08T19:56:00Z

I think we should add it on pytest level. Otherwise running tests in CI/buildkite environment will be different from running pytest locally.

vadiklyutiy · 2025-11-08T21:05:12Z

I can confirm that adding VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 fix the issue(did a lot of runs).

Signed-off-by: zhewenli <[email protected]>

zhewenl · 2025-11-09T07:54:52Z

@mgoin / @hl475 / @vadiklyutiy thanks for your suggestions! Updated the logic to take in env vars so both L4/Blackwell can use the same config wo/ setting it in different test suite.

Signed-off-by: zhewenli <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

update tests

bc767d4

Signed-off-by: zhewenli <[email protected]>

mergify bot added the ci/build label Nov 7, 2025

zhewenl requested review from LucasWilkinson, NickLucche, ProExpertProg, alexm-redhat, hmellor, mgoin and simon-mo and removed request for ProExpertProg and hmellor November 7, 2025 21:04

zhewenl marked this pull request as ready for review November 7, 2025 21:10

zhewenl requested review from vadiklyutiy and yeqcharlotte November 7, 2025 21:13

zhewenl requested review from houseroad and zou3519 November 8, 2025 01:05

mgoin reviewed Nov 8, 2025

View reviewed changes

update test

d418338

Signed-off-by: zhewenli <[email protected]>

mgoin approved these changes Nov 9, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 9, 2025

mgoin added the ci-failure Issue about an unexpected test failure in CI label Nov 9, 2025

github-project-automation bot added this to CI Failures Nov 9, 2025

khluu enabled auto-merge (squash) November 9, 2025 20:41

khluu merged commit a65a934 into vllm-project:main Nov 9, 2025
21 checks passed

github-project-automation bot moved this to Done in CI Failures Nov 9, 2025

hl475 mentioned this pull request Nov 10, 2025

[CI/Build] Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch #28180

Closed

5 tasks

This was referenced Nov 10, 2025

[CI Failure]: Nightly B200 LM Eval Failure #28401

Open

[Bug]: Find the root cause of SHARED_EXPERTS_STREAM fail #28220

Open

mgoin mentioned this pull request Nov 10, 2025

[Bugfix] Disable shared expert overlap if Marlin MoE is used #28410

Merged

5 tasks

zhewenl mentioned this pull request Nov 11, 2025

[MoE][Kernel][Perf] Improve Shared Expert Stream Overlap #28406

Merged

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025

[CI/Build] Temporary fix to LM Eval Small Models (vllm-project#28324)

55af9a1

Signed-off-by: zhewenli <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI/Build] Temporary fix to LM Eval Small Models #28324

[CI/Build] Temporary fix to LM Eval Small Models #28324

Uh oh!

zhewenl commented Nov 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

zhewenl commented Nov 7, 2025

Uh oh!

vadiklyutiy commented Nov 7, 2025

Uh oh!

zhewenl commented Nov 7, 2025 •

edited

Loading

Uh oh!

yeqcharlotte commented Nov 8, 2025

Uh oh!

hl475 commented Nov 8, 2025

Uh oh!

mgoin left a comment

Uh oh!

vadiklyutiy commented Nov 8, 2025

Uh oh!

vadiklyutiy commented Nov 8, 2025

Uh oh!

zhewenl commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[CI/Build] Temporary fix to LM Eval Small Models #28324

[CI/Build] Temporary fix to LM Eval Small Models #28324

Uh oh!

Conversation

zhewenl commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

zhewenl commented Nov 7, 2025

Uh oh!

vadiklyutiy commented Nov 7, 2025

Uh oh!

zhewenl commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yeqcharlotte commented Nov 8, 2025

Uh oh!

hl475 commented Nov 8, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented Nov 8, 2025

Uh oh!

vadiklyutiy commented Nov 8, 2025

Uh oh!

zhewenl commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhewenl commented Nov 7, 2025 •

edited by github-actions bot

Loading

zhewenl commented Nov 7, 2025 •

edited

Loading