Skip to content

Conversation

@zhewenl
Copy link
Collaborator

@zhewenl zhewenl commented Nov 7, 2025

Purpose

This PR provides a temp fix to broken CI: LM Eval Small Models, the test is failing with GSM8K evals on Qwen1.5-MoE-W4A16-CT-tp1:


[2025-11-05T06:16:56Z]             # Verify accuracy is within tolerance
--
  | [2025-11-05T06:16:56Z] >           assert measured_accuracy >= expected_accuracy - RTOL, (
  | [2025-11-05T06:16:56Z]                 f"Accuracy too low: {measured_accuracy:.3f} < "
  | [2025-11-05T06:16:56Z]                 f"{expected_accuracy:.3f} - {RTOL:.3f}"
  | [2025-11-05T06:16:56Z]             )
  | [2025-11-05T06:16:56Z] E           AssertionError: Accuracy too low: 0.000 < 0.450 - 0.080
  | [2025-11-05T06:16:56Z] E           assert 0.0 >= (0.45 - 0.08)
  | [2025-11-05T06:16:56Z]
  | [2025-11-05T06:16:56Z] evals/gsm8k/test_gsm8k_correctness.py:86: AssertionError

After bisecting, we found this PR could contribute to failing tests: #26440 - disabling the feature with VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 can make test passed consistently.

Note: this is only easy to reproduce on Nvidia L4(what CI is using), H100/MI300 is very difficult to repro this: pytest -s -v 'tests/evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness_param[Qwen1.5-MoE-W4A16-CT-tp1]'

Some hypothesis on why is this failing:

  • Quantization Incompatibility? The test is using W4A16 quantization, so the previous PR might break tensor properties with hidden_states.clone()?
  • It might be breaking memory layout in Qwen2 MoE model forward pass?

Test Plan

With VLLM_DISABLE_SHARED_EXPERTS_STREAM=1, tested 10 times(10/10 are passing): https://gist.github.com/zhewenl/a23ead4262f500b764ecd56c8c4c213d

Without disabling the env var, 3/3 are failing: https://gist.github.com/zhewenl/01a70feea4de15f6258dbecd76415a22

Signed-off-by: zhewenli <[email protected]>
@zhewenl
Copy link
Collaborator Author

zhewenl commented Nov 7, 2025

On a side note, since the root cause and the impact is still unknown - shall we temporally disable VLLM_DISABLE_SHARED_EXPERTS_STREAM ? It's enabled globally as default:

VLLM_DISABLE_SHARED_EXPERTS_STREAM: bool = False

@zhewenl zhewenl marked this pull request as ready for review November 7, 2025 21:10
@vadiklyutiy
Copy link
Collaborator

Do I understand correctly that we add VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 for all tests?

@zhewenl
Copy link
Collaborator Author

zhewenl commented Nov 7, 2025

Do I understand correctly that we add VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 for all tests?

For all the tests under LM Eval Small Models

@yeqcharlotte
Copy link
Collaborator

ok does this mean the failure is specific to L4 gpu? i think the duo stream still need some ci coverage.

can we reorganize these testing pipelines so we can separate testing that must be run on H100/B200 vs. L4. @hl475 @rzabarazesh please consider these in the test refactoring!

@zhewenl zhewenl requested review from houseroad and zou3519 November 8, 2025 01:05
@hl475
Copy link
Contributor

hl475 commented Nov 8, 2025

thanks!

not sure if feasible - can we add something like env: VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 into https://github.com/vllm-project/vllm/blob/b158df28139d134c2a43680104418eaa0d58e91c/tests/evals/gsm8k/configs/Qwen1.5-MoE-W4A16-CT.yaml , and then propagate it to the pytest which can use monkeypatch for the environment variables ? In this way, we can make the impact only to this test.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add it to the Blackwell lm eval job too?

@vadiklyutiy
Copy link
Collaborator

I think we should add it on pytest level. Otherwise running tests in CI/buildkite environment will be different from running pytest locally.

@vadiklyutiy
Copy link
Collaborator

I can confirm that adding VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 fix the issue(did a lot of runs).

Signed-off-by: zhewenli <[email protected]>
@zhewenl
Copy link
Collaborator Author

zhewenl commented Nov 9, 2025

@mgoin / @hl475 / @vadiklyutiy thanks for your suggestions! Updated the logic to take in env vars so both L4/Blackwell can use the same config wo/ setting it in different test suite.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 9, 2025
@mgoin mgoin added the ci-failure Issue about an unexpected test failure in CI label Nov 9, 2025
@khluu khluu enabled auto-merge (squash) November 9, 2025 20:41
@khluu khluu merged commit a65a934 into vllm-project:main Nov 9, 2025
21 checks passed
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ci-failure Issue about an unexpected test failure in CI ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants