-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[CI/Build] Temporary fix to LM Eval Small Models #28324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: zhewenli <[email protected]>
|
On a side note, since the root cause and the impact is still unknown - shall we temporally disable Line 221 in da786e3
|
|
Do I understand correctly that we add |
For all the tests under LM Eval Small Models |
|
ok does this mean the failure is specific to L4 gpu? i think the duo stream still need some ci coverage. can we reorganize these testing pipelines so we can separate testing that must be run on H100/B200 vs. L4. @hl475 @rzabarazesh please consider these in the test refactoring! |
|
thanks! not sure if feasible - can we add something like |
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add it to the Blackwell lm eval job too?
|
I think we should add it on pytest level. Otherwise running tests in CI/buildkite environment will be different from running pytest locally. |
|
I can confirm that adding |
Signed-off-by: zhewenli <[email protected]>
|
@mgoin / @hl475 / @vadiklyutiy thanks for your suggestions! Updated the logic to take in env vars so both L4/Blackwell can use the same config wo/ setting it in different test suite. |
Signed-off-by: zhewenli <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
This PR provides a temp fix to broken CI: LM Eval Small Models, the test is failing with GSM8K evals on
Qwen1.5-MoE-W4A16-CT-tp1:After bisecting, we found this PR could contribute to failing tests: #26440 - disabling the feature with
VLLM_DISABLE_SHARED_EXPERTS_STREAM=1can make test passed consistently.Note: this is only easy to reproduce on Nvidia L4(what CI is using), H100/MI300 is very difficult to repro this:
pytest -s -v 'tests/evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness_param[Qwen1.5-MoE-W4A16-CT-tp1]'Some hypothesis on why is this failing:
hidden_states.clone()?Test Plan
With
VLLM_DISABLE_SHARED_EXPERTS_STREAM=1, tested 10 times(10/10 are passing): https://gist.github.com/zhewenl/a23ead4262f500b764ecd56c8c4c213dWithout disabling the env var, 3/3 are failing: https://gist.github.com/zhewenl/01a70feea4de15f6258dbecd76415a22