Skip to content

[Test] Add nightly MoE eval tests#39956

Open
bnellnm wants to merge 11 commits intovllm-project:mainfrom
neuralmagic:nightly-eval-tests
Open

[Test] Add nightly MoE eval tests#39956
bnellnm wants to merge 11 commits intovllm-project:mainfrom
neuralmagic:nightly-eval-tests

Conversation

@bnellnm
Copy link
Copy Markdown
Collaborator

@bnellnm bnellnm commented Apr 16, 2026

Purpose

Add eval tests for important models that can be run nightly. The -small.txt models run with <=2 GPUs, while the -large.txt models need > 2.

Test Plan

Ran it locally

Test Result

Model Baseline
arcee-ai/Trinity-Mini 0.8408
deepseek-ai/DeepSeek-R1 0.9492
google/gemma-4-26B-A4B-it 0.3017
zai-org/GLM-4.7-Flash 0.8241
openai/gpt-oss-20b 0.3154
ibm-granite/granite-4.0-h-small 0.8400
ai21labs/AI21-Jamba2-Mini 0.7665
LiquidAI/LFM2.5-350M 0.2092
meta-llama/Llama-4-Scout-17B-16E-Instruct TIMEOUT
MiniMaxAI/MiniMax-M2.7 0.9249
mistralai/Mixtral-8x7B-v0.1 0.5512
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 0.9295
allenai/OLMoE-1B-7B-0125-Instruct 0.6770
microsoft/Phi-tiny-MoE-instruct 0.7020
sarvamai/sarvam-30b 0.6588
stepfun-ai/Step-3.5-Flash FAIL

Issues will be created for the failing tests.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

bnellnm added 2 commits April 15, 2026 22:58
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds several new model configuration files for GSM8K evaluations and updates the nightly test lists. Several referenced configuration files are missing from the PR, including DeepSeek-R1-TP.yaml and those in the moe-refactor directory, which will cause CI failures. Additionally, the 120B parameter Nemotron model is incorrectly categorized in the small evaluation suite and should be moved to the large suite.

Comment thread tests/evals/gsm8k/configs/models-nightly-large.txt
Comment thread tests/evals/gsm8k/configs/models-nightly-small.txt Outdated
bnellnm added 2 commits April 16, 2026 02:32
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@bnellnm bnellnm marked this pull request as ready for review April 16, 2026 13:03
Signed-off-by: Bill Nell <bnell@redhat.com>
@bnellnm bnellnm changed the title Nightly eval tests [Test] Add nightly eval tests Apr 16, 2026
bnellnm and others added 2 commits April 17, 2026 18:20
Signed-off-by: Bill Nell <bnell@redhat.com>
@bnellnm bnellnm changed the title [Test] Add nightly eval tests [Test] Add nightly MoE eval tests Apr 20, 2026
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

which job runs these?

@bnellnm
Copy link
Copy Markdown
Collaborator Author

bnellnm commented Apr 21, 2026

which job runs these?

Good question. Is there a spot to add nightly tests?

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

Just wondering on what GPU arch we are going to run it?

@bnellnm
Copy link
Copy Markdown
Collaborator Author

bnellnm commented Apr 21, 2026

Just wondering on what GPU arch we are going to run it?

I was planning on H100 but we could run on other arches too.

bnellnm added 3 commits April 21, 2026 17:17
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants