[CI] Lower GSM8K baselines for B200 nightly after eval unification by Kangyan-Zhou · Pull Request #22136 · sgl-project/sglang

Kangyan-Zhou · 2026-04-05T05:42:22Z

Summary

Fix FP8 MOE backend GSM8K threshold: 0.93 → 0.89 (observed scores: 0.905-0.925)
Fix FP4 DeepSeek-R1 GSM8K: add api="completion", num_examples=200, restore baseline 0.935
Add api field to AccuracyTestParams for controlling eval API mode

Root Cause (Bisected on B200)

PR #21667 unified the GSM8K eval path, introducing two separate issues:

Issue 1: FP8 MOE Backend (Qwen3-Next-80B-A3B-Instruct-FP8)

The new GSM8KEval properly excludes few-shot examples from the evaluation set (lines[5:205] instead of lines[0:200]). The old eval inflated scores by ~4% (8/200 trivially correct examples). FP8 scores dropped from ~0.93-0.95 to ~0.905-0.925.

Fix: Lower threshold from 0.93 to 0.89.

Issue 2: FP4 DeepSeek-R1 (nvidia/DeepSeek-R1-0528-NVFP4-v2)

Two compounding problems:

Wrong API mode: The old _run_few_shot_eval used the Completion API. The new _run_simple_eval defaults to the Chat API. DeepSeek-R1 scores dramatically lower through Chat API (0.86 vs 0.975 on Completion API).
Missing num_examples: Without num_examples=200, GSM8KEval evaluates all 1314 questions instead of the intended 200.

Bisect results (nvidia/DeepSeek-R1-0528-NVFP4-v2, TP4, B200, Completion API):

Eval Variant	Questions	Score
Old eval (8-shot, lines[0:200], with leakage)	200	0.975
New eval (5-shot, lines[5:205], no leakage)	200	0.985
New eval (5-shot, all remaining)	1314	0.954

The model has no accuracy regression — it's purely an eval methodology issue.

Fix: Add api="completion" + num_examples=200 to restore original behavior, keep baseline at 0.935.

Test plan

Verify nightly-test-perf-4-gpu-b200 passes on next nightly run

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request lowers the GSM8K accuracy baselines for FlashInfer TRT-LLM MoE and DeepSeek-R1 FP4 performance tests to reflect changes from evaluation unification and data leakage fixes. Feedback suggests verifying if other FP8-based models require similar baseline adjustments and recommends explicitly defining evaluation parameters like the number of examples and threads for consistency across tests.

gemini-code-assist · 2026-04-05T05:43:46Z

test/registered/backends/test_flashinfer_trtllm_gen_moe_backend.py

        metrics = run_eval(args)
        print(f"{metrics=}")
-        self.assertGreater(metrics["score"], 0.93)
+        self.assertGreater(metrics["score"], 0.89)


The PR description mentions lowering the FP8 MOE backend baseline. Note that FlashinferTrtllmGenMoeBackendMXFP8Base (line 157) also uses an FP8-based quantization (mxfp8) and currently maintains a GSM8K baseline of 0.93. If this model was also affected by the evaluation unification and data leakage fix, its baseline should likely be updated to avoid potential CI failures.

gemini-code-assist · 2026-04-05T05:43:46Z

test/registered/perf/test_dpsk_r1_fp4_4gpu_perf.py

-            accuracy_params=AccuracyTestParams(
-                dataset="gsm8k", baseline_accuracy=0.935
-            ),
+            accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83),


For consistency with other GSM8K tests in the repository (e.g., test_flashinfer_trtllm_gen_moe_backend.py) and to align with the 200-example evaluation range mentioned in the PR description, it is recommended to explicitly set num_examples=200. Additionally, consider setting num_threads=128 to match the configuration used in other backend tests.

Suggested change

accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83),

accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83, num_examples=200, num_threads=128),

PR sgl-project#21667 unified the GSM8K eval path from `few_shot_gsm8k` to `GSM8KEval`. Two issues caused test failures: 1. The old eval included few-shot examples in the evaluation set (lines[0:200]), inflating scores by ~4% (8/200 trivially correct). The new eval properly excludes them (lines[5:205]). 2. For DeepSeek-R1-FP4, `num_examples` was not set in AccuracyTestParams. The old path defaulted to 200 questions, but the new GSM8KEval evaluates ALL 1314 remaining questions when num_examples=None, making the test significantly harder and ~7x slower. Fixes: - FP8 MOE backend: lower threshold 0.93 → 0.89 (observed: 0.905-0.925) - FP4 DeepSeek-R1: add num_examples=200, lower threshold 0.935 → 0.89 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…=200 Bisecting on B200 revealed the real root cause for the DeepSeek-R1-FP4 accuracy drop: the eval unification (sgl-project#21667) removed the `_run_few_shot_eval` path which used the Completion API, and replaced it with `_run_simple_eval` which defaults to the Chat API. DeepSeek-R1 scores much lower on GSM8K through the Chat API (0.86) vs Completion API (0.975). Additionally, `num_examples` was not set, causing GSM8KEval to evaluate all 1314 questions instead of the intended 200. Bisect results (nvidia/DeepSeek-R1-0528-NVFP4-v2, TP4, B200): Old eval (8-shot, 200q, completion API): 0.975 New eval (5-shot, 200q, completion API): 0.985 New eval (5-shot, 1314q, completion API): 0.954 Fixes: - Add `api` field to AccuracyTestParams and propagate through _run_simple_eval - Set api="completion" and num_examples=200 in dpsk R1 FP4 test - Restore original baseline_accuracy=0.935 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zianglih · 2026-04-06T21:13:08Z

Hi, FlashinferTrtllmGenMoeBackendMXFP8Base also regressed similar to FlashinferTrtllmGenMoeBackendFP8Base. We can also set threshold from 0.93 to 0.89 there.

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

Kangyan-Zhou force-pushed the fix_nightly_b200_accuracy branch from bd24509 to a1c1d44 Compare April 5, 2026 06:41

Kangyan-Zhou marked this pull request as draft April 5, 2026 07:05

zianglih mentioned this pull request Apr 6, 2026

[RL] [FlashInfer] Fix weight update and expand tests for FlashInfer nvfp4 moe #22209

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Lower GSM8K baselines for B200 nightly after eval unification#22136

[CI] Lower GSM8K baselines for B200 nightly after eval unification#22136
Kangyan-Zhou wants to merge 2 commits intosgl-project:mainfrom
Kangyan-Zhou:fix_nightly_b200_accuracy

Kangyan-Zhou commented Apr 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

zianglih commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83),
	accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83, num_examples=200, num_threads=128),

Conversation

Kangyan-Zhou commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause (Bisected on B200)

Issue 1: FP8 MOE Backend (Qwen3-Next-80B-A3B-Instruct-FP8)

Issue 2: FP4 DeepSeek-R1 (nvidia/DeepSeek-R1-0528-NVFP4-v2)

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

zianglih commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kangyan-Zhou commented Apr 5, 2026 •

edited

Loading