Skip to content

[CI] Lower GSM8K baselines for B200 nightly after eval unification#22136

Draft
Kangyan-Zhou wants to merge 2 commits intosgl-project:mainfrom
Kangyan-Zhou:fix_nightly_b200_accuracy
Draft

[CI] Lower GSM8K baselines for B200 nightly after eval unification#22136
Kangyan-Zhou wants to merge 2 commits intosgl-project:mainfrom
Kangyan-Zhou:fix_nightly_b200_accuracy

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

@Kangyan-Zhou Kangyan-Zhou commented Apr 5, 2026

Summary

  • Fix FP8 MOE backend GSM8K threshold: 0.93 → 0.89 (observed scores: 0.905-0.925)
  • Fix FP4 DeepSeek-R1 GSM8K: add api="completion", num_examples=200, restore baseline 0.935
  • Add api field to AccuracyTestParams for controlling eval API mode

Root Cause (Bisected on B200)

PR #21667 unified the GSM8K eval path, introducing two separate issues:

Issue 1: FP8 MOE Backend (Qwen3-Next-80B-A3B-Instruct-FP8)

The new GSM8KEval properly excludes few-shot examples from the evaluation set (lines[5:205] instead of lines[0:200]). The old eval inflated scores by ~4% (8/200 trivially correct examples). FP8 scores dropped from ~0.93-0.95 to ~0.905-0.925.

Fix: Lower threshold from 0.93 to 0.89.

Issue 2: FP4 DeepSeek-R1 (nvidia/DeepSeek-R1-0528-NVFP4-v2)

Two compounding problems:

  1. Wrong API mode: The old _run_few_shot_eval used the Completion API. The new _run_simple_eval defaults to the Chat API. DeepSeek-R1 scores dramatically lower through Chat API (0.86 vs 0.975 on Completion API).
  2. Missing num_examples: Without num_examples=200, GSM8KEval evaluates all 1314 questions instead of the intended 200.

Bisect results (nvidia/DeepSeek-R1-0528-NVFP4-v2, TP4, B200, Completion API):

Eval Variant Questions Score
Old eval (8-shot, lines[0:200], with leakage) 200 0.975
New eval (5-shot, lines[5:205], no leakage) 200 0.985
New eval (5-shot, all remaining) 1314 0.954

The model has no accuracy regression — it's purely an eval methodology issue.

Fix: Add api="completion" + num_examples=200 to restore original behavior, keep baseline at 0.935.

Test plan

  • Verify nightly-test-perf-4-gpu-b200 passes on next nightly run

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request lowers the GSM8K accuracy baselines for FlashInfer TRT-LLM MoE and DeepSeek-R1 FP4 performance tests to reflect changes from evaluation unification and data leakage fixes. Feedback suggests verifying if other FP8-based models require similar baseline adjustments and recommends explicitly defining evaluation parameters like the number of examples and threads for consistency across tests.

metrics = run_eval(args)
print(f"{metrics=}")
self.assertGreater(metrics["score"], 0.93)
self.assertGreater(metrics["score"], 0.89)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PR description mentions lowering the FP8 MOE backend baseline. Note that FlashinferTrtllmGenMoeBackendMXFP8Base (line 157) also uses an FP8-based quantization (mxfp8) and currently maintains a GSM8K baseline of 0.93. If this model was also affected by the evaluation unification and data leakage fix, its baseline should likely be updated to avoid potential CI failures.

accuracy_params=AccuracyTestParams(
dataset="gsm8k", baseline_accuracy=0.935
),
accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other GSM8K tests in the repository (e.g., test_flashinfer_trtllm_gen_moe_backend.py) and to align with the 200-example evaluation range mentioned in the PR description, it is recommended to explicitly set num_examples=200. Additionally, consider setting num_threads=128 to match the configuration used in other backend tests.

Suggested change
accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83),
accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83, num_examples=200, num_threads=128),

PR sgl-project#21667 unified the GSM8K eval path from `few_shot_gsm8k` to
`GSM8KEval`. Two issues caused test failures:

1. The old eval included few-shot examples in the evaluation set
   (lines[0:200]), inflating scores by ~4% (8/200 trivially correct).
   The new eval properly excludes them (lines[5:205]).

2. For DeepSeek-R1-FP4, `num_examples` was not set in AccuracyTestParams.
   The old path defaulted to 200 questions, but the new GSM8KEval evaluates
   ALL 1314 remaining questions when num_examples=None, making the test
   significantly harder and ~7x slower.

Fixes:
- FP8 MOE backend: lower threshold 0.93 → 0.89 (observed: 0.905-0.925)
- FP4 DeepSeek-R1: add num_examples=200, lower threshold 0.935 → 0.89

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou Kangyan-Zhou force-pushed the fix_nightly_b200_accuracy branch from bd24509 to a1c1d44 Compare April 5, 2026 06:41
@Kangyan-Zhou Kangyan-Zhou marked this pull request as draft April 5, 2026 07:05
…=200

Bisecting on B200 revealed the real root cause for the DeepSeek-R1-FP4
accuracy drop: the eval unification (sgl-project#21667) removed the `_run_few_shot_eval`
path which used the Completion API, and replaced it with `_run_simple_eval`
which defaults to the Chat API. DeepSeek-R1 scores much lower on GSM8K
through the Chat API (0.86) vs Completion API (0.975).

Additionally, `num_examples` was not set, causing GSM8KEval to evaluate
all 1314 questions instead of the intended 200.

Bisect results (nvidia/DeepSeek-R1-0528-NVFP4-v2, TP4, B200):
  Old eval (8-shot, 200q, completion API):  0.975
  New eval (5-shot, 200q, completion API):  0.985
  New eval (5-shot, 1314q, completion API): 0.954

Fixes:
- Add `api` field to AccuracyTestParams and propagate through _run_simple_eval
- Set api="completion" and num_examples=200 in dpsk R1 FP4 test
- Restore original baseline_accuracy=0.935

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zianglih
Copy link
Copy Markdown
Contributor

zianglih commented Apr 6, 2026

Hi, FlashinferTrtllmGenMoeBackendMXFP8Base also regressed similar to FlashinferTrtllmGenMoeBackendFP8Base. We can also set threshold from 0.93 to 0.89 there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants