[CI] Lower GSM8K baselines for B200 nightly after eval unification#22136
[CI] Lower GSM8K baselines for B200 nightly after eval unification#22136Kangyan-Zhou wants to merge 2 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request lowers the GSM8K accuracy baselines for FlashInfer TRT-LLM MoE and DeepSeek-R1 FP4 performance tests to reflect changes from evaluation unification and data leakage fixes. Feedback suggests verifying if other FP8-based models require similar baseline adjustments and recommends explicitly defining evaluation parameters like the number of examples and threads for consistency across tests.
| metrics = run_eval(args) | ||
| print(f"{metrics=}") | ||
| self.assertGreater(metrics["score"], 0.93) | ||
| self.assertGreater(metrics["score"], 0.89) |
There was a problem hiding this comment.
The PR description mentions lowering the FP8 MOE backend baseline. Note that FlashinferTrtllmGenMoeBackendMXFP8Base (line 157) also uses an FP8-based quantization (mxfp8) and currently maintains a GSM8K baseline of 0.93. If this model was also affected by the evaluation unification and data leakage fix, its baseline should likely be updated to avoid potential CI failures.
| accuracy_params=AccuracyTestParams( | ||
| dataset="gsm8k", baseline_accuracy=0.935 | ||
| ), | ||
| accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83), |
There was a problem hiding this comment.
For consistency with other GSM8K tests in the repository (e.g., test_flashinfer_trtllm_gen_moe_backend.py) and to align with the 200-example evaluation range mentioned in the PR description, it is recommended to explicitly set num_examples=200. Additionally, consider setting num_threads=128 to match the configuration used in other backend tests.
| accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83), | |
| accuracy_params=AccuracyTestParams(dataset="gsm8k", baseline_accuracy=0.83, num_examples=200, num_threads=128), |
PR sgl-project#21667 unified the GSM8K eval path from `few_shot_gsm8k` to `GSM8KEval`. Two issues caused test failures: 1. The old eval included few-shot examples in the evaluation set (lines[0:200]), inflating scores by ~4% (8/200 trivially correct). The new eval properly excludes them (lines[5:205]). 2. For DeepSeek-R1-FP4, `num_examples` was not set in AccuracyTestParams. The old path defaulted to 200 questions, but the new GSM8KEval evaluates ALL 1314 remaining questions when num_examples=None, making the test significantly harder and ~7x slower. Fixes: - FP8 MOE backend: lower threshold 0.93 → 0.89 (observed: 0.905-0.925) - FP4 DeepSeek-R1: add num_examples=200, lower threshold 0.935 → 0.89 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bd24509 to
a1c1d44
Compare
…=200 Bisecting on B200 revealed the real root cause for the DeepSeek-R1-FP4 accuracy drop: the eval unification (sgl-project#21667) removed the `_run_few_shot_eval` path which used the Completion API, and replaced it with `_run_simple_eval` which defaults to the Chat API. DeepSeek-R1 scores much lower on GSM8K through the Chat API (0.86) vs Completion API (0.975). Additionally, `num_examples` was not set, causing GSM8KEval to evaluate all 1314 questions instead of the intended 200. Bisect results (nvidia/DeepSeek-R1-0528-NVFP4-v2, TP4, B200): Old eval (8-shot, 200q, completion API): 0.975 New eval (5-shot, 200q, completion API): 0.985 New eval (5-shot, 1314q, completion API): 0.954 Fixes: - Add `api` field to AccuracyTestParams and propagate through _run_simple_eval - Set api="completion" and num_examples=200 in dpsk R1 FP4 test - Restore original baseline_accuracy=0.935 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi, |
Summary
api="completion",num_examples=200, restore baseline 0.935apifield toAccuracyTestParamsfor controlling eval API modeRoot Cause (Bisected on B200)
PR #21667 unified the GSM8K eval path, introducing two separate issues:
Issue 1: FP8 MOE Backend (Qwen3-Next-80B-A3B-Instruct-FP8)
The new
GSM8KEvalproperly excludes few-shot examples from the evaluation set (lines[5:205]instead oflines[0:200]). The old eval inflated scores by ~4% (8/200 trivially correct examples). FP8 scores dropped from ~0.93-0.95 to ~0.905-0.925.Fix: Lower threshold from 0.93 to 0.89.
Issue 2: FP4 DeepSeek-R1 (nvidia/DeepSeek-R1-0528-NVFP4-v2)
Two compounding problems:
_run_few_shot_evalused the Completion API. The new_run_simple_evaldefaults to the Chat API. DeepSeek-R1 scores dramatically lower through Chat API (0.86 vs 0.975 on Completion API).num_examples: Withoutnum_examples=200,GSM8KEvalevaluates all 1314 questions instead of the intended 200.Bisect results (nvidia/DeepSeek-R1-0528-NVFP4-v2, TP4, B200, Completion API):
The model has no accuracy regression — it's purely an eval methodology issue.
Fix: Add
api="completion"+num_examples=200to restore original behavior, keep baseline at 0.935.Test plan
nightly-test-perf-4-gpu-b200passes on next nightly run🤖 Generated with Claude Code