Update MMLU/GPQA/LongBench accuracy benchmarks and Add document by Fridge003 · Pull Request #63 · ishandhanani/srt-slurm

Fridge003 · 2025-12-19T07:30:08Z

Summary by CodeRabbit

New Features
- Added support for additional benchmark types (MMLU, GPQA, LongBenchV2) with configurable accuracy parameters (examples, tokens, repeats, threads, context length, categories).
Chores
- Updated default benchmark settings: MMLU defaults adjusted (examples and token limits); GPQA defaults changed (higher token limit, reduced thread count) and thinking-mode removed.
Documentation
- Added Accuracy Benchmark docs with configuration examples and expected outputs.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-19T07:30:18Z

Walkthrough

Adds accuracy-benchmark fields to BenchmarkConfig, extends SGLang SLURM script generation to handle bench types mmlu, gpqa, and longbenchv2 by reading those fields into parsable configs, updates default parameters in MMLU and GPQA benchmark scripts, and adds accuracy benchmark documentation.

Changes

Cohort / File(s)	Summary
Schema Extensions `src/srtctl/core/schema.py`	Added optional fields to `BenchmarkConfig`: `num_examples`, `max_tokens`, `repeat`, `num_threads`, `max_context_length`, and `categories`.
SGLang backend `src/srtctl/backends/sglang.py`	Extended `generate_slurm_script` with branches for `bench_type` = `mmlu`, `gpqa`, and `longbenchv2`; each branch reads benchmark-specific parameters (with defaults) and builds `parsable_config`. Existing branches preserved.
MMLU benchmark script `scripts/benchmarks/mmlu/bench.sh`	Updated default parameters: `num_examples` from `198` → `200`, `max_tokens` from `512` → `2048`.
GPQA benchmark script `scripts/benchmarks/gpqa/bench.sh`	Updated defaults: `max_tokens` from `512` → `32768`, `num_threads` from `512` → `128`; removed `thinking_mode` argument usage and prints.
Documentation `docs/accuracy.md`	Added accuracy benchmarking documentation with YAML/config examples and usage notes for MMLU, GPQA, and a placeholder for LongBench-V2.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Review src/srtctl/core/schema.py for types and optional defaults.
Verify src/srtctl/backends/sglang.py branches, parameter names, and parsable_config formatting.
Run and validate scripts/benchmarks/mmlu/bench.sh and scripts/benchmarks/gpqa/bench.sh argument parsing and outputs.
Check gpqa/bench.sh for any leftover thinking_mode references.
Confirm docs/accuracy.md aligns with CLI/config parameter names.

Poem

🐰 I nudged the fields beneath the sod,
Defaults hopped up, precise and odd,
Scripts now speak a clearer tune,
Parsable carrots under moon,
🥕 small hops, big benching — laud!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: updates to accuracy benchmarks (MMLU/GPQA/LongBench) and addition of documentation.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch baizhou/acc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/benchmarks/mmlu/bench.sh (1)
5-5: Fix misleading comment.

The comment states "GPQA evaluation script" but the script actually runs MMLU evaluation (line 47: --eval-name mmlu). Update the comment to reflect the correct benchmark type.
🔎 Proposed fix
-# GPQA evaluation script using sglang.test.run_eval with mmlu
+# MMLU evaluation script using sglang.test.run_eval

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fce3dd and 09313f8.

📒 Files selected for processing (3)

scripts/benchmarks/mmlu/bench.sh (1 hunks)
src/srtctl/backends/sglang.py (1 hunks)
src/srtctl/core/schema.py (1 hunks)

🔇 Additional comments (3)

scripts/benchmarks/mmlu/bench.sh (1)

16-17: LGTM!

The updated default values for num_examples (200) and max_tokens (2048) align with the backend defaults in src/srtctl/backends/sglang.py and provide sensible defaults for MMLU benchmarking.

src/srtctl/backends/sglang.py (1)

265-270: LGTM!

The MMLU benchmark parsing logic correctly extracts configuration parameters with sensible defaults that align with the bash script expectations. The implementation follows the established pattern from the SA-bench branch above.

src/srtctl/core/schema.py (1)

168-172: LGTM!

The new accuracy benchmark fields are well-defined with appropriate types and clear descriptions. The optional nature correctly reflects that these fields only apply to accuracy benchmarks (MMLU, GPQA, LongBench).

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09313f8 and d1abf66.

📒 Files selected for processing (3)

scripts/benchmarks/gpqa/bench.sh (1 hunks)
src/srtctl/backends/sglang.py (1 hunks)
src/srtctl/core/schema.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

src/srtctl/core/schema.py
src/srtctl/backends/sglang.py

🔇 Additional comments (1)

scripts/benchmarks/gpqa/bench.sh (1)

17-19: Verify the significant parameter default changes.

The default values have changed substantially:

max_tokens: 512 → 32768 (64× increase)

num_threads: 512 → 128 (4× decrease)

The 64× increase in max_tokens will significantly increase memory consumption, while the 4× decrease in num_threads will reduce parallelism. Please confirm these changes align with the new GPQA benchmark requirements and available resources.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/accuracy.md (1)
26-38: Add language specification to code block showing expected output.

The fenced code block displaying expected benchmark output lacks a language identifier. While this won't break functionality, it's a documentation best practice.
🔎 Proposed fix
-After finishing benchmarking, the `benchmark.out` will contain the results of accuracy:
-```
+After finishing benchmarking, the `benchmark.out` will contain the results of accuracy:
+```text

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1abf66 and 5a096ea.

📒 Files selected for processing (1)

docs/accuracy.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/accuracy.md

26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (2)

docs/accuracy.md (2)

3-5: Clarify supported benchmark types and add note about undocumented parameters.

Line 3 lists supported benchmarks but the AI summary indicates that max_context_length and categories parameters are also configurable. These parameters are not shown in the MMLU or GPQA configuration examples (lines 14–17 and 46–49). If these are user-configurable, they should be documented with examples and explanations of their purpose.

Can you confirm whether max_context_length and categories are configurable parameters that should be documented in the configuration examples, or if they are internal/derived values?

31-33: Verify accuracy of output format.

The JSON output at line 32 contains np.float64(...) notation, which is a Python repr format rather than standard JSON. This is unusual for a documented "expected output" as users would typically see either standard JSON or formatted numbers. Clarify whether this reflects the actual output users will see or if this should be formatted differently for the documentation.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a096ea and 9c085be.

📒 Files selected for processing (1)

scripts/benchmarks/gpqa/bench.sh (2 hunks)

🔇 Additional comments (2)

scripts/benchmarks/gpqa/bench.sh (2)

20-20: Good documentation of the thinking-mode removal.

The comment clearly explains why --thinking-mode was removed, which helps future maintainers understand the limitation with the dynamo frontend.

22-22: Successfully resolved undefined thinking_mode variable issues.

The script no longer references ${thinking_mode} in the echo statement (line 22) or passes --thinking-mode to the Python invocation (line 52). This fully addresses the past review comments about the undefined variable.

Also applies to: 52-52

coderabbitai · 2025-12-19T08:29:23Z

 decode_gpus=$4
 num_examples=${5:-198}  # Default: 198
-max_tokens=${6:-512}    # Default: 512
+max_tokens=${6:-32768}    # Default: 32768


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n scripts/benchmarks/gpqa/bench.sh | head -60

Repository: ishandhanani/srt-slurm

Length of output: 2469

🏁 Script executed:

cat scripts/benchmarks/gpqa/bench.sh

Repository: ishandhanani/srt-slurm

Length of output: 2145

Add comments explaining the rationale for default parameter values.

The --thinking-mode removal is properly documented, but the script lacks explanatory comments for the new defaults:

max_tokens=32768 — clarify why this value was chosen for DeepSeek-R1 reasoning on GPQA

num_threads=128 — clarify the impact on sglang request parallelization

These values are reasonable for GPQA evaluation with a reasoning model, but documenting the rationale helps future maintainers understand the performance/quality trade-offs.

🤖 Prompt for AI Agents

In scripts/benchmarks/gpqa/bench.sh around line 17 (and where num_threads is set, e.g., the thread-init line), add short inline comments explaining the rationale for the new defaults: document that max_tokens=32768 was chosen to allow DeepSeek-R1 to perform extended chain-of-thought reasoning on GPQA without truncation, and that num_threads=128 was chosen to increase sglang request parallelization for throughput during batch evaluation while noting potential CPU/memory trade-offs; place each comment immediately above or on the same line as the variable assignment and keep them concise (one sentence each) mentioning the intended trade-off (quality vs. resource use).

Fridge003 added 2 commits December 19, 2025 04:28

upd mmlu test

494b949

upd mmlu bench

09313f8

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

upd gpqa and longbench

d1abf66

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

Comment thread scripts/benchmarks/gpqa/bench.sh

Comment thread scripts/benchmarks/gpqa/bench.sh Outdated

upd docs

5a096ea

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

fix

9c085be

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

ishandhanani merged commit 1e863d1 into main Dec 19, 2025
3 checks passed

This was referenced Dec 21, 2025

Restore functionality for accuracy benchmarks #66

Merged

tiny update acc doc #67

Merged

Fridge003 deleted the baizhou/acc branch January 6, 2026 14:19

coderabbitai bot mentioned this pull request Jan 8, 2026

complete refactor #65

Merged

coderabbitai bot mentioned this pull request Feb 27, 2026

Fix benchmark bugs: GPQA model detection, UTF-8 decode, random_range_ratio #194

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update MMLU/GPQA/LongBench accuracy benchmarks and Add document#63

Update MMLU/GPQA/LongBench accuracy benchmarks and Add document#63
ishandhanani merged 5 commits intomainfrom
baizhou/acc

Fridge003 commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Fridge003 commented Dec 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fridge003 commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 19, 2025 •

edited

Loading