Skip to content

Update MMLU/GPQA/LongBench accuracy benchmarks and Add document#63

Merged
ishandhanani merged 5 commits intomainfrom
baizhou/acc
Dec 19, 2025
Merged

Update MMLU/GPQA/LongBench accuracy benchmarks and Add document#63
ishandhanani merged 5 commits intomainfrom
baizhou/acc

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 commented Dec 19, 2025

Summary by CodeRabbit

  • New Features

    • Added support for additional benchmark types (MMLU, GPQA, LongBenchV2) with configurable accuracy parameters (examples, tokens, repeats, threads, context length, categories).
  • Chores

    • Updated default benchmark settings: MMLU defaults adjusted (examples and token limits); GPQA defaults changed (higher token limit, reduced thread count) and thinking-mode removed.
  • Documentation

    • Added Accuracy Benchmark docs with configuration examples and expected outputs.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 19, 2025

Walkthrough

Adds accuracy-benchmark fields to BenchmarkConfig, extends SGLang SLURM script generation to handle bench types mmlu, gpqa, and longbenchv2 by reading those fields into parsable configs, updates default parameters in MMLU and GPQA benchmark scripts, and adds accuracy benchmark documentation.

Changes

Cohort / File(s) Summary
Schema Extensions
src/srtctl/core/schema.py
Added optional fields to BenchmarkConfig: num_examples, max_tokens, repeat, num_threads, max_context_length, and categories.
SGLang backend
src/srtctl/backends/sglang.py
Extended generate_slurm_script with branches for bench_type = mmlu, gpqa, and longbenchv2; each branch reads benchmark-specific parameters (with defaults) and builds parsable_config. Existing branches preserved.
MMLU benchmark script
scripts/benchmarks/mmlu/bench.sh
Updated default parameters: num_examples from 198200, max_tokens from 5122048.
GPQA benchmark script
scripts/benchmarks/gpqa/bench.sh
Updated defaults: max_tokens from 51232768, num_threads from 512128; removed thinking_mode argument usage and prints.
Documentation
docs/accuracy.md
Added accuracy benchmarking documentation with YAML/config examples and usage notes for MMLU, GPQA, and a placeholder for LongBench-V2.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review src/srtctl/core/schema.py for types and optional defaults.
  • Verify src/srtctl/backends/sglang.py branches, parameter names, and parsable_config formatting.
  • Run and validate scripts/benchmarks/mmlu/bench.sh and scripts/benchmarks/gpqa/bench.sh argument parsing and outputs.
  • Check gpqa/bench.sh for any leftover thinking_mode references.
  • Confirm docs/accuracy.md aligns with CLI/config parameter names.

Poem

🐰 I nudged the fields beneath the sod,
Defaults hopped up, precise and odd,
Scripts now speak a clearer tune,
Parsable carrots under moon,
🥕 small hops, big benching — laud!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: updates to accuracy benchmarks (MMLU/GPQA/LongBench) and addition of documentation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch baizhou/acc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/benchmarks/mmlu/bench.sh (1)

5-5: Fix misleading comment.

The comment states "GPQA evaluation script" but the script actually runs MMLU evaluation (line 47: --eval-name mmlu). Update the comment to reflect the correct benchmark type.

🔎 Proposed fix
-# GPQA evaluation script using sglang.test.run_eval with mmlu
+# MMLU evaluation script using sglang.test.run_eval
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fce3dd and 09313f8.

📒 Files selected for processing (3)
  • scripts/benchmarks/mmlu/bench.sh (1 hunks)
  • src/srtctl/backends/sglang.py (1 hunks)
  • src/srtctl/core/schema.py (1 hunks)
🔇 Additional comments (3)
scripts/benchmarks/mmlu/bench.sh (1)

16-17: LGTM!

The updated default values for num_examples (200) and max_tokens (2048) align with the backend defaults in src/srtctl/backends/sglang.py and provide sensible defaults for MMLU benchmarking.

src/srtctl/backends/sglang.py (1)

265-270: LGTM!

The MMLU benchmark parsing logic correctly extracts configuration parameters with sensible defaults that align with the bash script expectations. The implementation follows the established pattern from the SA-bench branch above.

src/srtctl/core/schema.py (1)

168-172: LGTM!

The new accuracy benchmark fields are well-defined with appropriate types and clear descriptions. The optional nature correctly reflects that these fields only apply to accuracy benchmarks (MMLU, GPQA, LongBench).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09313f8 and d1abf66.

📒 Files selected for processing (3)
  • scripts/benchmarks/gpqa/bench.sh (1 hunks)
  • src/srtctl/backends/sglang.py (1 hunks)
  • src/srtctl/core/schema.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/srtctl/core/schema.py
  • src/srtctl/backends/sglang.py
🔇 Additional comments (1)
scripts/benchmarks/gpqa/bench.sh (1)

17-19: Verify the significant parameter default changes.

The default values have changed substantially:

  • max_tokens: 512 → 32768 (64× increase)
  • num_threads: 512 → 128 (4× decrease)

The 64× increase in max_tokens will significantly increase memory consumption, while the 4× decrease in num_threads will reduce parallelism. Please confirm these changes align with the new GPQA benchmark requirements and available resources.

Comment thread scripts/benchmarks/gpqa/bench.sh
Comment thread scripts/benchmarks/gpqa/bench.sh Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/accuracy.md (1)

26-38: Add language specification to code block showing expected output.

The fenced code block displaying expected benchmark output lacks a language identifier. While this won't break functionality, it's a documentation best practice.

🔎 Proposed fix
-After finishing benchmarking, the `benchmark.out` will contain the results of accuracy:
-```
+After finishing benchmarking, the `benchmark.out` will contain the results of accuracy:
+```text
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1abf66 and 5a096ea.

📒 Files selected for processing (1)
  • docs/accuracy.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/accuracy.md

26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (2)
docs/accuracy.md (2)

3-5: Clarify supported benchmark types and add note about undocumented parameters.

Line 3 lists supported benchmarks but the AI summary indicates that max_context_length and categories parameters are also configurable. These parameters are not shown in the MMLU or GPQA configuration examples (lines 14–17 and 46–49). If these are user-configurable, they should be documented with examples and explanations of their purpose.

Can you confirm whether max_context_length and categories are configurable parameters that should be documented in the configuration examples, or if they are internal/derived values?


31-33: Verify accuracy of output format.

The JSON output at line 32 contains np.float64(...) notation, which is a Python repr format rather than standard JSON. This is unusual for a documented "expected output" as users would typically see either standard JSON or formatted numbers. Clarify whether this reflects the actual output users will see or if this should be formatted differently for the documentation.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a096ea and 9c085be.

📒 Files selected for processing (1)
  • scripts/benchmarks/gpqa/bench.sh (2 hunks)
🔇 Additional comments (2)
scripts/benchmarks/gpqa/bench.sh (2)

20-20: Good documentation of the thinking-mode removal.

The comment clearly explains why --thinking-mode was removed, which helps future maintainers understand the limitation with the dynamo frontend.


22-22: Successfully resolved undefined thinking_mode variable issues.

The script no longer references ${thinking_mode} in the echo statement (line 22) or passes --thinking-mode to the Python invocation (line 52). This fully addresses the past review comments about the undefined variable.

Also applies to: 52-52

decode_gpus=$4
num_examples=${5:-198} # Default: 198
max_tokens=${6:-512} # Default: 512
max_tokens=${6:-32768} # Default: 32768
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n scripts/benchmarks/gpqa/bench.sh | head -60

Repository: ishandhanani/srt-slurm

Length of output: 2469


🏁 Script executed:

cat scripts/benchmarks/gpqa/bench.sh

Repository: ishandhanani/srt-slurm

Length of output: 2145


Add comments explaining the rationale for default parameter values.

The --thinking-mode removal is properly documented, but the script lacks explanatory comments for the new defaults:

  • max_tokens=32768 — clarify why this value was chosen for DeepSeek-R1 reasoning on GPQA
  • num_threads=128 — clarify the impact on sglang request parallelization

These values are reasonable for GPQA evaluation with a reasoning model, but documenting the rationale helps future maintainers understand the performance/quality trade-offs.

🤖 Prompt for AI Agents
In scripts/benchmarks/gpqa/bench.sh around line 17 (and where num_threads is
set, e.g., the thread-init line), add short inline comments explaining the
rationale for the new defaults: document that max_tokens=32768 was chosen to
allow DeepSeek-R1 to perform extended chain-of-thought reasoning on GPQA without
truncation, and that num_threads=128 was chosen to increase sglang request
parallelization for throughput during batch evaluation while noting potential
CPU/memory trade-offs; place each comment immediately above or on the same line
as the variable assignment and keep them concise (one sentence each) mentioning
the intended trade-off (quality vs. resource use).

@ishandhanani ishandhanani merged commit 1e863d1 into main Dec 19, 2025
3 checks passed
@Fridge003 Fridge003 deleted the baizhou/acc branch January 6, 2026 14:19
@coderabbitai coderabbitai bot mentioned this pull request Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants