Update MMLU/GPQA/LongBench accuracy benchmarks and Add document#63
Update MMLU/GPQA/LongBench accuracy benchmarks and Add document#63ishandhanani merged 5 commits intomainfrom
Conversation
WalkthroughAdds accuracy-benchmark fields to BenchmarkConfig, extends SGLang SLURM script generation to handle bench types Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/benchmarks/mmlu/bench.sh (1)
5-5: Fix misleading comment.The comment states "GPQA evaluation script" but the script actually runs MMLU evaluation (line 47:
--eval-name mmlu). Update the comment to reflect the correct benchmark type.🔎 Proposed fix
-# GPQA evaluation script using sglang.test.run_eval with mmlu +# MMLU evaluation script using sglang.test.run_eval
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
scripts/benchmarks/mmlu/bench.sh(1 hunks)src/srtctl/backends/sglang.py(1 hunks)src/srtctl/core/schema.py(1 hunks)
🔇 Additional comments (3)
scripts/benchmarks/mmlu/bench.sh (1)
16-17: LGTM!The updated default values for
num_examples(200) andmax_tokens(2048) align with the backend defaults insrc/srtctl/backends/sglang.pyand provide sensible defaults for MMLU benchmarking.src/srtctl/backends/sglang.py (1)
265-270: LGTM!The MMLU benchmark parsing logic correctly extracts configuration parameters with sensible defaults that align with the bash script expectations. The implementation follows the established pattern from the SA-bench branch above.
src/srtctl/core/schema.py (1)
168-172: LGTM!The new accuracy benchmark fields are well-defined with appropriate types and clear descriptions. The optional nature correctly reflects that these fields only apply to accuracy benchmarks (MMLU, GPQA, LongBench).
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
scripts/benchmarks/gpqa/bench.sh(1 hunks)src/srtctl/backends/sglang.py(1 hunks)src/srtctl/core/schema.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/srtctl/core/schema.py
- src/srtctl/backends/sglang.py
🔇 Additional comments (1)
scripts/benchmarks/gpqa/bench.sh (1)
17-19: Verify the significant parameter default changes.The default values have changed substantially:
max_tokens: 512 → 32768 (64× increase)num_threads: 512 → 128 (4× decrease)The 64× increase in max_tokens will significantly increase memory consumption, while the 4× decrease in num_threads will reduce parallelism. Please confirm these changes align with the new GPQA benchmark requirements and available resources.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/accuracy.md (1)
26-38: Add language specification to code block showing expected output.The fenced code block displaying expected benchmark output lacks a language identifier. While this won't break functionality, it's a documentation best practice.
🔎 Proposed fix
-After finishing benchmarking, the `benchmark.out` will contain the results of accuracy: -``` +After finishing benchmarking, the `benchmark.out` will contain the results of accuracy: +```text
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/accuracy.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/accuracy.md
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (2)
docs/accuracy.md (2)
3-5: Clarify supported benchmark types and add note about undocumented parameters.Line 3 lists supported benchmarks but the AI summary indicates that
max_context_lengthandcategoriesparameters are also configurable. These parameters are not shown in the MMLU or GPQA configuration examples (lines 14–17 and 46–49). If these are user-configurable, they should be documented with examples and explanations of their purpose.Can you confirm whether
max_context_lengthandcategoriesare configurable parameters that should be documented in the configuration examples, or if they are internal/derived values?
31-33: Verify accuracy of output format.The JSON output at line 32 contains
np.float64(...)notation, which is a Python repr format rather than standard JSON. This is unusual for a documented "expected output" as users would typically see either standard JSON or formatted numbers. Clarify whether this reflects the actual output users will see or if this should be formatted differently for the documentation.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/benchmarks/gpqa/bench.sh(2 hunks)
🔇 Additional comments (2)
scripts/benchmarks/gpqa/bench.sh (2)
20-20: Good documentation of the thinking-mode removal.The comment clearly explains why
--thinking-modewas removed, which helps future maintainers understand the limitation with the dynamo frontend.
22-22: Successfully resolved undefinedthinking_modevariable issues.The script no longer references
${thinking_mode}in the echo statement (line 22) or passes--thinking-modeto the Python invocation (line 52). This fully addresses the past review comments about the undefined variable.Also applies to: 52-52
| decode_gpus=$4 | ||
| num_examples=${5:-198} # Default: 198 | ||
| max_tokens=${6:-512} # Default: 512 | ||
| max_tokens=${6:-32768} # Default: 32768 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n scripts/benchmarks/gpqa/bench.sh | head -60Repository: ishandhanani/srt-slurm
Length of output: 2469
🏁 Script executed:
cat scripts/benchmarks/gpqa/bench.shRepository: ishandhanani/srt-slurm
Length of output: 2145
Add comments explaining the rationale for default parameter values.
The --thinking-mode removal is properly documented, but the script lacks explanatory comments for the new defaults:
max_tokens=32768— clarify why this value was chosen for DeepSeek-R1 reasoning on GPQAnum_threads=128— clarify the impact on sglang request parallelization
These values are reasonable for GPQA evaluation with a reasoning model, but documenting the rationale helps future maintainers understand the performance/quality trade-offs.
🤖 Prompt for AI Agents
In scripts/benchmarks/gpqa/bench.sh around line 17 (and where num_threads is
set, e.g., the thread-init line), add short inline comments explaining the
rationale for the new defaults: document that max_tokens=32768 was chosen to
allow DeepSeek-R1 to perform extended chain-of-thought reasoning on GPQA without
truncation, and that num_threads=128 was chosen to increase sglang request
parallelization for throughput during batch evaluation while noting potential
CPU/memory trade-offs; place each comment immediately above or on the same line
as the variable assignment and keep them concise (one sentence each) mentioning
the intended trade-off (quality vs. resource use).
Summary by CodeRabbit
New Features
Chores
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.