Restore functionality for accuracy benchmarks by Fridge003 · Pull Request #66 · ishandhanani/srt-slurm

Fridge003 · 2025-12-21T00:47:51Z

Summary by CodeRabbit

New Features
- Added support for three new benchmark evaluation types: MMLU, GPQA, and LongBenchV2. Users can now configure and execute these benchmarks with adjustable parameters including number of examples, maximum token counts, context window size, thread allocation, and category filters, enabling comprehensive model performance evaluation across multiple industry-standard benchmarks.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-21T00:47:59Z

Caution

Review failed

The pull request is closed.

Walkthrough

Added support for three new benchmark types (mmlu, gpqa, longbenchv2) in the generate_slurm_script function within src/srtctl/core/backend.py. Each benchmark type branch extracts specific configuration parameters from the benchmark_config and constructs a parsable_config dictionary with benchmark-appropriate fields.

Changes

Cohort / File(s)	Summary
Benchmark type handling extension `src/srtctl/core/backend.py`	Added three new conditional branches in `generate_slurm_script` to handle mmlu, gpqa, and longbenchv2 benchmark types. Each branch extracts type-specific parameters (num_examples, max_tokens, repeat, num_threads, and max_context_length/categories where applicable) from benchmark_config into parsable_config.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

All changes follow the existing pattern established by the sa-bench branch, making the additions straightforward and consistent
No modifications to public API or error handling logic
Single file affected with homogeneous, repetitive configuration extraction logic

Possibly related PRs

Update MMLU/GPQA/LongBench accuracy benchmarks and Add document #63: Adds support for the same new benchmark types (mmlu, gpqa, longbenchv2) by extending SLURM script generation with identical configuration field handling.
[bench] Add longbenchv2 support for longctx verification #57: Introduces the longbenchv2 benchmark infrastructure including BenchmarkType enum and bench.sh script that pairs with this PR's generate_slurm_script handling.

Poem

🐰 Three new benches hop into view,
mmlu, gpqa, longbench too!
Configs parse with gentle care,
More benchmarks than before to compare.
SRT now tests with greater flair! ✨

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch acc

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0147c76 and 2510fbe.

📒 Files selected for processing (1)

src/srtctl/core/backend.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Restore functionality for accuracy benchmarks

2510fbe

Fridge003 merged commit 525e771 into main Dec 21, 2025
1 check was pending

coderabbitai bot mentioned this pull request Dec 22, 2025

tiny update acc doc #67

Merged

Fridge003 deleted the acc branch January 6, 2026 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore functionality for accuracy benchmarks#66

Restore functionality for accuracy benchmarks#66
Fridge003 merged 1 commit intomainfrom
acc

Fridge003 commented Dec 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 21, 2025 •

edited

Loading

Review failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fridge003 commented Dec 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fridge003 commented Dec 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 21, 2025 •

edited

Loading