Skip to content

Restore functionality for accuracy benchmarks#66

Merged
Fridge003 merged 1 commit intomainfrom
acc
Dec 21, 2025
Merged

Restore functionality for accuracy benchmarks#66
Fridge003 merged 1 commit intomainfrom
acc

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 commented Dec 21, 2025

Summary by CodeRabbit

  • New Features
    • Added support for three new benchmark evaluation types: MMLU, GPQA, and LongBenchV2. Users can now configure and execute these benchmarks with adjustable parameters including number of examples, maximum token counts, context window size, thread allocation, and category filters, enabling comprehensive model performance evaluation across multiple industry-standard benchmarks.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 21, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Added support for three new benchmark types (mmlu, gpqa, longbenchv2) in the generate_slurm_script function within src/srtctl/core/backend.py. Each benchmark type branch extracts specific configuration parameters from the benchmark_config and constructs a parsable_config dictionary with benchmark-appropriate fields.

Changes

Cohort / File(s) Summary
Benchmark type handling extension
src/srtctl/core/backend.py
Added three new conditional branches in generate_slurm_script to handle mmlu, gpqa, and longbenchv2 benchmark types. Each branch extracts type-specific parameters (num_examples, max_tokens, repeat, num_threads, and max_context_length/categories where applicable) from benchmark_config into parsable_config.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • All changes follow the existing pattern established by the sa-bench branch, making the additions straightforward and consistent
  • No modifications to public API or error handling logic
  • Single file affected with homogeneous, repetitive configuration extraction logic

Possibly related PRs

Poem

🐰 Three new benches hop into view,
mmlu, gpqa, longbench too!
Configs parse with gentle care,
More benchmarks than before to compare.
SRT now tests with greater flair! ✨

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch acc

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0147c76 and 2510fbe.

📒 Files selected for processing (1)
  • src/srtctl/core/backend.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Fridge003 Fridge003 merged commit 525e771 into main Dec 21, 2025
1 check was pending
@coderabbitai coderabbitai bot mentioned this pull request Dec 22, 2025
@Fridge003 Fridge003 deleted the acc branch January 6, 2026 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant