You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added support for three new benchmark evaluation types: MMLU, GPQA, and LongBenchV2. Users can now configure and execute these benchmarks with adjustable parameters including number of examples, maximum token counts, context window size, thread allocation, and category filters, enabling comprehensive model performance evaluation across multiple industry-standard benchmarks.
✏️ Tip: You can customize this high-level summary in your review settings.
Added support for three new benchmark types (mmlu, gpqa, longbenchv2) in the generate_slurm_script function within src/srtctl/core/backend.py. Each benchmark type branch extracts specific configuration parameters from the benchmark_config and constructs a parsable_config dictionary with benchmark-appropriate fields.
Changes
Cohort / File(s)
Summary
Benchmark type handling extension src/srtctl/core/backend.py
Added three new conditional branches in generate_slurm_script to handle mmlu, gpqa, and longbenchv2 benchmark types. Each branch extracts type-specific parameters (num_examples, max_tokens, repeat, num_threads, and max_context_length/categories where applicable) from benchmark_config into parsable_config.
Estimated code review effort
🎯 2 (Simple) | ⏱️ ~10 minutes
All changes follow the existing pattern established by the sa-bench branch, making the additions straightforward and consistent
No modifications to public API or error handling logic
Single file affected with homogeneous, repetitive configuration extraction logic
🐰 Three new benches hop into view,
mmlu, gpqa, longbench too!
Configs parse with gentle care,
More benchmarks than before to compare.
SRT now tests with greater flair! ✨
✨ Finishing touches
📝 Generate docstrings
🧪 Generate unit tests (beta)
Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch acc
📜 Recent review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 0147c76 and 2510fbe.
📒 Files selected for processing (1)
src/srtctl/core/backend.py (1 hunks)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.