Fix benchmark bugs: GPQA model detection, UTF-8 decode, random_range_ratio by YAMY1234 · Pull Request #194 · ishandhanani/srt-slurm

YAMY1234 · 2026-02-27T08:35:08Z

Summary

gpqa/bench.sh: Auto-detect model name from /v1/models API endpoint instead of hardcoding deepseek-ai/DeepSeek-R1. Fixes 404 errors when serving models with different names.
postprocess_stage.py: Use read_text(errors="replace") to handle non-UTF-8 bytes in benchmark output (ANSI escape codes, etc.) that caused UnicodeDecodeError crashes.
sa_bench.py: Pass random_range_ratio parameter to bench script for configurable input/output length variance.
schema.py: Add random_range_ratio field to BenchmarkConfig.

Test plan

Run GPQA benchmark with a non-default model name and verify model is correctly detected
Run a benchmark that produces ANSI escape codes in output and verify postprocess does not crash
Run sa-bench with random_range_ratio set in config

Summary by CodeRabbit

Release Notes

New Features
- Introduced random_range_ratio parameter to customize SA-Bench behavior with sensible defaults
- Added automatic model detection for GPQA benchmark from configured endpoint; gracefully falls back to default if unavailable
Improvements
- Enhanced benchmark output processing with more robust error handling

…ratio

coderabbitai · 2026-02-27T08:35:24Z

📝 Walkthrough

Walkthrough

This PR extends benchmark configuration and model detection capabilities. It adds an optional random_range_ratio parameter to the BenchmarkConfig schema and integrates it into SA-Bench's command builder with a default fallback value. The GPQA benchmark script now automatically detects model names from the endpoint with graceful fallback. File decoding error handling is also improved for robustness.

Changes

Cohort / File(s)	Summary
SA-Bench Configuration `src/srtctl/core/schema.py`, `src/srtctl/benchmarks/sa_bench.py`	Added optional `random_range_ratio` field to BenchmarkConfig and integrated into SA-Bench build_command with configurable value or "0.8" default. Refactored command construction to use intermediate variable.
GPQA Benchmark Script `src/srtctl/benchmarks/scripts/gpqa/bench.sh`	Implements automatic model name detection by querying endpoint `/v1/models` endpoint with Python extraction; falls back to deepseek-ai/DeepSeek-R1 default with warning on failure.
Post-processing Error Handling `src/srtctl/cli/mixins/postprocess_stage.py`	Updated benchmark result extraction to decode raw output with `errors="replace"` for handling undecodable bytes gracefully.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Making random-range-ratio configurable #186: Introduces RANDOM_RANGE_RATIO parameter in sa-bench bench.sh script; directly parallels this PR's SA-Bench configuration enhancement.
Update MMLU/GPQA/LongBench accuracy benchmarks and Add document #63: Modifies BenchmarkConfig dataclass with additional fields; overlaps with schema extensions introduced here.

Poem

🐰 A config field hops into place,
Models detected with style and grace,
Random ratios now configurable bright,
Error handling smooth—what a delight!
Benchmarks now flexible, fluffy, and right! 🎯

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes all three main bug fixes addressed in the pull request: GPQA model detection, UTF-8 decode error handling, and random_range_ratio parameter support.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/srtctl/cli/mixins/postprocess_stage.py (2)

274-274: ⚠️ Potential issue | 🟡 Minor

Type mismatch: container_mounts expects dict[Path, Path], not dict[str, str].

The pipeline failure indicates start_srun_process requires dict[Path, Path] | None for container_mounts, but this passes string keys/values.

Proposed fix

             proc = start_srun_process(
                 command=["bash", "-c", script],
                 nodelist=[self.runtime.nodes.head],
                 output=str(self.runtime.log_dir / "postprocess.log"),
                 container_image="python:3.11",
-                container_mounts={str(self.runtime.log_dir): "/logs"},
+                container_mounts={self.runtime.log_dir: Path("/logs")},
                 env_to_set=env,
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/srtctl/cli/mixins/postprocess_stage.py` at line 274, The call is passing
strings for container_mounts but start_srun_process expects dict[Path, Path] |
None; change the mapping to use pathlib.Path objects (e.g.
Path(self.runtime.log_dir) as the key and Path("/logs") as the value) before
passing it to start_srun_process, and add an import for pathlib.Path if missing;
ensure the resulting container_mounts variable is typed/constructed as
dict[Path, Path] (or None when appropriate) so the signature matches.

416-416: ⚠️ Potential issue | 🟡 Minor

Type mismatch: Same issue as line 274.

Proposed fix

             proc = start_srun_process(
                 command=["bash", "-c", script],
                 nodelist=[self.runtime.nodes.head],
                 output=str(analysis_log),
                 container_image="python:3.11",
-                container_mounts={str(self.runtime.log_dir): "/logs"},
+                container_mounts={self.runtime.log_dir: Path("/logs")},
                 env_to_set=env_to_set,
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/srtctl/cli/mixins/postprocess_stage.py` at line 416, The container_mounts
entry currently uses str(self.runtime.log_dir) which causes a typing mismatch;
update the mapping to use the expected path type by passing self.runtime.log_dir
(not str()) so the container_mounts parameter and the runtime.log_dir attribute
types align (same fix as applied at the earlier occurrence around line 274);
adjust any related annotations if needed to ensure container_mounts accepts the
runtime path type.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/srtctl/cli/mixins/postprocess_stage.py`:
- Line 274: The call is passing strings for container_mounts but
start_srun_process expects dict[Path, Path] | None; change the mapping to use
pathlib.Path objects (e.g. Path(self.runtime.log_dir) as the key and
Path("/logs") as the value) before passing it to start_srun_process, and add an
import for pathlib.Path if missing; ensure the resulting container_mounts
variable is typed/constructed as dict[Path, Path] (or None when appropriate) so
the signature matches.
- Line 416: The container_mounts entry currently uses str(self.runtime.log_dir)
which causes a typing mismatch; update the mapping to use the expected path type
by passing self.runtime.log_dir (not str()) so the container_mounts parameter
and the runtime.log_dir attribute types align (same fix as applied at the
earlier occurrence around line 274); adjust any related annotations if needed to
ensure container_mounts accepts the runtime path type.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between acc6dd3 and 2199509.

📒 Files selected for processing (4)

src/srtctl/benchmarks/sa_bench.py
src/srtctl/benchmarks/scripts/gpqa/bench.sh
src/srtctl/cli/mixins/postprocess_stage.py
src/srtctl/core/schema.py

Fix benchmark bugs: GPQA model detection, UTF-8 decode, random_range_…

2199509

…ratio

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

YAMY1234 merged commit 6c7f853 into ishandhanani:main Mar 5, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix benchmark bugs: GPQA model detection, UTF-8 decode, random_range_ratio#194

Fix benchmark bugs: GPQA model detection, UTF-8 decode, random_range_ratio#194
YAMY1234 merged 1 commit intoishandhanani:mainfrom
YAMY1234:fix/benchmark-bugfixes

YAMY1234 commented Feb 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YAMY1234 commented Feb 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YAMY1234 commented Feb 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 27, 2026 •

edited

Loading