Skip to content

Fix benchmark bugs: GPQA model detection, UTF-8 decode, random_range_ratio#194

Merged
YAMY1234 merged 1 commit intoishandhanani:mainfrom
YAMY1234:fix/benchmark-bugfixes
Mar 5, 2026
Merged

Fix benchmark bugs: GPQA model detection, UTF-8 decode, random_range_ratio#194
YAMY1234 merged 1 commit intoishandhanani:mainfrom
YAMY1234:fix/benchmark-bugfixes

Conversation

@YAMY1234
Copy link
Copy Markdown
Collaborator

@YAMY1234 YAMY1234 commented Feb 27, 2026

Summary

  • gpqa/bench.sh: Auto-detect model name from /v1/models API endpoint instead of hardcoding deepseek-ai/DeepSeek-R1. Fixes 404 errors when serving models with different names.
  • postprocess_stage.py: Use read_text(errors="replace") to handle non-UTF-8 bytes in benchmark output (ANSI escape codes, etc.) that caused UnicodeDecodeError crashes.
  • sa_bench.py: Pass random_range_ratio parameter to bench script for configurable input/output length variance.
  • schema.py: Add random_range_ratio field to BenchmarkConfig.

Test plan

  • Run GPQA benchmark with a non-default model name and verify model is correctly detected
  • Run a benchmark that produces ANSI escape codes in output and verify postprocess does not crash
  • Run sa-bench with random_range_ratio set in config

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced random_range_ratio parameter to customize SA-Bench behavior with sensible defaults
    • Added automatic model detection for GPQA benchmark from configured endpoint; gracefully falls back to default if unavailable
  • Improvements

    • Enhanced benchmark output processing with more robust error handling

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 27, 2026

📝 Walkthrough

Walkthrough

This PR extends benchmark configuration and model detection capabilities. It adds an optional random_range_ratio parameter to the BenchmarkConfig schema and integrates it into SA-Bench's command builder with a default fallback value. The GPQA benchmark script now automatically detects model names from the endpoint with graceful fallback. File decoding error handling is also improved for robustness.

Changes

Cohort / File(s) Summary
SA-Bench Configuration
src/srtctl/core/schema.py, src/srtctl/benchmarks/sa_bench.py
Added optional random_range_ratio field to BenchmarkConfig and integrated into SA-Bench build_command with configurable value or "0.8" default. Refactored command construction to use intermediate variable.
GPQA Benchmark Script
src/srtctl/benchmarks/scripts/gpqa/bench.sh
Implements automatic model name detection by querying endpoint /v1/models endpoint with Python extraction; falls back to deepseek-ai/DeepSeek-R1 default with warning on failure.
Post-processing Error Handling
src/srtctl/cli/mixins/postprocess_stage.py
Updated benchmark result extraction to decode raw output with errors="replace" for handling undecodable bytes gracefully.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Poem

🐰 A config field hops into place,
Models detected with style and grace,
Random ratios now configurable bright,
Error handling smooth—what a delight!
Benchmarks now flexible, fluffy, and right! 🎯

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes all three main bug fixes addressed in the pull request: GPQA model detection, UTF-8 decode error handling, and random_range_ratio parameter support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/srtctl/cli/mixins/postprocess_stage.py (2)

274-274: ⚠️ Potential issue | 🟡 Minor

Type mismatch: container_mounts expects dict[Path, Path], not dict[str, str].

The pipeline failure indicates start_srun_process requires dict[Path, Path] | None for container_mounts, but this passes string keys/values.

Proposed fix
             proc = start_srun_process(
                 command=["bash", "-c", script],
                 nodelist=[self.runtime.nodes.head],
                 output=str(self.runtime.log_dir / "postprocess.log"),
                 container_image="python:3.11",
-                container_mounts={str(self.runtime.log_dir): "/logs"},
+                container_mounts={self.runtime.log_dir: Path("/logs")},
                 env_to_set=env,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/srtctl/cli/mixins/postprocess_stage.py` at line 274, The call is passing
strings for container_mounts but start_srun_process expects dict[Path, Path] |
None; change the mapping to use pathlib.Path objects (e.g.
Path(self.runtime.log_dir) as the key and Path("/logs") as the value) before
passing it to start_srun_process, and add an import for pathlib.Path if missing;
ensure the resulting container_mounts variable is typed/constructed as
dict[Path, Path] (or None when appropriate) so the signature matches.

416-416: ⚠️ Potential issue | 🟡 Minor

Type mismatch: Same issue as line 274.

Proposed fix
             proc = start_srun_process(
                 command=["bash", "-c", script],
                 nodelist=[self.runtime.nodes.head],
                 output=str(analysis_log),
                 container_image="python:3.11",
-                container_mounts={str(self.runtime.log_dir): "/logs"},
+                container_mounts={self.runtime.log_dir: Path("/logs")},
                 env_to_set=env_to_set,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/srtctl/cli/mixins/postprocess_stage.py` at line 416, The container_mounts
entry currently uses str(self.runtime.log_dir) which causes a typing mismatch;
update the mapping to use the expected path type by passing self.runtime.log_dir
(not str()) so the container_mounts parameter and the runtime.log_dir attribute
types align (same fix as applied at the earlier occurrence around line 274);
adjust any related annotations if needed to ensure container_mounts accepts the
runtime path type.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/srtctl/cli/mixins/postprocess_stage.py`:
- Line 274: The call is passing strings for container_mounts but
start_srun_process expects dict[Path, Path] | None; change the mapping to use
pathlib.Path objects (e.g. Path(self.runtime.log_dir) as the key and
Path("/logs") as the value) before passing it to start_srun_process, and add an
import for pathlib.Path if missing; ensure the resulting container_mounts
variable is typed/constructed as dict[Path, Path] (or None when appropriate) so
the signature matches.
- Line 416: The container_mounts entry currently uses str(self.runtime.log_dir)
which causes a typing mismatch; update the mapping to use the expected path type
by passing self.runtime.log_dir (not str()) so the container_mounts parameter
and the runtime.log_dir attribute types align (same fix as applied at the
earlier occurrence around line 274); adjust any related annotations if needed to
ensure container_mounts accepts the runtime path type.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between acc6dd3 and 2199509.

📒 Files selected for processing (4)
  • src/srtctl/benchmarks/sa_bench.py
  • src/srtctl/benchmarks/scripts/gpqa/bench.sh
  • src/srtctl/cli/mixins/postprocess_stage.py
  • src/srtctl/core/schema.py

@YAMY1234 YAMY1234 merged commit 6c7f853 into ishandhanani:main Mar 5, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant