fix(ci): #537 — warn + catastrophic-floor for LLM-driven eval gates#539
Conversation
Three ubuntu-only hard gates in test-mcp-regression.yml are driven by a non-deterministic caller LLM (claude-haiku-4-5-20251001) and flake-block unrelated PRs: M2 grounding-recall, M_skill_preflight Step-0/Step-1. Proof: M2 recall 0.783 on docs-only #536 vs 0.957 on #534/#535, identical main base. Per the gating-as-observability doctrine, flip the three steps to --gate-mode warn (quality thresholds advisory; metrics still emit to the run summary) and add a catastrophic floor that hard-fails any mode only on a genuine collapse: - tests/eval/_gate.py: shared gate_exit_code() two-tier policy + is_inconclusive() abstention helper. - Each eval gains --catastrophic-recall (M2 0.50 / step1 0.40 / step0 0.25). - Grounding eval abstains from the floor when >=50% of cases error (is_inconclusive): the #536 re-run scored recall 0.000 because every case errored on a missing API key — inconclusive, not a grounding collapse. Preflight evals already skip no-key cases, so only grounding needed the guard. - tests/test_eval_gate_two_tier.py: 11 behavioral tests for both helpers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR implements a two-tier CI gating policy that decouples deterministic catastrophic-failure detection from probabilistic quality-threshold enforcement. Hard-gated eval steps now warn on quality-breach conditions while reserving hard failure for genuine evaluation collapse, reducing flakiness from LLM caller variance. ChangesTwo-tier gating policy and eval runner integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
.github/workflows/test-mcp-regression.yml (2)
336-354: 💤 Low valueStep changes are consistent; same maintainability note as M2.
The step name updates and
--gate-mode warnswitches are correct for both Step-1 and Step-0. Both steps now rely on their respective runner defaults (--catastrophic-recall 0.40for Step-1,0.25for Step-0) rather than passing the values explicitly. This is the same pattern as M2 above—functionally correct but creates an implicit contract between the workflow comments (line 334) and the runner script defaults.If explicit values are preferred for maintainability (as suggested for M2), the same pattern applies here:
♻️ Optional refactor to make catastrophic thresholds explicit
Step-1:
run: > python tests/eval_preflight_skill_step1.py --gate-mode warn + --catastrophic-recall 0.40 -o test-results/skill-step1.jsonStep-0:
run: > python tests/eval_preflight_skill_invocation.py --gate-mode warn + --catastrophic-recall 0.25 -o test-results/skill-step0.json🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test-mcp-regression.yml around lines 336 - 354, The workflow steps "M_skill_preflight Step-1 (warn + catastrophic floor)" and "M_skill_preflight Step-0 (warn + catastrophic floor)" rely on runner defaults for catastrophic thresholds; make the thresholds explicit by adding the corresponding --catastrophic-recall flags to each run command (use the Step-1 flag value 0.40 and the Step-0 flag value 0.25) so the commands invoking tests/eval_preflight_skill_step1.py and tests/eval_preflight_skill_invocation.py include --catastrophic-recall 0.40 and --catastrophic-recall 0.25 respectively, removing the implicit dependency on runner defaults and keeping the step names/gate-mode unchanged.
254-262: 💤 Low valueConsider passing
--catastrophic-recallexplicitly for maintainability.The step name change and
--gate-mode warnswitch are correct. However, the workflow relies on the default--catastrophic-recall 0.50ineval_grounding_recall.pyrather than passing it explicitly. While functional, this creates an implicit contract: if someone changes the runner's default, the comment on line 252 becomes stale without the workflow itself breaking.Passing the value explicitly would make the configuration self-documenting:
♻️ Optional refactor to make catastrophic threshold explicit
run: > python tests/eval_grounding_recall.py --gate-mode warn + --catastrophic-recall 0.50 -o test-results/m2-grounding-recall.jsonThis is a style/maintainability tradeoff—defaults keep the workflow concise, but explicit values prevent drift between comments and code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test-mcp-regression.yml around lines 254 - 262, The workflow step invoking tests/eval_grounding_recall.py relies on the script's default catastrophic threshold; update the run command that calls eval_grounding_recall.py (the line containing "--gate-mode warn") to pass the explicit flag "--catastrophic-recall 0.50" so the CI config is self-documenting and won't break if the script default changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/eval_grounding_recall.py`:
- Around line 375-381: After parsing CLI flags, validate the thresholds so
--catastrophic-recall (parsed as catastrophic_recall) is within [0.0,1.0],
--min-recall (parsed as min_recall) is within [0.0,1.0], and catastrophic_recall
<= min_recall; if any check fails, print a clear error and exit non‑zero. Locate
the argument definitions added via p.add_argument("--catastrophic-recall", ...)
and the corresponding "--min-recall" arg, perform this validation immediately
after parser.parse_args() (or in the function that handles args) and reject
invalid combinations up front to prevent the two-tier contract from being
violated.
In `@tests/eval_preflight_skill_invocation.py`:
- Around line 89-95: Reject invalid threshold combinations by validating parsed
args: after argument parsing, check that 0 <= args.catastrophic_recall <=
args.min_recall <= 1 and that 0 <= args.max_fp_rate <= 1, and raise/exit with a
clear error if violated; update the validation logic around where thresholds are
used (e.g., before breach computation and in any function that reads
catastrophic_recall, min_recall, max_fp_rate) so downstream code (breach
computation) never runs with inverted or out-of-range thresholds.
In `@tests/eval_preflight_skill_step1.py`:
- Around line 80-86: The CLI currently accepts --catastrophic-recall without
bounds or ordering checks; after parsing args (using parser and the parsed names
catastrophic_recall and min_recall), validate that catastrophic_recall is within
[0.0, 1.0] and that catastrophic_recall <= min_recall; if a check fails call
parser.error with a clear message so arg parsing exits cleanly. Add these checks
immediately after args = parser.parse_args() (or equivalent) to enforce the
range and the ordering between --catastrophic-recall and --min-recall.
---
Nitpick comments:
In @.github/workflows/test-mcp-regression.yml:
- Around line 336-354: The workflow steps "M_skill_preflight Step-1 (warn +
catastrophic floor)" and "M_skill_preflight Step-0 (warn + catastrophic floor)"
rely on runner defaults for catastrophic thresholds; make the thresholds
explicit by adding the corresponding --catastrophic-recall flags to each run
command (use the Step-1 flag value 0.40 and the Step-0 flag value 0.25) so the
commands invoking tests/eval_preflight_skill_step1.py and
tests/eval_preflight_skill_invocation.py include --catastrophic-recall 0.40 and
--catastrophic-recall 0.25 respectively, removing the implicit dependency on
runner defaults and keeping the step names/gate-mode unchanged.
- Around line 254-262: The workflow step invoking tests/eval_grounding_recall.py
relies on the script's default catastrophic threshold; update the run command
that calls eval_grounding_recall.py (the line containing "--gate-mode warn") to
pass the explicit flag "--catastrophic-recall 0.50" so the CI config is
self-documenting and won't break if the script default changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1732d192-67c1-40a3-8478-113b6e5739f7
📒 Files selected for processing (6)
.github/workflows/test-mcp-regression.ymltests/eval/_gate.pytests/eval_grounding_recall.pytests/eval_preflight_skill_invocation.pytests/eval_preflight_skill_step1.pytests/test_eval_gate_two_tier.py
| p.add_argument( | ||
| "--catastrophic-recall", | ||
| type=float, | ||
| default=0.50, | ||
| help="Hard floor (#537): recall below this hard-fails CI in any gate mode " | ||
| "(a collapsed grounding path, not LLM variance). Default 0.50.", | ||
| ) |
There was a problem hiding this comment.
Validate catastrophic-floor ordering at parse time.
--catastrophic-recall is unconstrained here, so it can be set above --min-recall. That breaks the two-tier contract by making warn mode hard-fail scores that the quality gate treats as passing, and in this runner can even print ✓ all gates pass before the catastrophic failure line. Reject invalid threshold combinations (and out-of-range percentages) up front.
Suggested guard
args = p.parse_args()
+ for flag in ("min_recall", "min_precision", "max_abort_rate", "catastrophic_recall"):
+ value = getattr(args, flag)
+ if not 0.0 <= value <= 1.0:
+ p.error(f"--{flag.replace('_', '-')} must be between 0 and 1")
+ if args.catastrophic_recall > args.min_recall:
+ p.error("--catastrophic-recall must be less than or equal to --min-recall")
_, exit_code = asyncio.run(run(args))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/eval_grounding_recall.py` around lines 375 - 381, After parsing CLI
flags, validate the thresholds so --catastrophic-recall (parsed as
catastrophic_recall) is within [0.0,1.0], --min-recall (parsed as min_recall) is
within [0.0,1.0], and catastrophic_recall <= min_recall; if any check fails,
print a clear error and exit non‑zero. Locate the argument definitions added via
p.add_argument("--catastrophic-recall", ...) and the corresponding
"--min-recall" arg, perform this validation immediately after
parser.parse_args() (or in the function that handles args) and reject invalid
combinations up front to prevent the two-tier contract from being violated.
| parser.add_argument( | ||
| "--catastrophic-recall", | ||
| type=float, | ||
| default=0.25, | ||
| help="Hard floor (#537): should-invoke recall below this hard-fails CI in " | ||
| "any gate mode (a collapsed invocation path, not LLM variance). Default 0.25.", | ||
| ) |
There was a problem hiding this comment.
Reject threshold combinations that invert the warn/hard policy.
This flag is not validated against --min-recall, so a caller can set a catastrophic floor above the advisory recall gate and turn warn mode back into a hard gate for non-catastrophic results. Please enforce 0 <= catastrophic_recall <= min_recall <= 1 (and keep max_fp_rate in [0,1]) before computing breaches.
Suggested guard
args = parser.parse_args()
+ for flag in ("min_recall", "max_fp_rate", "catastrophic_recall"):
+ value = getattr(args, flag)
+ if not 0.0 <= value <= 1.0:
+ parser.error(f"--{flag.replace('_', '-')} must be between 0 and 1")
+ if args.catastrophic_recall > args.min_recall:
+ parser.error("--catastrophic-recall must be less than or equal to --min-recall")
rows = [json.loads(line) for line in DATASET.read_text().splitlines() if line.strip()]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/eval_preflight_skill_invocation.py` around lines 89 - 95, Reject
invalid threshold combinations by validating parsed args: after argument
parsing, check that 0 <= args.catastrophic_recall <= args.min_recall <= 1 and
that 0 <= args.max_fp_rate <= 1, and raise/exit with a clear error if violated;
update the validation logic around where thresholds are used (e.g., before
breach computation and in any function that reads catastrophic_recall,
min_recall, max_fp_rate) so downstream code (breach computation) never runs with
inverted or out-of-range thresholds.
| parser.add_argument( | ||
| "--catastrophic-recall", | ||
| type=float, | ||
| default=0.40, | ||
| help="Hard floor (#537): overall recall below this hard-fails CI in any " | ||
| "gate mode (a collapsed preflight path, not LLM variance). Default 0.40.", | ||
| ) |
There was a problem hiding this comment.
Enforce a sane catastrophic-floor range.
--catastrophic-recall is accepted without any bounds or ordering check. If someone tunes it above --min-recall, this runner will hard-fail warn-mode runs that only cleared the advisory gate, which defeats the two-tier policy. Validate both the [0,1] range and catastrophic_recall <= min_recall when parsing args.
Suggested guard
args = parser.parse_args()
+ for flag in ("min_recall", "catastrophic_recall"):
+ value = getattr(args, flag)
+ if not 0.0 <= value <= 1.0:
+ parser.error(f"--{flag.replace('_', '-')} must be between 0 and 1")
+ if args.catastrophic_recall > args.min_recall:
+ parser.error("--catastrophic-recall must be less than or equal to --min-recall")
rows = [json.loads(line) for line in DATASET.read_text().splitlines() if line.strip()]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/eval_preflight_skill_step1.py` around lines 80 - 86, The CLI currently
accepts --catastrophic-recall without bounds or ordering checks; after parsing
args (using parser and the parsed names catastrophic_recall and min_recall),
validate that catastrophic_recall is within [0.0, 1.0] and that
catastrophic_recall <= min_recall; if a check fails call parser.error with a
clear message so arg parsing exits cleanly. Add these checks immediately after
args = parser.parse_args() (or equivalent) to enforce the range and the ordering
between --catastrophic-recall and --min-recall.
What
Make the three LLM-driven hard CI gates robust to caller-LLM nondeterminism. Closes #537.
M2 grounding-recall,M_skill_preflight Step-0,M_skill_preflight Step-1are ubuntu-only gates driven byclaude-haiku-4-5-20251001. Their thresholds sit inside the LLM variance band, so they flake-block unrelated PRs (windows skips them → only ubuntu reds).Proof of flake (identical
mainbase):How (Option C — warn + catastrophic floor)
tests/eval/_gate.py(new):gate_exit_code(quality_breaches, catastrophic_breaches, gate_mode)two-tier policy +is_inconclusive(error_count, total)abstention helper.--catastrophic-recall(M20.50/ step10.40/ step00.25). Quality thresholds become warn-only (workflow--gate-mode hard→warn); metrics still emit to the run summary unconditionally. The catastrophic floor hard-fails any mode on a genuine collapse.is_inconclusive): the docs(research): #148 — capturing implicit agent-authored decisions #536 re-run scored recall0.000because every case errored on a missing API key — inconclusive, not a grounding collapse. Preflight evals already skip no-key cases (recallNone→ no breach), so only grounding needed the guard.tests/test_eval_gate_two_tier.py(new): 11 behavioral tests.Verification (local)
pytest tests/test_eval_gate_two_tier.py→ 11 passed ·ruff checkclean ·ruff format --checkclean · all edited filespy_compileclean. Full LLM evals run in CI.Notes
agent-needs-human. Threshold values (catastrophic floors, the 0.5 inconclusive ceiling) are judgment calls worth a human eye.🤖 Generated with Claude Code
Summary by CodeRabbit
Chores
Tests