fix(ci): #537 — warn + catastrophic-floor for LLM-driven eval gates by Knapp-Kevin · Pull Request #539 · BicameralAI/bicameral-mcp

Knapp-Kevin · 2026-06-03T19:51:01Z

What

Make the three LLM-driven hard CI gates robust to caller-LLM nondeterminism. Closes #537.

M2 grounding-recall, M_skill_preflight Step-0, M_skill_preflight Step-1 are ubuntu-only gates driven by claude-haiku-4-5-20251001. Their thresholds sit inside the LLM variance band, so they flake-block unrelated PRs (windows skips them → only ubuntu reds).

Proof of flake (identical main base):

PR	diff	M2 recall	gate ≥0.8
#536	one markdown file	0.783	❌
#535	—	0.957	✅
#534	—	0.957	✅

How (Option C — warn + catastrophic floor)

tests/eval/_gate.py (new): gate_exit_code(quality_breaches, catastrophic_breaches, gate_mode) two-tier policy + is_inconclusive(error_count, total) abstention helper.
Each eval gains --catastrophic-recall (M2 0.50 / step1 0.40 / step0 0.25). Quality thresholds become warn-only (workflow --gate-mode hard→warn); metrics still emit to the run summary unconditionally. The catastrophic floor hard-fails any mode on a genuine collapse.
The grounding eval abstains from the floor when ≥50% of cases error (is_inconclusive): the docs(research): #148 — capturing implicit agent-authored decisions #536 re-run scored recall 0.000 because every case errored on a missing API key — inconclusive, not a grounding collapse. Preflight evals already skip no-key cases (recall None → no breach), so only grounding needed the guard.
tests/test_eval_gate_two_tier.py (new): 11 behavioral tests.

Verification (local)

pytest tests/test_eval_gate_two_tier.py → 11 passed · ruff check clean · ruff format --check clean · all edited files py_compile clean. Full LLM evals run in CI.

Notes

Do not merge — agent-needs-human. Threshold values (catastrophic floors, the 0.5 inconclusive ceiling) are judgment calls worth a human eye.
After merge, re-run docs(research): #148 — capturing implicit agent-authored decisions #536 to pick up the warn gates.

🤖 Generated with Claude Code

Summary by CodeRabbit

Chores
- Updated CI evaluation workflows with improved gating logic to reduce false failures while maintaining critical safety checks
- Enhanced infrastructure failure detection to distinguish from genuine evaluation issues
Tests
- Added comprehensive test suite for evaluation gate policy

Three ubuntu-only hard gates in test-mcp-regression.yml are driven by a non-deterministic caller LLM (claude-haiku-4-5-20251001) and flake-block unrelated PRs: M2 grounding-recall, M_skill_preflight Step-0/Step-1. Proof: M2 recall 0.783 on docs-only #536 vs 0.957 on #534/#535, identical main base. Per the gating-as-observability doctrine, flip the three steps to --gate-mode warn (quality thresholds advisory; metrics still emit to the run summary) and add a catastrophic floor that hard-fails any mode only on a genuine collapse: - tests/eval/_gate.py: shared gate_exit_code() two-tier policy + is_inconclusive() abstention helper. - Each eval gains --catastrophic-recall (M2 0.50 / step1 0.40 / step0 0.25). - Grounding eval abstains from the floor when >=50% of cases error (is_inconclusive): the #536 re-run scored recall 0.000 because every case errored on a missing API key — inconclusive, not a grounding collapse. Preflight evals already skip no-key cases, so only grounding needed the guard. - tests/test_eval_gate_two_tier.py: 11 behavioral tests for both helpers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-03T19:51:14Z

📝 Walkthrough

Walkthrough

This PR implements a two-tier CI gating policy that decouples deterministic catastrophic-failure detection from probabilistic quality-threshold enforcement. Hard-gated eval steps now warn on quality-breach conditions while reserving hard failure for genuine evaluation collapse, reducing flakiness from LLM caller variance.

Changes

Two-tier gating policy and eval runner integration

Layer / File(s)	Summary
Two-tier gate policy helpers `tests/eval/_gate.py`	New `gate_exit_code()` function computes exit code from two breach tiers: catastrophic breaches always exit 1, quality breaches exit 1 only in hard mode. New `is_inconclusive()` function detects runs with no executed cases or excessive error rate to prevent misclassification of infrastructure failures as grounding collapse.
Gate policy behavioral tests `tests/test_eval_gate_two_tier.py`	Test suite validates `gate_exit_code()` results across warn/hard modes for clean, quality-breach, and catastrophic-breach scenarios, and validates `is_inconclusive()` boundary conditions and all-errored/no-cases-run detection.
Grounding-recall runner integration `tests/eval_grounding_recall.py`	Imports gate helpers; computes per-run `eval_error_rate` and uses `is_inconclusive()` to skip catastrophic classification when cases failed to execute; adds `--catastrophic-recall` threshold (default 0.50); reports gate breaches with mode notes; delegates exit code to centralized `gate_exit_code()`.
Skill invocation (Step-0) runner integration `tests/eval_preflight_skill_invocation.py`	Imports gate helpers; adds `--catastrophic-recall` threshold (default 0.25); computes and reports separate catastrophic-breaches list in JSON payload; delegates exit code to `gate_exit_code()` with both quality and catastrophic breach lists.
Skill step1 (Step-1) runner integration `tests/eval_preflight_skill_step1.py`	Imports gate helpers; adds `--catastrophic-recall` threshold (default 0.40) based on overall recall; computes catastrophic breaches separately; includes both breach types in JSON payload; delegates exit code to `gate_exit_code()` instead of inline hard-fail logic.
Workflow gate-mode adjustments `.github/workflows/test-mcp-regression.yml`	M2 grounding-recall and M_skill_preflight steps (Step-0 and Step-1) switch from `--gate-mode hard` to `--gate-mode warn`; step names and notes updated to reflect "warn + catastrophic floor" semantics; relying on newly introduced catastrophic thresholds for deterministic hard-failure behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

jinhongkuan

Poem

🐰 Two tiers now guard the gates of fate,
Where catastrophe stands ever straight,
But quality learns to wait and warn—
No more shall random noise be worn!
LLM variance bows to the floor. 🚪✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: converting LLM-driven CI gates from hard failure to warn mode with catastrophic-floor fallback, directly addressing issue `#537`.
Linked Issues check	✅ Passed	The PR fully implements the immediate mitigation strategy from `#537`: converting quality thresholds to warn-only while retaining catastrophic floors for genuine collapse detection, with inconclusive-run abstention to prevent false hard failures.
Out of Scope Changes check	✅ Passed	All changes are scoped to the immediate mitigation for `#537`: new two-tier gate logic, CLI catastrophic-recall thresholds, workflow mode adjustments, and behavioral tests—no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/537-llm-gate-flake

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

.github/workflows/test-mcp-regression.yml (2)
336-354: 💤 Low value

Step changes are consistent; same maintainability note as M2.

The step name updates and --gate-mode warn switches are correct for both Step-1 and Step-0. Both steps now rely on their respective runner defaults (--catastrophic-recall 0.40 for Step-1, 0.25 for Step-0) rather than passing the values explicitly. This is the same pattern as M2 above—functionally correct but creates an implicit contract between the workflow comments (line 334) and the runner script defaults.

If explicit values are preferred for maintainability (as suggested for M2), the same pattern applies here:
♻️ Optional refactor to make catastrophic thresholds explicit

Step-1:
         run: >
           python tests/eval_preflight_skill_step1.py
           --gate-mode warn
+          --catastrophic-recall 0.40
           -o test-results/skill-step1.json
Step-0:
         run: >
           python tests/eval_preflight_skill_invocation.py
           --gate-mode warn
+          --catastrophic-recall 0.25
           -o test-results/skill-step0.json
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test-mcp-regression.yml around lines 336 - 354, The
workflow steps "M_skill_preflight Step-1 (warn + catastrophic floor)" and
"M_skill_preflight Step-0 (warn + catastrophic floor)" rely on runner defaults
for catastrophic thresholds; make the thresholds explicit by adding the
corresponding --catastrophic-recall flags to each run command (use the Step-1
flag value 0.40 and the Step-0 flag value 0.25) so the commands invoking
tests/eval_preflight_skill_step1.py and tests/eval_preflight_skill_invocation.py
include --catastrophic-recall 0.40 and --catastrophic-recall 0.25 respectively,
removing the implicit dependency on runner defaults and keeping the step
names/gate-mode unchanged.
254-262: 💤 Low value

Consider passing --catastrophic-recall explicitly for maintainability.

The step name change and --gate-mode warn switch are correct. However, the workflow relies on the default --catastrophic-recall 0.50 in eval_grounding_recall.py rather than passing it explicitly. While functional, this creates an implicit contract: if someone changes the runner's default, the comment on line 252 becomes stale without the workflow itself breaking.

Passing the value explicitly would make the configuration self-documenting:
♻️ Optional refactor to make catastrophic threshold explicit
         run: >
           python tests/eval_grounding_recall.py
           --gate-mode warn
+          --catastrophic-recall 0.50
           -o test-results/m2-grounding-recall.json
This is a style/maintainability tradeoff—defaults keep the workflow concise, but explicit values prevent drift between comments and code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test-mcp-regression.yml around lines 254 - 262, The
workflow step invoking tests/eval_grounding_recall.py relies on the script's
default catastrophic threshold; update the run command that calls
eval_grounding_recall.py (the line containing "--gate-mode warn") to pass the
explicit flag "--catastrophic-recall 0.50" so the CI config is self-documenting
and won't break if the script default changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/eval_grounding_recall.py`:
- Around line 375-381: After parsing CLI flags, validate the thresholds so
--catastrophic-recall (parsed as catastrophic_recall) is within [0.0,1.0],
--min-recall (parsed as min_recall) is within [0.0,1.0], and catastrophic_recall
<= min_recall; if any check fails, print a clear error and exit non‑zero. Locate
the argument definitions added via p.add_argument("--catastrophic-recall", ...)
and the corresponding "--min-recall" arg, perform this validation immediately
after parser.parse_args() (or in the function that handles args) and reject
invalid combinations up front to prevent the two-tier contract from being
violated.

In `@tests/eval_preflight_skill_invocation.py`:
- Around line 89-95: Reject invalid threshold combinations by validating parsed
args: after argument parsing, check that 0 <= args.catastrophic_recall <=
args.min_recall <= 1 and that 0 <= args.max_fp_rate <= 1, and raise/exit with a
clear error if violated; update the validation logic around where thresholds are
used (e.g., before breach computation and in any function that reads
catastrophic_recall, min_recall, max_fp_rate) so downstream code (breach
computation) never runs with inverted or out-of-range thresholds.

In `@tests/eval_preflight_skill_step1.py`:
- Around line 80-86: The CLI currently accepts --catastrophic-recall without
bounds or ordering checks; after parsing args (using parser and the parsed names
catastrophic_recall and min_recall), validate that catastrophic_recall is within
[0.0, 1.0] and that catastrophic_recall <= min_recall; if a check fails call
parser.error with a clear message so arg parsing exits cleanly. Add these checks
immediately after args = parser.parse_args() (or equivalent) to enforce the
range and the ordering between --catastrophic-recall and --min-recall.

---

Nitpick comments:
In @.github/workflows/test-mcp-regression.yml:
- Around line 336-354: The workflow steps "M_skill_preflight Step-1 (warn +
catastrophic floor)" and "M_skill_preflight Step-0 (warn + catastrophic floor)"
rely on runner defaults for catastrophic thresholds; make the thresholds
explicit by adding the corresponding --catastrophic-recall flags to each run
command (use the Step-1 flag value 0.40 and the Step-0 flag value 0.25) so the
commands invoking tests/eval_preflight_skill_step1.py and
tests/eval_preflight_skill_invocation.py include --catastrophic-recall 0.40 and
--catastrophic-recall 0.25 respectively, removing the implicit dependency on
runner defaults and keeping the step names/gate-mode unchanged.
- Around line 254-262: The workflow step invoking tests/eval_grounding_recall.py
relies on the script's default catastrophic threshold; update the run command
that calls eval_grounding_recall.py (the line containing "--gate-mode warn") to
pass the explicit flag "--catastrophic-recall 0.50" so the CI config is
self-documenting and won't break if the script default changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1732d192-67c1-40a3-8478-113b6e5739f7

📥 Commits

Reviewing files that changed from the base of the PR and between 88bdf6a and 6505423.

📒 Files selected for processing (6)

.github/workflows/test-mcp-regression.yml
tests/eval/_gate.py
tests/eval_grounding_recall.py
tests/eval_preflight_skill_invocation.py
tests/eval_preflight_skill_step1.py
tests/test_eval_gate_two_tier.py

coderabbitai · 2026-06-03T19:57:08Z

+    p.add_argument(
+        "--catastrophic-recall",
+        type=float,
+        default=0.50,
+        help="Hard floor (#537): recall below this hard-fails CI in any gate mode "
+        "(a collapsed grounding path, not LLM variance). Default 0.50.",
+    )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate catastrophic-floor ordering at parse time.

--catastrophic-recall is unconstrained here, so it can be set above --min-recall. That breaks the two-tier contract by making warn mode hard-fail scores that the quality gate treats as passing, and in this runner can even print ✓ all gates pass before the catastrophic failure line. Reject invalid threshold combinations (and out-of-range percentages) up front.

Suggested guard

args = p.parse_args() + for flag in ("min_recall", "min_precision", "max_abort_rate", "catastrophic_recall"): + value = getattr(args, flag) + if not 0.0 <= value <= 1.0: + p.error(f"--{flag.replace('_', '-')} must be between 0 and 1") + if args.catastrophic_recall > args.min_recall: + p.error("--catastrophic-recall must be less than or equal to --min-recall") _, exit_code = asyncio.run(run(args))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval_grounding_recall.py` around lines 375 - 381, After parsing CLI flags, validate the thresholds so --catastrophic-recall (parsed as catastrophic_recall) is within [0.0,1.0], --min-recall (parsed as min_recall) is within [0.0,1.0], and catastrophic_recall <= min_recall; if any check fails, print a clear error and exit non‑zero. Locate the argument definitions added via p.add_argument("--catastrophic-recall", ...) and the corresponding "--min-recall" arg, perform this validation immediately after parser.parse_args() (or in the function that handles args) and reject invalid combinations up front to prevent the two-tier contract from being violated.

coderabbitai · 2026-06-03T19:57:08Z

+    parser.add_argument(
+        "--catastrophic-recall",
+        type=float,
+        default=0.25,
+        help="Hard floor (#537): should-invoke recall below this hard-fails CI in "
+        "any gate mode (a collapsed invocation path, not LLM variance). Default 0.25.",
+    )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject threshold combinations that invert the warn/hard policy.

This flag is not validated against --min-recall, so a caller can set a catastrophic floor above the advisory recall gate and turn warn mode back into a hard gate for non-catastrophic results. Please enforce 0 <= catastrophic_recall <= min_recall <= 1 (and keep max_fp_rate in [0,1]) before computing breaches.

Suggested guard

args = parser.parse_args() + for flag in ("min_recall", "max_fp_rate", "catastrophic_recall"): + value = getattr(args, flag) + if not 0.0 <= value <= 1.0: + parser.error(f"--{flag.replace('_', '-')} must be between 0 and 1") + if args.catastrophic_recall > args.min_recall: + parser.error("--catastrophic-recall must be less than or equal to --min-recall") rows = [json.loads(line) for line in DATASET.read_text().splitlines() if line.strip()]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval_preflight_skill_invocation.py` around lines 89 - 95, Reject invalid threshold combinations by validating parsed args: after argument parsing, check that 0 <= args.catastrophic_recall <= args.min_recall <= 1 and that 0 <= args.max_fp_rate <= 1, and raise/exit with a clear error if violated; update the validation logic around where thresholds are used (e.g., before breach computation and in any function that reads catastrophic_recall, min_recall, max_fp_rate) so downstream code (breach computation) never runs with inverted or out-of-range thresholds.

coderabbitai · 2026-06-03T19:57:08Z

+    parser.add_argument(
+        "--catastrophic-recall",
+        type=float,
+        default=0.40,
+        help="Hard floor (#537): overall recall below this hard-fails CI in any "
+        "gate mode (a collapsed preflight path, not LLM variance). Default 0.40.",
+    )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce a sane catastrophic-floor range.

--catastrophic-recall is accepted without any bounds or ordering check. If someone tunes it above --min-recall, this runner will hard-fail warn-mode runs that only cleared the advisory gate, which defeats the two-tier policy. Validate both the [0,1] range and catastrophic_recall <= min_recall when parsing args.

Suggested guard

args = parser.parse_args() + for flag in ("min_recall", "catastrophic_recall"): + value = getattr(args, flag) + if not 0.0 <= value <= 1.0: + parser.error(f"--{flag.replace('_', '-')} must be between 0 and 1") + if args.catastrophic_recall > args.min_recall: + parser.error("--catastrophic-recall must be less than or equal to --min-recall") rows = [json.loads(line) for line in DATASET.read_text().splitlines() if line.strip()]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval_preflight_skill_step1.py` around lines 80 - 86, The CLI currently accepts --catastrophic-recall without bounds or ordering checks; after parsing args (using parser and the parsed names catastrophic_recall and min_recall), validate that catastrophic_recall is within [0.0, 1.0] and that catastrophic_recall <= min_recall; if a check fails call parser.error with a clear message so arg parsing exits cleanly. Add these checks immediately after args = parser.parse_args() (or equivalent) to enforce the range and the ordering between --catastrophic-recall and --min-recall.

Knapp-Kevin added the agent-needs-human Agent completed work; requires human review/decision before merge label Jun 3, 2026

Knapp-Kevin temporarily deployed to ci-test June 3, 2026 19:51 — with GitHub Actions Inactive

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Knapp-Kevin requested a review from jinhongkuan June 3, 2026 20:12

This was referenced Jun 3, 2026

CI: dependabot PRs hard-fail auth-gated checks (no env secrets) — degrade gracefully #540

Open

fix(ci): #540 — skip key-dependent steps on dependabot PRs #541

Closed

docs(compliance): consolidated SOC2/OWASP/NIST/GDPR/EU-AI-Act remediation audit #542

Merged

silongtan mentioned this pull request Jun 5, 2026

CI: LLM-driven hard gates (M2 grounding-recall, M_skill_preflight Step-0/1) flake on unrelated PRs #537

Closed

jinhongkuan merged commit b214e4e into main Jun 7, 2026
12 checks passed

jinhongkuan deleted the fix/537-llm-gate-flake branch June 7, 2026 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): #537 — warn + catastrophic-floor for LLM-driven eval gates#539

fix(ci): #537 — warn + catastrophic-floor for LLM-driven eval gates#539
jinhongkuan merged 1 commit into
mainfrom
fix/537-llm-gate-flake

Knapp-Kevin commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Uh oh!

coderabbitai Bot Jun 3, 2026

Uh oh!

coderabbitai Bot Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Knapp-Kevin commented Jun 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How (Option C — warn + catastrophic floor)

Verification (local)

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Knapp-Kevin commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading