Skip to content

fix(ci): #537 — warn + catastrophic-floor for LLM-driven eval gates#539

Merged
jinhongkuan merged 1 commit into
mainfrom
fix/537-llm-gate-flake
Jun 7, 2026
Merged

fix(ci): #537 — warn + catastrophic-floor for LLM-driven eval gates#539
jinhongkuan merged 1 commit into
mainfrom
fix/537-llm-gate-flake

Conversation

@Knapp-Kevin

@Knapp-Kevin Knapp-Kevin commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

What

Make the three LLM-driven hard CI gates robust to caller-LLM nondeterminism. Closes #537.

M2 grounding-recall, M_skill_preflight Step-0, M_skill_preflight Step-1 are ubuntu-only gates driven by claude-haiku-4-5-20251001. Their thresholds sit inside the LLM variance band, so they flake-block unrelated PRs (windows skips them → only ubuntu reds).

Proof of flake (identical main base):

PR diff M2 recall gate ≥0.8
#536 one markdown file 0.783
#535 0.957
#534 0.957

How (Option C — warn + catastrophic floor)

  • tests/eval/_gate.py (new): gate_exit_code(quality_breaches, catastrophic_breaches, gate_mode) two-tier policy + is_inconclusive(error_count, total) abstention helper.
  • Each eval gains --catastrophic-recall (M2 0.50 / step1 0.40 / step0 0.25). Quality thresholds become warn-only (workflow --gate-mode hardwarn); metrics still emit to the run summary unconditionally. The catastrophic floor hard-fails any mode on a genuine collapse.
  • The grounding eval abstains from the floor when ≥50% of cases error (is_inconclusive): the docs(research): #148 — capturing implicit agent-authored decisions #536 re-run scored recall 0.000 because every case errored on a missing API key — inconclusive, not a grounding collapse. Preflight evals already skip no-key cases (recall None → no breach), so only grounding needed the guard.
  • tests/test_eval_gate_two_tier.py (new): 11 behavioral tests.

Verification (local)

pytest tests/test_eval_gate_two_tier.py → 11 passed · ruff check clean · ruff format --check clean · all edited files py_compile clean. Full LLM evals run in CI.

Notes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores

    • Updated CI evaluation workflows with improved gating logic to reduce false failures while maintaining critical safety checks
    • Enhanced infrastructure failure detection to distinguish from genuine evaluation issues
  • Tests

    • Added comprehensive test suite for evaluation gate policy

Three ubuntu-only hard gates in test-mcp-regression.yml are driven by a
non-deterministic caller LLM (claude-haiku-4-5-20251001) and flake-block
unrelated PRs: M2 grounding-recall, M_skill_preflight Step-0/Step-1. Proof:
M2 recall 0.783 on docs-only #536 vs 0.957 on #534/#535, identical main base.

Per the gating-as-observability doctrine, flip the three steps to
--gate-mode warn (quality thresholds advisory; metrics still emit to the run
summary) and add a catastrophic floor that hard-fails any mode only on a
genuine collapse:
- tests/eval/_gate.py: shared gate_exit_code() two-tier policy +
  is_inconclusive() abstention helper.
- Each eval gains --catastrophic-recall (M2 0.50 / step1 0.40 / step0 0.25).
- Grounding eval abstains from the floor when >=50% of cases error
  (is_inconclusive): the #536 re-run scored recall 0.000 because every case
  errored on a missing API key — inconclusive, not a grounding collapse.
  Preflight evals already skip no-key cases, so only grounding needed the guard.
- tests/test_eval_gate_two_tier.py: 11 behavioral tests for both helpers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Knapp-Kevin Knapp-Kevin added the agent-needs-human Agent completed work; requires human review/decision before merge label Jun 3, 2026
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements a two-tier CI gating policy that decouples deterministic catastrophic-failure detection from probabilistic quality-threshold enforcement. Hard-gated eval steps now warn on quality-breach conditions while reserving hard failure for genuine evaluation collapse, reducing flakiness from LLM caller variance.

Changes

Two-tier gating policy and eval runner integration

Layer / File(s) Summary
Two-tier gate policy helpers
tests/eval/_gate.py
New gate_exit_code() function computes exit code from two breach tiers: catastrophic breaches always exit 1, quality breaches exit 1 only in hard mode. New is_inconclusive() function detects runs with no executed cases or excessive error rate to prevent misclassification of infrastructure failures as grounding collapse.
Gate policy behavioral tests
tests/test_eval_gate_two_tier.py
Test suite validates gate_exit_code() results across warn/hard modes for clean, quality-breach, and catastrophic-breach scenarios, and validates is_inconclusive() boundary conditions and all-errored/no-cases-run detection.
Grounding-recall runner integration
tests/eval_grounding_recall.py
Imports gate helpers; computes per-run eval_error_rate and uses is_inconclusive() to skip catastrophic classification when cases failed to execute; adds --catastrophic-recall threshold (default 0.50); reports gate breaches with mode notes; delegates exit code to centralized gate_exit_code().
Skill invocation (Step-0) runner integration
tests/eval_preflight_skill_invocation.py
Imports gate helpers; adds --catastrophic-recall threshold (default 0.25); computes and reports separate catastrophic-breaches list in JSON payload; delegates exit code to gate_exit_code() with both quality and catastrophic breach lists.
Skill step1 (Step-1) runner integration
tests/eval_preflight_skill_step1.py
Imports gate helpers; adds --catastrophic-recall threshold (default 0.40) based on overall recall; computes catastrophic breaches separately; includes both breach types in JSON payload; delegates exit code to gate_exit_code() instead of inline hard-fail logic.
Workflow gate-mode adjustments
.github/workflows/test-mcp-regression.yml
M2 grounding-recall and M_skill_preflight steps (Step-0 and Step-1) switch from --gate-mode hard to --gate-mode warn; step names and notes updated to reflect "warn + catastrophic floor" semantics; relying on newly introduced catastrophic thresholds for deterministic hard-failure behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • jinhongkuan

Poem

🐰 Two tiers now guard the gates of fate,
Where catastrophe stands ever straight,
But quality learns to wait and warn—
No more shall random noise be worn!
LLM variance bows to the floor. 🚪✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: converting LLM-driven CI gates from hard failure to warn mode with catastrophic-floor fallback, directly addressing issue #537.
Linked Issues check ✅ Passed The PR fully implements the immediate mitigation strategy from #537: converting quality thresholds to warn-only while retaining catastrophic floors for genuine collapse detection, with inconclusive-run abstention to prevent false hard failures.
Out of Scope Changes check ✅ Passed All changes are scoped to the immediate mitigation for #537: new two-tier gate logic, CLI catastrophic-recall thresholds, workflow mode adjustments, and behavioral tests—no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/537-llm-gate-flake

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
.github/workflows/test-mcp-regression.yml (2)

336-354: 💤 Low value

Step changes are consistent; same maintainability note as M2.

The step name updates and --gate-mode warn switches are correct for both Step-1 and Step-0. Both steps now rely on their respective runner defaults (--catastrophic-recall 0.40 for Step-1, 0.25 for Step-0) rather than passing the values explicitly. This is the same pattern as M2 above—functionally correct but creates an implicit contract between the workflow comments (line 334) and the runner script defaults.

If explicit values are preferred for maintainability (as suggested for M2), the same pattern applies here:

♻️ Optional refactor to make catastrophic thresholds explicit

Step-1:

         run: >
           python tests/eval_preflight_skill_step1.py
           --gate-mode warn
+          --catastrophic-recall 0.40
           -o test-results/skill-step1.json

Step-0:

         run: >
           python tests/eval_preflight_skill_invocation.py
           --gate-mode warn
+          --catastrophic-recall 0.25
           -o test-results/skill-step0.json
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test-mcp-regression.yml around lines 336 - 354, The
workflow steps "M_skill_preflight Step-1 (warn + catastrophic floor)" and
"M_skill_preflight Step-0 (warn + catastrophic floor)" rely on runner defaults
for catastrophic thresholds; make the thresholds explicit by adding the
corresponding --catastrophic-recall flags to each run command (use the Step-1
flag value 0.40 and the Step-0 flag value 0.25) so the commands invoking
tests/eval_preflight_skill_step1.py and tests/eval_preflight_skill_invocation.py
include --catastrophic-recall 0.40 and --catastrophic-recall 0.25 respectively,
removing the implicit dependency on runner defaults and keeping the step
names/gate-mode unchanged.

254-262: 💤 Low value

Consider passing --catastrophic-recall explicitly for maintainability.

The step name change and --gate-mode warn switch are correct. However, the workflow relies on the default --catastrophic-recall 0.50 in eval_grounding_recall.py rather than passing it explicitly. While functional, this creates an implicit contract: if someone changes the runner's default, the comment on line 252 becomes stale without the workflow itself breaking.

Passing the value explicitly would make the configuration self-documenting:

♻️ Optional refactor to make catastrophic threshold explicit
         run: >
           python tests/eval_grounding_recall.py
           --gate-mode warn
+          --catastrophic-recall 0.50
           -o test-results/m2-grounding-recall.json

This is a style/maintainability tradeoff—defaults keep the workflow concise, but explicit values prevent drift between comments and code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test-mcp-regression.yml around lines 254 - 262, The
workflow step invoking tests/eval_grounding_recall.py relies on the script's
default catastrophic threshold; update the run command that calls
eval_grounding_recall.py (the line containing "--gate-mode warn") to pass the
explicit flag "--catastrophic-recall 0.50" so the CI config is self-documenting
and won't break if the script default changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/eval_grounding_recall.py`:
- Around line 375-381: After parsing CLI flags, validate the thresholds so
--catastrophic-recall (parsed as catastrophic_recall) is within [0.0,1.0],
--min-recall (parsed as min_recall) is within [0.0,1.0], and catastrophic_recall
<= min_recall; if any check fails, print a clear error and exit non‑zero. Locate
the argument definitions added via p.add_argument("--catastrophic-recall", ...)
and the corresponding "--min-recall" arg, perform this validation immediately
after parser.parse_args() (or in the function that handles args) and reject
invalid combinations up front to prevent the two-tier contract from being
violated.

In `@tests/eval_preflight_skill_invocation.py`:
- Around line 89-95: Reject invalid threshold combinations by validating parsed
args: after argument parsing, check that 0 <= args.catastrophic_recall <=
args.min_recall <= 1 and that 0 <= args.max_fp_rate <= 1, and raise/exit with a
clear error if violated; update the validation logic around where thresholds are
used (e.g., before breach computation and in any function that reads
catastrophic_recall, min_recall, max_fp_rate) so downstream code (breach
computation) never runs with inverted or out-of-range thresholds.

In `@tests/eval_preflight_skill_step1.py`:
- Around line 80-86: The CLI currently accepts --catastrophic-recall without
bounds or ordering checks; after parsing args (using parser and the parsed names
catastrophic_recall and min_recall), validate that catastrophic_recall is within
[0.0, 1.0] and that catastrophic_recall <= min_recall; if a check fails call
parser.error with a clear message so arg parsing exits cleanly. Add these checks
immediately after args = parser.parse_args() (or equivalent) to enforce the
range and the ordering between --catastrophic-recall and --min-recall.

---

Nitpick comments:
In @.github/workflows/test-mcp-regression.yml:
- Around line 336-354: The workflow steps "M_skill_preflight Step-1 (warn +
catastrophic floor)" and "M_skill_preflight Step-0 (warn + catastrophic floor)"
rely on runner defaults for catastrophic thresholds; make the thresholds
explicit by adding the corresponding --catastrophic-recall flags to each run
command (use the Step-1 flag value 0.40 and the Step-0 flag value 0.25) so the
commands invoking tests/eval_preflight_skill_step1.py and
tests/eval_preflight_skill_invocation.py include --catastrophic-recall 0.40 and
--catastrophic-recall 0.25 respectively, removing the implicit dependency on
runner defaults and keeping the step names/gate-mode unchanged.
- Around line 254-262: The workflow step invoking tests/eval_grounding_recall.py
relies on the script's default catastrophic threshold; update the run command
that calls eval_grounding_recall.py (the line containing "--gate-mode warn") to
pass the explicit flag "--catastrophic-recall 0.50" so the CI config is
self-documenting and won't break if the script default changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1732d192-67c1-40a3-8478-113b6e5739f7

📥 Commits

Reviewing files that changed from the base of the PR and between 88bdf6a and 6505423.

📒 Files selected for processing (6)
  • .github/workflows/test-mcp-regression.yml
  • tests/eval/_gate.py
  • tests/eval_grounding_recall.py
  • tests/eval_preflight_skill_invocation.py
  • tests/eval_preflight_skill_step1.py
  • tests/test_eval_gate_two_tier.py

Comment on lines +375 to +381
p.add_argument(
"--catastrophic-recall",
type=float,
default=0.50,
help="Hard floor (#537): recall below this hard-fails CI in any gate mode "
"(a collapsed grounding path, not LLM variance). Default 0.50.",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate catastrophic-floor ordering at parse time.

--catastrophic-recall is unconstrained here, so it can be set above --min-recall. That breaks the two-tier contract by making warn mode hard-fail scores that the quality gate treats as passing, and in this runner can even print ✓ all gates pass before the catastrophic failure line. Reject invalid threshold combinations (and out-of-range percentages) up front.

Suggested guard
     args = p.parse_args()
+    for flag in ("min_recall", "min_precision", "max_abort_rate", "catastrophic_recall"):
+        value = getattr(args, flag)
+        if not 0.0 <= value <= 1.0:
+            p.error(f"--{flag.replace('_', '-')} must be between 0 and 1")
+    if args.catastrophic_recall > args.min_recall:
+        p.error("--catastrophic-recall must be less than or equal to --min-recall")
 
     _, exit_code = asyncio.run(run(args))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/eval_grounding_recall.py` around lines 375 - 381, After parsing CLI
flags, validate the thresholds so --catastrophic-recall (parsed as
catastrophic_recall) is within [0.0,1.0], --min-recall (parsed as min_recall) is
within [0.0,1.0], and catastrophic_recall <= min_recall; if any check fails,
print a clear error and exit non‑zero. Locate the argument definitions added via
p.add_argument("--catastrophic-recall", ...) and the corresponding
"--min-recall" arg, perform this validation immediately after
parser.parse_args() (or in the function that handles args) and reject invalid
combinations up front to prevent the two-tier contract from being violated.

Comment on lines +89 to +95
parser.add_argument(
"--catastrophic-recall",
type=float,
default=0.25,
help="Hard floor (#537): should-invoke recall below this hard-fails CI in "
"any gate mode (a collapsed invocation path, not LLM variance). Default 0.25.",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject threshold combinations that invert the warn/hard policy.

This flag is not validated against --min-recall, so a caller can set a catastrophic floor above the advisory recall gate and turn warn mode back into a hard gate for non-catastrophic results. Please enforce 0 <= catastrophic_recall <= min_recall <= 1 (and keep max_fp_rate in [0,1]) before computing breaches.

Suggested guard
     args = parser.parse_args()
+    for flag in ("min_recall", "max_fp_rate", "catastrophic_recall"):
+        value = getattr(args, flag)
+        if not 0.0 <= value <= 1.0:
+            parser.error(f"--{flag.replace('_', '-')} must be between 0 and 1")
+    if args.catastrophic_recall > args.min_recall:
+        parser.error("--catastrophic-recall must be less than or equal to --min-recall")
 
     rows = [json.loads(line) for line in DATASET.read_text().splitlines() if line.strip()]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/eval_preflight_skill_invocation.py` around lines 89 - 95, Reject
invalid threshold combinations by validating parsed args: after argument
parsing, check that 0 <= args.catastrophic_recall <= args.min_recall <= 1 and
that 0 <= args.max_fp_rate <= 1, and raise/exit with a clear error if violated;
update the validation logic around where thresholds are used (e.g., before
breach computation and in any function that reads catastrophic_recall,
min_recall, max_fp_rate) so downstream code (breach computation) never runs with
inverted or out-of-range thresholds.

Comment on lines +80 to +86
parser.add_argument(
"--catastrophic-recall",
type=float,
default=0.40,
help="Hard floor (#537): overall recall below this hard-fails CI in any "
"gate mode (a collapsed preflight path, not LLM variance). Default 0.40.",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce a sane catastrophic-floor range.

--catastrophic-recall is accepted without any bounds or ordering check. If someone tunes it above --min-recall, this runner will hard-fail warn-mode runs that only cleared the advisory gate, which defeats the two-tier policy. Validate both the [0,1] range and catastrophic_recall <= min_recall when parsing args.

Suggested guard
     args = parser.parse_args()
+    for flag in ("min_recall", "catastrophic_recall"):
+        value = getattr(args, flag)
+        if not 0.0 <= value <= 1.0:
+            parser.error(f"--{flag.replace('_', '-')} must be between 0 and 1")
+    if args.catastrophic_recall > args.min_recall:
+        parser.error("--catastrophic-recall must be less than or equal to --min-recall")
 
     rows = [json.loads(line) for line in DATASET.read_text().splitlines() if line.strip()]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/eval_preflight_skill_step1.py` around lines 80 - 86, The CLI currently
accepts --catastrophic-recall without bounds or ordering checks; after parsing
args (using parser and the parsed names catastrophic_recall and min_recall),
validate that catastrophic_recall is within [0.0, 1.0] and that
catastrophic_recall <= min_recall; if a check fails call parser.error with a
clear message so arg parsing exits cleanly. Add these checks immediately after
args = parser.parse_args() (or equivalent) to enforce the range and the ordering
between --catastrophic-recall and --min-recall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-needs-human Agent completed work; requires human review/decision before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: LLM-driven hard gates (M2 grounding-recall, M_skill_preflight Step-0/1) flake on unrelated PRs

2 participants