fix(e2e): #362 — reclassify Flow 3 'no cc rows + no verdicts' as advisory#363
Merged
Conversation
…sory Investigation in #362 showed the Flow 3 FAIL mode (cc_delta=0 + verdicts=0 after a successful commit) is not a sync chain regression but downstream agent variance: Flow 5's agent decides whether to call resolve_compliance based on the _sync_guidance instruction on the history response, and different runs land different decisions on the same natural prompt. Verified across three independent CI artifacts: - Passing run (devin/1778808036): Flow 5 called resolve_compliance → 10 verdicts written → cc_delta>0 → Flow 3 PASS - Failing run (PR #354): Flow 5 skipped resolve_compliance → cc_delta=0 → Flow 3 FAIL (strict, blocking) - Failing run (PR #360): Flow 5 skipped resolve_compliance → cc_delta=0 → Flow 3 FAIL (strict, blocking) In every failing run the _sync_guidance field on the history response explicitly instructed "call bicameral.resolve_compliance". The agent ignored it. Querying each failing-run ledger.db artifact confirmed PR #351's prune logic did NOT prune any decision (signoff states: proposed/proposed/ratified), ruling out the prime suspect. The fix per the e2e report's own CORRECTION-PATH STATUS message ("the end-to-end correction dynamic is NOT validated by this headless harness... validate the agentic layer via the interactive recording path"): make Flow 3's FAIL branch advisory so the gap stays visible in the report (matching Flow 2a/4/4b in the same agentic layer) without blocking unrelated PRs. Per DEV_CYCLE.md the prompt cannot be tightened to name the tool the agent should call — "Use natural prompts — never name the tool the agent is supposed to auto-fire. Naming the tool defeats the trigger that IS the product." So Flow 3's strict assertion was always going to flake on agent variance; this commit just makes the harness honest about that constraint. Changes: - Flow 3 FAIL branch: verdict still FAIL but advisory text now set → blocking_failures filter at line 1608 (`verdict in ("FAIL","ERROR") AND NOT advisory`) excludes it → overall PASS unchanged when this is the only red flow - Updated the verdict-matrix docstring (line 320) to reflect the new non-blocking semantics Out of scope for this PR: - The underlying agent variance itself. That's the same class of issue #154/#156 already track for Flow 2a/4. Real fix is in the agentic auto-fire layer, not the e2e harness. - A real bisect over the 3 dev commits in the breakage window. Could still be worth doing if anyone suspects the variance is being amplified server-side — but the evidence above (prune logic didn't fire, unrelated PRs touched unrelated areas) points strongly at variance. Closes #362. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #362. Makes the Flow 3
no compliance_check rows + no verdictsbranch an advisory failure (matching Flow 2a/4/4b in the same agentic layer) so the agentic-layer gap stays visible without blocking unrelated PRs.This unblocks #354, #356, #360, #361, and any future PR hit by the same agent-variance issue.
Investigation summary
Documented in detail on #362. Short version: this is downstream agent variance, not a server regression.
In every failing run the
_sync_guidancefield on the history response explicitly instructed "call bicameral.resolve_compliance". The agent in those runs ignored the instruction. Querying each failing-runledger.dbartifact confirmed PR #351's prune logic did not prune any decision — all 3 decisions in PR #360's run had signoff states{proposed, proposed, ratified}, noprunedstate anywhere. This rules out the prime "regression at 02:50 UTC" suspect.Why the prompt can't be tightened instead
Per DEV_CYCLE.md:
So the fix-by-tightening-Flow-5-prompt path is closed by design. Either we make Flow 3 advisory (this PR), or we accept the strict-blocking assertion will flake on agent variance — which is exactly what's happening right now to PR #354, #356, #360, #361.
Why this is consistent with the existing model
The e2e report already declares the agentic layer unvalidated:
Flow 2a/4/4b are all in the agentic layer and all advisory. Flow 3 was the asymmetric outlier — strict-blocking despite the same class of failure mode. This PR removes the asymmetry.
Diff
One file, two changes:
tests/e2e/run_e2e_flows.py:438-477):flow3.advisorynow set to non-empty text. The CI gate filter at line 1608 (verdict in ("FAIL","ERROR") AND NOT advisory) excludes it fromblocking_failures. Overall workflow exit code = 0 when Flow 3 is the only red flow.Test plan
paths:in the workflow YAML)Out of scope
🤖 Generated with Claude Code