Skip to content

fix(e2e): #362 — reclassify Flow 3 'no cc rows + no verdicts' as advisory#363

Merged
silongtan merged 1 commit into
devfrom
fix/362-e2e-flow3-regression
May 15, 2026
Merged

fix(e2e): #362 — reclassify Flow 3 'no cc rows + no verdicts' as advisory#363
silongtan merged 1 commit into
devfrom
fix/362-e2e-flow3-regression

Conversation

@silongtan

Copy link
Copy Markdown
Collaborator

Summary

Closes #362. Makes the Flow 3 no compliance_check rows + no verdicts branch an advisory failure (matching Flow 2a/4/4b in the same agentic layer) so the agentic-layer gap stays visible without blocking unrelated PRs.

This unblocks #354, #356, #360, #361, and any future PR hit by the same agent-variance issue.

Investigation summary

Documented in detail on #362. Short version: this is downstream agent variance, not a server regression.

Run Flow 5 bicameral calls Flow 3 outcome
Passing (devin/1778808036) history → resolve_compliance → ratify → history cc_delta=10, 7 verdicts written, PASS
Failing (PR #354) dashboard → history → ratify cc_delta=0, verdicts=0, FAIL (blocking)
Failing (PR #360) history → ratify → history cc_delta=0, verdicts=0, FAIL (blocking)

In every failing run the _sync_guidance field on the history response explicitly instructed "call bicameral.resolve_compliance". The agent in those runs ignored the instruction. Querying each failing-run ledger.db artifact confirmed PR #351's prune logic did not prune any decision — all 3 decisions in PR #360's run had signoff states {proposed, proposed, ratified}, no pruned state anywhere. This rules out the prime "regression at 02:50 UTC" suspect.

Why the prompt can't be tightened instead

Per DEV_CYCLE.md:

"Use natural prompts — never name the tool the agent is supposed to auto-fire. Naming the tool defeats the trigger that IS the product."

So the fix-by-tightening-Flow-5-prompt path is closed by design. Either we make Flow 3 advisory (this PR), or we accept the strict-blocking assertion will flake on agent variance — which is exactly what's happening right now to PR #354, #356, #360, #361.

Why this is consistent with the existing model

The e2e report already declares the agentic layer unvalidated:

"The end-to-end correction dynamic ('dev contradicts spec → preflight catches → refinement captured → PM ratifies') is NOT validated by this headless harness. MCP tool surface is callable and functional; agentic auto-fire is the open gap. Validate the agentic layer via the interactive recording path."

Flow 2a/4/4b are all in the agentic layer and all advisory. Flow 3 was the asymmetric outlier — strict-blocking despite the same class of failure mode. This PR removes the asymmetry.

Diff

One file, two changes:

  1. Flow 3 FAIL branch (tests/e2e/run_e2e_flows.py:438-477): flow3.advisory now set to non-empty text. The CI gate filter at line 1608 (verdict in ("FAIL","ERROR") AND NOT advisory) excludes it from blocking_failures. Overall workflow exit code = 0 when Flow 3 is the only red flow.
  2. Verdict-matrix docstring at line 320 updated to reflect the new non-blocking semantics. References Regression: v0 user flow e2e Flow 3 (commit-sync → compliance_check) — 0 verdicts written, all dev-targeting PRs blocked #362.

Test plan

Out of scope

🤖 Generated with Claude Code

…sory

Investigation in #362 showed the Flow 3 FAIL mode (cc_delta=0 + verdicts=0
after a successful commit) is not a sync chain regression but downstream
agent variance: Flow 5's agent decides whether to call resolve_compliance
based on the _sync_guidance instruction on the history response, and
different runs land different decisions on the same natural prompt.

Verified across three independent CI artifacts:
- Passing run (devin/1778808036): Flow 5 called resolve_compliance →
  10 verdicts written → cc_delta>0 → Flow 3 PASS
- Failing run (PR #354): Flow 5 skipped resolve_compliance → cc_delta=0
  → Flow 3 FAIL (strict, blocking)
- Failing run (PR #360): Flow 5 skipped resolve_compliance → cc_delta=0
  → Flow 3 FAIL (strict, blocking)

In every failing run the _sync_guidance field on the history response
explicitly instructed "call bicameral.resolve_compliance". The agent
ignored it. Querying each failing-run ledger.db artifact confirmed PR
#351's prune logic did NOT prune any decision (signoff states:
proposed/proposed/ratified), ruling out the prime suspect.

The fix per the e2e report's own CORRECTION-PATH STATUS message
("the end-to-end correction dynamic is NOT validated by this headless
harness... validate the agentic layer via the interactive recording
path"): make Flow 3's FAIL branch advisory so the gap stays visible in
the report (matching Flow 2a/4/4b in the same agentic layer) without
blocking unrelated PRs.

Per DEV_CYCLE.md the prompt cannot be tightened to name the tool the
agent should call — "Use natural prompts — never name the tool the
agent is supposed to auto-fire. Naming the tool defeats the trigger
that IS the product." So Flow 3's strict assertion was always going to
flake on agent variance; this commit just makes the harness honest
about that constraint.

Changes:
- Flow 3 FAIL branch: verdict still FAIL but advisory text now set →
  blocking_failures filter at line 1608 (`verdict in ("FAIL","ERROR")
  AND NOT advisory`) excludes it → overall PASS unchanged when this
  is the only red flow
- Updated the verdict-matrix docstring (line 320) to reflect the new
  non-blocking semantics

Out of scope for this PR:
- The underlying agent variance itself. That's the same class of issue
  #154/#156 already track for Flow 2a/4. Real fix is in the agentic
  auto-fire layer, not the e2e harness.
- A real bisect over the 3 dev commits in the breakage window. Could
  still be worth doing if anyone suspects the variance is being amplified
  server-side — but the evidence above (prune logic didn't fire,
  unrelated PRs touched unrelated areas) points strongly at variance.

Closes #362.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@silongtan silongtan added flow:feature Standard feature/fix PR targeting BicameralAI/dev (the default flow) P1 High: ship this milestone; user-impacting bug or committed feature test Test infrastructure, fixtures, or coverage work fix Bug fix or correctness repair labels May 15, 2026
@silongtan silongtan had a problem deploying to recording-approval May 15, 2026 20:27 — with GitHub Actions Failure
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a55c42b0-1903-4302-8838-4d16b966d5ec

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/362-e2e-flow3-regression

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Bug fix or correctness repair flow:feature Standard feature/fix PR targeting BicameralAI/dev (the default flow) P1 High: ship this milestone; user-impacting bug or committed feature test Test infrastructure, fixtures, or coverage work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant