fix(e2e): #362 — reclassify Flow 3 'no cc rows + no verdicts' as advisory by silongtan · Pull Request #363 · BicameralAI/bicameral-mcp

silongtan · 2026-05-15T20:27:53Z

Summary

Closes #362. Makes the Flow 3 no compliance_check rows + no verdicts branch an advisory failure (matching Flow 2a/4/4b in the same agentic layer) so the agentic-layer gap stays visible without blocking unrelated PRs.

This unblocks #354, #356, #360, #361, and any future PR hit by the same agent-variance issue.

Investigation summary

Documented in detail on #362. Short version: this is downstream agent variance, not a server regression.

Run	Flow 5 bicameral calls	Flow 3 outcome
Passing (devin/1778808036)	history → resolve_compliance → ratify → history	cc_delta=10, 7 verdicts written, PASS
Failing (PR #354)	dashboard → history → ratify	cc_delta=0, verdicts=0, FAIL (blocking)
Failing (PR #360)	history → ratify → history	cc_delta=0, verdicts=0, FAIL (blocking)

In every failing run the _sync_guidance field on the history response explicitly instructed "call bicameral.resolve_compliance". The agent in those runs ignored the instruction. Querying each failing-run ledger.db artifact confirmed PR #351's prune logic did not prune any decision — all 3 decisions in PR #360's run had signoff states {proposed, proposed, ratified}, no pruned state anywhere. This rules out the prime "regression at 02:50 UTC" suspect.

Why the prompt can't be tightened instead

Per DEV_CYCLE.md:

"Use natural prompts — never name the tool the agent is supposed to auto-fire. Naming the tool defeats the trigger that IS the product."

So the fix-by-tightening-Flow-5-prompt path is closed by design. Either we make Flow 3 advisory (this PR), or we accept the strict-blocking assertion will flake on agent variance — which is exactly what's happening right now to PR #354, #356, #360, #361.

Why this is consistent with the existing model

The e2e report already declares the agentic layer unvalidated:

"The end-to-end correction dynamic ('dev contradicts spec → preflight catches → refinement captured → PM ratifies') is NOT validated by this headless harness. MCP tool surface is callable and functional; agentic auto-fire is the open gap. Validate the agentic layer via the interactive recording path."

Flow 2a/4/4b are all in the agentic layer and all advisory. Flow 3 was the asymmetric outlier — strict-blocking despite the same class of failure mode. This PR removes the asymmetry.

Diff

One file, two changes:

Flow 3 FAIL branch (tests/e2e/run_e2e_flows.py:438-477): flow3.advisory now set to non-empty text. The CI gate filter at line 1608 (verdict in ("FAIL","ERROR") AND NOT advisory) excludes it from blocking_failures. Overall workflow exit code = 0 when Flow 3 is the only red flow.
Verdict-matrix docstring at line 320 updated to reflect the new non-blocking semantics. References Regression: v0 user flow e2e Flow 3 (commit-sync → compliance_check) — 0 verdicts written, all dev-targeting PRs blocked #362.

Test plan

CI green on this PR (this PR DOESN'T modify any path the v0-user-flow-e2e workflow watches, so the workflow won't even fire — see paths: in the workflow YAML)
After merge: PR ci(perf): #357 sub-task 2 — file-backed SurrealKV perf gate #360 and infra(pre-commit): #357 sub-task 3 — local ruff enforcement at commit time #361 unblock (re-run their e2e and confirm Flow 3 FAIL → advisory, overall PASS)
Validate Flow 2a/4/4b remain advisory (unchanged)
Validate the prior Flow 3 PASS paths (full verdicts, headless-terminus cc rows) remain PASS (unchanged code path)

Out of scope

The underlying agent variance. Same class of issue as [P0] Preflight skill does not instruct agent to capture refinements when user prompt contradicts surfaced decisions #154 / [P1] SessionEnd capture-corrections hook is silently broken — design pivot to next-session surfacing #156. Real fix is in the agentic auto-fire layer (skill instructions, system reminders, or hook design) — not the e2e harness.
A real git-bisect over the 3 commits in the breakage window. Still worth doing if anyone has reason to think the agent variance is being amplified server-side, but the evidence above (prune logic didn't fire, no other PR touched the relevant pipeline) points to plain agent non-determinism.

🤖 Generated with Claude Code

…sory Investigation in #362 showed the Flow 3 FAIL mode (cc_delta=0 + verdicts=0 after a successful commit) is not a sync chain regression but downstream agent variance: Flow 5's agent decides whether to call resolve_compliance based on the _sync_guidance instruction on the history response, and different runs land different decisions on the same natural prompt. Verified across three independent CI artifacts: - Passing run (devin/1778808036): Flow 5 called resolve_compliance → 10 verdicts written → cc_delta>0 → Flow 3 PASS - Failing run (PR #354): Flow 5 skipped resolve_compliance → cc_delta=0 → Flow 3 FAIL (strict, blocking) - Failing run (PR #360): Flow 5 skipped resolve_compliance → cc_delta=0 → Flow 3 FAIL (strict, blocking) In every failing run the _sync_guidance field on the history response explicitly instructed "call bicameral.resolve_compliance". The agent ignored it. Querying each failing-run ledger.db artifact confirmed PR #351's prune logic did NOT prune any decision (signoff states: proposed/proposed/ratified), ruling out the prime suspect. The fix per the e2e report's own CORRECTION-PATH STATUS message ("the end-to-end correction dynamic is NOT validated by this headless harness... validate the agentic layer via the interactive recording path"): make Flow 3's FAIL branch advisory so the gap stays visible in the report (matching Flow 2a/4/4b in the same agentic layer) without blocking unrelated PRs. Per DEV_CYCLE.md the prompt cannot be tightened to name the tool the agent should call — "Use natural prompts — never name the tool the agent is supposed to auto-fire. Naming the tool defeats the trigger that IS the product." So Flow 3's strict assertion was always going to flake on agent variance; this commit just makes the harness honest about that constraint. Changes: - Flow 3 FAIL branch: verdict still FAIL but advisory text now set → blocking_failures filter at line 1608 (`verdict in ("FAIL","ERROR") AND NOT advisory`) excludes it → overall PASS unchanged when this is the only red flow - Updated the verdict-matrix docstring (line 320) to reflect the new non-blocking semantics Out of scope for this PR: - The underlying agent variance itself. That's the same class of issue #154/#156 already track for Flow 2a/4. Real fix is in the agentic auto-fire layer, not the e2e harness. - A real bisect over the 3 dev commits in the breakage window. Could still be worth doing if anyone suspects the variance is being amplified server-side — but the evidence above (prune logic didn't fire, unrelated PRs touched unrelated areas) points strongly at variance. Closes #362. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-15T20:28:00Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a55c42b0-1903-4302-8838-4d16b966d5ec

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/362-e2e-flow3-regression

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

silongtan added flow:feature Standard feature/fix PR targeting BicameralAI/dev (the default flow) P1 High: ship this milestone; user-impacting bug or committed feature test Test infrastructure, fixtures, or coverage work fix Bug fix or correctness repair labels May 15, 2026

silongtan temporarily deployed to ci-test May 15, 2026 20:27 — with GitHub Actions Inactive

silongtan had a problem deploying to recording-approval May 15, 2026 20:27 — with GitHub Actions Failure

silongtan temporarily deployed to production May 15, 2026 20:27 — with GitHub Actions Inactive

silongtan merged commit 1019083 into dev May 15, 2026
9 of 10 checks passed

silongtan deleted the fix/362-e2e-flow3-regression branch May 15, 2026 20:44

silongtan mentioned this pull request May 15, 2026

infra(symlinks): #357 sub-task 4 — Windows symlink materialization gate #364

Merged

3 tasks

jinhongkuan mentioned this pull request May 16, 2026

release: v0.15.0 — PII archive, hard-delete remove_decision, schema v17→v24 chain #388

Merged

9 tasks

Knapp-Kevin mentioned this pull request May 21, 2026

fix(e2e): Flow 3 ledger-sync regression — no compliance_check rows written despite successful agent run (post-Group A) #355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): #362 — reclassify Flow 3 'no cc rows + no verdicts' as advisory#363

fix(e2e): #362 — reclassify Flow 3 'no cc rows + no verdicts' as advisory#363
silongtan merged 1 commit into
devfrom
fix/362-e2e-flow3-regression

silongtan commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

silongtan commented May 15, 2026

Summary

Investigation summary

Why the prompt can't be tightened instead

Why this is consistent with the existing model

Diff

Test plan

Out of scope

Uh oh!

coderabbitai Bot commented May 15, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant