ci(#108): v0 user flow e2e — Claude Code CLI sessions vs desktop/desktop#142
Conversation
Drives a real Claude Code CLI session per spec flow with bicameral-mcp
registered as the only MCP server, and asserts on the stream-json
transcript that the right MCP tools were called with the right shapes.
Why a separate test from scripts/sim_issue_108_flows.py:
The handler-replay sim imports handler functions directly. It validates
server invariants (status projection, signoff transitions, ephemeral
detection) but bypasses three layers a real user exercises:
- MCP protocol marshalling (JSON-RPC over stdio)
- Skill files (.claude/skills/bicameral-*/SKILL.md trigger matching,
auto-chains: preflight → capture-corrections → context-sentry →
ingest → judge_gaps)
- Caller LLM tool sequencing from natural language
This e2e covers all three. The two tests are complementary: handler-replay
for fast local dev iteration on handler logic, e2e for the user-experience
contract.
Test fixture:
Pinned commit of github.com/desktop/desktop (e6c50fb…). Real-world ingest
content from docs/process/roadmap.md; bind target is the CherryPickResult
enum in app/src/lib/git/cherry-pick.ts (a stable, slow-changing public
type that genuinely corresponds to the cherry-pick roadmap item).
CI shape:
- environment: production (provides CLAUDE_CODE_OAUTH_TOKEN)
- Triggers on PRs touching tests/e2e/**, handlers/**, ledger/**,
contracts.py, skills/bicameral-**, server.py, pyproject.toml, or the
workflow itself
- Installs Claude Code CLI (npm) + bicameral-mcp (pip -e .)
- Clones desktop/desktop at the pinned commit, stamps a real 'main' branch
so feature-branch tests work (clone is otherwise detached HEAD)
- Probes CLAUDE_CODE_OAUTH_TOKEN visibility without leaking it
- Runs all five flows in a single Python orchestrator
- Uploads stream-json transcripts (30-day retention) for failure forensics
Per-flow contract:
Flow 1 — bicameral.ingest called with mappings (≥1)
Flow 2 — bicameral.preflight called with file_paths containing
cherry-pick.ts
Flow 3 — bicameral.link_commit + bicameral.resolve_compliance both
called; resolve_compliance carries verdicts
Flow 4 — bicameral.ingest called with source='agent_session' (top-level
or per-mapping span.source_type)
Flow 5 — bicameral.history called, with seed ingest + ratify pre-conditions
Cost: ~$0.50–$2.00 per CI run (each flow capped at --max-budget-usd 2.0).
Refs #108. Complementary to the handler-replay sim shipped in PR #139.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Bicameral drift report — skippedNo To enable: add a |
…TH for MCP
First CI run (workflow_dispatch on 25195231091) surfaced three issues — all
in the test infrastructure, not the implementation. Fixing them.
1. assert_flow_1, assert_flow_4 — the bicameral.ingest tool wraps its input
in a 'payload' key (matching the IngestPayload contract), and the
skill-side spelling for the items array is 'decisions', not 'mappings'.
The asserters were looking at input.mappings and input.source — both
absent. Now they look at input.payload.{decisions|mappings} and
input.payload.source. Verified against transcripts:
Flow 1 — payload.decisions=[…] → was reported as "no mappings"
Flow 4 — payload.source='agent_session' → was reported as "top_source=''"
Also extended the resolve_compliance asserter (Flow 3) for the same
payload-wrapping pattern.
2. bicameral.mcp.json — the env block lacked REPO_PATH, so the spawned
bicameral-mcp server fell back to '.' (claude CLI's cwd, which is
pilot/mcp/, not the desktop/desktop clone). bind couldn't find
app/src/lib/git/cherry-pick.ts and Flow 3's chain aborted at bind
instead of progressing to link_commit + resolve_compliance.
Fix: template the config with ${DESKTOP_REPO_PATH}, materialize at
orchestrator runtime by substituting the env-var value, write a
runtime copy under test-results/e2e/. Works locally + in CI without
committing a CI-specific path. (MCP env-merge vs env-replace
behaviour is implementation-defined across Claude Code versions, so
passing REPO_PATH explicitly via the config is more robust than
relying on parent-process env propagation.)
Refs #108. First-iteration validation of PR #142's e2e harness.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ruff lint complained: UP035 Import from `collections.abc` instead: `Callable` Trivial fix: move ``Callable`` import from ``typing`` to ``collections.abc`` (PEP 585 modernization). Also re-format triggered by the import-order shift. Verified locally: python3 -m ruff check tests/e2e/run_e2e_flows.py → All checks passed! python3 -m ruff format --check tests/e2e/run_e2e_flows.py → 1 file already formatted Unblocks PR #142 merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TH for MCP
First CI run (workflow_dispatch on 25195231091) surfaced three issues — all
in the test infrastructure, not the implementation. Fixing them.
1. assert_flow_1, assert_flow_4 — the bicameral.ingest tool wraps its input
in a 'payload' key (matching the IngestPayload contract), and the
skill-side spelling for the items array is 'decisions', not 'mappings'.
The asserters were looking at input.mappings and input.source — both
absent. Now they look at input.payload.{decisions|mappings} and
input.payload.source. Verified against transcripts:
Flow 1 — payload.decisions=[…] → was reported as "no mappings"
Flow 4 — payload.source='agent_session' → was reported as "top_source=''"
Also extended the resolve_compliance asserter (Flow 3) for the same
payload-wrapping pattern.
2. bicameral.mcp.json — the env block lacked REPO_PATH, so the spawned
bicameral-mcp server fell back to '.' (claude CLI's cwd, which is
pilot/mcp/, not the desktop/desktop clone). bind couldn't find
app/src/lib/git/cherry-pick.ts and Flow 3's chain aborted at bind
instead of progressing to link_commit + resolve_compliance.
Fix: template the config with ${DESKTOP_REPO_PATH}, materialize at
orchestrator runtime by substituting the env-var value, write a
runtime copy under test-results/e2e/. Works locally + in CI without
committing a CI-specific path. (MCP env-merge vs env-replace
behaviour is implementation-defined across Claude Code versions, so
passing REPO_PATH explicitly via the config is more robust than
relying on parent-process env propagation.)
Refs #108. First-iteration validation of PR #142's e2e harness.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on slide Adds an opt-in workflow_dispatch path that records a single split-screen demo session, then post-splits it into pm.mp4 (PM persona, two chapters joined by an ffmpeg-generated transition slide) and dev.mp4 (Dev persona). Plan: thoughts/shared/plans/2026-04-30-v0-userflow-demo-recording.md (Predecessor: PR #142, the assertion-only e2e on the same workflow file.) Why one continuous claude session instead of two persona sessions: the e2e config uses SURREAL_URL=memory://, so each MCP process is a fresh ledger. A single session is what makes Scene 3 (PM post-impl) show the SSE events from Scene 2 (Dev) authentically — same dashboard, same state, no re-hydration. Scene boundaries are detected from the stream-json tool-call timeline (no LLM-emitted sentinels): Scene 1 → 2 = first bicameral.preflight call Scene 2 → 3 = first bicameral.history call after any link_commit Recording step is `continue-on-error: true` — assertion-only step remains the sole authority on workflow conclusion. MP4s are .gitignored and excluded from the wheel; they live in the v0-user-flow-e2e-demos GitHub artifact (90-day retention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds `v0 user flow e2e` — a CI workflow that drives a real Claude Code CLI session per spec flow with `bicameral-mcp` registered as the only MCP server, and asserts on the stream-json transcript that the right MCP tools were called with the right shapes.
This is the canonical user-experience test for `BicameralAI/bicameral#108`. It complements the handler-replay simulation that landed in #139 (`scripts/sim_issue_108_flows.py`).
Why two tests?
The handler-replay sim imports handler functions directly and calls them. Fast, no API spend, useful for iterating on handler logic. But it bypasses three layers a real user exercises:
This e2e suite exercises all three. The two tests together form the spec's two-level validation: handler invariants (replay sim) + user-experience contract (this directory).
Test fixture: github.com/desktop/desktop
Pinned commit `e6c50fb028171e9cec03594273c8116bb135847e`. Real-world ingest content from `docs/process/roadmap.md`; bind target is the `CherryPickResult` enum in `app/src/lib/git/cherry-pick.ts` (a stable, slow-changing public type that genuinely corresponds to the cherry-pick roadmap item — bind is meaningful, not arbitrary).
What's in the PR
Per-flow contract (asserted by the orchestrator)
CI shape
```yaml
environment: production # CLAUDE_CODE_OAUTH_TOKEN
on:
pull_request:
paths: [tests/e2e/, handlers/, ledger/, contracts.py, skills/bicameral-, ...]
workflow_dispatch: # manual debug trigger
```
Triggers on PRs touching paths whose behaviour the e2e validates. `workflow_dispatch` lets maintainers re-run on demand without a code change.
Test plan
```bash
Locally (after npm i -g @anthropic-ai/claude-code + claude auth):
cd pilot/mcp
git clone --depth=1 https://github.com/desktop/desktop /tmp/desktop-clone
(cd /tmp/desktop-clone && git checkout -b main)
DESKTOP_REPO_PATH=/tmp/desktop-clone python tests/e2e/run_e2e_flows.py
```
Expected: `Overall: PASS` after ~3-5 minutes (5 Claude sessions). Cost ~$0.50-$2.00 per run.
In CI: this PR will trigger the workflow on itself once the workflow file lands on `dev`. (GitHub workflows execute the version on the base branch, so the very first run is on the next qualifying PR after merge.)
Risk
L2 — new CI workflow, real API spend per run. Mitigations:
Out of scope (deferred)
Refs #108. Follow-on to PRs #138 (#135 dashboard tooltip) and #139 (#108 handler-replay sim + skill correction).
🤖 Generated with Claude Code