ci(#108): v0 user flow e2e — Claude Code CLI sessions vs desktop/desktop by jinhongkuan · Pull Request #142 · BicameralAI/bicameral-mcp

jinhongkuan · 2026-04-30T23:52:57Z

Summary

Adds `v0 user flow e2e` — a CI workflow that drives a real Claude Code CLI session per spec flow with `bicameral-mcp` registered as the only MCP server, and asserts on the stream-json transcript that the right MCP tools were called with the right shapes.

This is the canonical user-experience test for `BicameralAI/bicameral#108`. It complements the handler-replay simulation that landed in #139 (`scripts/sim_issue_108_flows.py`).

Why two tests?

The handler-replay sim imports handler functions directly and calls them. Fast, no API spend, useful for iterating on handler logic. But it bypasses three layers a real user exercises:

MCP protocol — JSON-RPC over stdio, tool schema marshalling
Skill files — `.claude/skills/bicameral-*/SKILL.md` trigger matching, auto-chains
Caller LLM — natural-language → tool-call sequencing

This e2e suite exercises all three. The two tests together form the spec's two-level validation: handler invariants (replay sim) + user-experience contract (this directory).

Test fixture: github.com/desktop/desktop

Pinned commit `e6c50fb028171e9cec03594273c8116bb135847e`. Real-world ingest content from `docs/process/roadmap.md`; bind target is the `CherryPickResult` enum in `app/src/lib/git/cherry-pick.ts` (a stable, slow-changing public type that genuinely corresponds to the cherry-pick roadmap item — bind is meaningful, not arbitrary).

What's in the PR

File	Notes
`tests/e2e/run_e2e_flows.py`	Orchestrator: per flow, invokes `claude -p` with the prompt + bicameral MCP config, captures stream-json, asserts on tool-use blocks
`tests/e2e/bicameral.mcp.json`	MCP config — registers `bicameral-mcp` with `SURREAL_URL=memory://` so each flow gets a fresh ledger
`tests/e2e/prompts/flow-{1..5}-*.md`	Five natural-language user prompts, one per flow
`tests/e2e/README.md`	Local-run instructions + per-flow contract + design rationale
`.github/workflows/v0-user-flow-e2e.yml`	CI workflow — `production` env (for `CLAUDE_CODE_OAUTH_TOKEN`), pinned desktop/desktop commit, transcript artifact upload

Per-flow contract (asserted by the orchestrator)

Flow	What's asserted
1 — record decisions	`bicameral.ingest` called with `mappings` (≥1)
2 — preflight	`bicameral.preflight` called with `file_paths` containing `cherry-pick.ts`
3 — commit→reflected	`bicameral.link_commit` + `bicameral.resolve_compliance` both called; resolve_compliance carries `verdicts`
4 — session-end	`bicameral.ingest` called with `source='agent_session'` (top-level or per-mapping `span.source_type`)
5 — history	`bicameral.history` called; seed pre-conditions (`ingest` + `ratify`) present

CI shape

```yaml
environment: production # CLAUDE_CODE_OAUTH_TOKEN
on:
pull_request:
paths: [tests/e2e/, handlers/, ledger/, contracts.py, skills/bicameral-, ...]
workflow_dispatch: # manual debug trigger
```

Triggers on PRs touching paths whose behaviour the e2e validates. `workflow_dispatch` lets maintainers re-run on demand without a code change.

Test plan

```bash

Locally (after npm i -g @anthropic-ai/claude-code + claude auth):

cd pilot/mcp
git clone --depth=1 https://github.com/desktop/desktop /tmp/desktop-clone
(cd /tmp/desktop-clone && git checkout -b main)
DESKTOP_REPO_PATH=/tmp/desktop-clone python tests/e2e/run_e2e_flows.py
```

Expected: `Overall: PASS` after ~3-5 minutes (5 Claude sessions). Cost ~$0.50-$2.00 per run.

In CI: this PR will trigger the workflow on itself once the workflow file lands on `dev`. (GitHub workflows execute the version on the base branch, so the very first run is on the next qualifying PR after merge.)

Risk

L2 — new CI workflow, real API spend per run. Mitigations:

`--max-budget-usd 2.0` per flow caps cost (worst case ~$10/run, typical ~$1)
`--strict-mcp-config` ensures only bicameral-mcp is loaded (no other MCP servers leak in from runner)
Bash tool intentionally NOT in `--allowed-tools` (only `mcp__bicameral,Read,Grep`) — bicameral skills shouldn't need shell
Transcripts uploaded as artifacts on every run (30-day retention) for forensics
Pinned desktop/desktop commit — fixture won't drift mid-CI

Out of scope (deferred)

Caching `@anthropic-ai/claude-code` install across runs (small win; npm i is ~10s)
Splitting flows into separate jobs for parallelism (would 5x cost but cut wall time; defer until we have data on flake rate)
Cross-flow state (each flow uses fresh `memory://` ledger; flows that need history seed it inline in their prompt)

Refs #108. Follow-on to PRs #138 (#135 dashboard tooltip) and #139 (#108 handler-replay sim + skill correction).

🤖 Generated with Claude Code

Drives a real Claude Code CLI session per spec flow with bicameral-mcp registered as the only MCP server, and asserts on the stream-json transcript that the right MCP tools were called with the right shapes. Why a separate test from scripts/sim_issue_108_flows.py: The handler-replay sim imports handler functions directly. It validates server invariants (status projection, signoff transitions, ephemeral detection) but bypasses three layers a real user exercises: - MCP protocol marshalling (JSON-RPC over stdio) - Skill files (.claude/skills/bicameral-*/SKILL.md trigger matching, auto-chains: preflight → capture-corrections → context-sentry → ingest → judge_gaps) - Caller LLM tool sequencing from natural language This e2e covers all three. The two tests are complementary: handler-replay for fast local dev iteration on handler logic, e2e for the user-experience contract. Test fixture: Pinned commit of github.com/desktop/desktop (e6c50fb…). Real-world ingest content from docs/process/roadmap.md; bind target is the CherryPickResult enum in app/src/lib/git/cherry-pick.ts (a stable, slow-changing public type that genuinely corresponds to the cherry-pick roadmap item). CI shape: - environment: production (provides CLAUDE_CODE_OAUTH_TOKEN) - Triggers on PRs touching tests/e2e/**, handlers/**, ledger/**, contracts.py, skills/bicameral-**, server.py, pyproject.toml, or the workflow itself - Installs Claude Code CLI (npm) + bicameral-mcp (pip -e .) - Clones desktop/desktop at the pinned commit, stamps a real 'main' branch so feature-branch tests work (clone is otherwise detached HEAD) - Probes CLAUDE_CODE_OAUTH_TOKEN visibility without leaking it - Runs all five flows in a single Python orchestrator - Uploads stream-json transcripts (30-day retention) for failure forensics Per-flow contract: Flow 1 — bicameral.ingest called with mappings (≥1) Flow 2 — bicameral.preflight called with file_paths containing cherry-pick.ts Flow 3 — bicameral.link_commit + bicameral.resolve_compliance both called; resolve_compliance carries verdicts Flow 4 — bicameral.ingest called with source='agent_session' (top-level or per-mapping span.source_type) Flow 5 — bicameral.history called, with seed ingest + ratify pre-conditions Cost: ~$0.50–$2.00 per CI run (each flow capped at --max-budget-usd 2.0). Refs #108. Complementary to the handler-replay sim shipped in PR #139. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-30T23:53:05Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5d7fd3c5-1dee-497a-b3d6-75053a4f98a6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/108-v0-userflow-e2e-ci

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-30T23:53:23Z

Bicameral drift report — skipped

No bicameral/decisions.yaml found in repo root. Drift report is skipped for this PR.

To enable: add a bicameral/decisions.yaml manifest. See setup guide (link to be added when manifest spec ships).

…TH for MCP First CI run (workflow_dispatch on 25195231091) surfaced three issues — all in the test infrastructure, not the implementation. Fixing them. 1. assert_flow_1, assert_flow_4 — the bicameral.ingest tool wraps its input in a 'payload' key (matching the IngestPayload contract), and the skill-side spelling for the items array is 'decisions', not 'mappings'. The asserters were looking at input.mappings and input.source — both absent. Now they look at input.payload.{decisions|mappings} and input.payload.source. Verified against transcripts: Flow 1 — payload.decisions=[…] → was reported as "no mappings" Flow 4 — payload.source='agent_session' → was reported as "top_source=''" Also extended the resolve_compliance asserter (Flow 3) for the same payload-wrapping pattern. 2. bicameral.mcp.json — the env block lacked REPO_PATH, so the spawned bicameral-mcp server fell back to '.' (claude CLI's cwd, which is pilot/mcp/, not the desktop/desktop clone). bind couldn't find app/src/lib/git/cherry-pick.ts and Flow 3's chain aborted at bind instead of progressing to link_commit + resolve_compliance. Fix: template the config with ${DESKTOP_REPO_PATH}, materialize at orchestrator runtime by substituting the env-var value, write a runtime copy under test-results/e2e/. Works locally + in CI without committing a CI-specific path. (MCP env-merge vs env-replace behaviour is implementation-defined across Claude Code versions, so passing REPO_PATH explicitly via the config is more robust than relying on parent-process env propagation.) Refs #108. First-iteration validation of PR #142's e2e harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ruff lint complained: UP035 Import from `collections.abc` instead: `Callable` Trivial fix: move ``Callable`` import from ``typing`` to ``collections.abc`` (PEP 585 modernization). Also re-format triggered by the import-order shift. Verified locally: python3 -m ruff check tests/e2e/run_e2e_flows.py → All checks passed! python3 -m ruff format --check tests/e2e/run_e2e_flows.py → 1 file already formatted Unblocks PR #142 merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…TH for MCP First CI run (workflow_dispatch on 25195231091) surfaced three issues — all in the test infrastructure, not the implementation. Fixing them. 1. assert_flow_1, assert_flow_4 — the bicameral.ingest tool wraps its input in a 'payload' key (matching the IngestPayload contract), and the skill-side spelling for the items array is 'decisions', not 'mappings'. The asserters were looking at input.mappings and input.source — both absent. Now they look at input.payload.{decisions|mappings} and input.payload.source. Verified against transcripts: Flow 1 — payload.decisions=[…] → was reported as "no mappings" Flow 4 — payload.source='agent_session' → was reported as "top_source=''" Also extended the resolve_compliance asserter (Flow 3) for the same payload-wrapping pattern. 2. bicameral.mcp.json — the env block lacked REPO_PATH, so the spawned bicameral-mcp server fell back to '.' (claude CLI's cwd, which is pilot/mcp/, not the desktop/desktop clone). bind couldn't find app/src/lib/git/cherry-pick.ts and Flow 3's chain aborted at bind instead of progressing to link_commit + resolve_compliance. Fix: template the config with ${DESKTOP_REPO_PATH}, materialize at orchestrator runtime by substituting the env-var value, write a runtime copy under test-results/e2e/. Works locally + in CI without committing a CI-specific path. (MCP env-merge vs env-replace behaviour is implementation-defined across Claude Code versions, so passing REPO_PATH explicitly via the config is more robust than relying on parent-process env propagation.) Refs #108. First-iteration validation of PR #142's e2e harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on slide Adds an opt-in workflow_dispatch path that records a single split-screen demo session, then post-splits it into pm.mp4 (PM persona, two chapters joined by an ffmpeg-generated transition slide) and dev.mp4 (Dev persona). Plan: thoughts/shared/plans/2026-04-30-v0-userflow-demo-recording.md (Predecessor: PR #142, the assertion-only e2e on the same workflow file.) Why one continuous claude session instead of two persona sessions: the e2e config uses SURREAL_URL=memory://, so each MCP process is a fresh ledger. A single session is what makes Scene 3 (PM post-impl) show the SSE events from Scene 2 (Dev) authentically — same dashboard, same state, no re-hydration. Scene boundaries are detected from the stream-json tool-call timeline (no LLM-emitted sentinels): Scene 1 → 2 = first bicameral.preflight call Scene 2 → 3 = first bicameral.history call after any link_commit Recording step is `continue-on-error: true` — assertion-only step remains the sole authority on workflow conclusion. MP4s are .gitignored and excluded from the wheel; they live in the v0-user-flow-e2e-demos GitHub artifact (90-day retention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jinhongkuan temporarily deployed to ci-test April 30, 2026 23:53 — with GitHub Actions Inactive

jinhongkuan had a problem deploying to production April 30, 2026 23:53 — with GitHub Actions Failure

jinhongkuan had a problem deploying to production April 30, 2026 23:54 — with GitHub Actions Failure

jinhongkuan temporarily deployed to production May 1, 2026 00:04 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to ci-test May 1, 2026 00:04 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to production May 1, 2026 00:04 — with GitHub Actions Inactive

jinhongkuan had a problem deploying to production May 1, 2026 00:18 — with GitHub Actions Failure

jinhongkuan temporarily deployed to ci-test May 1, 2026 00:18 — with GitHub Actions Inactive

jinhongkuan merged commit 489c0fd into dev May 1, 2026
6 of 7 checks passed

jinhongkuan mentioned this pull request May 1, 2026

ci(#108): demo recording fast-follow — pm.mp4 + dev.mp4 with transition slide #144

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(#108): v0 user flow e2e — Claude Code CLI sessions vs desktop/desktop#142

ci(#108): v0 user flow e2e — Claude Code CLI sessions vs desktop/desktop#142
jinhongkuan merged 3 commits into
devfrom
feat/108-v0-userflow-e2e-ci

jinhongkuan commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jinhongkuan commented Apr 30, 2026

Summary

Why two tests?

Test fixture: github.com/desktop/desktop

What's in the PR

Per-flow contract (asserted by the orchestrator)

CI shape

Test plan

Locally (after npm i -g @anthropic-ai/claude-code + claude auth):

Risk

Out of scope (deferred)

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bicameral drift report — skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

github-actions Bot commented Apr 30, 2026 •

edited

Loading