Skip to content

ci(#108): v0 user flow e2e — Claude Code CLI sessions vs desktop/desktop#142

Merged
jinhongkuan merged 3 commits into
devfrom
feat/108-v0-userflow-e2e-ci
May 1, 2026
Merged

ci(#108): v0 user flow e2e — Claude Code CLI sessions vs desktop/desktop#142
jinhongkuan merged 3 commits into
devfrom
feat/108-v0-userflow-e2e-ci

Conversation

@jinhongkuan

Copy link
Copy Markdown
Contributor

Summary

Adds `v0 user flow e2e` — a CI workflow that drives a real Claude Code CLI session per spec flow with `bicameral-mcp` registered as the only MCP server, and asserts on the stream-json transcript that the right MCP tools were called with the right shapes.

This is the canonical user-experience test for `BicameralAI/bicameral#108`. It complements the handler-replay simulation that landed in #139 (`scripts/sim_issue_108_flows.py`).

Why two tests?

The handler-replay sim imports handler functions directly and calls them. Fast, no API spend, useful for iterating on handler logic. But it bypasses three layers a real user exercises:

  • MCP protocol — JSON-RPC over stdio, tool schema marshalling
  • Skill files — `.claude/skills/bicameral-*/SKILL.md` trigger matching, auto-chains
  • Caller LLM — natural-language → tool-call sequencing

This e2e suite exercises all three. The two tests together form the spec's two-level validation: handler invariants (replay sim) + user-experience contract (this directory).

Test fixture: github.com/desktop/desktop

Pinned commit `e6c50fb028171e9cec03594273c8116bb135847e`. Real-world ingest content from `docs/process/roadmap.md`; bind target is the `CherryPickResult` enum in `app/src/lib/git/cherry-pick.ts` (a stable, slow-changing public type that genuinely corresponds to the cherry-pick roadmap item — bind is meaningful, not arbitrary).

What's in the PR

File Notes
`tests/e2e/run_e2e_flows.py` Orchestrator: per flow, invokes `claude -p` with the prompt + bicameral MCP config, captures stream-json, asserts on tool-use blocks
`tests/e2e/bicameral.mcp.json` MCP config — registers `bicameral-mcp` with `SURREAL_URL=memory://` so each flow gets a fresh ledger
`tests/e2e/prompts/flow-{1..5}-*.md` Five natural-language user prompts, one per flow
`tests/e2e/README.md` Local-run instructions + per-flow contract + design rationale
`.github/workflows/v0-user-flow-e2e.yml` CI workflow — `production` env (for `CLAUDE_CODE_OAUTH_TOKEN`), pinned desktop/desktop commit, transcript artifact upload

Per-flow contract (asserted by the orchestrator)

Flow What's asserted
1 — record decisions `bicameral.ingest` called with `mappings` (≥1)
2 — preflight `bicameral.preflight` called with `file_paths` containing `cherry-pick.ts`
3 — commit→reflected `bicameral.link_commit` + `bicameral.resolve_compliance` both called; resolve_compliance carries `verdicts`
4 — session-end `bicameral.ingest` called with `source='agent_session'` (top-level or per-mapping `span.source_type`)
5 — history `bicameral.history` called; seed pre-conditions (`ingest` + `ratify`) present

CI shape

```yaml
environment: production # CLAUDE_CODE_OAUTH_TOKEN
on:
pull_request:
paths: [tests/e2e/, handlers/, ledger/, contracts.py, skills/bicameral-, ...]
workflow_dispatch: # manual debug trigger
```

Triggers on PRs touching paths whose behaviour the e2e validates. `workflow_dispatch` lets maintainers re-run on demand without a code change.

Test plan

```bash

Locally (after npm i -g @anthropic-ai/claude-code + claude auth):

cd pilot/mcp
git clone --depth=1 https://github.com/desktop/desktop /tmp/desktop-clone
(cd /tmp/desktop-clone && git checkout -b main)
DESKTOP_REPO_PATH=/tmp/desktop-clone python tests/e2e/run_e2e_flows.py
```

Expected: `Overall: PASS` after ~3-5 minutes (5 Claude sessions). Cost ~$0.50-$2.00 per run.

In CI: this PR will trigger the workflow on itself once the workflow file lands on `dev`. (GitHub workflows execute the version on the base branch, so the very first run is on the next qualifying PR after merge.)

Risk

L2 — new CI workflow, real API spend per run. Mitigations:

  • `--max-budget-usd 2.0` per flow caps cost (worst case ~$10/run, typical ~$1)
  • `--strict-mcp-config` ensures only bicameral-mcp is loaded (no other MCP servers leak in from runner)
  • Bash tool intentionally NOT in `--allowed-tools` (only `mcp__bicameral,Read,Grep`) — bicameral skills shouldn't need shell
  • Transcripts uploaded as artifacts on every run (30-day retention) for forensics
  • Pinned desktop/desktop commit — fixture won't drift mid-CI

Out of scope (deferred)

  • Caching `@anthropic-ai/claude-code` install across runs (small win; npm i is ~10s)
  • Splitting flows into separate jobs for parallelism (would 5x cost but cut wall time; defer until we have data on flake rate)
  • Cross-flow state (each flow uses fresh `memory://` ledger; flows that need history seed it inline in their prompt)

Refs #108. Follow-on to PRs #138 (#135 dashboard tooltip) and #139 (#108 handler-replay sim + skill correction).

🤖 Generated with Claude Code

Drives a real Claude Code CLI session per spec flow with bicameral-mcp
registered as the only MCP server, and asserts on the stream-json
transcript that the right MCP tools were called with the right shapes.

Why a separate test from scripts/sim_issue_108_flows.py:
  The handler-replay sim imports handler functions directly. It validates
  server invariants (status projection, signoff transitions, ephemeral
  detection) but bypasses three layers a real user exercises:
    - MCP protocol marshalling (JSON-RPC over stdio)
    - Skill files (.claude/skills/bicameral-*/SKILL.md trigger matching,
      auto-chains: preflight → capture-corrections → context-sentry →
      ingest → judge_gaps)
    - Caller LLM tool sequencing from natural language

This e2e covers all three. The two tests are complementary: handler-replay
for fast local dev iteration on handler logic, e2e for the user-experience
contract.

Test fixture:
  Pinned commit of github.com/desktop/desktop (e6c50fb…). Real-world ingest
  content from docs/process/roadmap.md; bind target is the CherryPickResult
  enum in app/src/lib/git/cherry-pick.ts (a stable, slow-changing public
  type that genuinely corresponds to the cherry-pick roadmap item).

CI shape:
  - environment: production (provides CLAUDE_CODE_OAUTH_TOKEN)
  - Triggers on PRs touching tests/e2e/**, handlers/**, ledger/**,
    contracts.py, skills/bicameral-**, server.py, pyproject.toml, or the
    workflow itself
  - Installs Claude Code CLI (npm) + bicameral-mcp (pip -e .)
  - Clones desktop/desktop at the pinned commit, stamps a real 'main' branch
    so feature-branch tests work (clone is otherwise detached HEAD)
  - Probes CLAUDE_CODE_OAUTH_TOKEN visibility without leaking it
  - Runs all five flows in a single Python orchestrator
  - Uploads stream-json transcripts (30-day retention) for failure forensics

Per-flow contract:
  Flow 1 — bicameral.ingest called with mappings (≥1)
  Flow 2 — bicameral.preflight called with file_paths containing
           cherry-pick.ts
  Flow 3 — bicameral.link_commit + bicameral.resolve_compliance both
           called; resolve_compliance carries verdicts
  Flow 4 — bicameral.ingest called with source='agent_session' (top-level
           or per-mapping span.source_type)
  Flow 5 — bicameral.history called, with seed ingest + ratify pre-conditions

Cost: ~$0.50–$2.00 per CI run (each flow capped at --max-budget-usd 2.0).

Refs #108. Complementary to the handler-replay sim shipped in PR #139.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 30, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5d7fd3c5-1dee-497a-b3d6-75053a4f98a6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/108-v0-userflow-e2e-ci

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Apr 30, 2026

Copy link
Copy Markdown

Bicameral drift report — skipped

No bicameral/decisions.yaml found in repo root. Drift report is skipped for this PR.

To enable: add a bicameral/decisions.yaml manifest. See setup guide (link to be added when manifest spec ships).

…TH for MCP

First CI run (workflow_dispatch on 25195231091) surfaced three issues — all
in the test infrastructure, not the implementation. Fixing them.

1. assert_flow_1, assert_flow_4 — the bicameral.ingest tool wraps its input
   in a 'payload' key (matching the IngestPayload contract), and the
   skill-side spelling for the items array is 'decisions', not 'mappings'.
   The asserters were looking at input.mappings and input.source — both
   absent. Now they look at input.payload.{decisions|mappings} and
   input.payload.source. Verified against transcripts:
     Flow 1 — payload.decisions=[…]  → was reported as "no mappings"
     Flow 4 — payload.source='agent_session' → was reported as "top_source=''"
   Also extended the resolve_compliance asserter (Flow 3) for the same
   payload-wrapping pattern.

2. bicameral.mcp.json — the env block lacked REPO_PATH, so the spawned
   bicameral-mcp server fell back to '.' (claude CLI's cwd, which is
   pilot/mcp/, not the desktop/desktop clone). bind couldn't find
   app/src/lib/git/cherry-pick.ts and Flow 3's chain aborted at bind
   instead of progressing to link_commit + resolve_compliance.

   Fix: template the config with ${DESKTOP_REPO_PATH}, materialize at
   orchestrator runtime by substituting the env-var value, write a
   runtime copy under test-results/e2e/. Works locally + in CI without
   committing a CI-specific path. (MCP env-merge vs env-replace
   behaviour is implementation-defined across Claude Code versions, so
   passing REPO_PATH explicitly via the config is more robust than
   relying on parent-process env propagation.)

Refs #108. First-iteration validation of PR #142's e2e harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ruff lint complained:
  UP035 Import from `collections.abc` instead: `Callable`

Trivial fix: move ``Callable`` import from ``typing`` to ``collections.abc``
(PEP 585 modernization). Also re-format triggered by the import-order shift.

Verified locally:
  python3 -m ruff check tests/e2e/run_e2e_flows.py        → All checks passed!
  python3 -m ruff format --check tests/e2e/run_e2e_flows.py → 1 file already formatted

Unblocks PR #142 merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jinhongkuan jinhongkuan merged commit 489c0fd into dev May 1, 2026
6 of 7 checks passed
jinhongkuan added a commit that referenced this pull request May 1, 2026
…TH for MCP

First CI run (workflow_dispatch on 25195231091) surfaced three issues — all
in the test infrastructure, not the implementation. Fixing them.

1. assert_flow_1, assert_flow_4 — the bicameral.ingest tool wraps its input
   in a 'payload' key (matching the IngestPayload contract), and the
   skill-side spelling for the items array is 'decisions', not 'mappings'.
   The asserters were looking at input.mappings and input.source — both
   absent. Now they look at input.payload.{decisions|mappings} and
   input.payload.source. Verified against transcripts:
     Flow 1 — payload.decisions=[…]  → was reported as "no mappings"
     Flow 4 — payload.source='agent_session' → was reported as "top_source=''"
   Also extended the resolve_compliance asserter (Flow 3) for the same
   payload-wrapping pattern.

2. bicameral.mcp.json — the env block lacked REPO_PATH, so the spawned
   bicameral-mcp server fell back to '.' (claude CLI's cwd, which is
   pilot/mcp/, not the desktop/desktop clone). bind couldn't find
   app/src/lib/git/cherry-pick.ts and Flow 3's chain aborted at bind
   instead of progressing to link_commit + resolve_compliance.

   Fix: template the config with ${DESKTOP_REPO_PATH}, materialize at
   orchestrator runtime by substituting the env-var value, write a
   runtime copy under test-results/e2e/. Works locally + in CI without
   committing a CI-specific path. (MCP env-merge vs env-replace
   behaviour is implementation-defined across Claude Code versions, so
   passing REPO_PATH explicitly via the config is more robust than
   relying on parent-process env propagation.)

Refs #108. First-iteration validation of PR #142's e2e harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jinhongkuan added a commit that referenced this pull request May 1, 2026
…on slide

Adds an opt-in workflow_dispatch path that records a single split-screen
demo session, then post-splits it into pm.mp4 (PM persona, two chapters
joined by an ffmpeg-generated transition slide) and dev.mp4 (Dev persona).

Plan: thoughts/shared/plans/2026-04-30-v0-userflow-demo-recording.md
(Predecessor: PR #142, the assertion-only e2e on the same workflow file.)

Why one continuous claude session instead of two persona sessions: the
e2e config uses SURREAL_URL=memory://, so each MCP process is a fresh
ledger. A single session is what makes Scene 3 (PM post-impl) show the
SSE events from Scene 2 (Dev) authentically — same dashboard, same
state, no re-hydration.

Scene boundaries are detected from the stream-json tool-call timeline
(no LLM-emitted sentinels):
  Scene 1 → 2 = first bicameral.preflight call
  Scene 2 → 3 = first bicameral.history call after any link_commit

Recording step is `continue-on-error: true` — assertion-only step
remains the sole authority on workflow conclusion.

MP4s are .gitignored and excluded from the wheel; they live in the
v0-user-flow-e2e-demos GitHub artifact (90-day retention).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant