BicameralAI · jinhongkuan · May 1, 2026 · Apr 30, 2026 · May 1, 2026 · May 1, 2026
@@ -0,0 +1,110 @@
+name: v0 user flow e2e
+
+# End-to-end validation of BicameralAI/bicameral#108's six canonical user
+# flows via real Claude Code CLI sessions with bicameral-mcp registered.
+# See tests/e2e/README.md for the design.
+#
+# Note: when this workflow file lands, it will not run on the PR that
+# adds it — pull_request workflows execute the version on the base
+# branch (main). First execution is on the next qualifying PR after merge.
+
+on:
+  pull_request:
+    branches: [main, dev]
+    paths:
+      - 'tests/e2e/**'
+      - 'handlers/**'
+      - 'ledger/**'
+      - 'contracts.py'
+      - 'skills/bicameral-**'
+      - 'server.py'
+      - 'pyproject.toml'
+      - '.github/workflows/v0-user-flow-e2e.yml'
+  workflow_dispatch:  # allow manual trigger for debugging
+
+env:
+  PYTHON_VERSION: '3.11'
+  NODE_VERSION: '20'
+  # Pinned commit of github.com/desktop/desktop. Bump when the roadmap.md
+  # shape drifts in ways that break prompts, or when bind targets change.
+  DESKTOP_PINNED_COMMIT: 'e6c50fb028171e9cec03594273c8116bb135847e'
+
+jobs:
+  v0-user-flow-e2e:
+    name: v0 User Flow E2E (Claude Code CLI session)
+    runs-on: ubuntu-latest
+    # production environment provides CLAUDE_CODE_OAUTH_TOKEN for the
+    # Claude Code CLI sessions.
+    environment: production
+    timeout-minutes: 25
+    env:
+      DESKTOP_REPO_PATH: /tmp/desktop-clone
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Setup Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ env.PYTHON_VERSION }}
+
+      - name: Setup Node.js (for Claude Code CLI)
+        uses: actions/setup-node@v4
+        with:
+          node-version: ${{ env.NODE_VERSION }}
+
+      - name: Install bicameral-mcp + test deps
+        run: pip install -e ".[test]"
+
+      - name: Install Claude Code CLI
+        run: npm install -g @anthropic-ai/claude-code
+
+      - name: Verify CLI tooling on PATH
+        run: |
+          which claude && claude --version
+          which bicameral-mcp
+
+      # ── Test fixture: github.com/desktop/desktop at a pinned commit ─
+      - name: Clone desktop/desktop at pinned commit
+        run: |
+          mkdir -p ${{ env.DESKTOP_REPO_PATH }}
+          cd ${{ env.DESKTOP_REPO_PATH }}
+          git init -q
+          git remote add origin https://github.com/desktop/desktop
+          git fetch --depth 1 origin "${DESKTOP_PINNED_COMMIT}"
+          git checkout FETCH_HEAD
+          # Stamp a real 'main' branch so flows that branch off it work
+          git checkout -b main
+          git config user.email ci@bicameral.test
+          git config user.name CI
+          # Sanity: required files present
+          test -f docs/process/roadmap.md
+          test -f app/src/lib/git/cherry-pick.ts
+
+      # ── Diagnostic probe: confirm OAuth token is non-empty without leaking it ─
+      - name: Claude Code OAuth token visibility probe
+        run: |
+          set +e
+          if [ -n "${CLAUDE_CODE_OAUTH_TOKEN}" ]; then
+            echo "CLAUDE_CODE_OAUTH_TOKEN: present (length=${#CLAUDE_CODE_OAUTH_TOKEN})"
+          else
+            echo "CLAUDE_CODE_OAUTH_TOKEN: EMPTY or UNSET"
+            echo "  secret expression non-empty: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN != '' }}"
+            exit 1
+          fi
+        env:
+          CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
+
+      # ── Drive the five flows through Claude Code CLI sessions ─
+      - name: Run v0 user flow e2e
+        env:
+          CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
+        run: python tests/e2e/run_e2e_flows.py
+
+      # ── Forensics: keep transcripts even on failure ─
+      - name: Upload e2e transcripts
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: v0-user-flow-e2e-transcripts
+          path: test-results/e2e/
+          retention-days: 30
@@ -0,0 +1,104 @@
+# v0 user flow e2e
+
+End-to-end validation of `BicameralAI/bicameral#108`'s six canonical user
+flows, driven by **real Claude Code CLI sessions** with `bicameral-mcp`
+registered as an MCP server. Test fixture: a pinned commit of
+`github.com/desktop/desktop`, with `docs/process/roadmap.md` as ingest
+content.
+
+This is the canonical CI test for the spec. The handler-replay simulation
+at `scripts/sim_issue_108_flows.py` complements it for fast local iteration
+on handler logic without burning Claude API calls.
+
+## What it tests
+
+Each flow corresponds to a section of [bicameral#108 spec](https://github.com/BicameralAI/bicameral/issues/108):
+
+| Flow | Spec section | Asserts |
+|---|---|---|
+| 1 | Record decisions from a meeting | `bicameral.ingest` called with mappings |
+| 2 | Begin to write code (preflight) | `bicameral.preflight` called with `file_paths` |
+| 3 | Commit code → reflected | `bicameral.link_commit` + `bicameral.resolve_compliance` (with verdicts) |
+| 4 | End coding session | `bicameral.ingest` called with `source="agent_session"` |
+| 5 | Review what's been tracked | `bicameral.history` called (with seed ingest + ratify) |
+
+Each flow is a separate `claude -p` invocation with a fresh `memory://`
+ledger. Within a session, prompts may chain multiple tool calls — the
+asserter walks the entire stream-json transcript.
+
+## How it works
+
+```
+prompts/flow-N-*.md  →  claude -p  →  stream-json transcript  →  assert
+                          │
+                          ├─ --mcp-config bicameral.mcp.json  (registers bicameral-mcp)
+                          ├─ --strict-mcp-config              (no other MCP servers loaded)
+                          ├─ --allowed-tools mcp__bicameral Read Grep
+                          ├─ --add-dir <desktop_clone>        (skill Read access)
+                          └─ --output-format stream-json --verbose
+```
+
+`run_e2e_flows.py` orchestrates all five flows, captures transcripts to
+`test-results/e2e/flow-N.ndjson`, and asserts on the tool-use blocks.
+
+## Running locally
+
+```bash
+# 1. Install bicameral-mcp + Claude Code CLI
+cd pilot/mcp
+pip install -e ".[test]"
+npm install -g @anthropic-ai/claude-code
+
+# 2. Authenticate Claude Code CLI (interactive — once)
+claude auth
+
+# 3. Clone the test fixture
+git clone --depth=1 https://github.com/desktop/desktop /tmp/desktop-clone
+cd /tmp/desktop-clone && git checkout -b main && cd -
+
+# 4. Run all five flows
+DESKTOP_REPO_PATH=/tmp/desktop-clone python tests/e2e/run_e2e_flows.py
+```
+
+Cost per run: ~$0.50–$2.00 across all five flows depending on how much the
+LLM exercises in each session. Each run is bounded by `--max-budget-usd 2.0`
+per flow.
+
+## CI
+
+GitHub Actions workflow: `.github/workflows/v0-user-flow-e2e.yml`.
+
+- Triggers on PRs touching `tests/e2e/**`, `handlers/**`, `ledger/**`,
+  `contracts.py`, `skills/bicameral-*/**`, or the workflow itself.
+- Runs in the `production` GitHub environment for `CLAUDE_CODE_OAUTH_TOKEN`.
+- Pinned `desktop/desktop` commit in the workflow file (update by editing
+  the env var).
+- Uploads `test-results/e2e/*.ndjson` as job artifacts (30-day retention)
+  for failure forensics.
+
+## Updating
+
+When the spec changes, update both:
+
+1. The relevant `prompts/flow-N-*.md` (natural-language user prompt)
+2. The matching `assert_flow_N` in `run_e2e_flows.py`
+
+When `desktop/desktop`'s `roadmap.md` or `cherry-pick.ts` shape drifts in
+ways that break the prompts or bind targets, bump the pinned commit in
+the workflow + adjust prompts.
+
+## Why not handler-replay only?
+
+The handler-replay sim (`scripts/sim_issue_108_flows.py`) directly imports
+handler functions and calls them. It's fast and useful for iterating on
+handler logic, but it bypasses three layers we need to validate:
+
+- **MCP protocol** — JSON-RPC over stdio, tool schema marshalling
+- **Skill files** — `.claude/skills/bicameral-*/SKILL.md` parsing, trigger
+  matching, prompt construction
+- **Caller LLM** — natural-language → tool-call sequencing, auto-chains
+  (preflight → capture-corrections → context-sentry → ingest → judge_gaps)
+
+This e2e suite covers all three. Together they form the spec's two-level
+validation: handler invariants (replay sim) + user-experience contract
+(this directory).
@@ -0,0 +1,12 @@
+{
+  "mcpServers": {
+    "bicameral": {
+      "command": "bicameral-mcp",
+      "args": [],
+      "env": {
+        "SURREAL_URL": "memory://",
+        "REPO_PATH": "${DESKTOP_REPO_PATH}"
+      }
+    }
+  }
+}
@@ -0,0 +1,13 @@
+I just reviewed the GitHub Desktop roadmap and want to capture some of their recent feature decisions in bicameral so we can track them.
+
+Here are three roadmap items:
+
+1. **High signal notifications (2.9.10 and 3.0.0)** — Receive a notification when checks fail. Receive a notification when your pull request is reviewed.
+
+2. **Improved commit history (2.9.0)** — Reorder commits via drag/drop. Squash commits via drag/drop. Amend last commit. Create a branch from a previous commit.
+
+3. **Cherry-picking commits from one branch to another (2.7.1)** — Cherry-pick commits with a context menu and interactively.
+
+Please ingest these as decisions into the bicameral ledger. The source is `desktop/desktop:docs/process/roadmap.md`.
+
+After ingesting, briefly confirm what was captured (decision IDs and signoff state) so I know they landed.
@@ -0,0 +1,5 @@
+Before I refactor the cherry-pick logic in GitHub Desktop, I want to make sure I'm aware of any prior decisions or context that touch this code path.
+
+I'm specifically going to be modifying `app/src/lib/git/cherry-pick.ts`.
+
+Please run a preflight check against this file path and tell me what comes back — any bound decisions, unresolved collisions, or context-pending items I should know about before I start writing code.
@@ -0,0 +1,8 @@
+I just made a commit that touched `app/src/lib/git/cherry-pick.ts`. Please sync the bicameral ledger to reflect the new HEAD and resolve any pending compliance checks that surface for that file.
+
+Specifically:
+1. Call link_commit on HEAD to detect drift against any decisions bound to that file.
+2. For each pending compliance check that comes back, evaluate whether the current code semantically matches the decision and emit a verdict (compliant / drifted / not_relevant) via resolve_compliance. Use the file content as evidence.
+3. After resolving, summarize: how many decisions transitioned to reflected vs drifted vs stayed pending.
+
+Before you start, you'll need to set up a bound decision against `app/src/lib/git/cherry-pick.ts` so there's something to sync. Use this decision text: "Cherry-pick commits with a context menu and interactively (GitHub Desktop roadmap, version 2.7.1)". Bind it to the `CherryPickResult` enum at the top of that file (lines 31–60).
@@ -0,0 +1,7 @@
+We're wrapping up our coding session. Earlier in our conversation I mentioned a constraint that we never wrote down explicitly:
+
+> "The cherry-pick implementation should never require interactive prompts during conflict resolution — conflicts must always be resolvable through the visual conflict UI, not via stdin."
+
+That's a real constraint that affects implementation. Please capture it as a session-end correction and ingest it into the bicameral ledger using the `agent_session` source so we know it came from this conversation rather than a transcript or doc.
+
+After ingesting, confirm the decision_id and the signoff state.
@@ -0,0 +1,11 @@
+Show me the full decision history for this repo. Group decisions by feature area and for each one, surface BOTH axes:
+
+- **status** — code-compliance side: reflected | drifted | pending | ungrounded
+- **signoff.state** — human-approval side: proposed | ratified | rejected | superseded | collision_pending | context_pending
+
+Before you call history, ingest two seed decisions so the response isn't empty:
+
+1. "Reorder commits via drag/drop" (feature_group: Improved commit history) — leave at default proposed/ungrounded.
+2. "Native support for Apple silicon machines" (feature_group: Apple silicon) — ingest, then ratify it so it shows ratified × ungrounded in the readout.
+
+After history returns, render a brief table showing each decision's two axes so I can scan it.