Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions .github/workflows/v0-user-flow-e2e.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
name: v0 user flow e2e

# End-to-end validation of BicameralAI/bicameral#108's six canonical user
# flows via real Claude Code CLI sessions with bicameral-mcp registered.
# See tests/e2e/README.md for the design.
#
# Note: when this workflow file lands, it will not run on the PR that
# adds it — pull_request workflows execute the version on the base
# branch (main). First execution is on the next qualifying PR after merge.

on:
pull_request:
branches: [main, dev]
paths:
- 'tests/e2e/**'
- 'handlers/**'
- 'ledger/**'
- 'contracts.py'
- 'skills/bicameral-**'
- 'server.py'
- 'pyproject.toml'
- '.github/workflows/v0-user-flow-e2e.yml'
workflow_dispatch: # allow manual trigger for debugging

env:
PYTHON_VERSION: '3.11'
NODE_VERSION: '20'
# Pinned commit of github.com/desktop/desktop. Bump when the roadmap.md
# shape drifts in ways that break prompts, or when bind targets change.
DESKTOP_PINNED_COMMIT: 'e6c50fb028171e9cec03594273c8116bb135847e'

jobs:
v0-user-flow-e2e:
name: v0 User Flow E2E (Claude Code CLI session)
runs-on: ubuntu-latest
# production environment provides CLAUDE_CODE_OAUTH_TOKEN for the
# Claude Code CLI sessions.
environment: production
timeout-minutes: 25
env:
DESKTOP_REPO_PATH: /tmp/desktop-clone
steps:
- uses: actions/checkout@v4

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}

- name: Setup Node.js (for Claude Code CLI)
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}

- name: Install bicameral-mcp + test deps
run: pip install -e ".[test]"

- name: Install Claude Code CLI
run: npm install -g @anthropic-ai/claude-code

- name: Verify CLI tooling on PATH
run: |
which claude && claude --version
which bicameral-mcp

# ── Test fixture: github.com/desktop/desktop at a pinned commit ─
- name: Clone desktop/desktop at pinned commit
run: |
mkdir -p ${{ env.DESKTOP_REPO_PATH }}
cd ${{ env.DESKTOP_REPO_PATH }}
git init -q
git remote add origin https://github.com/desktop/desktop
git fetch --depth 1 origin "${DESKTOP_PINNED_COMMIT}"
git checkout FETCH_HEAD
# Stamp a real 'main' branch so flows that branch off it work
git checkout -b main
git config user.email ci@bicameral.test
git config user.name CI
# Sanity: required files present
test -f docs/process/roadmap.md
test -f app/src/lib/git/cherry-pick.ts

# ── Diagnostic probe: confirm OAuth token is non-empty without leaking it ─
- name: Claude Code OAuth token visibility probe
run: |
set +e
if [ -n "${CLAUDE_CODE_OAUTH_TOKEN}" ]; then
echo "CLAUDE_CODE_OAUTH_TOKEN: present (length=${#CLAUDE_CODE_OAUTH_TOKEN})"
else
echo "CLAUDE_CODE_OAUTH_TOKEN: EMPTY or UNSET"
echo " secret expression non-empty: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN != '' }}"
exit 1
fi
env:
CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}

# ── Drive the five flows through Claude Code CLI sessions ─
- name: Run v0 user flow e2e
env:
CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
run: python tests/e2e/run_e2e_flows.py

# ── Forensics: keep transcripts even on failure ─
- name: Upload e2e transcripts
if: always()
uses: actions/upload-artifact@v4
with:
name: v0-user-flow-e2e-transcripts
path: test-results/e2e/
retention-days: 30
104 changes: 104 additions & 0 deletions tests/e2e/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# v0 user flow e2e

End-to-end validation of `BicameralAI/bicameral#108`'s six canonical user
flows, driven by **real Claude Code CLI sessions** with `bicameral-mcp`
registered as an MCP server. Test fixture: a pinned commit of
`github.com/desktop/desktop`, with `docs/process/roadmap.md` as ingest
content.

This is the canonical CI test for the spec. The handler-replay simulation
at `scripts/sim_issue_108_flows.py` complements it for fast local iteration
on handler logic without burning Claude API calls.

## What it tests

Each flow corresponds to a section of [bicameral#108 spec](https://github.com/BicameralAI/bicameral/issues/108):

| Flow | Spec section | Asserts |
|---|---|---|
| 1 | Record decisions from a meeting | `bicameral.ingest` called with mappings |
| 2 | Begin to write code (preflight) | `bicameral.preflight` called with `file_paths` |
| 3 | Commit code → reflected | `bicameral.link_commit` + `bicameral.resolve_compliance` (with verdicts) |
| 4 | End coding session | `bicameral.ingest` called with `source="agent_session"` |
| 5 | Review what's been tracked | `bicameral.history` called (with seed ingest + ratify) |

Each flow is a separate `claude -p` invocation with a fresh `memory://`
ledger. Within a session, prompts may chain multiple tool calls — the
asserter walks the entire stream-json transcript.

## How it works

```
prompts/flow-N-*.md → claude -p → stream-json transcript → assert
├─ --mcp-config bicameral.mcp.json (registers bicameral-mcp)
├─ --strict-mcp-config (no other MCP servers loaded)
├─ --allowed-tools mcp__bicameral Read Grep
├─ --add-dir <desktop_clone> (skill Read access)
└─ --output-format stream-json --verbose
```

`run_e2e_flows.py` orchestrates all five flows, captures transcripts to
`test-results/e2e/flow-N.ndjson`, and asserts on the tool-use blocks.

## Running locally

```bash
# 1. Install bicameral-mcp + Claude Code CLI
cd pilot/mcp
pip install -e ".[test]"
npm install -g @anthropic-ai/claude-code

# 2. Authenticate Claude Code CLI (interactive — once)
claude auth

# 3. Clone the test fixture
git clone --depth=1 https://github.com/desktop/desktop /tmp/desktop-clone
cd /tmp/desktop-clone && git checkout -b main && cd -

# 4. Run all five flows
DESKTOP_REPO_PATH=/tmp/desktop-clone python tests/e2e/run_e2e_flows.py
```

Cost per run: ~$0.50–$2.00 across all five flows depending on how much the
LLM exercises in each session. Each run is bounded by `--max-budget-usd 2.0`
per flow.

## CI

GitHub Actions workflow: `.github/workflows/v0-user-flow-e2e.yml`.

- Triggers on PRs touching `tests/e2e/**`, `handlers/**`, `ledger/**`,
`contracts.py`, `skills/bicameral-*/**`, or the workflow itself.
- Runs in the `production` GitHub environment for `CLAUDE_CODE_OAUTH_TOKEN`.
- Pinned `desktop/desktop` commit in the workflow file (update by editing
the env var).
- Uploads `test-results/e2e/*.ndjson` as job artifacts (30-day retention)
for failure forensics.

## Updating

When the spec changes, update both:

1. The relevant `prompts/flow-N-*.md` (natural-language user prompt)
2. The matching `assert_flow_N` in `run_e2e_flows.py`

When `desktop/desktop`'s `roadmap.md` or `cherry-pick.ts` shape drifts in
ways that break the prompts or bind targets, bump the pinned commit in
the workflow + adjust prompts.

## Why not handler-replay only?

The handler-replay sim (`scripts/sim_issue_108_flows.py`) directly imports
handler functions and calls them. It's fast and useful for iterating on
handler logic, but it bypasses three layers we need to validate:

- **MCP protocol** — JSON-RPC over stdio, tool schema marshalling
- **Skill files** — `.claude/skills/bicameral-*/SKILL.md` parsing, trigger
matching, prompt construction
- **Caller LLM** — natural-language → tool-call sequencing, auto-chains
(preflight → capture-corrections → context-sentry → ingest → judge_gaps)

This e2e suite covers all three. Together they form the spec's two-level
validation: handler invariants (replay sim) + user-experience contract
(this directory).
12 changes: 12 additions & 0 deletions tests/e2e/bicameral.mcp.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"mcpServers": {
"bicameral": {
"command": "bicameral-mcp",
"args": [],
"env": {
"SURREAL_URL": "memory://",
"REPO_PATH": "${DESKTOP_REPO_PATH}"
}
}
}
}
13 changes: 13 additions & 0 deletions tests/e2e/prompts/flow-1-ingest.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
I just reviewed the GitHub Desktop roadmap and want to capture some of their recent feature decisions in bicameral so we can track them.

Here are three roadmap items:

1. **High signal notifications (2.9.10 and 3.0.0)** — Receive a notification when checks fail. Receive a notification when your pull request is reviewed.

2. **Improved commit history (2.9.0)** — Reorder commits via drag/drop. Squash commits via drag/drop. Amend last commit. Create a branch from a previous commit.

3. **Cherry-picking commits from one branch to another (2.7.1)** — Cherry-pick commits with a context menu and interactively.

Please ingest these as decisions into the bicameral ledger. The source is `desktop/desktop:docs/process/roadmap.md`.

After ingesting, briefly confirm what was captured (decision IDs and signoff state) so I know they landed.
5 changes: 5 additions & 0 deletions tests/e2e/prompts/flow-2-preflight.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Before I refactor the cherry-pick logic in GitHub Desktop, I want to make sure I'm aware of any prior decisions or context that touch this code path.

I'm specifically going to be modifying `app/src/lib/git/cherry-pick.ts`.

Please run a preflight check against this file path and tell me what comes back — any bound decisions, unresolved collisions, or context-pending items I should know about before I start writing code.
8 changes: 8 additions & 0 deletions tests/e2e/prompts/flow-3-commit-sync.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
I just made a commit that touched `app/src/lib/git/cherry-pick.ts`. Please sync the bicameral ledger to reflect the new HEAD and resolve any pending compliance checks that surface for that file.

Specifically:
1. Call link_commit on HEAD to detect drift against any decisions bound to that file.
2. For each pending compliance check that comes back, evaluate whether the current code semantically matches the decision and emit a verdict (compliant / drifted / not_relevant) via resolve_compliance. Use the file content as evidence.
3. After resolving, summarize: how many decisions transitioned to reflected vs drifted vs stayed pending.

Before you start, you'll need to set up a bound decision against `app/src/lib/git/cherry-pick.ts` so there's something to sync. Use this decision text: "Cherry-pick commits with a context menu and interactively (GitHub Desktop roadmap, version 2.7.1)". Bind it to the `CherryPickResult` enum at the top of that file (lines 31–60).
7 changes: 7 additions & 0 deletions tests/e2e/prompts/flow-4-session-end.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
We're wrapping up our coding session. Earlier in our conversation I mentioned a constraint that we never wrote down explicitly:

> "The cherry-pick implementation should never require interactive prompts during conflict resolution — conflicts must always be resolvable through the visual conflict UI, not via stdin."

That's a real constraint that affects implementation. Please capture it as a session-end correction and ingest it into the bicameral ledger using the `agent_session` source so we know it came from this conversation rather than a transcript or doc.

After ingesting, confirm the decision_id and the signoff state.
11 changes: 11 additions & 0 deletions tests/e2e/prompts/flow-5-history.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Show me the full decision history for this repo. Group decisions by feature area and for each one, surface BOTH axes:

- **status** — code-compliance side: reflected | drifted | pending | ungrounded
- **signoff.state** — human-approval side: proposed | ratified | rejected | superseded | collision_pending | context_pending

Before you call history, ingest two seed decisions so the response isn't empty:

1. "Reorder commits via drag/drop" (feature_group: Improved commit history) — leave at default proposed/ungrounded.
2. "Native support for Apple silicon machines" (feature_group: Apple silicon) — ingest, then ratify it so it shows ratified × ungrounded in the readout.

After history returns, render a brief table showing each decision's two axes so I can scan it.
Loading
Loading