ci(#108): demo recording fast-follow — pm.mp4 + dev.mp4 with transition slide#144
Merged
Conversation
…on slide Adds an opt-in workflow_dispatch path that records a single split-screen demo session, then post-splits it into pm.mp4 (PM persona, two chapters joined by an ffmpeg-generated transition slide) and dev.mp4 (Dev persona). Plan: thoughts/shared/plans/2026-04-30-v0-userflow-demo-recording.md (Predecessor: PR #142, the assertion-only e2e on the same workflow file.) Why one continuous claude session instead of two persona sessions: the e2e config uses SURREAL_URL=memory://, so each MCP process is a fresh ledger. A single session is what makes Scene 3 (PM post-impl) show the SSE events from Scene 2 (Dev) authentically — same dashboard, same state, no re-hydration. Scene boundaries are detected from the stream-json tool-call timeline (no LLM-emitted sentinels): Scene 1 → 2 = first bicameral.preflight call Scene 2 → 3 = first bicameral.history call after any link_commit Recording step is `continue-on-error: true` — assertion-only step remains the sole authority on workflow conclusion. MP4s are .gitignored and excluded from the wheel; they live in the v0-user-flow-e2e-demos GitHub artifact (90-day retention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ser snap
The `chromium-browser` apt package on Ubuntu 22.04+ is a snap-store
installer wrapper; GitHub Actions runners can't reach the snap store
and the install retries for 30 minutes before failing. Symptom from
run 25198673582:
===> Unable to contact the store, trying every minute for the next 30 minutes
Fix:
- Drop chromium-browser from the apt-get install step.
- Auto-detect the browser binary in record_demo.sh — prefers
google-chrome-stable (pre-installed on ubuntu-latest), then
google-chrome / chromium / chromium-browser as fallbacks for
desktop developers running locally.
- Add a sanity check at the end of the install step that fails
fast if no chromium-compatible browser is on PATH.
All four browser variants accept identical Chromium-style flags
(--no-sandbox, --window-size, --window-position, etc.) so the
recording layout is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: the headless ``claude -p`` harness validates MCP tool callability but cannot reliably exercise the agentic auto-fire layer (preflight, capture-corrections) that the demo punchline depends on. This change makes the gap explicit in the report so reviewers can see at a glance which flows validate the tool surface vs which still need the interactive recording path to come through. Changes: - Switch e2e ledger from ``memory://`` to ``surrealkv://`` so state persists across the 5 sequential flows (flow-1 seeds → flow-5 ratifies). Wiped at start of each run so tests stay reproducible. - Refactor FLOW_PLAN to a FlowSpec dataclass with category (mcp_layer | agentic_layer) + per-flow advisory. Report now renders a sharable summary banner with MCP-vs-agentic breakdown and an ADVISORIES section explaining failed/compromised flows. - Rewrite flow-2 to natural dev voice (refactor reorder.ts, no skill names) — auto-fire trigger preserved. Currently fails by design; advisory documents (a) auto-fire losing the priority race vs the agent's "verify the premise first" instinct, and (b) CodeGenome semantic grounding being wired into link_commit + bind but NOT preflight, so file-path lookup against reorder.ts returns no matches even though "Reorder via drag/drop" is semantically dead-on. - Rewrite flow-4 to drop the fabricated "earlier in our conversation" framing — each ``claude -p`` is a fresh session so the agent correctly refused to put false provenance in the ledger. Now states the constraint honestly. Advisory marks this a compromised pass: it succeeds only because the prompt names ``agent_session`` source explicitly, so capture-corrections skill itself isn't auto-fired. - Rewrite flow-5 as PM Friday ratification against the persisted ledger — no in-session seed needed any more. - Add tests/e2e/record_demo_interactive.sh — tmux+send-keys sketch for the interactive recording path where auto-fire can actually be observed in footage. Layered on the recording infra in thoughts/shared/plans/2026-04-30-v0-userflow-demo-recording.md. Validates: 3/3 MCP-layer flows pass cleanly. Agentic layer: 1/2 (flow-4 compromised pass, flow-2 expected fail with advisory). Overall harness verdict FAIL — honest signal, see report ADVISORIES. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: video recording is expensive (~30-45 min wall + claude API spend) and was riding the same auto-trigger path as the cheap assertion step. Splitting into two jobs lets the assertion path flow through automatically on every PR while the recording path requires explicit human approval before each run. Changes: - Job 1 ``assertions`` — same scope as before (PR + dispatch). Uses ``environment: production`` for OAuth token; that env has no required reviewers, so PR triggers run without manual gating. - Job 2 ``recording`` — manual dispatch only, gated by ``environment: recording-approval``. Required reviewers on that env (set in repo settings) become the approval gate. ``needs: assertions`` + ``if: always()`` keeps the two stages ordered without blocking recording on the assertion harness's expected advisory failures. Repo-settings prerequisite (one-time, before next manual dispatch): - Create environment ``recording-approval`` in repo settings - Add the required reviewers list under "Deployment protection rules" - Copy ``CLAUDE_CODE_OAUTH_TOKEN`` secret to the env, or move it to repo-level so both jobs see it Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: a video demo run must not wait on the assertion harness. The two stages have independent value — assertions validate MCP tool surface, recording validates the agentic layer visually — and the harness's expected advisory failures (auto-fire gap, codegenome-not-wired-through- preflight gap) shouldn't gate access to demo footage. Drop ``needs: assertions`` and the corresponding ``always()`` guard. Recording is now triggered solely by ``workflow_dispatch`` + ``record_demo=true``, gated only by the ``recording-approval`` environment's required reviewers, and runs in parallel to assertions when both fire on the same dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``if: record_demo == true`` predicate was double-gating: an unset input on a PR or dispatch caused the recording job to be SKIPPED rather than queued for review, so reviewers never saw the approval prompt. Drop the predicate (and the now-dead ``record_demo`` workflow_dispatch input). The ``recording-approval`` environment's required-reviewers rule becomes the sole gate. The recording job now always queues on PR + on dispatch and sits in "Waiting" until an authorized reviewer approves it in the Actions UI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cene Implements thoughts/shared/plans/2026-05-01-interactive-recording-spec.md. Replaces the headless `claude -p` + demo_renderer.py path with five real interactive sessions (one per flow), driven via tmux bracketed paste, so recordings show the actual Claude Code TUI instead of a custom renderer. State persists across scenes via the shared surrealkv ledger. Continuous ffmpeg captures the arc; per-scene timestamps drive the post-recording trim/concat into full-int.mp4 + scene-1..5.mp4 + pm.mp4 (scene-1 + transition + scene-5) + dev.mp4 (scene-2 + 3 + 4). Workflow swaps `record_demo.sh` → `record_demo_interactive.sh`; legacy script retained as fallback (continue-on-error already covers flakes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reproduced the CI symptom locally: interactive `claude` reads but does NOT honour `CLAUDE_CODE_OAUTH_TOKEN` (matches GH issue #32463) — the "Select login method" picker fires regardless. Switching to `ANTHROPIC_API_KEY` works: claude detects the env var and shows a dismissable "Detected a custom API key in your environment" picker instead. Recording-job changes: - Workflow env: `CLAUDE_CODE_OAUTH_TOKEN` → `ANTHROPIC_API_KEY`. Mirror visibility-probe step from the assertions job for diagnosis parity. - Script: drop the `~/.claude/.credentials.json` write (was for OAuth, doesn't help). `wait_for_claude_ready` now walks the full first-run dialog stack — theme, API-key approval, security notes, trust folder, new MCP server, bypass-permissions warning — sending Enter or '1'/'2' per the dialog's preselected default. - READY_TIMEOUT raised 30→90s (each dismissal costs ~2s, plus initial TUI render). Assertions job stays on `CLAUDE_CODE_OAUTH_TOKEN` — its `claude -p` path honours that env var fine. Verified end-to-end against fresh `~/.claude` locally: ready ✓ at t≈10s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related issues with the e2e flow shape: 1. Flow 1 only ingested — never ratified. The seed decisions sat in `proposed`, contaminating flow 5's "what's queued for adoption" view. 2. Flow 3 had a "Before you start, you'll need to set up a bound decision against cherry-pick.ts" preamble that effectively re-ingested + re-bound the cherry-pick decision flow 1 had already created. Subsequent flows should USE the ledger flow 1 establishes, not rebuild it. Fixes: - flow-1-ingest.md: after ingest, bind cherry-pick decision to app/src/lib/git/cherry-pick.ts (CherryPickResult enum), then ratify all three. This is the clean baseline subsequent flows depend on. - flow-3-commit-sync.md: drop the setup preamble. The prompt now trusts the binding from flow 1 and just calls link_commit + resolve_compliance. - run_e2e_flows.py:assert_flow_1: assert ingest + bind(cherry-pick.ts) + ratify all fire. Bind target is checked against the bindings list shape the bind handler accepts (top-level `bindings: list[dict]` with `file_path` per entry). Other flows unchanged — flow 2's agent_session ingest is a refinement (naturally produced by preflight collision), flow 4's ingest is the session-end correction itself; neither is an "at the start" setup-ingest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prompts (all five rewritten as natural PM/dev language — no "ingest via bicameral.ingest", no tool names; each indirectly auto-fires the right skill via the trigger phrases the skill files document): - Flow 1: PM after a roadmap review. Lists three items with code anchors (cherry-pick.ts, reorder.ts) and "we're aligned, sign these off". Auto -> ingest + bind both files + ratify. Establishes the clean baseline the rest of the flows USE rather than rebuild. - Flow 2: dev wants to refactor reorder.ts. Auto -> preflight on bound reorder.ts -> ingest agent_session refinement -> resolve_collision. - Flow 3: dev asks for a small edit + commit on cherry-pick.ts. Real edit via Edit + real `git add`/`git commit` via Bash. The PostToolUse hook surfaces "bicameral: new commit detected" and bicameral-sync auto-fires link_commit. (Auto resolve_compliance is deferred until that feature lands; assertion only checks link_commit.) - Flow 4: PM mid-conversation constraint about cherry-pick conflict resolution. Auto -> ingest agent_session + resolve_collision wiring it as context_for to the existing cherry-pick decision (the gap the PR #144 footage exposed where the constraint orphaned as a parallel decision is now a hard test failure, not a compromised pass). - Flow 5: PM Friday review. Auto -> history + ratify the most-ready proposed decision. Assertion changes: - assert_flow_1: ingest + bind(cherry-pick.ts) + bind(reorder.ts) + ratify. - assert_flow_2: preflight target = reorder.ts (was cherry-pick.ts — the prompt is about reorder). - assert_flow_3: only link_commit; resolve_compliance dropped. - assert_flow_4: now strictly requires resolve_collision after ingest. Harness setup (run_e2e_flows.py + record_demo_interactive.sh): - `--allowed-tools` widened to mcp__bicameral,Read,Grep,Edit,Bash so flow 3 can actually edit + commit. - PostToolUse hook command imported from `setup_wizard._BICAMERAL_POST_ COMMIT_COMMAND` and written to a per-run settings.json passed via `--settings`. Single source of truth — the e2e exercises the exact hook string a freshly-onboarded user would have. - desktop-clone reset to FETCH_HEAD/HEAD before each run since flow 3 now leaves a real commit behind. Recording typing animation: - New `type_prompt` in record_demo_interactive.sh types each char with a ~3s total budget per prompt (replacing the instant paste). Embedded newlines use M-Enter (Alt+Return) — verified locally as the only escape that preserves newlines in claude TUI's input box without submitting. Final Enter submits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, scaffolding decoupling
The v0 user-flow e2e harness now reflects the spec's intent end-to-end so
that any failure points at a real product gap, not test design.
What changed in tests/e2e/run_e2e_flows.py:
- cwd=DESKTOP_REPO_PATH per claude session — agent treats the repo under
test as the primary codebase. Previously cwd=pilot/mcp made the agent
look for `app/src/lib/git/reorder.ts` in the Python MCP server tree
and refuse to act ("can't find this file in the current repo").
- Flows 2/3/4 share a chained dev_session via --session-id + --resume
so capture-corrections has real transcript history and the SessionEnd
hook fires on an authentic multi-turn session.
- SessionEnd hook installed (sourced from setup_wizard) so the
production hook path is exercised.
- Scaffolding turn injects an explicit preflight call after Flow 2 if
auto-fire failed — keeps Flow 3/4 from cascade-failing on Flow 2's
agentic auto-fire issue (#146). Flow 2's verdict still measures
auto-fire honestly; the scaffolding is session-state recovery only.
- Flow 1 asserter walks ingest.mappings[].code_regions[].file_path
(canonical modern path) AND accepts a follow-up bicameral.bind call
(legacy path) — both are valid binding shapes per the skill.
- Flow 3 verdict is now ledger-based (not commit-happened): asserts the
V1 lifecycle outcome (reflected/drifted decisions emerge during the
run, validating ingest → bind → link_commit → resolve_compliance →
verdict). The stream-json commit check is informational. Per
bicameral-mcp#135, post-commit hook is sync-only — the chain
completes via Flow 5's natural workflow.
- Flow 5 asserter conditional-ratifies: PASS when there are no
proposals to ratify (matches issue #108 Flow 5 spec which says
ratify is silent if queue is empty). Stops cascade-failing Flow 5
on upstream Flow 2 issues.
- Ledger query uses raw LedgerClient (bypasses
init_schema/migrate which crashes on the evidence_refs schema
bug — to be filed separately).
- Before/after ledger snapshot around dev_session lets the assertion
measure verdicts written instead of relying on a coincidental
pending count.
Per-flow prompts:
- flow-2: explicit "I know we said X but actually Y" framing makes the
collision against Flow 1's drag-and-drop reorder decision unambiguous.
- flow-3: minimal "edit + commit on cherry-pick.ts" — no bicameral
verbs, no status checks; just trip the post-commit hook.
- flow-4: drops "I want that locked in" tracking verbs (which were
routing to ingest), adds correction markers (`wait`, `shouldn't`,
`wrong`) that capture-corrections Step A pre-filter recognises, plus
a "continue refactor" code-work request that should trigger preflight
step 3.5 → in-session capture-corrections.
DEV_CYCLE.md §0 — Workflow Feature Release Cycle:
- New section before §1 documenting the meta-process for shipping new
agentic workflow features: friction → candidate workflow → test
harness → functional solution → telemetry → optimized solution.
- Codifies the lesson from this iteration cycle and from #146/#147:
put the harness in front of the implementation, not behind it. The
harness should fail on day one — that's the point.
Iteration result with the patches: 3/5 PASS (Flow 1, 3, 5), 2/5 FAIL
(Flow 2, 4 — both documented at bicameral-mcp#146 and bicameral-mcp#147
as real auto-fire reliability gaps in headless `claude -p`). Both #146
and #147 were updated in-place with iteration findings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jinhongkuan
added a commit
that referenced
this pull request
May 1, 2026
Prompts (all five rewritten as natural PM/dev language — no "ingest via bicameral.ingest", no tool names; each indirectly auto-fires the right skill via the trigger phrases the skill files document): - Flow 1: PM after a roadmap review. Lists three items with code anchors (cherry-pick.ts, reorder.ts) and "we're aligned, sign these off". Auto -> ingest + bind both files + ratify. Establishes the clean baseline the rest of the flows USE rather than rebuild. - Flow 2: dev wants to refactor reorder.ts. Auto -> preflight on bound reorder.ts -> ingest agent_session refinement -> resolve_collision. - Flow 3: dev asks for a small edit + commit on cherry-pick.ts. Real edit via Edit + real `git add`/`git commit` via Bash. The PostToolUse hook surfaces "bicameral: new commit detected" and bicameral-sync auto-fires link_commit. (Auto resolve_compliance is deferred until that feature lands; assertion only checks link_commit.) - Flow 4: PM mid-conversation constraint about cherry-pick conflict resolution. Auto -> ingest agent_session + resolve_collision wiring it as context_for to the existing cherry-pick decision (the gap the PR #144 footage exposed where the constraint orphaned as a parallel decision is now a hard test failure, not a compromised pass). - Flow 5: PM Friday review. Auto -> history + ratify the most-ready proposed decision. Assertion changes: - assert_flow_1: ingest + bind(cherry-pick.ts) + bind(reorder.ts) + ratify. - assert_flow_2: preflight target = reorder.ts (was cherry-pick.ts — the prompt is about reorder). - assert_flow_3: only link_commit; resolve_compliance dropped. - assert_flow_4: now strictly requires resolve_collision after ingest. Harness setup (run_e2e_flows.py + record_demo_interactive.sh): - `--allowed-tools` widened to mcp__bicameral,Read,Grep,Edit,Bash so flow 3 can actually edit + commit. - PostToolUse hook command imported from `setup_wizard._BICAMERAL_POST_ COMMIT_COMMAND` and written to a per-run settings.json passed via `--settings`. Single source of truth — the e2e exercises the exact hook string a freshly-onboarded user would have. - desktop-clone reset to FETCH_HEAD/HEAD before each run since flow 3 now leaves a real commit behind. Recording typing animation: - New `type_prompt` in record_demo_interactive.sh types each char with a ~3s total budget per prompt (replacing the instant paste). Embedded newlines use M-Enter (Alt+Return) — verified locally as the only escape that preserves newlines in claude TUI's input box without submitting. Final Enter submits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fulfills BicameralAI/bicameral#108 — the v0 user-flow demo + e2e validation arc.
Two artifacts off one workflow:
desktop/desktopat a pinned commit.pm.mp4+dev.mp4+full-int.mp4with a transition slide between PM chapters.State persists across all five flows via a shared surrealkv ledger; flow 1 sets the baseline, flows 2–5 USE it.
Flows under test
Each flow uses natural PM/dev language (no tool names mentioned in prompts) and indirectly auto-fires the relevant skill.
cherry-pick.ts, the reorder one toreorder.ts, sign these off."bicameral.ingest(3 decisions) →bicameral.bindon both files →bicameral.ratifyon all three. Establishes the baseline.reorder.ts— pulling outreorder()entirely."bicameral.preflighton boundreorder.ts→ ingest ofagent_sessionrefinement →bicameral.resolve_collisionwires it to the existing reorder decision.CherryPickResult, then commit it."Edit+git commitviaBash. PostToolUse hook (imported fromsetup_wizard._BICAMERAL_POST_COMMIT_COMMAND) surfaces "new commit detected" →bicameral-syncskill auto-fireslink_commit. (Autoresolve_complianceis deferred until that lands.)bicameral.ingest(agent_session) +bicameral.resolve_collisionlinking the constraint ascontext_forthe existing cherry-pick decision. (Earlier the constraint orphaned as a parallel decision; now strictly required.)bicameral.history→bicameral.ratifyon the most-ready proposed decision.How auth and recording work
productionenv):claude -phonoursCLAUDE_CODE_OAUTH_TOKENdirectly.recording-approvalenv): interactiveclaudeignoresCLAUDE_CODE_OAUTH_TOKEN(verified locally; matches GH issue #32463), so this path usesANTHROPIC_API_KEY. The script walks the first-run picker stack — theme → API-key approval → security notes → trust folder → MCP server → bypass-permissions warning — to land on the❯input prompt before each scene types its prompt at ~3s/prompt human pace viatmux send-keys -l.Test plan
resolve_collisiondoesn't fire).full-int.mp4,scene-{1..5}.mp4,pm.mp4(scene-1 + transition + scene-5),dev.mp4(scene-2 + scene-3 + scene-4) with the real Claude TUI rendered, prompts visibly typed, dashboard updating per scene.🤖 Generated with Claude Code