Skip to content

ci(#108): demo recording fast-follow — pm.mp4 + dev.mp4 with transition slide#144

Merged
jinhongkuan merged 15 commits into
devfrom
feat/108-demo-recording-fastfollow
May 1, 2026
Merged

ci(#108): demo recording fast-follow — pm.mp4 + dev.mp4 with transition slide#144
jinhongkuan merged 15 commits into
devfrom
feat/108-demo-recording-fastfollow

Conversation

@jinhongkuan

@jinhongkuan jinhongkuan commented May 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Fulfills BicameralAI/bicameral#108 — the v0 user-flow demo + e2e validation arc.

Two artifacts off one workflow:

  1. assertion harness (always runs): five real Claude Code sessions exercising the agentic auto-fire surface against desktop/desktop at a pinned commit.
  2. split-screen recording (manual approval): tmux-driven interactive Claude TUI for pm.mp4 + dev.mp4 + full-int.mp4 with a transition slide between PM chapters.

State persists across all five flows via a shared surrealkv ledger; flow 1 sets the baseline, flows 2–5 USE it.

Flows under test

Each flow uses natural PM/dev language (no tool names mentioned in prompts) and indirectly auto-fires the relevant skill.

# Persona Prompt shape What it tests
1 PM "Just got out of our roadmap review. Here are three items, the cherry-pick one anchors to cherry-pick.ts, the reorder one to reorder.ts, sign these off." bicameral.ingest (3 decisions) → bicameral.bind on both files → bicameral.ratify on all three. Establishes the baseline.
2 Dev "I'm about to refactor reorder.ts — pulling out reorder() entirely." Auto-fired bicameral.preflight on bound reorder.ts → ingest of agent_session refinement → bicameral.resolve_collision wires it to the existing reorder decision.
3 Dev "Add a one-line comment near CherryPickResult, then commit it." Real Edit + git commit via Bash. PostToolUse hook (imported from setup_wizard._BICAMERAL_POST_COMMIT_COMMAND) surfaces "new commit detected" → bicameral-sync skill auto-fires link_commit. (Auto resolve_compliance is deferred until that lands.)
4 PM "We need cherry-pick conflict resolution to never block on stdin — visual UI only. Worth tracking alongside the cherry-pick work." Auto-fired bicameral.ingest (agent_session) + bicameral.resolve_collision linking the constraint as context_for the existing cherry-pick decision. (Earlier the constraint orphaned as a parallel decision; now strictly required.)
5 PM "Friday review. Walk me through what's tracked. Ratify whichever proposed one looks most ready." Auto-fired bicameral.historybicameral.ratify on the most-ready proposed decision.

How auth and recording work

  • Assertions job (production env): claude -p honours CLAUDE_CODE_OAUTH_TOKEN directly.
  • Recording job (recording-approval env): interactive claude ignores CLAUDE_CODE_OAUTH_TOKEN (verified locally; matches GH issue #32463), so this path uses ANTHROPIC_API_KEY. The script walks the first-run picker stack — theme → API-key approval → security notes → trust folder → MCP server → bypass-permissions warning — to land on the input prompt before each scene types its prompt at ~3s/prompt human pace via tmux send-keys -l.

Test plan

  • Assertions job: 5/5 PASS with the new strict shape (flow 4 fails honestly if resolve_collision doesn't fire).
  • Recording job (manual dispatch): produces full-int.mp4, scene-{1..5}.mp4, pm.mp4 (scene-1 + transition + scene-5), dev.mp4 (scene-2 + scene-3 + scene-4) with the real Claude TUI rendered, prompts visibly typed, dashboard updating per scene.
  • No re-ingestion of flow 1's decisions in subsequent flows. Dashboard shows ~5 decisions total at the end (3 seed + 1 reorder refinement + 1 cherry-pick constraint), all linked appropriately.

🤖 Generated with Claude Code

…on slide

Adds an opt-in workflow_dispatch path that records a single split-screen
demo session, then post-splits it into pm.mp4 (PM persona, two chapters
joined by an ffmpeg-generated transition slide) and dev.mp4 (Dev persona).

Plan: thoughts/shared/plans/2026-04-30-v0-userflow-demo-recording.md
(Predecessor: PR #142, the assertion-only e2e on the same workflow file.)

Why one continuous claude session instead of two persona sessions: the
e2e config uses SURREAL_URL=memory://, so each MCP process is a fresh
ledger. A single session is what makes Scene 3 (PM post-impl) show the
SSE events from Scene 2 (Dev) authentically — same dashboard, same
state, no re-hydration.

Scene boundaries are detected from the stream-json tool-call timeline
(no LLM-emitted sentinels):
  Scene 1 → 2 = first bicameral.preflight call
  Scene 2 → 3 = first bicameral.history call after any link_commit

Recording step is `continue-on-error: true` — assertion-only step
remains the sole authority on workflow conclusion.

MP4s are .gitignored and excluded from the wheel; they live in the
v0-user-flow-e2e-demos GitHub artifact (90-day retention).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 092045b0-5bcc-40e1-b09a-55ce522b4186

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/108-demo-recording-fastfollow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…ser snap

The `chromium-browser` apt package on Ubuntu 22.04+ is a snap-store
installer wrapper; GitHub Actions runners can't reach the snap store
and the install retries for 30 minutes before failing. Symptom from
run 25198673582:

  ===> Unable to contact the store, trying every minute for the next 30 minutes

Fix:
  - Drop chromium-browser from the apt-get install step.
  - Auto-detect the browser binary in record_demo.sh — prefers
    google-chrome-stable (pre-installed on ubuntu-latest), then
    google-chrome / chromium / chromium-browser as fallbacks for
    desktop developers running locally.
  - Add a sanity check at the end of the install step that fails
    fast if no chromium-compatible browser is on PATH.

All four browser variants accept identical Chromium-style flags
(--no-sandbox, --window-size, --window-position, etc.) so the
recording layout is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: the headless ``claude -p`` harness validates MCP tool callability
but cannot reliably exercise the agentic auto-fire layer (preflight,
capture-corrections) that the demo punchline depends on. This change
makes the gap explicit in the report so reviewers can see at a glance
which flows validate the tool surface vs which still need the
interactive recording path to come through.

Changes:
- Switch e2e ledger from ``memory://`` to ``surrealkv://`` so state
  persists across the 5 sequential flows (flow-1 seeds → flow-5
  ratifies). Wiped at start of each run so tests stay reproducible.
- Refactor FLOW_PLAN to a FlowSpec dataclass with category
  (mcp_layer | agentic_layer) + per-flow advisory. Report now renders
  a sharable summary banner with MCP-vs-agentic breakdown and an
  ADVISORIES section explaining failed/compromised flows.
- Rewrite flow-2 to natural dev voice (refactor reorder.ts, no skill
  names) — auto-fire trigger preserved. Currently fails by design;
  advisory documents (a) auto-fire losing the priority race vs the
  agent's "verify the premise first" instinct, and (b) CodeGenome
  semantic grounding being wired into link_commit + bind but NOT
  preflight, so file-path lookup against reorder.ts returns no
  matches even though "Reorder via drag/drop" is semantically dead-on.
- Rewrite flow-4 to drop the fabricated "earlier in our conversation"
  framing — each ``claude -p`` is a fresh session so the agent
  correctly refused to put false provenance in the ledger. Now states
  the constraint honestly. Advisory marks this a compromised pass:
  it succeeds only because the prompt names ``agent_session`` source
  explicitly, so capture-corrections skill itself isn't auto-fired.
- Rewrite flow-5 as PM Friday ratification against the persisted
  ledger — no in-session seed needed any more.
- Add tests/e2e/record_demo_interactive.sh — tmux+send-keys sketch
  for the interactive recording path where auto-fire can actually
  be observed in footage. Layered on the recording infra in
  thoughts/shared/plans/2026-04-30-v0-userflow-demo-recording.md.

Validates: 3/3 MCP-layer flows pass cleanly. Agentic layer: 1/2
(flow-4 compromised pass, flow-2 expected fail with advisory).
Overall harness verdict FAIL — honest signal, see report ADVISORIES.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: video recording is expensive (~30-45 min wall + claude API spend)
and was riding the same auto-trigger path as the cheap assertion step.
Splitting into two jobs lets the assertion path flow through
automatically on every PR while the recording path requires explicit
human approval before each run.

Changes:
- Job 1 ``assertions`` — same scope as before (PR + dispatch). Uses
  ``environment: production`` for OAuth token; that env has no required
  reviewers, so PR triggers run without manual gating.
- Job 2 ``recording`` — manual dispatch only, gated by
  ``environment: recording-approval``. Required reviewers on that env
  (set in repo settings) become the approval gate. ``needs: assertions``
  + ``if: always()`` keeps the two stages ordered without blocking
  recording on the assertion harness's expected advisory failures.

Repo-settings prerequisite (one-time, before next manual dispatch):
- Create environment ``recording-approval`` in repo settings
- Add the required reviewers list under "Deployment protection rules"
- Copy ``CLAUDE_CODE_OAUTH_TOKEN`` secret to the env, or move it to
  repo-level so both jobs see it

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: a video demo run must not wait on the assertion harness. The two
stages have independent value — assertions validate MCP tool surface,
recording validates the agentic layer visually — and the harness's
expected advisory failures (auto-fire gap, codegenome-not-wired-through-
preflight gap) shouldn't gate access to demo footage.

Drop ``needs: assertions`` and the corresponding ``always()`` guard.
Recording is now triggered solely by ``workflow_dispatch`` +
``record_demo=true``, gated only by the ``recording-approval``
environment's required reviewers, and runs in parallel to assertions
when both fire on the same dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``if: record_demo == true`` predicate was double-gating: an unset
input on a PR or dispatch caused the recording job to be SKIPPED rather
than queued for review, so reviewers never saw the approval prompt.

Drop the predicate (and the now-dead ``record_demo`` workflow_dispatch
input). The ``recording-approval`` environment's required-reviewers rule
becomes the sole gate. The recording job now always queues on PR + on
dispatch and sits in "Waiting" until an authorized reviewer approves
it in the Actions UI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cene

Implements thoughts/shared/plans/2026-05-01-interactive-recording-spec.md.
Replaces the headless `claude -p` + demo_renderer.py path with five real
interactive sessions (one per flow), driven via tmux bracketed paste, so
recordings show the actual Claude Code TUI instead of a custom renderer.

State persists across scenes via the shared surrealkv ledger. Continuous
ffmpeg captures the arc; per-scene timestamps drive the post-recording
trim/concat into full-int.mp4 + scene-1..5.mp4 + pm.mp4 (scene-1 +
transition + scene-5) + dev.mp4 (scene-2 + 3 + 4).

Workflow swaps `record_demo.sh` → `record_demo_interactive.sh`; legacy
script retained as fallback (continue-on-error already covers flakes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reproduced the CI symptom locally: interactive `claude` reads but does
NOT honour `CLAUDE_CODE_OAUTH_TOKEN` (matches GH issue #32463) — the
"Select login method" picker fires regardless. Switching to
`ANTHROPIC_API_KEY` works: claude detects the env var and shows a
dismissable "Detected a custom API key in your environment" picker
instead.

Recording-job changes:
- Workflow env: `CLAUDE_CODE_OAUTH_TOKEN` → `ANTHROPIC_API_KEY`. Mirror
  visibility-probe step from the assertions job for diagnosis parity.
- Script: drop the `~/.claude/.credentials.json` write (was for OAuth,
  doesn't help). `wait_for_claude_ready` now walks the full first-run
  dialog stack — theme, API-key approval, security notes, trust folder,
  new MCP server, bypass-permissions warning — sending Enter or '1'/'2'
  per the dialog's preselected default.
- READY_TIMEOUT raised 30→90s (each dismissal costs ~2s, plus initial
  TUI render).

Assertions job stays on `CLAUDE_CODE_OAUTH_TOKEN` — its `claude -p` path
honours that env var fine.

Verified end-to-end against fresh `~/.claude` locally: ready ✓ at t≈10s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related issues with the e2e flow shape:

1. Flow 1 only ingested — never ratified. The seed decisions sat in
   `proposed`, contaminating flow 5's "what's queued for adoption" view.
2. Flow 3 had a "Before you start, you'll need to set up a bound decision
   against cherry-pick.ts" preamble that effectively re-ingested + re-bound
   the cherry-pick decision flow 1 had already created. Subsequent flows
   should USE the ledger flow 1 establishes, not rebuild it.

Fixes:
- flow-1-ingest.md: after ingest, bind cherry-pick decision to
  app/src/lib/git/cherry-pick.ts (CherryPickResult enum), then ratify all
  three. This is the clean baseline subsequent flows depend on.
- flow-3-commit-sync.md: drop the setup preamble. The prompt now trusts
  the binding from flow 1 and just calls link_commit + resolve_compliance.
- run_e2e_flows.py:assert_flow_1: assert ingest + bind(cherry-pick.ts) +
  ratify all fire. Bind target is checked against the bindings list shape
  the bind handler accepts (top-level `bindings: list[dict]` with
  `file_path` per entry).

Other flows unchanged — flow 2's agent_session ingest is a refinement
(naturally produced by preflight collision), flow 4's ingest is the
session-end correction itself; neither is an "at the start" setup-ingest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prompts (all five rewritten as natural PM/dev language — no "ingest via
bicameral.ingest", no tool names; each indirectly auto-fires the right
skill via the trigger phrases the skill files document):

- Flow 1: PM after a roadmap review. Lists three items with code anchors
  (cherry-pick.ts, reorder.ts) and "we're aligned, sign these off". Auto
  -> ingest + bind both files + ratify. Establishes the clean baseline
  the rest of the flows USE rather than rebuild.
- Flow 2: dev wants to refactor reorder.ts. Auto -> preflight on bound
  reorder.ts -> ingest agent_session refinement -> resolve_collision.
- Flow 3: dev asks for a small edit + commit on cherry-pick.ts. Real
  edit via Edit + real `git add`/`git commit` via Bash. The PostToolUse
  hook surfaces "bicameral: new commit detected" and bicameral-sync
  auto-fires link_commit. (Auto resolve_compliance is deferred until
  that feature lands; assertion only checks link_commit.)
- Flow 4: PM mid-conversation constraint about cherry-pick conflict
  resolution. Auto -> ingest agent_session + resolve_collision wiring
  it as context_for to the existing cherry-pick decision (the gap the
  PR #144 footage exposed where the constraint orphaned as a parallel
  decision is now a hard test failure, not a compromised pass).
- Flow 5: PM Friday review. Auto -> history + ratify the most-ready
  proposed decision.

Assertion changes:
- assert_flow_1: ingest + bind(cherry-pick.ts) + bind(reorder.ts) + ratify.
- assert_flow_2: preflight target = reorder.ts (was cherry-pick.ts —
  the prompt is about reorder).
- assert_flow_3: only link_commit; resolve_compliance dropped.
- assert_flow_4: now strictly requires resolve_collision after ingest.

Harness setup (run_e2e_flows.py + record_demo_interactive.sh):
- `--allowed-tools` widened to mcp__bicameral,Read,Grep,Edit,Bash so
  flow 3 can actually edit + commit.
- PostToolUse hook command imported from `setup_wizard._BICAMERAL_POST_
  COMMIT_COMMAND` and written to a per-run settings.json passed via
  `--settings`. Single source of truth — the e2e exercises the exact
  hook string a freshly-onboarded user would have.
- desktop-clone reset to FETCH_HEAD/HEAD before each run since flow 3
  now leaves a real commit behind.

Recording typing animation:
- New `type_prompt` in record_demo_interactive.sh types each char with
  a ~3s total budget per prompt (replacing the instant paste). Embedded
  newlines use M-Enter (Alt+Return) — verified locally as the only
  escape that preserves newlines in claude TUI's input box without
  submitting. Final Enter submits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, scaffolding decoupling

The v0 user-flow e2e harness now reflects the spec's intent end-to-end so
that any failure points at a real product gap, not test design.

What changed in tests/e2e/run_e2e_flows.py:

- cwd=DESKTOP_REPO_PATH per claude session — agent treats the repo under
  test as the primary codebase. Previously cwd=pilot/mcp made the agent
  look for `app/src/lib/git/reorder.ts` in the Python MCP server tree
  and refuse to act ("can't find this file in the current repo").
- Flows 2/3/4 share a chained dev_session via --session-id + --resume
  so capture-corrections has real transcript history and the SessionEnd
  hook fires on an authentic multi-turn session.
- SessionEnd hook installed (sourced from setup_wizard) so the
  production hook path is exercised.
- Scaffolding turn injects an explicit preflight call after Flow 2 if
  auto-fire failed — keeps Flow 3/4 from cascade-failing on Flow 2's
  agentic auto-fire issue (#146). Flow 2's verdict still measures
  auto-fire honestly; the scaffolding is session-state recovery only.
- Flow 1 asserter walks ingest.mappings[].code_regions[].file_path
  (canonical modern path) AND accepts a follow-up bicameral.bind call
  (legacy path) — both are valid binding shapes per the skill.
- Flow 3 verdict is now ledger-based (not commit-happened): asserts the
  V1 lifecycle outcome (reflected/drifted decisions emerge during the
  run, validating ingest → bind → link_commit → resolve_compliance →
  verdict). The stream-json commit check is informational. Per
  bicameral-mcp#135, post-commit hook is sync-only — the chain
  completes via Flow 5's natural workflow.
- Flow 5 asserter conditional-ratifies: PASS when there are no
  proposals to ratify (matches issue #108 Flow 5 spec which says
  ratify is silent if queue is empty). Stops cascade-failing Flow 5
  on upstream Flow 2 issues.
- Ledger query uses raw LedgerClient (bypasses
  init_schema/migrate which crashes on the evidence_refs schema
  bug — to be filed separately).
- Before/after ledger snapshot around dev_session lets the assertion
  measure verdicts written instead of relying on a coincidental
  pending count.

Per-flow prompts:
- flow-2: explicit "I know we said X but actually Y" framing makes the
  collision against Flow 1's drag-and-drop reorder decision unambiguous.
- flow-3: minimal "edit + commit on cherry-pick.ts" — no bicameral
  verbs, no status checks; just trip the post-commit hook.
- flow-4: drops "I want that locked in" tracking verbs (which were
  routing to ingest), adds correction markers (`wait`, `shouldn't`,
  `wrong`) that capture-corrections Step A pre-filter recognises, plus
  a "continue refactor" code-work request that should trigger preflight
  step 3.5 → in-session capture-corrections.

DEV_CYCLE.md §0 — Workflow Feature Release Cycle:
- New section before §1 documenting the meta-process for shipping new
  agentic workflow features: friction → candidate workflow → test
  harness → functional solution → telemetry → optimized solution.
- Codifies the lesson from this iteration cycle and from #146/#147:
  put the harness in front of the implementation, not behind it. The
  harness should fail on day one — that's the point.

Iteration result with the patches: 3/5 PASS (Flow 1, 3, 5), 2/5 FAIL
(Flow 2, 4 — both documented at bicameral-mcp#146 and bicameral-mcp#147
as real auto-fire reliability gaps in headless `claude -p`). Both #146
and #147 were updated in-place with iteration findings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jinhongkuan jinhongkuan had a problem deploying to recording-approval May 1, 2026 13:23 — with GitHub Actions Failure
@jinhongkuan jinhongkuan merged commit 26497aa into dev May 1, 2026
6 of 8 checks passed
jinhongkuan added a commit that referenced this pull request May 1, 2026
Prompts (all five rewritten as natural PM/dev language — no "ingest via
bicameral.ingest", no tool names; each indirectly auto-fires the right
skill via the trigger phrases the skill files document):

- Flow 1: PM after a roadmap review. Lists three items with code anchors
  (cherry-pick.ts, reorder.ts) and "we're aligned, sign these off". Auto
  -> ingest + bind both files + ratify. Establishes the clean baseline
  the rest of the flows USE rather than rebuild.
- Flow 2: dev wants to refactor reorder.ts. Auto -> preflight on bound
  reorder.ts -> ingest agent_session refinement -> resolve_collision.
- Flow 3: dev asks for a small edit + commit on cherry-pick.ts. Real
  edit via Edit + real `git add`/`git commit` via Bash. The PostToolUse
  hook surfaces "bicameral: new commit detected" and bicameral-sync
  auto-fires link_commit. (Auto resolve_compliance is deferred until
  that feature lands; assertion only checks link_commit.)
- Flow 4: PM mid-conversation constraint about cherry-pick conflict
  resolution. Auto -> ingest agent_session + resolve_collision wiring
  it as context_for to the existing cherry-pick decision (the gap the
  PR #144 footage exposed where the constraint orphaned as a parallel
  decision is now a hard test failure, not a compromised pass).
- Flow 5: PM Friday review. Auto -> history + ratify the most-ready
  proposed decision.

Assertion changes:
- assert_flow_1: ingest + bind(cherry-pick.ts) + bind(reorder.ts) + ratify.
- assert_flow_2: preflight target = reorder.ts (was cherry-pick.ts —
  the prompt is about reorder).
- assert_flow_3: only link_commit; resolve_compliance dropped.
- assert_flow_4: now strictly requires resolve_collision after ingest.

Harness setup (run_e2e_flows.py + record_demo_interactive.sh):
- `--allowed-tools` widened to mcp__bicameral,Read,Grep,Edit,Bash so
  flow 3 can actually edit + commit.
- PostToolUse hook command imported from `setup_wizard._BICAMERAL_POST_
  COMMIT_COMMAND` and written to a per-run settings.json passed via
  `--settings`. Single source of truth — the e2e exercises the exact
  hook string a freshly-onboarded user would have.
- desktop-clone reset to FETCH_HEAD/HEAD before each run since flow 3
  now leaves a real commit behind.

Recording typing animation:
- New `type_prompt` in record_demo_interactive.sh types each char with
  a ~3s total budget per prompt (replacing the instant paste). Embedded
  newlines use M-Enter (Alt+Return) — verified locally as the only
  escape that preserves newlines in claude TUI's input box without
  submitting. Final Enter submits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant