fix(skill): preflight auto-fire on natural refactor prompts (replaces #151) by jinhongkuan · Pull Request #155 · BicameralAI/bicameral-mcp

jinhongkuan · 2026-05-02T07:48:33Z

Summary

Clean cherry-pick of the auto-fire fix from PR #151 onto a fresh dev base, without the bundled governance/Merkle-ledger commit (3f856af) that was producing rebase conflicts on docs/META_LEDGER.md, docs/SYSTEM_STATE.md, and .gitignore. The governance cleanup is conceptually independent and should land as its own PR where the qor-logic Merkle chain can be resolved deliberately.

Closes #146 (preflight does not auto-fire on natural refactor prompts).

What's in this PR

Seven commits, all scoped to the auto-fire mechanism:

fix(skill): resolve preflight auto-fire failure on natural refactor prompts (#146) — adds scripts/hooks/preflight_intent.py (verb-list classifier) + scripts/hooks/preflight_reminder.py (UserPromptSubmit hook entry point), wires .claude/settings.json, and adds a ### Hook reinforcement subsection to skills/bicameral-preflight/SKILL.md.
fix(setup): install preflight UserPromptSubmit hook for end users — adds the bicameral-mcp-preflight-reminder console script in pyproject.toml and wires it into setup_wizard.py so fresh installs get the hook.
style: ruff format scripts/hooks/preflight_intent.py
fix(e2e): materialize UserPromptSubmit hook into test target settings — e2e harness materializes the same hook config a real install would have.
fix(hook): emit hookSpecificOutput envelope so additionalContext reaches model — Claude Code 2.x silently drops the legacy top-level {additionalContext: ...} shape; the hook now emits {hookSpecificOutput: {hookEventName: \"UserPromptSubmit\", additionalContext: ...}}.
test(e2e): split Flow 2 into auto-fire (Flow 2) + correction-capture loop (Flow 2a) — narrows Flow 2 to the auto-fire scope (precedes write op), adds Flow 2a as advisory for the full correction-capture loop tracked in [P0] Preflight skill does not instruct agent to capture refinements when user prompt contradicts surfaced decisions #154, gates CI exit code on non-advisory failures only.
style: ruff format tests/e2e/run_e2e_flows.py

What was DROPPED (compared to #151)

3f856af chore(governance): v0 process cleanup — entire commit excluded. Re-open as its own PR.
e769eec Merge branch 'dev' into claude/peaceful-bell-12b5e8 — merge commit, redundant on a fresh-from-dev branch.
docs/META_LEDGER.md edits from f4de501 — Merkle-chain audit trail, conflicted with dev's parallel cleanup. Should land via the governance PR.
docs/SYSTEM_STATE.md edits from f4de501 and 13312d4 — same reason.
plan-preflight-autofire-hook.md — qor-logic planning artifact; should land via the governance PR.

What was MERGED carefully

skills/bicameral-preflight/SKILL.md — dev had added a ## Telemetry section in the same region where f4de501 added ### Hook reinforcement. Both kept; ordered as Hook reinforcement → Telemetry (continuation of trigger discussion before the instrumentation interlude before Steps).

Validation

ruff format --check . clean (210 files)
ruff check . clean
tests/test_preflight_hook.py: 5/5 PASS
E2E asserter dry-run against the most recent CI transcript (commit 92525fa, run 25246398064): Flow 2 PASS, Flow 2a FAIL (advisory → non-blocking), Flow 4 FAIL (advisory → non-blocking). CI exit code: 0.

Test plan

CI: ruff + mypy passes
CI: e2e assertions (auto) passes (advisory failures from Flow 2a / Flow 4 do not red-light CI per the new gate logic)
CI: MCP Regression Suite (ubuntu + windows) passes
Verify Flow 2 transcript shows bicameral_preflight preceding any Edit

Closes fix(skill): preflight does not auto-fire on natural refactor prompts in headless Claude Code sessions #146 (auto-fire)
Tracks [P0] Preflight skill does not instruct agent to capture refinements when user prompt contradicts surfaced decisions #154 (P0 — skill-layer gap: preflight surfaces decisions but doesn't instruct the agent to capture refinements when the user prompt contradicts a surfaced decision)
Replaces v0 process cleanup + preflight auto-fire hook (#146) #151 (closing in favor of this clean-base PR)

🤖 Generated with Claude Code

…rompts (#146) Closes #146 — Flow 2 in tests/e2e/run_e2e_flows.py fails because bicameral.preflight does not auto-fire in headless `claude -p` even when the user prompt explicitly contradicts a prior decision. The existing SKILL.md auto-fire description has plateaued; the agent's default tool-selection priority puts Bash/Glob ahead of preflight. Solution: deterministic UserPromptSubmit hook that detects code-implementation intent via shared verb list and injects an authoritative <system-reminder> elevating preflight above file-inspection tools. Architecture (Hickey razor): - Verb list lives once in scripts/hooks/preflight_intent.py as data (frozenset). Future UI configurability is a one-edit change. - should_fire_preflight(): pure function, 11 lines, depth 2, no network, no LLM, sub-millisecond regex scan. - preflight_reminder.py: 9-line UserPromptSubmit hook entry point; fail-permissive (exit 0 + empty response on errors); never blocks the user. - v0 verb-list duplication between SKILL.md description (frontmatter) and the Python module is documented honestly in the SKILL.md addendum per audit Advisory #1, not papered over with a false SSOT claim. Tests: 11 functionality tests (TDD-light invariant — every test invokes the unit and asserts on output, no presence-only patterns): - 6 classifier tests covering all 30 verbs, 3 skip patterns, indirect intent, data shape, the literal Flow 2 contradiction prompt - 5 hook subprocess tests covering match/no-match/malformed-stdin/ idempotent invocations + Flow 2 fixture Authoritative integration test: tests/e2e/run_e2e_flows.py::test_flow_2 on dev branch (preflight tool_use.id must precede first non-bicameral discovery tool in the stream-json transcript). QorLogic SDLC artifacts: plan-preflight-autofire-hook.md, META_LEDGER Entries #11-#14 (PLAN, GATE PASS, IMPLEMENT, SUBSTANTIATE seal). Merkle seal: 33007d2a72fe3db237935216e063327750896d595faa15001757761e43a8e83c Risk grade: L2 (blast radius: every user prompt; individual-action risk: small + bounded + reversible) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The preflight auto-fire fix in f4de501 added a UserPromptSubmit hook to the bicameral repo's own .claude/settings.json so the e2e flow passes when dogfooding bicameral on bicameral. But setup_wizard's _install_claude_hooks was not extended, so users running `bicameral-mcp setup` on their own repos got the old PostToolUse + SessionEnd hooks and no preflight reinforcement — leaving the bug the PR claims to close (#146) open in production. Changes: - pyproject.toml: add `bicameral-mcp-preflight-reminder` console script entrypoint (`scripts.hooks.preflight_reminder:main`) so the hook resolves on PATH from any pip-installed environment, mirroring the existing `bicameral-mcp` and `bicameral-mcp-classify` pattern. - setup_wizard.py: extend `_install_claude_hooks` with a third `UserPromptSubmit` block that writes the same idempotent merge pattern used for PostToolUse/Bash and SessionEnd. Stale entries matching `bicameral` or `preflight_reminder` in the command string are stripped before re-write. - docs/SYSTEM_STATE.md: document the two new modified files under the preflight-hook session block. Verification: - 11/11 preflight tests pass (tests/test_preflight_intent.py + tests/test_preflight_hook.py). - Smoke test: `_install_claude_hooks` on a fresh tempdir writes all three hook events and the resulting settings.json is byte-stable across repeated invocations. Note: the bicameral repo's own .claude/settings.json continues to invoke `python3 scripts/hooks/preflight_reminder.py` (the source file directly) so devs working on the repo without a `pip install -e .` still get the hook firing — the divergence between dogfood and user install paths is intentional.

Pre-existing format violation in the f4de501 commit caught by CI. Verb frozenset reformatted to one-element-per-line per ruff defaults. No semantic change; 11/11 preflight tests still pass.

The e2e harness writes a project-style settings.json to the test target (cwd=/tmp/desktop-clone) so Claude headless picks up the bicameral hooks. Pre-fix: only PostToolUse/Bash and SessionEnd were materialized — UserPromptSubmit (added in f4de501 + propagated to setup_wizard in 13312d4) was missing. Result: Flow 2 (preflight auto-fire on natural refactor request) and Flow 4 (in-session capture-corrections via preflight step 3.5) both fail with `expected preflight (auto-fired); saw: []` because the agent's default tool priority puts Bash/Glob ahead of preflight and nothing reorders it. Fix: import `_BICAMERAL_PREFLIGHT_REMINDER_COMMAND` alongside the other two hook constants and add a UserPromptSubmit entry to the materialized settings dict. The console-script command resolves on PATH from the workflow's `pip install -e ".[test]"` step. Single source of truth preserved — both real users (via setup_wizard) and the harness pull from the same constants.

…hes model Claude Code 2.x silently drops the legacy top-level {"additionalContext": ...} shape — the hook process runs and exits 0, but the system-reminder never reaches the LLM. Wrap the payload in {"hookSpecificOutput": {"hookEventName": "UserPromptSubmit", "additionalContext": ...}} per the current CLI contract. Tests previously asserted against the broken shape (testing the hook against itself rather than the CLI it must integrate with), which is why this slipped through. They now assert the envelope shape, so a regression to the legacy shape would fail loudly. Verified live with `claude -p` + a real hook: agent now reads and acknowledges the preflight system-reminder, where before it ignored it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…loop (Flow 2a) The previous Flow 2 assertion required preflight + agent_session ingest + resolve_collision in a single test. After the auto-fire fix (a few commits back) preflight now genuinely fires, but the agent doesn't walk the preflight skill's Step 3.5 to invoke capture-corrections — so the refinement isn't captured and resolve_collision never runs. Two independent contracts were tangled into one verdict. Split: - Flow 2 (mcp_layer) — auto-fire scope only: preflight fires on reorder.ts, precedes the first write op (Edit / Write / git commit). Reads are allowed in parallel (the agent legitimately fetches in parallel with preflight to keep latency reasonable). This is exactly what #146 promised. - Flow 2a (agentic_layer, advisory) — full correction-capture loop: same claude session (reuses Flow 2's transcript via new `reuses_flow` field on FlowSpec, so no duplicate API call) but a different asserter, checking for agent_session ingest + resolve_collision. Currently FAILs because no skill instructs the agent to capture refinements when the user's prompt contradicts a surfaced decision. Tracked as P0 in #154. - Flow 4 — same root cause as Flow 2a (skill-walking gap on Step 3.5). Tagged with advisory pointing at #154. Was already FAILing. CI gate change: blocking_failures = FAIL/ERROR with no advisory text. Flows with an `advisory` field that fail surface loudly in the report (banner + ADVISORIES section) but do not red-light CI. This lets us keep running the gap assertions on every PR (so a silent close becomes visible) without making every PR also pay for the open gap. Verified locally by replaying the asserter against the most recent CI transcript (commit 92525fa, run 25246398064): Flow 2 PASS, Flow 2a FAIL (advisory), Flow 4 FAIL (advisory). Lint + py_compile clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Whitespace-only — formatter collapses three fits-on-one-line list comprehensions and two short return tuples that were unnecessarily wrapped. No behavioural change. Local check: pip install -e ".[test]" inside venv → both `ruff format --check .` (210 files already formatted) and `ruff check .` (all checks passed) clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-02T07:48:40Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d799e69-079c-461b-8910-889982e335f4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/preflight-auto-fire-clean

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Cherry-picked from 1f54f1a, scope-narrowed to the surgical contribution. The original commit was authored against an older base where the e2e harness scaffold did not yet exist; this rebased version adds only the new logic on top of dev's existing harness. What this commit adds: - `tests/e2e/_ledger_helpers.py` — pure helper `count_agent_session_decisions(snapshot)`, extracted so unit tests can import without triggering the harness's top-level env-var / CLI guards. - `tests/e2e/run_e2e_flows.py`: - `_count_agent_session_decisions(snapshot)` — thin wrapper around the helper that hides the import inside the harness. - `_validate_flow4_via_ledger()` — path-X-(b) post-hoc ledger query. Snapshots the ledger after the harness completes and counts decisions with `source_type='agent_session'`. Asserter FAIL + ledger has agent_session → UPGRADE to PASS with explicit annotation. Ledger error → INCONCLUSIVE (verdict unchanged). All five behavior-matrix cases documented in the docstring. - Invocation site: called once after `_validate_flow3_via_ledger` in `main()`, only when `dev_session` ran. - `tests/test_flow4_ledger_validation.py` — five unit tests against the helper covering: zero rows, error snapshot (None), agent_session presence, mixed source types, and empty decisions list. Why this is decoupled from agent caprice: in-stream Flow 4 evidence requires the agent to invoke `bicameral.preflight` and walk Step 3.5 to trigger capture-corrections. Path-X-(b) validates the *product outcome* (decisions written with the canonical source_type) rather than the *mechanism* (which tool the agent chose). This means a SessionEnd subprocess effect that lands in the ledger after the parent stream-json closes still upgrades the verdict, even when the in-stream signal is absent. Closes research-brief recommendation P0 #2. Note: this commit replaces the original 1f54f1a SHA on the branch via rebase. Governance/META_LEDGER edits and the planning artifacts that were bundled with the original have been dropped here and will land via a separate governance PR. The auto-fire UserPromptSubmit hook (#146 fix) that was also bundled is shipping via #155. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cherry-picked from 1f54f1a, scope-narrowed to the surgical contribution. The original commit was authored against an older base where the e2e harness scaffold did not yet exist; this rebased version adds only the new logic on top of dev's existing harness. What this commit adds: - `tests/e2e/_ledger_helpers.py` — pure helper `count_agent_session_decisions(snapshot)`, extracted so unit tests can import without triggering the harness's top-level env-var / CLI guards. - `tests/e2e/run_e2e_flows.py`: - `_count_agent_session_decisions(snapshot)` — thin wrapper around the helper that hides the import inside the harness. - `_validate_flow4_via_ledger()` — path-X-(b) post-hoc ledger query. Snapshots the ledger after the harness completes and counts decisions with `source_type='agent_session'`. Asserter FAIL + ledger has agent_session → UPGRADE to PASS with explicit annotation. Ledger error → INCONCLUSIVE (verdict unchanged). All five behavior-matrix cases documented in the docstring. - Invocation site: called once after `_validate_flow3_via_ledger` in `main()`, only when `dev_session` ran. - `tests/test_flow4_ledger_validation.py` — five unit tests against the helper covering: zero rows, error snapshot (None), agent_session presence, mixed source types, and empty decisions list. Why this is decoupled from agent caprice: in-stream Flow 4 evidence requires the agent to invoke `bicameral.preflight` and walk Step 3.5 to trigger capture-corrections. Path-X-(b) validates the *product outcome* (decisions written with the canonical source_type) rather than the *mechanism* (which tool the agent chose). This means a SessionEnd subprocess effect that lands in the ledger after the parent stream-json closes still upgrades the verdict, even when the in-stream signal is absent. Closes research-brief recommendation P0 #2. Note: this commit replaces the original 1f54f1a SHA on the branch via rebase. Governance/META_LEDGER edits and the planning artifacts that were bundled with the original have been dropped here and will land via a separate governance PR. The auto-fire UserPromptSubmit hook (#146 fix) that was also bundled is shipping via #155. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit 8af60f3)

The UserPromptSubmit hook installed by BicameralAI#146/BicameralAI#155 told the agent to call bicameral.preflight "Before invoking any file-inspection tool (Read, Grep, Bash, Glob)". That short-circuited the caller-LLM discovery the rest of the contract depends on: - bicameral.preflight uses `file_paths` for region-anchored binds_to lookup (the precision channel). Empty file_paths drops to fuzzy text-similarity over decision descriptions. - The user often names a *feature* ("the reorder feature") rather than a *file* (`reorder.ts`). The caller LLM has to do that mapping — it's the semantic half of "selection before generation." - But to do the mapping it needs Read / Grep / Glob, which the old reminder forbade. Symptom on PR BicameralAI#168 / BicameralAI#165 e2e: agent fired preflight with empty file_paths because it had no chance to inspect the codebase first. Server returned weak / no surfaced decisions. Flow 2 asserter failed (file_paths=[]); Flow 2a cascaded (no surfaced decisions to capture from). Reconcile with BicameralAI#146 by gating on the right line: - Read / Grep / Glob FIRST (discovery — caller LLM resolves the user's request to concrete file paths). - bicameral.preflight(topic, file_paths) — fed by step 1. - Write ops (Edit / Write / NotebookEdit / mutating Bash) — preflight must precede the first one. This is the contract assert_flow_2 has *already* been gating; only the hook reminder was misaligned. Files: - scripts/hooks/preflight_reminder.py — REMINDER_TEXT rewrite + docstring documenting the reconciliation with BicameralAI#146 - skills/bicameral-preflight/SKILL.md — Step 2 strengthened: "Discover first, then preflight"; file_paths is the precision channel, omit only for genuinely abstract queries - tests/test_preflight_hook.py — new test_reminder_gates_writes_not_discovery asserts the new posture (positive: "Read-only discovery FIRST", "BEFORE any write op"; negative: must NOT contain the old "before any file-inspection tool" phrasing) The Flow 2 asserter is unchanged — it has always gated writes, not reads (see lines 763-766: "Read is deliberately allowed before/in- parallel-with preflight"). This PR aligns the hook reminder with what the asserter already requires.

Knapp-Kevin and others added 7 commits May 2, 2026 00:46

style: ruff format scripts/hooks/preflight_intent.py

0bd8b6a

Pre-existing format violation in the f4de501 commit caught by CI. Verb frozenset reformatted to one-element-per-line per ruff defaults. No semantic change; 11/11 preflight tests still pass.

jinhongkuan temporarily deployed to ci-test May 2, 2026 07:48 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to production May 2, 2026 07:48 — with GitHub Actions Inactive

jinhongkuan had a problem deploying to recording-approval May 2, 2026 07:48 — with GitHub Actions Failure

This was referenced May 2, 2026

v0 process cleanup + preflight auto-fire hook (#146) #151

Closed

fix(skill): preflight does not auto-fire on natural refactor prompts in headless Claude Code sessions #146

Closed

jinhongkuan requested a review from Knapp-Kevin May 2, 2026 07:51

jinhongkuan merged commit 87b996b into dev May 2, 2026
9 of 10 checks passed

jinhongkuan mentioned this pull request May 2, 2026

Flow 4 path-X-(b) ledger validation + SessionEnd hook drift fix (#147) #152

Merged

5 tasks

This was referenced May 2, 2026

[P1] SessionEnd capture-corrections hook is silently broken — design pivot to next-session surfacing #156

Closed

[P0] Preflight skill does not instruct agent to capture refinements when user prompt contradicts surfaced decisions #154

Closed

jinhongkuan mentioned this pull request May 2, 2026

refactor(e2e): single source of truth for harness + recording setup #158

Merged

9 tasks

jinhongkuan mentioned this pull request May 4, 2026

fix(skill): preflight reminder allows discovery first, gates only writes #172

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(skill): preflight auto-fire on natural refactor prompts (replaces #151)#155

fix(skill): preflight auto-fire on natural refactor prompts (replaces #151)#155
jinhongkuan merged 7 commits into
devfrom
fix/preflight-auto-fire-clean

jinhongkuan commented May 2, 2026

Uh oh!

coderabbitai Bot commented May 2, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jinhongkuan commented May 2, 2026

Summary

What's in this PR

What was DROPPED (compared to #151)

What was MERGED carefully

Validation

Test plan

Related

Uh oh!

coderabbitai Bot commented May 2, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants