Skip to content

fix(skill): #402 — preflight auto-fires on /qor-plan + sibling slash-commands#524

Closed
Knapp-Kevin wants to merge 2 commits into
devfrom
fix/402-preflight-slash-command-classifier
Closed

fix(skill): #402 — preflight auto-fires on /qor-plan + sibling slash-commands#524
Knapp-Kevin wants to merge 2 commits into
devfrom
fix/402-preflight-slash-command-classifier

Conversation

@Knapp-Kevin

Copy link
Copy Markdown
Collaborator

Summary

Closes #402.

bicameral-preflight's Tier-1 (UserPromptSubmit hook) classifier was silently skipping slash-command prompts whose implementation-intent verb was encoded in the command name rather than the prompt body. The literal failing invocation from the issue — /qor-plan https://github.com/BicameralAI/bicameral-daemon/issues/1 — now fires the gate.

Root cause (Hypothesis 1 confirmed)

scripts/hooks/preflight_intent.py:IMPLEMENTATION_VERBS did not contain "plan", and the regex had no slash-command awareness. should_fire_preflight() returned False for any /qor-plan ... prompt; the hook injected no system-reminder; the agent ran planning blind to the ledger.

Sibling commands worked by coincidence (/qor-implementimplement verb match, /qor-refactorrefactor verb match). /qor-plan, /qor-debug, /qor-auto-dev-1 did not.

Hypothesis 3 (skill-chain ordering) is wrong: UserPromptSubmit fires on raw user text before slash-command resolution. The hook was being called — the regex inside it was the gate that returned False.

What changed

  • New IMPL_INTENT_SLASH_COMMANDS frozenset in preflight_intent.py: qor-plan, qor-implement, qor-refactor, qor-debug, qor-remediate, qor-organize, qor-auto-dev-1, qor-auto-dev. Slash-commands in this set short-circuit to fire=True regardless of argument shape (URL, text, or empty).
  • New classify_prompt() API returning a (fire, prompt_surface_form, slash_command) NamedTuple. should_fire_preflight() preserved as a backward-compat wrapper — existing tests + callers untouched.
  • prompt_surface_form field added per fix(skill): bicameral-preflight does not auto-fire on /qor-plan with a GitHub issue URL #402 acceptance: slash_command_with_url / slash_command_with_text / slash_command_bare / free_text / empty. Hook appends a preflight.trigger_evaluated row to ~/.bicameral/preflight_trigger_evaluated.jsonl carrying this field. BICAMERAL_TELEMETRY=0 suppresses the log without affecting the gate.
  • skills/bicameral-preflight/SKILL.md description frontmatter + body updated so the Tier-2 caller-LLM gate sees the same slash-command surface as the deterministic Tier-1 hook (per the existing "must be edited together" contract).

Regression coverage

Per #402 acceptance:

  • tests/test_preflight_intent.py — 13 new tests covering the literal /qor-plan <issue-url> repro, every command in IMPL_INTENT_SLASH_COMMANDS, read-only commands that must NOT fire (/qor-status, /qor-help, /qor-audit, /qor-validate), the unknown-slash-command fallthrough to verb-check, lowercasing, and parity between classify_prompt() and should_fire_preflight().
  • tests/test_preflight_hook.py — 3 sociable subprocess tests (real hook + real classifier + real JSONL write, no mocks, per CLAUDE.md sociable-testing rule). The test_hook_fires_on_qor_plan_with_issue_url test runs the exact failing prompt from the issue and would have failed on dev HEAD before this PR.
  • tests/e2e/prompts/flow-6-slash-command-preflight.md — prompt fixture for the failing case, ready to wire into run_e2e_flows.py once the agentic-layer slot is available.

Local gates run before push:

  • pytest tests/test_preflight_intent.py tests/test_preflight_hook.py → 28 passed (15 new)
  • ruff check + ruff format --check (touched files) → clean

Acceptance check

  • Reproduction harness — fails on dev HEAD, passes after (the sociable hook test uses the literal failing prompt as a real subprocess)
  • Root cause documented in fix PR (this section)
  • Regression coverage for /qor-plan <plain English> and /qor-implement <prompt>
  • Telemetry carries prompt_surface_formpartial: local JSONL emission wired; PostHog uplink deferred (see below)

Backward compatibility

  • Data migration: N/A — no ledger schema or business-logic change.
  • Wire format: N/A — no MCP tool changes; classifier is hook-side.
  • Skill contract: skills/bicameral-preflight/SKILL.md description + body updated in the same commit (per CLAUDE.md "Tool Changes Require Skill Changes").

Deferred (NOT in this PR)

  1. PostHog uplink for preflight.trigger_evaluated. The hook is a fast subprocess; synchronous network I/O blocks the user prompt up to 3 s and a daemon-thread fire-and-forget is unreliable on fast exit. The local JSONL captures the exact field set the dashboard would query — a follow-up should drain it from the long-lived MCP server via the existing relay path in telemetry.py. Recommend filing as a new issue tagged observability, linked to fix(skill): bicameral-preflight does not auto-fire on /qor-plan with a GitHub issue URL #402.
  2. Wiring flow-6-slash-command-preflight.md into run_e2e_flows.py as an agentic-layer FlowSpec (needs DESKTOP_REPO_PATH + claude CLI). The sociable hook test already covers the deterministic gate.

Out of scope (per issue body)

…commands

Root cause: the Tier-1 hook classifier at scripts/hooks/preflight_intent.py
looked only at free-text English verbs. IMPLEMENTATION_VERBS did not
contain "plan", and the regex had no slash-command awareness. So a prompt
like `/qor-plan https://github.com/.../issues/1` returned False from
should_fire_preflight() and the UserPromptSubmit hook injected nothing —
preflight never got its turn, the agent went straight to planning, and
the ledger's prior decisions stayed invisible. Sibling commands worked
by accident (`/qor-implement` matched the `implement` verb, etc.);
`/qor-plan`, `/qor-debug`, `/qor-auto-dev-1` did not.

Confirms Hypothesis 1 from the issue body. Hypothesis 3 (skill-chain
ordering) is wrong: UserPromptSubmit fires on the raw user text BEFORE
slash-command resolution, so the hook was getting called — the regex
inside it was the gate that failed.

Changes
- preflight_intent.py: new IMPL_INTENT_SLASH_COMMANDS frozenset
  (qor-plan, qor-implement, qor-refactor, qor-debug, qor-remediate,
  qor-organize, qor-auto-dev-1, qor-auto-dev); new classify_prompt()
  returning (fire, prompt_surface_form, slash_command) NamedTuple;
  should_fire_preflight() preserved as a compat wrapper.
- preflight_reminder.py: hook uses classify_prompt() and appends a
  `preflight.trigger_evaluated` row to ~/.bicameral/preflight_trigger_evaluated.jsonl
  carrying prompt_surface_form (slash_command_with_url / _with_text /
  _bare / free_text / empty). BICAMERAL_TELEMETRY=0 suppresses the log
  without affecting the gate decision.
- SKILL.md: description frontmatter + body updated so the Tier-2
  caller-LLM gate sees the same slash-command surface as the
  deterministic Tier-1 hook (per existing "must be edited together"
  contract).
- tests/test_preflight_intent.py: 13 new tests including the literal
  #402 repro, all IMPL_INTENT slash-commands, read-only commands, the
  unknown-slash-command fallthrough, and backward-compat parity with
  should_fire_preflight().
- tests/test_preflight_hook.py: 3 sociable subprocess tests — real
  hook + real classifier + real JSONL write, no mocks (per CLAUDE.md
  sociable-testing rule).
- tests/e2e/prompts/flow-6-slash-command-preflight.md: prompt fixture
  for the failing case, future-compatible with run_e2e_flows.py.

Deferred
- PostHog uplink for the trigger_evaluated stream. The hook is a fast
  subprocess; synchronous network I/O blocks the user prompt and
  daemon-thread fire-and-forget is unreliable on fast exit. The local
  JSONL captures the exact field set the dashboard would query — a
  follow-up should drain it from the long-lived MCP server via the
  existing relay path in telemetry.py.
- Wiring flow-6 into run_e2e_flows.py as a FlowSpec entry (needs
  DESKTOP_REPO_PATH + claude CLI). The sociable hook test already
  exercises the deterministic gate; agentic-layer e2e is a follow-up.

Closes #402

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Knapp-Kevin Knapp-Kevin added bug Something isn't working flow:feature Standard feature/fix PR targeting BicameralAI/dev (the default flow) P1 High: ship this milestone; user-impacting bug or committed feature preflight fix Bug fix or correctness repair skill Skill instructions or workflow guidance surface labels May 26, 2026
@Knapp-Kevin Knapp-Kevin requested a deployment to recording-approval May 26, 2026 21:22 — with GitHub Actions Waiting
@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ad01731a-8d21-4385-b16f-28786da23741

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/402-preflight-slash-command-classifier

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Knapp-Kevin

Copy link
Copy Markdown
Collaborator Author

e2e CI failure — diagnosed as substrate, not regression

The v0 user flow e2e workflow's e2e assertions (auto) job failed on this PR, but the failure mode is infra-level — not anything in this PR's change set.

Evidence

Signal Observation
Every flow's claude subprocess claude CLI exit=1, stderr (last 500 chars): (empty)
Flow 5 (pure MCP-layer bicameral.history — never goes near preflight) Fails identically: exit=1, 0 tool calls
Per-flow runtime ~4 s each vs the 900 s budget — fast-fail at the subprocess boundary, before any MCP traffic
Last passing auto-assertion on this workflow May 25 02:47Z (feat/357-sociable-gate-surrealql @ c1041f3, run 26380551118)
Today's other failed run Dependabot sigstore PR (run 26447359314) failed the SAME workflow today for a different reason: CLAUDE_CODE_OAUTH_TOKEN: EMPTY
Claude Code CLI version in the runner 2.1.150 (Claude Code) — fresh install per run

Why this can't be from this PR

The entire change set is a Tier-1 deterministic hook (scripts/hooks/preflight_intent.py) that runs inside a Claude Code session, after claude -p successfully starts. Here claude -p is dying before opening the session — so my code is unreachable in this failure mode. Flow 5 doubles down: it never touches the preflight hook (it just calls bicameral.history) and fails the same way.

Likely root causes (CI-side)

  1. @anthropic-ai/claude-code 2.1.150 regression in non-interactive (-p) mode that fails silently
  2. CLAUDE_CODE_OAUTH_TOKEN rotated/expired between May 25 and now (token present in env, length 108, but claude may not surface auth errors to stderr)
  3. One of the 8 commits that landed on dev between c1041f3 and 24f773a — most-suspect is aafdd42 (dogfood hooks source-relative), but that wouldn't cause silent claude exits

Status

  • All other gates green (ruff+mypy, MCP regression ubuntu+windows, Perf gate, Python floor, Schema persistence, TruffleHog, CodeRabbit, Preflight Failure-Mode Eval).
  • This PR's deterministic hook test suite (sociable subprocess tests, real classifier, real JSONL write) is the local regression coverage for fix(skill): bicameral-preflight does not auto-fire on /qor-plan with a GitHub issue URL #402 and passes 28/28.
  • Leaving as draft while substrate is restored. No code changes proposed here.

@jinhongkuan

Copy link
Copy Markdown
Contributor

Closing during PR hygiene because this branch is stale/conflicting even though the underlying work remains important. Preflight auto-fire for /qor-plan and sibling slash commands is high-value if those commands are part of the active workflow: it prevents plans from being written blind to the ledger. Please re-cut from current dev against issue #402, keep deterministic hook/skill coverage, and defer optional analytics/e2e wiring unless needed for the core behavior.

@jinhongkuan jinhongkuan closed this Jun 7, 2026
jinhongkuan pushed a commit that referenced this pull request Jun 9, 2026
…-intent slash-commands

Revives the proven fix from PR #524 (closed only because the v0 e2e CI failed on
stale-auth substrate — not the diff; that e2e workflow is now shelved to
dispatch-only via #556). Rebased onto current main (all 6 files applied cleanly —
nobody touched these since #524 diverged) and the bug re-reproduced on main before
fixing: `should_fire_preflight("/qor-plan https://…/issues/1")` returned False.

Root cause (Hypothesis 1, confirmed): the UserPromptSubmit classifier in
scripts/hooks/preflight_intent.py matched only free-text IMPLEMENTATION_VERBS and
had no slash-command awareness. `plan` is not a verb, so `/qor-plan <url>` never
fired preflight — planning ran blind to the ledger. Sibling commands worked by
coincidence (`/qor-implement`→implement, `/qor-refactor`→refactor). Hypothesis 3
(skill-chain ordering) refuted: UserPromptSubmit fires on raw text before
slash-command resolution; the hook ran — its regex was the gate that returned False.

Fix:
- New IMPL_INTENT_SLASH_COMMANDS frozenset (qor-plan, qor-implement, qor-refactor,
  qor-debug, qor-remediate, qor-organize, qor-auto-dev-1, qor-auto-dev) — these
  short-circuit to fire regardless of argument (URL / text / empty).
- New layered classify_prompt() → ClassifyResult(fire, prompt_surface_form,
  slash_command); preflight_reminder.py records prompt_surface_form to telemetry
  so a future trigger-surface regression is observable. should_fire_preflight()
  preserved (delegates) for backward compat.
- SKILL.md description (Tier-2 caller-LLM gate) updated in lockstep with the
  Tier-1 hook set, and correctly keeps read-only commands (/qor-status, /qor-help,
  /qor-audit, /qor-validate) on the SKIP list.

Regression guard: test_preflight_intent.py + test_preflight_hook.py now run in
test-mcp-regression.yml (they were dormant — in no CI workflow). flow-6 e2e prompt
added for reference (the e2e harness is dispatch-only per #556; the unit tests are
the durable gate).

Closes #402. Supersedes the closed #524.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working fix Bug fix or correctness repair flow:feature Standard feature/fix PR targeting BicameralAI/dev (the default flow) P1 High: ship this milestone; user-impacting bug or committed feature preflight skill Skill instructions or workflow guidance surface

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants