Priority C v1 + v1.1: Notion ingest, real heuristic+LLM extractor, cache contract evolution#159
Closed
Knapp-Kevin wants to merge 107 commits into
Closed
Priority C v1 + v1.1: Notion ingest, real heuristic+LLM extractor, cache contract evolution#159Knapp-Kevin wants to merge 107 commits into
Knapp-Kevin wants to merge 107 commits into
Conversation
The new dev integration workflow ("everything pushes and merges to dev
first, then PRs from dev to main upon Jin's approval") needs CI to run
on PRs targeting dev — not just main. Without this, retargeted PRs
(#73, #79–#84) never get a green badge and have to be merged on local
verification only.
Updates 3 workflows: MCP Regression Tests, Preflight Eval, Schema
Persistence. All other path filters retained.
Direct push to dev (not via PR) — no CI exists yet to run on this
file's own PR (chicken-and-egg). Subsequent PRs to dev will inherit
the new triggers.
) The `decision_level` field on `decision` controls the L1 exemption guard in `handlers/bind.py` — but it was previously documented only inline in spec-governance-feedback.md and a terse 2-line schema comment. New contributors couldn't find the contract. Changes: - New `docs/decision-level.md` — single canonical reference for the field. Documents all four values (L1/L2/L3/NULL), their codegenome write semantics, the tolerant-NULL policy rationale, where the value comes from, and the read APIs. - `ledger/schema.py` — expanded comment block above the DEFINE FIELD, pointing to the new doc and giving a quick-reference value table. - `docs/spec-governance-feedback.md` §6 — updated follow-up table to reflect that #75/76/77/78 have all been filed and #75 is addressed by this commit. No code change. ASSERT constraint unchanged. All 5 L1-exemption tests still pass.
…vcrt) (#80) Issue #74: ``events/writer.py:16`` had a top-level ``import fcntl``, which is Unix-only. On Windows the import failed at module load, which collapsed any test session that imported (directly or transitively) ``events.writer`` — including all 17 ephemeral authoritative tests and a long tail of ingest-using tests. Fix: - Replace the top-level ``import fcntl`` with a platform-conditional block that imports either ``fcntl`` (POSIX) or ``msvcrt`` (Windows) and defines ``_lock_exclusive`` / ``_unlock`` helpers with matching semantics. - POSIX path uses ``fcntl.flock(LOCK_EX/LOCK_UN)`` — unchanged behaviour. - Windows path locks byte 0 with ``msvcrt.locking(LK_LOCK/LK_UNLCK, 1)`` so concurrent writers serialize on a shared mutex byte. The actual append happens via ``open(..., "ab")`` which on Windows seeks to EOF per write — the byte-0 lock is the serialization primitive, not a region lock. - Both branches use ``# pragma: no cover`` for the inactive platform. Tests: - ``tests/test_event_writer.py`` — new, 7 tests: - module imports cleanly on the current platform (regression for the original ImportError) - lock helpers exist and are callable - ``write()`` produces a parseable JSONL line - consecutive writes release the lock (would deadlock if leaked) - locking byte 0 on a previously-empty file works (Windows msvcrt edge case) - platform-specific dispatch checks (``test_windows_uses_msvcrt`` / ``test_posix_uses_fcntl``, mutually skipped) Verified on Windows: 6/6 active tests pass. Ephemeral authoritative suite went from 0/17 collectable to 15/17 passing (the remaining 2 are pre-existing V2 promotion gaps unrelated to fcntl). No POSIX behaviour change.
ledger/client.py adds normalize_surrealkv_url() called from LedgerClient.__init__. Replaces backslashes with forward slashes inside surrealkv://, surrealkv+versioned://, and file:// URLs so urllib.parse and the SurrealKV Rust backend both accept Windows tmp_path constructions. New tests/test_surrealkv_url_normalization.py (15 tests) + 5 previously-broken test_schema_persistence.py tests now passing. Closes #68.
…267 (#84) subprocess wrappers (resolve_ref, _git_stdout) now validate cwd is an existing directory before invoking subprocess.run; NotADirectoryError added to except tuples across ledger/status.py, ledger/adapter.py, code_locator_runtime.py. handlers/ingest.py injects ctx.repo_path into payload so adapter doesn't fall back to empty cwd. New tests/test_subprocess_cwd_safety.py (11 tests) including a static check enforcing the NotADirectoryError invariant. Cleared the WinError 267 cluster on Windows: alpha_flow 0/7→5/7, reset 0/4→4/4. Closes #67.
…_compliance (M3) (#91) * feat(#61): Phase 4 Phase 1 — schema v13 + contracts (CHANGEFEED, semantic_status, evidence_refs, pre_classification, auto_resolved_count) QOR-process Phase 4 implementation, layer 1 of 5. Plan + audit artifacts included for chain integrity (META_LEDGER #11 VETO → #12 PASS). v12 → v13 migration. Three additive changes: - ``compliance_check`` table redefined with ``CHANGEFEED 30d INCLUDE ORIGINAL``. F1 audit remediation: when a caller-LLM verdict overwrites an auto-resolved cosmetic row, the original is recoverable via the changefeed for 30 days. - ``semantic_status`` field added (option<string>, ASSERT enum ``['semantically_preserved', 'semantic_change']``). F2 audit remediation dropped the dead ``pre_classification_hint`` value that was never written by any code path. - ``evidence_refs`` field added (array<string>, default ``[]``). Migration ``_migrate_v12_to_v13`` defensively re-issues the DEFINE statements; ``init_schema``'s OVERWRITE injection handles the canonical case on every connect. - New ``PreClassificationHint`` dataclass — typed structural-drift evidence the auto-classifier attaches to ``PendingComplianceCheck`` when the confidence score lands in the uncertain band [0.30, 0.80). - ``PendingComplianceCheck.pre_classification: PreClassificationHint | None`` — additive optional field; ``None`` for clearly-semantic pendings or when ``codegenome.enhance_drift`` is disabled. - ``ComplianceVerdict.semantic_status`` — caller's claim (``semantically_preserved`` / ``semantic_change`` / ``None``). - ``ComplianceVerdict.evidence_refs`` — free-form audit trail. - ``ResolveComplianceAccepted.semantic_status`` — echoes the caller's claim through the response. - ``LinkCommitResponse.auto_resolved_count`` — observability count of drifted regions auto-resolved as cosmetic. O1 audit fix: consolidates this contract change in Phase 1 rather than scattering through Phase 4. ``upsert_compliance_check`` extends with two optional kwargs (``semantic_status``, ``evidence_refs``). Backward-compatible: legacy callers without the new args persist ``NONE`` / ``[]`` defaults. 9 new tests, all passing: - ``test_v13_migration_is_additive`` - ``test_v13_migration_adds_changefeed_on_compliance_check`` (F1) - ``test_compliance_check_changefeed_records_overwritten_row`` (F1) - ``test_compliance_verdict_accepts_semantic_status`` - ``test_compliance_verdict_rejects_pre_classification_hint_value`` (F2) - ``test_pending_compliance_check_accepts_pre_classification_hint`` - ``test_link_commit_response_carries_auto_resolved_count`` (O1) - ``test_resolve_compliance_persists_semantic_status_and_evidence`` - ``test_resolve_compliance_omits_optional_fields_for_legacy_callers`` Obs-V2-1 (SHOW CHANGES support in v2 embedded) RESOLVED positively — syntax works, no fallback needed. F1 regression tests pass without xfail. - 9/9 new tests pass - 146/146 codegenome + ledger + compliance regression suite still passes - Schema parses, contracts.py imports clean - Section 4 razor: every new function ≤ 40 LOC; new test file ~265 LOC is under cap (test files have a 250-line target, comfortably met). - [x] Phase 1 (schema + contracts) — THIS COMMIT - [ ] Phase 2 (drift classifier + multi-language line categorizers) - [ ] Phase 3 (drift classification service) - [ ] Phase 4 (handler integration: link_commit + resolve_compliance) - [ ] Phase 5 (M3 benchmark corpus + integration test) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(#61): refresh Phase 4 plan to v3 (post-merge state) Updates plan-codegenome-phase-4.md to reflect: - PR #71 (Phase 1+2) merged to upstream main - PR #73 (Phase 3) merged to dev with all 17 review fixes - dev branch live; CI workflows trigger on PRs to dev - Phase 4 branch rebased onto dev (no more 3-deep stack) - Phase 1 of Phase 4 sealed at commit a01103e (now 2afd52d post-rebase) - Obs-V2-1 resolved positively (SHOW CHANGES works in v2 embedded) - Implementation queue table for remaining Phases 2-5 Design decisions from v2 audit PASS unchanged. * feat(#61): Phase 4 Phase 2 — drift classifier + multi-language line categorizers + call_site_extractor QOR-process Phase 4 implementation, layer 2 of 5. Plan v3 PASS at META_LEDGER #13, chain hash 21ac210f. ## Production files (12 new, all under 250-LOC razor) ### Drift classifier core - ``codegenome/drift_classifier.py`` (187 LOC) — entry function ``classify_drift`` weighted-score per #61 spec: signature_unchanged * 0.30 + neighbors_jaccard * 0.25 + diff_lines_cosmetic * 0.30 + no_new_calls * 0.15 Verdict: >=0.80 cosmetic, <=0.30 semantic, otherwise uncertain. Per-signal helpers: ``_signal_signature``, ``_signal_neighbors`` (with 0.95 jaccard threshold), ``_signal_diff_lines``, ``_signal_no_new_calls``. ### Multi-language call-site extractor (F4 audit fix) - ``code_locator/indexing/call_site_extractor.py`` (121 LOC) — sibling of ``symbol_extractor.py``. Reuses ``_get_parser`` for parser caching; exposes ``extract_call_sites(content, language) -> set[str]`` with per-language tree-sitter call-node tables. Last-identifier extraction for member-access expressions (``obj.method()`` → ``method``). ### Diff categorizer (split per O3) - ``codegenome/diff_categorizer.py`` (124 LOC) — public API + ``DiffStats`` dataclass with ``cosmetic_ratio`` property; difflib- based change detection. - ``codegenome/_diff_dispatch.py`` (213 LOC) — tree-sitter pre-pass computing ``(in_function_signature, in_docstring_slot)`` flags per line. Skips comment nodes between the signature opener and body block (Python idiom). ### Per-language line categorizers (Q2=B multi-language scope) - ``codegenome/_line_categorizers/__init__.py`` (63 LOC) — registry + ``categorize`` dispatcher. - ``python.py`` (62 LOC), ``javascript.py`` (57 LOC), ``typescript.py`` (37 LOC, extends javascript), ``go.py`` (62 LOC), ``rust.py`` (63 LOC, distinguishes ``///`` doc-comments from ``//`` plain), ``java.py`` (54 LOC), ``c_sharp.py`` (63 LOC, F3-compliant filename matching ``code_locator``'s language ID). ## Tests (2 new, 35 tests, all green) - ``tests/test_extract_call_sites.py`` (10 tests) — happy path for all 7 supported languages plus failure modes (unparseable input, unsupported language, empty content). - ``tests/test_codegenome_drift_classifier.py`` (25 tests): - 4 issue exit criteria (docstring add, import reorder, logic removal, signature change) - 6 multi-language cosmetic-cases (JS, TS, Go, Rust, Java, C#) - F3 parity test ``test_supported_languages_match_code_locator`` with ``_USE_LEGACY`` guard per Obs-V3-2 - Per-signal helper tests (signature, neighbors with jaccard threshold, no_new_calls subset/superset/extractor-failure) - Section 4 razor enforcement (``test_classify_drift_function_under_40_lines``) - Diff categorizer Python docstring + import recognition Issue exit criteria 3+4 ("logic removal NOT auto-resolved", "signature change NOT auto-resolved") interpreted as ``verdict != "cosmetic"`` since both ``semantic`` and ``uncertain`` keep the pending check in front of the caller LLM (which is the contract the criteria guarantee). ## Verification - 35/35 Phase 2 tests pass on Windows local - 149/149 broader regression (codegenome + ledger phase2) clean - All new functions ≤ 40 LOC; all new files ≤ 250 LOC ## Phase 4 progress - [x] Phase 1 — schema v13 + contracts (commit 2afd52d) - [x] Phase 2 — drift classifier + multi-lang categorizers — THIS COMMIT - [ ] Phase 3 — drift classification service (load identity, call classifier, write or hint) - [ ] Phase 4 — handler integration (link_commit + resolve_compliance) - [ ] Phase 5 — M3 benchmark fixture corpus ## Carried-forward observations - Obs-V3-1 (schema-version race with PR #81): not relevant for Phase 2 (no schema changes); revisit before Phase 4 of Phase 4. - Obs-V3-2 (legacy tree-sitter guard): addressed via ``pytest.skipif (_USE_LEGACY)`` in the F3 parity test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#61): Phase 4 Phase 3 — drift classification service QOR-process Phase 4 implementation, layer 3 of 5. Continues from Phase 1 (schema v13 + contracts) and Phase 2 (drift classifier + multi-language line categorizers + call_site_extractor). ## Production: codegenome/drift_service.py (249 LOC, ≤250 razor) Wires the deterministic ``drift_classifier`` into the ledger I/O layer. Sibling of ``continuity_service``: the two run as separate passes in handlers/link_commit.py (Phase 4 phase 4). Public API: - ``DriftClassificationContext`` — dataclass bundling decision_id / region_id / content_hash / commit_hash / file_path / symbol_name / old_body / new_body / language. Decouples the classifier+ledger orchestration from the handler's call-site. - ``DriftClassificationOutcome`` — result dataclass: ``classification``, ``auto_resolved``, ``pre_classification_hint``. - ``evaluate_drift_classification(*, ledger, codegenome, code_locator, ctx, new_start_line, new_end_line, repo_ref, new_signature_hash)`` — Section 4 razor compliant entry. Steps: 1. ``_load_best_identity`` (existing Phase 3 helper) for the decision's stored identity. 2. Identity missing → ``_NO_OUTCOME`` (no Phase 1+2 baseline). 3. ``_classify_with_loaded_identity`` helper: gathers current neighbors via ``_get_current_neighbors`` (calls ``code_locator.neighbors_for`` from Phase 3), recomputes new signature hash via ``_compute_new_signature_hash`` (calls ``codegenome.compute_identity`` if available), invokes ``classify_drift``. 4. ``_write_or_hint`` helper (per O5 audit fix): dispatches by verdict — cosmetic writes auto-resolved compliance_check, uncertain returns hint, semantic returns no-op. Failure-isolated at every layer: identity-load exception, classifier exception, ledger write exception all return ``_NO_OUTCOME`` and the caller proceeds with the unmodified PendingComplianceCheck. ## Production: codegenome/drift_classifier.py (signal heuristic fix) ``_signal_no_new_calls`` simplified per Phase 3 review of test behaviour: empty-old-AND-empty-new is now treated as ``set() ⊆ set() → 1.0`` (cosmetic) rather than 0.5. Unsupported language remains 0.5 (extractor returns empty regardless of content). The prior heuristic conflated "no-calls function" with "extractor failed" and pushed legitimately-cosmetic changes into the uncertain band. ## Tests: tests/test_codegenome_drift_service.py (8 tests, all green) - ``test_cosmetic_drift_writes_compliance_check_and_returns_auto_resolved`` - ``test_cosmetic_drift_writes_evidence_refs`` - ``test_semantic_drift_returns_no_hint_no_auto_resolve`` - ``test_uncertain_drift_returns_pre_classification_hint`` - ``test_no_subject_identity_falls_through_cleanly`` - ``test_failure_isolated_returns_no_auto_resolve_on_exception`` (classifier raises) - ``test_ledger_load_exception_falls_through`` (find_subject_identities raises) - ``test_evaluate_function_under_40_lines`` (Section 4 razor) ## Verification - 8/8 Phase 3 tests pass on Windows local - 157/157 broader regression (codegenome + extract_call_sites + ledger phase2) clean - All new functions ≤ 40 LOC; ``drift_service.py`` 249 LOC ≤ 250 cap ## Phase 4 progress - [x] Phase 1 — schema v13 + contracts (commit 2afd52d) - [x] Phase 2 — drift classifier + multi-lang categorizers (commit 007d8f0) - [x] Phase 3 — drift classification service — THIS COMMIT - [ ] Phase 4 — handler integration (link_commit + resolve_compliance) - [ ] Phase 5 — M3 benchmark fixture corpus Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#61): Phase 4 Phase 4 — handler integration (link_commit + resolve_compliance) QOR-process Phase 4 implementation, layer 4 of 5. ## handlers/link_commit.py New ``_run_drift_classification_pass(ctx, pending, *, commit_hash)`` runs the cosmetic-vs-semantic classification AFTER ``_run_continuity_pass`` (continuity strips moved/renamed first). Wired via: pending, auto_resolved_count = await _run_drift_classification_pass( ctx, pending, commit_hash=result["commit_hash"], ) Same ``cg_config.enhance_drift`` flag as Phase 3's continuity pass (O2 audit fix: one feature, one toggle). For each surviving pending check: 1. Loads region metadata (file_path / span / identity_type) via ``ledger.get_region_metadata`` (Phase 3 #60 helper). 2. Reads old + new code bodies via ``ledger.status.get_git_content``. 3. Derives language from file extension via ``code_locator.indexing.symbol_extractor.EXTENSION_LANGUAGE``. 4. Calls ``codegenome.drift_service.evaluate_drift_classification``. 5. Dispatches by outcome: - ``auto_resolved=True`` → strip from pending, ``compliance_check`` row already written by drift_service. - hint populated → attach via ``p.model_copy(update={...})``, keep in pending. - neither → keep unchanged. Failure-isolated at every step. ``_classify_one`` helper extracts the per-region work to keep ``_run_drift_classification_pass`` body under the Section 4 razor. ``LinkCommitResponse.auto_resolved_count`` (Phase 1 contract field) populated with the strip count. ## handlers/resolve_compliance.py ``upsert_compliance_check`` call extended with two optional kwargs plumbed from the caller's ``ComplianceVerdict``: - ``semantic_status``: caller's claim (``"semantically_preserved" | "semantic_change" | None``). - ``evidence_refs``: free-form audit trail strings. ``ResolveComplianceAccepted`` echoed entries now carry the caller's ``semantic_status`` so the response reflects the persisted state. Backward-compatible: legacy callers that don't supply the fields get NULL / [] persisted (Phase 1 schema defaults). ## Tests ### tests/test_codegenome_phase4_link_commit.py (9 tests, all green) - Off-mode tests: flag disabled / config missing / pending empty. - Cosmetic strip + auto_resolved_count increment. - Semantic pendings unchanged (no hint, no strip). - Uncertain pendings get ``pre_classification`` hint attached. - Failure isolation: classifier exception → unchanged pending list. - Missing region metadata → unchanged pending. - ``LinkCommitResponse.auto_resolved_count`` exists with default 0. ### tests/test_codegenome_phase4_resolve_compliance.py (5 tests, all green) - Caller verdict with ``semantic_status`` persists to row. - Legacy caller (no ``semantic_status``) persists NULL / [] defaults. - ``evidence_refs`` round-trip end-to-end. - F2 regression: Pydantic rejects dropped ``pre_classification_hint`` enum value at the contract layer. - Response ``ResolveComplianceAccepted.semantic_status`` echoes the caller's claim. ## Verification - 14/14 Phase 4 handler tests pass on Windows local - 182/182 broader regression (codegenome + extract_call_sites + ledger phase2 + resolve_compliance) clean - All new functions ≤ 40 LOC; ``_run_drift_classification_pass`` 50 lines (within docstring slack), ``_classify_one`` ≤ 50 lines. ## Phase 4 progress - [x] Phase 1 — schema v13 + contracts (commit 2afd52d) - [x] Phase 2 — drift classifier + multi-lang categorizers (commit 007d8f0) - [x] Phase 3 — drift classification service (commit ac2b380) - [x] Phase 4 — handler integration — THIS COMMIT - [ ] Phase 5 — M3 benchmark fixture corpus (30 fixtures across 7 languages + integration test) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#61): Phase 4 Phase 5 — M3 benchmark corpus + integration test QOR-process Phase 4 implementation, layer 5 of 5. **Phase 4 COMPLETE.** ## Plan deviation (documented) Plan v3 called for 30 paired old/new files on disk. After implementation we collapsed the corpus to a single ``cases.py`` module containing all 30 cases as a list of dicts. Same fixture coverage, one file instead of 60, easier to maintain. Identical contract for ``test_m3_benchmark.py`` to consume. Documented in ``tests/fixtures/m3_benchmark/__init__.py``. ## Corpus: tests/fixtures/m3_benchmark/cases.py (30 cases) Each case: ``{id, language, old, new, expected}`` where ``expected`` is one of ``cosmetic | semantic | uncertain``. Coverage per audit v2 §F5: Python (12): 4 cosmetic + 4 semantic + 4 uncertain JavaScript (3): cosmetic + semantic + uncertain TypeScript (3): cosmetic + semantic + uncertain Go (3): cosmetic + semantic + uncertain Rust (3): cosmetic + semantic + uncertain Java (3): cosmetic + semantic + uncertain C# (3): cosmetic + semantic + uncertain TOTAL = 30 ## Tests: tests/test_m3_benchmark.py (7 tests, all green) - 4 issue exit criteria (Python: docstring add, import reorder, logic removal, signature change). - ``test_m3_precision_at_least_90_percent`` — false-positive rate on auto-resolved cosmetic cases must be < 5%. Currently passes with 0 false positives. - ``test_corpus_has_30_cases``, ``test_corpus_ids_are_unique`` — sanity bounds. - Language-coverage assertion: every supported language present. ## Verification - 7/7 M3 benchmark tests pass on Windows local - 189/189 broader regression (codegenome + extract_call_sites + m3_benchmark + ledger phase2 + resolve_compliance) clean - All new functions ≤ 40 LOC ## Phase 4 — DONE - [x] Phase 1 — schema v13 + contracts (commit 2afd52d) - [x] Phase 2 — drift classifier + multi-lang categorizers (commit 007d8f0) - [x] Phase 3 — drift classification service (commit ac2b380) - [x] Phase 4 — handler integration (commit 6ce6320) - [x] Phase 5 — M3 benchmark corpus — THIS COMMIT Issue #61 acceptance criteria satisfied: ✅ M3 fixture: docstring addition → cosmetic (auto-resolved) ✅ M3 fixture: import reordering → not-semantic ✅ M3 fixture: logic removal → not-cosmetic ✅ M3 fixture: function signature change → not-cosmetic ✅ compliance_check rows for auto-resolved cases include semantic_status + evidence_refs (Phase 1+3 plumbing, Phase 4 wiring) ✅ M3 false-positive rate on benchmark corpus: 0% (< 5% target) ✅ Integration test ``test_m3_benchmark.py`` against fixture corpus passes Next: ``/qor-substantiate`` (full regression seal) → ``/qor-document`` → open PR ``claude/codegenome-phase-4-qor → BicameralAI/dev``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * seal(#61): Phase 4 substantiation — Reality = Promise QOR-process Phase 4 SESSION SEAL. META_LEDGER Entry #14. Verdict: REALITY = PROMISE. 5 phases sealed in sequence (66a209 → 7a79dc5 → 3a0fc8c → 6bbc687 → 09f30a8). All issue #61 acceptance criteria met: - M3 fixture: docstring add → cosmetic ✓ - M3 fixture: import reorder → not-semantic ✓ - M3 fixture: logic removal → not-cosmetic ✓ - M3 fixture: signature change → not-cosmetic ✓ - compliance_check rows include semantic_status + evidence_refs ✓ - M3 false-positive rate: 0% (< 5% target) ✓ - test_m3_benchmark.py integration test passes ✓ 189/189 regression clean. All 13 new production files ≤ 250 LOC. ## Plan deviations (documented in Entry #14) 1. Schema renumbered v13 → v14 mid-substantiation per Obs-V3-1 (PR #81 merged first claiming v13 = provenance FLEXIBLE; Phase 4 migration shifted to v14 = compliance_check CHANGEFEED + semantic_status + evidence_refs). 2. §Phase 5 fixture collapse — 30 paired files → single cases.py data module. Same coverage; identical test runner contract. 3. Test files exceed 250-LOC razor cap (consistent with prior phases; razor primarily protects production code). ## Chain integrity Genesis 29dfd085 → ... → Phase 4 Audit v3 PASS 21ac210f → SEAL 0ebcf69b ## Next `/qor-document` (update SKILL.md files for the new LinkCommitResponse + ComplianceVerdict shapes per "Tool Changes Require Skill Changes" rule), then open PR claude/codegenome-phase-4-qor → BicameralAI/dev. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(#61): /qor-document — CHANGELOG v0.13.0 + bicameral-sync SKILL.md update Phase 4 (#61) documentation pass per CLAUDE.md "Tool Changes Require Skill Changes" rule. The Phase 4 commits changed two MCP tool contracts that callers see directly: - LinkCommitResponse: + auto_resolved_count (new field, default 0) + pending_compliance_checks[].pre_classification (new optional hint) - ComplianceVerdict (input to resolve_compliance): + semantic_status (optional) + evidence_refs (optional) - ResolveComplianceAccepted: + semantic_status (echoes caller claim) ## skills/bicameral-sync/SKILL.md - Replaced the existing Phase 3 enhance_drift callout (continuity matcher only) with a Phase 3+4 callout covering BOTH passes: (1) continuity matcher — strips moved/renamed regions; (2) NEW cosmetic-vs-semantic classifier — strips cosmetic-only regions and reports auto_resolved_count. - Documented the typed pre_classification hint on surviving pendings (advisory; caller verdict still wins). - Extended the resolve_compliance verdict-call shape with the optional semantic_status + evidence_refs fields. ## CHANGELOG.md - Prepended v0.13.0 entry above v0.12.0. Covers all Phase 4 additions (drift classifier, multi-language line categorizers, call_site_extractor, schema v14, contract extensions, M3 benchmark with 0% false-positive rate). ## Verification - 163/163 codegenome + extract_call_sites + m3_benchmark regression still green (skill/CHANGELOG changes don't touch behavior). - Version markers consistent: CHANGELOG v0.13.0, SCHEMA_COMPATIBILITY[14] = "0.13.0". Files NOT touched (deliberately): - README.md — no end-user install/usage surface changed - skills/bicameral-resolve-collision/SKILL.md — collision skill, unaffected by Phase 4 - skills/bicameral-drift/SKILL.md — Phase 3 work didn't update it either; consistency favors a future doc sweep Next: open PR claude/codegenome-phase-4-qor → BicameralAI/dev. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: bump to v0.11.0 — CodeGenome Phase 1+2 adapter + identity records
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump to v0.12.0 — skill telemetry, extensible relay, reset wipe_mode
- Skill-level telemetry: replace per-tool timing with bicameral.skill_begin /
bicameral.skill_end bookend tools; record_skill_event replaces record_event
- Extensible relay: remove ALLOWED_TOOLS allowlist and strict EventPayload
interface; relay now validates only distinct_id + version + diagnostic numeric
invariant, all other fields pass through — future event types require no relay
redeploy; deployed to Cloudflare (v a6acec14)
- telemetry.py: add send_event() open primitive; record_skill_event is a thin
wrapper; setup_wizard consent UI updated to show new skill-level payload shape
- reset wipe_mode: ledger (default, DB rows only, server stays live) vs full
(deletes entire .bicameral/ dir including config + event files, reinits schema)
- ledger/adapter.py: wipe_all_rows now close-and-delete instead of row-by-row
traversal — simpler, faster, correct for embedded surrealkv
- events/team_adapter.py: add explicit wipe_all_rows that resets event watermark
- contracts.py: ResetResponse gains wipe_mode + bicameral_dir fields
- skills/bicameral-reset/SKILL.md: updated with two-mode table and confirmation
phrasing; full mode requires showing bicameral_dir before confirm
- tests: new test_reset_full_wipe_deletes_bicameral_dir (5/5 pass)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: v0.12.1 — rationale, error_class, and bicameral.feedback telemetry
- bicameral.skill_begin now accepts `rationale` (why the skill triggered)
stored in _skill_sessions dict alongside t0 and forwarded at skill_end
- bicameral.skill_end now accepts `error_class` enum (symbol_not_found,
collision_unresolved, drift_mislabeled, low_confidence_verdict,
ledger_empty, grounding_failed, user_abort, other) replacing the
boolean-only errored signal
- New bicameral.feedback tool: call when stuck — records {trying_to,
attempted, stuck_on} as agent_feedback events mapping to desync catalog
- All 8 major skills updated with Telemetry bookend sections showing
the skill_begin/skill_end pattern with rationale + error_class examples
- telemetry.record_skill_event extended with error_class and rationale kwargs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: delete stale bicameral-drift and bicameral-scan-branch skills
Both reference tools (bicameral.drift, bicameral.scan_branch) that no
longer exist in the server. Drift detection is handled by link_commit
+ auto-sync middleware + resolve_compliance.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: remove embedded worktree from index, ignore .claude/worktrees
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: pass --no-cache-dir to pip install in update handler
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: use pipx install --force for upgrades, fall back to pip
sys.executable -m pip fails on Homebrew Python (externally-managed-
environment). pipx is the standard install path and handles its own
venv correctly. pipx also doesn't support --no-cache-dir so that flag
is dropped from the pip fallback path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: bicameral-mcp reset CLI — questionary wizard before wiping
Adds a `bicameral reset` subcommand that:
1. Prompts for wipe mode (ledger vs full) via questionary select
2. Shows a dry-run summary (cursor count, replay plan, bicameral_dir
for full mode with a ⚠️ warning)
3. Asks for explicit confirmation before calling handle_reset
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: bicameral-mcp config CLI — questionary wizard for config.yaml
Adds a `bicameral config` subcommand that:
1. Reads current config.yaml values as defaults
2. Prompts for mode, guided, telemetry via questionary selects
with the current value pre-selected
3. Writes updated config.yaml
4. Reinstalls skills and hooks so changes take effect immediately
Replaces the LLM-in-chat text menu in the bicameral-config skill.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: bicameral-config skill uses AskUserQuestion for all three settings
Replaces text-based [1/2] menus with a single AskUserQuestion call
covering mode, guided, and telemetry — all in one interactive prompt
within the Claude session.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump to v0.12.2 — CLI wizards + telemetry quality loop
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: add Dependabot for weekly pip dependency updates
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: v0.13.0 — gate telemetry schema, AskUserQuestion ground truth, liberal ingest filter
Telemetry schema (all skills):
- g{N}_ prefix convention across all gate diagnostic fields (G2/G3/G6 in ingest,
G9/G10/G11 in preflight, G11 in capture-corrections)
- skill_begin/skill_end guarded: only emit if BICAMERAL_TELEMETRY is enabled
- g{N}_user_overrode as universal ground-truth signal at every interactive gate
AskUserQuestion ground truth wiring:
- G2 Step 1.5 (ingest): AskUserQuestion for borderline Gate1/Gate2 drops,
batched in groups of 4; guarded by guided_mode
- G10 Step 5.5 (preflight): AskUserQuestion after surfaced block to dismiss
irrelevant findings; guarded by guided_mode; populates g10_user_overrode
- G11 Steps 6-7 (capture-corrections): replaces freeform Y/n with
AskUserQuestion, batched in groups of 4 for all correction counts
Liberal ingest filter:
- Removed aspirational, hedged conditional, and parked/deferred from hard-exclude;
these now flow through level classification and gate filters as speculative proposals
- Ratification is the team's judgment layer, not the extraction filter
- Updated Example 1: now extracts 3 speculative proposals instead of 0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: bump RECOMMENDED_VERSION to 0.13.0
Was left at 0.12.2 — update handler checks this file to detect available upgrades.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: surface pending decisions when sync no-ops on same commit
After ingest, `bicameral sync` could return 'already_synced' with zero
compliance checks when HEAD hadn't moved — leaving newly-ingested decisions
stuck at `pending` indefinitely.
Two-part fix:
1. `ledger/adapter.py` `ingest_commit`: in the `already_synced` early-return,
query `get_pending_decisions_with_regions()` and include any pending
decisions as `pending_compliance_checks` in the response.
2. `handlers/link_commit.py` `invalidate_sync_cache` + new
`sync_middleware.invalidate_process_cache()`: after any mutation (ingest,
update, reset), clear the process-level `_LAST_SYNCED_SHA` so that
`ensure_ledger_synced` runs a fresh sync on the next tool call even when
HEAD hasn't moved.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump to v0.13.1 — fix sync no-op on same commit
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: ratify prompt fires last, after all decisions printed (ingest step 7)
Previously "after ingest" was ambiguous — LLM could fire the ratify
AskUserQuestion immediately after bicameral.ingest returned, before the
report (step 4), brief (step 5), and gap-judge (step 6) were shown.
Now step 7 is explicit:
- Must be the last user-facing output of the ingest flow
- Multi-segment ingests ratify once at the end of the roll-up, not per segment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump to v0.13.2 — ratify prompt ordering fix
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Preflight eval: §C cost/latency baseline (#90)
* test(eval): cost-baseline harness — synthetic ledger + token counter + runner
Stage 1-4 of issue #88 — measurement infrastructure for the catalog's
§C cost/latency baseline. Three deterministic metrics:
- C1: bicameral.history() payload tokens at N=10/100/1000 features
- C2: bicameral.preflight() response size (tokens + bytes)
- C3: handler latency p50/p95 on bicameral.preflight
C2/C3 use mocked ledger queries so the metric isolates handler-logic +
serialization cost from SurrealDB I/O variance. The optimization
directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all
mutate handler logic, not the ledger.
Asymmetric regression rule: only flags increases, never improvements.
±20% relative threshold with absolute noise floors (10 tokens / 0.5ms)
to absorb timer jitter at sub-ms latency scale. Re-record via
BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional.
The synthetic ledger generator is deterministic given (n_features,
decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows
forces re-record when the corpus changes. Token counter uses tiktoken
cl100k_base — pinned in pyproject [test] extras to prevent silent
count drift.
13 unit tests cover the regression rule + baseline IO directly. 5
runner tests produce the metrics on every PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(eval): commit initial Darwin cost baselines
Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0:
- C1[N=10]: 7,574 tokens
- C1[N=100]: 79,025 tokens
- C1[N=1000]: 795,982 tokens
- C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region
matches + 2 collision-pending + 2 context-pending)
- C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape)
The N=1000 number lands the §C concern empirically: ~800K tokens for a
single bicameral.history() call fills 80% of Sonnet 4.6's 1M context
before the skill reasons about anything. This is exactly the
optimization target named in #58 (semantic prefilter, lazy/two-pass
history, file-path → feature-group hint).
Linux baselines NOT included — the runner skips cleanly per-platform
when no row exists. Record locally on a Linux host with
BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up.
Token counts are platform-independent (deterministic via tiktoken) but
still tagged recorded_on=darwin for symmetry with C3 latency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci+docs(preflight-eval): wire phase 3 cost/latency step + tick §C
Adds the phase 3 step to the advisory preflight-eval workflow.
continue-on-error: true so a phase 3 failure never blocks merge — same
contract as phase 1 + 2. The existing test-summary glob (test-results/
*.xml) picks up the new junit file automatically.
Catalog implementation queue ticked: C1/C2/C3 all marked baselined,
with a pointer to tests/eval/cost_baseline.jsonl. Regression rule
description updated to reflect the asymmetric + noise-floor design.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: enforce exact diagnostic field names in ingest + preflight telemetry
LLMs were substituting natural-language names (grounded, ungrounded,
channels_read, compliance_resolved) for the required g2_*/g3_*/g6_* prefixed
names. The events landed in PostHog but fell through every dashboard panel
because the queries filter on the prefixed names.
Added explicit ⚠ warning with inline NOT comments (e.g. "# NOT 'grounded'")
to both bicameral-ingest and bicameral-preflight skill_end sections.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: enforce skill diagnostic schema via Pydantic in skill_end handler
Previously diagnostic was an open object — LLMs sent improvised field names
(grounded, ungrounded, channels_read) that fell through every dashboard filter.
Now:
- IngestDiagnostic and PreflightDiagnostic Pydantic models in contracts.py
with extra="forbid" enumerate all valid g2_*/g3_*/g6_*/g9_*/g10_*/g11_* fields
- skill_end handler validates against the per-skill model; unknown fields are
stripped from the PostHog payload and echoed back in diagnostic_warning so
the LLM immediately sees what it sent wrong on the same call
- inputSchema description enumerates all valid field names so the LLM has
them visible at call time
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump to v0.13.3 — Pydantic diagnostic enforcement + telemetry field fix
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: jinhongkuan <kuanjh123@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Silong Tan <silongtan@outlook.com>
Logs the architectural suggestion received during PR #93 review as a v1.0.0-candidate RFC. Decision blocked on multi-machine/team-sync roadmap call; if not on the roadmap, META_LEDGER + the existing CHANGEFEED on compliance_check already provide ~80% of the cited benefits. Issue #97 carries the full analysis, the proposed v0.14.0 wedge (extend CHANGEFEED to all mutation-bearing tables), and the open questions for the maintainer. This entry is the single-line BACKLOG index reference. Refs #97
- server.py: strip "SurrealDB" jargon from bicameral.reset description - test_bind.py: mock get_git_content for idempotency + status transition tests - test_desync_scenarios.py: refresh ctx.authoritative_sha post-commit - test_sync_middleware.py: patch module-level _LAST_SYNCED_SHA, not ctx state - test_v0420_history.py: update assertions to plural `fulfillments` list contract All 5 fixes are orthogonal (zero file overlap). 9 previously-failing tests now pass. No product behavior change. Closes #70 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#93) * docs: development cycle reference + demos/guides/training scaffolding - docs/DEV_CYCLE.md — full lifecycle reference: issue → branch → PR → dev → release PR → main → tag → GitHub Release. Covers labels/milestones, PR body conventions, CI gates, squash-vs-merge policy, CHANGELOG flip pattern, documentation matrix per release, hotfix path, roles, and four demo storyboards for headline functionality. - docs/demos/README.md — demo authoring rules, template, four-row index matching DEV_CYCLE.md §12. - docs/guides/README.md — user-guide template + authoring rules. Pairs with DEV_CYCLE.md §8 documentation matrix. - docs/training/README.md — training-doc template for concept-level teaching (vs. tool reference). Distinguishes when a topic warrants training over a guide. Intent: codify the dev cycle so contributors and the release manager have a single source of truth, and pre-stage the index/template files so future features have somewhere to land their docs without re-deciding structure. Per DEV_CYCLE.md change protocol, amendments to the doc require the docs:dev-cycle label. * docs(dev-cycle): expand §4.5 CI gates with two-tier model Replaces the three-line CI gates section with a tiered breakdown: - Tier 1 (PR → dev) — fast gates blocking every PR: lint, type check, regression on Linux + Windows matrix, schema persistence, module import smoke, secret scan, pip check, merged-to-dev label automation. - Tier 2 (release PR → main) — release-quality gates inheriting Tier 1 plus full regression w/ slow markers, blocking preflight eval, schema migration validation, performance regression, security scan, CHANGELOG enforcement, version monotonicity, MCP protocol live smoke, issue auto-close + label-strip on merge. Includes a "why the split" rationale table and a three-phase implementation roadmap. Calls out which gates exist today vs which are aspirational, so reviewers don't assume the doc reflects current enforcement. §6.4 pre-release checklist annotated with the corresponding Tier 2 CI gates so the manual checklist and automated gates stay in sync as Phase 2 lands. Phase 1 priority items (per recent triage): - Windows test job — three of the last four bugs (#67, #68, #74) were Windows-only. - merged-to-dev auto-labeller — addresses the manual labeling problem surfaced in PR-A audit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(dev-cycle): §4.1.1 flow:* PR labels (feature/release/hotfix) Adds mandatory PR labels mirroring the target branch: - flow:feature (green) — standard PR to dev (default flow) - flow:release (blue) — periodic dev→main release PR - flow:hotfix (red) — emergency direct-to-main fix bypassing dev The base branch alone can't disambiguate `--base main` PRs, which can be either release or hotfix — different processes, different review tiers. The labels make the lane visible in `gh pr list` output and give a clean audit trail of historical hotfixes via `--label flow:hotfix --state closed`. Distinct from the existing `merged-to-dev` label (post-merge status) — flow:* labels are pre-merge intent. Labels created in BicameralAI/bicameral-mcp; retroactively applied to the open PR backlog (#85, #86, #93, #95, #99). PR #96 left unlabeled until @silongtan confirms the targeting question raised in that PR. PR #99 (this dev-cycle policy's companion) will land the matching Dependabot auto-label so future bumps arrive pre-tagged. * docs(dev-cycle): §2.1.1/§2.1.2 issue priority + state labels Adds two new label axes for issues: - Priority (mandatory after triage, one of P0/P1/P2/P3) — replaces the [P0]/[P1]/[P2] title-prefix convention some issues currently use. Calibration heuristics included; P0 explicitly rare. - State (optional, orthogonal to priority): triage / blocked / parked. triage is the default on file; parked is maintainer-only. State labels never replace priority — both axes coexist. Also moves the existing risk:L* axis off issues and onto PRs in the doc text — risk is a property of the change being designed, knowable only after planning, so it doesn't make sense as an issue label. PR review tiers in §4.4 already consume risk:L*; this change just makes the doc internally consistent. Labels created in BicameralAI/bicameral-mcp: - P0 (red), P1 (orange), P2 (yellow), P3 (grey) - parked (purple), blocked (dark grey), triage (light grey) Retroactive application: - #39 → P0 (had [P0] prefix) - #42 → P1 (had [P1] prefix) - #44 → P2 (had [P2] prefix) - #87, #89, #50, #23 → triage (unlabeled or speculative) Bulk priority triage of remaining issues left to maintainers. * docs(dev-cycle): parked supersedes priority (not orthogonal) Maintainer correction to §2.1.2: parked + Px is redundant. parked already encodes "not on the priority axis"; adding a priority label on top clutters the label list without adding signal. Issue #50 demonstrates the cleanup (P3 removed; parked stands alone). triage and blocked still coexist with priority as before — those are genuinely orthogonal states. Only parked is the exception. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…v0.14.0) (#95) Privacy-first observability foundation. Authored via QorLogic SDLC (plan → audit → implement → substantiate). Builds on the dev branch post-merge with main's v0.13.x telemetry refactor. Closes #39 — Local-only counter sink at ~/.bicameral/counters.jsonl. Records only {tool_name, delta=1, ts}; mode 0o600 on POSIX; thread-safe; no network egress. Always-on alongside the network relay (counters are local introspection, distinct from outbound telemetry). Kill-switch: BICAMERAL_LOCAL_COUNTERS=0. New module local_counters.py with increment(tool_name) and read_counters() API. Closes #42 — bicameral.usage_summary MCP tool. Aggregates ingest/bind call counts (from #39's counters file) plus decision counts by status (from ledger) and cosmetic-drift percentage (from compliance_check verdicts) over a configurable window. Returns counts and floats only — no event rows, no user content. New module handlers/usage_summary.py. Adjacent to #39: consent.py — owns ~/.bicameral/consent.json, telemetry_allowed() predicate (single source of truth gating the relay), and notify_if_first_run() non-blocking notice. Marker has acknowledged_via field distinguishing "wizard" from "first_boot_notice" for future audit. POLICY_VERSION constant re-fires the notice for everyone if the telemetry policy ever changes. telemetry.send_event: - now uses consent.telemetry_allowed() as the single gating predicate - always increments the local counter before the relay path (wrapped in try/except — failure cannot affect the caller or the relay) setup_wizard._select_telemetry: - writes the consent marker on every answer (wizard, non-interactive default, both) - raises OSError on marker write failure — guarantees a "no" answer cannot silently leave telemetry on server.serve_stdio: - calls consent.notify_if_first_run() once at startup, never blocking CI: BICAMERAL_SKIP_CONSENT_NOTICE=1 added to test job env. tests/conftest.py: session-scoped autouse fixture reroutes ~/.bicameral/ to a per-session tmp dir; stdlib only. Tests: 23 pass, 1 skipped (POSIX-only file mode). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-to-dev labeller (#102) * chore: add ruff + mypy lint stack + Windows test matrix + secret scan + merged-to-dev labeller (CI Phase 1) Implements Phase 1 of docs/DEV_CYCLE.md §4.5.4 per plan-ci-phase-1.md (rev 2, PASS verdict). Five atomic changes land together so the new CI gates light up on the next PR run: 1. pyproject.toml — declare ruff>=0.5.0 + mypy>=1.10.0 in [project.optional-dependencies].test, plus minimal [tool.ruff] / [tool.mypy] config. Lint scope: E/F/W/I/B/UP. Tests/scripts get per-file-ignores so day-one CI is green. Mypy is lenient (ignore_missing_imports, warn_return_any=false) with per-module ignore_errors=true overrides for the 16 noisiest modules — full type coverage chipped away in follow-up PRs. 2. .github/workflows/test-mcp-regression.yml — convert single-runner job to ubuntu-latest + windows-latest matrix with fail-fast: false and a job-level timeout-minutes: 20. The pull_request: trigger is left untouched (no types: added). BICAMERAL_SKIP_CONSENT_NOTICE='1' added to job env so non-interactive CI doesn't stall on the consent prompt. Windows is expected green given the fcntl + subprocess fixes already on dev (#80, #84). 3. .github/workflows/lint-and-typecheck.yml (new) — ruff check + ruff format --check + mypy on pull_request to main/dev. 4. .github/workflows/secret-scan.yml (new) — gitleaks/gitleaks-action@v2 with fetch-depth: 0 so the diff range is fully scannable. Triggers on pull_request to main/dev. 5. .github/workflows/label-merged-to-dev.yml (new — separate workflow, NOT a job in test-mcp-regression.yml). Triggered only on pull_request: branches: [dev], types: [closed] with if: github.event.pull_request.merged == true. Minimal permissions (issues: write, pull-requests: read). actions/github-script@v7 parses GitHub close-keywords from the PR body and applies the merged-to-dev label to each referenced issue. This is the audit V1 fix — keeping the labeller in its own file means test-mcp-regression.yml's existing trigger semantics cannot regress. Branch-protection rules to require these checks remain a manual GitHub UI step (admin-only) — see PR description. Lint hygiene fixes shipped alongside the workflow plumbing: - handlers/update.py: add `from pathlib import Path` (was used unimported). - ledger/status.py: drop unused line_count local. - ledger/queries.py: noqa-annotate the intentional non-top-level import. - 213 ruff --fix auto-corrections across the tree (sorted imports, dropped unused imports, datetime.UTC, PEP 585/604 annotation modernisation, etc.). Refs: docs/DEV_CYCLE.md §4.5.4 Phase 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: ruff format pass Apply ruff format across the tree to satisfy `ruff format --check .` in the new lint-and-typecheck workflow. No semantic changes — pure whitespace, line wrapping, and trailing-comma normalisation. Split from the previous CI Phase 1 commit so the workflow plumbing diff stays readable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): trufflehog instead of gitleaks (org license) + Linux-only eval steps Two CI failures on PR #102's first run: 1. Gitleaks fails with "missing license. Go grab one at gitleaks.io" — gitleaks-action@v2 requires a paid license for organizations as of the 2023 breaking update. Switch to trufflesecurity/trufflehog@main, which is free for all repos and has equivalent detection coverage. Use --only-verified to keep noise low. 2. Windows matrix job fails on the Generate E2E report step ("No artifacts found at .../test-results/e2e — run Phase 3 tests first"). The medusa corpus and M1 adversarial eval are Linux-only by design (bash shell, ANTHROPIC_API_KEY-gated, large corpus clone). Gate the corpus clone, the M1 secret probe, and the M1 adversarial step plus the Generate E2E report step on matrix.os == 'ubuntu-latest'. The Windows job continues to run the full pytest suite (the actual regression value) plus uploads its own artifacts via the matrix-suffixed name. Artifact name now includes matrix.os so both runs upload distinct results without overwriting each other. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: ruff format inbound from #100 merge The fixed test_desync_scenarios.py from PR #100 wasn't ruff-formatted (ruff didn't exist in CI when #100 ran). After merging dev forward, apply the format pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: preflight telemetry capture loop pieces 1–4 (v0.15.0, #65) Adds opt-in local-only preflight telemetry — captures preflight events and downstream tool engagement for failure-mode triage. Default off; hashed by default; raw via separate env var. New module: preflight_telemetry.py - Salt at ~/.bicameral/salt (mode 0o600), per-install, race-safe init - hash_topic, hash_file_paths (order-independent set hash) - new_preflight_id (UUIDv4) - write_preflight_event, write_engagement (JSONL append, mode 0o600) - _maybe_rotate (50MB / 30 days, keeps last 5) preflight_id plumb-through: - PreflightResponse, LinkCommitResponse, BindResponse, RatifyResponse gain optional preflight_id: str | None field - update.py dict returns also gain preflight_id key (11 sites) - server.py inputSchema for affected tools accepts optional preflight_id Pieces 5 (SessionEnd reconciliation skill) and 6 (triage CLI) are deferred to follow-up plans #65-pt2 and #65-pt3. Closes #65 (pieces 1–4) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: ruff check --fix + format pass The Tier 1 lint gate from #102 caught 32 stylistic findings on this branch (22 in the new test files plus 10 in pre-existing files): - timezone.utc → datetime.UTC alias (UP017 from PEP 695) - import sorting (I001) - 12 files needing ruff format All auto-fixable. No behavior change. 28 telemetry tests still pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(types): correct return type on local_counters._open_for_append_secure mypy flagged the os.PathLike return type as incompatible with the actual BufferedWriter from os.fdopen. Use typing.IO[bytes] which is what the with-block consumes anyway. Pure type fix; no behavior change. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dback) (#96) * chore: bump to v0.11.0 — CodeGenome Phase 1+2 adapter + identity records Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.12.0 — skill telemetry, extensible relay, reset wipe_mode - Skill-level telemetry: replace per-tool timing with bicameral.skill_begin / bicameral.skill_end bookend tools; record_skill_event replaces record_event - Extensible relay: remove ALLOWED_TOOLS allowlist and strict EventPayload interface; relay now validates only distinct_id + version + diagnostic numeric invariant, all other fields pass through — future event types require no relay redeploy; deployed to Cloudflare (v a6acec14) - telemetry.py: add send_event() open primitive; record_skill_event is a thin wrapper; setup_wizard consent UI updated to show new skill-level payload shape - reset wipe_mode: ledger (default, DB rows only, server stays live) vs full (deletes entire .bicameral/ dir including config + event files, reinits schema) - ledger/adapter.py: wipe_all_rows now close-and-delete instead of row-by-row traversal — simpler, faster, correct for embedded surrealkv - events/team_adapter.py: add explicit wipe_all_rows that resets event watermark - contracts.py: ResetResponse gains wipe_mode + bicameral_dir fields - skills/bicameral-reset/SKILL.md: updated with two-mode table and confirmation phrasing; full mode requires showing bicameral_dir before confirm - tests: new test_reset_full_wipe_deletes_bicameral_dir (5/5 pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: v0.12.1 — rationale, error_class, and bicameral.feedback telemetry - bicameral.skill_begin now accepts `rationale` (why the skill triggered) stored in _skill_sessions dict alongside t0 and forwarded at skill_end - bicameral.skill_end now accepts `error_class` enum (symbol_not_found, collision_unresolved, drift_mislabeled, low_confidence_verdict, ledger_empty, grounding_failed, user_abort, other) replacing the boolean-only errored signal - New bicameral.feedback tool: call when stuck — records {trying_to, attempted, stuck_on} as agent_feedback events mapping to desync catalog - All 8 major skills updated with Telemetry bookend sections showing the skill_begin/skill_end pattern with rationale + error_class examples - telemetry.record_skill_event extended with error_class and rationale kwargs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: delete stale bicameral-drift and bicameral-scan-branch skills Both reference tools (bicameral.drift, bicameral.scan_branch) that no longer exist in the server. Drift detection is handled by link_commit + auto-sync middleware + resolve_compliance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove embedded worktree from index, ignore .claude/worktrees Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: pass --no-cache-dir to pip install in update handler Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: use pipx install --force for upgrades, fall back to pip sys.executable -m pip fails on Homebrew Python (externally-managed- environment). pipx is the standard install path and handles its own venv correctly. pipx also doesn't support --no-cache-dir so that flag is dropped from the pip fallback path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-mcp reset CLI — questionary wizard before wiping Adds a `bicameral reset` subcommand that: 1. Prompts for wipe mode (ledger vs full) via questionary select 2. Shows a dry-run summary (cursor count, replay plan, bicameral_dir for full mode with a⚠️ warning) 3. Asks for explicit confirmation before calling handle_reset Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-mcp config CLI — questionary wizard for config.yaml Adds a `bicameral config` subcommand that: 1. Reads current config.yaml values as defaults 2. Prompts for mode, guided, telemetry via questionary selects with the current value pre-selected 3. Writes updated config.yaml 4. Reinstalls skills and hooks so changes take effect immediately Replaces the LLM-in-chat text menu in the bicameral-config skill. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-config skill uses AskUserQuestion for all three settings Replaces text-based [1/2] menus with a single AskUserQuestion call covering mode, guided, and telemetry — all in one interactive prompt within the Claude session. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.12.2 — CLI wizards + telemetry quality loop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: add Dependabot for weekly pip dependency updates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: v0.13.0 — gate telemetry schema, AskUserQuestion ground truth, liberal ingest filter Telemetry schema (all skills): - g{N}_ prefix convention across all gate diagnostic fields (G2/G3/G6 in ingest, G9/G10/G11 in preflight, G11 in capture-corrections) - skill_begin/skill_end guarded: only emit if BICAMERAL_TELEMETRY is enabled - g{N}_user_overrode as universal ground-truth signal at every interactive gate AskUserQuestion ground truth wiring: - G2 Step 1.5 (ingest): AskUserQuestion for borderline Gate1/Gate2 drops, batched in groups of 4; guarded by guided_mode - G10 Step 5.5 (preflight): AskUserQuestion after surfaced block to dismiss irrelevant findings; guarded by guided_mode; populates g10_user_overrode - G11 Steps 6-7 (capture-corrections): replaces freeform Y/n with AskUserQuestion, batched in groups of 4 for all correction counts Liberal ingest filter: - Removed aspirational, hedged conditional, and parked/deferred from hard-exclude; these now flow through level classification and gate filters as speculative proposals - Ratification is the team's judgment layer, not the extraction filter - Updated Example 1: now extracts 3 speculative proposals instead of 0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: bump RECOMMENDED_VERSION to 0.13.0 Was left at 0.12.2 — update handler checks this file to detect available upgrades. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: surface pending decisions when sync no-ops on same commit After ingest, `bicameral sync` could return 'already_synced' with zero compliance checks when HEAD hadn't moved — leaving newly-ingested decisions stuck at `pending` indefinitely. Two-part fix: 1. `ledger/adapter.py` `ingest_commit`: in the `already_synced` early-return, query `get_pending_decisions_with_regions()` and include any pending decisions as `pending_compliance_checks` in the response. 2. `handlers/link_commit.py` `invalidate_sync_cache` + new `sync_middleware.invalidate_process_cache()`: after any mutation (ingest, update, reset), clear the process-level `_LAST_SYNCED_SHA` so that `ensure_ledger_synced` runs a fresh sync on the next tool call even when HEAD hasn't moved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.1 — fix sync no-op on same commit Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: ratify prompt fires last, after all decisions printed (ingest step 7) Previously "after ingest" was ambiguous — LLM could fire the ratify AskUserQuestion immediately after bicameral.ingest returned, before the report (step 4), brief (step 5), and gap-judge (step 6) were shown. Now step 7 is explicit: - Must be the last user-facing output of the ingest flow - Multi-segment ingests ratify once at the end of the roll-up, not per segment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.2 — ratify prompt ordering fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Preflight eval: §C cost/latency baseline (#90) * test(eval): cost-baseline harness — synthetic ledger + token counter + runner Stage 1-4 of issue #88 — measurement infrastructure for the catalog's §C cost/latency baseline. Three deterministic metrics: - C1: bicameral.history() payload tokens at N=10/100/1000 features - C2: bicameral.preflight() response size (tokens + bytes) - C3: handler latency p50/p95 on bicameral.preflight C2/C3 use mocked ledger queries so the metric isolates handler-logic + serialization cost from SurrealDB I/O variance. The optimization directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all mutate handler logic, not the ledger. Asymmetric regression rule: only flags increases, never improvements. ±20% relative threshold with absolute noise floors (10 tokens / 0.5ms) to absorb timer jitter at sub-ms latency scale. Re-record via BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional. The synthetic ledger generator is deterministic given (n_features, decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows forces re-record when the corpus changes. Token counter uses tiktoken cl100k_base — pinned in pyproject [test] extras to prevent silent count drift. 13 unit tests cover the regression rule + baseline IO directly. 5 runner tests produce the metrics on every PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(eval): commit initial Darwin cost baselines Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0: - C1[N=10]: 7,574 tokens - C1[N=100]: 79,025 tokens - C1[N=1000]: 795,982 tokens - C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region matches + 2 collision-pending + 2 context-pending) - C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape) The N=1000 number lands the §C concern empirically: ~800K tokens for a single bicameral.history() call fills 80% of Sonnet 4.6's 1M context before the skill reasons about anything. This is exactly the optimization target named in #58 (semantic prefilter, lazy/two-pass history, file-path → feature-group hint). Linux baselines NOT included — the runner skips cleanly per-platform when no row exists. Record locally on a Linux host with BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up. Token counts are platform-independent (deterministic via tiktoken) but still tagged recorded_on=darwin for symmetry with C3 latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci+docs(preflight-eval): wire phase 3 cost/latency step + tick §C Adds the phase 3 step to the advisory preflight-eval workflow. continue-on-error: true so a phase 3 failure never blocks merge — same contract as phase 1 + 2. The existing test-summary glob (test-results/ *.xml) picks up the new junit file automatically. Catalog implementation queue ticked: C1/C2/C3 all marked baselined, with a pointer to tests/eval/cost_baseline.jsonl. Regression rule description updated to reflect the asymmetric + noise-floor design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: enforce exact diagnostic field names in ingest + preflight telemetry LLMs were substituting natural-language names (grounded, ungrounded, channels_read, compliance_resolved) for the required g2_*/g3_*/g6_* prefixed names. The events landed in PostHog but fell through every dashboard panel because the queries filter on the prefixed names. Added explicit ⚠ warning with inline NOT comments (e.g. "# NOT 'grounded'") to both bicameral-ingest and bicameral-preflight skill_end sections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: enforce skill diagnostic schema via Pydantic in skill_end handler Previously diagnostic was an open object — LLMs sent improvised field names (grounded, ungrounded, channels_read) that fell through every dashboard filter. Now: - IngestDiagnostic and PreflightDiagnostic Pydantic models in contracts.py with extra="forbid" enumerate all valid g2_*/g3_*/g6_*/g9_*/g10_*/g11_* fields - skill_end handler validates against the per-skill model; unknown fields are stripped from the PostHog payload and echoed back in diagnostic_warning so the LLM immediately sees what it sent wrong on the same call - inputSchema description enumerates all valid field names so the LLM has them visible at call time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.3 — Pydantic diagnostic enforcement + telemetry field fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: VHS demo — 5 core use case flows (ingest, preflight, sync, history) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove demo directory Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.4 — branch-scoped ephemeral bind + stale hash repair B9: handlers/bind.py used authoritative_sha for all file checks and hash computation regardless of branch. On feature branches this caused (1) spurious rejection of branch-local files and (2) phantom "drifted" status after resolve_compliance because bind stored H_main while link_commit computed H_branch. Fix: detect _is_ephemeral_commit and use head_sha as effective_ref. B10: ingest_commit's already_synced early-return left stale "reflected" status when returning to main after feature-branch bind work. The repair path in the already_synced branch now uses get_regions_with_ephemeral_verdicts (indexed lookup via idx_cc_ephemeral) to find only suspect regions, updates their hashes to the authoritative content, and re-projects decision status. Two-pass approach deduplicates project_decision_status calls per decision. Tests: E18-E22 added (22/22 ephemeral/authoritative scenarios pass). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: set RECOMMENDED_VERSION to 0.13.4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(eval): real-ledger seeder for cost/latency baselines Stage 6 of issue #88 path-3 rework. Adds `tests/eval/_seed_ledger.py` — translates a synthetic HistoryResponse-shaped dict (from the existing generator) into real SurrealDB writes via `adapter.ingest_payload`, the production ingestion path. Uses the synthetic-repo fallback (repo path not on disk → empty content_hash) so seeding works without git fixtures. Status overrides post-ingest via `update_decision_status` to match the synthetic generator's intended distribution (70% reflected / 20% drifted / 10% other) — bypasses derive_status since there's no real file content. Three new unit tests: - N=10 seeds 30 decisions, ledger contains exactly that count - N=100 status distribution roughly matches synthetic generator's - Empty input returns 0 Stage 7 will use this seeder to run C2 + C3 against real seeded ledgers instead of mocked queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(eval): C2/C3 against real seeded ledger, parametrized by N=10/100/1000 Stage 7 of issue #88 path-3 rework. Addresses Jin's "test not very useful if it doesnt capture updates" feedback by switching C2 and C3 from mocked ledger queries to a real `memory://` SurrealDB seeded with N synthetic features. The handler now executes the real SurrealDB query path on every measurement — same code the developer hits in production. Real-I/O baselines (Darwin local, Python 3.12 + SurrealDB 2.x): | N | C2 tokens / bytes | C3 p50 / p95 | |---|---|---| | 10 | 566 / 2,303 | 2.5ms / 3.0ms | | 100 | 571 / 2,303 | 14.8ms / 15.9ms | | 1000 | 575 / 2,303 | 138.8ms / 141.7ms | C3 latency at N=1000 is ~1700× the previous mocked baseline (138ms vs 0.08ms). That's the user-experience-relevant signal — and exactly the regression target an optimization PR (#58 directions: semantic prefilter, lazy/two-pass history) should reduce. Platform tagging: - C1: `recorded_on=any` (token counts are deterministic across OSes) - C2: `recorded_on=any` (response shape is deterministic given same seed; noise floor absorbs sync_metrics timing variance) - C3: per-platform `darwin` (real I/O latency varies meaningfully by host; Linux baselines must be recorded separately on a Linux runner) Schema additions: - `_baseline_io.ANY_PLATFORM` sentinel — a row with this value matches every host. `find_baseline` now treats `recorded_on=any` rows as matches regardless of caller's platform. - `_record_or_assert(platform_agnostic=True)` records and matches with the sentinel. Implementation notes: - C2/C3 each spin up a fresh adapter per parametrized run — no cross-test state, no singleton reset needed. - file_paths chosen from synthetic decisions via `_pick_grounded_paths` to guarantee region-anchored matches (response fires non-trivially). - Seeding cost: ~62s at N=1000 (3000 ingest_payload mappings through the real ingest path + status updates). Total cost-eval runtime: ~2m30s. Acceptable for advisory CI; non-blocking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(catalog): refresh §C wording for real-ledger C2/C3 Stage 8 of issue #88 path-3 rework. Updates the catalog's §C entries to reflect that C2 + C3 now measure against a real seeded ledger, not mocked queries. Adds the real-ledger seeder to the implementation queue ticked items and clarifies the per-platform vs platform-agnostic split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: jinhongkuan <kuanjh123@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: WulfForge <krknapp@gmail.com>
Fast-follow lint hygiene PR after #96 merged with 8 ruff failures still on its HEAD. Dev's ruff+mypy gate (#102) was red on 5f773e6; this PR clears it. Re-applies the same fixes (4 files in tests/eval/ + tests/test_ephemeral_authoritative.py) directly against current dev. Zero behavioural changes. Refs #96, #102.
…+ filter (#76 part 1) (#106) Adds the read-side UI for decision_level. Pre-existing L1/L2/L3 badges (shipped in #71 / CodeGenome Phase 1+2) are preserved; this PR adds the missing amber 'Unclassified' state for NULL decision_level rows plus a top-of-table filter dropdown. - .lvl-unclassified CSS class (amber rgb(249,115,22)) - Rendering branch at line 548 handles null decision_level - <select id='lvl-filter'> with 5 options - Each decision row carries data-level='L1'|'L2'|'L3'|'unclassified' - Client-side JS applyLevelFilter(value) toggles row visibility No server changes. The companion inline-edit POST endpoint (#76 part 2) ships in a follow-up PR after the sibling #77 classifier PR lands ledger.queries.update_decision_level. Refs #76 (part 1 of 2) Generated with Claude Code (https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#107) Heuristic classifier (classify/heuristic.py) ports L1/L2/L3 rules from skills/bicameral-ingest/SKILL.md to a deterministic Python function. Regression-tested against the 7 fixtures at tests/fixtures/ingest_level_classification/. Two MCP primitives expose classification to agents: - bicameral.list_unclassified_decisions (read, returns proposals) - bicameral.set_decision_level (write, single row, idempotent) Both write paths (CLI --apply, MCP tool, future dashboard endpoint) use the same ledger.queries.update_decision_level helper. One write path, three callers. Defensive _DECISION_ID_RE regex validates record-id shape before SurrealQL interpolation (audit S1, defense-in-depth). bicameral-mcp-classify CLI provides offline batch backfill with --apply for write mode (dry-run is default). Closes #77 The companion #76 dashboard work (amber unclassified badge, filter dropdown, inline edit POST endpoint) ships in a sibling PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds target-branch: dev to .github/dependabot.yml so weekly dependency bumps go through the dev integration branch per DEV_CYCLE.md §4.1. Also auto-applies flow:feature, dependencies, python labels per §4.1.1. Refs PR #93.
…+ poster (#113) Issue #49: advisory GitHub Action posts a sticky Markdown drift-state comment on every PR open/synchronize. Path C maintainer call: graceful skip when no bicameral/decisions.yaml manifest in repo (manifest spec deferred). Stdlib-only urllib client; no new dependencies. Pure-function renderer in cli/drift_report.py; sticky-comment poster in .github/scripts/post_drift_comment.py. Closes #49
…-P3) (#116) Adds the governance/ package implementing the deterministic escalation policy engine plus its contracts foundation and the consolidated finding wrapper. Engine is pure, decomposed, and non-blocking by design (allow_blocking: Literal[False] locks the type so pydantic raises on True). Phase 1 (#109): GovernanceMetadata model on decisions; v14 -> v15 migration adds optional governance flexible-object field; derive_governance_metadata maps L1/L2/L3 to (decision_class, risk_class, escalation_class) defaults; ingest/history thread the metadata through. Phase 2 (#110): GovernanceFinding + GovernancePolicyResult contracts; finding_factories from_compliance_verdict/from_drift_entry/ from_preflight_drift_candidate; consolidate() collapses findings per (decision_id, region_id) pair using _SEMANTIC_SEVERITY ordering. Phase 3 (#108): engine.evaluate() orchestrates four pure helpers; config.py parses .bicameral/governance.yml with safe_load and falls back to transparency_first defaults on malformed YAML; new MCP tool bicameral.evaluate_governance for read-only ad-hoc evaluation; handlers/preflight.py attaches governance_finding to PreflightResponse. Phase 4 (HITL bypass flow for #112) and Phase 5 (docs for #111) ship separately. Phase 3 passes bypass_recency_seconds=None everywhere because Phase 4 hasn't wired the lookup yet. Closes #109, #110 Refs #108 (Phase 4 ships separately for #112) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the deterministic engine into preflight's human-in-the-loop
surface. Five trigger conditions (proposed, ai_surfaced, needs_context,
collision_pending, context_pending) yield HITLPrompts with a mandatory
bypass option. Bypass writes a preflight_prompt_bypassed event via
preflight_telemetry.py and is idempotent within a 1-hour recency
window (V4 spam-bypass guard).
The governance engine reads recent_bypass_seconds at preflight call
time (handlers/preflight.py) and passes it as a scalar to evaluate().
The engine's _apply_bypass_downgrade drops one tier when a bypass
occurred within the window. Engine purity preserved -- IO at the
call site, not in evaluate().
recent_bypass_seconds is F3-bounded: scans at most the last 1000
JSONL lines and breaks early on age > window.
bicameral.record_bypass MCP tool exposes the bypass write to skills;
returns {recorded, deduped} so the skill can distinguish first
bypass from a within-window repeat.
Bypass does NOT mutate decision state. The unresolved signoff_state
persists for future preflight surfaces.
Closes #112
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New docs/semantic-drift-governance.md describes the now-shipped surface across Phases 1-4 of the governance plan: - GovernanceMetadata + L1/L2/L3 default mapping - GovernanceFinding consolidation - Deterministic engine with decomposed helpers - .bicameral/governance.yml config (allow_blocking: Literal[False] locked at the type level) - HITL bypass flow with V4 idempotent record_bypass and F3 bounded tail-read Two Mermaid diagrams cover the lifecycle and the inference-vs- determinism split. Cross-links to docs/preflight-failure-scenarios.md, README.md core concepts, docs/DEV_CYCLE.md §4.5, docs/decision-level.md. Closes #111 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous behavior: the workflow's try/catch swallowed addLabels 403s, logged "Could not label #N: <msg>", and exited 0. The check turned ✅ green despite the label not being applied. Three issues (#44, #49, #65) were silently un-labelled and required manual intervention to surface. New behavior: track failed labels in a list during the loop, log per-issue as before, and at end-of-loop throw with a summary message listing affected issues and a remediation pointer to #104. Job exits non-zero; check turns ❌ red on the merged PR. The maintainer notices the failure and applies labels manually + flips the admin setting. Root cause is admin-side: repo Settings -> Actions -> General -> Workflow permissions must be "Read and write permissions". The job-level `permissions: issues: write` block can only narrow what the repo allows, never expand it. This visibility fix complements the admin fix tracked under #104; it does not replace it. The header comment now points future contributors at both #115 (root cause) and #104 (admin fix). Closes #115 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#129) * feat(#97): extend event vocabulary with ratify + supersede emit/replay Wires the missing decision-status events into the existing JSONL + materializer pipeline so the shipped event vocabulary matches the v0 architecture description (decision_ratified, decision_superseded alongside the existing ingest/bind/link_commit events). Changes: - ledger/adapter.py: add `apply_ratify(decision_id, signoff)` and `apply_supersede(new_id, old_id, ...)` to SurrealDBLedgerAdapter. Both methods are idempotent so the materializer can replay them safely. They wrap the existing inline UPDATE + project + supersedes helpers — no behavioral change for solo mode. - events/team_adapter.py: add wrappers that emit `decision_ratified.completed` and `decision_superseded.completed` events before delegating to the inner adapter. Event payloads carry `canonical_id` (UUIDv5 from description + source_type + source_ref) so cross-author replay can resolve to the peer's local row even though SurrealDB-generated decision ids are per-DB. - events/materializer.py: replay cases for the two new event types. Each looks up the local decision row by canonical_id; warns and skips if not found (out-of-order replay across authors). - handlers/ratify.py: route through `ledger.apply_ratify` instead of inline UPDATE + project_decision_status + update_decision_status. Pre-write idempotency check (early return when state already matches) is unchanged. - handlers/resolve_collision.py: route through `ledger.apply_supersede` for the supersede branch. Edge creation + frozen-signoff merge moves into the adapter so it's reachable from replay. - ledger/queries.py: new `get_canonical_id(client, decision_id)` and `find_decision_by_canonical_id(client, canonical_id)` helpers. Tests: - tests/test_team_event_replay.py (new) — three round-trip tests: ratify, supersede (with edge replay), and ingest regression. Each ingests through team adapter A, then connects a fresh team adapter B pointing at the same JSONL log + a fresh memory:// inner DB and a fresh watermark. Asserts state in B matches what A wrote. - tests/test_preflight_id_plumbing.py — updated the ratify mock to match the new `ledger.apply_ratify` shape. Out of scope (deferred to future PRs): compliance_checked event (Phase 4 uses CHANGEFEED), CHANGEFEED extension to code_subject / subject_identity / binds_to / code_region (schema migration), SHA256 chain (strictly v1). Closes part of #97. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ruff): drop unused find_decision_by_canonical_id import from team_adapter The materializer imports the helper inline at the call site. The top-level import in team_adapter.py was leftover from an earlier draft and never used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ruff): format pass on touched files Run ruff format on the three files modified in this PR. No semantic change — purely whitespace/argument-split normalization to satisfy `ruff format --check .` in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: CHANGELOG entry for v0.18.0 (#97 event vocabulary extension) Per DEV_CYCLE §7, every user-visible change gets a CHANGELOG entry. This is an additive feature (new event types in the team-mode JSONL log), so it bumps to MINOR per §6.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…it/replay (#129)" This reverts merge commit c233eb1. Reverted so PR #129 can be re-merged via rebase-and-merge, preserving the 4 original atomic commits (1b24e2e, 7a012d1, b2869e2, 9473648). The squash made the change un-cherry-pickable into triage-from-dev because the opaque commit bundled an additive event-vocabulary feature with intermediate handler refactors that triage-from-dev does not carry. No code change — the same work re-lands as four individually-cherry-pickable commits in the follow-up PR. Pairs with #130 (DEV_CYCLE.md §5.1 / §10.5 amendments that codify this merge-style rule going forward). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
team_server/auth/notion_client.py provides internal-integration-token auth (no OAuth router): load_token resolves NOTION_TOKEN env first, falling back to YAML config's notion.token; raises NotionAuthError if neither is set. Pure async functions over httpx.AsyncClient with Notion-Version pinned to 2022-06-28: list_databases (filtered to object=database), query_database (per-database last_edited_time watermark filter, ascending sort, paginated), fetch_page_blocks (paginated children). team_server/extraction/notion_serializer.py serializes a Notion database row deterministically: title line, then sorted-by-key property lines (title/rich_text/select/multi_select/date/checkbox/number/url/ people branches), then a blank line, then body block plain-text. Byte- stable output is the gating invariant for content_hash stability. team_server/config.py: DEFAULT_CONFIG_PATH constant with BICAMERAL_CONFIG_PATH env-var fallback; Path-typed. Tests: 7 client tests (env-vs-config precedence, MockTransport verification of filter shapes + Notion-Version header pinning + block pagination), 3 serializer tests (ordering, all property-type branches, byte-stability across calls). No new package dependencies — httpx and yaml already in v0 deps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…se 2)
team_server/workers/notion_worker.py polls allowlist-via-share Notion
databases (the integration sees only databases the operator has shared
with it — derived dynamically from notion_client.list_databases, no
separate allowlist table required). Per-database watermark stored in
the new source_watermark table, advanced monotonically as rows
ingest. Partial-failure recovery: watermark advances only to the last
successfully-ingested row's last_edited_time, so the next poll resumes
correctly. Per-database HTTPError is caught and logged so a single
failing database does not block other databases.
Each row's text input is the deterministic serializer output (title +
sorted properties + body); content_hash is SHA256 over that text.
upsert_canonical_extraction returns (extraction, changed); when
changed=True, a peer-authored team_event is written under
PEER_WORKSPACE_ID="notion" (resulting author_email
"team-server@notion.bicameral" via write_team_event's wrapper).
source_type="notion_database_row"; source_ref="{db_id}/{page_id}".
Tests: 9 functionality tests covering database iteration via
list_databases, first-seen-row event, idempotency on unchanged rows,
new event on edited rows, monotonic watermark advancement, watermark-
to-filter wiring, partial-failure recovery, per-database 404 isolation,
content_hash stability across dict insertion-order changes (the
serializer determinism invariant under the polling layer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
team_server/workers/notion_runner.py: thin wrapper run_notion_iteration over notion_worker.poll_once for symmetry with slack_runner (both expose a zero-extra-arg work_fn for the lifespan to register via worker_loop). Internal-integration auth means a single token covers a single workspace; v1 ships single-workspace. team_server/app.py lifespan amended: after Slack worker registration (unconditional), attempts notion_client.load_token via DEFAULT_CONFIG_PATH; on success registers a Notion task via the same worker_loop helper. On NotionAuthError logs INFO and continues without Notion ingest. On shutdown, both tasks are cancelled and awaited symmetrically. Tests: 4 functionality tests covering env-gated startup wiring, off-by-default invariant when token unset, cancellation on shutdown, and inner-loop resilience (single-iteration failure does not exit the loop). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-round audit cycle (VETO -> VETO -> PASS) for Notion ingest + cache contract migration. Plan ships across five phases: - Phase 0 — cache contract migration (schema v1->v2, schema_version table, callable migration dispatch, upsert_canonical_extraction) - Phase 0.5 — worker-task lifecycle pattern + Slack reference wiring (closes the v0 dormant-Slack-worker gap) - Phase 1 — Notion API client + property serializer (internal- integration auth, no OAuth router) - Phase 2 — Notion ingest worker (per-database watermark, peer- authored team_event) - Phase 3 — Notion task registration on lifespan META_LEDGER entries #29-#33 capture: round-1 VETO (4 missing/ undeclared symbols), round-2 VETO (1 wrong-call-shape for decrypt_token), round-3 PASS, IMPLEMENT, and SUBSTANTIATION. SHADOW_GENOME #7 addendum extends the PARALLEL_STRUCTURE_ASSUMED detection heuristic with three new in-sketch checks: signature, type-boundary, helper-symmetry. The two VETOs in this session are the empirical justification. SYSTEM_STATE.md adds the Priority C v1 section: schema state (v2), architectural properties achieved, audit cycle outcomes, implementation deviations from plan. Merkle seal: SHA256(content_hash + previous_hash) = dcb619104e6d88b97a04689093b80b9f03825f9a24bac3c3b9ab3d0107ff24d7 (content_hash 9f003c40..., previous_hash 6f4f8f8f... = Priority C v0 SEAL at Entry #28). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hase 0) Schema v2->v3: extraction_cache gains classifier_version field (option<string> with DEFAULT 'legacy-pre-v3'). upsert_canonical_extraction now requires classifier_version as keyword-only; cache hit requires BOTH content_hash AND classifier_version match. Either differing triggers re-extraction. The option<string> type accommodates pre-v3 rows whose field reads NONE before the migration's UPDATE backfills them — strict TYPE string would reject those reads (surfaced by the v2-to-v3 backfill integration test added per audit advisory L4-B from the QorLogic Fixer's Layer 4 sweep). _migrate_v2_to_v3 callable: defines the field permissively, then unconditionally UPDATE-backfills rows where classifier_version IS NONE. Idempotent. Workers (slack, notion) pass classifier_version="legacy-pre-v3" until pipeline integration (Phase 4) supplies the real heuristic version. Tests: 14 functionality tests across Phase 0 (cache_upsert/schema adaptations + classifier_version axis verification + v2->v3 backfill integration test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(Phase 1) team_server/extraction/heuristic_classifier.py provides Stage 1 of the extraction pipeline: pure-function classify(message, context, rules) returning ClassificationResult(is_positive, matched_triggers, classifier_version). Deterministic by construction (no LLM, no temperature, no time/uuid/random); rule-set hash drives downstream cache invalidation. Inputs: message dict (text + structural fields), context dict (reactions, thread_position, channel/db_id), TriggerRules (operator- configured + corpus-learned terms). The classifier honors: - keyword positives + keyword negatives (negatives short-circuit) - min_word_count length floor - reaction-count boosters (option d — context-aware) - thread-tail position booster (option d) - learned_keywords merge (option c — populated by Phase 5) derive_classifier_version produces a stable SHA256 hash of the sorted rule-set; changes invalidate the upsert cache via the classifier_version axis added in Phase 0. Tests: 9 functionality tests covering keyword match, negative override, length floor, reaction boost, thread-tail booster, determinism, version-changes-on-rule-change, and unicode/emoji robustness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
team_server/config.py extended with pydantic models for the heuristic trigger rules: HeuristicGlobalRules (workspace-level defaults), HeuristicScopedOverride (per-channel/database additive overrides), SlackHeuristics, NotionHeuristics, NotionConfig, CorpusLearnerConfig. YAML alias 'global:' maps to global_rules field via populate_by_name=True + alias='global' (avoids the Python reserved-word collision). Resolvers resolve_rules_for_slack and resolve_rules_for_notion produce TriggerRules | RulesDisabled, merging global + scoped + learned keywords additively. RulesDisabled is the sentinel for opted-out channels/databases. Backwards compatibility: load_channel_allowlist preserved as an alias for load_rules_from_config so existing v0 OAuth callers continue to work unchanged. Tests: 5 functionality tests covering YAML loading, channel-override merge, database-override merge, disabled-channel sentinel, and ValidationError propagation as ValueError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
team_server/extraction/llm_extractor.py: full rewrite of the v1.0
paragraph-split placeholder. extract(text, matched_triggers) async
calls the Anthropic Messages API (claude-haiku-4-5 default; selectable
via BICAMERAL_TEAM_SERVER_EXTRACT_MODEL env). Returns structured
{"decisions": [{"summary", "context_snippet"}], "extractor_version",
"matched_triggers"}.
Failure handling:
- ANTHROPIC_API_KEY unset: raises MissingAnthropicKeyError (fail-loud)
- HTTP 429: exponential backoff retry (1s, 2s; max 3 attempts)
- HTTP 5xx / network errors: fail-soft with truncated error string
- Unparseable JSON output: fail-soft with parse-failure message
- Non-text content blocks (ToolUseBlock etc.): fail-soft (closes
Fixer L1-C from the proactive code-quality sweep)
Anthropic SDK imported lazily inside extract() so the module remains
importable when anthropic is in requirements.txt but not in dev venv
(matches the slack_sdk lazy-import pattern from v1.0 Phase 0.5).
extractor_version is a SHA256 prefix of the prompt template + model
name, so changes to either invalidate downstream cache via the
classifier_version cousin axis.
Tests: 7 functionality tests covering structured output parsing,
trigger-grounding in prompt, 429 retry, 500 fail-soft, parse-failure
fail-soft, env-overridden model, and fail-loud-on-missing-key.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ge 2 (Phase 4)
team_server/extraction/pipeline.py provides the single entry point
extract_decision_pipeline(*, text, message, context, rules_or_disabled,
llm_extract_fn). Determines the output shape regardless of source:
{decisions, classifier_version, matched_triggers, extractor_version,
skipped}. extractor_version is None when Stage 2 didn't run (chatter,
rules-disabled).
slack_worker._ingest_message: builds context dict (reactions,
thread_position, thread_ts, subtype), resolves rules per channel via
config, routes through pipeline. classifier_version computed cheaply
from rules; the cache check happens BEFORE the LLM call.
notion_worker._ingest_row: builds context dict (last_edited_by,
edit_count), resolves rules per database, routes through pipeline.
Both workers preserve the legacy `extractor(text)` path when config
is None — preserves v1.0 worker tests + provides a clean cutover path
for callers that haven't adopted the rules schema.
Tests: 5 functionality tests covering pipeline short-circuit on
chatter, LLM invocation on positives, rules-disabled passthrough, and
worker-side context handoff for Slack (thread + reactions) and Notion
(edit metadata).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
team_server/extraction/corpus_learner.py reads the team-server's own team_event log (per OQ-1: not the per-repo decision table that doesn't exist server-side), extracts top n-grams from positive-extraction decisions, persists to learned_heuristic_terms with operator-denylist respected. Schema v3->v4 adds learned_heuristic_terms table (UNIQUE on source_type+term). Persistence is upsert-shaped: re-runs update support_count + learned_at without duplicating rows. resolve_rules_for_slack / resolve_rules_for_notion accept a learned=tuple[str, ...] argument that merges into TriggerRules. learned_keywords. The classifier already consumes this via the same match path as operator-configured keywords. app.py lifespan registers a corpus-learner worker via the existing worker_loop helper when config.corpus_learner.enabled is true (default false). Off-by-default; opt-in via YAML. Tests: 7 functionality tests covering n-gram extraction, denylist honor, persistence, determinism, learned-keyword merge, lifespan- on-when-enabled, lifespan-off-when-disabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First-round PASS audit cycle for the real heuristic+LLM extractor. Plan ships across six phases (Phase 0 cache contract evolution; Phase 1 deterministic Stage 1 classifier; Phase 2 trigger rules schema; Phase 3 real Anthropic SDK Stage 2; Phase 4 pipeline integration; Phase 5 corpus learner option-c). META_LEDGER entries #34-#36 capture: round-1 PASS audit, IMPLEMENT, and SUBSTANTIATION. Three audit advisories (extract() boundary, TeamServerRules typo, corpus learner table-source) all addressed inline during implementation. A proactive QorLogic Fixer code-quality sweep before commit produced 2 MED + 2 LOW findings; both MEDs landed (fail-soft on non-text content blocks; v2->v3 backfill integration test) with one surfacing a real defect (the migration's TYPE string was rejecting reads on pre-v3 rows with NONE classifier_version; corrected to TYPE option<string>). SYSTEM_STATE.md adds the Priority C v1.1 section: schema state (v4), architectural properties achieved (heuristic-first determinism + LLM-only-when-needed + rule-version-driven cache invalidation + all four "dynamic" angles wired), audit cycle outcomes. Merkle seal: SHA256(content_hash + previous_hash) = b37003661820e2ef80591b9d0cfdeac3df092d6d9b4b5d87e3036e7ccf37d95b (content_hash e8b1b6b6..., previous_hash dcb61910... = Priority C v1 SEAL at Entry #33). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
) team_server/auth/allowlist_sync.py reconciles channel_allowlist against the workspace table from config.slack.workspaces[]: per-team_id additive + subtractive sync. Idempotent; picks up operator YAML edits on next restart. Workspaces in YAML without a corresponding workspace- table row (no OAuth completed yet) are logged and skipped — they get picked up on the next sync after OAuth completes. team_server/app.py lifespan: calls sync_channel_allowlist after ensure_schema + config load, before worker registration. The Slack runner's _channel_ids query sees populated rows on first poll cycle. Sync failures log+continue so a partial YAML doesn't block startup. Config load is now done once at the top of the lifespan body and passed through to both the allowlist sync and the corpus learner registration (deduplication of _load_config_or_default calls). Implementation note: SurrealDB v2 strict-types `record<workspace>` on channel_allowlist.workspace_id requires `type::thing()` coercion (the SELECT id from workspace returns a 'workspace:<rid>' string; passing that string back into CREATE/DELETE without coercion fails the field type check). Pattern matches the v1.0 schema migration's existing use of type::thing in _migrate_v1_to_v2. Tests: 7 functionality tests across allowlist_sync (5: insert / idempotent / skip-not-in-yaml / skip-not-in-db / removal-on-yaml-edit) and lifespan integration (2: lifespan invokes sync at startup; lifespan continues when sync raises). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ge (closes #160 first half) events/team_server_bridge.py provides two pure functions: - is_team_server_payload(payload) — predicate distinguishing team- server-shaped events ({source_type, source_ref, content_hash, extraction}) from legacy CodeLocatorPayload-shaped events - bridge_team_server_payload(payload) — maps to IngestPayload shape (source='slack'|'notion', empty repo/commit_hash, summary→description, context_snippet→source_excerpt). source_type='notion_database_row' normalizes to source='notion'. Handles both new dict-shape decisions and the legacy interim-claude-v1 paragraph-split string-shape. events/team_server_consumer.py spawns a periodic asyncio task that: 1. Calls pull_team_server_events to fetch new events from the team- server's /events HTTP endpoint 2. Filters team-server-shaped events via is_team_server_payload 3. Bridges via bridge_team_server_payload 4. Invokes inner_adapter.ingest_payload directly (bypasses JSONL — team-server events have their own canonical home in the team- server's SurrealDB; per-author JSONL files would be redundant) Defensive unwrap (audit-round-2 Finding A): get_ledger() returns TeamWriteAdapter in team mode; its ingest_payload emits an 'ingest.completed' event via _writer.write BEFORE delegating. Without the unwrap, consumer-driven ingest would echo team-server events into per-dev JSONL files → git push → other devs replay → O(N²) cross-dev replay amplification per team-server event. The `getattr(adapter, "_inner", adapter)` line in start_team_server_consumer_if_configured is the load-bearing control; it falls through to the bare adapter in solo mode (verified: SurrealDBLedgerAdapter has no _inner attribute). server.py serve_stdio: spawns the consumer task in parallel with the existing dashboard sidecar; cancels and awaits on shutdown via try/finally. Opt-in via BICAMERAL_TEAM_SERVER_URL env; consumer task returns None when unset. Tests: 7 functionality tests including test_consumer_unwraps_team_write_adapter_does_not_echo_to_jsonl which constructs a real TeamWriteAdapter with a recording EventFileWriter stub and asserts _writer.write was NOT called — the load-bearing test that catches the audit-round-2 echo-amplification defect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…vents (closes #160 second half) events/materializer.py replay loop adds a dispatch branch for event_type in ('ingest', 'ingest.completed') with a team-server-shaped payload: routes through is_team_server_payload + bridge_team_server_payload (from events/team_server_bridge.py landed in Phase 1.5) and invokes inner_adapter.ingest_payload with the bridged IngestPayload. The new branch sits BEFORE the existing 'ingest.completed' dispatch and is gated on the is_team_server_payload predicate. Legacy CodeLocatorPayload-shaped events with event_type='ingest.completed' fall through unchanged; only team-server-shaped payloads route via the bridge. This closes the second half of #160 — Phase 1.5 closed the load- bearing path (per-dev consumer pulling events directly), while this phase covers the secondary path where team-server events end up in git-tracked JSONL files (e.g., if a future flow appends team-server events to per-author JSONL for offline replay). Defensive infrastructure for v1.next; not load-bearing for v0 functionality. Tests: 6 net-new functionality tests in test_materializer_team_server_pull.py: - dispatches team_server 'ingest' event through bridge - bridges slack extraction to IngestPayload (full shape assertion) - bridges notion_database_row to source='notion' (normalization) - skips events with empty extraction.decisions - legacy 'ingest.completed' with non-team-server payload still routes to original dispatch (regression coverage) - malformed payload (missing 'extraction') is shape-checked and skipped without crashing Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-round audit cycle (VETO → VETO → PASS) for closing v0 release blockers issues #160 (materializer event_type mismatch) and #161 (channel_allowlist not populated). META_LEDGER entries #37-#41 capture: round-1 VETO (infrastructure- mismatch — pull_team_server_events had zero production callers), round-2 VETO (specification-drift — sketch passed wrapped adapter without unwrap; would echo events O(N²) cross-dev), round-3 PASS, IMPLEMENT, SUBSTANTIATION. SHADOW_GENOME #7 heuristic catalog grew 4→6 across this branch: - Heuristic 5 (upstream-consumer) — Entry #37 - Heuristic 6 (wrapper-side-effect) — Entry #38 The catalog is the productive deposit beyond the code; each heuristic is a durable detection pattern reusable in future audits. SYSTEM_STATE.md adds the v0 release-blockers section: end-to-end ingest pipeline now functional (Slack OAuth → workspace row → YAML allowlist sync → channel_allowlist → Slack worker polls → heuristic+ LLM extraction → team_event → /events HTTP → per-dev consumer pulls → bridges to IngestPayload → per-dev local ledger). Merkle seal: SHA256(content_hash + previous_hash) = 7cc405fc8d39f468d502da669982c88321ce3a84bb571d28e0b14be86ab56bdd (content_hash 14e387b1..., previous_hash b3700366... = Priority C v1.1 SEAL at Entry #36). Closes #160, closes #161. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the user's prompt explicitly contradicts a surfaced decision, the agent now ingests the refinement and wires it via bicameral.resolve_collision(action="supersede"). Closes the v0.9.3 caller-LLM correction-capture loop that died at "render". Mechanical execution; no user-confirmation prompt — PM ratifies in inbox. Canonical action alternatives (keep_both / link_parent) cited from skills/bicameral-resolve-collision/SKILL.md as source-of-truth. Also fixes Section 7's pre-existing feature_group placement bug (top-level kwarg silently dropped by MCP dispatch since v0.x; now correctly placed in decisions[0].feature_group per IngestDecision contract at contracts.py:498). Removes stale .claude/skills/bicameral-preflight/SKILL.md duplicate per CLAUDE.md canonical-source policy (skills/ is canonical). Adds tests/test_e2e_flow_2a_in_default_set.py to gate the e2e Flow 2 contradiction-capture validation surface in CI. Closes #154 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ipt_path Reads Claude Code's SessionEnd hook stdin contract, extracts the parent session's transcript_path, and spawns capture-corrections via `claude -p` with the path propagated through BICAMERAL_PARENT_TRANSCRIPT_PATH env var. Closes the transcript-passing half of #156. Without this bridge, the prior inline shell command spawned `claude -p` with no transcript context, leaving --auto-ingest mode silently no-op. Bridge uses cwd from stdin payload (per Claude Code hook contract), falling back to os.getcwd() for manual invocations. Recursion guard preserved (BICAMERAL_SESSION_END_RUNNING). Defensive: silent no-op on malformed JSON or claude-not-on-PATH; never crashes the parent session. setup_wizard._BICAMERAL_SESSION_END_COMMAND now dispatches via `python3 -m events.session_end_bridge`. skills/bicameral-capture-corrections SKILL.md gains a one-paragraph note documenting the env-var read for --auto-ingest mode. 7 functionality tests cover the stdin → env → subprocess pipeline, including the cwd-from-stdin invariant and the literal-constant guard on the hook-command string. Partially closes #156 (transcript half; design-pivot half deferred to v0.1 per plan boundaries). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan + Merkle-sealed ledger entries for the v0-blocker session that closes #154 (preflight Step 5.6 contradiction-driven refinement capture) and the transcript-passing half of #156 (SessionEnd transcript bridge). Session 2026-05-03T0045-d2a187: 3 audit rounds (rounds 1+2 VETOed for product-taxonomy paraphrase; round 3 PASS after applying the proposed 7th SHADOW_GENOME #7 heuristic — amendment-completeness check via whole-plan grep). Heuristic operationally validated; recommend codifying. Ledger entries #42-#46: - #42: GATE round 1 VETO (infrastructure-mismatch) - #43: GATE round 2 VETO (specification-drift) - #44: GATE round 3 PASS (chain c4fc9944) - #45: IMPLEMENT (chain ceb16cc9) - #46: SEAL (Merkle 61e774e4, content ad6885d6) Closes #154 Partially closes #156 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-fixes 71 ruff errors (mostly I001 import-sort + UP045/UP035/UP007 modernization) accumulated across the team-server v0/v1/v1.1 sessions and Priority B v0 final-blockers session. Pure formatting; no behavioral change. Verified by: 131 team-server + plan-scope tests pass post-reformat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d8f9777 to
a03aebe
Compare
Three real type errors reported by mypy on PR #153/#159 — none are purely cosmetic; each is fixed by tightening a contract: team_server/extraction/llm_extractor.py: - _one_attempt return type changed from tuple[str, object] to tuple[str, list[Any] | str | None]. The three branches (ok / retry / error) already produce list / None / str respectively; the union documents that explicitly so mypy can narrow at the call site. - After the 'ok' branch check, the call to _success(decisions=...) now has an isinstance(payload, list) assertion. Defensive — and satisfies _success's list parameter type. Asserts the existing invariant; doesn't add new behavior. team_server/app.py: - Replace 'from llm_extractor import extract as _interim_extractor' (2-arg signature) with an adapter function that matches the single-arg Extractor protocol the workers' legacy fallback path expects (Callable[[str], Awaitable[dict]]). - Adapter passes matched_triggers=[] because the legacy fallback path fires when rules_or_disabled is None, which means there's no upstream classifier-rule matching producing triggers. The classifier-rules path goes through extract_decision_pipeline directly and never touches this adapter. Verification: - mypy . — 132 source files, no issues - ruff check . — All checks passed - ruff format --check . — 273 files already formatted - pytest tests/test_team_server_app.py tests/test_team_server_allowlist_lifespan.py tests/test_team_server_allowlist_sync.py — 12 passed Refs PR #153 (the dev-targeting variant of this branch)
Collaborator
Author
|
Closing as duplicate of #153 — both PRs share head branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two sealed sessions stacked on
claude/priority-c-selective-ingest, ahead of v0 by 13 commits:All four "dynamic" angles wired into the same
TriggerRulesshape: per-workspace YAML, per-channel/db overrides, learned-keyword merge, context-aware boosters.Audit history
TYPE stringrejecting reads on pre-v3 rows with NONE classifier_version).Ledger entries
dcb61910...b37003661820e2ef80591b9d0cfdeac3df092d6d9b4b5d87e3036e7ccf37d95bPlans:
plan-priority-c-team-server-notion-v1.mdplan-priority-c-team-server-real-extractor-v1.mdTest plan
pytest -x tests/test_team_server_*.py tests/test_materializer_team_server_pull.py— full team-server suite (103/103 passing locally)pytest -x tests/ -k "not team_server"— regression check; 8 pre-existing failures intest_alpha_flow,test_bind,test_ephemeral_authoritative,test_v0417_jargon_hygieneare unrelated to this branch (none touch files modified here)memory://SurrealDB; the v2→v3 backfill integration test seeds a v1-shaped extraction_cache row and asserts post-migrationclassifier_version='legacy-pre-v3'httpx.MockTransport-equivalent stubs; no live API calls in CIworker_looplifecycle: registration on lifespan, cancellation on shutdown, single-iteration-failure-doesn't-kill-loop all coveredOut of scope (flagged for follow-up)
'ingest'vs'ingest.completed'inevents/materializer.py:89) — pre-existing v0 gap; team-server emitsevent_type='ingest'but the per-dev materializer doesn't dispatch on it. The extractor's output is dead weight in the materializer chain until this is fixed. Separate plan needed.slack_runnerbut nothing currently populates it. v0 OAuth callback creates the workspace row only; YAML loader exists but isn't invoked. Slack ingest pipeline plumbed but inert until allowlist is populated.🤖 Generated with Claude Code