chore: bump to v0.4.4 — grounding reuse + coverage loop#9
Merged
Conversation
Bumps version 0.4.3 → 0.4.4 to release the grounding-pipeline improvements that landed since v0.4.3. What's in v0.4.4 (vs v0.4.3): - **PR #5** (silong/code-locator-fix-drift): decision grounding reuse + 3-tier coverage loop • Before BM25, handle_ingest checks the ledger for similar previously-grounded intents via search_grounded_intents() and reuses their code_regions after live-symbol validation • ground_mappings retries with progressively relaxed thresholds (strict 0.5/80 → relaxed 0.3/70 → broad 0.1/60) before giving up • New cache_hits field on IngestStats; grounding_tier stamped on maps_to edge provenance for observability • 26 new unit tests (10 vocab cache + 16 coverage loop) • Fully deterministic, no LLM added to the grounding path - **PR #8**: small test fix — case-insensitive INCLUDE/EXCLUDE check + bumped excerpt size ceiling for the v0.4.3 few-shot SKILL.md structure. M1 adversarial regression (local, before/after the v0.4.4 changes): v0.4.3: P=0.81 R=0.87 F1=0.84 (TP=13 FP=3 FN=2) v0.4.4: P=0.81 R=0.87 F1=0.84 (TP=13 FP=3 FN=2) ^^^^^^^^^^^^^^ identical — extraction quality unchanged The grounding-pipeline changes are conceptually independent of M1 extraction quality. Cache hits and tier-relaxation only affect grounded_pct, which was already 100% on the adversarial corpus. Offline test suite: 71/71 pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughVersion constants were incremented from Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
silongtan
added a commit
that referenced
this pull request
Apr 25, 2026
All four findings verified against current code; only the actionable ones applied. 81 passed + 1 xfailed in 9.02s. #1 — skills/bicameral-preflight/SKILL.md sync_metrics note The .claude/skills copy got the sync_metrics observability note back when V1 A3 shipped, but the canonical skills/ copy never did. Mirror the wording verbatim near step 2 so the rendering guidance and response-field documentation stay in sync. #2 — handlers/detect_drift.py per-entry alignment The cosmetic-hint enrichment was slicing both head_full and wt_full using entry.lines (the baseline anchor). HEAD and the working tree can shift the symbol independently, so a single index range can't align both sides. The narrow consequence: a drifted entry with shifted lines could yield a misleading cosmetic_hint=true on bytes that aren't the bound region. Fix: re-resolve the symbol against each ref via resolve_symbol_lines(file_path, entry.symbol, repo, ref="HEAD") and ref="working_tree" separately, slice each ref using its own resolved range. Resolution failure on either side → safe default of cosmetic_hint=False (matches the V1 contract: "False is cheap, True must be earned"). Empty symbol → skip (new fail-safe path). Test refactor: test_invalid_lines_skipped renamed to test_unresolvable_symbol_skipped — the old test asserted that lines=(0,0) was the failsafe trigger, but entry.lines is no longer the alignment input. New test exercises the resolve_symbol_lines-returns-None path via a nonexistent symbol name, which is the real fail-safe gate now. #3 — V2 guide TOC anchor for §9 GitHub auto-generates fragment IDs from heading text by lowercasing, replacing spaces with hyphens, and dropping punctuation. "## 9. Acceptance criteria for V2" maps to #9-acceptance-criteria-for-v2, but the TOC pointed at #9-acceptance-criteria (truncated). Link broken. Updated to the correct fragment. #4 — V2 guide unlabeled fenced code blocks (markdownlint MD040) Six fenced opens used bare ``` instead of a labeled fence. Tagged each with ```text — the contents are commit listings, ASCII DAG diagrams, pseudocode protocols, and tuple notation, none of which fit a real language tag. The other fenced blocks in the guide (already tagged ```sql / ```python) are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
silongtan
added a commit
that referenced
this pull request
Apr 26, 2026
All four findings verified against current code; only the actionable ones applied. 81 passed + 1 xfailed in 9.02s. #1 — skills/bicameral-preflight/SKILL.md sync_metrics note The .claude/skills copy got the sync_metrics observability note back when V1 A3 shipped, but the canonical skills/ copy never did. Mirror the wording verbatim near step 2 so the rendering guidance and response-field documentation stay in sync. #2 — handlers/detect_drift.py per-entry alignment The cosmetic-hint enrichment was slicing both head_full and wt_full using entry.lines (the baseline anchor). HEAD and the working tree can shift the symbol independently, so a single index range can't align both sides. The narrow consequence: a drifted entry with shifted lines could yield a misleading cosmetic_hint=true on bytes that aren't the bound region. Fix: re-resolve the symbol against each ref via resolve_symbol_lines(file_path, entry.symbol, repo, ref="HEAD") and ref="working_tree" separately, slice each ref using its own resolved range. Resolution failure on either side → safe default of cosmetic_hint=False (matches the V1 contract: "False is cheap, True must be earned"). Empty symbol → skip (new fail-safe path). Test refactor: test_invalid_lines_skipped renamed to test_unresolvable_symbol_skipped — the old test asserted that lines=(0,0) was the failsafe trigger, but entry.lines is no longer the alignment input. New test exercises the resolve_symbol_lines-returns-None path via a nonexistent symbol name, which is the real fail-safe gate now. #3 — V2 guide TOC anchor for §9 GitHub auto-generates fragment IDs from heading text by lowercasing, replacing spaces with hyphens, and dropping punctuation. "## 9. Acceptance criteria for V2" maps to #9-acceptance-criteria-for-v2, but the TOC pointed at #9-acceptance-criteria (truncated). Link broken. Updated to the correct fragment. #4 — V2 guide unlabeled fenced code blocks (markdownlint MD040) Six fenced opens used bare ``` instead of a labeled fence. Tagged each with ```text — the contents are commit listings, ASCII DAG diagrams, pseudocode protocols, and tuple notation, none of which fit a real language tag. The other fenced blocks in the guide (already tagged ```sql / ```python) are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
silongtan
added a commit
that referenced
this pull request
Apr 26, 2026
All four findings verified against current code; only the actionable ones applied. 81 passed + 1 xfailed in 9.02s. #1 — skills/bicameral-preflight/SKILL.md sync_metrics note The .claude/skills copy got the sync_metrics observability note back when V1 A3 shipped, but the canonical skills/ copy never did. Mirror the wording verbatim near step 2 so the rendering guidance and response-field documentation stay in sync. #2 — handlers/detect_drift.py per-entry alignment The cosmetic-hint enrichment was slicing both head_full and wt_full using entry.lines (the baseline anchor). HEAD and the working tree can shift the symbol independently, so a single index range can't align both sides. The narrow consequence: a drifted entry with shifted lines could yield a misleading cosmetic_hint=true on bytes that aren't the bound region. Fix: re-resolve the symbol against each ref via resolve_symbol_lines(file_path, entry.symbol, repo, ref="HEAD") and ref="working_tree" separately, slice each ref using its own resolved range. Resolution failure on either side → safe default of cosmetic_hint=False (matches the V1 contract: "False is cheap, True must be earned"). Empty symbol → skip (new fail-safe path). Test refactor: test_invalid_lines_skipped renamed to test_unresolvable_symbol_skipped — the old test asserted that lines=(0,0) was the failsafe trigger, but entry.lines is no longer the alignment input. New test exercises the resolve_symbol_lines-returns-None path via a nonexistent symbol name, which is the real fail-safe gate now. #3 — V2 guide TOC anchor for §9 GitHub auto-generates fragment IDs from heading text by lowercasing, replacing spaces with hyphens, and dropping punctuation. "## 9. Acceptance criteria for V2" maps to #9-acceptance-criteria-for-v2, but the TOC pointed at #9-acceptance-criteria (truncated). Link broken. Updated to the correct fragment. #4 — V2 guide unlabeled fenced code blocks (markdownlint MD040) Six fenced opens used bare ``` instead of a labeled fence. Tagged each with ```text — the contents are commit listings, ASCII DAG diagrams, pseudocode protocols, and tuple notation, none of which fit a real language tag. The other fenced blocks in the guide (already tagged ```sql / ```python) are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9 tasks
Knapp-Kevin
added a commit
that referenced
this pull request
May 6, 2026
… to research brief (#205) Addresses Codex first-pass review notes #1, #2, #3, #7, #8, #9 from the brief's review block. Tier C items + the subsequent Kilo / Gemini / Codex-2nd-pass review layers are tracked as follow-ups (will be surfaced in the PR thread for direction). Changes: - § 1.4 ingest pipeline: adds explicit "Risk amplification (durable-feedback-loop)" paragraph framing ingest as the durable write-surface that propagates poisoned content through preflight back into the agent's reasoning context. Strengthens LLM-01 + LLM-04 P0 defensibility (Codex #2). - § 1.8 skills surface: adds worked before/after example contrasting instruction-only `bicameral-report-bug` keys-only commitment vs the deterministic `_resolve_signer_email` gate that replaced it in #204. Makes the doctrine concrete for non-agent-systems readers (Codex #3). - § 1.9 team-server: rewrites the dangling "TEAM-NN gaps in § 4" promise to "intentionally not enumerated; activation PR authors TEAM-NN IDs against actual activated topology" (Codex #8). - § 2.6 EU AI Act: removes unilateral "limited risk" claim. Now describes bicameral-mcp as an AI-adjacent developer-tool component whose risk-tier classification properly attaches to the integrated system + deployment context, requiring counsel review for any specific tier claim (Codex #7). - § 5 gap synthesis: adds Deployment trigger column (`all` / `local-OK` / `team/hosted` / `pre-team` / `hosted`) so severity is defensible per deployment shape. SOC2-01 reclassified as pre-team/hosted P0 with local-only boundary statement; GDPR-05 reclassified as team/hosted P1 with local single-user P2; OWASP-03 reclassified as hosted P1 with local P2 (uv/pipx provides install-time lock); OWASP-02 trigger narrowed to team/hosted (Codex #1). - Appendix method notes: softens "every claim should be verifiable by re-reading the cited file at the cited line range" to acknowledge that most findings cite components rather than path:line, and defers a line-level evidence appendix as a follow-up improvement (Codex #9). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin
added a commit
that referenced
this pull request
May 6, 2026
…+Gemini+Codex-2 (#205) Authors a single Reviewer Disposition Pass table at the top of the brief reconciling all 32 review points across four review layers (Codex first-pass, Kilo, Gemini CLI, Codex second-pass) into one post-review consensus before downstream P1 issue-filing — per the explicit Codex-2 #1 directive. Decisions: 21 applied this commit, 6 already applied in 1d82658, 3 deferred to follow-up, 2 note-only. Net new gap IDs added per disposition: GDPR-08 (ephemeral data), GDPR-09 (consent versioning + revocation), LLM-11 (cross-tool config-file modification surface), MCP-01 (host UX as external dependency), CFG-01 (config precedence + fail-closed model). Reclassification: LLM-06 P0/M → P1/M with scope narrowed to future remote-skill-loading channel (per Kilo #2). Major content additions to the brief: - § 1.1: MCP host UX is external dependency, not security gate (new gap MCP-01) — host that auto-approves tool calls bypasses any "operator will see this" assumption. - § 1.2: SurrealDB version pinning supply-chain callout (Kilo #11). - § 1.7: cross-tool config-file modification surface (new gap LLM-11) distinct from skill-content surface — `setup_wizard` writes shell commands into `.claude/settings.json` that run host-side at hook fire. - § 1.11 (new): Configuration precedence + fail-closed model — single uniform precedence rule across all knobs (env > config.yaml > hardcoded defaults), fail-closed semantics on missing/malformed/ contradictory config (Codex-2 #5). - § 2.4 (a): LLM02 mapping note clarifying it folds into LLM-07 + OWASP-04 (Kilo #13). - § 2.4 (b): explicit `confirm=True` is agent-supplied not HITL (Kilo #3) — security context cannot rely on agent-filled params. - § 2.4 (c) LLM-01 + LLM-04: extensible classifier (Gemini #2) + guardrail-not-classifier framing (Codex-1 #6) + control-acceptance template (Codex-2 #4) — quarantine, override, test fixtures, measurement counters. - § 2.4 (c) LLM-03: timeouts as `.bicameral/config.yaml` knobs (Gemini #3). - § 2.4 (c) LLM-05 + LLM-09: out-of-band operator confirmation, not agent-supplied confirm parameters (Kilo #3). - § 2.4 (c) LLM-06: scope-narrowed to future remote-skill-loading; in current install model the wheel-trust covers it (Kilo #2). - § 2.4 (c) LLM-11 (new): cross-tool config-file gate (signed hooks-manifest.json) distinct from skill manifest. - § 2.1 (c) GDPR-01: three remediation candidates — tombstone-and- rebuild with signed manifest (Kilo #12), crypto-shredding (Gemini #1), or scope-out via PII detect-and-refuse. - § 2.1 (c) GDPR-02: data-subject-access search must cover full identifier surface (description, source_ref, topic, file paths) not just signer email (Codex-1 #5). - § 2.1 (c) GDPR-08 (new): ephemeral data surfaces (tempfiles, swap, WAL, crash dumps) (Kilo #7). - § 2.1 (c) GDPR-09 (new): consent versioning + revocation semantics (Kilo #8 + Codex-2 #3). - § 5: gap table updated with new rows + LLM-06 reclassification; gap counts post-disposition (5 P0 / 19 P1 / 16 P2 / 5 P3 = 45 total, up from 41). - § 6.1 (new): epic grouping for deferred P1 batch (Codex-1 #10) — ingest boundary guardrails, per-tool authority gradation, supply- chain signing, telemetry & consent. - § 6.2 (new): six-section control-acceptance template for every DG gap (Codex-2 #4) — positive / negative / bypass / fail-closed / telemetry / docs. Filed-issue updates: - Issue #214 (LLM-06): relabeled P0 → P1, retitled to reflect scope narrowing, full disposition comment added. - Issue #212 (LLM-01) + #213 (LLM-04): disposition comments added capturing the guardrail framing, classifier extensibility, and control-acceptance template applicable to both. Deferred for follow-up: Codex-1 #4 (controller/processor restructure of standards table), Codex-1 #9 (full evidence appendix beyond the methodology softening), Codex-2 #2 (full 3-column deployment-profile matrix beyond the single-column trigger). Brief now 706 lines (up from 606); +124 line diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
3 tasks
Knapp-Kevin
added a commit
to Knapp-Kevin/bicameral-mcp
that referenced
this pull request
May 21, 2026
…lization
**Phase B-1 of N. Does NOT close GDPR-01.** This cycle ships the
load-bearing schema-level deterministic gate (#205 doctrine,
gate_kind: schema) that segregates verbatim transcript text into the
operator-erasable PiiArchive from Phase A. Speakers/source_ref
pseudonymization (Phase B-2), cross-author replay sanitizer (Phase
B-3), and erase-subject CLI + legacy backfill (Phase C) remain.
Plan: plan-221-phase-b-1-ingest-cutover.md (qor-judge PASS at L2
round 3; two prior VETOes captured F-B1-{1,2,3} + F-B2-{1,2,3} as
Shadow Genome Entries BicameralAI#8 and BicameralAI#9 with new heuristics BicameralAI#7-9).
What ships:
- Schema v21→v22 migration: relaxes input_span.text ASSERT to
"$value != '' OR $this.archive_key != ''". DB-engine-enforced;
refactor-resistant. Legacy UNIQUE-on-(source_type, source_ref, text)
index preserved for backward-compat.
- ledger/queries.py::_resolve_span_text(archive, row) — sync helper,
single point of truth for input_span.text reads. Returns archive
content when archive_key is set, "[ERASED]" sentinel post-erasure,
legacy row.text as fallback.
- _ERASED_SENTINEL constant hoisted (load-bearing in helper return
AND real_spans filter exclusion).
- 7 read sites refactored to route through helper:
* 4 graph projections in queries.py (get_all_decisions,
search_by_bm25, get_decisions_for_file, get_decisions_for_files)
* handlers/history.py:217 enriched-fetch site
* handlers/remove_source.py audit-telemetry consumer of
get_input_span_row (post-erasure audit captures sentinel, not
stale plaintext)
- upsert_input_span gains archive_key parameter — when set, writes
with text='' and dedup keyed on archive_key; legacy text-only path
preserved.
- SurrealDBLedgerAdapter.ingest_payload writes verbatim to archive
before input_span CREATE; falls back to inline-text on archive
write failure.
- PiiArchive plumbed onto adapter via adapters/ledger.py::get_ledger();
path from BICAMERAL_PII_ARCHIVE_PATH env or
~/.bicameral/pii-archive.db default.
- governance-gates.yaml entry: gate_kind: schema pointing at
input_span.text ASSERT (strongest deterministic-gate variant).
Tests (24 new sociable, all passing):
- 6 schema migration tests (deterministic-gate ASSERT, legacy-shape
acceptance, archive-key-only acceptance, both-empty rejection)
- 8 _resolve_span_text unit tests (sentinel constant, archive path,
legacy fallback, erasure → sentinel, broken-archive grace, both-
set archive-wins, idempotency)
- 4 load-bearing erasure propagation tests (audit-required):
* test_resolve_returns_erased_sentinel_after_archive_erase
* test_get_all_decisions_filters_erased_sentinel_from_source_excerpt
* test_legacy_row_with_no_archive_key_still_renders_normally
* test_ingest_writes_text_to_archive_and_empty_to_input_span
- Regression: 15 test_phase2_ledger + 19 Phase A tests still pass.
Honest non-closure framing per audit:
- Phase B-1 segregates input_span.text only. decision.speakers and
decision.source_ref remain raw PII surfaces — Phase B-2.
- Cross-author replay sanitizer (events/materializer.py) — Phase B-3.
- erase-subject CLI + legacy-row backfill — Phase C.
- Schema-level UNIQUE-on-archive_key NOT added (legacy rows have
archive_key='' which would violate UNIQUE on empty values).
Python-side dedup via get_input_span_id is the gate; partial UNIQUE
index lands post-backfill.
ruff format + ruff check clean. Roadmap doc updated; Shadow Genome
Entry BicameralAI#9 captures the round-2 VETO lessons (cross-section signature
consistency + sentinel downstream-consumer audit as heuristics BicameralAI#8-9).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bump version 0.4.3 → 0.4.4 to release the grounding-pipeline improvements that landed in PR #5 + PR #8 since v0.4.3.
What's in v0.4.4 vs v0.4.3
cache_hitsfield onIngestStats,grounding_tierstamped onmaps_toedge provenance, 26 new unit tests. Fully deterministic, no LLM added to grounding path.M1 adversarial regression (local)
Bit-for-bit identical. Expected — grounding-pipeline changes don't affect M1 extraction P/R/F1 (which measures extraction quality against the Opus fixture). Offline suite: 71/71 pass.
Summary by CodeRabbit