Skip to content

chore: bump to v0.4.4 — grounding reuse + coverage loop#9

Merged
jinhongkuan merged 1 commit into
mainfrom
chore/bump-v0.4.4
Apr 14, 2026
Merged

chore: bump to v0.4.4 — grounding reuse + coverage loop#9
jinhongkuan merged 1 commit into
mainfrom
chore/bump-v0.4.4

Conversation

@jinhongkuan

@jinhongkuan jinhongkuan commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Bump version 0.4.3 → 0.4.4 to release the grounding-pipeline improvements that landed in PR #5 + PR #8 since v0.4.3.

What's in v0.4.4 vs v0.4.3

M1 adversarial regression (local)

P R F1 TP FP FN
v0.4.3 0.81 0.87 0.84 13 3 2
v0.4.4 0.81 0.87 0.84 13 3 2

Bit-for-bit identical. Expected — grounding-pipeline changes don't affect M1 extraction P/R/F1 (which measures extraction quality against the Opus fixture). Offline suite: 71/71 pass.

Summary by CodeRabbit

  • Chores
    • Version bumped to 0.4.4

Bumps version 0.4.3 → 0.4.4 to release the grounding-pipeline
improvements that landed since v0.4.3.

What's in v0.4.4 (vs v0.4.3):

  - **PR #5** (silong/code-locator-fix-drift): decision grounding
    reuse + 3-tier coverage loop
    • Before BM25, handle_ingest checks the ledger for similar
      previously-grounded intents via search_grounded_intents()
      and reuses their code_regions after live-symbol validation
    • ground_mappings retries with progressively relaxed thresholds
      (strict 0.5/80 → relaxed 0.3/70 → broad 0.1/60) before giving up
    • New cache_hits field on IngestStats; grounding_tier stamped on
      maps_to edge provenance for observability
    • 26 new unit tests (10 vocab cache + 16 coverage loop)
    • Fully deterministic, no LLM added to the grounding path

  - **PR #8**: small test fix — case-insensitive INCLUDE/EXCLUDE
    check + bumped excerpt size ceiling for the v0.4.3 few-shot
    SKILL.md structure.

M1 adversarial regression (local, before/after the v0.4.4 changes):

  v0.4.3:        P=0.81 R=0.87 F1=0.84  (TP=13 FP=3 FN=2)
  v0.4.4:        P=0.81 R=0.87 F1=0.84  (TP=13 FP=3 FN=2)
                 ^^^^^^^^^^^^^^ identical — extraction quality unchanged

The grounding-pipeline changes are conceptually independent of M1
extraction quality. Cache hits and tier-relaxation only affect
grounded_pct, which was already 100% on the adversarial corpus.

Offline test suite: 71/71 pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jinhongkuan jinhongkuan merged commit aae8809 into main Apr 14, 2026
1 check was pending
@coderabbitai

coderabbitai Bot commented Apr 14, 2026

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0146d4ab-4b89-4f44-9e2f-34dd35ef6bcd

📥 Commits

Reviewing files that changed from the base of the PR and between 0914e6d and 278b755.

📒 Files selected for processing (2)
  • RECOMMENDED_VERSION
  • pyproject.toml

📝 Walkthrough

Walkthrough

Version constants were incremented from 0.4.3 to 0.4.4 across two project files to reflect a new release version.

Changes

Cohort / File(s) Summary
Version Bump
RECOMMENDED_VERSION (constant file), pyproject.toml
Updated version identifiers from 0.4.3 to 0.4.4 in project metadata and version constant definitions.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

🐰 From point-four-three to point-four-four,
A tiny hop, but worth it more!
The versions bump, the tags align,
Release time magic, pure and fine! ✨

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/bump-v0.4.4

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

silongtan added a commit that referenced this pull request Apr 25, 2026
All four findings verified against current code; only the actionable
ones applied. 81 passed + 1 xfailed in 9.02s.

#1 — skills/bicameral-preflight/SKILL.md sync_metrics note
  The .claude/skills copy got the sync_metrics observability note
  back when V1 A3 shipped, but the canonical skills/ copy never
  did. Mirror the wording verbatim near step 2 so the rendering
  guidance and response-field documentation stay in sync.

#2 — handlers/detect_drift.py per-entry alignment
  The cosmetic-hint enrichment was slicing both head_full and
  wt_full using entry.lines (the baseline anchor). HEAD and the
  working tree can shift the symbol independently, so a single
  index range can't align both sides. The narrow consequence: a
  drifted entry with shifted lines could yield a misleading
  cosmetic_hint=true on bytes that aren't the bound region.

  Fix: re-resolve the symbol against each ref via
  resolve_symbol_lines(file_path, entry.symbol, repo, ref="HEAD")
  and ref="working_tree" separately, slice each ref using its own
  resolved range. Resolution failure on either side → safe default
  of cosmetic_hint=False (matches the V1 contract: "False is cheap,
  True must be earned"). Empty symbol → skip (new fail-safe path).

  Test refactor: test_invalid_lines_skipped renamed to
  test_unresolvable_symbol_skipped — the old test asserted that
  lines=(0,0) was the failsafe trigger, but entry.lines is no
  longer the alignment input. New test exercises the
  resolve_symbol_lines-returns-None path via a nonexistent symbol
  name, which is the real fail-safe gate now.

#3 — V2 guide TOC anchor for §9
  GitHub auto-generates fragment IDs from heading text by
  lowercasing, replacing spaces with hyphens, and dropping
  punctuation. "## 9. Acceptance criteria for V2" maps to
  #9-acceptance-criteria-for-v2, but the TOC pointed at
  #9-acceptance-criteria (truncated). Link broken.
  Updated to the correct fragment.

#4 — V2 guide unlabeled fenced code blocks (markdownlint MD040)
  Six fenced opens used bare ``` instead of a labeled fence.
  Tagged each with ```text — the contents are commit listings,
  ASCII DAG diagrams, pseudocode protocols, and tuple notation,
  none of which fit a real language tag. The other fenced blocks
  in the guide (already tagged ```sql / ```python) are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
silongtan added a commit that referenced this pull request Apr 26, 2026
All four findings verified against current code; only the actionable
ones applied. 81 passed + 1 xfailed in 9.02s.

#1 — skills/bicameral-preflight/SKILL.md sync_metrics note
  The .claude/skills copy got the sync_metrics observability note
  back when V1 A3 shipped, but the canonical skills/ copy never
  did. Mirror the wording verbatim near step 2 so the rendering
  guidance and response-field documentation stay in sync.

#2 — handlers/detect_drift.py per-entry alignment
  The cosmetic-hint enrichment was slicing both head_full and
  wt_full using entry.lines (the baseline anchor). HEAD and the
  working tree can shift the symbol independently, so a single
  index range can't align both sides. The narrow consequence: a
  drifted entry with shifted lines could yield a misleading
  cosmetic_hint=true on bytes that aren't the bound region.

  Fix: re-resolve the symbol against each ref via
  resolve_symbol_lines(file_path, entry.symbol, repo, ref="HEAD")
  and ref="working_tree" separately, slice each ref using its own
  resolved range. Resolution failure on either side → safe default
  of cosmetic_hint=False (matches the V1 contract: "False is cheap,
  True must be earned"). Empty symbol → skip (new fail-safe path).

  Test refactor: test_invalid_lines_skipped renamed to
  test_unresolvable_symbol_skipped — the old test asserted that
  lines=(0,0) was the failsafe trigger, but entry.lines is no
  longer the alignment input. New test exercises the
  resolve_symbol_lines-returns-None path via a nonexistent symbol
  name, which is the real fail-safe gate now.

#3 — V2 guide TOC anchor for §9
  GitHub auto-generates fragment IDs from heading text by
  lowercasing, replacing spaces with hyphens, and dropping
  punctuation. "## 9. Acceptance criteria for V2" maps to
  #9-acceptance-criteria-for-v2, but the TOC pointed at
  #9-acceptance-criteria (truncated). Link broken.
  Updated to the correct fragment.

#4 — V2 guide unlabeled fenced code blocks (markdownlint MD040)
  Six fenced opens used bare ``` instead of a labeled fence.
  Tagged each with ```text — the contents are commit listings,
  ASCII DAG diagrams, pseudocode protocols, and tuple notation,
  none of which fit a real language tag. The other fenced blocks
  in the guide (already tagged ```sql / ```python) are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
silongtan added a commit that referenced this pull request Apr 26, 2026
All four findings verified against current code; only the actionable
ones applied. 81 passed + 1 xfailed in 9.02s.

#1 — skills/bicameral-preflight/SKILL.md sync_metrics note
  The .claude/skills copy got the sync_metrics observability note
  back when V1 A3 shipped, but the canonical skills/ copy never
  did. Mirror the wording verbatim near step 2 so the rendering
  guidance and response-field documentation stay in sync.

#2 — handlers/detect_drift.py per-entry alignment
  The cosmetic-hint enrichment was slicing both head_full and
  wt_full using entry.lines (the baseline anchor). HEAD and the
  working tree can shift the symbol independently, so a single
  index range can't align both sides. The narrow consequence: a
  drifted entry with shifted lines could yield a misleading
  cosmetic_hint=true on bytes that aren't the bound region.

  Fix: re-resolve the symbol against each ref via
  resolve_symbol_lines(file_path, entry.symbol, repo, ref="HEAD")
  and ref="working_tree" separately, slice each ref using its own
  resolved range. Resolution failure on either side → safe default
  of cosmetic_hint=False (matches the V1 contract: "False is cheap,
  True must be earned"). Empty symbol → skip (new fail-safe path).

  Test refactor: test_invalid_lines_skipped renamed to
  test_unresolvable_symbol_skipped — the old test asserted that
  lines=(0,0) was the failsafe trigger, but entry.lines is no
  longer the alignment input. New test exercises the
  resolve_symbol_lines-returns-None path via a nonexistent symbol
  name, which is the real fail-safe gate now.

#3 — V2 guide TOC anchor for §9
  GitHub auto-generates fragment IDs from heading text by
  lowercasing, replacing spaces with hyphens, and dropping
  punctuation. "## 9. Acceptance criteria for V2" maps to
  #9-acceptance-criteria-for-v2, but the TOC pointed at
  #9-acceptance-criteria (truncated). Link broken.
  Updated to the correct fragment.

#4 — V2 guide unlabeled fenced code blocks (markdownlint MD040)
  Six fenced opens used bare ``` instead of a labeled fence.
  Tagged each with ```text — the contents are commit listings,
  ASCII DAG diagrams, pseudocode protocols, and tuple notation,
  none of which fit a real language tag. The other fenced blocks
  in the guide (already tagged ```sql / ```python) are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin added a commit that referenced this pull request May 6, 2026
… to research brief (#205)

Addresses Codex first-pass review notes #1, #2, #3, #7, #8, #9 from the
brief's review block. Tier C items + the subsequent Kilo / Gemini /
Codex-2nd-pass review layers are tracked as follow-ups (will be
surfaced in the PR thread for direction).

Changes:

- § 1.4 ingest pipeline: adds explicit "Risk amplification
  (durable-feedback-loop)" paragraph framing ingest as the durable
  write-surface that propagates poisoned content through preflight
  back into the agent's reasoning context. Strengthens LLM-01 + LLM-04
  P0 defensibility (Codex #2).
- § 1.8 skills surface: adds worked before/after example contrasting
  instruction-only `bicameral-report-bug` keys-only commitment vs the
  deterministic `_resolve_signer_email` gate that replaced it in #204.
  Makes the doctrine concrete for non-agent-systems readers (Codex #3).
- § 1.9 team-server: rewrites the dangling "TEAM-NN gaps in § 4"
  promise to "intentionally not enumerated; activation PR authors
  TEAM-NN IDs against actual activated topology" (Codex #8).
- § 2.6 EU AI Act: removes unilateral "limited risk" claim. Now
  describes bicameral-mcp as an AI-adjacent developer-tool component
  whose risk-tier classification properly attaches to the integrated
  system + deployment context, requiring counsel review for any
  specific tier claim (Codex #7).
- § 5 gap synthesis: adds Deployment trigger column (`all` /
  `local-OK` / `team/hosted` / `pre-team` / `hosted`) so severity is
  defensible per deployment shape. SOC2-01 reclassified as
  pre-team/hosted P0 with local-only boundary statement; GDPR-05
  reclassified as team/hosted P1 with local single-user P2; OWASP-03
  reclassified as hosted P1 with local P2 (uv/pipx provides
  install-time lock); OWASP-02 trigger narrowed to team/hosted (Codex #1).
- Appendix method notes: softens "every claim should be verifiable by
  re-reading the cited file at the cited line range" to acknowledge
  that most findings cite components rather than path:line, and
  defers a line-level evidence appendix as a follow-up improvement
  (Codex #9).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin added a commit that referenced this pull request May 6, 2026
…+Gemini+Codex-2 (#205)

Authors a single Reviewer Disposition Pass table at the top of the
brief reconciling all 32 review points across four review layers
(Codex first-pass, Kilo, Gemini CLI, Codex second-pass) into one
post-review consensus before downstream P1 issue-filing — per the
explicit Codex-2 #1 directive.

Decisions: 21 applied this commit, 6 already applied in 1d82658,
3 deferred to follow-up, 2 note-only. Net new gap IDs added per
disposition: GDPR-08 (ephemeral data), GDPR-09 (consent versioning
+ revocation), LLM-11 (cross-tool config-file modification surface),
MCP-01 (host UX as external dependency), CFG-01 (config precedence
+ fail-closed model). Reclassification: LLM-06 P0/M → P1/M with
scope narrowed to future remote-skill-loading channel (per Kilo #2).

Major content additions to the brief:

- § 1.1: MCP host UX is external dependency, not security gate (new
  gap MCP-01) — host that auto-approves tool calls bypasses any
  "operator will see this" assumption.
- § 1.2: SurrealDB version pinning supply-chain callout (Kilo #11).
- § 1.7: cross-tool config-file modification surface (new gap LLM-11)
  distinct from skill-content surface — `setup_wizard` writes shell
  commands into `.claude/settings.json` that run host-side at hook fire.
- § 1.11 (new): Configuration precedence + fail-closed model — single
  uniform precedence rule across all knobs (env > config.yaml >
  hardcoded defaults), fail-closed semantics on missing/malformed/
  contradictory config (Codex-2 #5).
- § 2.4 (a): LLM02 mapping note clarifying it folds into LLM-07 +
  OWASP-04 (Kilo #13).
- § 2.4 (b): explicit `confirm=True` is agent-supplied not HITL
  (Kilo #3) — security context cannot rely on agent-filled params.
- § 2.4 (c) LLM-01 + LLM-04: extensible classifier (Gemini #2) +
  guardrail-not-classifier framing (Codex-1 #6) + control-acceptance
  template (Codex-2 #4) — quarantine, override, test fixtures,
  measurement counters.
- § 2.4 (c) LLM-03: timeouts as `.bicameral/config.yaml` knobs (Gemini #3).
- § 2.4 (c) LLM-05 + LLM-09: out-of-band operator confirmation, not
  agent-supplied confirm parameters (Kilo #3).
- § 2.4 (c) LLM-06: scope-narrowed to future remote-skill-loading; in
  current install model the wheel-trust covers it (Kilo #2).
- § 2.4 (c) LLM-11 (new): cross-tool config-file gate (signed
  hooks-manifest.json) distinct from skill manifest.
- § 2.1 (c) GDPR-01: three remediation candidates — tombstone-and-
  rebuild with signed manifest (Kilo #12), crypto-shredding (Gemini
  #1), or scope-out via PII detect-and-refuse.
- § 2.1 (c) GDPR-02: data-subject-access search must cover full
  identifier surface (description, source_ref, topic, file paths) not
  just signer email (Codex-1 #5).
- § 2.1 (c) GDPR-08 (new): ephemeral data surfaces (tempfiles, swap,
  WAL, crash dumps) (Kilo #7).
- § 2.1 (c) GDPR-09 (new): consent versioning + revocation semantics
  (Kilo #8 + Codex-2 #3).
- § 5: gap table updated with new rows + LLM-06 reclassification;
  gap counts post-disposition (5 P0 / 19 P1 / 16 P2 / 5 P3 = 45 total,
  up from 41).
- § 6.1 (new): epic grouping for deferred P1 batch (Codex-1 #10) —
  ingest boundary guardrails, per-tool authority gradation, supply-
  chain signing, telemetry & consent.
- § 6.2 (new): six-section control-acceptance template for every DG
  gap (Codex-2 #4) — positive / negative / bypass / fail-closed /
  telemetry / docs.

Filed-issue updates:
- Issue #214 (LLM-06): relabeled P0 → P1, retitled to reflect scope
  narrowing, full disposition comment added.
- Issue #212 (LLM-01) + #213 (LLM-04): disposition comments added
  capturing the guardrail framing, classifier extensibility, and
  control-acceptance template applicable to both.

Deferred for follow-up: Codex-1 #4 (controller/processor
restructure of standards table), Codex-1 #9 (full evidence appendix
beyond the methodology softening), Codex-2 #2 (full 3-column
deployment-profile matrix beyond the single-column trigger).

Brief now 706 lines (up from 606); +124 line diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin added a commit to Knapp-Kevin/bicameral-mcp that referenced this pull request May 21, 2026
…lization

**Phase B-1 of N. Does NOT close GDPR-01.** This cycle ships the
load-bearing schema-level deterministic gate (#205 doctrine,
gate_kind: schema) that segregates verbatim transcript text into the
operator-erasable PiiArchive from Phase A. Speakers/source_ref
pseudonymization (Phase B-2), cross-author replay sanitizer (Phase
B-3), and erase-subject CLI + legacy backfill (Phase C) remain.

Plan: plan-221-phase-b-1-ingest-cutover.md (qor-judge PASS at L2
round 3; two prior VETOes captured F-B1-{1,2,3} + F-B2-{1,2,3} as
Shadow Genome Entries BicameralAI#8 and BicameralAI#9 with new heuristics BicameralAI#7-9).

What ships:

- Schema v21→v22 migration: relaxes input_span.text ASSERT to
  "$value != '' OR $this.archive_key != ''". DB-engine-enforced;
  refactor-resistant. Legacy UNIQUE-on-(source_type, source_ref, text)
  index preserved for backward-compat.
- ledger/queries.py::_resolve_span_text(archive, row) — sync helper,
  single point of truth for input_span.text reads. Returns archive
  content when archive_key is set, "[ERASED]" sentinel post-erasure,
  legacy row.text as fallback.
- _ERASED_SENTINEL constant hoisted (load-bearing in helper return
  AND real_spans filter exclusion).
- 7 read sites refactored to route through helper:
    * 4 graph projections in queries.py (get_all_decisions,
      search_by_bm25, get_decisions_for_file, get_decisions_for_files)
    * handlers/history.py:217 enriched-fetch site
    * handlers/remove_source.py audit-telemetry consumer of
      get_input_span_row (post-erasure audit captures sentinel, not
      stale plaintext)
- upsert_input_span gains archive_key parameter — when set, writes
  with text='' and dedup keyed on archive_key; legacy text-only path
  preserved.
- SurrealDBLedgerAdapter.ingest_payload writes verbatim to archive
  before input_span CREATE; falls back to inline-text on archive
  write failure.
- PiiArchive plumbed onto adapter via adapters/ledger.py::get_ledger();
  path from BICAMERAL_PII_ARCHIVE_PATH env or
  ~/.bicameral/pii-archive.db default.
- governance-gates.yaml entry: gate_kind: schema pointing at
  input_span.text ASSERT (strongest deterministic-gate variant).

Tests (24 new sociable, all passing):
- 6 schema migration tests (deterministic-gate ASSERT, legacy-shape
  acceptance, archive-key-only acceptance, both-empty rejection)
- 8 _resolve_span_text unit tests (sentinel constant, archive path,
  legacy fallback, erasure → sentinel, broken-archive grace, both-
  set archive-wins, idempotency)
- 4 load-bearing erasure propagation tests (audit-required):
    * test_resolve_returns_erased_sentinel_after_archive_erase
    * test_get_all_decisions_filters_erased_sentinel_from_source_excerpt
    * test_legacy_row_with_no_archive_key_still_renders_normally
    * test_ingest_writes_text_to_archive_and_empty_to_input_span
- Regression: 15 test_phase2_ledger + 19 Phase A tests still pass.

Honest non-closure framing per audit:
- Phase B-1 segregates input_span.text only. decision.speakers and
  decision.source_ref remain raw PII surfaces — Phase B-2.
- Cross-author replay sanitizer (events/materializer.py) — Phase B-3.
- erase-subject CLI + legacy-row backfill — Phase C.
- Schema-level UNIQUE-on-archive_key NOT added (legacy rows have
  archive_key='' which would violate UNIQUE on empty values).
  Python-side dedup via get_input_span_id is the gate; partial UNIQUE
  index lands post-backfill.

ruff format + ruff check clean. Roadmap doc updated; Shadow Genome
Entry BicameralAI#9 captures the round-2 VETO lessons (cross-section signature
consistency + sentinel downstream-consumer audit as heuristics BicameralAI#8-9).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant