Skip to content

feat(pii-archive): #221 Phase B-1 — ingest cutover + read-path centralization (NOT closure)#356

Merged
jinhongkuan merged 1 commit into
devfrom
feat/221-phase-b-ingest-cutover
May 15, 2026
Merged

feat(pii-archive): #221 Phase B-1 — ingest cutover + read-path centralization (NOT closure)#356
jinhongkuan merged 1 commit into
devfrom
feat/221-phase-b-ingest-cutover

Conversation

@Knapp-Kevin

Copy link
Copy Markdown
Collaborator

Status

Phase B-1 of N. GDPR-01 audit gap remains OPEN. This PR ships the load-bearing schema-level deterministic gate (#205 doctrine, gate_kind: schema) that segregates verbatim transcript text into the operator-erasable PiiArchive (Phase A primitive). Speakers/source_ref pseudonymization (Phase B-2), cross-author replay sanitizer (Phase B-3), and erase-subject CLI + legacy backfill (Phase C) remain.

Plan + audit history

Plan: plan-221-phase-b-1-ingest-cutover.md — qor-judge PASS at L2 in round 3.

Two prior VETOes captured as Shadow Genome Entries:

What ships

  • Schema v21→v22: relaxes input_span.text ASSERT to $value != '' OR $this.archive_key != ''. DB-engine-enforced; refactor-resistant.
  • _resolve_span_text(archive, row) — sync helper at ledger/queries.py; single point of truth for input_span.text reads. Returns archive content / [ERASED] sentinel / legacy row.text.
  • _ERASED_SENTINEL constant hoisted; load-bearing in helper return AND real_spans filter exclusion.
  • 7 read sites refactored to route through helper:
    • 4 graph projections in queries.py (get_all_decisions, search_by_bm25, get_decisions_for_file, get_decisions_for_files)
    • handlers/history.py:217 enriched-fetch site
    • handlers/remove_source.py audit-telemetry consumer of get_input_span_row
  • upsert_input_span gains archive_key parameter — when set, writes with text='' and dedups on archive_key.
  • SurrealDBLedgerAdapter.ingest_payload writes verbatim to archive before input_span CREATE; falls back to inline-text on archive failure.
  • PiiArchive plumbed onto adapter via adapters/ledger.py::get_ledger().
  • governance-gates.yaml entry: gate_kind: schema pointing at the ASSERT (strongest deterministic-gate variant).

Honest non-closure framing

  • Phase B-1 segregates input_span.text only. decision.speakers and decision.source_ref remain raw PII surfaces — Phase B-2.
  • Cross-author replay sanitizer (events/materializer.py) — Phase B-3.
  • erase-subject CLI + legacy-row backfill — Phase C.
  • Schema-level UNIQUE-on-archive_key NOT added (legacy rows have archive_key='' which would violate UNIQUE on empty values). Python-side dedup via get_input_span_id is the gate; partial UNIQUE index lands post-backfill.

Test plan

  • 24 new sociable tests, all passing:
    • 6 schema migration tests (deterministic-gate ASSERT acceptance + rejection paths)
    • 8 _resolve_span_text unit tests (sentinel constant, archive path, legacy fallback, erasure → sentinel, broken-archive grace, both-set archive-wins, idempotency)
    • 4 load-bearing erasure propagation tests (audit-required):
      • test_resolve_returns_erased_sentinel_after_archive_erase
      • test_get_all_decisions_filters_erased_sentinel_from_source_excerpt
      • test_legacy_row_with_no_archive_key_still_renders_normally
      • test_ingest_writes_text_to_archive_and_empty_to_input_span
  • Regression: test_phase2_ledger (15) + Phase A tests (19) still pass.
  • ruff format + ruff check clean.

Plan revisions vs. originally audited

Files touched (12)

File Change
ledger/schema.py v22 ASSERT + migration function + registry entry
ledger/queries.py helper, sentinel constant, 4 read-site refactors, upsert_input_span archive_key path
ledger/adapter.py ingest_payload writes to archive before input_span CREATE
adapters/ledger.py PiiArchive plumbed onto adapter
handlers/history.py enriched-fetch site refactored
handlers/remove_source.py audit-telemetry routed through helper
governance-gates.yaml gate_kind: schema entry
docs/policies/gdpr-art-17-erasure-roadmap.md Phase B-1 marked shipped; B-2/B-3/C carved out
docs/SHADOW_GENOME.md Entries #8 + #9 with heuristics #7-9
tests/test_pii_archive_schema_migration_b1.py new
tests/test_resolve_span_text_unit.py new
tests/test_history_erasure_propagation.py new

🤖 Generated with Claude Code

…lization

**Phase B-1 of N. Does NOT close GDPR-01.** This cycle ships the
load-bearing schema-level deterministic gate (#205 doctrine,
gate_kind: schema) that segregates verbatim transcript text into the
operator-erasable PiiArchive from Phase A. Speakers/source_ref
pseudonymization (Phase B-2), cross-author replay sanitizer (Phase
B-3), and erase-subject CLI + legacy backfill (Phase C) remain.

Plan: plan-221-phase-b-1-ingest-cutover.md (qor-judge PASS at L2
round 3; two prior VETOes captured F-B1-{1,2,3} + F-B2-{1,2,3} as
Shadow Genome Entries #8 and #9 with new heuristics #7-9).

What ships:

- Schema v21→v22 migration: relaxes input_span.text ASSERT to
  "$value != '' OR $this.archive_key != ''". DB-engine-enforced;
  refactor-resistant. Legacy UNIQUE-on-(source_type, source_ref, text)
  index preserved for backward-compat.
- ledger/queries.py::_resolve_span_text(archive, row) — sync helper,
  single point of truth for input_span.text reads. Returns archive
  content when archive_key is set, "[ERASED]" sentinel post-erasure,
  legacy row.text as fallback.
- _ERASED_SENTINEL constant hoisted (load-bearing in helper return
  AND real_spans filter exclusion).
- 7 read sites refactored to route through helper:
    * 4 graph projections in queries.py (get_all_decisions,
      search_by_bm25, get_decisions_for_file, get_decisions_for_files)
    * handlers/history.py:217 enriched-fetch site
    * handlers/remove_source.py audit-telemetry consumer of
      get_input_span_row (post-erasure audit captures sentinel, not
      stale plaintext)
- upsert_input_span gains archive_key parameter — when set, writes
  with text='' and dedup keyed on archive_key; legacy text-only path
  preserved.
- SurrealDBLedgerAdapter.ingest_payload writes verbatim to archive
  before input_span CREATE; falls back to inline-text on archive
  write failure.
- PiiArchive plumbed onto adapter via adapters/ledger.py::get_ledger();
  path from BICAMERAL_PII_ARCHIVE_PATH env or
  ~/.bicameral/pii-archive.db default.
- governance-gates.yaml entry: gate_kind: schema pointing at
  input_span.text ASSERT (strongest deterministic-gate variant).

Tests (24 new sociable, all passing):
- 6 schema migration tests (deterministic-gate ASSERT, legacy-shape
  acceptance, archive-key-only acceptance, both-empty rejection)
- 8 _resolve_span_text unit tests (sentinel constant, archive path,
  legacy fallback, erasure → sentinel, broken-archive grace, both-
  set archive-wins, idempotency)
- 4 load-bearing erasure propagation tests (audit-required):
    * test_resolve_returns_erased_sentinel_after_archive_erase
    * test_get_all_decisions_filters_erased_sentinel_from_source_excerpt
    * test_legacy_row_with_no_archive_key_still_renders_normally
    * test_ingest_writes_text_to_archive_and_empty_to_input_span
- Regression: 15 test_phase2_ledger + 19 Phase A tests still pass.

Honest non-closure framing per audit:
- Phase B-1 segregates input_span.text only. decision.speakers and
  decision.source_ref remain raw PII surfaces — Phase B-2.
- Cross-author replay sanitizer (events/materializer.py) — Phase B-3.
- erase-subject CLI + legacy-row backfill — Phase C.
- Schema-level UNIQUE-on-archive_key NOT added (legacy rows have
  archive_key='' which would violate UNIQUE on empty values).
  Python-side dedup via get_input_span_id is the gate; partial UNIQUE
  index lands post-backfill.

ruff format + ruff check clean. Roadmap doc updated; Shadow Genome
Entry #9 captures the round-2 VETO lessons (cross-section signature
consistency + sentinel downstream-consumer audit as heuristics #8-9).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 21fb5422-7602-45ee-bfbf-d0f7c0136d7b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/221-phase-b-ingest-cutover

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Knapp-Kevin Knapp-Kevin added feat Feature work or user-visible capability P1 High: ship this milestone; user-impacting bug or committed feature governance security Security-sensitive work compliance Compliance / regulatory / security-standard alignment work ledger Decision ledger, persistence, or query surface labels May 15, 2026
@jinhongkuan jinhongkuan merged commit 2d47ca3 into dev May 15, 2026
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

compliance Compliance / regulatory / security-standard alignment work feat Feature work or user-visible capability governance ledger Decision ledger, persistence, or query surface P1 High: ship this milestone; user-impacting bug or committed feature security Security-sensitive work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants