feat(pii-archive): #221 Phase B-1 — ingest cutover + read-path centralization (NOT closure)#356
Merged
Merged
Conversation
…lization
**Phase B-1 of N. Does NOT close GDPR-01.** This cycle ships the
load-bearing schema-level deterministic gate (#205 doctrine,
gate_kind: schema) that segregates verbatim transcript text into the
operator-erasable PiiArchive from Phase A. Speakers/source_ref
pseudonymization (Phase B-2), cross-author replay sanitizer (Phase
B-3), and erase-subject CLI + legacy backfill (Phase C) remain.
Plan: plan-221-phase-b-1-ingest-cutover.md (qor-judge PASS at L2
round 3; two prior VETOes captured F-B1-{1,2,3} + F-B2-{1,2,3} as
Shadow Genome Entries #8 and #9 with new heuristics #7-9).
What ships:
- Schema v21→v22 migration: relaxes input_span.text ASSERT to
"$value != '' OR $this.archive_key != ''". DB-engine-enforced;
refactor-resistant. Legacy UNIQUE-on-(source_type, source_ref, text)
index preserved for backward-compat.
- ledger/queries.py::_resolve_span_text(archive, row) — sync helper,
single point of truth for input_span.text reads. Returns archive
content when archive_key is set, "[ERASED]" sentinel post-erasure,
legacy row.text as fallback.
- _ERASED_SENTINEL constant hoisted (load-bearing in helper return
AND real_spans filter exclusion).
- 7 read sites refactored to route through helper:
* 4 graph projections in queries.py (get_all_decisions,
search_by_bm25, get_decisions_for_file, get_decisions_for_files)
* handlers/history.py:217 enriched-fetch site
* handlers/remove_source.py audit-telemetry consumer of
get_input_span_row (post-erasure audit captures sentinel, not
stale plaintext)
- upsert_input_span gains archive_key parameter — when set, writes
with text='' and dedup keyed on archive_key; legacy text-only path
preserved.
- SurrealDBLedgerAdapter.ingest_payload writes verbatim to archive
before input_span CREATE; falls back to inline-text on archive
write failure.
- PiiArchive plumbed onto adapter via adapters/ledger.py::get_ledger();
path from BICAMERAL_PII_ARCHIVE_PATH env or
~/.bicameral/pii-archive.db default.
- governance-gates.yaml entry: gate_kind: schema pointing at
input_span.text ASSERT (strongest deterministic-gate variant).
Tests (24 new sociable, all passing):
- 6 schema migration tests (deterministic-gate ASSERT, legacy-shape
acceptance, archive-key-only acceptance, both-empty rejection)
- 8 _resolve_span_text unit tests (sentinel constant, archive path,
legacy fallback, erasure → sentinel, broken-archive grace, both-
set archive-wins, idempotency)
- 4 load-bearing erasure propagation tests (audit-required):
* test_resolve_returns_erased_sentinel_after_archive_erase
* test_get_all_decisions_filters_erased_sentinel_from_source_excerpt
* test_legacy_row_with_no_archive_key_still_renders_normally
* test_ingest_writes_text_to_archive_and_empty_to_input_span
- Regression: 15 test_phase2_ledger + 19 Phase A tests still pass.
Honest non-closure framing per audit:
- Phase B-1 segregates input_span.text only. decision.speakers and
decision.source_ref remain raw PII surfaces — Phase B-2.
- Cross-author replay sanitizer (events/materializer.py) — Phase B-3.
- erase-subject CLI + legacy-row backfill — Phase C.
- Schema-level UNIQUE-on-archive_key NOT added (legacy rows have
archive_key='' which would violate UNIQUE on empty values).
Python-side dedup via get_input_span_id is the gate; partial UNIQUE
index lands post-backfill.
ruff format + ruff check clean. Roadmap doc updated; Shadow Genome
Entry #9 captures the round-2 VETO lessons (cross-section signature
consistency + sentinel downstream-consumer audit as heuristics #8-9).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This was referenced May 15, 2026
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status
Phase B-1 of N. GDPR-01 audit gap remains OPEN. This PR ships the load-bearing schema-level deterministic gate (#205 doctrine,
gate_kind: schema) that segregates verbatim transcript text into the operator-erasable PiiArchive (Phase A primitive). Speakers/source_ref pseudonymization (Phase B-2), cross-author replay sanitizer (Phase B-3), and erase-subject CLI + legacy backfill (Phase C) remain.Plan + audit history
Plan:
plan-221-phase-b-1-ingest-cutover.md— qor-judge PASS at L2 in round 3.Two prior VETOes captured as Shadow Genome Entries:
ledger/queries.pyread sites — missedhandlers/history.py:217andhandlers/remove_source.py. Heuristic skill: bicameral-ingest few-shot extraction (v0.4.3) #7 added: codebase-wide-grep check.real_spansfilter under[ERASED]sentinel. Heuristics test: tolerate v0.4.3 SKILL.md structure in step1 excerpt test #8 (cross-section signature consistency) + chore: bump to v0.4.4 — grounding reuse + coverage loop #9 (sentinel downstream-consumer audit) added.What ships
input_span.textASSERT to$value != '' OR $this.archive_key != ''. DB-engine-enforced; refactor-resistant._resolve_span_text(archive, row)— sync helper atledger/queries.py; single point of truth forinput_span.textreads. Returns archive content /[ERASED]sentinel / legacy row.text._ERASED_SENTINELconstant hoisted; load-bearing in helper return ANDreal_spansfilter exclusion.queries.py(get_all_decisions,search_by_bm25,get_decisions_for_file,get_decisions_for_files)handlers/history.py:217enriched-fetch sitehandlers/remove_source.pyaudit-telemetry consumer ofget_input_span_rowupsert_input_spangainsarchive_keyparameter — when set, writes withtext=''and dedups on archive_key.SurrealDBLedgerAdapter.ingest_payloadwrites verbatim to archive beforeinput_spanCREATE; falls back to inline-text on archive failure.PiiArchiveplumbed onto adapter viaadapters/ledger.py::get_ledger().governance-gates.yamlentry:gate_kind: schemapointing at the ASSERT (strongest deterministic-gate variant).Honest non-closure framing
input_span.textonly.decision.speakersanddecision.source_refremain raw PII surfaces — Phase B-2.events/materializer.py) — Phase B-3.erase-subjectCLI + legacy-row backfill — Phase C.UNIQUE-on-archive_keyNOT added (legacy rows havearchive_key=''which would violate UNIQUE on empty values). Python-side dedup viaget_input_span_idis the gate; partial UNIQUE index lands post-backfill.Test plan
_resolve_span_textunit tests (sentinel constant, archive path, legacy fallback, erasure → sentinel, broken-archive grace, both-set archive-wins, idempotency)test_resolve_returns_erased_sentinel_after_archive_erasetest_get_all_decisions_filters_erased_sentinel_from_source_excerpttest_legacy_row_with_no_archive_key_still_renders_normallytest_ingest_writes_text_to_archive_and_empty_to_input_spantest_phase2_ledger(15) + Phase A tests (19) still pass.ruff format+ruff checkclean.Plan revisions vs. originally audited
get_input_span_idis the new gate. Documented in schema comment + roadmap doc.test_history_erasure_propagation.py— same coverage, fewer files.Files touched (12)
ledger/schema.pyledger/queries.pyupsert_input_spanarchive_key pathledger/adapter.pyingest_payloadwrites to archive before input_span CREATEadapters/ledger.pyhandlers/history.pyhandlers/remove_source.pygovernance-gates.yamlgate_kind: schemaentrydocs/policies/gdpr-art-17-erasure-roadmap.mddocs/SHADOW_GENOME.mdtests/test_pii_archive_schema_migration_b1.pytests/test_resolve_span_text_unit.pytests/test_history_erasure_propagation.py🤖 Generated with Claude Code