M1/decision relevance ruler#6
Conversation
Adds the instrumentation and eval runner for the "Decision Relevance" metric from bicameral-quality-validation-roadmap.html (M1). Lets any branch that edits .claude/skills/bicameral-ingest/SKILL.md produce a directly-comparable grounded_pct in CI artifacts. Changes: - handlers/ingest.py: log "[ingest] complete: N/M grounded (pct%)" and populate stats.grounded / stats.grounded_pct on IngestResponse. - contracts.py: extend IngestStats (additive, no breaking change). - ledger/queries.py: get_grounding_breakdown() groups by source_ref. - tests/fixtures/expected/decisions.py: TRANSCRIPT_SOURCES map so the runner can discover which transcript targets which repo. Adding a new transcript/repo is a one-line fixture edit. - tests/eval_decision_relevance.py: new fixture-driven runner with two modes — "none" (ingest fixture decisions directly, no LLM) and "from-skill-md" (headless extraction from current SKILL.md, then grounding). Mirrors eval_code_locator.py structure. - tests/_extract_headless.py: runs Step 1 of SKILL.md via the Anthropic Messages API using Authorization: Bearer CLAUDE_CODE_OAUTH_TOKEN. Caches responses keyed on (skill_md_sha, transcript_sha, model) — editing SKILL.md on a branch invalidates the cache automatically. - tests/test_extract_headless.py: offline unit tests for Step 1 excerpt parsing, fenced-JSON tolerance, and cache hit path. - tests/test_phase2_ledger.py: 2 new tests for grounded-field population + get_grounding_breakdown. Also repairs test_ungrounded_intent_has_correct_status to use gibberish text (it was silently relying on the broken grounding path below). Incidental fix: - code_locator/tools/validate_symbols.py: store self._db so adapters/code_locator.py:190 can reach db.lookup_by_file() during auto-grounding. Previously the whole ground_mappings() path was a silent no-op, which masked test_phase3_integration failures and produced 100%-ungrounded results for any real run. Regression: 30 tests pass across phase1 + phase2 + extract_headless. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The driver was resolving SKILL.md relative to the PARENT repo root (.claude/skills/bicameral-ingest/SKILL.md), but that path is gitignored at the parent level — only the submodule copy is tracked (pilot/mcp/.claude/skills/bicameral-ingest/SKILL.md). In CI the file was absent and the extraction+grounding step failed at _load_skill_md before touching auth. Resolve to MCP_ROOT (parents[1]) instead. This makes the submodule the single source of truth for the skill and keeps Phase 5 A/B branches inside the submodule — consistent with every other MCP asset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CI run bit us with `httpx.LocalProtocolError: Illegal header value b'Bearer '` — the env var resolved to an empty string and we concatenated it into the Authorization header, producing malformed output instead of a clear error. Switch to os.environ.get(..., "") + .strip() check so a missing or empty-string token surfaces with a self-explaining RuntimeError before we build any HTTP request.
The CI run came back 401 from the public Messages API with only the status code visible — not enough to tell whether the Claude Code OAuth token is (a) lacking the right scope, (b) the wrong token type for this endpoint, or (c) rejected for some other reason. Capture the response body (first 500 chars) and emit it with the model name, token prefix (first 12 chars, enough to distinguish sk-ant-api03 from sk-ant-oat01 without leaking material), and token length. This is diagnostic-only; the cached success path is untouched.
Anthropic's public Messages API rejects Claude Code OAuth tokens with 401 "OAuth authentication is currently not supported". Switch to standard API-key auth: - ANTHROPIC_API_KEY env var instead of CLAUDE_CODE_OAUTH_TOKEN - x-api-key header instead of Authorization: Bearer - Function parameter: api_key (was oauth_token) - Error messages point at the ci-test environment secret path Cached success path is untouched; tests still pass offline with no network credential needed.
Haiku 4.5 was returning JSON with unescaped inner double quotes inside description strings (e.g., "or \`[\"*\"]\` for global access"), breaking json.loads mid-parse. Also the default 4096 max_tokens was near the ceiling for transcripts with 10+ decisions. Switch to Anthropic tool use: - Define an `submit_extraction` tool with a JSON schema for decisions + action_items - Force the model to call it via tool_choice - Read tool_use.input directly — it's a pre-parsed Python dict delivered by the API, so no JSON string parsing, no unescape failures, no markdown fence drift Also: - Bump MAX_OUTPUT_TOKENS 4096 → 8192 (cheap headroom) - Raise clean errors on stop_reason="max_tokens" and missing tool_use blocks - Keep _parse_response_json as a defensive utility (not on the hot path) so the existing offline tests for fence tolerance still run Offline tests: 9/9 pass. Next CI run is the real end-to-end validation against medusa/saleor/vendure via Haiku.
Reshapes the M1 eval so CI doesn't depend on Anthropic as a
"ground-truth oracle" on every run. New architecture:
simulation (every CI run, Haiku 4.5 via live API):
- Runs current SKILL.md Step-1 extraction on each transcript
- Ingests via the grounding pipeline
- Reports grounded_pct as before
ground truth (pregenerated offline, Opus 4.6, committed to git):
- tests/fixtures/extraction/<source_ref>.json — one per transcript
- Generated by tests/regen_extraction_fixtures.py --all
- Hand-editable: the committed JSON is the oracle, Opus is just the
bootstrap tool. audit provenance fields track model + timestamp +
skill_md_sha so staleness is visible in git diff.
metric (computed by the runner at M1 eval time):
- For each transcript, fuzzy-match extracted decisions against the
ground-truth fixture via rapidfuzz.fuzz.token_set_ratio (threshold 70)
- 1:1 matching (no double-counting)
- Per-transcript precision / recall / F1 + counts
- Aggregate across all scored transcripts
- Skipped gracefully when no fixture file exists, so CI stays green
until fixtures are bootstrapped
Phase 5 A/B workflow (once fixtures exist):
1. Branch from main, edit .claude/skills/bicameral-ingest/SKILL.md
2. git push → CI runs Haiku extraction via new skill → reports
precision/recall delta against the FIXED Opus fixture
3. Compare across experiment branches. Winner merges.
New files:
tests/_extraction_metrics.py — rapidfuzz matching + aggregate math
tests/test_extraction_metrics.py — 9 unit tests (1:1 match, P/R math)
tests/regen_extraction_fixtures.py — Opus-backed bootstrap script
tests/fixtures/extraction/README.md — format + workflow docs
Also:
- Incidental fix in _build_payload_from_skill_md: coerce
action_item.owner=None → "" to satisfy IngestPayload contract
(the in-flight CI tool-use run exposed this via a pydantic error)
Regression: 30 tests pass (9 extract_headless + 9 extraction_metrics
+ 12 phase-2). Runner smoke-test in --skill-variant none mode passes
end-to-end with the new plumbing and cleanly reports "extraction
skipped (no fixtures)" when fixtures dir is empty.
Next step (outside this commit): bootstrap fixtures locally via
ANTHROPIC_API_KEY=... .venv/bin/python tests/regen_extraction_fixtures.py --all
commit the resulting JSONs, push. CI will then produce real P/R
numbers alongside grounded_pct.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generates the M1 ground-truth extraction fixtures in-session instead of paying Anthropic API tokens, using Claude Opus 4.6 (this session) with the cloned medusa/saleor/vendure trees available for spot-verification of claimed symbols and files. Produced by reading each transcript with the code alongside it, applying SKILL.md Step-1 rules (when in doubt, exclude), and writing one JSON per source_ref. Coverage (67 decisions + 33 action_items across 9 transcripts): medusa-payment-timeout 4 decisions, 5 action_items medusa-plugin-migration 8 decisions, 5 action_items medusa-webhook-notifications 12 decisions, 3 action_items saleor-checkout-extensibility 10 decisions, 3 action_items saleor-graphql-permissions 8 decisions, 3 action_items saleor-order-workflows 7 decisions, 4 action_items vendure-channel-pricing 7 decisions, 2 action_items vendure-custom-fields 7 decisions, 5 action_items vendure-search-reindexing 4 decisions, 3 action_items Provenance fields on each fixture: generated_by: claude-opus-4-6[1m]-in-session skill_md_sha: bb8f81be7f20 (current SKILL.md) generated_at: 2026-04-13 These are hand-editable going forward. Re-run tests/regen_extraction_fixtures.py (with ANTHROPIC_API_KEY) to regenerate via the API path if/when the transcripts change or the Phase 5 skill variants warrant a fresh reference. Also incidental: - eval_decision_relevance.py: stop hardcoding "no fixtures" in the extraction-skipped console message; just print the aggregate's own reason string. The hardcoded text was misleading when the skip reason was actually "variant=none, metric not applicable". Regression: 33 tests pass across test_extract_headless, test_extraction_metrics, test_phase2_ledger. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 49 minutes and 21 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughValidateSymbolsTool now retains the DB instance; ingest handler computes and logs grounded counts and percentage and populates new IngestStats fields; ledger gains grounding breakdown query; a headless extraction driver, LLM/rapidfuzz matchers, metrics, fixtures, regeneration tooling, and evaluation runner with tests were added. Changes
Sequence Diagram(s)sequenceDiagram
participant Handler as Ingest Handler
participant Tool as ValidateSymbolsTool
participant DB as Database
participant Ledger as Ledger Queries
Handler->>+Handler: handle_ingest(payload)
Handler->>+Tool: validate_symbols(...)
Tool->>+DB: lookup_by_file() via self._db
DB-->>-Tool: symbol results
Tool-->>-Handler: validation results
Handler->>Handler: compute intents_created, ungrounded
Handler->>Handler: grounded = max(0, intents_created - ungrounded)
Handler->>Handler: grounded_pct = grounded/intents_created or 0.0
Handler->>+Ledger: persist/return IngestStats (including grounded, grounded_pct)
Ledger-->>-Handler: stats returned
Handler-->>-Handler: return result
sequenceDiagram
participant Eval as Eval Runner
participant Extract as Headless Extractor
participant API as Anthropic API
participant Ingest as Ingest Handler
participant Metrics as Metrics Computer
Eval->>+Eval: load transcripts & plan
loop per transcript
Eval->>+Extract: extract_from_current_skill(transcript)
Extract->>Extract: check cache
alt cache hit
Extract-->>Eval: return cached decisions
else cache miss
Extract->>+API: send tool-based extraction request
API-->>-Extract: tool_use response
Extract->>Extract: parse & cache
Extract-->>Eval: return extracted decisions
end
Eval->>+Ingest: handle_ingest(payload)
Ingest-->>-Eval: return grounding stats
Eval->>+Metrics: compute_extraction_metrics(extracted, fixture)
Metrics-->>-Eval: return precision/recall/F1
end
Eval->>Eval: aggregate per-repo & overall metrics
Eval->>Eval: enforce regression gates -> exit code
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
tests/_extract_headless.py (1)
166-169: Chain the exception for better traceability.When re-raising as a new exception type, use
from excto preserve the original traceback chain. This helps with debugging when JSON parsing fails.Proposed fix
try: parsed = json.loads(text) except json.JSONDecodeError as exc: - raise ValueError(f"extractor did not return valid JSON: {exc}\nbody={body[:500]}") + raise ValueError(f"extractor did not return valid JSON: {exc}\nbody={body[:500]}") from exc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/_extract_headless.py` around lines 166 - 169, The except block around json.loads(text) currently raises a new ValueError without chaining the original json.JSONDecodeError; update the except handler for json.JSONDecodeError (the block catching exc) to re-raise the ValueError using "from exc" so the original traceback is preserved (i.e., change the raise in the try/except that follows json.loads(text) to use exception chaining).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ledger/queries.py`:
- Around line 571-580: The code currently treats any non-"ungrounded" status
(including empty or missing status) as grounded; update the logic in the block
that obtains status and updates the bucket (the variables status,
buckets.setdefault, and keys "grounded"/"ungrounded"/"total") so that only an
explicit "grounded" value increments b["grounded"], and all other cases
(including empty string or missing status) increment b["ungrounded"] (or are
handled as ungrounded) before updating totals and grounded_pct.
In `@tests/_extraction_metrics.py`:
- Around line 88-103: The current greedy loop matching items from actual to
unmatched_expected (variables actual, unmatched_expected, matched) using
fuzz.token_set_ratio and popping the best per iteration causes order-dependent
results; replace this with an order-independent max-cardinality / max-weight
bipartite matching: build a weight matrix of scores between each actual and each
expected (using fuzz.token_set_ratio and MATCH_THRESHOLD), run a max-weight
matching (e.g., Hungarian algorithm or
networkx.algorithms.matching.max_weight_matching) to select globally optimal
pairs, then populate matched, unmatched_actual, and unmatched_expected from that
matching instead of popping inside the greedy loop. Ensure ties below
MATCH_THRESHOLD are treated as non-matches.
In `@tests/eval_decision_relevance.py`:
- Around line 244-248: Wrap the json.loads(args.multi_repo) call in a try/except
to catch json.JSONDecodeError so a malformed --multi-repo value produces a clean
CLI error instead of a traceback; specifically, when parsing into repo_map (the
repo_map: dict[str, str] = json.loads(args.multi_repo) line), catch
json.JSONDecodeError, print a descriptive message to stderr (e.g. "error:
invalid --multi-repo JSON") and return the same error tuple ({}, 2) as the empty
case.
In `@tests/fixtures/extraction/saleor-checkout-extensibility.json`:
- Around line 50-51: Replace the ambiguous empty string owner value with a
proper null to represent an unassigned owner: find the JSON object containing
the "text" value "Deliver base implementation for review by end of sprint;
check-in Thursday on the plugin changes (critical path)" and set its "owner"
property from "" to null so consumers don't need extra normalization logic.
In `@tests/fixtures/extraction/vendure-custom-fields.json`:
- Around line 49-50: Change the JSON fixture entry where the object has "text":
"All PRs reviewed by Wednesday" and "owner": "" so that the owner field is
normalized to null; specifically replace the empty-string value for the "owner"
property with a JSON null to ensure schema consistency across fixtures.
---
Nitpick comments:
In `@tests/_extract_headless.py`:
- Around line 166-169: The except block around json.loads(text) currently raises
a new ValueError without chaining the original json.JSONDecodeError; update the
except handler for json.JSONDecodeError (the block catching exc) to re-raise the
ValueError using "from exc" so the original traceback is preserved (i.e., change
the raise in the try/except that follows json.loads(text) to use exception
chaining).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 443dbd1e-4916-4f3c-be13-116bede51987
📒 Files selected for processing (22)
code_locator/tools/validate_symbols.pycontracts.pyhandlers/ingest.pyledger/queries.pytests/_extract_headless.pytests/_extraction_metrics.pytests/eval_decision_relevance.pytests/fixtures/expected/decisions.pytests/fixtures/extraction/README.mdtests/fixtures/extraction/medusa-payment-timeout.jsontests/fixtures/extraction/medusa-plugin-migration.jsontests/fixtures/extraction/medusa-webhook-notifications.jsontests/fixtures/extraction/saleor-checkout-extensibility.jsontests/fixtures/extraction/saleor-graphql-permissions.jsontests/fixtures/extraction/saleor-order-workflows.jsontests/fixtures/extraction/vendure-channel-pricing.jsontests/fixtures/extraction/vendure-custom-fields.jsontests/fixtures/extraction/vendure-search-reindexing.jsontests/regen_extraction_fixtures.pytests/test_extract_headless.pytests/test_extraction_metrics.pytests/test_phase2_ledger.py
| status = str(row.get("status") or "") | ||
| b = buckets.setdefault( | ||
| ref, | ||
| {"source_ref": ref, "grounded": 0, "ungrounded": 0, "total": 0, "grounded_pct": 0.0}, | ||
| ) | ||
| b["total"] += 1 | ||
| if status == "ungrounded": | ||
| b["ungrounded"] += 1 | ||
| else: | ||
| b["grounded"] += 1 |
There was a problem hiding this comment.
Do not treat empty/missing status as grounded.
At Line 571-580, status == "" (or missing) currently increments grounded, which can overstate grounded_pct and skew eval outcomes.
Suggested fix
- status = str(row.get("status") or "")
+ status = str(row.get("status") or "")
b = buckets.setdefault(
ref,
{"source_ref": ref, "grounded": 0, "ungrounded": 0, "total": 0, "grounded_pct": 0.0},
)
b["total"] += 1
- if status == "ungrounded":
+ if not status or status == "ungrounded":
b["ungrounded"] += 1
else:
b["grounded"] += 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ledger/queries.py` around lines 571 - 580, The code currently treats any
non-"ungrounded" status (including empty or missing status) as grounded; update
the logic in the block that obtains status and updates the bucket (the variables
status, buckets.setdefault, and keys "grounded"/"ungrounded"/"total") so that
only an explicit "grounded" value increments b["grounded"], and all other cases
(including empty string or missing status) increment b["ungrounded"] (or are
handled as ungrounded) before updating totals and grounded_pct.
| for a in actual: | ||
| if not unmatched_expected: | ||
| unmatched_actual.append(a) | ||
| continue | ||
| # Find best match in remaining expected pool | ||
| best_i = -1 | ||
| best_score = -1.0 | ||
| for i, e in enumerate(unmatched_expected): | ||
| score = fuzz.token_set_ratio(a, e) | ||
| if score > best_score: | ||
| best_score = score | ||
| best_i = i | ||
| if best_score >= MATCH_THRESHOLD: | ||
| matched.append((a, unmatched_expected[best_i], best_score)) | ||
| unmatched_expected.pop(best_i) | ||
| else: |
There was a problem hiding this comment.
Order-dependent greedy matching can mis-score TP/FP/FN.
At Line 88, matching extracted decisions sequentially and popping matches at Line 102 makes outcomes dependent on input order. For a regression gate, this can undercount true positives on equivalent outputs.
Proposed fix (order-independent max-cardinality matching)
def compute_extraction_metrics(
@@
- unmatched_expected = list(expected)
- matched: list[tuple[str, str, float]] = []
- unmatched_actual: list[str] = []
-
- for a in actual:
- if not unmatched_expected:
- unmatched_actual.append(a)
- continue
- # Find best match in remaining expected pool
- best_i = -1
- best_score = -1.0
- for i, e in enumerate(unmatched_expected):
- score = fuzz.token_set_ratio(a, e)
- if score > best_score:
- best_score = score
- best_i = i
- if best_score >= MATCH_THRESHOLD:
- matched.append((a, unmatched_expected[best_i], best_score))
- unmatched_expected.pop(best_i)
- else:
- unmatched_actual.append(a)
+ # Build thresholded bipartite graph (actual -> expected), neighbors sorted by score desc.
+ adj: list[list[int]] = []
+ score_map: dict[tuple[int, int], float] = {}
+ for ai, a in enumerate(actual):
+ candidates: list[tuple[int, float]] = []
+ for ei, e in enumerate(expected):
+ s = fuzz.token_set_ratio(a, e)
+ if s >= MATCH_THRESHOLD:
+ candidates.append((ei, s))
+ score_map[(ai, ei)] = s
+ candidates.sort(key=lambda t: t[1], reverse=True)
+ adj.append([ei for ei, _ in candidates])
+
+ # Maximum-cardinality bipartite matching (Kuhn DFS).
+ match_e_to_a = [-1] * len(expected)
+
+ def _dfs(ai: int, seen: set[int]) -> bool:
+ for ei in adj[ai]:
+ if ei in seen:
+ continue
+ seen.add(ei)
+ prev_ai = match_e_to_a[ei]
+ if prev_ai == -1 or _dfs(prev_ai, seen):
+ match_e_to_a[ei] = ai
+ return True
+ return False
+
+ for ai in range(len(actual)):
+ _dfs(ai, set())
+
+ matched_actual = {ai for ai in match_e_to_a if ai != -1}
+ matched_expected_idx = {ei for ei, ai in enumerate(match_e_to_a) if ai != -1}
+
+ matched: list[tuple[str, str, float]] = []
+ for ei, ai in enumerate(match_e_to_a):
+ if ai != -1:
+ matched.append((actual[ai], expected[ei], score_map[(ai, ei)]))
+
+ unmatched_actual = [a for i, a in enumerate(actual) if i not in matched_actual]
+ unmatched_expected = [e for i, e in enumerate(expected) if i not in matched_expected_idx]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/_extraction_metrics.py` around lines 88 - 103, The current greedy loop
matching items from actual to unmatched_expected (variables actual,
unmatched_expected, matched) using fuzz.token_set_ratio and popping the best per
iteration causes order-dependent results; replace this with an order-independent
max-cardinality / max-weight bipartite matching: build a weight matrix of scores
between each actual and each expected (using fuzz.token_set_ratio and
MATCH_THRESHOLD), run a max-weight matching (e.g., Hungarian algorithm or
networkx.algorithms.matching.max_weight_matching) to select globally optimal
pairs, then populate matched, unmatched_actual, and unmatched_expected from that
matching instead of popping inside the greedy loop. Ensure ties below
MATCH_THRESHOLD are treated as non-matches.
| repo_map: dict[str, str] = json.loads(args.multi_repo) if args.multi_repo else {} | ||
| if not repo_map: | ||
| print("error: --multi-repo is required", file=sys.stderr) | ||
| return {}, 2 | ||
|
|
There was a problem hiding this comment.
Handle malformed --multi-repo JSON with a clean CLI error.
Line 244 can throw json.JSONDecodeError and produce a traceback instead of a controlled exit path.
Proposed fix
- repo_map: dict[str, str] = json.loads(args.multi_repo) if args.multi_repo else {}
+ try:
+ repo_map: dict[str, str] = json.loads(args.multi_repo) if args.multi_repo else {}
+ except json.JSONDecodeError as exc:
+ print(f"error: --multi-repo must be valid JSON: {exc}", file=sys.stderr)
+ return {}, 2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/eval_decision_relevance.py` around lines 244 - 248, Wrap the
json.loads(args.multi_repo) call in a try/except to catch json.JSONDecodeError
so a malformed --multi-repo value produces a clean CLI error instead of a
traceback; specifically, when parsing into repo_map (the repo_map: dict[str,
str] = json.loads(args.multi_repo) line), catch json.JSONDecodeError, print a
descriptive message to stderr (e.g. "error: invalid --multi-repo JSON") and
return the same error tuple ({}, 2) as the empty case.
| "text": "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)", | ||
| "owner": "" |
There was a problem hiding this comment.
Use null instead of empty string for unassigned owner.
At Line 50-51, owner: "" is ambiguous and forces extra normalization logic. Use null for missing owner values.
Suggested fix
- "owner": ""
+ "owner": null📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "text": "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)", | |
| "owner": "" | |
| "text": "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)", | |
| "owner": null |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/fixtures/extraction/saleor-checkout-extensibility.json` around lines 50
- 51, Replace the ambiguous empty string owner value with a proper null to
represent an unassigned owner: find the JSON object containing the "text" value
"Deliver base implementation for review by end of sprint; check-in Thursday on
the plugin changes (critical path)" and set its "owner" property from "" to null
so consumers don't need extra normalization logic.
| "text": "All PRs reviewed by Wednesday", | ||
| "owner": "" |
There was a problem hiding this comment.
Normalize missing owner as null.
At Line 49-50, prefer owner: null over owner: "" for schema consistency across fixtures.
Suggested fix
- "owner": ""
+ "owner": null📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "text": "All PRs reviewed by Wednesday", | |
| "owner": "" | |
| "text": "All PRs reviewed by Wednesday", | |
| "owner": null |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/fixtures/extraction/vendure-custom-fields.json` around lines 49 - 50,
Change the JSON fixture entry where the object has "text": "All PRs reviewed by
Wednesday" and "owner": "" so that the owner field is normalized to null;
specifically replace the empty-string value for the "owner" property with a JSON
null to ensure schema consistency across fixtures.
Replaces the rapidfuzz token_set_ratio matcher with an Anthropic
Haiku 4.5 tool-use call that judges semantic equivalence between
extracted and ground-truth decisions. Rapidfuzz stays as an offline
fallback for unit tests.
Why:
The first end-to-end CI run (24370616582) reported P=0.51, R=0.45,
F1=0.48, but inspection of the per-transcript unmatched lists showed
most "false positives" and "false negatives" were the same decision
phrased slightly differently. Four concrete cases from the run:
medusa-payment-timeout:
FP: "Implement a 12-second timeout ceiling on payment provider
authorize calls; if exceeded, return requires_more status"
FN: "Wrap the authorize call in PaymentProviderService with a
12-second timeout ceiling and return requires_more status"
(same decision, scored <70 on token_set_ratio)
medusa-webhook-notifications:
FP: "Per-endpoint rate limiting will use token bucket algorithm
with a capacity of 10 requests per second, with overflow queued"
FN: "Apply a per-endpoint token bucket rate limiter at 10 requests
per second; overflow requests are queued rather than dropped"
saleor-graphql-permissions:
FP: "Gate authorization checks for mutations with side effects..."
FN: "Gate the checkoutComplete mutation's permission check before
any side effects..."
vendure-custom-fields:
FP: "Schema synchronization will use TypeORM schema:synchronize..."
FN: "Run schema:synchronize in staging first as a CI validation..."
These numbers were measuring matcher brittleness, not skill quality.
Design:
- New module tests/_extraction_matcher.py with llm_match() that does
ONE Haiku call per transcript (not N*M per pair), with the full
actual + expected lists in the user prompt and tool_use forcing a
structured matches array in return.
- Strict matching rules in the system prompt: "same constraint, same
scope, same behavior", 1:1 pairing, "when in doubt leave unmatched".
- Prompt-cached on the system prompt so re-runs within a branch only
pay for user-message tokens.
- Response cached to tests/.match-cache/ keyed on
(model, sha(actual_list), sha(expected_list)) so repeat CI runs on
the same branch pay $0 to the API. Any SKILL.md edit that changes
what Haiku extracts changes the actual_sha and invalidates the
cache automatically.
- Tool response is already a parsed dict — no JSON string parsing.
- Fails fast with a diagnostic error body on HTTP errors, same
pattern as _extract_headless.
Dispatcher:
- compute_extraction_metrics(..., matcher="auto") resolves to "llm"
when ANTHROPIC_API_KEY is set (CI path) and "rapidfuzz" otherwise
(offline unit tests). Explicit matcher="llm" or matcher="rapidfuzz"
forces the path. New "matcher" field in the metric dict records
which matcher was used for auditability.
- Runner prints "extraction (llm):" vs "extraction (rapidfuzz):" in
the aggregate summary so CI logs show which matcher ran.
Tests:
- Existing rapidfuzz-mode tests retained, explicitly forced via
matcher="rapidfuzz" (9 passing)
- New unit tests for _pick_matcher (env-based auto-resolve), LLM
cache hit path, missing-key error, _parse_matches 1:1 enforcement
with invalid indices / duplicate expecteds, and the compute_*
dispatch path via a stubbed llm_match (7 passing)
- All offline — no network required
- Total: 16 metric tests + 9 extractor tests = 25 pass
Cost per CI run:
~9 Haiku calls per run × ~2k input tokens × $0.80/Mtok = ~$0.015
plus prompt-cache hits on the system prompt amortized across calls.
Effectively free, and cached zero on repeats.
Next CI run on this branch will produce the first real extraction
metric with paraphrase-robust matching. Expected: the P=0.51 R=0.45
baseline will move upward substantially (informal eyeball suggests
~0.75+) without any skill change, because the previous numbers were
almost entirely matcher brittleness.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
tests/_extraction_metrics.py (1)
70-98:⚠️ Potential issue | 🟠 MajorRapidfuzz fallback is still order-dependent.
This loop still greedily pops the best remaining expected for each actual, so runs that fall back to
rapidfuzzcan undercount TP / overcount FP based only on list order. That keeps the offline metric unstable; this path still needs a global 1:1 bipartite matching instead of greedy removal.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/_extraction_metrics.py` around lines 70 - 98, The _rapidfuzz_match function currently does greedy per-actual selection causing order-dependent matches; replace the greedy removal with a global 1:1 bipartite matching: build a score matrix using fuzz.token_set_ratio between every actual (rows) and expected (columns), run a maximum-weight bipartite matching (e.g., Hungarian algorithm / scipy.optimize.linear_sum_assignment or a max-weight matching routine) to get globally optimal actual->expected pairs, then convert matched pairs to the (actual_idx, expected_idx) list but only accept matches whose score >= MATCH_THRESHOLD (unmatched or below-threshold become (actual_idx, None)); keep the function signature and return format unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/_extraction_matcher.py`:
- Around line 217-223: The RuntimeError raised after the HTTP POST is echoing
API key-derived data (api_key, key_prefix, key_len); remove any key-derived
fields from the exception text. Update the error construction that references
resp.status_code, resp.text, model, key_prefix, and key_len so it only includes
non-secret diagnostics (e.g., resp.status_code, resp.text[:500], and model) and
does not reference api_key, key_prefix, key_len, or ANTROP... constants.
- Around line 127-136: The cache key generation is ambiguous and missing
prompt/tool versioning: change _list_sha to produce an unambiguous hash of the
list (e.g., include item lengths or a separator encoding so ["a","b\nc"] !=
["a\nb","c"]) and update _cache_path to incorporate SYSTEM_PROMPT,
USER_PROMPT_TEMPLATE and MATCHER_TOOL (or their current version/sha) into the
key string used before calling _sha so llm_match() won't return stale hits when
prompts or tools change; update references to _list_sha and _cache_path
accordingly.
In `@tests/_extraction_metrics.py`:
- Line 51: The comment on the MATCH_THRESHOLD constant contains a Unicode
en-dash; replace the en-dash (–) with a plain ASCII hyphen (-) in the inline
comment after MATCH_THRESHOLD so the comment reads "rapidfuzz token_set_ratio,
0-100 scale" and resolves the RUF003 lint warning.
---
Duplicate comments:
In `@tests/_extraction_metrics.py`:
- Around line 70-98: The _rapidfuzz_match function currently does greedy
per-actual selection causing order-dependent matches; replace the greedy removal
with a global 1:1 bipartite matching: build a score matrix using
fuzz.token_set_ratio between every actual (rows) and expected (columns), run a
maximum-weight bipartite matching (e.g., Hungarian algorithm /
scipy.optimize.linear_sum_assignment or a max-weight matching routine) to get
globally optimal actual->expected pairs, then convert matched pairs to the
(actual_idx, expected_idx) list but only accept matches whose score >=
MATCH_THRESHOLD (unmatched or below-threshold become (actual_idx, None)); keep
the function signature and return format unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5c255b50-0c6e-45de-b5a9-b1e762779ec6
📒 Files selected for processing (4)
tests/_extraction_matcher.pytests/_extraction_metrics.pytests/eval_decision_relevance.pytests/test_extraction_metrics.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/test_extraction_metrics.py
- tests/eval_decision_relevance.py
| def _list_sha(items: list[str]) -> str: | ||
| # Order matters for cache key — same-content-different-order lists | ||
| # should map to different cache entries so we don't silently reuse a | ||
| # stale matching when the runner reshuffles. | ||
| return _sha("\n".join(items)) | ||
|
|
||
|
|
||
| def _cache_path(model: str, actual_sha: str, expected_sha: str) -> Path: | ||
| key = f"{model}|{actual_sha}|{expected_sha}" | ||
| return MATCH_CACHE_DIR / f"{_sha(key)}.json" |
There was a problem hiding this comment.
Make the cache key unambiguous and prompt-versioned.
Line 131 hashes "\n".join(items), so inputs like ["a", "b\nc"] and ["a\nb", "c"] collide. Lines 135-136 also ignore SYSTEM_PROMPT, USER_PROMPT_TEMPLATE, and MATCHER_TOOL, so prompt/tool edits will silently reuse stale judge outputs. Because llm_match() returns cache hits before calling the API, this can produce wrong metrics.
🔧 Proposed fix
def _sha(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()
+MATCHER_CACHE_VERSION = _sha(
+ json.dumps(
+ {
+ "system": SYSTEM_PROMPT,
+ "user_template": USER_PROMPT_TEMPLATE,
+ "tool": MATCHER_TOOL,
+ },
+ sort_keys=True,
+ ensure_ascii=False,
+ separators=(",", ":"),
+ )
+)
+
+
def _list_sha(items: list[str]) -> str:
# Order matters for cache key — same-content-different-order lists
# should map to different cache entries so we don't silently reuse a
# stale matching when the runner reshuffles.
- return _sha("\n".join(items))
+ return _sha(json.dumps(items, ensure_ascii=False, separators=(",", ":")))
def _cache_path(model: str, actual_sha: str, expected_sha: str) -> Path:
- key = f"{model}|{actual_sha}|{expected_sha}"
+ key = f"{MATCHER_CACHE_VERSION}|{model}|{actual_sha}|{expected_sha}"
return MATCH_CACHE_DIR / f"{_sha(key)}.json"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/_extraction_matcher.py` around lines 127 - 136, The cache key
generation is ambiguous and missing prompt/tool versioning: change _list_sha to
produce an unambiguous hash of the list (e.g., include item lengths or a
separator encoding so ["a","b\nc"] != ["a\nb","c"]) and update _cache_path to
incorporate SYSTEM_PROMPT, USER_PROMPT_TEMPLATE and MATCHER_TOOL (or their
current version/sha) into the key string used before calling _sha so llm_match()
won't return stale hits when prompts or tools change; update references to
_list_sha and _cache_path accordingly.
| key_prefix = api_key[:12] if api_key else "(empty)" | ||
| with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client: | ||
| resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload) | ||
| if resp.status_code >= 400: | ||
| raise RuntimeError( | ||
| f"Matcher API error {resp.status_code}: {resp.text[:500]} " | ||
| f"(model={model}, key_prefix={key_prefix}..., key_len={len(api_key)})" |
There was a problem hiding this comment.
Do not echo API key material into the exception text.
Line 223 includes part of ANTHROPIC_API_KEY in every 4xx failure. Even partial credentials should not land in CI logs or artifacts; keep the model/status diagnostics and drop the key-derived fields.
🔒 Proposed fix
- key_prefix = api_key[:12] if api_key else "(empty)"
with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client:
resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload)
if resp.status_code >= 400:
raise RuntimeError(
f"Matcher API error {resp.status_code}: {resp.text[:500]} "
- f"(model={model}, key_prefix={key_prefix}..., key_len={len(api_key)})"
+ f"(model={model})"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| key_prefix = api_key[:12] if api_key else "(empty)" | |
| with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client: | |
| resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload) | |
| if resp.status_code >= 400: | |
| raise RuntimeError( | |
| f"Matcher API error {resp.status_code}: {resp.text[:500]} " | |
| f"(model={model}, key_prefix={key_prefix}..., key_len={len(api_key)})" | |
| with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client: | |
| resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload) | |
| if resp.status_code >= 400: | |
| raise RuntimeError( | |
| f"Matcher API error {resp.status_code}: {resp.text[:500]} " | |
| f"(model={model})" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/_extraction_matcher.py` around lines 217 - 223, The RuntimeError raised
after the HTTP POST is echoing API key-derived data (api_key, key_prefix,
key_len); remove any key-derived fields from the exception text. Update the
error construction that references resp.status_code, resp.text, model,
key_prefix, and key_len so it only includes non-secret diagnostics (e.g.,
resp.status_code, resp.text[:500], and model) and does not reference api_key,
key_prefix, key_len, or ANTROP... constants.
| from rapidfuzz import fuzz | ||
|
|
||
| FIXTURES_DIR = Path(__file__).resolve().parent / "fixtures" / "extraction" | ||
| MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0–100 scale |
There was a problem hiding this comment.
Replace the Unicode dash in this comment.
Line 51 uses –, which Ruff flags as RUF003. Swap it for a plain - so the new file stays lint-clean.
🧹 Proposed fix
-MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0–100 scale
+MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0-100 scale📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0–100 scale | |
| MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0-100 scale |
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 51-51: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/_extraction_metrics.py` at line 51, The comment on the MATCH_THRESHOLD
constant contains a Unicode en-dash; replace the en-dash (–) with a plain ASCII
hyphen (-) in the inline comment after MATCH_THRESHOLD so the comment reads
"rapidfuzz token_set_ratio, 0-100 scale" and resolves the RUF003 lint warning.
Bootstraps the M1 adversarial corpus ground truth in-session (Claude Opus 4.6, with the cloned medusa/saleor/vendure trees available for symbol cross-checks). One JSON per adversarial transcript, matching the shape of the existing friendly-corpus fixtures. Distribution by category (15 decisions + 6 action_items total): adv-strat-fake STRAT-FAKE 0 dec 0 act (strict-exclude probe) adv-vocab-collide VOCAB-COLLIDE 1 dec 2 act (one buried needle) adv-density-extreme DENSITY-EXTREME 9 dec 2 act (compound + diffuse) adv-offtopic-mix OFFTOPIC-MIX 1 dec 1 act (buried in Q2 OR) adv-reversal REVERSAL 4 dec 1 act (final state only) Each fixture has a "notes" field explaining the category-specific extraction policy so future hand-edits stay aligned with intent. The strat-fake empty case is intentional: any positive extraction is a false positive. These transcripts are not yet wired into TRANSCRIPT_SOURCES — that edit waits on the fixture-location migration (separate cleanup ticket). Once registered, CI will produce the first adversarial baseline alongside the existing friendly-corpus numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registers adv-strat-fake, adv-vocab-collide, adv-density-extreme, adv-offtopic-mix, and adv-reversal under repo_key="adversarial" so the runner reports them as their own group in CI output. Ground truth for each was committed earlier at tests/fixtures/extraction/adv-*.json. The "adversarial" repo_key is matched against --multi-repo by the parent workflow, where it aliases to the medusa clone (any indexed codebase works — adversarial transcripts test extraction quality, not grounding quality against a specific codebase). Bland-noun grounding noise on these transcripts is exactly what the VOCAB-COLLIDE category is designed to expose, and feeds the M2 (grounding precision) ticket. Runner total: 9 friendly + 5 adversarial = 14 transcripts. Local smoke test: --skill-variant none with adversarial in the repo map handles the empty case cleanly (adversarial transcripts have no entries in ALL_DECISIONS, so none-mode produces 0/0 for the adversarial group without crashing). Live extraction in CI's from-skill-md mode will be the first real measurement.
Removes the 9 friendly e-commerce fixtures (medusa-*, saleor-*, vendure-*) from tests/fixtures/extraction/ and the matching TRANSCRIPT_SOURCES entries from decisions.py. The friendly corpus was too easy to be informative about real-world reliability — F1=0.87 on sprint-planning meetings says nothing about whether bicameral handles hedged language, generic vocabulary, compound sentences, off-topic meetings, or reversed decisions. After this commit: TRANSCRIPT_SOURCES → 5 entries, all repo_key="adversarial" fixtures/extraction/ → README.md + 5 adv-*.json files The 9 friendly transcript .md files in pilot/ml/data/transcripts/ remain in place — they're not bicameral-mcp test fixtures, they're data assets used elsewhere in pilot/ml/. Only the test-fixture copies in pilot/mcp/tests/fixtures/extraction/ are removed. ALL_DECISIONS in decisions.py is unchanged — it's still consumed by tests/eval_code_locator.py for the M2 RAG eval, which targets the friendly OSS repos and is conceptually independent of M1. Offline test suite: 40/40 pass after cleanup (15 phase-2 + 16 metrics + 9 extract_headless).
Resolves four conflicts: - .gitignore: keeps both qor:seed block and .claude/worktrees ignore. - docs/META_LEDGER.md: takes dev's chain (#1-#26) as canonical. This branch's parallel #7-#14 entries (sealed against the obsolete Entry #6 base before dev added #7-#26) are not folded in here — they need to be re-authored against dev's #26 seal via /qor-meta-log-decision in a follow-up. Their session content is preserved in commit messages (f4de501, 3f856af) and in docs/SYSTEM_STATE.md. - docs/SYSTEM_STATE.md: keeps both the v0 process cleanup / preflight hook session blocks from this branch and dev's #124 implementation block. - skills/bicameral-preflight/SKILL.md: keeps both the new "Hook reinforcement" subsection (this branch) and the Telemetry section (dev). The earlier removal of Telemetry in this branch's f4de501 was authored against a stale base; restoring it here.
Priority C v0 — Phase 1 of the team-server vertical-slice for multi-dev decision continuity at organizational scale (per the Sales Enablement & Positioning Playbook + research-brief-priority-c-selective-ingest-2026-05-02.md). The team-server is a self-managing, customer-self-hosted Python service. Per CONCEPT.md anti-goals under literal-keyword parsing (SHADOW_GENOME Failure Entry #6 addendum): "no managed backend" forbids vendor SaaS and human-ops-tax architectures, NOT self-managing customer-deployable backends. Sentry self-hosted, Supabase OSS, and the embedded-SurrealDB philosophy already in repo are precedents. Scaffold delivered: - team_server/app.py (47 LOC): FastAPI app factory; lifespan migrates schema on startup, closes DB on teardown; /health endpoint - team_server/schema.py (80 LOC): v0 schema for workspace, channel_allowlist, extraction_cache, team_event tables. Idempotent ensure_schema(); migration dispatch table for future versions. FLEXIBLE TYPE object on canonical_extraction + payload (per #72 lesson + audit Advisory #3). - team_server/db.py (41 LOC): TeamServerDB factory wrapping ledger.client.LedgerClient with team-server's own ns/db pair. - deploy/team-server.docker-compose.yml + Dockerfile.team-server: single-service compose; volume for persistent data; healthcheck on /health; runs as non-root. Tests (6 functionality tests, all green): - tests/test_team_server_app.py: app starts + serves health, schema migrates from empty + idempotent, lifespan teardown closes DB, /health returns well-formed JSON. - tests/test_team_server_deploy.py: docker-compose config validates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of plan-priority-c-team-server-slack-v0.md — adds the polling worker, the canonical-extraction cache that closes the multi-dev extraction-divergence gap, and the peer-author event writer. Multi-dev convergence mechanism: any dev's session that touches a Slack message produces the SAME canonical extraction across the team because get_or_compute is keyed on (source_type, source_ref, content_hash) and the cache row is shared via the team-server's append-only event log. This is what closes Playbook Pillar #1 (Decision Continuity) at multi-dev/multi-agent scale that the v1 brief incorrectly framed as "build curation only" — see SHADOW_GENOME Failure Entry #6. The interim canonical extraction uses Anthropic Claude with the model_version='interim-claude-v1' tombstone so a future Phase 5 (CocoIndex #136 integration) can identify and rebuild interim entries when memoized deterministic transforms become available. Files added: - team_server/extraction/canonical_cache.py (45 LOC): get_or_compute() returns cached extraction OR invokes compute_fn + persists. - team_server/extraction/llm_extractor.py: interim Anthropic-backed extractor; production-default until CocoIndex lands. - team_server/sync/peer_writer.py (42 LOC): write_team_event() — appends to team_event with author_email='team-server@<team_id>.bicameral' identity (single-bot per workspace; multi-instance HA is v1). - team_server/workers/slack_worker.py (100 LOC): poll_once() — conversations_history per allowlisted channel, content_hash dedup, cache-keyed extraction, peer event write per new message. Tests (6 functionality tests, all green): - tests/test_team_server_canonical_cache.py: cache hit returns existing without compute_fn, cache miss persists + subsequent call returns cached, content_hash variation produces new rows. - tests/test_team_server_slack_worker.py: worker polls only allowlisted channels, writes one team_event per message with peer-author identity, dedups via message ts (idempotent re-poll). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of plan-priority-c-team-server-slack-v0.md — closes the multi-dev convergence loop by exposing team_event over HTTP and extending EventMaterializer with a failure-isolated team-server pull. The materializer pull is OUTSIDE the deterministic core (per CONCEPT.md literal-keyword parsing of "no network calls in the deterministic core" — SHADOW_GENOME Failure Entry #6 addendum). Failure-isolation contract: team-server outage NEVER cascades into per-dev preflight failures — events/team_server_pull.py swallows transport errors, returns empty events, leaves the watermark unchanged. Files added: - team_server/api/events.py: GET /events?since=N&limit=M endpoint; reads team_event ordered by sequence ascending; pagination via since cursor. - events/team_server_pull.py (57 LOC): pull_team_server_events() queries the team-server's /events endpoint, persists watermark per call, swallows transport errors gracefully. - team_server/app.py (modified): include events router. Tests (6 functionality tests, all green): - tests/test_team_server_events_api.py: /events returns rows in sequence order, paginates via since cursor, returns empty (not error) when no new events. - tests/test_materializer_team_server_pull.py: materializer pulls from team-server URL, watermark advances + persists separately, 503 /transport-error degrades gracefully (no exception, watermark unchanged) — failure-isolation contract verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
QorLogic SDLC governance trail for the Priority C v0 implementation that landed in commits 1-4 of this PR. Includes: - docs/research-brief-priority-c-selective-ingest-2026-05-02.md (v3) — research substrate. v1 was rejected for INVARIANT_FROM_IMPLEMENTATION (treating v0 agent-fetches-only code state as product principle); v2 added playbook substrate; v3 narrowed to Slack-first + team-server + CocoIndex-conditional after operator dialogue clarified "no managed backend" = "no human-ops-tax architecture," not "no backend." - plan-priority-c-team-server-slack-v0.md (437 LOC) — the L3 plan with five phases (Phase 5 deferred per "if we can manage it" feasibility caveat). - docs/SHADOW_GENOME.md Failure Entry #6 + addendum — captures the framing-error pattern AND the "anti-goals must be parsed by their load-bearing keyword" lesson; symmetric to v0-code-as-principle. - docs/META_LEDGER.md Entries #27 (IMPLEMENT) + #28 (SEAL). Predecessor: efd0304b (#135-triage seal on dev). Implement chain: 211ffb9e. Substantiation seal: 6f4f8f8f1d63ad82b952a3c6aff270d30584e08b0572077ff685e84ce453f6c2 - docs/SYSTEM_STATE.md — Priority C v0 section appended; documents schema additions, architectural properties achieved, audit advisory disposition, Phase 5 deferred state, and the qor-logic-internal steps skipped for downstream-project rationale. - .agent/staging/AUDIT_REPORT.md — PASS verdict, three non-blocking advisories all addressed at implement-time. Verdict: REALITY = PROMISE for Phases 1-4. Phase 5 (CocoIndex #136) explicitly deferred per plan slip-independence design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+Gemini+Codex-2 (#205) Authors a single Reviewer Disposition Pass table at the top of the brief reconciling all 32 review points across four review layers (Codex first-pass, Kilo, Gemini CLI, Codex second-pass) into one post-review consensus before downstream P1 issue-filing — per the explicit Codex-2 #1 directive. Decisions: 21 applied this commit, 6 already applied in 1d82658, 3 deferred to follow-up, 2 note-only. Net new gap IDs added per disposition: GDPR-08 (ephemeral data), GDPR-09 (consent versioning + revocation), LLM-11 (cross-tool config-file modification surface), MCP-01 (host UX as external dependency), CFG-01 (config precedence + fail-closed model). Reclassification: LLM-06 P0/M → P1/M with scope narrowed to future remote-skill-loading channel (per Kilo #2). Major content additions to the brief: - § 1.1: MCP host UX is external dependency, not security gate (new gap MCP-01) — host that auto-approves tool calls bypasses any "operator will see this" assumption. - § 1.2: SurrealDB version pinning supply-chain callout (Kilo #11). - § 1.7: cross-tool config-file modification surface (new gap LLM-11) distinct from skill-content surface — `setup_wizard` writes shell commands into `.claude/settings.json` that run host-side at hook fire. - § 1.11 (new): Configuration precedence + fail-closed model — single uniform precedence rule across all knobs (env > config.yaml > hardcoded defaults), fail-closed semantics on missing/malformed/ contradictory config (Codex-2 #5). - § 2.4 (a): LLM02 mapping note clarifying it folds into LLM-07 + OWASP-04 (Kilo #13). - § 2.4 (b): explicit `confirm=True` is agent-supplied not HITL (Kilo #3) — security context cannot rely on agent-filled params. - § 2.4 (c) LLM-01 + LLM-04: extensible classifier (Gemini #2) + guardrail-not-classifier framing (Codex-1 #6) + control-acceptance template (Codex-2 #4) — quarantine, override, test fixtures, measurement counters. - § 2.4 (c) LLM-03: timeouts as `.bicameral/config.yaml` knobs (Gemini #3). - § 2.4 (c) LLM-05 + LLM-09: out-of-band operator confirmation, not agent-supplied confirm parameters (Kilo #3). - § 2.4 (c) LLM-06: scope-narrowed to future remote-skill-loading; in current install model the wheel-trust covers it (Kilo #2). - § 2.4 (c) LLM-11 (new): cross-tool config-file gate (signed hooks-manifest.json) distinct from skill manifest. - § 2.1 (c) GDPR-01: three remediation candidates — tombstone-and- rebuild with signed manifest (Kilo #12), crypto-shredding (Gemini #1), or scope-out via PII detect-and-refuse. - § 2.1 (c) GDPR-02: data-subject-access search must cover full identifier surface (description, source_ref, topic, file paths) not just signer email (Codex-1 #5). - § 2.1 (c) GDPR-08 (new): ephemeral data surfaces (tempfiles, swap, WAL, crash dumps) (Kilo #7). - § 2.1 (c) GDPR-09 (new): consent versioning + revocation semantics (Kilo #8 + Codex-2 #3). - § 5: gap table updated with new rows + LLM-06 reclassification; gap counts post-disposition (5 P0 / 19 P1 / 16 P2 / 5 P3 = 45 total, up from 41). - § 6.1 (new): epic grouping for deferred P1 batch (Codex-1 #10) — ingest boundary guardrails, per-tool authority gradation, supply- chain signing, telemetry & consent. - § 6.2 (new): six-section control-acceptance template for every DG gap (Codex-2 #4) — positive / negative / bypass / fail-closed / telemetry / docs. Filed-issue updates: - Issue #214 (LLM-06): relabeled P0 → P1, retitled to reflect scope narrowing, full disposition comment added. - Issue #212 (LLM-01) + #213 (LLM-04): disposition comments added capturing the guardrail framing, classifier extensibility, and control-acceptance template applicable to both. Deferred for follow-up: Codex-1 #4 (controller/processor restructure of standards table), Codex-1 #9 (full evidence appendix beyond the methodology softening), Codex-2 #2 (full 3-column deployment-profile matrix beyond the single-column trigger). Brief now 706 lines (up from 606); +124 line diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary by CodeRabbit
New Features
Documentation
Tests
Chores