M1/decision relevance ruler by jinhongkuan · Pull Request #6 · BicameralAI/bicameral-mcp

jinhongkuan · 2026-04-13T22:49:47Z

Summary by CodeRabbit

New Features
- Ingest now reports grounding metrics (grounded count and grounded percentage) and supports per-source grounding breakdowns
- End-to-end decision-relevance evaluation and a headless extraction workflow for measuring extraction + grounding
Documentation
- Added README and schema guidance for ground-truth extraction fixtures
Tests
- Large suite of tests and fixtures for extraction, matching, metrics, and grounding
Chores
- Fixture regeneration utility for extraction fixtures

Adds the instrumentation and eval runner for the "Decision Relevance" metric from bicameral-quality-validation-roadmap.html (M1). Lets any branch that edits .claude/skills/bicameral-ingest/SKILL.md produce a directly-comparable grounded_pct in CI artifacts. Changes: - handlers/ingest.py: log "[ingest] complete: N/M grounded (pct%)" and populate stats.grounded / stats.grounded_pct on IngestResponse. - contracts.py: extend IngestStats (additive, no breaking change). - ledger/queries.py: get_grounding_breakdown() groups by source_ref. - tests/fixtures/expected/decisions.py: TRANSCRIPT_SOURCES map so the runner can discover which transcript targets which repo. Adding a new transcript/repo is a one-line fixture edit. - tests/eval_decision_relevance.py: new fixture-driven runner with two modes — "none" (ingest fixture decisions directly, no LLM) and "from-skill-md" (headless extraction from current SKILL.md, then grounding). Mirrors eval_code_locator.py structure. - tests/_extract_headless.py: runs Step 1 of SKILL.md via the Anthropic Messages API using Authorization: Bearer CLAUDE_CODE_OAUTH_TOKEN. Caches responses keyed on (skill_md_sha, transcript_sha, model) — editing SKILL.md on a branch invalidates the cache automatically. - tests/test_extract_headless.py: offline unit tests for Step 1 excerpt parsing, fenced-JSON tolerance, and cache hit path. - tests/test_phase2_ledger.py: 2 new tests for grounded-field population + get_grounding_breakdown. Also repairs test_ungrounded_intent_has_correct_status to use gibberish text (it was silently relying on the broken grounding path below). Incidental fix: - code_locator/tools/validate_symbols.py: store self._db so adapters/code_locator.py:190 can reach db.lookup_by_file() during auto-grounding. Previously the whole ground_mappings() path was a silent no-op, which masked test_phase3_integration failures and produced 100%-ungrounded results for any real run. Regression: 30 tests pass across phase1 + phase2 + extract_headless. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The driver was resolving SKILL.md relative to the PARENT repo root (.claude/skills/bicameral-ingest/SKILL.md), but that path is gitignored at the parent level — only the submodule copy is tracked (pilot/mcp/.claude/skills/bicameral-ingest/SKILL.md). In CI the file was absent and the extraction+grounding step failed at _load_skill_md before touching auth. Resolve to MCP_ROOT (parents[1]) instead. This makes the submodule the single source of truth for the skill and keeps Phase 5 A/B branches inside the submodule — consistent with every other MCP asset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The CI run bit us with `httpx.LocalProtocolError: Illegal header value b'Bearer '` — the env var resolved to an empty string and we concatenated it into the Authorization header, producing malformed output instead of a clear error. Switch to os.environ.get(..., "") + .strip() check so a missing or empty-string token surfaces with a self-explaining RuntimeError before we build any HTTP request.

The CI run came back 401 from the public Messages API with only the status code visible — not enough to tell whether the Claude Code OAuth token is (a) lacking the right scope, (b) the wrong token type for this endpoint, or (c) rejected for some other reason. Capture the response body (first 500 chars) and emit it with the model name, token prefix (first 12 chars, enough to distinguish sk-ant-api03 from sk-ant-oat01 without leaking material), and token length. This is diagnostic-only; the cached success path is untouched.

Anthropic's public Messages API rejects Claude Code OAuth tokens with 401 "OAuth authentication is currently not supported". Switch to standard API-key auth: - ANTHROPIC_API_KEY env var instead of CLAUDE_CODE_OAUTH_TOKEN - x-api-key header instead of Authorization: Bearer - Function parameter: api_key (was oauth_token) - Error messages point at the ci-test environment secret path Cached success path is untouched; tests still pass offline with no network credential needed.

Haiku 4.5 was returning JSON with unescaped inner double quotes inside description strings (e.g., "or \`[\"*\"]\` for global access"), breaking json.loads mid-parse. Also the default 4096 max_tokens was near the ceiling for transcripts with 10+ decisions. Switch to Anthropic tool use: - Define an `submit_extraction` tool with a JSON schema for decisions + action_items - Force the model to call it via tool_choice - Read tool_use.input directly — it's a pre-parsed Python dict delivered by the API, so no JSON string parsing, no unescape failures, no markdown fence drift Also: - Bump MAX_OUTPUT_TOKENS 4096 → 8192 (cheap headroom) - Raise clean errors on stop_reason="max_tokens" and missing tool_use blocks - Keep _parse_response_json as a defensive utility (not on the hot path) so the existing offline tests for fence tolerance still run Offline tests: 9/9 pass. Next CI run is the real end-to-end validation against medusa/saleor/vendure via Haiku.

Reshapes the M1 eval so CI doesn't depend on Anthropic as a "ground-truth oracle" on every run. New architecture: simulation (every CI run, Haiku 4.5 via live API): - Runs current SKILL.md Step-1 extraction on each transcript - Ingests via the grounding pipeline - Reports grounded_pct as before ground truth (pregenerated offline, Opus 4.6, committed to git): - tests/fixtures/extraction/<source_ref>.json — one per transcript - Generated by tests/regen_extraction_fixtures.py --all - Hand-editable: the committed JSON is the oracle, Opus is just the bootstrap tool. audit provenance fields track model + timestamp + skill_md_sha so staleness is visible in git diff. metric (computed by the runner at M1 eval time): - For each transcript, fuzzy-match extracted decisions against the ground-truth fixture via rapidfuzz.fuzz.token_set_ratio (threshold 70) - 1:1 matching (no double-counting) - Per-transcript precision / recall / F1 + counts - Aggregate across all scored transcripts - Skipped gracefully when no fixture file exists, so CI stays green until fixtures are bootstrapped Phase 5 A/B workflow (once fixtures exist): 1. Branch from main, edit .claude/skills/bicameral-ingest/SKILL.md 2. git push → CI runs Haiku extraction via new skill → reports precision/recall delta against the FIXED Opus fixture 3. Compare across experiment branches. Winner merges. New files: tests/_extraction_metrics.py — rapidfuzz matching + aggregate math tests/test_extraction_metrics.py — 9 unit tests (1:1 match, P/R math) tests/regen_extraction_fixtures.py — Opus-backed bootstrap script tests/fixtures/extraction/README.md — format + workflow docs Also: - Incidental fix in _build_payload_from_skill_md: coerce action_item.owner=None → "" to satisfy IngestPayload contract (the in-flight CI tool-use run exposed this via a pydantic error) Regression: 30 tests pass (9 extract_headless + 9 extraction_metrics + 12 phase-2). Runner smoke-test in --skill-variant none mode passes end-to-end with the new plumbing and cleanly reports "extraction skipped (no fixtures)" when fixtures dir is empty. Next step (outside this commit): bootstrap fixtures locally via ANTHROPIC_API_KEY=... .venv/bin/python tests/regen_extraction_fixtures.py --all commit the resulting JSONs, push. CI will then produce real P/R numbers alongside grounded_pct. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Generates the M1 ground-truth extraction fixtures in-session instead of paying Anthropic API tokens, using Claude Opus 4.6 (this session) with the cloned medusa/saleor/vendure trees available for spot-verification of claimed symbols and files. Produced by reading each transcript with the code alongside it, applying SKILL.md Step-1 rules (when in doubt, exclude), and writing one JSON per source_ref. Coverage (67 decisions + 33 action_items across 9 transcripts): medusa-payment-timeout 4 decisions, 5 action_items medusa-plugin-migration 8 decisions, 5 action_items medusa-webhook-notifications 12 decisions, 3 action_items saleor-checkout-extensibility 10 decisions, 3 action_items saleor-graphql-permissions 8 decisions, 3 action_items saleor-order-workflows 7 decisions, 4 action_items vendure-channel-pricing 7 decisions, 2 action_items vendure-custom-fields 7 decisions, 5 action_items vendure-search-reindexing 4 decisions, 3 action_items Provenance fields on each fixture: generated_by: claude-opus-4-6[1m]-in-session skill_md_sha: bb8f81be7f20 (current SKILL.md) generated_at: 2026-04-13 These are hand-editable going forward. Re-run tests/regen_extraction_fixtures.py (with ANTHROPIC_API_KEY) to regenerate via the API path if/when the transcripts change or the Phase 5 skill variants warrant a fresh reference. Also incidental: - eval_decision_relevance.py: stop hardcoding "no fixtures" in the extraction-skipped console message; just print the aggregate's own reason string. The hardcoded text was misleading when the skip reason was actually "variant=none, metric not applicable". Regression: 33 tests pass across test_extract_headless, test_extraction_metrics, test_phase2_ledger. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-13T22:50:00Z

Warning

Rate limit exceeded

@jinhongkuan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 49 minutes and 21 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 49 minutes and 21 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 036641bd-e291-42db-81d9-fdeafc104ec4

📥 Commits

Reviewing files that changed from the base of the PR and between 0c1dac9 and cfbf490.

📒 Files selected for processing (1)

tests/fixtures/expected/decisions.py

📝 Walkthrough

Walkthrough

ValidateSymbolsTool now retains the DB instance; ingest handler computes and logs grounded counts and percentage and populates new IngestStats fields; ledger gains grounding breakdown query; a headless extraction driver, LLM/rapidfuzz matchers, metrics, fixtures, regeneration tooling, and evaluation runner with tests were added.

Changes

Cohort / File(s)	Summary
Validate & Ingest `code_locator/tools/validate_symbols.py`, `contracts.py`, `handlers/ingest.py`	`ValidateSymbolsTool` now stores the provided `db` on `self._db`. `IngestStats` gains `grounded` and `grounded_pct`. `handle_ingest` computes `grounded`/`grounded_pct`, clamps values, logs `[ingest] complete`, and populates the new stats fields.
Ledger Query `ledger/queries.py`	Added `get_grounding_breakdown(...)` async helper that aggregates intents by `source_ref` (grounded/ungrounded/total) and computes `grounded_pct`; supports optional `source_type`/`source_scope` filters.
Headless Extraction Driver `tests/_extract_headless.py`, `tests/test_extract_headless.py`	New headless extractor that extracts Step 1 from `SKILL.md`, forces Anthropic tool-based extraction, caches results, and includes parsing, cache, and error-handling tests.
Extraction Matching & Metrics `tests/_extraction_matcher.py`, `tests/_extraction_metrics.py`, `tests/test_extraction_metrics.py`	Added LLM-backed 1:1 matcher with cache, rapidfuzz fallback, deterministic parsing, `compute_extraction_metrics` and `aggregate_extraction_metrics`, plus unit tests covering matchers and metrics.
Evaluation Runner & Regression Tests `tests/eval_decision_relevance.py`, `tests/test_phase2_ledger.py`	New CLI evaluation runner orchestrating extract→ingest→metrics across repos/transcripts, aggregates per-repo/overall grounding and extraction metrics, enforces regression gates; tests assert grounding stats and ledger query behavior.
Fixture Registry & Fixtures `tests/fixtures/expected/decisions.py`, `tests/fixtures/extraction/README.md`, `tests/fixtures/extraction/*.json`	Added `TRANSCRIPT_SOURCES` mapping, documentation, and numerous M1 extraction JSON fixtures with metadata and decisions/action_items.
Fixture Regeneration Tooling `tests/regen_extraction_fixtures.py`	Added script to regenerate extraction fixtures via the headless extractor with CLI options (`--all`, `--source-ref`, `--model`, `--force`, `--dry-run`).
Test Harness Additions `tests/test_extract_headless.py`, `tests/test_extraction_metrics.py`, `tests/test_phase2_ledger.py`	Extensive tests for extractor parsing, cache behavior, matcher selection, metric aggregation, and grounding stats assertions.

Sequence Diagram(s)

sequenceDiagram
    participant Handler as Ingest Handler
    participant Tool as ValidateSymbolsTool
    participant DB as Database
    participant Ledger as Ledger Queries

    Handler->>+Handler: handle_ingest(payload)
    Handler->>+Tool: validate_symbols(...)
    Tool->>+DB: lookup_by_file() via self._db
    DB-->>-Tool: symbol results
    Tool-->>-Handler: validation results

    Handler->>Handler: compute intents_created, ungrounded
    Handler->>Handler: grounded = max(0, intents_created - ungrounded)
    Handler->>Handler: grounded_pct = grounded/intents_created or 0.0
    Handler->>+Ledger: persist/return IngestStats (including grounded, grounded_pct)
    Ledger-->>-Handler: stats returned
    Handler-->>-Handler: return result

sequenceDiagram
    participant Eval as Eval Runner
    participant Extract as Headless Extractor
    participant API as Anthropic API
    participant Ingest as Ingest Handler
    participant Metrics as Metrics Computer

    Eval->>+Eval: load transcripts & plan
    loop per transcript
        Eval->>+Extract: extract_from_current_skill(transcript)
        Extract->>Extract: check cache
        alt cache hit
            Extract-->>Eval: return cached decisions
        else cache miss
            Extract->>+API: send tool-based extraction request
            API-->>-Extract: tool_use response
            Extract->>Extract: parse & cache
            Extract-->>Eval: return extracted decisions
        end

        Eval->>+Ingest: handle_ingest(payload)
        Ingest-->>-Eval: return grounding stats

        Eval->>+Metrics: compute_extraction_metrics(extracted, fixture)
        Metrics-->>-Eval: return precision/recall/F1
    end
    Eval->>Eval: aggregate per-repo & overall metrics
    Eval->>Eval: enforce regression gates -> exit code

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

refactor: BicameralContext — request-scoped snapshot isolation #4 — Related refactor touching grounding and code-locator adapter flows; aligns with retaining ValidateSymbolsTool._db to enable DB lookups during auto-grounding.

Poem

🐰 I nibbled the fixtures, found carrots of truth,

Hopped through MD steps and cached each sweet tooth,
Grounded the intents, watched percentages bloom,
Metrics and tests guard the warren and room,
A rabbit applauds: tidy, robust, and sleuth.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.02% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'M1/decision relevance ruler' is vague and does not clearly summarize the main changes in the pull request.	Consider a more descriptive title that captures the primary objective, such as 'Add M1 decision-relevance evaluation framework with grounding metrics and extraction validation' or 'Implement extraction-based decision grounding evaluation system'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch m1/decision-relevance-ruler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

tests/_extract_headless.py (1)

166-169: Chain the exception for better traceability.

When re-raising as a new exception type, use from exc to preserve the original traceback chain. This helps with debugging when JSON parsing fails.

Proposed fix

     try:
         parsed = json.loads(text)
     except json.JSONDecodeError as exc:
-        raise ValueError(f"extractor did not return valid JSON: {exc}\nbody={body[:500]}")
+        raise ValueError(f"extractor did not return valid JSON: {exc}\nbody={body[:500]}") from exc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/_extract_headless.py` around lines 166 - 169, The except block around
json.loads(text) currently raises a new ValueError without chaining the original
json.JSONDecodeError; update the except handler for json.JSONDecodeError (the
block catching exc) to re-raise the ValueError using "from exc" so the original
traceback is preserved (i.e., change the raise in the try/except that follows
json.loads(text) to use exception chaining).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ledger/queries.py`:
- Around line 571-580: The code currently treats any non-"ungrounded" status
(including empty or missing status) as grounded; update the logic in the block
that obtains status and updates the bucket (the variables status,
buckets.setdefault, and keys "grounded"/"ungrounded"/"total") so that only an
explicit "grounded" value increments b["grounded"], and all other cases
(including empty string or missing status) increment b["ungrounded"] (or are
handled as ungrounded) before updating totals and grounded_pct.

In `@tests/_extraction_metrics.py`:
- Around line 88-103: The current greedy loop matching items from actual to
unmatched_expected (variables actual, unmatched_expected, matched) using
fuzz.token_set_ratio and popping the best per iteration causes order-dependent
results; replace this with an order-independent max-cardinality / max-weight
bipartite matching: build a weight matrix of scores between each actual and each
expected (using fuzz.token_set_ratio and MATCH_THRESHOLD), run a max-weight
matching (e.g., Hungarian algorithm or
networkx.algorithms.matching.max_weight_matching) to select globally optimal
pairs, then populate matched, unmatched_actual, and unmatched_expected from that
matching instead of popping inside the greedy loop. Ensure ties below
MATCH_THRESHOLD are treated as non-matches.

In `@tests/eval_decision_relevance.py`:
- Around line 244-248: Wrap the json.loads(args.multi_repo) call in a try/except
to catch json.JSONDecodeError so a malformed --multi-repo value produces a clean
CLI error instead of a traceback; specifically, when parsing into repo_map (the
repo_map: dict[str, str] = json.loads(args.multi_repo) line), catch
json.JSONDecodeError, print a descriptive message to stderr (e.g. "error:
invalid --multi-repo JSON") and return the same error tuple ({}, 2) as the empty
case.

In `@tests/fixtures/extraction/saleor-checkout-extensibility.json`:
- Around line 50-51: Replace the ambiguous empty string owner value with a
proper null to represent an unassigned owner: find the JSON object containing
the "text" value "Deliver base implementation for review by end of sprint;
check-in Thursday on the plugin changes (critical path)" and set its "owner"
property from "" to null so consumers don't need extra normalization logic.

In `@tests/fixtures/extraction/vendure-custom-fields.json`:
- Around line 49-50: Change the JSON fixture entry where the object has "text":
"All PRs reviewed by Wednesday" and "owner": "" so that the owner field is
normalized to null; specifically replace the empty-string value for the "owner"
property with a JSON null to ensure schema consistency across fixtures.

---

Nitpick comments:
In `@tests/_extract_headless.py`:
- Around line 166-169: The except block around json.loads(text) currently raises
a new ValueError without chaining the original json.JSONDecodeError; update the
except handler for json.JSONDecodeError (the block catching exc) to re-raise the
ValueError using "from exc" so the original traceback is preserved (i.e., change
the raise in the try/except that follows json.loads(text) to use exception
chaining).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 443dbd1e-4916-4f3c-be13-116bede51987

📥 Commits

Reviewing files that changed from the base of the PR and between 4d46607 and a23dba3.

📒 Files selected for processing (22)

code_locator/tools/validate_symbols.py
contracts.py
handlers/ingest.py
ledger/queries.py
tests/_extract_headless.py
tests/_extraction_metrics.py
tests/eval_decision_relevance.py
tests/fixtures/expected/decisions.py
tests/fixtures/extraction/README.md
tests/fixtures/extraction/medusa-payment-timeout.json
tests/fixtures/extraction/medusa-plugin-migration.json
tests/fixtures/extraction/medusa-webhook-notifications.json
tests/fixtures/extraction/saleor-checkout-extensibility.json
tests/fixtures/extraction/saleor-graphql-permissions.json
tests/fixtures/extraction/saleor-order-workflows.json
tests/fixtures/extraction/vendure-channel-pricing.json
tests/fixtures/extraction/vendure-custom-fields.json
tests/fixtures/extraction/vendure-search-reindexing.json
tests/regen_extraction_fixtures.py
tests/test_extract_headless.py
tests/test_extraction_metrics.py
tests/test_phase2_ledger.py

coderabbitai · 2026-04-13T22:56:21Z

+        status = str(row.get("status") or "")
+        b = buckets.setdefault(
+            ref,
+            {"source_ref": ref, "grounded": 0, "ungrounded": 0, "total": 0, "grounded_pct": 0.0},
+        )
+        b["total"] += 1
+        if status == "ungrounded":
+            b["ungrounded"] += 1
+        else:
+            b["grounded"] += 1


⚠️ Potential issue | 🟠 Major

Do not treat empty/missing status as grounded.

At Line 571-580, status == "" (or missing) currently increments grounded, which can overstate grounded_pct and skew eval outcomes.

Suggested fix

- status = str(row.get("status") or "") + status = str(row.get("status") or "") b = buckets.setdefault( ref, {"source_ref": ref, "grounded": 0, "ungrounded": 0, "total": 0, "grounded_pct": 0.0}, ) b["total"] += 1 - if status == "ungrounded": + if not status or status == "ungrounded": b["ungrounded"] += 1 else: b["grounded"] += 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ledger/queries.py` around lines 571 - 580, The code currently treats any non-"ungrounded" status (including empty or missing status) as grounded; update the logic in the block that obtains status and updates the bucket (the variables status, buckets.setdefault, and keys "grounded"/"ungrounded"/"total") so that only an explicit "grounded" value increments b["grounded"], and all other cases (including empty string or missing status) increment b["ungrounded"] (or are handled as ungrounded) before updating totals and grounded_pct.

coderabbitai · 2026-04-13T22:56:21Z

+    for a in actual:
+        if not unmatched_expected:
+            unmatched_actual.append(a)
+            continue
+        # Find best match in remaining expected pool
+        best_i = -1
+        best_score = -1.0
+        for i, e in enumerate(unmatched_expected):
+            score = fuzz.token_set_ratio(a, e)
+            if score > best_score:
+                best_score = score
+                best_i = i
+        if best_score >= MATCH_THRESHOLD:
+            matched.append((a, unmatched_expected[best_i], best_score))
+            unmatched_expected.pop(best_i)
+        else:


⚠️ Potential issue | 🟠 Major

Order-dependent greedy matching can mis-score TP/FP/FN.

At Line 88, matching extracted decisions sequentially and popping matches at Line 102 makes outcomes dependent on input order. For a regression gate, this can undercount true positives on equivalent outputs.

Proposed fix (order-independent max-cardinality matching)

def compute_extraction_metrics( @@ - unmatched_expected = list(expected) - matched: list[tuple[str, str, float]] = [] - unmatched_actual: list[str] = [] - - for a in actual: - if not unmatched_expected: - unmatched_actual.append(a) - continue - # Find best match in remaining expected pool - best_i = -1 - best_score = -1.0 - for i, e in enumerate(unmatched_expected): - score = fuzz.token_set_ratio(a, e) - if score > best_score: - best_score = score - best_i = i - if best_score >= MATCH_THRESHOLD: - matched.append((a, unmatched_expected[best_i], best_score)) - unmatched_expected.pop(best_i) - else: - unmatched_actual.append(a) + # Build thresholded bipartite graph (actual -> expected), neighbors sorted by score desc. + adj: list[list[int]] = [] + score_map: dict[tuple[int, int], float] = {} + for ai, a in enumerate(actual): + candidates: list[tuple[int, float]] = [] + for ei, e in enumerate(expected): + s = fuzz.token_set_ratio(a, e) + if s >= MATCH_THRESHOLD: + candidates.append((ei, s)) + score_map[(ai, ei)] = s + candidates.sort(key=lambda t: t[1], reverse=True) + adj.append([ei for ei, _ in candidates]) + + # Maximum-cardinality bipartite matching (Kuhn DFS). + match_e_to_a = [-1] * len(expected) + + def _dfs(ai: int, seen: set[int]) -> bool: + for ei in adj[ai]: + if ei in seen: + continue + seen.add(ei) + prev_ai = match_e_to_a[ei] + if prev_ai == -1 or _dfs(prev_ai, seen): + match_e_to_a[ei] = ai + return True + return False + + for ai in range(len(actual)): + _dfs(ai, set()) + + matched_actual = {ai for ai in match_e_to_a if ai != -1} + matched_expected_idx = {ei for ei, ai in enumerate(match_e_to_a) if ai != -1} + + matched: list[tuple[str, str, float]] = [] + for ei, ai in enumerate(match_e_to_a): + if ai != -1: + matched.append((actual[ai], expected[ei], score_map[(ai, ei)])) + + unmatched_actual = [a for i, a in enumerate(actual) if i not in matched_actual] + unmatched_expected = [e for i, e in enumerate(expected) if i not in matched_expected_idx]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/_extraction_metrics.py` around lines 88 - 103, The current greedy loop matching items from actual to unmatched_expected (variables actual, unmatched_expected, matched) using fuzz.token_set_ratio and popping the best per iteration causes order-dependent results; replace this with an order-independent max-cardinality / max-weight bipartite matching: build a weight matrix of scores between each actual and each expected (using fuzz.token_set_ratio and MATCH_THRESHOLD), run a max-weight matching (e.g., Hungarian algorithm or networkx.algorithms.matching.max_weight_matching) to select globally optimal pairs, then populate matched, unmatched_actual, and unmatched_expected from that matching instead of popping inside the greedy loop. Ensure ties below MATCH_THRESHOLD are treated as non-matches.

coderabbitai · 2026-04-13T22:56:21Z

+    repo_map: dict[str, str] = json.loads(args.multi_repo) if args.multi_repo else {}
+    if not repo_map:
+        print("error: --multi-repo is required", file=sys.stderr)
+        return {}, 2
+


⚠️ Potential issue | 🟡 Minor

Handle malformed --multi-repo JSON with a clean CLI error.

Line 244 can throw json.JSONDecodeError and produce a traceback instead of a controlled exit path.

Proposed fix

- repo_map: dict[str, str] = json.loads(args.multi_repo) if args.multi_repo else {} + try: + repo_map: dict[str, str] = json.loads(args.multi_repo) if args.multi_repo else {} + except json.JSONDecodeError as exc: + print(f"error: --multi-repo must be valid JSON: {exc}", file=sys.stderr) + return {}, 2

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/eval_decision_relevance.py` around lines 244 - 248, Wrap the json.loads(args.multi_repo) call in a try/except to catch json.JSONDecodeError so a malformed --multi-repo value produces a clean CLI error instead of a traceback; specifically, when parsing into repo_map (the repo_map: dict[str, str] = json.loads(args.multi_repo) line), catch json.JSONDecodeError, print a descriptive message to stderr (e.g. "error: invalid --multi-repo JSON") and return the same error tuple ({}, 2) as the empty case.

coderabbitai · 2026-04-13T22:56:21Z

+      "text": "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)",
+      "owner": ""


⚠️ Potential issue | 🟡 Minor

Use null instead of empty string for unassigned owner.

At Line 50-51, owner: "" is ambiguous and forces extra normalization logic. Use null for missing owner values.

Suggested fix

- "owner": "" + "owner": null

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"text": "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)",

"owner": ""

"text": "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)",

"owner": null

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/fixtures/extraction/saleor-checkout-extensibility.json` around lines 50 - 51, Replace the ambiguous empty string owner value with a proper null to represent an unassigned owner: find the JSON object containing the "text" value "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)" and set its "owner" property from "" to null so consumers don't need extra normalization logic.

coderabbitai · 2026-04-13T22:56:21Z

+      "text": "All PRs reviewed by Wednesday",
+      "owner": ""


⚠️ Potential issue | 🟡 Minor

Normalize missing owner as null.

At Line 49-50, prefer owner: null over owner: "" for schema consistency across fixtures.

Suggested fix

- "owner": "" + "owner": null

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"text": "All PRs reviewed by Wednesday",

"owner": ""

"text": "All PRs reviewed by Wednesday",

"owner": null

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/fixtures/extraction/vendure-custom-fields.json` around lines 49 - 50, Change the JSON fixture entry where the object has "text": "All PRs reviewed by Wednesday" and "owner": "" so that the owner field is normalized to null; specifically replace the empty-string value for the "owner" property with a JSON null to ensure schema consistency across fixtures.

Replaces the rapidfuzz token_set_ratio matcher with an Anthropic Haiku 4.5 tool-use call that judges semantic equivalence between extracted and ground-truth decisions. Rapidfuzz stays as an offline fallback for unit tests. Why: The first end-to-end CI run (24370616582) reported P=0.51, R=0.45, F1=0.48, but inspection of the per-transcript unmatched lists showed most "false positives" and "false negatives" were the same decision phrased slightly differently. Four concrete cases from the run: medusa-payment-timeout: FP: "Implement a 12-second timeout ceiling on payment provider authorize calls; if exceeded, return requires_more status" FN: "Wrap the authorize call in PaymentProviderService with a 12-second timeout ceiling and return requires_more status" (same decision, scored <70 on token_set_ratio) medusa-webhook-notifications: FP: "Per-endpoint rate limiting will use token bucket algorithm with a capacity of 10 requests per second, with overflow queued" FN: "Apply a per-endpoint token bucket rate limiter at 10 requests per second; overflow requests are queued rather than dropped" saleor-graphql-permissions: FP: "Gate authorization checks for mutations with side effects..." FN: "Gate the checkoutComplete mutation's permission check before any side effects..." vendure-custom-fields: FP: "Schema synchronization will use TypeORM schema:synchronize..." FN: "Run schema:synchronize in staging first as a CI validation..." These numbers were measuring matcher brittleness, not skill quality. Design: - New module tests/_extraction_matcher.py with llm_match() that does ONE Haiku call per transcript (not N*M per pair), with the full actual + expected lists in the user prompt and tool_use forcing a structured matches array in return. - Strict matching rules in the system prompt: "same constraint, same scope, same behavior", 1:1 pairing, "when in doubt leave unmatched". - Prompt-cached on the system prompt so re-runs within a branch only pay for user-message tokens. - Response cached to tests/.match-cache/ keyed on (model, sha(actual_list), sha(expected_list)) so repeat CI runs on the same branch pay $0 to the API. Any SKILL.md edit that changes what Haiku extracts changes the actual_sha and invalidates the cache automatically. - Tool response is already a parsed dict — no JSON string parsing. - Fails fast with a diagnostic error body on HTTP errors, same pattern as _extract_headless. Dispatcher: - compute_extraction_metrics(..., matcher="auto") resolves to "llm" when ANTHROPIC_API_KEY is set (CI path) and "rapidfuzz" otherwise (offline unit tests). Explicit matcher="llm" or matcher="rapidfuzz" forces the path. New "matcher" field in the metric dict records which matcher was used for auditability. - Runner prints "extraction (llm):" vs "extraction (rapidfuzz):" in the aggregate summary so CI logs show which matcher ran. Tests: - Existing rapidfuzz-mode tests retained, explicitly forced via matcher="rapidfuzz" (9 passing) - New unit tests for _pick_matcher (env-based auto-resolve), LLM cache hit path, missing-key error, _parse_matches 1:1 enforcement with invalid indices / duplicate expecteds, and the compute_* dispatch path via a stubbed llm_match (7 passing) - All offline — no network required - Total: 16 metric tests + 9 extractor tests = 25 pass Cost per CI run: ~9 Haiku calls per run × ~2k input tokens × $0.80/Mtok = ~$0.015 plus prompt-cache hits on the system prompt amortized across calls. Effectively free, and cached zero on repeats. Next CI run on this branch will produce the first real extraction metric with paraphrase-robust matching. Expected: the P=0.51 R=0.45 baseline will move upward substantially (informal eyeball suggests ~0.75+) without any skill change, because the previous numbers were almost entirely matcher brittleness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

tests/_extraction_metrics.py (1)
70-98: ⚠️ Potential issue | 🟠 Major

Rapidfuzz fallback is still order-dependent.

This loop still greedily pops the best remaining expected for each actual, so runs that fall back to rapidfuzz can undercount TP / overcount FP based only on list order. That keeps the offline metric unstable; this path still needs a global 1:1 bipartite matching instead of greedy removal.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/_extraction_metrics.py` around lines 70 - 98, The _rapidfuzz_match
function currently does greedy per-actual selection causing order-dependent
matches; replace the greedy removal with a global 1:1 bipartite matching: build
a score matrix using fuzz.token_set_ratio between every actual (rows) and
expected (columns), run a maximum-weight bipartite matching (e.g., Hungarian
algorithm / scipy.optimize.linear_sum_assignment or a max-weight matching
routine) to get globally optimal actual->expected pairs, then convert matched
pairs to the (actual_idx, expected_idx) list but only accept matches whose score
>= MATCH_THRESHOLD (unmatched or below-threshold become (actual_idx, None));
keep the function signature and return format unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/_extraction_matcher.py`:
- Around line 217-223: The RuntimeError raised after the HTTP POST is echoing
API key-derived data (api_key, key_prefix, key_len); remove any key-derived
fields from the exception text. Update the error construction that references
resp.status_code, resp.text, model, key_prefix, and key_len so it only includes
non-secret diagnostics (e.g., resp.status_code, resp.text[:500], and model) and
does not reference api_key, key_prefix, key_len, or ANTROP... constants.
- Around line 127-136: The cache key generation is ambiguous and missing
prompt/tool versioning: change _list_sha to produce an unambiguous hash of the
list (e.g., include item lengths or a separator encoding so ["a","b\nc"] !=
["a\nb","c"]) and update _cache_path to incorporate SYSTEM_PROMPT,
USER_PROMPT_TEMPLATE and MATCHER_TOOL (or their current version/sha) into the
key string used before calling _sha so llm_match() won't return stale hits when
prompts or tools change; update references to _list_sha and _cache_path
accordingly.

In `@tests/_extraction_metrics.py`:
- Line 51: The comment on the MATCH_THRESHOLD constant contains a Unicode
en-dash; replace the en-dash (–) with a plain ASCII hyphen (-) in the inline
comment after MATCH_THRESHOLD so the comment reads "rapidfuzz token_set_ratio,
0-100 scale" and resolves the RUF003 lint warning.

---

Duplicate comments:
In `@tests/_extraction_metrics.py`:
- Around line 70-98: The _rapidfuzz_match function currently does greedy
per-actual selection causing order-dependent matches; replace the greedy removal
with a global 1:1 bipartite matching: build a score matrix using
fuzz.token_set_ratio between every actual (rows) and expected (columns), run a
maximum-weight bipartite matching (e.g., Hungarian algorithm /
scipy.optimize.linear_sum_assignment or a max-weight matching routine) to get
globally optimal actual->expected pairs, then convert matched pairs to the
(actual_idx, expected_idx) list but only accept matches whose score >=
MATCH_THRESHOLD (unmatched or below-threshold become (actual_idx, None)); keep
the function signature and return format unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c255b50-0c6e-45de-b5a9-b1e762779ec6

📥 Commits

Reviewing files that changed from the base of the PR and between a23dba3 and 42c3a07.

📒 Files selected for processing (4)

tests/_extraction_matcher.py
tests/_extraction_metrics.py
tests/eval_decision_relevance.py
tests/test_extraction_metrics.py

🚧 Files skipped from review as they are similar to previous changes (2)

tests/test_extraction_metrics.py
tests/eval_decision_relevance.py

coderabbitai · 2026-04-14T00:59:30Z

+def _list_sha(items: list[str]) -> str:
+    # Order matters for cache key — same-content-different-order lists
+    # should map to different cache entries so we don't silently reuse a
+    # stale matching when the runner reshuffles.
+    return _sha("\n".join(items))
+
+
+def _cache_path(model: str, actual_sha: str, expected_sha: str) -> Path:
+    key = f"{model}|{actual_sha}|{expected_sha}"
+    return MATCH_CACHE_DIR / f"{_sha(key)}.json"


⚠️ Potential issue | 🟠 Major

Make the cache key unambiguous and prompt-versioned.

Line 131 hashes "\n".join(items), so inputs like ["a", "b\nc"] and ["a\nb", "c"] collide. Lines 135-136 also ignore SYSTEM_PROMPT, USER_PROMPT_TEMPLATE, and MATCHER_TOOL, so prompt/tool edits will silently reuse stale judge outputs. Because llm_match() returns cache hits before calling the API, this can produce wrong metrics.

🔧 Proposed fix

def _sha(text: str) -> str: return hashlib.sha256(text.encode("utf-8")).hexdigest() +MATCHER_CACHE_VERSION = _sha( + json.dumps( + { + "system": SYSTEM_PROMPT, + "user_template": USER_PROMPT_TEMPLATE, + "tool": MATCHER_TOOL, + }, + sort_keys=True, + ensure_ascii=False, + separators=(",", ":"), + ) +) + + def _list_sha(items: list[str]) -> str: # Order matters for cache key — same-content-different-order lists # should map to different cache entries so we don't silently reuse a # stale matching when the runner reshuffles. - return _sha("\n".join(items)) + return _sha(json.dumps(items, ensure_ascii=False, separators=(",", ":"))) def _cache_path(model: str, actual_sha: str, expected_sha: str) -> Path: - key = f"{model}|{actual_sha}|{expected_sha}" + key = f"{MATCHER_CACHE_VERSION}|{model}|{actual_sha}|{expected_sha}" return MATCH_CACHE_DIR / f"{_sha(key)}.json"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/_extraction_matcher.py` around lines 127 - 136, The cache key generation is ambiguous and missing prompt/tool versioning: change _list_sha to produce an unambiguous hash of the list (e.g., include item lengths or a separator encoding so ["a","b\nc"] != ["a\nb","c"]) and update _cache_path to incorporate SYSTEM_PROMPT, USER_PROMPT_TEMPLATE and MATCHER_TOOL (or their current version/sha) into the key string used before calling _sha so llm_match() won't return stale hits when prompts or tools change; update references to _list_sha and _cache_path accordingly.

coderabbitai · 2026-04-14T00:59:30Z

+    key_prefix = api_key[:12] if api_key else "(empty)"
+    with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client:
+        resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload)
+        if resp.status_code >= 400:
+            raise RuntimeError(
+                f"Matcher API error {resp.status_code}: {resp.text[:500]} "
+                f"(model={model}, key_prefix={key_prefix}..., key_len={len(api_key)})"


⚠️ Potential issue | 🟠 Major

Do not echo API key material into the exception text.

Line 223 includes part of ANTHROPIC_API_KEY in every 4xx failure. Even partial credentials should not land in CI logs or artifacts; keep the model/status diagnostics and drop the key-derived fields.

🔒 Proposed fix

- key_prefix = api_key[:12] if api_key else "(empty)" with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client: resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload) if resp.status_code >= 400: raise RuntimeError( f"Matcher API error {resp.status_code}: {resp.text[:500]} " - f"(model={model}, key_prefix={key_prefix}..., key_len={len(api_key)})" + f"(model={model})" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

key_prefix = api_key[:12] if api_key else "(empty)"

with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client:

resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload)

if resp.status_code >= 400:

raise RuntimeError(

f"Matcher API error {resp.status_code}: {resp.text[:500]} "

f"(model={model}, key_prefix={key_prefix}..., key_len={len(api_key)})"

with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client:

resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload)

if resp.status_code >= 400:

raise RuntimeError(

f"Matcher API error {resp.status_code}: {resp.text[:500]} "

f"(model={model})"

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/_extraction_matcher.py` around lines 217 - 223, The RuntimeError raised after the HTTP POST is echoing API key-derived data (api_key, key_prefix, key_len); remove any key-derived fields from the exception text. Update the error construction that references resp.status_code, resp.text, model, key_prefix, and key_len so it only includes non-secret diagnostics (e.g., resp.status_code, resp.text[:500], and model) and does not reference api_key, key_prefix, key_len, or ANTROP... constants.

coderabbitai · 2026-04-14T00:59:30Z

+from rapidfuzz import fuzz
+
+FIXTURES_DIR = Path(__file__).resolve().parent / "fixtures" / "extraction"
+MATCH_THRESHOLD = 70  # rapidfuzz token_set_ratio, 0–100 scale


⚠️ Potential issue | 🟡 Minor

Replace the Unicode dash in this comment.

Line 51 uses –, which Ruff flags as RUF003. Swap it for a plain - so the new file stays lint-clean.

🧹 Proposed fix

-MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0–100 scale +MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0-100 scale

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0–100 scale

MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0-100 scale

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 51-51: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/_extraction_metrics.py` at line 51, The comment on the MATCH_THRESHOLD constant contains a Unicode en-dash; replace the en-dash (–) with a plain ASCII hyphen (-) in the inline comment after MATCH_THRESHOLD so the comment reads "rapidfuzz token_set_ratio, 0-100 scale" and resolves the RUF003 lint warning.

Bootstraps the M1 adversarial corpus ground truth in-session (Claude Opus 4.6, with the cloned medusa/saleor/vendure trees available for symbol cross-checks). One JSON per adversarial transcript, matching the shape of the existing friendly-corpus fixtures. Distribution by category (15 decisions + 6 action_items total): adv-strat-fake STRAT-FAKE 0 dec 0 act (strict-exclude probe) adv-vocab-collide VOCAB-COLLIDE 1 dec 2 act (one buried needle) adv-density-extreme DENSITY-EXTREME 9 dec 2 act (compound + diffuse) adv-offtopic-mix OFFTOPIC-MIX 1 dec 1 act (buried in Q2 OR) adv-reversal REVERSAL 4 dec 1 act (final state only) Each fixture has a "notes" field explaining the category-specific extraction policy so future hand-edits stay aligned with intent. The strat-fake empty case is intentional: any positive extraction is a false positive. These transcripts are not yet wired into TRANSCRIPT_SOURCES — that edit waits on the fixture-location migration (separate cleanup ticket). Once registered, CI will produce the first adversarial baseline alongside the existing friendly-corpus numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Registers adv-strat-fake, adv-vocab-collide, adv-density-extreme, adv-offtopic-mix, and adv-reversal under repo_key="adversarial" so the runner reports them as their own group in CI output. Ground truth for each was committed earlier at tests/fixtures/extraction/adv-*.json. The "adversarial" repo_key is matched against --multi-repo by the parent workflow, where it aliases to the medusa clone (any indexed codebase works — adversarial transcripts test extraction quality, not grounding quality against a specific codebase). Bland-noun grounding noise on these transcripts is exactly what the VOCAB-COLLIDE category is designed to expose, and feeds the M2 (grounding precision) ticket. Runner total: 9 friendly + 5 adversarial = 14 transcripts. Local smoke test: --skill-variant none with adversarial in the repo map handles the empty case cleanly (adversarial transcripts have no entries in ALL_DECISIONS, so none-mode produces 0/0 for the adversarial group without crashing). Live extraction in CI's from-skill-md mode will be the first real measurement.

Removes the 9 friendly e-commerce fixtures (medusa-*, saleor-*, vendure-*) from tests/fixtures/extraction/ and the matching TRANSCRIPT_SOURCES entries from decisions.py. The friendly corpus was too easy to be informative about real-world reliability — F1=0.87 on sprint-planning meetings says nothing about whether bicameral handles hedged language, generic vocabulary, compound sentences, off-topic meetings, or reversed decisions. After this commit: TRANSCRIPT_SOURCES → 5 entries, all repo_key="adversarial" fixtures/extraction/ → README.md + 5 adv-*.json files The 9 friendly transcript .md files in pilot/ml/data/transcripts/ remain in place — they're not bicameral-mcp test fixtures, they're data assets used elsewhere in pilot/ml/. Only the test-fixture copies in pilot/mcp/tests/fixtures/extraction/ are removed. ALL_DECISIONS in decisions.py is unchanged — it's still consumed by tests/eval_code_locator.py for the M2 RAG eval, which targets the friendly OSS repos and is conceptually independent of M1. Offline test suite: 40/40 pass after cleanup (15 phase-2 + 16 metrics + 9 extract_headless).

Resolves four conflicts: - .gitignore: keeps both qor:seed block and .claude/worktrees ignore. - docs/META_LEDGER.md: takes dev's chain (#1-#26) as canonical. This branch's parallel #7-#14 entries (sealed against the obsolete Entry #6 base before dev added #7-#26) are not folded in here — they need to be re-authored against dev's #26 seal via /qor-meta-log-decision in a follow-up. Their session content is preserved in commit messages (f4de501, 3f856af) and in docs/SYSTEM_STATE.md. - docs/SYSTEM_STATE.md: keeps both the v0 process cleanup / preflight hook session blocks from this branch and dev's #124 implementation block. - skills/bicameral-preflight/SKILL.md: keeps both the new "Hook reinforcement" subsection (this branch) and the Telemetry section (dev). The earlier removal of Telemetry in this branch's f4de501 was authored against a stale base; restoring it here.

Priority C v0 — Phase 1 of the team-server vertical-slice for multi-dev decision continuity at organizational scale (per the Sales Enablement & Positioning Playbook + research-brief-priority-c-selective-ingest-2026-05-02.md). The team-server is a self-managing, customer-self-hosted Python service. Per CONCEPT.md anti-goals under literal-keyword parsing (SHADOW_GENOME Failure Entry #6 addendum): "no managed backend" forbids vendor SaaS and human-ops-tax architectures, NOT self-managing customer-deployable backends. Sentry self-hosted, Supabase OSS, and the embedded-SurrealDB philosophy already in repo are precedents. Scaffold delivered: - team_server/app.py (47 LOC): FastAPI app factory; lifespan migrates schema on startup, closes DB on teardown; /health endpoint - team_server/schema.py (80 LOC): v0 schema for workspace, channel_allowlist, extraction_cache, team_event tables. Idempotent ensure_schema(); migration dispatch table for future versions. FLEXIBLE TYPE object on canonical_extraction + payload (per #72 lesson + audit Advisory #3). - team_server/db.py (41 LOC): TeamServerDB factory wrapping ledger.client.LedgerClient with team-server's own ns/db pair. - deploy/team-server.docker-compose.yml + Dockerfile.team-server: single-service compose; volume for persistent data; healthcheck on /health; runs as non-root. Tests (6 functionality tests, all green): - tests/test_team_server_app.py: app starts + serves health, schema migrates from empty + idempotent, lifespan teardown closes DB, /health returns well-formed JSON. - tests/test_team_server_deploy.py: docker-compose config validates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3 of plan-priority-c-team-server-slack-v0.md — adds the polling worker, the canonical-extraction cache that closes the multi-dev extraction-divergence gap, and the peer-author event writer. Multi-dev convergence mechanism: any dev's session that touches a Slack message produces the SAME canonical extraction across the team because get_or_compute is keyed on (source_type, source_ref, content_hash) and the cache row is shared via the team-server's append-only event log. This is what closes Playbook Pillar #1 (Decision Continuity) at multi-dev/multi-agent scale that the v1 brief incorrectly framed as "build curation only" — see SHADOW_GENOME Failure Entry #6. The interim canonical extraction uses Anthropic Claude with the model_version='interim-claude-v1' tombstone so a future Phase 5 (CocoIndex #136 integration) can identify and rebuild interim entries when memoized deterministic transforms become available. Files added: - team_server/extraction/canonical_cache.py (45 LOC): get_or_compute() returns cached extraction OR invokes compute_fn + persists. - team_server/extraction/llm_extractor.py: interim Anthropic-backed extractor; production-default until CocoIndex lands. - team_server/sync/peer_writer.py (42 LOC): write_team_event() — appends to team_event with author_email='team-server@<team_id>.bicameral' identity (single-bot per workspace; multi-instance HA is v1). - team_server/workers/slack_worker.py (100 LOC): poll_once() — conversations_history per allowlisted channel, content_hash dedup, cache-keyed extraction, peer event write per new message. Tests (6 functionality tests, all green): - tests/test_team_server_canonical_cache.py: cache hit returns existing without compute_fn, cache miss persists + subsequent call returns cached, content_hash variation produces new rows. - tests/test_team_server_slack_worker.py: worker polls only allowlisted channels, writes one team_event per message with peer-author identity, dedups via message ts (idempotent re-poll). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 4 of plan-priority-c-team-server-slack-v0.md — closes the multi-dev convergence loop by exposing team_event over HTTP and extending EventMaterializer with a failure-isolated team-server pull. The materializer pull is OUTSIDE the deterministic core (per CONCEPT.md literal-keyword parsing of "no network calls in the deterministic core" — SHADOW_GENOME Failure Entry #6 addendum). Failure-isolation contract: team-server outage NEVER cascades into per-dev preflight failures — events/team_server_pull.py swallows transport errors, returns empty events, leaves the watermark unchanged. Files added: - team_server/api/events.py: GET /events?since=N&limit=M endpoint; reads team_event ordered by sequence ascending; pagination via since cursor. - events/team_server_pull.py (57 LOC): pull_team_server_events() queries the team-server's /events endpoint, persists watermark per call, swallows transport errors gracefully. - team_server/app.py (modified): include events router. Tests (6 functionality tests, all green): - tests/test_team_server_events_api.py: /events returns rows in sequence order, paginates via since cursor, returns empty (not error) when no new events. - tests/test_materializer_team_server_pull.py: materializer pulls from team-server URL, watermark advances + persists separately, 503 /transport-error degrades gracefully (no exception, watermark unchanged) — failure-isolation contract verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

QorLogic SDLC governance trail for the Priority C v0 implementation that landed in commits 1-4 of this PR. Includes: - docs/research-brief-priority-c-selective-ingest-2026-05-02.md (v3) — research substrate. v1 was rejected for INVARIANT_FROM_IMPLEMENTATION (treating v0 agent-fetches-only code state as product principle); v2 added playbook substrate; v3 narrowed to Slack-first + team-server + CocoIndex-conditional after operator dialogue clarified "no managed backend" = "no human-ops-tax architecture," not "no backend." - plan-priority-c-team-server-slack-v0.md (437 LOC) — the L3 plan with five phases (Phase 5 deferred per "if we can manage it" feasibility caveat). - docs/SHADOW_GENOME.md Failure Entry #6 + addendum — captures the framing-error pattern AND the "anti-goals must be parsed by their load-bearing keyword" lesson; symmetric to v0-code-as-principle. - docs/META_LEDGER.md Entries #27 (IMPLEMENT) + #28 (SEAL). Predecessor: efd0304b (#135-triage seal on dev). Implement chain: 211ffb9e. Substantiation seal: 6f4f8f8f1d63ad82b952a3c6aff270d30584e08b0572077ff685e84ce453f6c2 - docs/SYSTEM_STATE.md — Priority C v0 section appended; documents schema additions, architectural properties achieved, audit advisory disposition, Phase 5 deferred state, and the qor-logic-internal steps skipped for downstream-project rationale. - .agent/staging/AUDIT_REPORT.md — PASS verdict, three non-blocking advisories all addressed at implement-time. Verdict: REALITY = PROMISE for Phases 1-4. Phase 5 (CocoIndex #136) explicitly deferred per plan slip-independence design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+Gemini+Codex-2 (#205) Authors a single Reviewer Disposition Pass table at the top of the brief reconciling all 32 review points across four review layers (Codex first-pass, Kilo, Gemini CLI, Codex second-pass) into one post-review consensus before downstream P1 issue-filing — per the explicit Codex-2 #1 directive. Decisions: 21 applied this commit, 6 already applied in 1d82658, 3 deferred to follow-up, 2 note-only. Net new gap IDs added per disposition: GDPR-08 (ephemeral data), GDPR-09 (consent versioning + revocation), LLM-11 (cross-tool config-file modification surface), MCP-01 (host UX as external dependency), CFG-01 (config precedence + fail-closed model). Reclassification: LLM-06 P0/M → P1/M with scope narrowed to future remote-skill-loading channel (per Kilo #2). Major content additions to the brief: - § 1.1: MCP host UX is external dependency, not security gate (new gap MCP-01) — host that auto-approves tool calls bypasses any "operator will see this" assumption. - § 1.2: SurrealDB version pinning supply-chain callout (Kilo #11). - § 1.7: cross-tool config-file modification surface (new gap LLM-11) distinct from skill-content surface — `setup_wizard` writes shell commands into `.claude/settings.json` that run host-side at hook fire. - § 1.11 (new): Configuration precedence + fail-closed model — single uniform precedence rule across all knobs (env > config.yaml > hardcoded defaults), fail-closed semantics on missing/malformed/ contradictory config (Codex-2 #5). - § 2.4 (a): LLM02 mapping note clarifying it folds into LLM-07 + OWASP-04 (Kilo #13). - § 2.4 (b): explicit `confirm=True` is agent-supplied not HITL (Kilo #3) — security context cannot rely on agent-filled params. - § 2.4 (c) LLM-01 + LLM-04: extensible classifier (Gemini #2) + guardrail-not-classifier framing (Codex-1 #6) + control-acceptance template (Codex-2 #4) — quarantine, override, test fixtures, measurement counters. - § 2.4 (c) LLM-03: timeouts as `.bicameral/config.yaml` knobs (Gemini #3). - § 2.4 (c) LLM-05 + LLM-09: out-of-band operator confirmation, not agent-supplied confirm parameters (Kilo #3). - § 2.4 (c) LLM-06: scope-narrowed to future remote-skill-loading; in current install model the wheel-trust covers it (Kilo #2). - § 2.4 (c) LLM-11 (new): cross-tool config-file gate (signed hooks-manifest.json) distinct from skill manifest. - § 2.1 (c) GDPR-01: three remediation candidates — tombstone-and- rebuild with signed manifest (Kilo #12), crypto-shredding (Gemini #1), or scope-out via PII detect-and-refuse. - § 2.1 (c) GDPR-02: data-subject-access search must cover full identifier surface (description, source_ref, topic, file paths) not just signer email (Codex-1 #5). - § 2.1 (c) GDPR-08 (new): ephemeral data surfaces (tempfiles, swap, WAL, crash dumps) (Kilo #7). - § 2.1 (c) GDPR-09 (new): consent versioning + revocation semantics (Kilo #8 + Codex-2 #3). - § 5: gap table updated with new rows + LLM-06 reclassification; gap counts post-disposition (5 P0 / 19 P1 / 16 P2 / 5 P3 = 45 total, up from 41). - § 6.1 (new): epic grouping for deferred P1 batch (Codex-1 #10) — ingest boundary guardrails, per-tool authority gradation, supply- chain signing, telemetry & consent. - § 6.2 (new): six-section control-acceptance template for every DG gap (Codex-2 #4) — positive / negative / bypass / fail-closed / telemetry / docs. Filed-issue updates: - Issue #214 (LLM-06): relabeled P0 → P1, retitled to reflect scope narrowing, full disposition comment added. - Issue #212 (LLM-01) + #213 (LLM-04): disposition comments added capturing the guardrail framing, classifier extensibility, and control-acceptance template applicable to both. Deferred for follow-up: Codex-1 #4 (controller/processor restructure of standards table), Codex-1 #9 (full evidence appendix beyond the methodology softening), Codex-2 #2 (full 3-column deployment-profile matrix beyond the single-column trigger). Brief now 706 lines (up from 606); +124 line diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jinhongkuan and others added 8 commits April 13, 2026 16:34

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

jinhongkuan and others added 3 commits April 13, 2026 21:54

jinhongkuan merged commit 79765a6 into main Apr 14, 2026
1 check passed

This was referenced Apr 20, 2026

Self-contained tests + MCP-native CI workflow #29

Merged

feat: v0.5.3 — gap attribution, auto link_commit sync, skill aggression + README overhaul #37

Merged

Knapp-Kevin mentioned this pull request May 2, 2026

Priority C v0 — self-managing team-server, Slack-first ingest #153

Closed

8 tasks

This was referenced May 6, 2026

[compliance:OWASP-LLM] Prompt-injection canary scan on bicameral.ingest content (gap LLM-01) #212

Closed

[compliance:OWASP-LLM,HIPAA,PCI] PII/secret/PHI/PAN detect-and-refuse on bicameral.ingest (gap LLM-04 + HIPAA-01 + PCI-01) #213

Closed

Knapp-Kevin mentioned this pull request May 14, 2026

[#279 Phase 1] Pull-based meeting ingestion — sync-and-brief CLI + Granola adapter + SessionStart hook #318

Merged

5 tasks

jinhongkuan mentioned this pull request May 23, 2026

feat(diagnose): #252 Layer 5 — opt-in remediation recipe (rebased from #274) #506

Merged

Knapp-Kevin mentioned this pull request Jun 3, 2026

P1: Human-oversight enforcement vs claim (NIST MANAGE-1.1 / EU AI Act Art.14) #548

Open

		"text": "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)",
		"owner": ""

	MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0–100 scale
	MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0-100 scale

Conversation

jinhongkuan commented Apr 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jinhongkuan commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading