Skip to content

M1/decision relevance ruler#6

Merged
jinhongkuan merged 12 commits into
mainfrom
m1/decision-relevance-ruler
Apr 14, 2026
Merged

M1/decision relevance ruler#6
jinhongkuan merged 12 commits into
mainfrom
m1/decision-relevance-ruler

Conversation

@jinhongkuan

@jinhongkuan jinhongkuan commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features

    • Ingest now reports grounding metrics (grounded count and grounded percentage) and supports per-source grounding breakdowns
    • End-to-end decision-relevance evaluation and a headless extraction workflow for measuring extraction + grounding
  • Documentation

    • Added README and schema guidance for ground-truth extraction fixtures
  • Tests

    • Large suite of tests and fixtures for extraction, matching, metrics, and grounding
  • Chores

    • Fixture regeneration utility for extraction fixtures

jinhongkuan and others added 8 commits April 13, 2026 16:34
Adds the instrumentation and eval runner for the "Decision Relevance"
metric from bicameral-quality-validation-roadmap.html (M1). Lets any
branch that edits .claude/skills/bicameral-ingest/SKILL.md produce a
directly-comparable grounded_pct in CI artifacts.

Changes:
- handlers/ingest.py: log "[ingest] complete: N/M grounded (pct%)" and
  populate stats.grounded / stats.grounded_pct on IngestResponse.
- contracts.py: extend IngestStats (additive, no breaking change).
- ledger/queries.py: get_grounding_breakdown() groups by source_ref.
- tests/fixtures/expected/decisions.py: TRANSCRIPT_SOURCES map so the
  runner can discover which transcript targets which repo. Adding a
  new transcript/repo is a one-line fixture edit.
- tests/eval_decision_relevance.py: new fixture-driven runner with
  two modes — "none" (ingest fixture decisions directly, no LLM) and
  "from-skill-md" (headless extraction from current SKILL.md, then
  grounding). Mirrors eval_code_locator.py structure.
- tests/_extract_headless.py: runs Step 1 of SKILL.md via the
  Anthropic Messages API using Authorization: Bearer
  CLAUDE_CODE_OAUTH_TOKEN. Caches responses keyed on
  (skill_md_sha, transcript_sha, model) — editing SKILL.md on a
  branch invalidates the cache automatically.
- tests/test_extract_headless.py: offline unit tests for Step 1
  excerpt parsing, fenced-JSON tolerance, and cache hit path.
- tests/test_phase2_ledger.py: 2 new tests for grounded-field
  population + get_grounding_breakdown. Also repairs
  test_ungrounded_intent_has_correct_status to use gibberish text
  (it was silently relying on the broken grounding path below).

Incidental fix:
- code_locator/tools/validate_symbols.py: store self._db so
  adapters/code_locator.py:190 can reach db.lookup_by_file()
  during auto-grounding. Previously the whole ground_mappings()
  path was a silent no-op, which masked test_phase3_integration
  failures and produced 100%-ungrounded results for any real run.

Regression: 30 tests pass across phase1 + phase2 + extract_headless.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The driver was resolving SKILL.md relative to the PARENT repo root
(.claude/skills/bicameral-ingest/SKILL.md), but that path is
gitignored at the parent level — only the submodule copy is tracked
(pilot/mcp/.claude/skills/bicameral-ingest/SKILL.md). In CI the file
was absent and the extraction+grounding step failed at _load_skill_md
before touching auth.

Resolve to MCP_ROOT (parents[1]) instead. This makes the submodule
the single source of truth for the skill and keeps Phase 5 A/B
branches inside the submodule — consistent with every other MCP asset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CI run bit us with `httpx.LocalProtocolError: Illegal header
value b'Bearer '` — the env var resolved to an empty string and we
concatenated it into the Authorization header, producing malformed
output instead of a clear error.

Switch to os.environ.get(..., "") + .strip() check so a missing
or empty-string token surfaces with a self-explaining RuntimeError
before we build any HTTP request.
The CI run came back 401 from the public Messages API with only the
status code visible — not enough to tell whether the Claude Code
OAuth token is (a) lacking the right scope, (b) the wrong token type
for this endpoint, or (c) rejected for some other reason.

Capture the response body (first 500 chars) and emit it with the
model name, token prefix (first 12 chars, enough to distinguish
sk-ant-api03 from sk-ant-oat01 without leaking material), and token
length. This is diagnostic-only; the cached success path is
untouched.
Anthropic's public Messages API rejects Claude Code OAuth tokens
with 401 "OAuth authentication is currently not supported". Switch
to standard API-key auth:

- ANTHROPIC_API_KEY env var instead of CLAUDE_CODE_OAUTH_TOKEN
- x-api-key header instead of Authorization: Bearer
- Function parameter: api_key (was oauth_token)
- Error messages point at the ci-test environment secret path

Cached success path is untouched; tests still pass offline with no
network credential needed.
Haiku 4.5 was returning JSON with unescaped inner double quotes
inside description strings (e.g., "or \`[\"*\"]\` for global access"),
breaking json.loads mid-parse. Also the default 4096 max_tokens
was near the ceiling for transcripts with 10+ decisions.

Switch to Anthropic tool use:
- Define an `submit_extraction` tool with a JSON schema for
  decisions + action_items
- Force the model to call it via tool_choice
- Read tool_use.input directly — it's a pre-parsed Python dict
  delivered by the API, so no JSON string parsing, no unescape
  failures, no markdown fence drift

Also:
- Bump MAX_OUTPUT_TOKENS 4096 → 8192 (cheap headroom)
- Raise clean errors on stop_reason="max_tokens" and missing
  tool_use blocks
- Keep _parse_response_json as a defensive utility (not on the
  hot path) so the existing offline tests for fence tolerance
  still run

Offline tests: 9/9 pass. Next CI run is the real end-to-end
validation against medusa/saleor/vendure via Haiku.
Reshapes the M1 eval so CI doesn't depend on Anthropic as a
"ground-truth oracle" on every run. New architecture:

  simulation (every CI run, Haiku 4.5 via live API):
    - Runs current SKILL.md Step-1 extraction on each transcript
    - Ingests via the grounding pipeline
    - Reports grounded_pct as before

  ground truth (pregenerated offline, Opus 4.6, committed to git):
    - tests/fixtures/extraction/<source_ref>.json — one per transcript
    - Generated by tests/regen_extraction_fixtures.py --all
    - Hand-editable: the committed JSON is the oracle, Opus is just the
      bootstrap tool. audit provenance fields track model + timestamp +
      skill_md_sha so staleness is visible in git diff.

  metric (computed by the runner at M1 eval time):
    - For each transcript, fuzzy-match extracted decisions against the
      ground-truth fixture via rapidfuzz.fuzz.token_set_ratio (threshold 70)
    - 1:1 matching (no double-counting)
    - Per-transcript precision / recall / F1 + counts
    - Aggregate across all scored transcripts
    - Skipped gracefully when no fixture file exists, so CI stays green
      until fixtures are bootstrapped

Phase 5 A/B workflow (once fixtures exist):
  1. Branch from main, edit .claude/skills/bicameral-ingest/SKILL.md
  2. git push → CI runs Haiku extraction via new skill → reports
     precision/recall delta against the FIXED Opus fixture
  3. Compare across experiment branches. Winner merges.

New files:
  tests/_extraction_metrics.py       — rapidfuzz matching + aggregate math
  tests/test_extraction_metrics.py   — 9 unit tests (1:1 match, P/R math)
  tests/regen_extraction_fixtures.py — Opus-backed bootstrap script
  tests/fixtures/extraction/README.md — format + workflow docs

Also:
  - Incidental fix in _build_payload_from_skill_md: coerce
    action_item.owner=None → "" to satisfy IngestPayload contract
    (the in-flight CI tool-use run exposed this via a pydantic error)

Regression: 30 tests pass (9 extract_headless + 9 extraction_metrics
+ 12 phase-2). Runner smoke-test in --skill-variant none mode passes
end-to-end with the new plumbing and cleanly reports "extraction
skipped (no fixtures)" when fixtures dir is empty.

Next step (outside this commit): bootstrap fixtures locally via
  ANTHROPIC_API_KEY=... .venv/bin/python tests/regen_extraction_fixtures.py --all
commit the resulting JSONs, push. CI will then produce real P/R
numbers alongside grounded_pct.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generates the M1 ground-truth extraction fixtures in-session instead
of paying Anthropic API tokens, using Claude Opus 4.6 (this session)
with the cloned medusa/saleor/vendure trees available for
spot-verification of claimed symbols and files. Produced by reading
each transcript with the code alongside it, applying SKILL.md
Step-1 rules (when in doubt, exclude), and writing one JSON per
source_ref.

Coverage (67 decisions + 33 action_items across 9 transcripts):

  medusa-payment-timeout          4 decisions, 5 action_items
  medusa-plugin-migration         8 decisions, 5 action_items
  medusa-webhook-notifications   12 decisions, 3 action_items
  saleor-checkout-extensibility  10 decisions, 3 action_items
  saleor-graphql-permissions      8 decisions, 3 action_items
  saleor-order-workflows          7 decisions, 4 action_items
  vendure-channel-pricing         7 decisions, 2 action_items
  vendure-custom-fields           7 decisions, 5 action_items
  vendure-search-reindexing       4 decisions, 3 action_items

Provenance fields on each fixture:
  generated_by: claude-opus-4-6[1m]-in-session
  skill_md_sha: bb8f81be7f20 (current SKILL.md)
  generated_at: 2026-04-13

These are hand-editable going forward. Re-run
tests/regen_extraction_fixtures.py (with ANTHROPIC_API_KEY) to
regenerate via the API path if/when the transcripts change or the
Phase 5 skill variants warrant a fresh reference.

Also incidental:
- eval_decision_relevance.py: stop hardcoding "no fixtures" in the
  extraction-skipped console message; just print the aggregate's
  own reason string. The hardcoded text was misleading when the
  skip reason was actually "variant=none, metric not applicable".

Regression: 33 tests pass across test_extract_headless,
test_extraction_metrics, test_phase2_ledger.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 13, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@jinhongkuan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 49 minutes and 21 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 49 minutes and 21 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 036641bd-e291-42db-81d9-fdeafc104ec4

📥 Commits

Reviewing files that changed from the base of the PR and between 0c1dac9 and cfbf490.

📒 Files selected for processing (1)
  • tests/fixtures/expected/decisions.py
📝 Walkthrough

Walkthrough

ValidateSymbolsTool now retains the DB instance; ingest handler computes and logs grounded counts and percentage and populates new IngestStats fields; ledger gains grounding breakdown query; a headless extraction driver, LLM/rapidfuzz matchers, metrics, fixtures, regeneration tooling, and evaluation runner with tests were added.

Changes

Cohort / File(s) Summary
Validate & Ingest
code_locator/tools/validate_symbols.py, contracts.py, handlers/ingest.py
ValidateSymbolsTool now stores the provided db on self._db. IngestStats gains grounded and grounded_pct. handle_ingest computes grounded/grounded_pct, clamps values, logs [ingest] complete, and populates the new stats fields.
Ledger Query
ledger/queries.py
Added get_grounding_breakdown(...) async helper that aggregates intents by source_ref (grounded/ungrounded/total) and computes grounded_pct; supports optional source_type/source_scope filters.
Headless Extraction Driver
tests/_extract_headless.py, tests/test_extract_headless.py
New headless extractor that extracts Step 1 from SKILL.md, forces Anthropic tool-based extraction, caches results, and includes parsing, cache, and error-handling tests.
Extraction Matching & Metrics
tests/_extraction_matcher.py, tests/_extraction_metrics.py, tests/test_extraction_metrics.py
Added LLM-backed 1:1 matcher with cache, rapidfuzz fallback, deterministic parsing, compute_extraction_metrics and aggregate_extraction_metrics, plus unit tests covering matchers and metrics.
Evaluation Runner & Regression Tests
tests/eval_decision_relevance.py, tests/test_phase2_ledger.py
New CLI evaluation runner orchestrating extract→ingest→metrics across repos/transcripts, aggregates per-repo/overall grounding and extraction metrics, enforces regression gates; tests assert grounding stats and ledger query behavior.
Fixture Registry & Fixtures
tests/fixtures/expected/decisions.py, tests/fixtures/extraction/README.md, tests/fixtures/extraction/*.json
Added TRANSCRIPT_SOURCES mapping, documentation, and numerous M1 extraction JSON fixtures with metadata and decisions/action_items.
Fixture Regeneration Tooling
tests/regen_extraction_fixtures.py
Added script to regenerate extraction fixtures via the headless extractor with CLI options (--all, --source-ref, --model, --force, --dry-run).
Test Harness Additions
tests/test_extract_headless.py, tests/test_extraction_metrics.py, tests/test_phase2_ledger.py
Extensive tests for extractor parsing, cache behavior, matcher selection, metric aggregation, and grounding stats assertions.

Sequence Diagram(s)

sequenceDiagram
    participant Handler as Ingest Handler
    participant Tool as ValidateSymbolsTool
    participant DB as Database
    participant Ledger as Ledger Queries

    Handler->>+Handler: handle_ingest(payload)
    Handler->>+Tool: validate_symbols(...)
    Tool->>+DB: lookup_by_file() via self._db
    DB-->>-Tool: symbol results
    Tool-->>-Handler: validation results

    Handler->>Handler: compute intents_created, ungrounded
    Handler->>Handler: grounded = max(0, intents_created - ungrounded)
    Handler->>Handler: grounded_pct = grounded/intents_created or 0.0
    Handler->>+Ledger: persist/return IngestStats (including grounded, grounded_pct)
    Ledger-->>-Handler: stats returned
    Handler-->>-Handler: return result
Loading
sequenceDiagram
    participant Eval as Eval Runner
    participant Extract as Headless Extractor
    participant API as Anthropic API
    participant Ingest as Ingest Handler
    participant Metrics as Metrics Computer

    Eval->>+Eval: load transcripts & plan
    loop per transcript
        Eval->>+Extract: extract_from_current_skill(transcript)
        Extract->>Extract: check cache
        alt cache hit
            Extract-->>Eval: return cached decisions
        else cache miss
            Extract->>+API: send tool-based extraction request
            API-->>-Extract: tool_use response
            Extract->>Extract: parse & cache
            Extract-->>Eval: return extracted decisions
        end

        Eval->>+Ingest: handle_ingest(payload)
        Ingest-->>-Eval: return grounding stats

        Eval->>+Metrics: compute_extraction_metrics(extracted, fixture)
        Metrics-->>-Eval: return precision/recall/F1
    end
    Eval->>Eval: aggregate per-repo & overall metrics
    Eval->>Eval: enforce regression gates -> exit code
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

Poem

🐰 I nibbled the fixtures, found carrots of truth,

Hopped through MD steps and cached each sweet tooth,
Grounded the intents, watched percentages bloom,
Metrics and tests guard the warren and room,
A rabbit applauds: tidy, robust, and sleuth.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.02% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'M1/decision relevance ruler' is vague and does not clearly summarize the main changes in the pull request. Consider a more descriptive title that captures the primary objective, such as 'Add M1 decision-relevance evaluation framework with grounding metrics and extraction validation' or 'Implement extraction-based decision grounding evaluation system'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch m1/decision-relevance-ruler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
tests/_extract_headless.py (1)

166-169: Chain the exception for better traceability.

When re-raising as a new exception type, use from exc to preserve the original traceback chain. This helps with debugging when JSON parsing fails.

Proposed fix
     try:
         parsed = json.loads(text)
     except json.JSONDecodeError as exc:
-        raise ValueError(f"extractor did not return valid JSON: {exc}\nbody={body[:500]}")
+        raise ValueError(f"extractor did not return valid JSON: {exc}\nbody={body[:500]}") from exc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/_extract_headless.py` around lines 166 - 169, The except block around
json.loads(text) currently raises a new ValueError without chaining the original
json.JSONDecodeError; update the except handler for json.JSONDecodeError (the
block catching exc) to re-raise the ValueError using "from exc" so the original
traceback is preserved (i.e., change the raise in the try/except that follows
json.loads(text) to use exception chaining).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ledger/queries.py`:
- Around line 571-580: The code currently treats any non-"ungrounded" status
(including empty or missing status) as grounded; update the logic in the block
that obtains status and updates the bucket (the variables status,
buckets.setdefault, and keys "grounded"/"ungrounded"/"total") so that only an
explicit "grounded" value increments b["grounded"], and all other cases
(including empty string or missing status) increment b["ungrounded"] (or are
handled as ungrounded) before updating totals and grounded_pct.

In `@tests/_extraction_metrics.py`:
- Around line 88-103: The current greedy loop matching items from actual to
unmatched_expected (variables actual, unmatched_expected, matched) using
fuzz.token_set_ratio and popping the best per iteration causes order-dependent
results; replace this with an order-independent max-cardinality / max-weight
bipartite matching: build a weight matrix of scores between each actual and each
expected (using fuzz.token_set_ratio and MATCH_THRESHOLD), run a max-weight
matching (e.g., Hungarian algorithm or
networkx.algorithms.matching.max_weight_matching) to select globally optimal
pairs, then populate matched, unmatched_actual, and unmatched_expected from that
matching instead of popping inside the greedy loop. Ensure ties below
MATCH_THRESHOLD are treated as non-matches.

In `@tests/eval_decision_relevance.py`:
- Around line 244-248: Wrap the json.loads(args.multi_repo) call in a try/except
to catch json.JSONDecodeError so a malformed --multi-repo value produces a clean
CLI error instead of a traceback; specifically, when parsing into repo_map (the
repo_map: dict[str, str] = json.loads(args.multi_repo) line), catch
json.JSONDecodeError, print a descriptive message to stderr (e.g. "error:
invalid --multi-repo JSON") and return the same error tuple ({}, 2) as the empty
case.

In `@tests/fixtures/extraction/saleor-checkout-extensibility.json`:
- Around line 50-51: Replace the ambiguous empty string owner value with a
proper null to represent an unassigned owner: find the JSON object containing
the "text" value "Deliver base implementation for review by end of sprint;
check-in Thursday on the plugin changes (critical path)" and set its "owner"
property from "" to null so consumers don't need extra normalization logic.

In `@tests/fixtures/extraction/vendure-custom-fields.json`:
- Around line 49-50: Change the JSON fixture entry where the object has "text":
"All PRs reviewed by Wednesday" and "owner": "" so that the owner field is
normalized to null; specifically replace the empty-string value for the "owner"
property with a JSON null to ensure schema consistency across fixtures.

---

Nitpick comments:
In `@tests/_extract_headless.py`:
- Around line 166-169: The except block around json.loads(text) currently raises
a new ValueError without chaining the original json.JSONDecodeError; update the
except handler for json.JSONDecodeError (the block catching exc) to re-raise the
ValueError using "from exc" so the original traceback is preserved (i.e., change
the raise in the try/except that follows json.loads(text) to use exception
chaining).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 443dbd1e-4916-4f3c-be13-116bede51987

📥 Commits

Reviewing files that changed from the base of the PR and between 4d46607 and a23dba3.

📒 Files selected for processing (22)
  • code_locator/tools/validate_symbols.py
  • contracts.py
  • handlers/ingest.py
  • ledger/queries.py
  • tests/_extract_headless.py
  • tests/_extraction_metrics.py
  • tests/eval_decision_relevance.py
  • tests/fixtures/expected/decisions.py
  • tests/fixtures/extraction/README.md
  • tests/fixtures/extraction/medusa-payment-timeout.json
  • tests/fixtures/extraction/medusa-plugin-migration.json
  • tests/fixtures/extraction/medusa-webhook-notifications.json
  • tests/fixtures/extraction/saleor-checkout-extensibility.json
  • tests/fixtures/extraction/saleor-graphql-permissions.json
  • tests/fixtures/extraction/saleor-order-workflows.json
  • tests/fixtures/extraction/vendure-channel-pricing.json
  • tests/fixtures/extraction/vendure-custom-fields.json
  • tests/fixtures/extraction/vendure-search-reindexing.json
  • tests/regen_extraction_fixtures.py
  • tests/test_extract_headless.py
  • tests/test_extraction_metrics.py
  • tests/test_phase2_ledger.py

Comment thread ledger/queries.py
Comment on lines +571 to +580
status = str(row.get("status") or "")
b = buckets.setdefault(
ref,
{"source_ref": ref, "grounded": 0, "ungrounded": 0, "total": 0, "grounded_pct": 0.0},
)
b["total"] += 1
if status == "ungrounded":
b["ungrounded"] += 1
else:
b["grounded"] += 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not treat empty/missing status as grounded.

At Line 571-580, status == "" (or missing) currently increments grounded, which can overstate grounded_pct and skew eval outcomes.

Suggested fix
-        status = str(row.get("status") or "")
+        status = str(row.get("status") or "")
         b = buckets.setdefault(
             ref,
             {"source_ref": ref, "grounded": 0, "ungrounded": 0, "total": 0, "grounded_pct": 0.0},
         )
         b["total"] += 1
-        if status == "ungrounded":
+        if not status or status == "ungrounded":
             b["ungrounded"] += 1
         else:
             b["grounded"] += 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ledger/queries.py` around lines 571 - 580, The code currently treats any
non-"ungrounded" status (including empty or missing status) as grounded; update
the logic in the block that obtains status and updates the bucket (the variables
status, buckets.setdefault, and keys "grounded"/"ungrounded"/"total") so that
only an explicit "grounded" value increments b["grounded"], and all other cases
(including empty string or missing status) increment b["ungrounded"] (or are
handled as ungrounded) before updating totals and grounded_pct.

Comment thread tests/_extraction_metrics.py Outdated
Comment on lines +88 to +103
for a in actual:
if not unmatched_expected:
unmatched_actual.append(a)
continue
# Find best match in remaining expected pool
best_i = -1
best_score = -1.0
for i, e in enumerate(unmatched_expected):
score = fuzz.token_set_ratio(a, e)
if score > best_score:
best_score = score
best_i = i
if best_score >= MATCH_THRESHOLD:
matched.append((a, unmatched_expected[best_i], best_score))
unmatched_expected.pop(best_i)
else:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Order-dependent greedy matching can mis-score TP/FP/FN.

At Line 88, matching extracted decisions sequentially and popping matches at Line 102 makes outcomes dependent on input order. For a regression gate, this can undercount true positives on equivalent outputs.

Proposed fix (order-independent max-cardinality matching)
 def compute_extraction_metrics(
@@
-    unmatched_expected = list(expected)
-    matched: list[tuple[str, str, float]] = []
-    unmatched_actual: list[str] = []
-
-    for a in actual:
-        if not unmatched_expected:
-            unmatched_actual.append(a)
-            continue
-        # Find best match in remaining expected pool
-        best_i = -1
-        best_score = -1.0
-        for i, e in enumerate(unmatched_expected):
-            score = fuzz.token_set_ratio(a, e)
-            if score > best_score:
-                best_score = score
-                best_i = i
-        if best_score >= MATCH_THRESHOLD:
-            matched.append((a, unmatched_expected[best_i], best_score))
-            unmatched_expected.pop(best_i)
-        else:
-            unmatched_actual.append(a)
+    # Build thresholded bipartite graph (actual -> expected), neighbors sorted by score desc.
+    adj: list[list[int]] = []
+    score_map: dict[tuple[int, int], float] = {}
+    for ai, a in enumerate(actual):
+        candidates: list[tuple[int, float]] = []
+        for ei, e in enumerate(expected):
+            s = fuzz.token_set_ratio(a, e)
+            if s >= MATCH_THRESHOLD:
+                candidates.append((ei, s))
+                score_map[(ai, ei)] = s
+        candidates.sort(key=lambda t: t[1], reverse=True)
+        adj.append([ei for ei, _ in candidates])
+
+    # Maximum-cardinality bipartite matching (Kuhn DFS).
+    match_e_to_a = [-1] * len(expected)
+
+    def _dfs(ai: int, seen: set[int]) -> bool:
+        for ei in adj[ai]:
+            if ei in seen:
+                continue
+            seen.add(ei)
+            prev_ai = match_e_to_a[ei]
+            if prev_ai == -1 or _dfs(prev_ai, seen):
+                match_e_to_a[ei] = ai
+                return True
+        return False
+
+    for ai in range(len(actual)):
+        _dfs(ai, set())
+
+    matched_actual = {ai for ai in match_e_to_a if ai != -1}
+    matched_expected_idx = {ei for ei, ai in enumerate(match_e_to_a) if ai != -1}
+
+    matched: list[tuple[str, str, float]] = []
+    for ei, ai in enumerate(match_e_to_a):
+        if ai != -1:
+            matched.append((actual[ai], expected[ei], score_map[(ai, ei)]))
+
+    unmatched_actual = [a for i, a in enumerate(actual) if i not in matched_actual]
+    unmatched_expected = [e for i, e in enumerate(expected) if i not in matched_expected_idx]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/_extraction_metrics.py` around lines 88 - 103, The current greedy loop
matching items from actual to unmatched_expected (variables actual,
unmatched_expected, matched) using fuzz.token_set_ratio and popping the best per
iteration causes order-dependent results; replace this with an order-independent
max-cardinality / max-weight bipartite matching: build a weight matrix of scores
between each actual and each expected (using fuzz.token_set_ratio and
MATCH_THRESHOLD), run a max-weight matching (e.g., Hungarian algorithm or
networkx.algorithms.matching.max_weight_matching) to select globally optimal
pairs, then populate matched, unmatched_actual, and unmatched_expected from that
matching instead of popping inside the greedy loop. Ensure ties below
MATCH_THRESHOLD are treated as non-matches.

Comment on lines +244 to +248
repo_map: dict[str, str] = json.loads(args.multi_repo) if args.multi_repo else {}
if not repo_map:
print("error: --multi-repo is required", file=sys.stderr)
return {}, 2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle malformed --multi-repo JSON with a clean CLI error.

Line 244 can throw json.JSONDecodeError and produce a traceback instead of a controlled exit path.

Proposed fix
-    repo_map: dict[str, str] = json.loads(args.multi_repo) if args.multi_repo else {}
+    try:
+        repo_map: dict[str, str] = json.loads(args.multi_repo) if args.multi_repo else {}
+    except json.JSONDecodeError as exc:
+        print(f"error: --multi-repo must be valid JSON: {exc}", file=sys.stderr)
+        return {}, 2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval_decision_relevance.py` around lines 244 - 248, Wrap the
json.loads(args.multi_repo) call in a try/except to catch json.JSONDecodeError
so a malformed --multi-repo value produces a clean CLI error instead of a
traceback; specifically, when parsing into repo_map (the repo_map: dict[str,
str] = json.loads(args.multi_repo) line), catch json.JSONDecodeError, print a
descriptive message to stderr (e.g. "error: invalid --multi-repo JSON") and
return the same error tuple ({}, 2) as the empty case.

Comment on lines +50 to +51
"text": "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)",
"owner": ""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use null instead of empty string for unassigned owner.

At Line 50-51, owner: "" is ambiguous and forces extra normalization logic. Use null for missing owner values.

Suggested fix
-      "owner": ""
+      "owner": null
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"text": "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)",
"owner": ""
"text": "Deliver base implementation for review by end of sprint; check-in Thursday on the plugin changes (critical path)",
"owner": null
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/extraction/saleor-checkout-extensibility.json` around lines 50
- 51, Replace the ambiguous empty string owner value with a proper null to
represent an unassigned owner: find the JSON object containing the "text" value
"Deliver base implementation for review by end of sprint; check-in Thursday on
the plugin changes (critical path)" and set its "owner" property from "" to null
so consumers don't need extra normalization logic.

Comment on lines +49 to +50
"text": "All PRs reviewed by Wednesday",
"owner": ""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Normalize missing owner as null.

At Line 49-50, prefer owner: null over owner: "" for schema consistency across fixtures.

Suggested fix
-      "owner": ""
+      "owner": null
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"text": "All PRs reviewed by Wednesday",
"owner": ""
"text": "All PRs reviewed by Wednesday",
"owner": null
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/extraction/vendure-custom-fields.json` around lines 49 - 50,
Change the JSON fixture entry where the object has "text": "All PRs reviewed by
Wednesday" and "owner": "" so that the owner field is normalized to null;
specifically replace the empty-string value for the "owner" property with a JSON
null to ensure schema consistency across fixtures.

Replaces the rapidfuzz token_set_ratio matcher with an Anthropic
Haiku 4.5 tool-use call that judges semantic equivalence between
extracted and ground-truth decisions. Rapidfuzz stays as an offline
fallback for unit tests.

Why:
The first end-to-end CI run (24370616582) reported P=0.51, R=0.45,
F1=0.48, but inspection of the per-transcript unmatched lists showed
most "false positives" and "false negatives" were the same decision
phrased slightly differently. Four concrete cases from the run:

  medusa-payment-timeout:
    FP: "Implement a 12-second timeout ceiling on payment provider
         authorize calls; if exceeded, return requires_more status"
    FN: "Wrap the authorize call in PaymentProviderService with a
         12-second timeout ceiling and return requires_more status"
    (same decision, scored <70 on token_set_ratio)

  medusa-webhook-notifications:
    FP: "Per-endpoint rate limiting will use token bucket algorithm
         with a capacity of 10 requests per second, with overflow queued"
    FN: "Apply a per-endpoint token bucket rate limiter at 10 requests
         per second; overflow requests are queued rather than dropped"

  saleor-graphql-permissions:
    FP: "Gate authorization checks for mutations with side effects..."
    FN: "Gate the checkoutComplete mutation's permission check before
         any side effects..."

  vendure-custom-fields:
    FP: "Schema synchronization will use TypeORM schema:synchronize..."
    FN: "Run schema:synchronize in staging first as a CI validation..."

These numbers were measuring matcher brittleness, not skill quality.

Design:
- New module tests/_extraction_matcher.py with llm_match() that does
  ONE Haiku call per transcript (not N*M per pair), with the full
  actual + expected lists in the user prompt and tool_use forcing a
  structured matches array in return.
- Strict matching rules in the system prompt: "same constraint, same
  scope, same behavior", 1:1 pairing, "when in doubt leave unmatched".
- Prompt-cached on the system prompt so re-runs within a branch only
  pay for user-message tokens.
- Response cached to tests/.match-cache/ keyed on
  (model, sha(actual_list), sha(expected_list)) so repeat CI runs on
  the same branch pay $0 to the API. Any SKILL.md edit that changes
  what Haiku extracts changes the actual_sha and invalidates the
  cache automatically.
- Tool response is already a parsed dict — no JSON string parsing.
- Fails fast with a diagnostic error body on HTTP errors, same
  pattern as _extract_headless.

Dispatcher:
- compute_extraction_metrics(..., matcher="auto") resolves to "llm"
  when ANTHROPIC_API_KEY is set (CI path) and "rapidfuzz" otherwise
  (offline unit tests). Explicit matcher="llm" or matcher="rapidfuzz"
  forces the path. New "matcher" field in the metric dict records
  which matcher was used for auditability.
- Runner prints "extraction (llm):" vs "extraction (rapidfuzz):" in
  the aggregate summary so CI logs show which matcher ran.

Tests:
- Existing rapidfuzz-mode tests retained, explicitly forced via
  matcher="rapidfuzz" (9 passing)
- New unit tests for _pick_matcher (env-based auto-resolve), LLM
  cache hit path, missing-key error, _parse_matches 1:1 enforcement
  with invalid indices / duplicate expecteds, and the compute_*
  dispatch path via a stubbed llm_match (7 passing)
- All offline — no network required
- Total: 16 metric tests + 9 extractor tests = 25 pass

Cost per CI run:
~9 Haiku calls per run × ~2k input tokens × $0.80/Mtok = ~$0.015
plus prompt-cache hits on the system prompt amortized across calls.
Effectively free, and cached zero on repeats.

Next CI run on this branch will produce the first real extraction
metric with paraphrase-robust matching. Expected: the P=0.51 R=0.45
baseline will move upward substantially (informal eyeball suggests
~0.75+) without any skill change, because the previous numbers were
almost entirely matcher brittleness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
tests/_extraction_metrics.py (1)

70-98: ⚠️ Potential issue | 🟠 Major

Rapidfuzz fallback is still order-dependent.

This loop still greedily pops the best remaining expected for each actual, so runs that fall back to rapidfuzz can undercount TP / overcount FP based only on list order. That keeps the offline metric unstable; this path still needs a global 1:1 bipartite matching instead of greedy removal.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/_extraction_metrics.py` around lines 70 - 98, The _rapidfuzz_match
function currently does greedy per-actual selection causing order-dependent
matches; replace the greedy removal with a global 1:1 bipartite matching: build
a score matrix using fuzz.token_set_ratio between every actual (rows) and
expected (columns), run a maximum-weight bipartite matching (e.g., Hungarian
algorithm / scipy.optimize.linear_sum_assignment or a max-weight matching
routine) to get globally optimal actual->expected pairs, then convert matched
pairs to the (actual_idx, expected_idx) list but only accept matches whose score
>= MATCH_THRESHOLD (unmatched or below-threshold become (actual_idx, None));
keep the function signature and return format unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/_extraction_matcher.py`:
- Around line 217-223: The RuntimeError raised after the HTTP POST is echoing
API key-derived data (api_key, key_prefix, key_len); remove any key-derived
fields from the exception text. Update the error construction that references
resp.status_code, resp.text, model, key_prefix, and key_len so it only includes
non-secret diagnostics (e.g., resp.status_code, resp.text[:500], and model) and
does not reference api_key, key_prefix, key_len, or ANTROP... constants.
- Around line 127-136: The cache key generation is ambiguous and missing
prompt/tool versioning: change _list_sha to produce an unambiguous hash of the
list (e.g., include item lengths or a separator encoding so ["a","b\nc"] !=
["a\nb","c"]) and update _cache_path to incorporate SYSTEM_PROMPT,
USER_PROMPT_TEMPLATE and MATCHER_TOOL (or their current version/sha) into the
key string used before calling _sha so llm_match() won't return stale hits when
prompts or tools change; update references to _list_sha and _cache_path
accordingly.

In `@tests/_extraction_metrics.py`:
- Line 51: The comment on the MATCH_THRESHOLD constant contains a Unicode
en-dash; replace the en-dash (–) with a plain ASCII hyphen (-) in the inline
comment after MATCH_THRESHOLD so the comment reads "rapidfuzz token_set_ratio,
0-100 scale" and resolves the RUF003 lint warning.

---

Duplicate comments:
In `@tests/_extraction_metrics.py`:
- Around line 70-98: The _rapidfuzz_match function currently does greedy
per-actual selection causing order-dependent matches; replace the greedy removal
with a global 1:1 bipartite matching: build a score matrix using
fuzz.token_set_ratio between every actual (rows) and expected (columns), run a
maximum-weight bipartite matching (e.g., Hungarian algorithm /
scipy.optimize.linear_sum_assignment or a max-weight matching routine) to get
globally optimal actual->expected pairs, then convert matched pairs to the
(actual_idx, expected_idx) list but only accept matches whose score >=
MATCH_THRESHOLD (unmatched or below-threshold become (actual_idx, None)); keep
the function signature and return format unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c255b50-0c6e-45de-b5a9-b1e762779ec6

📥 Commits

Reviewing files that changed from the base of the PR and between a23dba3 and 42c3a07.

📒 Files selected for processing (4)
  • tests/_extraction_matcher.py
  • tests/_extraction_metrics.py
  • tests/eval_decision_relevance.py
  • tests/test_extraction_metrics.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_extraction_metrics.py
  • tests/eval_decision_relevance.py

Comment on lines +127 to +136
def _list_sha(items: list[str]) -> str:
# Order matters for cache key — same-content-different-order lists
# should map to different cache entries so we don't silently reuse a
# stale matching when the runner reshuffles.
return _sha("\n".join(items))


def _cache_path(model: str, actual_sha: str, expected_sha: str) -> Path:
key = f"{model}|{actual_sha}|{expected_sha}"
return MATCH_CACHE_DIR / f"{_sha(key)}.json"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the cache key unambiguous and prompt-versioned.

Line 131 hashes "\n".join(items), so inputs like ["a", "b\nc"] and ["a\nb", "c"] collide. Lines 135-136 also ignore SYSTEM_PROMPT, USER_PROMPT_TEMPLATE, and MATCHER_TOOL, so prompt/tool edits will silently reuse stale judge outputs. Because llm_match() returns cache hits before calling the API, this can produce wrong metrics.

🔧 Proposed fix
 def _sha(text: str) -> str:
     return hashlib.sha256(text.encode("utf-8")).hexdigest()
 
 
+MATCHER_CACHE_VERSION = _sha(
+    json.dumps(
+        {
+            "system": SYSTEM_PROMPT,
+            "user_template": USER_PROMPT_TEMPLATE,
+            "tool": MATCHER_TOOL,
+        },
+        sort_keys=True,
+        ensure_ascii=False,
+        separators=(",", ":"),
+    )
+)
+
+
 def _list_sha(items: list[str]) -> str:
     # Order matters for cache key — same-content-different-order lists
     # should map to different cache entries so we don't silently reuse a
     # stale matching when the runner reshuffles.
-    return _sha("\n".join(items))
+    return _sha(json.dumps(items, ensure_ascii=False, separators=(",", ":")))
 
 
 def _cache_path(model: str, actual_sha: str, expected_sha: str) -> Path:
-    key = f"{model}|{actual_sha}|{expected_sha}"
+    key = f"{MATCHER_CACHE_VERSION}|{model}|{actual_sha}|{expected_sha}"
     return MATCH_CACHE_DIR / f"{_sha(key)}.json"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/_extraction_matcher.py` around lines 127 - 136, The cache key
generation is ambiguous and missing prompt/tool versioning: change _list_sha to
produce an unambiguous hash of the list (e.g., include item lengths or a
separator encoding so ["a","b\nc"] != ["a\nb","c"]) and update _cache_path to
incorporate SYSTEM_PROMPT, USER_PROMPT_TEMPLATE and MATCHER_TOOL (or their
current version/sha) into the key string used before calling _sha so llm_match()
won't return stale hits when prompts or tools change; update references to
_list_sha and _cache_path accordingly.

Comment on lines +217 to +223
key_prefix = api_key[:12] if api_key else "(empty)"
with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client:
resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload)
if resp.status_code >= 400:
raise RuntimeError(
f"Matcher API error {resp.status_code}: {resp.text[:500]} "
f"(model={model}, key_prefix={key_prefix}..., key_len={len(api_key)})"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not echo API key material into the exception text.

Line 223 includes part of ANTHROPIC_API_KEY in every 4xx failure. Even partial credentials should not land in CI logs or artifacts; keep the model/status diagnostics and drop the key-derived fields.

🔒 Proposed fix
-    key_prefix = api_key[:12] if api_key else "(empty)"
     with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client:
         resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload)
         if resp.status_code >= 400:
             raise RuntimeError(
                 f"Matcher API error {resp.status_code}: {resp.text[:500]} "
-                f"(model={model}, key_prefix={key_prefix}..., key_len={len(api_key)})"
+                f"(model={model})"
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
key_prefix = api_key[:12] if api_key else "(empty)"
with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client:
resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload)
if resp.status_code >= 400:
raise RuntimeError(
f"Matcher API error {resp.status_code}: {resp.text[:500]} "
f"(model={model}, key_prefix={key_prefix}..., key_len={len(api_key)})"
with httpx.Client(timeout=REQUEST_TIMEOUT_S) as client:
resp = client.post(ANTHROPIC_API_URL, headers=headers, json=payload)
if resp.status_code >= 400:
raise RuntimeError(
f"Matcher API error {resp.status_code}: {resp.text[:500]} "
f"(model={model})"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/_extraction_matcher.py` around lines 217 - 223, The RuntimeError raised
after the HTTP POST is echoing API key-derived data (api_key, key_prefix,
key_len); remove any key-derived fields from the exception text. Update the
error construction that references resp.status_code, resp.text, model,
key_prefix, and key_len so it only includes non-secret diagnostics (e.g.,
resp.status_code, resp.text[:500], and model) and does not reference api_key,
key_prefix, key_len, or ANTROP... constants.

from rapidfuzz import fuzz

FIXTURES_DIR = Path(__file__).resolve().parent / "fixtures" / "extraction"
MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0–100 scale

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace the Unicode dash in this comment.

Line 51 uses , which Ruff flags as RUF003. Swap it for a plain - so the new file stays lint-clean.

🧹 Proposed fix
-MATCH_THRESHOLD = 70  # rapidfuzz token_set_ratio, 0–100 scale
+MATCH_THRESHOLD = 70  # rapidfuzz token_set_ratio, 0-100 scale
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0100 scale
MATCH_THRESHOLD = 70 # rapidfuzz token_set_ratio, 0-100 scale
🧰 Tools
🪛 Ruff (0.15.9)

[warning] 51-51: Comment contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/_extraction_metrics.py` at line 51, The comment on the MATCH_THRESHOLD
constant contains a Unicode en-dash; replace the en-dash (–) with a plain ASCII
hyphen (-) in the inline comment after MATCH_THRESHOLD so the comment reads
"rapidfuzz token_set_ratio, 0-100 scale" and resolves the RUF003 lint warning.

jinhongkuan and others added 3 commits April 13, 2026 21:54
Bootstraps the M1 adversarial corpus ground truth in-session
(Claude Opus 4.6, with the cloned medusa/saleor/vendure trees
available for symbol cross-checks). One JSON per adversarial
transcript, matching the shape of the existing friendly-corpus
fixtures.

Distribution by category (15 decisions + 6 action_items total):

  adv-strat-fake       STRAT-FAKE       0 dec  0 act   (strict-exclude probe)
  adv-vocab-collide    VOCAB-COLLIDE    1 dec  2 act   (one buried needle)
  adv-density-extreme  DENSITY-EXTREME  9 dec  2 act   (compound + diffuse)
  adv-offtopic-mix     OFFTOPIC-MIX     1 dec  1 act   (buried in Q2 OR)
  adv-reversal         REVERSAL         4 dec  1 act   (final state only)

Each fixture has a "notes" field explaining the category-specific
extraction policy so future hand-edits stay aligned with intent.
The strat-fake empty case is intentional: any positive extraction
is a false positive.

These transcripts are not yet wired into TRANSCRIPT_SOURCES — that
edit waits on the fixture-location migration (separate cleanup
ticket). Once registered, CI will produce the first adversarial
baseline alongside the existing friendly-corpus numbers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registers adv-strat-fake, adv-vocab-collide, adv-density-extreme,
adv-offtopic-mix, and adv-reversal under repo_key="adversarial" so
the runner reports them as their own group in CI output. Ground
truth for each was committed earlier at
tests/fixtures/extraction/adv-*.json.

The "adversarial" repo_key is matched against --multi-repo by the
parent workflow, where it aliases to the medusa clone (any indexed
codebase works — adversarial transcripts test extraction quality,
not grounding quality against a specific codebase). Bland-noun
grounding noise on these transcripts is exactly what the
VOCAB-COLLIDE category is designed to expose, and feeds the M2
(grounding precision) ticket.

Runner total: 9 friendly + 5 adversarial = 14 transcripts.

Local smoke test: --skill-variant none with adversarial in the
repo map handles the empty case cleanly (adversarial transcripts
have no entries in ALL_DECISIONS, so none-mode produces 0/0 for
the adversarial group without crashing). Live extraction in CI's
from-skill-md mode will be the first real measurement.
Removes the 9 friendly e-commerce fixtures (medusa-*, saleor-*,
vendure-*) from tests/fixtures/extraction/ and the matching
TRANSCRIPT_SOURCES entries from decisions.py. The friendly corpus
was too easy to be informative about real-world reliability —
F1=0.87 on sprint-planning meetings says nothing about whether
bicameral handles hedged language, generic vocabulary, compound
sentences, off-topic meetings, or reversed decisions.

After this commit:

  TRANSCRIPT_SOURCES   →  5 entries, all repo_key="adversarial"
  fixtures/extraction/ →  README.md + 5 adv-*.json files

The 9 friendly transcript .md files in pilot/ml/data/transcripts/
remain in place — they're not bicameral-mcp test fixtures, they're
data assets used elsewhere in pilot/ml/. Only the test-fixture
copies in pilot/mcp/tests/fixtures/extraction/ are removed.

ALL_DECISIONS in decisions.py is unchanged — it's still consumed
by tests/eval_code_locator.py for the M2 RAG eval, which targets
the friendly OSS repos and is conceptually independent of M1.

Offline test suite: 40/40 pass after cleanup
(15 phase-2 + 16 metrics + 9 extract_headless).
@jinhongkuan jinhongkuan merged commit 79765a6 into main Apr 14, 2026
1 check passed
jinhongkuan added a commit that referenced this pull request May 2, 2026
Resolves four conflicts:
- .gitignore: keeps both qor:seed block and .claude/worktrees ignore.
- docs/META_LEDGER.md: takes dev's chain (#1-#26) as canonical. This
  branch's parallel #7-#14 entries (sealed against the obsolete Entry #6
  base before dev added #7-#26) are not folded in here — they need to be
  re-authored against dev's #26 seal via /qor-meta-log-decision in a
  follow-up. Their session content is preserved in commit messages
  (f4de501, 3f856af) and in docs/SYSTEM_STATE.md.
- docs/SYSTEM_STATE.md: keeps both the v0 process cleanup / preflight
  hook session blocks from this branch and dev's #124 implementation
  block.
- skills/bicameral-preflight/SKILL.md: keeps both the new "Hook
  reinforcement" subsection (this branch) and the Telemetry section
  (dev). The earlier removal of Telemetry in this branch's f4de501 was
  authored against a stale base; restoring it here.
Knapp-Kevin added a commit that referenced this pull request May 3, 2026
Priority C v0 — Phase 1 of the team-server vertical-slice for multi-dev
decision continuity at organizational scale (per the Sales Enablement &
Positioning Playbook + research-brief-priority-c-selective-ingest-2026-05-02.md).

The team-server is a self-managing, customer-self-hosted Python service.
Per CONCEPT.md anti-goals under literal-keyword parsing (SHADOW_GENOME
Failure Entry #6 addendum): "no managed backend" forbids vendor SaaS
and human-ops-tax architectures, NOT self-managing customer-deployable
backends. Sentry self-hosted, Supabase OSS, and the embedded-SurrealDB
philosophy already in repo are precedents.

Scaffold delivered:
- team_server/app.py (47 LOC): FastAPI app factory; lifespan migrates
  schema on startup, closes DB on teardown; /health endpoint
- team_server/schema.py (80 LOC): v0 schema for workspace,
  channel_allowlist, extraction_cache, team_event tables. Idempotent
  ensure_schema(); migration dispatch table for future versions.
  FLEXIBLE TYPE object on canonical_extraction + payload (per #72
  lesson + audit Advisory #3).
- team_server/db.py (41 LOC): TeamServerDB factory wrapping
  ledger.client.LedgerClient with team-server's own ns/db pair.
- deploy/team-server.docker-compose.yml + Dockerfile.team-server:
  single-service compose; volume for persistent data; healthcheck on
  /health; runs as non-root.

Tests (6 functionality tests, all green):
- tests/test_team_server_app.py: app starts + serves health, schema
  migrates from empty + idempotent, lifespan teardown closes DB,
  /health returns well-formed JSON.
- tests/test_team_server_deploy.py: docker-compose config validates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin added a commit that referenced this pull request May 3, 2026
Phase 3 of plan-priority-c-team-server-slack-v0.md — adds the polling
worker, the canonical-extraction cache that closes the multi-dev
extraction-divergence gap, and the peer-author event writer.

Multi-dev convergence mechanism: any dev's session that touches a Slack
message produces the SAME canonical extraction across the team because
get_or_compute is keyed on (source_type, source_ref, content_hash) and
the cache row is shared via the team-server's append-only event log.
This is what closes Playbook Pillar #1 (Decision Continuity) at
multi-dev/multi-agent scale that the v1 brief incorrectly framed as
"build curation only" — see SHADOW_GENOME Failure Entry #6.

The interim canonical extraction uses Anthropic Claude with the
model_version='interim-claude-v1' tombstone so a future Phase 5 (CocoIndex
#136 integration) can identify and rebuild interim entries when memoized
deterministic transforms become available.

Files added:
- team_server/extraction/canonical_cache.py (45 LOC): get_or_compute()
  returns cached extraction OR invokes compute_fn + persists.
- team_server/extraction/llm_extractor.py: interim Anthropic-backed
  extractor; production-default until CocoIndex lands.
- team_server/sync/peer_writer.py (42 LOC): write_team_event() — appends
  to team_event with author_email='team-server@<team_id>.bicameral'
  identity (single-bot per workspace; multi-instance HA is v1).
- team_server/workers/slack_worker.py (100 LOC): poll_once() —
  conversations_history per allowlisted channel, content_hash dedup,
  cache-keyed extraction, peer event write per new message.

Tests (6 functionality tests, all green):
- tests/test_team_server_canonical_cache.py: cache hit returns existing
  without compute_fn, cache miss persists + subsequent call returns
  cached, content_hash variation produces new rows.
- tests/test_team_server_slack_worker.py: worker polls only allowlisted
  channels, writes one team_event per message with peer-author identity,
  dedups via message ts (idempotent re-poll).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin added a commit that referenced this pull request May 3, 2026
Phase 4 of plan-priority-c-team-server-slack-v0.md — closes the
multi-dev convergence loop by exposing team_event over HTTP and
extending EventMaterializer with a failure-isolated team-server pull.

The materializer pull is OUTSIDE the deterministic core (per CONCEPT.md
literal-keyword parsing of "no network calls in the deterministic core"
— SHADOW_GENOME Failure Entry #6 addendum). Failure-isolation contract:
team-server outage NEVER cascades into per-dev preflight failures —
events/team_server_pull.py swallows transport errors, returns empty
events, leaves the watermark unchanged.

Files added:
- team_server/api/events.py: GET /events?since=N&limit=M endpoint;
  reads team_event ordered by sequence ascending; pagination via
  since cursor.
- events/team_server_pull.py (57 LOC): pull_team_server_events()
  queries the team-server's /events endpoint, persists watermark per
  call, swallows transport errors gracefully.
- team_server/app.py (modified): include events router.

Tests (6 functionality tests, all green):
- tests/test_team_server_events_api.py: /events returns rows in
  sequence order, paginates via since cursor, returns empty (not
  error) when no new events.
- tests/test_materializer_team_server_pull.py: materializer pulls from
  team-server URL, watermark advances + persists separately, 503
  /transport-error degrades gracefully (no exception, watermark
  unchanged) — failure-isolation contract verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin added a commit that referenced this pull request May 3, 2026
QorLogic SDLC governance trail for the Priority C v0 implementation
that landed in commits 1-4 of this PR.

Includes:

- docs/research-brief-priority-c-selective-ingest-2026-05-02.md (v3) —
  research substrate. v1 was rejected for INVARIANT_FROM_IMPLEMENTATION
  (treating v0 agent-fetches-only code state as product principle); v2
  added playbook substrate; v3 narrowed to Slack-first + team-server
  + CocoIndex-conditional after operator dialogue clarified "no managed
  backend" = "no human-ops-tax architecture," not "no backend."
- plan-priority-c-team-server-slack-v0.md (437 LOC) — the L3 plan
  with five phases (Phase 5 deferred per "if we can manage it"
  feasibility caveat).
- docs/SHADOW_GENOME.md Failure Entry #6 + addendum — captures the
  framing-error pattern AND the "anti-goals must be parsed by their
  load-bearing keyword" lesson; symmetric to v0-code-as-principle.
- docs/META_LEDGER.md Entries #27 (IMPLEMENT) + #28 (SEAL).
  Predecessor: efd0304b (#135-triage seal on dev).
  Implement chain: 211ffb9e.
  Substantiation seal: 6f4f8f8f1d63ad82b952a3c6aff270d30584e08b0572077ff685e84ce453f6c2
- docs/SYSTEM_STATE.md — Priority C v0 section appended; documents
  schema additions, architectural properties achieved, audit advisory
  disposition, Phase 5 deferred state, and the qor-logic-internal
  steps skipped for downstream-project rationale.
- .agent/staging/AUDIT_REPORT.md — PASS verdict, three non-blocking
  advisories all addressed at implement-time.

Verdict: REALITY = PROMISE for Phases 1-4. Phase 5 (CocoIndex #136)
explicitly deferred per plan slip-independence design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin added a commit that referenced this pull request May 6, 2026
…+Gemini+Codex-2 (#205)

Authors a single Reviewer Disposition Pass table at the top of the
brief reconciling all 32 review points across four review layers
(Codex first-pass, Kilo, Gemini CLI, Codex second-pass) into one
post-review consensus before downstream P1 issue-filing — per the
explicit Codex-2 #1 directive.

Decisions: 21 applied this commit, 6 already applied in 1d82658,
3 deferred to follow-up, 2 note-only. Net new gap IDs added per
disposition: GDPR-08 (ephemeral data), GDPR-09 (consent versioning
+ revocation), LLM-11 (cross-tool config-file modification surface),
MCP-01 (host UX as external dependency), CFG-01 (config precedence
+ fail-closed model). Reclassification: LLM-06 P0/M → P1/M with
scope narrowed to future remote-skill-loading channel (per Kilo #2).

Major content additions to the brief:

- § 1.1: MCP host UX is external dependency, not security gate (new
  gap MCP-01) — host that auto-approves tool calls bypasses any
  "operator will see this" assumption.
- § 1.2: SurrealDB version pinning supply-chain callout (Kilo #11).
- § 1.7: cross-tool config-file modification surface (new gap LLM-11)
  distinct from skill-content surface — `setup_wizard` writes shell
  commands into `.claude/settings.json` that run host-side at hook fire.
- § 1.11 (new): Configuration precedence + fail-closed model — single
  uniform precedence rule across all knobs (env > config.yaml >
  hardcoded defaults), fail-closed semantics on missing/malformed/
  contradictory config (Codex-2 #5).
- § 2.4 (a): LLM02 mapping note clarifying it folds into LLM-07 +
  OWASP-04 (Kilo #13).
- § 2.4 (b): explicit `confirm=True` is agent-supplied not HITL
  (Kilo #3) — security context cannot rely on agent-filled params.
- § 2.4 (c) LLM-01 + LLM-04: extensible classifier (Gemini #2) +
  guardrail-not-classifier framing (Codex-1 #6) + control-acceptance
  template (Codex-2 #4) — quarantine, override, test fixtures,
  measurement counters.
- § 2.4 (c) LLM-03: timeouts as `.bicameral/config.yaml` knobs (Gemini #3).
- § 2.4 (c) LLM-05 + LLM-09: out-of-band operator confirmation, not
  agent-supplied confirm parameters (Kilo #3).
- § 2.4 (c) LLM-06: scope-narrowed to future remote-skill-loading; in
  current install model the wheel-trust covers it (Kilo #2).
- § 2.4 (c) LLM-11 (new): cross-tool config-file gate (signed
  hooks-manifest.json) distinct from skill manifest.
- § 2.1 (c) GDPR-01: three remediation candidates — tombstone-and-
  rebuild with signed manifest (Kilo #12), crypto-shredding (Gemini
  #1), or scope-out via PII detect-and-refuse.
- § 2.1 (c) GDPR-02: data-subject-access search must cover full
  identifier surface (description, source_ref, topic, file paths) not
  just signer email (Codex-1 #5).
- § 2.1 (c) GDPR-08 (new): ephemeral data surfaces (tempfiles, swap,
  WAL, crash dumps) (Kilo #7).
- § 2.1 (c) GDPR-09 (new): consent versioning + revocation semantics
  (Kilo #8 + Codex-2 #3).
- § 5: gap table updated with new rows + LLM-06 reclassification;
  gap counts post-disposition (5 P0 / 19 P1 / 16 P2 / 5 P3 = 45 total,
  up from 41).
- § 6.1 (new): epic grouping for deferred P1 batch (Codex-1 #10) —
  ingest boundary guardrails, per-tool authority gradation, supply-
  chain signing, telemetry & consent.
- § 6.2 (new): six-section control-acceptance template for every DG
  gap (Codex-2 #4) — positive / negative / bypass / fail-closed /
  telemetry / docs.

Filed-issue updates:
- Issue #214 (LLM-06): relabeled P0 → P1, retitled to reflect scope
  narrowing, full disposition comment added.
- Issue #212 (LLM-01) + #213 (LLM-04): disposition comments added
  capturing the guardrail framing, classifier extensibility, and
  control-acceptance template applicable to both.

Deferred for follow-up: Codex-1 #4 (controller/processor
restructure of standards table), Codex-1 #9 (full evidence appendix
beyond the methodology softening), Codex-2 #2 (full 3-column
deployment-profile matrix beyond the single-column trigger).

Brief now 706 lines (up from 606); +124 line diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant