diff --git a/docs/research/calibration-harness-stage2-design.md b/docs/research/calibration-harness-stage2-design.md new file mode 100644 index 00000000..ca0c1011 --- /dev/null +++ b/docs/research/calibration-harness-stage2-design.md @@ -0,0 +1,503 @@ +# Calibration Harness Stage-2 Design — CoordinationRisk null-models, seed replay, artifact layout + +**Status:** research-grade proposal (pre-v1). Origin: Amara +18th courier ferry, Part 1 §B ("Statistical Calibration Plan"), +§F PR #2 ("CoordinationRisk calibration harness"), and Part 2 +corrections #2 (Wilson intervals), #7 (MAD=0 fallback), and #9 +(explicit artifact output). This doc specifies the Stage-2 +rung of the corrected promotion ladder. Author: architect +review. Scope: design-only; no code, no tests, no workflow +changes. + +## 1. Why this doc + +The 18th-ferry corrected promotion ladder places the +**calibration harness at Stage 2**, between the toy detector +(Stage 1, PR #323) and the advisory engine (Stage 4, future +`src/Core/NetworkIntegrity/`). Stage 2's deliverable is a +runnable harness that: + +1. generates synthetic null-model workloads (no cartel), +2. generates synthetic attack workloads (cartel present), +3. computes metrics across many seeds, +4. emits artifacts that downstream calibration + ROC/PR + tooling reads, +5. produces Wilson-interval confidence bounds on detection + + FPR rates rather than handwave ±5%. + +Without a pre-committed design, the first Stage-2 graduation +invents its own conventions; every subsequent graduation +either follows them or diverges. Writing the design first is +cheaper than untangling divergence later. + +This doc is **research-grade** in `docs/research/` until an +ADR promotes it. Code landings consuming this design cite +it by path; any divergence from the design requires a design +amendment (PR editing this doc) rather than silent drift. + +## 2. Placement + +```text +src/Experimental/CartelLab/ + CalibrationHarness.fs ← core runner + NullModels.fs ← 6 null-model generators + AttackInjectors.fs ← 8 scenario injectors (Stage 3) + MetricVector.fs ← Z-score vector type + WilsonInterval.fs ← Wilson score CL helper +tests/Tests.FSharp/CartelLab/ + CalibrationHarness.Tests.fs ← seeded smoke tests +artifacts/coordination-risk/ ← .gitignored; output of runs + calibration-summary.json + seed-results.csv + roc-pr.json + failing-seeds.txt + metric-distributions.csv + run-manifest.json +``` + +The `src/Experimental/` namespace is introduced here for the +first time. It is the appropriate home for Stage-2 / Stage-3 +work per Amara's corrected promotion ladder — not `src/Core/`, +which is reserved for Stage-4 advisory engine. A +`src/Experimental/README.md` documenting the +"Experimental-vs-Core" distinction should accompany the first +landing. + +## 3. Core types + +### 3.1 Metric vector + +```fsharp +type MetricVector = { + DeltaLambda1 : double option // Δλ₁ (symmetrized adj) + DeltaQ : double option // ΔQ (modularity, Louvain) + StakeAccel : double option // A_S (2nd diff covariance) + PLVMagnitude : double option // PLV |r| + PLVOffset : double option // arg(r), radians + Exclusivity : double option // w(S,S) / w(S,V) + Influence : double option // ∂²Out/∂i∂j for i,j∈S +} +``` + +Notes: + +- Every metric is `option` because each has defined + "undefined" cases (empty inputs, zero-magnitude PLV, + degenerate Graph). +- PLV is two fields, not one — per 18th-ferry correction + #6, magnitude and offset are distinct dimensions. + Downstream scoring uses both. +- `Exclusivity` replaces raw `Conductance` from the + draft deep-research per correction #4. The positive- + valence primitive is what the score consumes directly; + `Conductance` stays a diagnostic only. + +### 3.2 Scenario + +```fsharp +type Scenario = + | NullModel of name: string + | Attack of name: string +``` + +Stage-2 populates these with the six null-model names: +`ErdosRenyi`, `Configuration`, `StakeShuffle`, +`TemporalShuffle`, `ClusteredHonest`, `Camouflage`. +Stage-3 adds eight attack names: +`ObviousClique`, `StealthSlow`, `SynchronizedVoting`, +`DenseHonestCluster`, `LowWeightCartel`, `CamouflageNoise`, +`RotatingCartel`, `CrossCoalition`. + +### 3.3 Run record + +```fsharp +type RunRecord = { + Scenario : Scenario + Seed : int + Parameters : Map // scenario params + Metrics : MetricVector + ZScores : MetricVector // robust-z normalized + Detected : bool // per-threshold verdict + Elapsed : System.TimeSpan +} +``` + +The `ZScores` field is a second `MetricVector` holding the +per-metric robust z-scores computed against the null-model +baseline. Using the same type twice is deliberate — the +values are in different units but the structural shape is +identical, which keeps code symmetric. + +## 4. Null-model generator interface + +```fsharp +type INullModelGenerator = + abstract Name : string + abstract Preserves : string seq + abstract Avoids : string seq + abstract Generate : seed: int -> parameters: Map -> Graph +``` + +The `Preserves` / `Avoids` fields are Amara 18th-ferry +Part 1 §B null-model table columns — they make the +invariant target of each generator machine-readable for the +harness report. Example: + +```fsharp +type ErdosRenyiGenerator() = + interface INullModelGenerator with + member _.Name = "ErdosRenyi" + member _.Preserves = seq { "node-count"; "average-degree" } + member _.Avoids = seq { "any-structure" } + member _.Generate seed parameters = + let n = parameters |> Map.find "n" |> int + let m = parameters |> Map.find "m" |> int + let rng = System.Random(seed) + // ... build Graph with n nodes, m random edges + Graph.empty +``` + +Six generators ship in Stage-2 `NullModels.fs`. Attack +injectors (Stage 3) implement a parallel interface +`IAttackInjector` with the same shape plus a "baseline graph" +parameter onto which the attack is overlaid. + +## 5. Wilson intervals + +Per Amara 18th-ferry correction #2, every +detection-rate / false-positive-rate claim ships with its +Wilson 95% interval. The claim "90 of 100 seeds detected" is +not "basically 90%"; its Wilson 95% lower bound is ~0.826. + +### 5.1 API + +```fsharp +/// Wilson score confidence interval for a binomial +/// proportion. Parameters: +/// successes : k, count of successes +/// trials : n, total trials +/// z : normal quantile (1.96 for 95%, 2.576 for 99%) +/// +/// Returns (lower, upper) in [0, 1]. Handles k=0 and k=n +/// without NaN. Returns (nan, nan) when trials < 1. +val wilson : successes: int -> trials: int -> z: double -> struct (double * double) + +/// Convenience wrapper at 95% (z = 1.959964). +val wilson95 : successes: int -> trials: int -> struct (double * double) +``` + +### 5.2 Reporting contract + +Every statistical claim in an artifact carries the four +fields: `{ successes, trials, lowerBound, upperBound }`. +Examples: + +```json +{ + "detection": { "successes": 90, "trials": 100, + "lowerBound": 0.826, "upperBound": 0.945 }, + "falsePositive": { "successes": 20, "trials": 100, + "lowerBound": 0.135, "upperBound": 0.289 } +} +``` + +Promotion discipline (Amara corrected ladder): Stage 4 +advisory-engine claims require `lowerBound ≥ 0.90` (for +detection) and `upperBound ≤ 0.10` (for FPR). Stage 2 can +report whatever the measurement is; Stage 4 sets the bar. + +## 6. Robust z-score with MAD=0 fallback + +Per 18th-ferry correction #7, `RobustStats.robustZScore` +needs a fallback when `MAD = 0` (null-model baseline is +constant). Current implementation (PR #333) uses an +epsilon-floor `MadFloor`. The correction asks for an +additional mode: + +```fsharp +type RobustZScoreMode = + | EpsilonFloor of epsilon: double + | PercentileRank + | Hybrid of epsilon: double // percentile-rank when MAD < epsilon + +val robustZScore : + mode: RobustZScoreMode + -> baseline: double seq + -> measurement: double + -> double option +``` + +The `Hybrid` mode is the recommended default for the +calibration harness: use the fast 1.4826·MAD path when +MAD is non-trivial, fall back to percentile-rank when the +baseline is effectively constant. Pure `EpsilonFloor` is +retained for callers that want strict O(1)-per-call +semantics. + +## 7. Artifact layout + +Per 18th-ferry correction #9, the output directory layout +is pre-specified so downstream tooling (notebooks, PR +comments, nightly-sweep dashboards) knows exactly what to +read. + +### 7.1 `run-manifest.json` + +One file per harness run, containing the invocation +metadata: + +```json +{ + "run_id": "2026-04-24-1430-otto-162", + "commit_sha": "<40-hex>", + "zeta_version": "", + "started_at": "2026-04-24T14:30:00Z", + "completed_at": "2026-04-24T14:42:13Z", + "null_models": ["ErdosRenyi", "Configuration", ...], + "attack_injectors": [], + "seeds_per_scenario": 100, + "scenario_parameters": { "n": 50, "m": 200 }, + "wilson_z": 1.959964, + "robust_mode": "Hybrid", + "robust_epsilon": 1.0e-12 +} +``` + +### 7.2 `seed-results.csv` + +One row per `(scenario, seed)` pair. Schema: + +```text +scenario,seed,detected,delta_lambda1,delta_q,stake_accel, +plv_mag,plv_offset,exclusivity,influence, +z_delta_lambda1,z_delta_q,z_stake_accel,z_plv_mag, +z_plv_offset,z_exclusivity,z_influence,elapsed_ms +``` + +Row is a flat CSV; no nested values. Undefined metrics +encode as empty string (not "nan", not "NULL") to stay +friendly to pandas / DuckDB / Excel. + +### 7.3 `calibration-summary.json` + +Aggregate per-scenario stats: + +```json +{ + "by_scenario": { + "ErdosRenyi": { + "trials": 100, + "detected": 18, + "detection_rate": { + "point": 0.18, + "lowerBound": 0.119, + "upperBound": 0.263 + }, + "metric_medians": { ... }, + "metric_mads": { ... } + }, + "ObviousClique": { + "trials": 100, + "detected": 90, + "detection_rate": { + "point": 0.90, + "lowerBound": 0.826, + "upperBound": 0.945 + } + } + }, + "overall_fpr": { + "trials": 600, + "false_positives": 80, + "rate": { + "point": 0.133, + "lowerBound": 0.108, + "upperBound": 0.163 + } + } +} +``` + +Null-model rows aggregate into `overall_fpr`; attack rows +stay per-scenario so different attacks can be assessed +independently. + +### 7.4 `roc-pr.json` + +ROC and PR curve samples for threshold sweeps. Per Amara +correction #8, PR curves are more informative than ROC for +low-prevalence detection: + +```json +{ + "roc": [ { "threshold": 0.5, "tpr": 0.95, "fpr": 0.30 }, ... ], + "pr": [ { "threshold": 0.5, "precision": 0.72, "recall": 0.95 }, ... ], + "operating_point": { + "recommended_threshold": 3.0, + "rationale": "lowest threshold satisfying Wilson FPR UB <= 0.10" + } +} +``` + +### 7.5 `metric-distributions.csv` + +Per-metric, per-scenario histograms for visualization. Schema: + +```text +scenario,metric,bin_lower,bin_upper,count +``` + +### 7.6 `failing-seeds.txt` + +Seeds on which a regression test would have failed (e.g. +the advertised detection bar was missed). One seed per +line. Intended downstream consumer: a regression suite +that re-runs these specific seeds on every PR to catch +re-introductions of the failure. + +## 8. Invocation contract + +```fsharp +type HarnessConfig = { + NullModels : INullModelGenerator list + AttackInjectors : IAttackInjector list + SeedsPerScenario : int // default 100 + ScenarioParameters : Map + WilsonZ : double // default 1.959964 + RobustMode : RobustZScoreMode // default Hybrid(1e-12) + OutputDirectory : string // default "artifacts/coordination-risk" + DetectionThreshold : double // per-scenario flag threshold +} + +val run : config: HarnessConfig -> Async +``` + +The runner emits all five artifact files on completion. +Failure to write any file aborts the run (fail-closed); a +partial artifact set is worse than a clear failure. + +## 9. CI hook (optional) + +Nightly workflow at `.github/workflows/cartel-calibration- +sweep.yml` (not landing with this design; mentioned for +orientation). Runs the harness on `schedule: cron: '0 4 * * *'`. +The workflow: + +- 100 seeds per scenario (small budget) +- uploads `artifacts/coordination-risk/` as workflow + artifacts +- opens an issue when `detection_rate.lowerBound` for any + attack scenario falls below 0.80 for two consecutive + runs + +The issue-opening rule is a regression alarm, not a merge +gate — statistical smoke tests do not block PRs per +`docs/research/test-classification.md`. + +## 10. Stage-3 scenario suite interface (forward-looking) + +Same shape as `INullModelGenerator` but parameterized over +a baseline graph: + +```fsharp +type IAttackInjector = + abstract Name : string + abstract ExpectedSignals : MetricName seq + abstract Inject : baseline: Graph -> seed: int -> parameters: Map -> Graph +``` + +`ExpectedSignals` documents which MetricVector fields the +attack should produce a signal on (e.g. `ObviousClique` → +`[DeltaLambda1; DeltaQ; Exclusivity]`; `SynchronizedVoting` +→ `[PLVMagnitude]` only). The harness's scenario-behavior +check asserts the expected-signal fields are above-baseline +z-score without asserting exact values. + +## 11. What this doc does NOT do + +- Does **not** ship any code. Pure design; Stage-2 + implementation lands as a separate graduation. +- Does **not** implement the 8-row attack-scenario suite. + That is Stage 3, building on this interface. +- Does **not** wire the nightly workflow. Workflow design + is a follow-up once Stage-2 implementation exists. +- Does **not** place the harness in `src/Core/`. Amara's + corrected promotion ladder reserves `src/Core/ + NetworkIntegrity/` for Stage 4; Stage 2 lives in + `src/Experimental/`. +- Does **not** constrain thresholds. The `DetectionThreshold` + field is input, not output; operating points come from + the ROC/PR sweep. +- Does **not** widen any existing test threshold. Per + correction #10, changes to existing stochastic tests + require measured-variance evidence, not new-doc + authority. + +## 12. Composition with other in-flight work + +- **Amara 18th ferry absorb** (PR #337) — this doc + operationalizes §B / §E / §F + corrections #2 / #7 / #9. +- **`docs/research/test-classification.md`** (PR #339) — + the harness produces category-3 statistical smoke tests; + this doc's §9 nightly-workflow design aligns with the + classification's nightly-sweep cadence. +- **`docs/definitions/KSK.md`** (PR #336) — KSK's Oracle + layer consumes the harness's per-run Wilson-bounded + detection rate. Oracle trust posture depends on the + interval width, not just the point estimate. +- **`src/Core/TemporalCoordinationDetection.fs`** — PLV + + new `meanPhaseOffset` (PR #340) populate the + `PLVMagnitude` + `PLVOffset` fields of `MetricVector`. +- **`src/Core/RobustStats.fs`** — existing `robustZScore` + + proposed `Hybrid` mode (PR #333 + this doc §6) + populate every `ZScores.*` field. +- **`src/Core/Graph.fs`** — `largestEigenvalue`, + `modularityScore`, `labelPropagation`, `exclusivity`, + `internalDensity`, `conductance` (PRs #321, #324, #326, + #331) are the primitive surface the harness consumes. +- **Otto-160 parser-tech** (#338 merged) — not directly + relevant to this doc, but the FParsec-first discipline + informs how scenario parameters might be parsed from + scenario-description DSL files later. + +## 13. Promotion path + +- Stage 0 (now): this research doc. +- Stage 1: ADR defining Stage-2 as the next + CartelLab graduation + sign-off on the artifact layout. +- Stage 2.a: `src/Experimental/CartelLab/` directory + created; `CalibrationHarness.fs` skeleton + 1 null-model + generator (`ErdosRenyi`) + `WilsonInterval.fs` + smoke + test; empty attack-injector list; artifact output files + written with an empty scenario set to validate the + layout end-to-end. +- Stage 2.b: remaining 5 null-model generators + `Hybrid` + mode in `RobustStats.robustZScore` + first attack + injector (`ObviousClique`) as a cross-check. +- Stage 3: remaining 7 attack injectors. +- Stage 4: promotion of the composite scoring + Wilson- + interval reporting into `src/Core/NetworkIntegrity/` + once FP bar cleared at Wilson-upper-bound ≤ 0.10 and + detection Wilson-lower-bound ≥ 0.90. +- Stage 5: Aurora / KSK policy-layer integration per + `docs/definitions/KSK.md` advisory-only flow. +- Stage 6: explicit enforcement only with due-process + policy + red-team review (not this doc's concern). + +Each stage is a small graduation on the Otto-105 cadence. + +## 14. Cross-references + +- Amara 18th ferry — `docs/aurora/2026-04-24-amara- + calibration-ci-hardening-deep-research-plus-5-5- + corrections-18th-ferry.md`. +- Test classification — `docs/research/test- + classification.md`. +- KSK definition — `docs/definitions/KSK.md`. +- PR #323 toy cartel detector — Stage 1, input to + this Stage-2 design. +- PR #327 sharder flake BACKLOG — parallel CI-discipline + work. +- Otto-105 graduation cadence memory. +- GOVERNANCE §4 skills-through-skill-creator (not + relevant here but referenced for context).