Skip to content

docs: calibration-harness Stage-2 design — Amara 18th-ferry §B/§F + corrections #2/#7/#9#342

Merged
AceHack merged 1 commit intomainfrom
docs/calibration-harness-stage2-design
Apr 24, 2026
Merged

docs: calibration-harness Stage-2 design — Amara 18th-ferry §B/§F + corrections #2/#7/#9#342
AceHack merged 1 commit intomainfrom
docs/calibration-harness-stage2-design

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented Apr 24, 2026

Summary

Research-grade design doc for Stage-2 of Amara's corrected promotion ladder. Specifies the next-rung deliverable (calibration harness) so when implementation starts, conventions are pre-committed.

Key design decisions

Scope

18th-ferry operationalization status

# Correction Status
1,10 Test classification policy Shipped (#339)
2 Wilson intervals Design specified (this doc); impl waits Stage 2.a
4 Exclusivity primitive Shipped (#331)
5 Modularity relational Shipped (#324)
6 PLV phase-offset Shipped (#340)
7 MAD=0 fallback Design specified (this doc); impl waits Stage 2.a
9 Artifact layout Design specified (this doc)
3 CoordinationRiskScore rename Already canonical in code
8 Stronger sources Reporting discipline

7 of 10 18th-ferry corrections now have either shipped code or committed design.

Test plan

  • Markdownlint clean locally.
  • Single new file; no surface impact.

🤖 Generated with Claude Code

…-ferry §B + §F + corrections #2 #7 #9

Research-grade design doc for the Stage-2 rung of Amara's
corrected promotion ladder. Specifies: (a) placement under
src/Experimental/CartelLab/ (not src/Core/ — that's Stage 4);
(b) MetricVector type with PLV magnitude AND offset split
(correction #6); (c) INullModelGenerator interface +
Preserves/Avoids table columns; (d) IAttackInjector
forward-looking interface (Stage 3); (e) Wilson-interval
reporting contract with {successes, trials, lowerBound,
upperBound} schema (correction #2 — no more "~95% CI ±5%"
handwave); (f) RobustZScoreMode with Hybrid fallback
(correction #7 — percentile-rank when MAD < epsilon);
(g) explicit artifact-output layout under artifacts/
coordination-risk/ with five files + run-manifest.json
(correction #9).

6-stage promotion path (0 doc / 1 ADR / 2.a skeleton /
2.b full null-models + first attack / 3 attack suite /
4 Core/NetworkIntegrity / 5 Aurora-KSK) matches Amara's
corrected ladder and Otto-105 cadence.

Doc-only change; no code, no tests, no workflow, no
BACKLOG tail touch (avoids positional-conflict pattern
that cost #334#341 re-file this session).

This is the 7th of 10 18th-ferry operationalizations:
- #1/#10 test-classification (#339)
- #2 Wilson-interval design specified (this doc)
- #6 PLV phase-offset shipped (#340)
- #7 MAD=0 Hybrid mode specified (this doc)
- #9 artifact layout specified (this doc)
- #4 exclusivity already shipped (#331)
- #5 modularity relational already shipped (#324)

Remaining: Wilson-interval IMPLEMENTATION (waits on #323 +
Stage 2.a), MAD=0 Hybrid IMPLEMENTATION (waits on #333 +
Stage 2.a), conductance-sign doc (waits on #331), Stage-2.a
skeleton itself.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 09:07
@AceHack AceHack enabled auto-merge (squash) April 24, 2026 09:07
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@AceHack AceHack merged commit 96f9a74 into main Apr 24, 2026
11 of 12 checks passed
@AceHack AceHack deleted the docs/calibration-harness-stage2-design branch April 24, 2026 09:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a research-grade design document specifying the planned Stage-2 “calibration harness” for coordination-risk / cartel detection work, with pre-committed conventions for metrics, confidence-interval reporting, robust z-score fallback, and artifact outputs.

Changes:

  • Introduces a Stage-2 harness design covering module placement (src/Experimental/CartelLab/), core types/interfaces, and invocation contract.
  • Specifies statistical reporting discipline (Wilson intervals) and robust z-score modes (including MAD=0 fallback via percentile rank / hybrid).
  • Defines a fixed artifact output schema under artifacts/coordination-risk/ for downstream calibration/ROC/PR tooling.

Comment on lines +444 to +447
- **`docs/definitions/KSK.md`** (PR #336) — KSK's Oracle
layer consumes the harness's per-run Wilson-bounded
detection rate. Oracle trust posture depends on the
interval width, not just the point estimate.
Comment on lines +491 to +493
- Amara 18th ferry — `docs/aurora/2026-04-24-amara-
calibration-ci-hardening-deep-research-plus-5-5-
corrections-18th-ferry.md`.
Comment on lines +3 to +9
**Status:** research-grade proposal (pre-v1). Origin: Amara
18th courier ferry, Part 1 §B ("Statistical Calibration Plan"),
§F PR #2 ("CoordinationRisk calibration harness"), and Part 2
corrections #2 (Wilson intervals), #7 (MAD=0 fallback), and #9
(explicit artifact output). This doc specifies the Stage-2
rung of the corrected promotion ladder. Author: architect
review. Scope: design-only; no code, no tests, no workflow
WilsonInterval.fs ← Wilson score CL helper
tests/Tests.FSharp/CartelLab/
CalibrationHarness.Tests.fs ← seeded smoke tests
artifacts/coordination-risk/ ← .gitignored; output of runs
val run : config: HarnessConfig -> Async<unit>
```

The runner emits all five artifact files on completion.
AceHack added a commit that referenced this pull request Apr 24, 2026
…rections (#344)

Dedicated absorb of Amara's 19th courier ferry per CC-002
close-on-existing discipline. Scheduled Otto-164 → executed
Otto-165, following 7-ferry precedent (PRs #196 / #211 /
#219 / #221 / #235 / #245 / #259 / #330 / #337).

Two-part ferry: Part 1 deep-research DST audit (12
sections: rulebook, 12-row entropy scan, dependency audit,
7-row simulation-surface coverage, retry audit, CI
determinism, seed discipline, Cartel-Lab DST readiness,
KSK/Aurora DST readiness, state-of-the-art comparison,
10-row PR roadmap, what-not-to-claim caveats; Mermaid CI
diagram + Gantt timeline). Part 2 Amara's own 5.5-Thinking
correction pass (7 required corrections, per-area grade
table with B- overall, revised 6-PR roadmap with titles
locked, DST-held + FoundationDB-grade acceptance criteria,
copy-paste Kenji summary).

Key findings:
- DST grade: B- (strong architecture, partial impl)
- Blockers: DiskBackingStore bypasses simulation (D-grade
  filesystem simulation), no ISimulationDriver, Task.Run
  ambient ThreadPool risk, no seed artifacts / no swarm
  harness
- 4 of 12 Part-1 sections already align with shipped
  substrate:
  - §6 test classification → PR #339
  - §7 artifact layout → PR #342 design
  - §8 Cartel-Lab stage discipline → PRs #330/#337/#342
  - §9 KSK advisory-only → PR #336 + Otto-140..145 memory

6-PR revised roadmap queued as graduation candidates:
1. DST scanner + accepted-boundary registry (new tool +
   policy docs + workflow)
2. Seed protocol + CI artifacts
3. Sharder reproduction (NOT widen) — reinforces 18th #10
4. ISimulationDriver + VTS promotion to core
5. Simulated filesystem (DiskBackingStore rewrite)
6. Cartel-Lab DST calibration (aligns with #342 design)

Plus: push-with-retry.sh retry-audit finding; DST-held +
FDB-grade criteria lock.

GOVERNANCE §33 four-field header (Scope / Attribution /
Operational status / Non-fusion disclaimer). Amara verdict
preserved: "strong draft / not canonical yet."

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request Apr 24, 2026
…mara 19th-ferry correction #6) (#346)

Research-grade criteria doc locking two acceptance bars:

1. DST-held — minimum: 6 items (seeds committed, failing
   tests emit seed+params, bit-for-bit local-vs-CI
   reproducibility, broad sweeps nightly-not-gating,
   zero unreviewed entropy hits in main-path, boundaries
   either simulated or explicitly accepted).
2. FoundationDB-grade DST candidate — aspirational: 8
   surfaces (simulated FS, simulated network,
   deterministic task scheduler, fault injection/buggify,
   swarm runner, replay artifact storage, failure
   minimization/shrinking, end-to-end scenario from one
   seed).

Maps 19th-ferry revised-roadmap PRs to which criteria
items each addresses. Captures Amara's per-area grade
table (overall B-) as "Amara's assessment, not factory-
certified."

Explicit promotion path: doc stays research-grade until
PR 1 of the 19th-ferry revised roadmap lands an ADR
promoting the DST-held bar to factory discipline; at
that point criteria migrate to docs/DST-COMPLIANCE.md
top-level.

No graduation claims DST-held today; graduations reference
this doc as target without self-certification.

Composes with test-classification.md (PR #339; supports
items 1+2+4), calibration-harness-stage2-design.md (PR
#342; artifact schema supports item 2), Amara 19th ferry
(PR #344 absorb; source of criteria).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants