Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions docs/BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -708,6 +708,145 @@ within each priority tier.
side (Window.fs wiring pending). Target: measured numbers in
`docs/BENCHMARKS.md` by end of round 20.

- [ ] **Itron-lineage signal-processing → factory-observability
mapping**. Second-wave Itron disclosure
(auto-loop-34, captured in
`memory/user_aaron_itron_pki_supply_chain_secure_boot_background.md`)
Comment thread
AceHack marked this conversation as resolved.
Comment thread
AceHack marked this conversation as resolved.
named a specific portfolio of published signal-processing /
Comment thread
AceHack marked this conversation as resolved.
anomaly-detection techniques the maintainer worked on at
director-level IoT engineering scope. Each technique maps
concretely onto unfinished factory surfaces. The ARC3-DORA
PNNL-HITL composition landed in
`docs/research/arc3-dora-benchmark.md` (auto-loop-35) is the
first executed mapping; the remainder are research-doc
candidates. **Scope: produce a one-page research doc per
mapping pair below; do NOT implement yet — first occurrence
per pair is prior-art cite + composition sketch only.**

Mapping pairs (each a candidate research doc
under `docs/research/itron-lineage/`):

1. **PNNL HITL expert-derived confidence → agent-output-
under-uncertainty measurement substrate (the layer
between agent output and DORA grade, NOT DORA itself;
DORA stays objective devops-delivery metrics).** LANDED
auto-loop-35 in `docs/research/arc3-dora-benchmark.md`
§Prior-art lineage. Occurrence-3 of wink-validation.
2. **Disaggregation discipline → ZSet retraction-native
operator algebra.** ZSet preserves per-multiplicity;
aggregation loses it. Industrial-scale disaggregation
(DriveNets network-disaggregation) validates the
architectural direction the Escro maintain-every-dep +
microkernel-endpoint directive already committed to.
Composition sketch: aggregate-view operators as
derivations, disaggregated-view as primitive.
3. **PRIDES (Power Rising and Descending Signature,
low-overhead binary) → per-commit alignment-clause
signature.** Every commit produces a binary
rising/falling pattern against the 20 ALIGNMENT clauses
(HC-1..HC-7 / SD-1..SD-8 / DIR-1..DIR-5). PRIDES-style
compact signature is IoT-memory-compatible — usable by a
resource-constrained alignment-observability sidecar.
4. **Wavelet-GAT (Graph Attention Network over wavelet
decomposition) → clause-graph anomaly detection.**
Clause-commit graph attends to suspicious edges; wavelet
decomposes low+high-freq components of the compliance
time-series. 99% published accuracy target in grid
literature; portable signal.
5. **GESL (Grid Event Signature Library, 900+ types) →
factory-event signature library.** Curate a library of
named alignment-anomaly types (clause drift, scope creep,
retraction-not-restored, operator-misuse) matchable
against commit-stream. Complements `docs/WONT-DO.md` +
`docs/TECH-DEBT.md` as positive/negative
anomaly-signature catalog.
6. **Context-Agnostic Learning (SCADA) → universal operator
algebra calibration.** SCADA's universal context-agnostic
values that work across network locations map to Zeta's
design goal that retraction-native operators compose at
any point in the pipeline. Composition sketch: anomaly
signals normalised against operator-algebra axioms rather
than per-module conventions.
7. **Physics-Informed Generators → operator-algebra-informed
code generators.** Physics priors constrain ML-generator
output; Zeta's operator-algebra axioms can constrain
Copilot / Codex / Claude generators. This IS the factory's
well-defined-Occam's discipline (Rodney's Razor: prefer
the simplest generator output that still satisfies the
operator-algebra invariants — a constraint-narrowing
prior over generator hypothesis space) at the
code-generation layer.
8. **MUSIC spectral (SINR under noise) → clause-compliance
spectral decomposition.** Commit-cadence, round-close
cadence, tick-cadence make alignment time-series noisy;
MUSIC extracts dominant frequencies (ambient drift vs.
directed work).
9. **FFT foundation → time-series instruments across the
factory.** Any series we hold (commit-cadence,
clause-compliance, tick-duration, compoundings-per-tick)
has an FFT view. Cheapest, most portable. Likely first
instrument to land if instrumentation work starts.
10. **Micro-Doppler (µD) / VWCD → commit-vibration signature
extraction.** Which files vibrate together under which
work-session rhythm. Adjacent to existing pipeline-churn
analysis.

**Why this is one row, not 10.** Per the factory's
occurrence-1 discipline, each pair is research-doc-level
first-pass only; no implementation commits without
occurrence-2+ calibration. One row tracks the portfolio so
the promotion threshold is visible and the composition
pattern is explicit. The research docs land as a family
under `docs/research/itron-lineage/` once started.

**Why research-project-tier.** These are measurement /
observability instruments, not shipped library surface.
They unlock ALIGNMENT measurability (Zeta's primary research
focus per `docs/ALIGNMENT.md`) by giving specific, published,
validated signal-extraction techniques. They do NOT need to
be implemented to be valuable — naming + citing is
occurrence-1 contribution.

**Effort.** Research doc per pair: 1-2 ticks of speculative
work each. Pair #1 already LANDED. Pairs #2, #3, #5, #9
likely strongest next candidates (highest composition-value
with existing factory surfaces). Pairs #6, #7 are architectural
claims that compose with well-defined-Occam's; can land as
short citation sections in existing docs rather than new
research docs. Pairs #4, #8, #10 require more background
before they're tractable.

**Composes with.**
- `docs/ALIGNMENT.md` — measurable alignment primary research
focus; every pair above is a measurement instrument for
a clause-compliance signal.
- `docs/research/arc3-dora-benchmark.md` — cognition-layer
measurement substrate where pair #1 landed.
- `memory/user_aaron_itron_pki_supply_chain_secure_boot_background.md`
— verbatim maintainer disclosure + calibration context.
- `memory/feedback_external_signal_confirms_internal_insight_second_occurrence_discipline_2026_04_22.md`
— the occurrence-discipline that gates promotion from
research-doc to ADR / BP-NN / shipped instrument.
Comment thread
AceHack marked this conversation as resolved.
- `docs/TECH-RADAR.md` — candidate destination for pair #9
(FFT) once a first instrument lands.
- Escro maintain-every-dep → microkernel-endpoint directive
— disaggregation (pair #2) is the industrial-scale pattern
Aaron lived through that the directive already follows.

**What this is NOT.**
- NOT a commitment to implement any pair (occurrence-1
discipline: cite prior-art + compose sketch, wait for
occurrence-2 to promote).
- NOT a cap on pairs — additional pairs may emerge; this row
tracks the portfolio without closing it.
- NOT a reframe of ALIGNMENT.md clauses — the clauses stay
stable; these are instruments for measuring them.
- NOT signal-to-noise-ratio-chasing (MUSIC's SINR utility is
literal, not the measurement philosophy).
- NOT an Itron-specific dependency — all named techniques
are publicly published; maintainer's prior art accelerates
composition understanding but does not constrain adoption.

## P1 — SQL frontend + query surface (round-33 vision, v1 scope)

- [ ] **Shared query IR that compiles to the DBSP operator
Expand Down
157 changes: 157 additions & 0 deletions docs/research/arc3-dora-benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,156 @@ not a metaphor.
tier where new research-level moves originate; stepdown
measures how much of that work survives at lower capacity.

## Prior-art lineage — PNNL HITL / Itron signal processing

**Added 2026-04-22 auto-loop-35.** The maintainer named the
connection explicitly: PNNL's "expert-derived confidence"
scoring framework (Grid Event Signature Library, ~900
signature types, human-in-the-loop confidence-weighting
layered on ML output) is a published analog of the factory's multi-substrate
triangulation + reviewer-roster + maintainer-echo pattern that
this benchmark presumes as the measurement substrate sitting
*between the agent output and the DORA grade* — distinct from
the DORA metrics themselves.

**Separation of concerns.** DORA (deploy frequency, lead time
for changes, change failure rate, mean time to restore service)
is a DevOps-delivery benchmark family from the Google/Accelerate
research line; metrics are objectively measurable from CI/CD
and incident-tracking data. ARC-3 is Chollet's cognition /
abstraction-and-reasoning benchmark. This factory's benchmark
is **DORA (the objective)** framed as the maintainer's personal
ARC-3-equivalent (the class-of-benchmark framing: frontier
reasoning under compounding tests with no instructions). The
document filename retains `arc3-dora` for continuity, but the
layering is:
Comment thread
AceHack marked this conversation as resolved.

- **DORA metrics**: objective delivery measurements.
Not HITL-modulated. Deployment frequency counts deployments
to production; change failure rate is the ratio of failed
deployments over total deployments; no confidence weighting
applies. (Per the canonical Google/Accelerate DORA
definitions — distinct from commit / raw-incident counts,
which would skew cross-run comparison under different batch
sizes.)
- **Agent-output-under-uncertainty layer**: the noisy ML / agent
output that is being graded against DORA. *This* is where
HITL expert-derived confidence applies — calibrating which
agent outputs are trustworthy enough to ship, exactly as
PNNL HITL calibrates ML classifier output on PMU/FDR
waveforms before triggering grid alarms.
- **ARC-3 framing**: the class-of-benchmark description — no
instructions, every lesson compounds, forgotten lessons =
regression. This framing informs how the benchmark is
*interpreted* (a frontier-capability test) but does not add
a separate measurement.

**Why DORA-in-production qualifies as the maintainer's
personal-ARC3-equivalent.** Maintainer mid-tick clarification
(auto-loop-35): *"jsut cause i said that's my ARC3"* +
Comment thread
AceHack marked this conversation as resolved.
*"yeah casue running a production pipeline is hard as fuck"*.
The framing is not hyperbole — running a production pipeline
under real constraints (incident response with real users
affected, lead time measured when consequences are real,
change-failure-rate counted against real SLOs, MTTR under
live pressure) is genuinely a compounding-under-real-stakes
test in the ARC-3 class shape. The benchmark remains DORA;
the ARC-3 label is the maintainer's way of saying "this is
my frontier-test," not a second measurement axis.

**Operational definition of ARC-3-class (maintainer, auto-loop-35):**
*"ARC3 = hard problem that is [trying to be made] continuously
testable even though there is 0 formal definition"*. Three
criteria — all three must hold:

1. **Hard** — frontier-capability test, compounding, not
solvable by instruction-following alone.
2. **Continuously testable** — produces a stream of
observations (telemetry, benchmark runs, per-commit
signals) rather than a one-shot pass/fail.
3. **No formal definition** — operationally-grounded
(benchmark, telemetry, empirical) rather than
theoretically-specified. The absence of a formal
definition is a *feature* of the class: the problem
resists formalisation, but the measurement pipeline
still produces defensible signal.

By this test, DORA-in-production qualifies cleanly — deploy
frequency / lead time / CFR / MTTR are operationally well-
defined *as measurements*, but "running a production
pipeline well" has no closed-form theoretical definition.

**Other Zeta factory surfaces that meet the ARC-3-class test**
(flagged here; not yet treated as cartridges):

- **Factory autonomy under autonomous-loop substrate** —
hard (tick-must-never-stop under genuine work-queue
selection); continuously testable (tick-history,
round-history, per-commit alignment signals); no formal
definition of "autonomous factory operating at target
capability."
Comment thread
AceHack marked this conversation as resolved.
- **ALIGNMENT.md measurable primary-research-focus** — hard
(alignment has no closed-form specification); continuously
testable (per-commit HC-1..HC-7 / SD-1..SD-8 / DIR-1..DIR-5
signals, time-series); no formal definition of "aligned
AI."
- **Zero-to-production in 3-4 hours on ServiceTitan demo** —
hard (full-stack capability compounded under time
pressure); continuously testable (rounds of attempts,
per-domain DORA); no formal definition of "production-
ready demo."

Each matches the three-criteria ARC-3-class shape. Treating
them all as ARC-3-class gives the factory a consistent lens
for frontier-test work and reuses the same measurement
substrate (HITL expert-derived confidence over agent output,
graded against the operational metric for the specific
domain).

The shape is the same across both:

| PNNL HITL (grid) | Zeta ARC3-DORA (factory) |
| ----------------------------------------- | -------------------------------------------- |
| ML classifier on noisy PMU/FDR waveform | Agent output under uncertainty (code / spec) |
| Grid Signature Library (GESL, 900+ types) | Alignment-clause + operator-algebra library |
| Expert score layered on ML confidence | Maintainer echo + reviewer roster confidence |
| Improves accuracy beyond ML-alone | Triangulation beats single-substrate depth |

**Occurrence classification.** This is occurrence-3 of the
*external-signal-confirms-internal-insight* recurrence tracked
in `memory/feedback_external_signal_confirms_internal_insight_second_occurrence_discipline_2026_04_22.md`:

1. Muratori 5-pattern → Zeta operator algebra (YouTube wink,
auto-loop-24).
2. Three-substrate triangulation (Claude + Codex + Gemini)
+ Aaron exact-phrasing echo "now you see what i see"
(auto-loop-25/26).
3. PNNL HITL expert-derived confidence → factory's
multi-reviewer + maintainer-echo calibration
(auto-loop-34/35, disclosed in Itron second-wave cascade).

Per the external-signal discipline, occurrence-3+ is
Architect-level promotion material. The promotion surface
for this specific pattern is ARC3-DORA: the benchmark's
cognition-layer measurement substrate inherits the PNNL HITL
shape, not as a derivation but as cited prior-art confirming
the substrate is well-formed.

**What this changes in the benchmark spec.** Nothing about the
shape changes; the composition-with-HITL language makes the
measurement substrate *citable* rather than internally-coined.
ARC3-DORA's DORA-side delivery metrics remain carrier-channel;
the cognition-side capability signature remains stepdown-under-
capability-reduction; the multi-substrate / maintainer-echo /
reviewer-roster calibration layer now has a published sibling.

**Bounded promotion.** HITL-citation applies to the calibration
substrate, not to ARC3-DORA's task-completion criterion. The
falsifier (humans-in-production-environments beat agents on
DORA) stays task-completion-measured, not confidence-weighted.
Confidence-weighting is a measurement instrument; it does not
lower the task bar.

## Reference patterns

- Auto-memory ARC3 entry — full prose derivation of this shape
Expand All @@ -276,3 +426,10 @@ not a metaphor.
- `docs/AUTONOMOUS-LOOP.md` — never-be-idle ladder; Level-3
generative improvements are the anti-livelock brace referenced
in component 2
- `memory/user_aaron_itron_pki_supply_chain_secure_boot_background.md`
Comment thread
AceHack marked this conversation as resolved.
— second-wave disclosure cascade naming PNNL HITL
"expert-derived confidence" as published prior art for the
cognition-layer measurement substrate cited above
- `memory/feedback_external_signal_confirms_internal_insight_second_occurrence_discipline_2026_04_22.md`
— the occurrence-discipline used to classify the HITL
connection as occurrence-3 of the wink-validation recurrence
Loading