diff --git a/memory/MEMORY.md b/memory/MEMORY.md index 3e4c7a47..8f600479 100644 --- a/memory/MEMORY.md +++ b/memory/MEMORY.md @@ -2,6 +2,7 @@ **📌 Fast path: read `CURRENT-aaron.md` and `CURRENT-amara.md` first.** These per-maintainer distillations show what's currently in force. Raw memories below are the history; CURRENT files are the projection. (`CURRENT-aaron.md` refreshed 2026-04-24 with the 2026-04-24 autonomous-loop session cluster — sections 13-17.) +- [**Otto-285 — DST AND DETERMINISM ARE NOT EDGE-CASE AVOIDANCE. Deeper framing: tests should be DETERMINISTIC (so bugs reproduce) but the REAL WORLD IS NOT — tests should deterministically exercise every flavor of chaos the algorithm will encounter in production (random timing, byzantine inputs, hash collisions, timestamp orderings, adversarial users), NOT shrink the test's encoding of chaos to make symptoms disappear. Determinism is the WAY to test chaos reproducibly, not the REASON to skip chaos. The meta-rule above Otto-281: when a non-deterministic test catches a real algorithmic edge case, FIX the algorithm — don't pin a seed/freeze time/remove entropy so the case stops happening. The discriminator: does the fix INVOKE the algorithm's actual contract (legitimate — algorithm has documented input invariants you're satisfying) or SHRINK the test's coverage of what the algorithm is supposed to handle (cheat — algorithm still fails in production, test just stops asking)? Subquestion: what fraction of the input space am I now NOT testing? Same input space via deterministic primitive = fine; narrower space because broader revealed problems = cheat. Triggering case 2026-04-25: PR #482 HLL fuzz test fix LEGITIMATE per discriminator (HLL theorem requires uniform hashes; HashCode.Combine violated contract; XxHash3 satisfies contract; same n values still tested via FsCheck). Anti-patterns: pin FsCheck seed, increase tolerance to make pass, freeze clock to skip leap-second handling, force test single-threaded to skip concurrency. Aaron Otto-285 2026-04-25 "we never want to use random seed pins to cheat by not fully testing if you understand what I mean" + "I guess the general rule is dont use DST and determinism to avoid edge cases handling". CLAUDE.md candidate; deferred to maintainer discretion per Otto-283.**](feedback_dst_not_edge_case_avoidance_otto_285_2026_04_25.md) — 2026-04-25. Meta-rule above Otto-281. Otto-281 says "fix determinism"; Otto-285 says "make sure your determinism fix isn't a cheat that narrows coverage." Empirical verification of PR #482 fix legitimacy: 500 trials × 5 offset baselines, max error 1.96%, 0 exceeded 4% bound — contract-satisfied means bound holds. Composes with Otto-272 (DST as tool for characterizing edge cases not avoiding them), Otto-264 rule of balance (paired verification of coverage), Otto-282 (comment WHY discriminator), Otto-248 never-ignore-flakes. - [**Otto-284 — IDLE-PR CREATIVE FALLBACK. When stuck in heartbeat-idle (priority ladder exhausted, only blocked-on-Aaron items remain), DON'T wait — create a single idle PR and do anything I want in it: project-related or completely off-project, no scope/relevance restrictions; mergeable to main if it doesn't break things; ONE fat PR, not many; goal is learning + evolving by doing rather than calcifying in idle waits. Triggering case: 2026-04-24 → 2026-04-25 wake where I sat idle waiting for Aaron on high-blast-radius items. Otto-284 fills the LEFTOVER idle time AFTER the high-risk items wait — does NOT override "don't pick destructive items without you" from CLAUDE.md auto-mode. CLAUDE.md candidate (4th-tier fallback below the never-be-idle priority ladder); deferred to maintainer discretion per Otto-283. Aaron Otto-284 2026-04-25 "if you ever get stuck in a heartbeat idle loop again, just create a single idle PR... no restrictions, we can even check it into master as long as it does not break stuff... non project related or project related completely up to you... so you are learning and evolving by doing... no need for more than one fat PR... This is for like last night when you got scared and decided to wait on me for the more risky items"**](feedback_idle_pr_creative_fallback_no_restrictions_otto_284_2026_04_25.md) — 2026-04-25. Authority extension that breaks the agent-calcifies-when-blocked failure mode. Branch name suggestion: `idle/-creative-work` or `idle/`. Title prefix `idle:`. Examples: refactor experiments, doc improvements, new skill drafts, perf-pattern learning, off-project creative work, math play, recreational puzzles in F#. Quality bar still "doesn't break things" (build green, tests pass, no regressions); scope/relevance bar relaxed. Composes with never-be-idle CLAUDE.md rule (4th tier below 3-tier ladder), Otto-282 (creative time builds predictive-model fluency), Otto-238 (idle PRs retractable by design), Otto-264 rule of balance (counterweight to high-blast-radius wait calcification), Otto-279 (research-grade work in idle PRs lands under docs/research/). - [**Otto-283 — STANDING DIRECTIVE: don't make the human maintainer the bottleneck. For any "Aaron's call" / "your call" / "you decide" / delegated open question, ALWAYS: (1) decide; (2) track the decision visibly with rationale + a `revisit if X` falsification signal; (3) reflect later whether the decision was right; (4) revisit if needed; (5) ONLY THEN talk with Aaron once experience exists. Don't punt back to Aaron with unmade decisions — Aaron wants experience-informed conversations, not theoretical debates with no data. Applies to ADR open questions, design trade-offs, scope choices, schema picks, anything Aaron explicitly delegates. Does NOT apply to high-blast-radius / destructive actions (still go to Aaron per CLAUDE.md). Aaron Otto-283 2026-04-25 "Aaron's call. you decide and keep track and reflect later... then you can talk to me once you have the experience" + "this is standing guidance for don't make the human maintainer the bottleneck" + "you should always do this for aaron questions". CLAUDE.md candidate, deferred to maintainer discretion.**](feedback_decide_track_reflect_revisit_then_talk_with_experience_otto_283_2026_04_25.md) — 2026-04-25. Authority-delegation pattern. Decision-tracking format: `Otto decided X. Why: . Revisit if: .` Format applies to ADR open questions + design docs + scope calls. Composes with Otto-282 (decide-with-why is design-decision-granular cognitive externalization), Otto-238 (revisit-if = retractability promise made explicit), CLAUDE.md "future-self not bound by past-self" (track-record substrate makes revising responsible), Otto-264 rule of balance. Triggering case this session: PR #474 ADR three "Aaron's call" open questions converted to "Otto decided X (revisit if Y)". - [**Otto-282 — write code from reader perspective; every non-obvious choice deserves an in-place rationale comment because the future reader will always ask "why did you choose this?"; the why-comment is a MENTAL-LOAD OPTIMIZATION (cognitive externalization — ~10sec write-time saves ~1hr per re-derivation across N readers × M visits) AND a GATE on action ("if you can't answer your own why, don't make the change"); the deepest framing — "makes sense" and "understand why" are the same cognitive primitive: a predictive model of the code; readers who understand why can PREDICT untested-case behavior and safely change surrounding code; readers with WHAT only can describe but not predict; subsumes magic-numbers + DST-exempt-justification + trade-off-rationale rules; Aaron Otto-282 2026-04-25 generalising from SplitMix64 + DST-exemption discussions, then refined twice — gate framing + predictive-model framing; pre-commit-lint candidate (flag new literals without comments)**](feedback_write_code_from_reader_perspective_why_did_you_choose_this_otto_282_2026_04_25.md) — 2026-04-25. General code-authoring discipline + cognitive economics. Three layers: (1) BASE — comment WHY for non-obvious choices (magic numbers, algorithm picks, threshold values, API shapes, perf trade-offs, defensive-vs-assertive style); (2) GATE — if you can't articulate the why, the change is premature; (3) PREDICTIVE-MODEL — readers who understand why can predict, not just describe; that prediction-power is what enables safe local change. Examples this session: SplitMix64 multipliers (`GoldenRatio` / `VignaA` / `VignaB`), shift-pair (`30/27/31` empirically tuned per Vigna), DST-exempt (Otto-281), per-process-randomization (Otto-281 audit), Microsoft.NET.Test.Sdk in dotnet-runtime group (cadence rationale). Composes with Otto-281, Otto-272, Otto-227 + intentional-debt + "do nothing if nothing is broken". diff --git a/memory/feedback_dst_not_edge_case_avoidance_otto_285_2026_04_25.md b/memory/feedback_dst_not_edge_case_avoidance_otto_285_2026_04_25.md new file mode 100644 index 00000000..e83f1bf6 --- /dev/null +++ b/memory/feedback_dst_not_edge_case_avoidance_otto_285_2026_04_25.md @@ -0,0 +1,230 @@ +--- +name: DST AND DETERMINISM ARE NOT EDGE-CASE AVOIDANCE — when a non-deterministic test catches a real algorithmic edge case, the right fix is to HANDLE the edge case in the algorithm, NOT to make the test deterministic so the case stops happening; pinning a seed / freezing time / removing entropy so a flake "goes away" is cheating; the only legitimate use of determinism is to satisfy an algorithm's actual input invariant (e.g., HLL needs uniform-distributed hashes, so route through XxHash3 not HashCode.Combine — that's not avoiding an edge case, that's matching the algorithm's contract); ALWAYS ask: does my "make it deterministic" fix MASK a real edge case the algorithm should handle, or does it INVOKE the algorithm's actual guarantees? Aaron Otto-285 2026-04-25 "we never want to use random seed pins to cheat by not fully testing if you understand what I mean" + "I guess the general rule is dont use DST and determinism to avoid edge cases handling" +description: Otto-285 general rule on DST/determinism discipline. The legitimate use of DST is to satisfy an algorithm's input invariants (uniform hashing, fixed time domain, controlled randomness), NOT to artificially avoid edge cases the algorithm should handle. Pinning a seed to make a flake disappear is cheating. The discriminator: does the fix invoke the algorithm's actual contract, or does it shrink the test's coverage? +type: feedback +--- + +## The rule + +When a non-deterministic test catches a real algorithmic +edge case, the right fix is to **handle the edge case in +the algorithm**, not to make the test deterministic so the +case stops happening. + +Aaron's verbatim framing 2026-04-25: + +> *"we never want to use random seed pins to cheat by not +> fully testing if you understand what I mean."* + +> *"I guess the general rule is dont use DST and +> determinism to avoid edge cases handling."* + +This is the meta-principle behind Otto-281 +(`feedback_dst_exempt_is_deferred_bug_not_containment_otto_281_2026_04_25.md`) +and Otto-272 (DST-everywhere). Otto-281 said "fix the +determinism" — Otto-285 says **make sure the determinism +fix isn't itself a cheat**. + +## The framing — deterministic tests of a chaotic real world + +Aaron's deeper articulation 2026-04-25: + +> *"like the tests are all deterministic but the real world +> is [non-deterministic], our tests are trying to test all +> the edge cases of the real world but in a deterministic +> way not reduce scope by eliminating edge cases of the +> real world in our tests with determinism. that will lead +> to more robust tests."* + +**The point of DST is not to escape chaos — it is to +reproduce chaos reproducibly.** + +The real world is non-deterministic: random timing, byzantine +inputs, network failures, leap seconds, concurrent races, +hash collisions, timestamp orderings, adversarial users. +Production code will encounter all of it. + +A test's job is to deterministically exercise every flavor +of that chaos that the algorithm needs to handle. The +*reproduction* is deterministic so the bug, when found, can +be replayed. The *coverage* is the chaos — every edge case +the real world will throw at the algorithm. + +Determinism is the **way** we test chaos reproducibly. It +is not the **reason** to skip the chaos. + +The mental model: + +``` +real-world chaos (broad, non-deterministic) + ↓ encode-as-deterministic-input-generator (FsCheck, fixed seeds, replay logs) +deterministic test (reproduces every chaos case the real world produces) + ↓ +algorithm handles every encoded case correctly +``` + +The trap Otto-285 prevents: + +``` +real-world chaos (broad, non-deterministic) + ↓ encode-as-deterministic + ↓ but oh wait some cases fail + ↓ shrink the encoding to skip those cases +narrower-deterministic test (only tests the easy cases) + ↓ +algorithm "passes" but real world still breaks it +``` + +Robust tests come from the first shape. The second shape +ships bugs to production with a green CI badge. + +## The two kinds of "make it deterministic" fixes + +There are two cases that look the same on the surface, +but they are opposite in spirit: + +**LEGITIMATE — invoking the algorithm's actual contract:** + +The algorithm has documented input invariants. A +non-deterministic input violates those invariants. The +"fix" routes through a primitive that satisfies the +invariant. The algorithm's edge-case behavior is preserved +because the algorithm WAS NEVER MEANT to handle that input. + +Example: HLL's correctness theorem assumes uniform 64-bit +hashes. `HashCode.Combine` produces 32-bit hashes with +process-randomized salt. The flake was the test exercising +HLL outside its input contract. The fix routes through +`XxHash3` which gives uniform 64-bit avalanche. The test +still covers all `n` values FsCheck generates; the +algorithm's actual edge cases (small-n bias correction, +linear counting transition) are still exercised. + +**CHEAT — shrinking coverage to make the flake disappear:** + +The algorithm's contract permits the input. The test +caught a genuine edge case where the algorithm fails. The +"fix" pins a seed, freezes time, or hardcodes inputs so +the case never recurs in tests. The algorithm still fails +on that case in production; the test just stops asking. + +Example anti-pattern: "the test was flaking due to leap +second handling at midnight UTC. Fixed by pinning the test +clock to noon — never crosses midnight, never hits a leap +second." That's a cheat — the algorithm still has a leap- +second bug; the test just doesn't test for it any more. + +The legitimate fix would be to handle leap seconds in the +algorithm. + +## The discriminator question + +When tempted to reach for "make it deterministic", ALWAYS +ask: + +> Does this fix INVOKE the algorithm's actual contract, or +> does it SHRINK the test's coverage of what the algorithm +> is supposed to handle? + +If invoke-contract → fine. If shrink-coverage → cheat. + +A useful subquestion: **what fraction of the input space +am I now NOT testing?** If the answer is "I'm now testing +the same input space, just via a deterministic-input +primitive" → fine. If the answer is "I'm now testing a +narrower input space because the broader one revealed +problems" → cheat. + +## Examples — legitimate vs cheat + +| Situation | LEGITIMATE | CHEAT | +|---|---|---| +| HLL fuzz test flakes on `HashCode.Combine` | Route through `XxHash3` (HLL needs uniform hashes per its contract; we're invoking the contract, not narrowing inputs) | Pin a hash-function seed that happens to give error <4% (narrows input space artificially) | +| Concurrency test races | Add proper synchronization to the algorithm so it's correct under concurrent inputs | Force the test to single-threaded sequential execution | +| Float comparison test flakes | Use the algorithm's documented epsilon tolerance | Pin float inputs to values that don't trigger rounding edge cases | +| `Random` unseeded → unpredictable test | Seed with a fixed value AND extend test to also sweep multiple seeds (DST + breadth) | Pin one seed and call it done | +| DateTime.UtcNow → leap-second flake | Handle leap seconds in the algorithm | Freeze clock to noon | + +The legitimate fixes either *invoke the algorithm's +contract* (the algorithm doesn't have to handle inputs it +didn't promise to) or *fix the algorithm to handle the +edge case it was caught failing on*. The cheats narrow +test coverage to make symptoms disappear. + +## How my Otto-281 fix this session relates + +PR #482 (HLL fuzz test fix) is a LEGITIMATE case per the +discriminator: + +- HLL's correctness theorem (Flajolet et al. 2007) requires + uniformly-distributed hashes. That is the algorithm's + documented input contract. +- `HashCode.Combine` is a 32-bit per-process-salted + hash — its output is non-uniform across processes (each + process sees a different mapping for the same int). +- The fuzz test was exercising HLL outside its input + contract. The flake was real but represented a + contract-violation, not an algorithmic edge case. +- The fix routes through `XxHash3.HashToUInt64` which + satisfies the contract. Same `n` values are still + generated by FsCheck; the algorithm's small-cardinality + edge cases (linear counting, bias correction) are still + exercised. + +Empirical verification (sweep, 500 trials × 5 starting +offsets): max error 1.96%, far below the 4% bound. With +contract satisfied, the bound holds. + +The fix would have been a CHEAT if instead it had: + +- Pinned `n.Get` to a fixed sequence that didn't trigger + the flake (narrows coverage). +- Pinned the FsCheck seed (narrows coverage). +- Increased the tolerance to 8% to "always pass" (changes + the test's actual claim). +- Added `[]` or a "DST-exempt: HLL is probabilistic" + comment (Otto-281 deferred bug). + +None of those would have addressed the actual contract +violation. The legitimate fix did. + +## Composes with + +- **Otto-281** *DST-exempt is deferred bug* — Otto-285 is + the meta-rule above Otto-281. Otto-281 says "fix the + determinism"; Otto-285 says "make sure your determinism + fix isn't a cheat that narrows coverage." +- **Otto-272** *DST-everywhere* — DST is the substrate + that lets us reproduce flakes. It's a tool for + *characterizing* edge cases, not for *avoiding* them. +- **Otto-264** *rule of balance* — every "make it + deterministic" fix should pair with verification that + the test's coverage didn't shrink. +- **Otto-282** *write the why* — when applying a + determinism fix, comment WHY: "routes through XxHash3 + because HLL's contract requires uniform hashes" makes + the discriminator visible to future readers. +- **Otto-248** *never ignore flakes* — flakes ARE the + signal that something violates the algorithm's contract + (or the algorithm has a real edge case). Investigation + finds which; Otto-285 is the rule for the fix-shape. + +## CLAUDE.md candidacy + +Otto-285 is a meta-principle that applies to every test- +flake fix. It belongs alongside Otto-281 in CLAUDE.md or +the agent-best-practices substrate. + +Decision (Otto 2026-04-25, per Otto-283): **defer +elevation to maintainer discretion**. Memory entry is +sufficient for now; revisit at next governance pass. + +## Pre-commit-lint candidate + +Hard to mechanize. The discriminator question is judgment- +heavy. But a simple heuristic: any commit that adds +`Random N` (with N != 0 / N != fixed seed marker) AND +removes a previously-failing test case OR narrows a test +range should fire a manual-review flag. Same for any +commit adding `[]` paired with a comment +about non-determinism.