diff --git a/docs/BACKLOG.md b/docs/BACKLOG.md index 5ad0879e27..836286b21c 100644 --- a/docs/BACKLOG.md +++ b/docs/BACKLOG.md @@ -680,6 +680,7 @@ are closed (status: closed in frontmatter)._ - [ ] **[B-0715](backlog/P2/B-0715-soraya-round52-istimeinvariant-axiom-registry-gap-dbsp-chain-rule-2026-05-23.md)** Soraya round-52 hand-off — register `IsTimeInvariant` axiom in verification-registry (Class 1/2 statement+paper-drift on a load-bearing axiom that BOTH registered DBSP theorems depend on) - [ ] **[B-0717](backlog/P2/B-0717-soraya-round57-lsm-spine-registry-and-bp16-cross-check-pair-2026-05-24.md)** Soraya round-57 hand-off — LSM Spine cluster registry-rows + BP-16 cross-check pair (SpineAsyncProtocol candidate-P0 TLA+/code-drift gap) - [ ] **[B-0721](backlog/P2/B-0721-backlog-md-generated-index-on-schedule-not-per-pr-2026-05-24.md)** Move docs/BACKLOG.md generated-index drift check off per-PR gate onto scheduled cadence +- [ ] **[B-0722](backlog/P2/B-0722-ci-ephemeral-cluster-smoke-via-k3d-on-runner-evolve-to-vcluster-2026-05-25.md)** CI ephemeral cluster smoke — k3d-on-runner for every AI-cluster PR; evolve to vcluster-on-shared-host when persistent dev cluster exists - [ ] **[B-0724](backlog/P2/B-0724-ts-hat-operator-polyglot-k8s-operator-pattern-for-max-2026-05-25.md)** TS hat-system operator — second polyglot implementation alongside the Go scaffold; proves the polyglot-operator pattern for the cluster - [ ] **[B-0726](backlog/P2/B-0726-reticulum-throughout-cluster-and-edge-composing-substrate-alongside-k8s-2026-05-25.md)** Reticulum throughout — cluster nodes AND edge devices on the same mesh; K8s and Reticulum compose as layers rather than partition by network tier - [ ] **[B-0728](backlog/P2/B-0728-destructive-tool-authoring-contract-rails-plus-permission-grants-invocation-plus-runtime-acceptance-gate-2026-05-25.md)** Destructive-tool authoring contract — safety rails + permission-grants-INVOCATION-not-absolution + runtime-acceptance gate with nonce; canonical pattern landed in flash-usb.ts diff --git a/docs/backlog/P2/B-0722-ci-ephemeral-cluster-smoke-via-k3d-on-runner-evolve-to-vcluster-2026-05-25.md b/docs/backlog/P2/B-0722-ci-ephemeral-cluster-smoke-via-k3d-on-runner-evolve-to-vcluster-2026-05-25.md new file mode 100644 index 0000000000..9dff42e9a1 --- /dev/null +++ b/docs/backlog/P2/B-0722-ci-ephemeral-cluster-smoke-via-k3d-on-runner-evolve-to-vcluster-2026-05-25.md @@ -0,0 +1,109 @@ +--- +id: B-0722 +priority: P2 +status: open +title: "CI ephemeral cluster smoke — k3d-on-runner for every AI-cluster PR; evolve to vcluster-on-shared-host when persistent dev cluster exists" +created: 2026-05-25 +last_updated: 2026-05-25 +classification: buildable-now +decomposition: atomic +type: ci-substrate +discovered_by: aaron +owners: [aaron, maintainer] +composes_with: + - full-ai-cluster/dev-cluster/ + - full-ai-cluster/dev-cluster/SYNC-WAVES.md + - full-ai-cluster/dev-cluster/README.md + - full-ai-cluster/k8s/applications/argocd/Application.yaml +--- + +# B-0722 — CI ephemeral cluster smoke (k3d-on-runner now, vcluster-on-shared-host later) + +## Carved blade + +> Every PR that touches the AI cluster's Application graph should spin up an ephemeral cluster, reconcile the root App-of-Apps, and assert sync waves resolve — BEFORE the change hits prod. k3d-on-runner is sufficient for v1; vcluster evolves the cycle from ~5 min to ~30 sec when the persistent dev cluster exists. + +## Origin + +Aaron 2026-05-25, during the dev-cluster scaffolding session (PR #4953): *"also tests should be able to use kind/k3d to do ephemeral clusters on prs"*. Then: *"we will do k8s in k8s later k8s in docker if fine for ci now"*. + +The dev-cluster substrate landed in PR #4953 is CI-ready by design — `up.sh` accepts a git-ref argument today; the `--config ` flag is part of Phase 1's small refactor (planned in this row, not yet implemented). CI just needs to call it with a single-node profile and run sync-wave assertions. + +## What lands (when this row is picked up) + +### Phase 1 — k3d-on-runner (the v1 ask) + +1. **`full-ai-cluster/dev-cluster/profiles/ci.k3d-config.yaml`** — minimal single-node k3d profile sized for GitHub-hosted runners (2 CPU / 7 GB). No agents, no local registry, same Cilium-takeover K3S flags. + +2. **`full-ai-cluster/tools/ci/cluster-smoke.sh`** — wraps `full-ai-cluster/dev-cluster/up.sh --config full-ai-cluster/dev-cluster/profiles/ci.k3d-config.yaml`, then: + - Builds the sync-wave plan by parsing every `Application.yaml`'s `argocd.argoproj.io/sync-wave` annotation + - Polls each app per wave, asserting Healthy/Synced OR Healthy/OutOfSync (acceptable for placeholder Deployments at `replicas: 0`) + - Captures `argocd-applications.json`, `nodes.txt`, `pods.txt`, `recent-events.txt` to `ARTIFACT_DIR` + - Tears down on EXIT trap (skip with `SKIP_TEARDOWN=1`) + - Exit codes: 0 = pass; 1 = converge timeout; 2 = pre-flight fail + +3. **`.github/workflows/ai-cluster-smoke.yml`** — triggers on `pull_request` with path filter (`full-ai-cluster/k8s/applications/**`, `full-ai-cluster/dev-cluster/**`, `full-ai-cluster/tools/ci/**`, this workflow file). Concurrency group cancels in-flight runs on new commits. Installs k3d + kubectl + helm + jq, runs `cluster-smoke.sh`, uploads artifacts, posts PR comment on failure with sync-wave plan + recent events. + - **Security**: every github-context value (head SHA, etc.) reaches `run:` via `env:` block — never inlined — per [`docs/security/GITHUB-ACTIONS-SAFE-PATTERNS.md`](../../security/GITHUB-ACTIONS-SAFE-PATTERNS.md). Use `${{ github.event.pull_request.head.sha }}` only inside `env:`, then reference as `$GIT_REF` in `run:`. + +4. **Small `up.sh` refactor** — add `--config ` flag; read `metadata.name` from the chosen config so `down.sh` + smoke script stay in sync regardless of cluster name. (Default behavior preserved: no flag = current `k3d-config.yaml`.) + +### Phase 2 — vcluster-on-shared-host (the "later" path) + +When the bare-metal cluster comes up and is reachable from CI: + +- Replace the k3d-on-runner spin-up with `vcluster create pr-${{ github.event.pull_request.number }}` on a long-running host cluster +- Each PR gets its own isolated vcluster on shared infrastructure +- Spin-up drops from ~3-5 min (k3d full cluster) to ~30 sec (vcluster pod-on-existing-cluster) +- Tear-down is `vcluster delete pr-`; one command, instant +- Same `cluster-smoke.sh` runs against vcluster's kubeconfig — no other code changes + +References for the Phase 2 design: + +- **vcluster (Loft)** — https://www.vcluster.com/ — virtual K8s clusters as pods +- **Cluster API (CAPI)** — https://cluster-api.sigs.k8s.io/ — declarative cluster management via CRDs +- **Kamaji** / **k0smotron** — managed control planes inside a host cluster (lighter alternatives to CAPI) + +## Why P2 not P1 + +The dev-cluster substrate (PR #4953) already lets a maintainer manually run `./up.sh feat/my-branch` to test a PR locally. Automating that in CI is a clear win but not blocking — substrate exists for manual dev-test today, and the prod cluster doesn't exist yet so there's no urgent "block bad changes from reaching prod" pressure. + +Becomes P1 when: + +- Prod cluster bootstrap completes (bare-metal install finished) +- Multiple maintainers / agents are landing AI-cluster PRs in parallel (manual dev-test stops scaling) + +## Acceptance + +- [ ] `full-ai-cluster/dev-cluster/up.sh --config full-ai-cluster/dev-cluster/profiles/ci.k3d-config.yaml` works locally and brings up a single-node cluster +- [ ] `full-ai-cluster/tools/ci/cluster-smoke.sh` runs end-to-end against a fresh checkout and exits 0 on a clean main +- [ ] `.github/workflows/ai-cluster-smoke.yml` triggers on a PR touching `full-ai-cluster/k8s/applications/**`, completes within 45 min, posts artifacts +- [ ] A deliberately broken PR (e.g., sync-wave annotation missing on a new app, or chart-values typo) is caught by the workflow before merge +- [ ] Workflow concurrency cancels in-flight runs on new commits to the same PR +- [ ] Every github-context value reaches `run:` via `env:` (no inline interpolation — per [`docs/security/GITHUB-ACTIONS-SAFE-PATTERNS.md`](../../security/GITHUB-ACTIONS-SAFE-PATTERNS.md) workflow-injection guidance) + +## Estimated scope + +- Phase 1: ~1 day of dedicated work; ~500 lines (1 yaml profile, 1 shell script, 1 workflow, small up.sh refactor) +- Phase 2: separate row, depends on Phase 1 + persistent shared cluster existing + +## References + +- PR #4953 — dev-cluster substrate this row builds on (k3d, ArgoCD bootstrap, sync-wave annotations across 34 Applications, SYNC-WAVES.md) +- PR #4950 — disko cookie-cutter (bare-metal install path; complements but doesn't block this) +- PR #4951 — NFD + lstopo + zeta-install (compose with smoke test for hardware-feature assertions) +- PR #4930 — hat-system operator (one of the apps the smoke test must reconcile) +- `full-ai-cluster/dev-cluster/SYNC-WAVES.md` — dependency graph the smoke test asserts against +- `full-ai-cluster/dev-cluster/DOCKER-DESKTOP.md` — resource sizing context relevant to CI runner constraints + +## Composes with substrate + +- The dev/prod parity model from PR #4953 (same workloads from same git ref via ArgoCD) +- The sync-wave annotations on all 34 Applications (PR #4953) — smoke test asserts the graph reconciles in order +- `dev-cluster/README.md` "Multi-cluster ArgoCD pattern (future)" section — Phase 2 evolution path + +## Not in scope + +- GPU-dependent workloads (ollama / vllm / deepseek-coder / qwen-coder) — these stay excluded from CI per the dev-cluster root App-of-Apps `exclude:` glob +- Longhorn — single-node CI has nothing to replicate; local-path-provisioner handles PVCs +- Real model serving — no GPUs on GitHub-hosted runners +- Production cluster smoke — separate row; production reconciliation runs continuously via ArgoCD on the bare-metal cluster, not via CI diff --git a/docs/hygiene-history/ticks/2026/05/25/2208Z.md b/docs/hygiene-history/ticks/2026/05/25/2208Z.md new file mode 100644 index 0000000000..c967a22c39 --- /dev/null +++ b/docs/hygiene-history/ticks/2026/05/25/2208Z.md @@ -0,0 +1,85 @@ +| 2026-05-25T22:08Z | opus-4-7 / autonomous-loop | 07a747f3 | Otto-CLI fresh cold-boot under explanatory-output-style; CronList empty (catch-43 fired) → sentinel re-armed; PR #4954 (B-0722) had 2 failing required checks (markdownlint MD012 at `docs/BACKLOG.md:695` + BACKLOG.md generated-index drift from main-merge); both resolved via single `BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts` run; commit `8282de945` pushed; auto-merge stayed armed; counter reset via substantive engagement | this shard | named-dep-driven-fix-not-brief-ack | + +# Tick 2208Z — 2026-05-25 Otto-CLI; PR #4954 markdownlint + BACKLOG.md drift cleared via single regeneration + +**Surface:** Otto-CLI fresh cold-boot (autonomous-loop fired by scheduled task) +**Branch (root checkout):** `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (PR #4954 branch; my Otto-CLI lane; clean working tree of tracked modifications — only untracked Lior sidetick dirs) +**Tier (rate-limit):** Normal (GraphQL 4702/5000, 48min reset; REST core 4832/5000) +**Tier (dotgit):** Recovered (0 stuck `git pack-objects`/`git maintenance`/`git repack` procs) +**Tier (peer-saturation):** 54 peer claude/gemini/kiro/codex/alexa procs (active but not corrupting `.git/`) +**Sentinel:** `07a747f3` armed at session-start per [`.claude/rules/tick-must-never-stop.md`](../../../../../../.claude/rules/tick-must-never-stop.md) — `CronList` returned empty (catch-43 confirmed; session-exit-non-persistence cadence per the in-repo rule body) +**Build gate:** not run this tick (substrate is `.md` regeneration only; no F#/dotnet touched) + +## Refresh-before-decide findings (step 1) + +- `git fetch origin main` → clean; origin/main HEAD `5b6ea6a0c` (B-0742 k8s local stack reference) +- Current branch `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` 3 commits ahead of main (including the post-merge `74aac8346`) +- `gh pr list` → PR #4954 OPEN, MERGEABLE, auto-merge armed by Aaron at 16:46Z, status BLOCKED +- 2 FAILURE required checks (named-dependency per [`holding-without-named-dependency-is-standing-by-failure.md`](../../../../../../.claude/rules/holding-without-named-dependency-is-standing-by-failure.md) condition #3 — concrete artifact, bounded scope, in own lane): + 1. `lint (markdownlint)` — `docs/BACKLOG.md:695 MD012/no-multiple-blanks` (2 blank lines before `## P3 — convenience / deferred` heading; expected 1) + 2. `check docs/BACKLOG.md generated-index drift` — 5 rows out of canonical generator order after main-merge picked up B-0506/B-0499/B-0514/B-0515/B-0517/B-0519 reorderings + +## Disposition (steps 2-3) + +NOT brief-ack. NAMED DEPENDENCY surfaced concrete bounded work in own Otto-CLI lane. Reset counter via substantive engagement per [`holding-without-named-dependency-is-standing-by-failure.md`](../../../../../../.claude/rules/holding-without-named-dependency-is-standing-by-failure.md). + +## Investigation + fix + +Both failures share a single root cause: the post-main-merge commit `74aac8346` carried main's row state into the working tree but did NOT re-run `tools/backlog/generate-index.ts`, so: + +- The blank-line normalization (Phase-2 generator emits exactly one blank line before each priority heading) was lost +- The post-merge ordering / status-bullet state didn't match canonical generator output + +Resolution path (single command, fixes both): + +```bash +BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts +# wrote /docs/BACKLOG.md +``` + +Diff: `1 file changed, 1 deletion(-)` — net effect is the duplicate blank line at line 695 removed; all other row state stayed canonical (the `--check` complaints were the generator's normalization vs the human-merged file). + +### Verification + +- `bun tools/backlog/generate-index.ts --check` → `ok: docs/BACKLOG.md matches generator output` +- `npx markdownlint-cli2 docs/BACKLOG.md` → exit 0 (no output) +- `git branch --show-current` immediately before commit = `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (branch guard OK per [`.claude/rules/zeta-expected-branch.md`](../../../../../../.claude/rules/zeta-expected-branch.md)) + +## Landings (step 4) + +- Commit `8282de945` `fix(B-0722): regenerate BACKLOG.md to clear MD012 + index-drift` +- Pushed to `origin/backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (`74aac8346..8282de945`) +- PR #4954 head SHA now `8282de9457ff595d6abb837ebcbcf7b782c1db5c` +- Auto-merge stayed armed (squash, enabledBy AceHack at 16:46:19Z); will fire when checks pass on new SHA + +`★ Insight ─────────────────────────────────────` +• **Post-merge BACKLOG drift is mechanical** — when a feature branch merges `main` AND the index file landed since the branch was last regenerated, the generator output diverges from the file. Always re-run `BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts` after any main-merge that touches `docs/backlog/**`. +• **Phase-1a guard requires opt-in** — `tools/backlog/generate-index.ts` refuses to overwrite the existing BACKLOG.md without `BACKLOG_WRITE_FORCE=1`. This is intentional substrate-protection: the generator is Phase-1a and the guard prevents accidental clobber during Phase-2 migration work. The opt-in env var is the single-line authorization. +• **Two named-dep checks, one root cause, one commit** — `lint (markdownlint)` and `check docs/BACKLOG.md generated-index drift` both surfaced the same underlying drift. Diagnosing both failures pointed at the same fix, so a single regeneration commit cleared both. This composes with [`blocked-green-ci-investigate-threads.md`](../../../../../../.claude/rules/blocked-green-ci-investigate-threads.md) — investigate the failures, then act with the smallest possible fix. +`─────────────────────────────────────────────────` + +## Step 5 — this shard + +Written at `docs/hygiene-history/ticks/2026/05/25/2208Z.md` (canonical write surface per [`.claude/rules/tick-must-never-stop.md`](../../../../../../.claude/rules/tick-must-never-stop.md)). + +## Step 6 — CronList check + +`CronList` returned empty at session-start (catch-43 confirmed). Sentinel `07a747f3` (`* * * * *`, `<>`, recurring, session-only — auto-expires in 7d) armed immediately as FIRST tool call. This continues the 5+ catch-43-firing cold-boot cadence documented across today's user-scope memory anchors (`feedback_cold_boot_cascade_continues_independent_of_dotgit_clearance_5th_today_dotgit_recovered_named_dep_pr_4937_wait_ci_otto_cli_2026_05_25.md`). + +## Step 7 — Visibility signal + +**Concrete artifacts this tick:** + +- Commit `8282de945` on `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (pushed to origin) +- Tick shard at `docs/hygiene-history/ticks/2026/05/25/2208Z.md` (THIS file) +- Sentinel re-armed (`07a747f3`) + +PR #4954 progression: BLOCKED with 2 failed required checks → BLOCKED waiting for re-runs on new SHA → auto-merge fires when green. No additional action needed from Otto until next named-dep surfaces. + +## Composes with + +- [`.claude/rules/holding-without-named-dependency-is-standing-by-failure.md`](../../../../../../.claude/rules/holding-without-named-dependency-is-standing-by-failure.md) — named-dependency #3 (concrete bounded artifact, own lane); reset counter via substantive engagement +- [`.claude/rules/blocked-green-ci-investigate-threads.md`](../../../../../../.claude/rules/blocked-green-ci-investigate-threads.md) — BLOCKED-with-green-CI investigation discipline at the FAILED-CI scope +- [`.claude/rules/zeta-expected-branch.md`](../../../../../../.claude/rules/zeta-expected-branch.md) — `git branch --show-current` guard immediately before commit (env-var ZETA_EXPECTED_BRANCH not set; direct guard sufficient because no peer Otto on this same branch) +- [`.claude/rules/tick-must-never-stop.md`](../../../../../../.claude/rules/tick-must-never-stop.md) — sentinel re-arm on `CronList` empty +- [`.claude/rules/refresh-before-decide.md`](../../../../../../.claude/rules/refresh-before-decide.md) — fetched origin/main + queried PR status + rate-limit tier before acting