Skip to content
1 change: 1 addition & 0 deletions docs/BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -680,6 +680,7 @@ are closed (status: closed in frontmatter)._
- [ ] **[B-0715](backlog/P2/B-0715-soraya-round52-istimeinvariant-axiom-registry-gap-dbsp-chain-rule-2026-05-23.md)** Soraya round-52 hand-off β€” register `IsTimeInvariant` axiom in verification-registry (Class 1/2 statement+paper-drift on a load-bearing axiom that BOTH registered DBSP theorems depend on)
- [ ] **[B-0717](backlog/P2/B-0717-soraya-round57-lsm-spine-registry-and-bp16-cross-check-pair-2026-05-24.md)** Soraya round-57 hand-off β€” LSM Spine cluster registry-rows + BP-16 cross-check pair (SpineAsyncProtocol candidate-P0 TLA+/code-drift gap)
- [ ] **[B-0721](backlog/P2/B-0721-backlog-md-generated-index-on-schedule-not-per-pr-2026-05-24.md)** Move docs/BACKLOG.md generated-index drift check off per-PR gate onto scheduled cadence
- [ ] **[B-0722](backlog/P2/B-0722-ci-ephemeral-cluster-smoke-via-k3d-on-runner-evolve-to-vcluster-2026-05-25.md)** CI ephemeral cluster smoke β€” k3d-on-runner for every AI-cluster PR; evolve to vcluster-on-shared-host when persistent dev cluster exists
Comment thread
AceHack marked this conversation as resolved.
Comment thread
AceHack marked this conversation as resolved.
- [ ] **[B-0724](backlog/P2/B-0724-ts-hat-operator-polyglot-k8s-operator-pattern-for-max-2026-05-25.md)** TS hat-system operator β€” second polyglot implementation alongside the Go scaffold; proves the polyglot-operator pattern for the cluster
- [ ] **[B-0726](backlog/P2/B-0726-reticulum-throughout-cluster-and-edge-composing-substrate-alongside-k8s-2026-05-25.md)** Reticulum throughout β€” cluster nodes AND edge devices on the same mesh; K8s and Reticulum compose as layers rather than partition by network tier
- [ ] **[B-0728](backlog/P2/B-0728-destructive-tool-authoring-contract-rails-plus-permission-grants-invocation-plus-runtime-acceptance-gate-2026-05-25.md)** Destructive-tool authoring contract β€” safety rails + permission-grants-INVOCATION-not-absolution + runtime-acceptance gate with nonce; canonical pattern landed in flash-usb.ts
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
id: B-0722
priority: P2
status: open
title: "CI ephemeral cluster smoke β€” k3d-on-runner for every AI-cluster PR; evolve to vcluster-on-shared-host when persistent dev cluster exists"
created: 2026-05-25
last_updated: 2026-05-25
classification: buildable-now
decomposition: atomic
type: ci-substrate
discovered_by: aaron
owners: [aaron, maintainer]
composes_with:
- full-ai-cluster/dev-cluster/
- full-ai-cluster/dev-cluster/SYNC-WAVES.md
- full-ai-cluster/dev-cluster/README.md
- full-ai-cluster/k8s/applications/argocd/Application.yaml
Comment thread
AceHack marked this conversation as resolved.
---

# B-0722 β€” CI ephemeral cluster smoke (k3d-on-runner now, vcluster-on-shared-host later)

## Carved blade

> Every PR that touches the AI cluster's Application graph should spin up an ephemeral cluster, reconcile the root App-of-Apps, and assert sync waves resolve β€” BEFORE the change hits prod. k3d-on-runner is sufficient for v1; vcluster evolves the cycle from ~5 min to ~30 sec when the persistent dev cluster exists.

## Origin

Aaron 2026-05-25, during the dev-cluster scaffolding session (PR #4953): *"also tests should be able to use kind/k3d to do ephemeral clusters on prs"*. Then: *"we will do k8s in k8s later k8s in docker if fine for ci now"*.

The dev-cluster substrate landed in PR #4953 is CI-ready by design β€” `up.sh` accepts a git-ref argument today; the `--config <profile>` flag is part of Phase 1's small refactor (planned in this row, not yet implemented). CI just needs to call it with a single-node profile and run sync-wave assertions.

## What lands (when this row is picked up)

### Phase 1 β€” k3d-on-runner (the v1 ask)

1. **`full-ai-cluster/dev-cluster/profiles/ci.k3d-config.yaml`** β€” minimal single-node k3d profile sized for GitHub-hosted runners (2 CPU / 7 GB). No agents, no local registry, same Cilium-takeover K3S flags.

2. **`full-ai-cluster/tools/ci/cluster-smoke.sh`** β€” wraps `full-ai-cluster/dev-cluster/up.sh --config full-ai-cluster/dev-cluster/profiles/ci.k3d-config.yaml`, then:
- Builds the sync-wave plan by parsing every `Application.yaml`'s `argocd.argoproj.io/sync-wave` annotation
- Polls each app per wave, asserting Healthy/Synced OR Healthy/OutOfSync (acceptable for placeholder Deployments at `replicas: 0`)
- Captures `argocd-applications.json`, `nodes.txt`, `pods.txt`, `recent-events.txt` to `ARTIFACT_DIR`
- Tears down on EXIT trap (skip with `SKIP_TEARDOWN=1`)
- Exit codes: 0 = pass; 1 = converge timeout; 2 = pre-flight fail

3. **`.github/workflows/ai-cluster-smoke.yml`** β€” triggers on `pull_request` with path filter (`full-ai-cluster/k8s/applications/**`, `full-ai-cluster/dev-cluster/**`, `full-ai-cluster/tools/ci/**`, this workflow file). Concurrency group cancels in-flight runs on new commits. Installs k3d + kubectl + helm + jq, runs `cluster-smoke.sh`, uploads artifacts, posts PR comment on failure with sync-wave plan + recent events.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include bootstrap root app in smoke workflow path filter

Expand the proposed pull_request paths list to include the root App-of-Apps manifest (e.g. full-ai-cluster/k8s/bootstrap/root-application.yaml). As written, a PR that changes the root application graph entrypoint would not trigger this smoke workflow, which contradicts the row’s goal of validating graph-affecting changes before merge and leaves a real blind spot for bootstrap-level regressions.

Useful? React with πŸ‘Β / πŸ‘Ž.

- **Security**: every github-context value (head SHA, etc.) reaches `run:` via `env:` block β€” never inlined β€” per [`docs/security/GITHUB-ACTIONS-SAFE-PATTERNS.md`](../../security/GITHUB-ACTIONS-SAFE-PATTERNS.md). Use `${{ github.event.pull_request.head.sha }}` only inside `env:`, then reference as `$GIT_REF` in `run:`.

4. **Small `up.sh` refactor** β€” add `--config <path>` flag; read `metadata.name` from the chosen config so `down.sh` + smoke script stay in sync regardless of cluster name. (Default behavior preserved: no flag = current `k3d-config.yaml`.)

### Phase 2 β€” vcluster-on-shared-host (the "later" path)

When the bare-metal cluster comes up and is reachable from CI:

- Replace the k3d-on-runner spin-up with `vcluster create pr-${{ github.event.pull_request.number }}` on a long-running host cluster
- Each PR gets its own isolated vcluster on shared infrastructure
- Spin-up drops from ~3-5 min (k3d full cluster) to ~30 sec (vcluster pod-on-existing-cluster)
- Tear-down is `vcluster delete pr-<num>`; one command, instant
- Same `cluster-smoke.sh` runs against vcluster's kubeconfig β€” no other code changes

References for the Phase 2 design:

- **vcluster (Loft)** β€” https://www.vcluster.com/ β€” virtual K8s clusters as pods
- **Cluster API (CAPI)** β€” https://cluster-api.sigs.k8s.io/ β€” declarative cluster management via CRDs
- **Kamaji** / **k0smotron** β€” managed control planes inside a host cluster (lighter alternatives to CAPI)

## Why P2 not P1

The dev-cluster substrate (PR #4953) already lets a maintainer manually run `./up.sh feat/my-branch` to test a PR locally. Automating that in CI is a clear win but not blocking β€” substrate exists for manual dev-test today, and the prod cluster doesn't exist yet so there's no urgent "block bad changes from reaching prod" pressure.

Becomes P1 when:

- Prod cluster bootstrap completes (bare-metal install finished)
- Multiple maintainers / agents are landing AI-cluster PRs in parallel (manual dev-test stops scaling)

## Acceptance

- [ ] `full-ai-cluster/dev-cluster/up.sh --config full-ai-cluster/dev-cluster/profiles/ci.k3d-config.yaml` works locally and brings up a single-node cluster
- [ ] `full-ai-cluster/tools/ci/cluster-smoke.sh` runs end-to-end against a fresh checkout and exits 0 on a clean main
- [ ] `.github/workflows/ai-cluster-smoke.yml` triggers on a PR touching `full-ai-cluster/k8s/applications/**`, completes within 45 min, posts artifacts
- [ ] A deliberately broken PR (e.g., sync-wave annotation missing on a new app, or chart-values typo) is caught by the workflow before merge
- [ ] Workflow concurrency cancels in-flight runs on new commits to the same PR
- [ ] Every github-context value reaches `run:` via `env:` (no inline interpolation β€” per [`docs/security/GITHUB-ACTIONS-SAFE-PATTERNS.md`](../../security/GITHUB-ACTIONS-SAFE-PATTERNS.md) workflow-injection guidance)

## Estimated scope

- Phase 1: ~1 day of dedicated work; ~500 lines (1 yaml profile, 1 shell script, 1 workflow, small up.sh refactor)
- Phase 2: separate row, depends on Phase 1 + persistent shared cluster existing

## References

- PR #4953 β€” dev-cluster substrate this row builds on (k3d, ArgoCD bootstrap, sync-wave annotations across 34 Applications, SYNC-WAVES.md)
Comment thread
AceHack marked this conversation as resolved.
- PR #4950 β€” disko cookie-cutter (bare-metal install path; complements but doesn't block this)
- PR #4951 β€” NFD + lstopo + zeta-install (compose with smoke test for hardware-feature assertions)
- PR #4930 β€” hat-system operator (one of the apps the smoke test must reconcile)
- `full-ai-cluster/dev-cluster/SYNC-WAVES.md` β€” dependency graph the smoke test asserts against
- `full-ai-cluster/dev-cluster/DOCKER-DESKTOP.md` β€” resource sizing context relevant to CI runner constraints
Comment thread
AceHack marked this conversation as resolved.

## Composes with substrate

- The dev/prod parity model from PR #4953 (same workloads from same git ref via ArgoCD)
- The sync-wave annotations on all 34 Applications (PR #4953) β€” smoke test asserts the graph reconciles in order
- `dev-cluster/README.md` "Multi-cluster ArgoCD pattern (future)" section β€” Phase 2 evolution path

## Not in scope

- GPU-dependent workloads (ollama / vllm / deepseek-coder / qwen-coder) β€” these stay excluded from CI per the dev-cluster root App-of-Apps `exclude:` glob
- Longhorn β€” single-node CI has nothing to replicate; local-path-provisioner handles PVCs
- Real model serving β€” no GPUs on GitHub-hosted runners
- Production cluster smoke β€” separate row; production reconciliation runs continuously via ArgoCD on the bare-metal cluster, not via CI
85 changes: 85 additions & 0 deletions docs/hygiene-history/ticks/2026/05/25/2208Z.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
| 2026-05-25T22:08Z | opus-4-7 / autonomous-loop | 07a747f3 | Otto-CLI fresh cold-boot under explanatory-output-style; CronList empty (catch-43 fired) β†’ sentinel re-armed; PR #4954 (B-0722) had 2 failing required checks (markdownlint MD012 at `docs/BACKLOG.md:695` + BACKLOG.md generated-index drift from main-merge); both resolved via single `BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts` run; commit `8282de945` pushed; auto-merge stayed armed; counter reset via substantive engagement | this shard | named-dep-driven-fix-not-brief-ack |

# Tick 2208Z β€” 2026-05-25 Otto-CLI; PR #4954 markdownlint + BACKLOG.md drift cleared via single regeneration

**Surface:** Otto-CLI fresh cold-boot (autonomous-loop fired by scheduled task)
**Branch (root checkout):** `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (PR #4954 branch; my Otto-CLI lane; clean working tree of tracked modifications β€” only untracked Lior sidetick dirs)
**Tier (rate-limit):** Normal (GraphQL 4702/5000, 48min reset; REST core 4832/5000)
**Tier (dotgit):** Recovered (0 stuck `git pack-objects`/`git maintenance`/`git repack` procs)
**Tier (peer-saturation):** 54 peer claude/gemini/kiro/codex/alexa procs (active but not corrupting `.git/`)
**Sentinel:** `07a747f3` armed at session-start per [`.claude/rules/tick-must-never-stop.md`](../../../../../../.claude/rules/tick-must-never-stop.md) β€” `CronList` returned empty (catch-43 confirmed; session-exit-non-persistence cadence per the in-repo rule body)
**Build gate:** not run this tick (substrate is `.md` regeneration only; no F#/dotnet touched)

## Refresh-before-decide findings (step 1)

- `git fetch origin main` β†’ clean; origin/main HEAD `5b6ea6a0c` (B-0742 k8s local stack reference)
- Current branch `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` 3 commits ahead of main (including the post-merge `74aac8346`)
- `gh pr list` β†’ PR #4954 OPEN, MERGEABLE, auto-merge armed by Aaron at 16:46Z, status BLOCKED
- 2 FAILURE required checks (named-dependency per [`holding-without-named-dependency-is-standing-by-failure.md`](../../../../../../.claude/rules/holding-without-named-dependency-is-standing-by-failure.md) condition #3 β€” concrete artifact, bounded scope, in own lane):
1. `lint (markdownlint)` β€” `docs/BACKLOG.md:695 MD012/no-multiple-blanks` (2 blank lines before `## P3 β€” convenience / deferred` heading; expected 1)
2. `check docs/BACKLOG.md generated-index drift` β€” 5 rows out of canonical generator order after main-merge picked up B-0506/B-0499/B-0514/B-0515/B-0517/B-0519 reorderings

## Disposition (steps 2-3)

NOT brief-ack. NAMED DEPENDENCY surfaced concrete bounded work in own Otto-CLI lane. Reset counter via substantive engagement per [`holding-without-named-dependency-is-standing-by-failure.md`](../../../../../../.claude/rules/holding-without-named-dependency-is-standing-by-failure.md).

## Investigation + fix

Both failures share a single root cause: the post-main-merge commit `74aac8346` carried main's row state into the working tree but did NOT re-run `tools/backlog/generate-index.ts`, so:

- The blank-line normalization (Phase-2 generator emits exactly one blank line before each priority heading) was lost
- The post-merge ordering / status-bullet state didn't match canonical generator output

Resolution path (single command, fixes both):

```bash
BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts
# wrote <repo>/docs/BACKLOG.md
```

Diff: `1 file changed, 1 deletion(-)` β€” net effect is the duplicate blank line at line 695 removed; all other row state stayed canonical (the `--check` complaints were the generator's normalization vs the human-merged file).

### Verification

- `bun tools/backlog/generate-index.ts --check` β†’ `ok: docs/BACKLOG.md matches generator output`
- `npx markdownlint-cli2 docs/BACKLOG.md` β†’ exit 0 (no output)
- `git branch --show-current` immediately before commit = `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (branch guard OK per [`.claude/rules/zeta-expected-branch.md`](../../../../../../.claude/rules/zeta-expected-branch.md))

## Landings (step 4)

- Commit `8282de945` `fix(B-0722): regenerate BACKLOG.md to clear MD012 + index-drift`
- Pushed to `origin/backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (`74aac8346..8282de945`)
- PR #4954 head SHA now `8282de9457ff595d6abb837ebcbcf7b782c1db5c`
- Auto-merge stayed armed (squash, enabledBy AceHack at 16:46:19Z); will fire when checks pass on new SHA

`β˜… Insight ─────────────────────────────────────`
β€’ **Post-merge BACKLOG drift is mechanical** β€” when a feature branch merges `main` AND the index file landed since the branch was last regenerated, the generator output diverges from the file. Always re-run `BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts` after any main-merge that touches `docs/backlog/**`.
β€’ **Phase-1a guard requires opt-in** β€” `tools/backlog/generate-index.ts` refuses to overwrite the existing BACKLOG.md without `BACKLOG_WRITE_FORCE=1`. This is intentional substrate-protection: the generator is Phase-1a and the guard prevents accidental clobber during Phase-2 migration work. The opt-in env var is the single-line authorization.
β€’ **Two named-dep checks, one root cause, one commit** β€” `lint (markdownlint)` and `check docs/BACKLOG.md generated-index drift` both surfaced the same underlying drift. Diagnosing both failures pointed at the same fix, so a single regeneration commit cleared both. This composes with [`blocked-green-ci-investigate-threads.md`](../../../../../../.claude/rules/blocked-green-ci-investigate-threads.md) β€” investigate the failures, then act with the smallest possible fix.
`─────────────────────────────────────────────────`

## Step 5 β€” this shard

Written at `docs/hygiene-history/ticks/2026/05/25/2208Z.md` (canonical write surface per [`.claude/rules/tick-must-never-stop.md`](../../../../../../.claude/rules/tick-must-never-stop.md)).

## Step 6 β€” CronList check

`CronList` returned empty at session-start (catch-43 confirmed). Sentinel `07a747f3` (`* * * * *`, `<<autonomous-loop>>`, recurring, session-only β€” auto-expires in 7d) armed immediately as FIRST tool call. This continues the 5+ catch-43-firing cold-boot cadence documented across today's user-scope memory anchors (`feedback_cold_boot_cascade_continues_independent_of_dotgit_clearance_5th_today_dotgit_recovered_named_dep_pr_4937_wait_ci_otto_cli_2026_05_25.md`).

## Step 7 β€” Visibility signal

**Concrete artifacts this tick:**

- Commit `8282de945` on `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (pushed to origin)
- Tick shard at `docs/hygiene-history/ticks/2026/05/25/2208Z.md` (THIS file)
- Sentinel re-armed (`07a747f3`)

PR #4954 progression: BLOCKED with 2 failed required checks β†’ BLOCKED waiting for re-runs on new SHA β†’ auto-merge fires when green. No additional action needed from Otto until next named-dep surfaces.

## Composes with

- [`.claude/rules/holding-without-named-dependency-is-standing-by-failure.md`](../../../../../../.claude/rules/holding-without-named-dependency-is-standing-by-failure.md) β€” named-dependency #3 (concrete bounded artifact, own lane); reset counter via substantive engagement
- [`.claude/rules/blocked-green-ci-investigate-threads.md`](../../../../../../.claude/rules/blocked-green-ci-investigate-threads.md) β€” BLOCKED-with-green-CI investigation discipline at the FAILED-CI scope
- [`.claude/rules/zeta-expected-branch.md`](../../../../../../.claude/rules/zeta-expected-branch.md) β€” `git branch --show-current` guard immediately before commit (env-var ZETA_EXPECTED_BRANCH not set; direct guard sufficient because no peer Otto on this same branch)
- [`.claude/rules/tick-must-never-stop.md`](../../../../../../.claude/rules/tick-must-never-stop.md) β€” sentinel re-arm on `CronList` empty
- [`.claude/rules/refresh-before-decide.md`](../../../../../../.claude/rules/refresh-before-decide.md) β€” fetched origin/main + queried PR status + rate-limit tier before acting
Loading