-
Notifications
You must be signed in to change notification settings - Fork 1
backlog(B-0722): CI ephemeral cluster smoke via k3d-on-runner; evolve to vcluster #4954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
fea52af
3db4bb1
74aac83
8282de9
706b7f4
f4c203d
508efd9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| --- | ||
| id: B-0722 | ||
| priority: P2 | ||
| status: open | ||
| title: "CI ephemeral cluster smoke β k3d-on-runner for every AI-cluster PR; evolve to vcluster-on-shared-host when persistent dev cluster exists" | ||
| created: 2026-05-25 | ||
| last_updated: 2026-05-25 | ||
| classification: buildable-now | ||
| decomposition: atomic | ||
| type: ci-substrate | ||
| discovered_by: aaron | ||
| owners: [aaron, maintainer] | ||
| composes_with: | ||
| - full-ai-cluster/dev-cluster/ | ||
| - full-ai-cluster/dev-cluster/SYNC-WAVES.md | ||
| - full-ai-cluster/dev-cluster/README.md | ||
| - full-ai-cluster/k8s/applications/argocd/Application.yaml | ||
|
AceHack marked this conversation as resolved.
|
||
| --- | ||
|
|
||
| # B-0722 β CI ephemeral cluster smoke (k3d-on-runner now, vcluster-on-shared-host later) | ||
|
|
||
| ## Carved blade | ||
|
|
||
| > Every PR that touches the AI cluster's Application graph should spin up an ephemeral cluster, reconcile the root App-of-Apps, and assert sync waves resolve β BEFORE the change hits prod. k3d-on-runner is sufficient for v1; vcluster evolves the cycle from ~5 min to ~30 sec when the persistent dev cluster exists. | ||
|
|
||
| ## Origin | ||
|
|
||
| Aaron 2026-05-25, during the dev-cluster scaffolding session (PR #4953): *"also tests should be able to use kind/k3d to do ephemeral clusters on prs"*. Then: *"we will do k8s in k8s later k8s in docker if fine for ci now"*. | ||
|
|
||
| The dev-cluster substrate landed in PR #4953 is CI-ready by design β `up.sh` accepts a git-ref argument today; the `--config <profile>` flag is part of Phase 1's small refactor (planned in this row, not yet implemented). CI just needs to call it with a single-node profile and run sync-wave assertions. | ||
|
|
||
| ## What lands (when this row is picked up) | ||
|
|
||
| ### Phase 1 β k3d-on-runner (the v1 ask) | ||
|
|
||
| 1. **`full-ai-cluster/dev-cluster/profiles/ci.k3d-config.yaml`** β minimal single-node k3d profile sized for GitHub-hosted runners (2 CPU / 7 GB). No agents, no local registry, same Cilium-takeover K3S flags. | ||
|
|
||
| 2. **`full-ai-cluster/tools/ci/cluster-smoke.sh`** β wraps `full-ai-cluster/dev-cluster/up.sh --config full-ai-cluster/dev-cluster/profiles/ci.k3d-config.yaml`, then: | ||
| - Builds the sync-wave plan by parsing every `Application.yaml`'s `argocd.argoproj.io/sync-wave` annotation | ||
| - Polls each app per wave, asserting Healthy/Synced OR Healthy/OutOfSync (acceptable for placeholder Deployments at `replicas: 0`) | ||
| - Captures `argocd-applications.json`, `nodes.txt`, `pods.txt`, `recent-events.txt` to `ARTIFACT_DIR` | ||
| - Tears down on EXIT trap (skip with `SKIP_TEARDOWN=1`) | ||
| - Exit codes: 0 = pass; 1 = converge timeout; 2 = pre-flight fail | ||
|
|
||
| 3. **`.github/workflows/ai-cluster-smoke.yml`** β triggers on `pull_request` with path filter (`full-ai-cluster/k8s/applications/**`, `full-ai-cluster/dev-cluster/**`, `full-ai-cluster/tools/ci/**`, this workflow file). Concurrency group cancels in-flight runs on new commits. Installs k3d + kubectl + helm + jq, runs `cluster-smoke.sh`, uploads artifacts, posts PR comment on failure with sync-wave plan + recent events. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Expand the proposed Useful? React with πΒ / π. |
||
| - **Security**: every github-context value (head SHA, etc.) reaches `run:` via `env:` block β never inlined β per [`docs/security/GITHUB-ACTIONS-SAFE-PATTERNS.md`](../../security/GITHUB-ACTIONS-SAFE-PATTERNS.md). Use `${{ github.event.pull_request.head.sha }}` only inside `env:`, then reference as `$GIT_REF` in `run:`. | ||
|
|
||
| 4. **Small `up.sh` refactor** β add `--config <path>` flag; read `metadata.name` from the chosen config so `down.sh` + smoke script stay in sync regardless of cluster name. (Default behavior preserved: no flag = current `k3d-config.yaml`.) | ||
|
|
||
| ### Phase 2 β vcluster-on-shared-host (the "later" path) | ||
|
|
||
| When the bare-metal cluster comes up and is reachable from CI: | ||
|
|
||
| - Replace the k3d-on-runner spin-up with `vcluster create pr-${{ github.event.pull_request.number }}` on a long-running host cluster | ||
| - Each PR gets its own isolated vcluster on shared infrastructure | ||
| - Spin-up drops from ~3-5 min (k3d full cluster) to ~30 sec (vcluster pod-on-existing-cluster) | ||
| - Tear-down is `vcluster delete pr-<num>`; one command, instant | ||
| - Same `cluster-smoke.sh` runs against vcluster's kubeconfig β no other code changes | ||
|
|
||
| References for the Phase 2 design: | ||
|
|
||
| - **vcluster (Loft)** β https://www.vcluster.com/ β virtual K8s clusters as pods | ||
| - **Cluster API (CAPI)** β https://cluster-api.sigs.k8s.io/ β declarative cluster management via CRDs | ||
| - **Kamaji** / **k0smotron** β managed control planes inside a host cluster (lighter alternatives to CAPI) | ||
|
|
||
| ## Why P2 not P1 | ||
|
|
||
| The dev-cluster substrate (PR #4953) already lets a maintainer manually run `./up.sh feat/my-branch` to test a PR locally. Automating that in CI is a clear win but not blocking β substrate exists for manual dev-test today, and the prod cluster doesn't exist yet so there's no urgent "block bad changes from reaching prod" pressure. | ||
|
|
||
| Becomes P1 when: | ||
|
|
||
| - Prod cluster bootstrap completes (bare-metal install finished) | ||
| - Multiple maintainers / agents are landing AI-cluster PRs in parallel (manual dev-test stops scaling) | ||
|
|
||
| ## Acceptance | ||
|
|
||
| - [ ] `full-ai-cluster/dev-cluster/up.sh --config full-ai-cluster/dev-cluster/profiles/ci.k3d-config.yaml` works locally and brings up a single-node cluster | ||
| - [ ] `full-ai-cluster/tools/ci/cluster-smoke.sh` runs end-to-end against a fresh checkout and exits 0 on a clean main | ||
| - [ ] `.github/workflows/ai-cluster-smoke.yml` triggers on a PR touching `full-ai-cluster/k8s/applications/**`, completes within 45 min, posts artifacts | ||
| - [ ] A deliberately broken PR (e.g., sync-wave annotation missing on a new app, or chart-values typo) is caught by the workflow before merge | ||
| - [ ] Workflow concurrency cancels in-flight runs on new commits to the same PR | ||
| - [ ] Every github-context value reaches `run:` via `env:` (no inline interpolation β per [`docs/security/GITHUB-ACTIONS-SAFE-PATTERNS.md`](../../security/GITHUB-ACTIONS-SAFE-PATTERNS.md) workflow-injection guidance) | ||
|
|
||
| ## Estimated scope | ||
|
|
||
| - Phase 1: ~1 day of dedicated work; ~500 lines (1 yaml profile, 1 shell script, 1 workflow, small up.sh refactor) | ||
| - Phase 2: separate row, depends on Phase 1 + persistent shared cluster existing | ||
|
|
||
| ## References | ||
|
|
||
| - PR #4953 β dev-cluster substrate this row builds on (k3d, ArgoCD bootstrap, sync-wave annotations across 34 Applications, SYNC-WAVES.md) | ||
|
AceHack marked this conversation as resolved.
|
||
| - PR #4950 β disko cookie-cutter (bare-metal install path; complements but doesn't block this) | ||
| - PR #4951 β NFD + lstopo + zeta-install (compose with smoke test for hardware-feature assertions) | ||
| - PR #4930 β hat-system operator (one of the apps the smoke test must reconcile) | ||
| - `full-ai-cluster/dev-cluster/SYNC-WAVES.md` β dependency graph the smoke test asserts against | ||
| - `full-ai-cluster/dev-cluster/DOCKER-DESKTOP.md` β resource sizing context relevant to CI runner constraints | ||
|
AceHack marked this conversation as resolved.
|
||
|
|
||
| ## Composes with substrate | ||
|
|
||
| - The dev/prod parity model from PR #4953 (same workloads from same git ref via ArgoCD) | ||
| - The sync-wave annotations on all 34 Applications (PR #4953) β smoke test asserts the graph reconciles in order | ||
| - `dev-cluster/README.md` "Multi-cluster ArgoCD pattern (future)" section β Phase 2 evolution path | ||
|
|
||
| ## Not in scope | ||
|
|
||
| - GPU-dependent workloads (ollama / vllm / deepseek-coder / qwen-coder) β these stay excluded from CI per the dev-cluster root App-of-Apps `exclude:` glob | ||
| - Longhorn β single-node CI has nothing to replicate; local-path-provisioner handles PVCs | ||
| - Real model serving β no GPUs on GitHub-hosted runners | ||
| - Production cluster smoke β separate row; production reconciliation runs continuously via ArgoCD on the bare-metal cluster, not via CI | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| | 2026-05-25T22:08Z | opus-4-7 / autonomous-loop | 07a747f3 | Otto-CLI fresh cold-boot under explanatory-output-style; CronList empty (catch-43 fired) β sentinel re-armed; PR #4954 (B-0722) had 2 failing required checks (markdownlint MD012 at `docs/BACKLOG.md:695` + BACKLOG.md generated-index drift from main-merge); both resolved via single `BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts` run; commit `8282de945` pushed; auto-merge stayed armed; counter reset via substantive engagement | this shard | named-dep-driven-fix-not-brief-ack | | ||
|
|
||
| # Tick 2208Z β 2026-05-25 Otto-CLI; PR #4954 markdownlint + BACKLOG.md drift cleared via single regeneration | ||
|
|
||
| **Surface:** Otto-CLI fresh cold-boot (autonomous-loop fired by scheduled task) | ||
| **Branch (root checkout):** `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (PR #4954 branch; my Otto-CLI lane; clean working tree of tracked modifications β only untracked Lior sidetick dirs) | ||
| **Tier (rate-limit):** Normal (GraphQL 4702/5000, 48min reset; REST core 4832/5000) | ||
| **Tier (dotgit):** Recovered (0 stuck `git pack-objects`/`git maintenance`/`git repack` procs) | ||
| **Tier (peer-saturation):** 54 peer claude/gemini/kiro/codex/alexa procs (active but not corrupting `.git/`) | ||
| **Sentinel:** `07a747f3` armed at session-start per [`.claude/rules/tick-must-never-stop.md`](../../../../../../.claude/rules/tick-must-never-stop.md) β `CronList` returned empty (catch-43 confirmed; session-exit-non-persistence cadence per the in-repo rule body) | ||
| **Build gate:** not run this tick (substrate is `.md` regeneration only; no F#/dotnet touched) | ||
|
|
||
| ## Refresh-before-decide findings (step 1) | ||
|
|
||
| - `git fetch origin main` β clean; origin/main HEAD `5b6ea6a0c` (B-0742 k8s local stack reference) | ||
| - Current branch `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` 3 commits ahead of main (including the post-merge `74aac8346`) | ||
| - `gh pr list` β PR #4954 OPEN, MERGEABLE, auto-merge armed by Aaron at 16:46Z, status BLOCKED | ||
| - 2 FAILURE required checks (named-dependency per [`holding-without-named-dependency-is-standing-by-failure.md`](../../../../../../.claude/rules/holding-without-named-dependency-is-standing-by-failure.md) condition #3 β concrete artifact, bounded scope, in own lane): | ||
| 1. `lint (markdownlint)` β `docs/BACKLOG.md:695 MD012/no-multiple-blanks` (2 blank lines before `## P3 β convenience / deferred` heading; expected 1) | ||
| 2. `check docs/BACKLOG.md generated-index drift` β 5 rows out of canonical generator order after main-merge picked up B-0506/B-0499/B-0514/B-0515/B-0517/B-0519 reorderings | ||
|
|
||
| ## Disposition (steps 2-3) | ||
|
|
||
| NOT brief-ack. NAMED DEPENDENCY surfaced concrete bounded work in own Otto-CLI lane. Reset counter via substantive engagement per [`holding-without-named-dependency-is-standing-by-failure.md`](../../../../../../.claude/rules/holding-without-named-dependency-is-standing-by-failure.md). | ||
|
|
||
| ## Investigation + fix | ||
|
|
||
| Both failures share a single root cause: the post-main-merge commit `74aac8346` carried main's row state into the working tree but did NOT re-run `tools/backlog/generate-index.ts`, so: | ||
|
|
||
| - The blank-line normalization (Phase-2 generator emits exactly one blank line before each priority heading) was lost | ||
| - The post-merge ordering / status-bullet state didn't match canonical generator output | ||
|
|
||
| Resolution path (single command, fixes both): | ||
|
|
||
| ```bash | ||
| BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts | ||
| # wrote <repo>/docs/BACKLOG.md | ||
| ``` | ||
|
|
||
| Diff: `1 file changed, 1 deletion(-)` β net effect is the duplicate blank line at line 695 removed; all other row state stayed canonical (the `--check` complaints were the generator's normalization vs the human-merged file). | ||
|
|
||
| ### Verification | ||
|
|
||
| - `bun tools/backlog/generate-index.ts --check` β `ok: docs/BACKLOG.md matches generator output` | ||
| - `npx markdownlint-cli2 docs/BACKLOG.md` β exit 0 (no output) | ||
| - `git branch --show-current` immediately before commit = `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (branch guard OK per [`.claude/rules/zeta-expected-branch.md`](../../../../../../.claude/rules/zeta-expected-branch.md)) | ||
|
|
||
| ## Landings (step 4) | ||
|
|
||
| - Commit `8282de945` `fix(B-0722): regenerate BACKLOG.md to clear MD012 + index-drift` | ||
| - Pushed to `origin/backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (`74aac8346..8282de945`) | ||
| - PR #4954 head SHA now `8282de9457ff595d6abb837ebcbcf7b782c1db5c` | ||
| - Auto-merge stayed armed (squash, enabledBy AceHack at 16:46:19Z); will fire when checks pass on new SHA | ||
|
|
||
| `β Insight βββββββββββββββββββββββββββββββββββββ` | ||
| β’ **Post-merge BACKLOG drift is mechanical** β when a feature branch merges `main` AND the index file landed since the branch was last regenerated, the generator output diverges from the file. Always re-run `BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts` after any main-merge that touches `docs/backlog/**`. | ||
| β’ **Phase-1a guard requires opt-in** β `tools/backlog/generate-index.ts` refuses to overwrite the existing BACKLOG.md without `BACKLOG_WRITE_FORCE=1`. This is intentional substrate-protection: the generator is Phase-1a and the guard prevents accidental clobber during Phase-2 migration work. The opt-in env var is the single-line authorization. | ||
| β’ **Two named-dep checks, one root cause, one commit** β `lint (markdownlint)` and `check docs/BACKLOG.md generated-index drift` both surfaced the same underlying drift. Diagnosing both failures pointed at the same fix, so a single regeneration commit cleared both. This composes with [`blocked-green-ci-investigate-threads.md`](../../../../../../.claude/rules/blocked-green-ci-investigate-threads.md) β investigate the failures, then act with the smallest possible fix. | ||
| `βββββββββββββββββββββββββββββββββββββββββββββββββ` | ||
|
|
||
| ## Step 5 β this shard | ||
|
|
||
| Written at `docs/hygiene-history/ticks/2026/05/25/2208Z.md` (canonical write surface per [`.claude/rules/tick-must-never-stop.md`](../../../../../../.claude/rules/tick-must-never-stop.md)). | ||
|
|
||
| ## Step 6 β CronList check | ||
|
|
||
| `CronList` returned empty at session-start (catch-43 confirmed). Sentinel `07a747f3` (`* * * * *`, `<<autonomous-loop>>`, recurring, session-only β auto-expires in 7d) armed immediately as FIRST tool call. This continues the 5+ catch-43-firing cold-boot cadence documented across today's user-scope memory anchors (`feedback_cold_boot_cascade_continues_independent_of_dotgit_clearance_5th_today_dotgit_recovered_named_dep_pr_4937_wait_ci_otto_cli_2026_05_25.md`). | ||
|
|
||
| ## Step 7 β Visibility signal | ||
|
|
||
| **Concrete artifacts this tick:** | ||
|
|
||
| - Commit `8282de945` on `backlog/b0722-ci-ephemeral-cluster-smoke-2026-05-25-c2` (pushed to origin) | ||
| - Tick shard at `docs/hygiene-history/ticks/2026/05/25/2208Z.md` (THIS file) | ||
| - Sentinel re-armed (`07a747f3`) | ||
|
|
||
| PR #4954 progression: BLOCKED with 2 failed required checks β BLOCKED waiting for re-runs on new SHA β auto-merge fires when green. No additional action needed from Otto until next named-dep surfaces. | ||
|
|
||
| ## Composes with | ||
|
|
||
| - [`.claude/rules/holding-without-named-dependency-is-standing-by-failure.md`](../../../../../../.claude/rules/holding-without-named-dependency-is-standing-by-failure.md) β named-dependency #3 (concrete bounded artifact, own lane); reset counter via substantive engagement | ||
| - [`.claude/rules/blocked-green-ci-investigate-threads.md`](../../../../../../.claude/rules/blocked-green-ci-investigate-threads.md) β BLOCKED-with-green-CI investigation discipline at the FAILED-CI scope | ||
| - [`.claude/rules/zeta-expected-branch.md`](../../../../../../.claude/rules/zeta-expected-branch.md) β `git branch --show-current` guard immediately before commit (env-var ZETA_EXPECTED_BRANCH not set; direct guard sufficient because no peer Otto on this same branch) | ||
| - [`.claude/rules/tick-must-never-stop.md`](../../../../../../.claude/rules/tick-must-never-stop.md) β sentinel re-arm on `CronList` empty | ||
| - [`.claude/rules/refresh-before-decide.md`](../../../../../../.claude/rules/refresh-before-decide.md) β fetched origin/main + queried PR status + rate-limit tier before acting |
Uh oh!
There was an error while loading. Please reload this page.