backlog(B-0722): CI ephemeral cluster smoke via k3d-on-runner; evolve to vcluster#4954
Conversation
… to vcluster Files Aaron's "tests should be able to use kind/k3d to do ephemeral clusters on prs" + "we will do k8s in k8s later k8s in docker if fine for ci now" as a P2 backlog row. Builds on PR #4953's dev-cluster substrate (up.sh / down.sh / sync- wave annotations). Phase 1 = k3d-on-runner workflow (immediate ask); Phase 2 = vcluster-on-shared-host when persistent dev cluster exists (faster PR cycles: ~30s vs ~5min). Captures: profile config, smoke script with sync-wave assertion, GH Actions workflow with concurrency + path filter + secure env-var pattern for github-context values, small up.sh refactor for --config flag. Acceptance criteria + non-scope items documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fea52af477
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds a new P2 backlog row (B-0722) capturing a plan to run an ephemeral Kubernetes cluster smoke test in CI for AI-cluster PRs (k3d-on-runner now, with a future evolution to vcluster-on-shared-host).
Changes:
- Introduces
docs/backlog/P2/B-0722-*.mdwith frontmatter + detailed Phase 1/Phase 2 implementation plan. - Documents workflow triggering, artifact capture, teardown behavior, and acceptance criteria for the future CI smoke workflow.
Codex/markdownlint flagged two lines where bullet lists weren't preceded by blank lines (MD032). Also regenerated docs/BACKLOG.md via `BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts` to include B-0722. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…st-1636Z; all peer ai-cluster -c2 batch (audit-only, 4th precedent application) (#4957) 49 open PRs (net +2 from 1636Z); 3 BLOCKED+resolve-threads (#4954 / #4955 / #4956), all peer ai-cluster -c2 train continuation of merged #4951. 11 threads deep-audited, 0 FPs — all substantive (full-ai-cluster/dev-cluster/ referenced ahead of substrate landing in #4953). Audit-only disposition per 1405Z/1539Z/1636Z/0441Z precedent (4th application today, 5th overall). Build gate green (0/0/00:00:25.48). Co-authored-by: Otto <noreply@anthropic.com>
…ate-honest Codex/Copilot flagged 5 dangling cross-references after the prior fix: - composes_with B-0722 path (in PR #4954, not on main) — replaced with a comment noting pending merge - body refs to B-0722, B-0723 — qualified with 'PR #4954/#4955 pending merge' so the intent is preserved + state is honest - body refs to dev-cluster/ + PR #4953 — #4953 was closed pending redesign; replaced 'dev-cluster/' references with 'local k3d / kind cluster' + raw 'k3d cluster create' fallback for now Substrate-honest framing: row's design intent stays intact; reader isn't promised a path that won't resolve until upstream merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rn proof for Max (#4960) * backlog(B-0724): TS hat-system operator — polyglot K8s-operator pattern proof Aaron 2026-05-25: > "yes lets combine he will like kubernets operators but he does > not have experience maybe we write a ts operator insteadd of go > he likes ts" > "we want polyglot operator support for k8s anyways so we are not > rigid about go" Reframes Max's TS preference accommodation into "first deliberate proof of the polyglot-operator pattern the cluster commits to anyway." Two operators against the same CRDs forces the schema to be the canonical contract — no language-specific quirks bleed through. Captures: - Pattern (CRD-as-canonical-contract + multiple language impls watching same CRDs; leader election for active reconciler) - Why polyglot at cluster scope (contract enforcement, failure- domain isolation, talent flexibility, ecosystem coverage) - TS operator stack (kubernetes/client-node, NestJS optional, fastify webhook, nats.js + pino for tick emit, coordination.k8s.io Lease for leader election) - Composition with shipped substrate (PR #4930 Go scaffold as reference/baseline; PR #4958 agentic-organization CLUSTER_NATIVE_HAT_SYSTEM doc; B-0722 smoke test as polyglot validation gate; B-0723 multi-kubelet × polyglot operators for max redundancy) - Acceptance criteria for the TS scaffold - Future Rust (kube-rs) + Python (kopf) as same-pattern extensions - P2 because Go scaffold is already functional; not blocking - Max owns the TS implementation at his preferred pace Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(B-0724): MD012 (consecutive blanks) + MD032 (blank-before-list) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(B-0724): rewrite dangling refs to closed/pending PRs to be substrate-honest Codex/Copilot flagged 5 dangling cross-references after the prior fix: - composes_with B-0722 path (in PR #4954, not on main) — replaced with a comment noting pending merge - body refs to B-0722, B-0723 — qualified with 'PR #4954/#4955 pending merge' so the intent is preserved + state is honest - body refs to dev-cluster/ + PR #4953 — #4953 was closed pending redesign; replaced 'dev-cluster/' references with 'local k3d / kind cluster' + raw 'k3d cluster create' fallback for now Substrate-honest framing: row's design intent stays intact; reader isn't promised a path that won't resolve until upstream merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * B-0724: add team language-affinity map + 'limit Go necessity' framing Aaron 2026-05-25: > 'max love ts and cs i love fs and cs we both like rust and python > for where they make sense' > 'we understand go is necessary in some places for k8s but we would > like to limit its necessity' Updates the polyglot operator language table: - Names Aaron + Max's individual + shared strong languages - Adds C# / F# via KubeOps.NET as future operator #2 — the team's overlap language (both love C#); kubebuilder-class framework on .NET removes Go from operator authoring entirely for this work - Sharpens the polyglot motivation: Go is starter / minimize over time; ecosystem-forced where genuinely required, not chosen Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both review threads (Copilot P1 + Codex P2) on the same line of docs/backlog/P3/B-0723-...md correctly flagged dangling composes_with references: - full-ai-cluster/dev-cluster/ — never on main (PR #4953 closed unmerged) - docs/backlog/P2/B-0722-...md — in-flight via PR #4954, not on main yet Both would surface as missing-target noise to backlog hygiene auditors. Fix: keep full-ai-cluster/ (exists on main) in composes_with; move the in-flight PR cross-refs to a new related_prs: key that names GitHub state rather than filesystem state. When PR #4954 lands B-0722 on main, a follow-up can promote that PR-ref back into composes_with. Co-Authored-By: Claude <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
This PR has two failing checks: 'backlog-pr-hygiene-p2' and 'validate-doc-imports'. The 'backlog-pr-hygiene-p2' check is failing because the PR is not following the backlog PR hygiene rules. Please review the rules and update the PR accordingly. The 'validate-doc-imports' check is a false positive and should be updated to ignore files in the 'docs/backlog/P2' directory. |
Post-main-merge drift: 5 row-ordering changes (B-0499/B-0506/B-0514/B-0515/B-0517/B-0519) + one extra blank line at line 695 (`## P3 — convenience / deferred`). Single-source fix: `BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts`. Removes one blank line; brings rows into canonical generator order. Co-Authored-By: Claude <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 706b7f4517
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The "**Concrete artifacts this tick:**" bold-text line directly preceded the bullet list; markdownlint MD032 requires lists be surrounded by blank lines. Co-Authored-By: Claude <noreply@anthropic.com>
…eference accuracy Copilot+Codex review findings on PR #4954 (verified against repo state in isolated worktree): - Path prefixes — `up.sh`, `dev-cluster/`, `tools/ci/` references now consistently use `full-ai-cluster/` prefix matching actual subtree location - security-reminder hook — replaced with concrete link to docs/security/GITHUB-ACTIONS-SAFE-PATTERNS.md (the actual workflow-injection guidance doc; security-reminder hook does not exist as a separate artifact) - "after the local refactor in this row's PR" — reworded; this PR is the backlog row, not the implementation; `--config` flag is part of Phase 1's planned refactor - Tick shard 2208Z absolute path — replaced `/Users/acehack/...` with `<repo>` placeholder NOT addressed in this commit (FP / outdated): - MD012 line 695 (already resolved at source by regeneration; marked isOutdated:true) - sync-wave annotations claim — VERIFIED present on all 35 Application.yaml files Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 508efd9c87
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - Tears down on EXIT trap (skip with `SKIP_TEARDOWN=1`) | ||
| - Exit codes: 0 = pass; 1 = converge timeout; 2 = pre-flight fail | ||
|
|
||
| 3. **`.github/workflows/ai-cluster-smoke.yml`** — triggers on `pull_request` with path filter (`full-ai-cluster/k8s/applications/**`, `full-ai-cluster/dev-cluster/**`, `full-ai-cluster/tools/ci/**`, this workflow file). Concurrency group cancels in-flight runs on new commits. Installs k3d + kubectl + helm + jq, runs `cluster-smoke.sh`, uploads artifacts, posts PR comment on failure with sync-wave plan + recent events. |
There was a problem hiding this comment.
Include bootstrap root app in smoke workflow path filter
Expand the proposed pull_request paths list to include the root App-of-Apps manifest (e.g. full-ai-cluster/k8s/bootstrap/root-application.yaml). As written, a PR that changes the root application graph entrypoint would not trigger this smoke workflow, which contradicts the row’s goal of validating graph-affecting changes before merge and leaves a real blind spot for bootstrap-level regressions.
Useful? React with 👍 / 👎.
Summary
Files Aaron's "tests should be able to use kind/k3d to do ephemeral clusters on prs" + "we will do k8s in k8s later k8s in docker if fine for ci now" as a P2 backlog row.
Builds on PR #4953's dev-cluster substrate. Phase 1 = k3d-on-runner workflow (immediate ask); Phase 2 = vcluster-on-shared-host when persistent dev cluster exists.
PR contents:
docs/backlog/P2/B-0722-ci-ephemeral-cluster-smoke-via-k3d-on-runner-evolve-to-vcluster-2026-05-25.md(the backlog row — substrate only, no implementation)docs/BACKLOG.md(regenerated index after main-merge to clear MD012 + drift on the generated index)docs/hygiene-history/ticks/2026/05/25/2208Z.md(Otto-CLI cold-boot tick shard documenting the CI-fix work)Test plan
docs/backlog/P2/full-ai-cluster/dev-cluster/*paths are accuratedocs/BACKLOG.mdmatchesbun tools/backlog/generate-index.ts --checkoutput🤖 Generated with Claude Code