refactor: split restate worker [stacked on 4782] by chronark · Pull Request #4798 · unkeyed/unkey

chronark · 2026-01-20T21:37:15Z

Summary

Splits the ctrl service into two independently deployable components:

ctrl-api: ConnectRPC server handling synchronous requests (deployments, builds, cluster coordination)
ctrl-worker: Restate workflow executor handling async workflows (certificates, deployments, routing, versioning)

Why

The monolithic ctrl service combined request handling with long-running workflows, making it difficult to scale and deploy independently. Separating them allows:

Independent scaling of API vs workflow processing
Cleaner failure isolation between sync and async paths
Simpler deployment and rollback

Changes

Architecture

New cmd/ctrl CLI with api and worker subcommands
svc/ctrl/api/ - API server configuration and startup
svc/ctrl/worker/ - Restate worker with sub-packages:
- certificate/ - ACME certificate management workflows
- deploy/ - Deployment, promotion, rollback workflows
- routing/ - Frontline route assignment
- versioning/ - Per-region version tracking

New Packages

svc/ctrl/pkg/build/ - Depot.dev container build integration
svc/ctrl/pkg/s3/ - S3-compatible storage with Docker networking support
svc/ctrl/pkg/hash/ - Content hashing utilities

API Changes

svc/api/routes/v2_deploy_generate_upload_url/ - New endpoint for deployment uploads
svc/ctrl/services/deployment/ - Deployment service with S3 upload URL generation

Infrastructure

Separate k8s manifests: ctrl-api.yaml, ctrl-worker.yaml
Updated Tiltfile for local development
New proto definitions for build and deployment services

Documentation

Comprehensive doc.go and symbol documentation added to all new packages following the documentation guidelines.

Testing

Existing integration tests pass. New endpoint has test coverage for 200/400/401/403/404 responses.

…nto decouple-control-workers

This is a merge commit the virtual branches in your workspace. Due to GitButler managing multiple virtual branches, you cannot switch back and forth between git branches and virtual branches easily. If you switch to another branch, GitButler will need to be reinitialized. If you commit on this branch, GitButler will throw it away. Here are the branches that are currently applied: - decouple-control-workers (refs/gitbutler/decouple-control-workers) branch head: 015bad2 For more information about what we're doing here, check out our docs: https://docs.gitbutler.com/features/branch-management/integration-branch

this allows us to use restates operator to deploy it independently of our control plane

CLAassistant · 2026-01-20T21:37:23Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ chronark
❌ gitbutler-client
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

coderabbitai · 2026-01-20T21:37:41Z

📝 Walkthrough

Walkthrough

Split the control plane into separate API and Worker services; replace legacy build backends with S3/Hydra-based build flows; introduce Restate-backed durable workflows and region-scoped List+Watch RPCs; add S3 presigned upload endpoints; update DB schema, dev manifests, Bazel targets, tests, and many service signatures and wiring.

Changes

Cohort / File(s)	Summary
CLI & Commands `cmd/run/main.go`, `cmd/ctrl/main.go`, `cmd/ctrl/api.go`, `cmd/ctrl/worker.go`, `cmd/ctrl/doc.go`, `cmd/ctrl/BUILD.bazel`	Add `api` and `worker` subcommands, move flags into subcommands, add api/worker actions and docs, update Bazel entries and command wiring.
Control Plane — API package `svc/ctrl/api/*`, `svc/ctrl/api/BUILD.bazel`, `svc/ctrl/api/config.go`, `svc/ctrl/api/run.go`	New `api` package with Config types and Run() that wires DB, Restate ingress, caches, ACME, services, TLS/HTTP server and exposes API server.
Control Plane — Worker package `svc/ctrl/worker/*`, `svc/ctrl/worker/BUILD.bazel`, `svc/ctrl/worker/run.go`, `svc/ctrl/worker/config.go`	New `worker` package and Run() wiring Restate worker, ACME, deploy/routing/versioning/certificate services, Vault, S3 build storage, lifecycle and bootstrap logic.
Build & Backends (proto & pkg) `svc/ctrl/proto/hydra/v1/`, `svc/ctrl/pkg/build/`, removed `svc/ctrl/proto/ctrl/v1/build.proto`, removed `svc/ctrl/services/build/*`	Remove legacy BuildService proto/backends; add Hydra BuildDockerImage proto; new build package (Depot/Hydra flow) and refactored build logic.
S3 Storage package `svc/ctrl/pkg/s3/*`, `svc/ctrl/pkg/s3/BUILD.bazel`	New s3.Storage interface and implementation with GenerateUploadURL/GenerateDownloadURL and Bazel target.
Deployment service `svc/ctrl/services/deployment/*`, `svc/ctrl/services/deployment/BUILD.bazel`	Replace BuildService client with s3.Storage buildStorage, add CreateS3UploadURL handler, map DB statuses to proto, integrate Restate workflows for durable operations.
Cluster sync / RPCs `svc/ctrl/services/cluster/*`	Add region-scoped streaming RPCs WatchDeployments/WatchSentinels, GetDesired* point queries, Report* status RPCs; refine auth/region header handling and docs.
Worker internals (deploy/routing/certificate/versioning) `svc/ctrl/worker/deploy/`, `svc/ctrl/worker/routing/`, `svc/ctrl/worker/certificate/`, `svc/ctrl/worker/versioning/`	New/updated worker subpackages: deploy uses S3+Hydra build path, routing and certificate services added, versioning provides per-region counters and Restate handlers.
Cache & Hash removals `svc/ctrl/internal/caches/`, `svc/ctrl/pkg/hash/`	Removed internal caches package and hash utilities (sources and BUILD entries deleted).
DB / Schema & state tracking `pkg/db/schema.sql`, `web/internal/db/src/schema/*`	Make sentinels version unique per region (region+version), drop some region_version indexes, add new state_changes table/schema and re-export, update TS schema files.
Krane controller `svc/krane/internal/deployment/*`	Rename Reconciler→Controller, add New/Start/Stop lifecycle, integrate ReportDeploymentStatus, add helper env building and backoff/resync doc updates.
API routes & services wiring `svc/api/routes/*`, `svc/api/run.go`	Switch handlers to use DeploymentServiceClient, remove Build service client, register v2 deploy routes unconditionally, make GetDeployment DB-centric; update services struct.
Tests & test harness `svc/api/internal/testutil/`, `svc/api/internal/testutil/seed/`, route tests	New test harness and rich seeding utilities, MockDeploymentClient added; tests updated to use mocks/seed data instead of previous ctrl build client.
Dev / infra manifests & env `dev/docker-compose.yaml`, `dev/Tiltfile`, `dev/k8s/manifests/*`, `Makefile`, `.github/workflows/job_bazel.yaml`, `dev/.env.depot.example`	Rename `ctrl` → `ctrl-api`, add `ctrl-worker` and `worker` services, switch env vars to UNKEY_* names, update compose/Tilt/K8s/Makefile/workflow and health/port changes.
Bazel / Buildfile edits many `BUILD.bazel` files across `svc/ctrl/`, `svc/api/`	Add/remove/rename many Bazel targets to reflect api/worker split, new pkg/s3 and pkg/build targets, remove legacy build backends and adjust deps/srcs.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant API as rgba(30,130,76,0.5)
  participant DB as rgba(0,122,204,0.5)
  participant Restate as rgba(201,37,96,0.5)
  participant Worker as rgba(255,140,0,0.5)
  participant S3 as rgba(102,51,153,0.5)
  participant Hydra as rgba(0,153,153,0.5)

  Client->>API: CreateDeployment / generateUploadUrl
  API->>DB: Persist deployment (pending)
  API->>Restate: Start durable workflow (project-scoped)
  Restate->>Worker: Schedule workflow task
  Worker->>S3: GenerateUploadURL / presigned URL
  Client->>S3: Upload build context (PUT to presigned URL)
  Worker->>S3: Fetch build context
  Worker->>Hydra: BuildDockerImage (hydra)
  Hydra-->>Worker: image name / build id
  Worker->>DB: Update deployment status/topology
  Worker->>API: Report status (via DB or ReportDeploymentStatus)
  Worker-->>Restate: Workflow complete
  Restate-->>API: Workflow result
  API-->>Client: Return deployment status / hostnames

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

feat: add getDeployment endpoint #4604 — Overlaps v2 deploy getDeployment API and tests; likely touches route/handler and test wiring changes.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.82% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main objective of the PR: splitting the monolithic restate worker into two independently deployable components (ctrl-api and ctrl-worker).
Description check	✅ Passed	The PR description provides comprehensive coverage of changes including architecture overview, why the split is needed, detailed change breakdown, new packages, API changes, infrastructure updates, and testing notes. All required sections are present and well-documented.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

vercel · 2026-01-20T21:39:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
engineering	Ready	Preview, Comment	Jan 23, 2026 3:21pm

1 Skipped Deployment

Project	Deployment	Review	Updated (UTC)
dashboard	Ignored	Preview	Jan 23, 2026 3:21pm

coderabbitai

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

svc/krane/internal/deployment/delete.go (1)
35-39: Unreachable or redundant code at line 39.

At line 39, err will always be nil because:

If reportDeploymentStatus at line 28 fails, line 35-37 returns the error.

If it succeeds, err is nil.

So client.IgnoreNotFound(err) effectively returns nil. This appears to be dead code or a copy-paste artifact.
🐛 Suggested fix
 	err = c.reportDeploymentStatus(ctx, &ctrlv1.ReportDeploymentStatusRequest{
 		Change: &ctrlv1.ReportDeploymentStatusRequest_Delete_{
 			Delete: &ctrlv1.ReportDeploymentStatusRequest_Delete{
 				K8SName: req.GetK8SName(),
 			},
 		},
 	})
 	if err != nil {
 		return err
 	}
 
-	return client.IgnoreNotFound(err)
+	return nil
 }

🤖 Fix all issues with AI agents

In `@dev/k8s/manifests/worker.yaml`:
- Around line 20-24: The manifest mounts the host Docker socket via the volume
named "docker-socket" (path "/var/run/docker.sock"), which grants host-level
privileges; remove the volume and any corresponding volumeMount referencing
"docker-socket" from the Deployment/Pod spec or replace it with a safer
alternative (e.g., use a sidecar DinD image, a remote build service, or a
privileged Pod specifically in a dev-only kustomize/overlay). Ensure the
dev-only nature is explicit by moving the socket mount into a non-production
overlay or feature flag so production manifests never include "docker-socket".

In `@svc/ctrl/integration/seed/seed.go`:
- Around line 412-445: Seeder.CreateRatelimit currently may do nothing if both
req.IdentityID and req.KeyID are nil and can overwrite an earlier error if both
inserts run; add input validation to require at least one of req.IdentityID or
req.KeyID and fail early if neither is set, then call
db.Query.InsertIdentityRatelimit and db.Query.InsertKeyRatelimit independently
(if their corresponding IDs are non-nil), capture each insert's error separately
and assert require.NoError(s.t, err) for each insert (or aggregate and fail if
any error occurred) instead of reusing a single err variable that can be
overwritten; reference the Seeder.CreateRatelimit function and the
db.Query.InsertIdentityRatelimit / InsertKeyRatelimit calls when making these
changes.

In `@svc/krane/internal/deployment/actual_state_report.go`:
- Around line 20-75: The runActualStateReportLoop goroutine needs lifecycle and
reconnection handling: wrap the watch creation (w := c.clientSet...Watch) in a
loop that exits on ctx.Done() or c.done, call w.Stop() when context cancels, and
when w.ResultChan() closes or a watch.Error is received, log and break the inner
event loop then retry creating a new watch (with a small backoff) so the watch
is re-established on expiration; ensure existing logic that casts to
*appsv1.ReplicaSet and calls c.buildDeploymentStatus / c.reportDeploymentStatus
is kept inside the event loop but that the outer loop recreates w on closure or
error to maintain continuous reporting.

In `@svc/krane/internal/deployment/apply.go`:
- Line 104: The Resources field for the container is left empty despite
validating ApplyDeployment.CpuMillicores and MemoryMib; update the container's
Resources (in svc/krane/internal/deployment/apply.go where Resources:
corev1.ResourceRequirements{} is set) to populate Requests and Limits using the
proto fields CpuMillicores and MemoryMib (set CPU as millicores and Memory as
MiB). Use the Kubernetes resource quantities (e.g., resource.NewMilliQuantity
for CPU and resource.NewQuantity with binary/decimal Mi units for memory) to
create a corev1.ResourceList for corev1.ResourceCPU and corev1.ResourceMemory
and assign them to both Requests and Limits so the validated values are actually
applied to the container.

In `@svc/krane/internal/deployment/desired_state_apply.go`:
- Around line 18-30: The runDesiredStateApplyLoop currently blocks with
time.Sleep and will keep reconnecting after ctx cancellation; change the loop in
Controller.runDesiredStateApplyLoop to use a context-aware wait (e.g., select on
ctx.Done() and a timer channel computed from interval) before calling
streamDesiredStateOnce, and break/return immediately when ctx is cancelled so
the goroutine exits; ensure any sleep/wait uses ctx (select on ctx.Done()) and
avoid calling c.streamDesiredStateOnce when ctx is done to prevent log spam
during shutdown.
- Around line 51-60: The switch over state.GetState() in desired_state_apply.go
currently ignores nil/unknown variants and still advances the version; add a
default branch that returns a hard error (use fault.Internal) when
state.GetState() is nil or an unrecognized oneof so the agent fails fast instead
of acknowledging unsupported state; update the switch that matches
*ctrlv1.DeploymentState_Apply and *ctrlv1.DeploymentState_Delete to include a
default case that returns fault.Internalf (or fault.Internal with context)
mentioning the unexpected state, ensuring callers of this function see the error
rather than silently progressing the version.
- Around line 66-70: After the receive loop, call stream.Err() and handle any
non-nil error before invoking stream.Close(); if stream.Err() returns an error,
log it via c.logger.Error (include a descriptive message and the error) and
return that error instead of proceeding straight to Close(), so wire/receive
errors from the Receive() loop are not swallowed; ensure you still attempt
Close() or handle Close() errors appropriately (log/return) after checking
stream.Err().

In `@svc/krane/internal/deployment/state.go`:
- Around line 48-51: The code incorrectly references pod.Spec.Resources; replace
that block to read resources from the pod's containers instead: iterate
pod.Spec.Containers (or use the primary container at index 0) and read
container.Resources.Limits.Cpu().MilliValue() and
container.Resources.Limits.Memory().Value() / (1024*1024), assigning the results
to instance.CpuMillicores and instance.MemoryMib respectively; ensure you check
for nil/empty containers and for nil/absent Limits before calling Cpu() or
Memory() to avoid panics.

In `@svc/krane/internal/sentinel/desired_state_apply.go`:
- Around line 18-31: The reconnection loop in
Controller.runDesiredStateApplyLoop never observes context cancellation; replace
the unconditional sleep+work loop with a select that waits on either a timer
(time.After with the computed interval) or ctx.Done(), and break/return when ctx
is cancelled so the loop exits gracefully; also check ctx.Done() before calling
Controller.streamDesiredStateOnce to avoid starting work after cancellation and
propagate/ignore errors appropriately when ctx is closed.
- Around line 47-72: After the receive loop, check stream.Err() and if non-nil
return that error (so a server-side streaming error is propagated), otherwise
continue as normal; inside the switch handling state.GetState() add a default
case that logs an unexpected/unknown state type (using c.logger.Error and the
raw state info) and continues rather than silently ignoring it. Update the code
around the Receive loop, the switch on state.GetState(), and the
cleanup/stream.Close() handling to use stream.Err(), c.logger, ApplySentinel,
DeleteSentinel, and c.versionLastSeen as needed.

In `@svc/worker/deploy/deploy_handler.go`:
- Around line 205-208: The call to restate.RunVoid wrapping
db.BulkQuery.InsertDeploymentTopologies can fail but its returned error (err) is
not checked, allowing the workflow to continue despite insert failures; update
the surrounding code that invokes restate.RunVoid (the block calling db.Tx,
w.db.RW(), and db.BulkQuery.InsertDeploymentTopologies) to capture and handle
the returned error from restate.RunVoid (and propagate/return it from the
enclosing function or log and return a failure) so the deployment stops instead
of blocking; specifically ensure you check the err variable after
restate.RunVoid and do not ignore it.

In
`@web/apps/engineering/content/docs/architecture/services/ctrl/pull-based-infra.mdx`:
- Line 186: The copy incorrectly claims "exactly-once delivery"; update the
sentence referencing Krane's watermark and the phrase "exactly-once delivery" to
clarify that the version-based approach ensures "no missed updates" and that any
replays across polling/reconnects are handled via idempotent reprocessing (or
describe it as effectively at-least-once with idempotent handling), e.g.,
replace "exactly-once delivery" with wording that mentions "no missed updates"
and idempotent reprocessing while keeping the reference to Krane's watermark.

In `@web/apps/engineering/content/docs/architecture/services/krane/index.mdx`:
- Around line 10-18: The intro incorrectly says Krane "polls" for changes;
update the opening paragraph to describe the streaming, server-streaming model
instead—mention that Krane connects to ctrl's ClusterService using
WatchDeployments and WatchSentinels RPCs to receive streamed state changes,
processes each event, applies them to local Kubernetes resources, and resumes
from its last-seen version on reconnect; keep references to Krane, ctrl,
ClusterService, WatchDeployments, and WatchSentinels so readers see the
consistent streaming architecture.

In
`@web/apps/engineering/content/docs/architecture/services/krane/sync-engine.mdx`:
- Around line 107-112: The doc claims soft deletes eliminate a separate
changelog table but the PR introduces state_changes; update the "Soft Deletes"
section to reconcile this by noting that resource tables (rows with version and
fields like desired_replicas and desired_state='archived') are the ongoing
source-of-truth for incremental sync, while the new state_changes table is used
only for bootstrapping/streaming/replication scenarios (or short-lived change
streaming) rather than replacing row-level versioning; mention the purpose of
state_changes, how it complements soft deletes, and remove the blanket
"eliminates the need for a separate changelog table" phrasing or qualify it
accordingly.

🧹 Nitpick comments (33)

pkg/assert/greater_or_equal.go (1)
19-19: LGTM! Consider adding int8 and int16 for consistency.

The addition of unsigned integer types is correct and non-breaking. However, the constraint now covers all unsigned sizes (uint8 through uint64) but still omits int8 and int16 from the signed side. Consider adding them for completeness.
🔧 Optional: Add missing signed types
-func GreaterOrEqual[T ~int | ~int32 | ~int64 | ~float32 | ~float64 | ~uint | ~uint8 | ~uint16 | ~uint32 | ~uint64](a, b T, message ...string) error {
+func GreaterOrEqual[T ~int | ~int8 | ~int16 | ~int32 | ~int64 | ~float32 | ~float64 | ~uint | ~uint8 | ~uint16 | ~uint32 | ~uint64](a, b T, message ...string) error {
svc/krane/secrets/token/k8s_validator.go (3)
3-13: Reorder imports per coding guidelines.

External packages (k8s.io/...) should be grouped before internal packages (github.com/unkeyed/unkey/...). As per coding guidelines, the order should be: (1) Standard library, (2) External/third-party, (3) Internal packages.
Suggested import order
 import (
 	"context"
 	"fmt"
 	"strings"
 
-	"github.com/unkeyed/unkey/svc/krane/pkg/labels"
-
 	authv1 "k8s.io/api/authentication/v1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/client-go/kubernetes"
+
+	"github.com/unkeyed/unkey/svc/krane/pkg/labels"
 )
42-44: Consider validating clientset is not nil.

If cfg.Clientset is nil, the validator will panic on first use. A defensive check could provide a clearer error at construction time rather than at validation time.
Optional: Add nil check
 func NewK8sValidator(cfg K8sValidatorConfig) *K8sValidator {
+	if cfg.Clientset == nil {
+		panic("K8sValidatorConfig.Clientset must not be nil")
+	}
 	return &K8sValidator{clientset: cfg.Clientset}
 }
59-123: Consider using fault package for structured error handling.

The coding guidelines specify using the fault package for structured error handling with fault.Wrap(), fault.Code(), etc. Currently all errors use fmt.Errorf. This would provide better error categorization and potentially safer public-facing error messages.
Example refactor for one error case
+import "github.com/unkeyed/unkey/internal/fault"
+
 // In Validate method:
-return nil, fmt.Errorf("failed to validate token via TokenReview: %w", err)
+return nil, fault.Wrap(err, fault.Internal("failed to validate token via TokenReview"))
Additionally, on line 71, if result.Status.Error is empty, the error message will be less informative. Consider handling this case:
 if !result.Status.Authenticated {
-	return nil, fmt.Errorf("token not authenticated: %s", result.Status.Error)
+	errMsg := result.Status.Error
+	if errMsg == "" {
+		errMsg = "unknown reason"
+	}
+	return nil, fmt.Errorf("token not authenticated: %s", errMsg)
 }
svc/krane/pkg/controlplane/client.go (1)
3-14: Consider organizing imports into groups per coding guidelines.

The imports could be separated with blank lines between groups: (1) standard library, (2) external packages, (6) generated code.
♻️ Suggested import organization
 import (
 	"context"
 	"crypto/tls"
 	"net"
 	"net/http"
 	"strings"
 	"time"
-
+
 	"connectrpc.com/connect"
-	"github.com/unkeyed/unkey/gen/proto/ctrl/v1/ctrlv1connect"
 	"golang.org/x/net/http2"
+
+	"github.com/unkeyed/unkey/gen/proto/ctrl/v1/ctrlv1connect"
 )
Based on coding guidelines, imports should be grouped with blank lines: standard library, external/third-party, then generated code. Running make fmt may handle this automatically.
svc/krane/internal/deployment/desired_state_apply.go (1)

39-45: Use fault for structured error propagation.

Consider wrapping these returned errors with fault.Wrap / fault.Internal to add context and align with project error-handling conventions. As per coding guidelines, ...

Also applies to: 53-59, 67-69

svc/worker/deploy/deploy_handler.go (1)

185-203: Verify NextVersion() calls are replay-safe in workflows.

These per‑region NextVersion() calls are made directly in the workflow loop. If Restate only dedupes calls wrapped in restate.Run/durable call recording, retries could allocate extra versions. Please confirm the client call is workflow‑durable or wrap it in a restate.Run step (same applies to the sentinel versioning below).
svc/ctrl/run.go (1)
225-240: Consider simplifying the http2.Server initialization.

All fields are explicitly set to their zero values, which is equivalent to using an empty struct literal. This could be simplified for clarity.
♻️ Suggested simplification
-	h2cHandler := h2c.NewHandler(mux, &http2.Server{
-		MaxHandlers:                  0,
-		MaxConcurrentStreams:         0,
-		MaxDecoderHeaderTableSize:    0,
-		MaxEncoderHeaderTableSize:    0,
-		MaxReadFrameSize:             0,
-		PermitProhibitedCipherSuites: false,
-		IdleTimeout:                  0,
-		ReadIdleTimeout:              0,
-		PingTimeout:                  0,
-		WriteByteTimeout:             0,
-		MaxUploadBufferPerConnection: 0,
-		MaxUploadBufferPerStream:     0,
-		NewWriteScheduler:            nil,
-		CountError:                   nil,
-	})
+	h2cHandler := h2c.NewHandler(mux, &http2.Server{})
svc/ctrl/integration/sync_test.go (1)
1156-1168: Consider clarifying the sequence validation logic.

The bootstrapSequence variable is set but only used once for initialization. The current logic sets it from the first message then checks all non-bookmark messages have sequence > 0. If the intent is also to verify bootstrap messages have consistent sequences, consider adding that assertion explicitly.
♻️ Suggested clarification
 	// All messages during bootstrap should have the same sequence (the max sequence at bootstrap time)
-	var bootstrapSequence uint64
 	for _, msg := range messages {
 		seq := msg.GetSequence()
-		if bootstrapSequence == 0 {
-			bootstrapSequence = seq
-		}
-		// Bootstrap messages all have the same sequence
-		if msg.GetBookmark() == nil {
-			// Non-bookmark messages should have a valid sequence
-			require.Greater(t, seq, uint64(0), "all messages should have sequence > 0")
-		}
+		// All messages should have a valid sequence > 0
+		require.Greater(t, seq, uint64(0), "all messages should have sequence > 0")
 	}
svc/krane/internal/deployment/namespace.go (1)

33-40: Consider fault.Wrap for structured errors.
Lines 39 and 102 still use fmt.Errorf; repo guidelines ask for structured fault errors for consistency and codes.

Also applies to: 100-103
svc/worker/config.go (1)
325-327: Consider returning the parse error from GetBuildPlatform.
Errors are currently ignored; returning (BuildPlatform, error) would guard against misuse outside Validate().
♻️ Safer signature option
-func (c Config) GetBuildPlatform() BuildPlatform {
-	parsed, _ := parseBuildPlatform(c.BuildPlatform)
-	return parsed
-}
+func (c Config) GetBuildPlatform() (BuildPlatform, error) {
+	return parseBuildPlatform(c.BuildPlatform)
+}
svc/ctrl/services/cluster/service.go (1)
3-7: Import ordering doesn't follow coding guidelines.

Per coding guidelines, Go imports should be organized as: (1) Standard library, (2) External/third-party, (3) Internal packages, (4) Package-level, (5) Service-level, (6) Generated code. Currently, generated code (ctrlv1connect) is listed first.
Suggested import ordering
 import (
+	"github.com/unkeyed/unkey/pkg/db"
+	"github.com/unkeyed/unkey/pkg/otel/logging"
+
 	"github.com/unkeyed/unkey/gen/proto/ctrl/v1/ctrlv1connect"
-	"github.com/unkeyed/unkey/pkg/db"
-	"github.com/unkeyed/unkey/pkg/otel/logging"
 )
dev/k8s/manifests/worker.yaml (1)
25-138: Consider adding resource limits and health probes.

The worker container lacks resource requests/limits and liveness/readiness probes. Even for development, adding these helps catch resource issues early and ensures the pod lifecycle is properly managed.
♻️ Suggested additions for the container spec
          resources:
            requests:
              memory: "256Mi"
              cpu: "100m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /health
              port: 7092
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health
              port: 7092
            initialDelaySeconds: 5
            periodSeconds: 5
svc/krane/internal/sentinel/resync.go (2)
23-73: Context cancellation not checked during pagination.

The pagination loop doesn't check for context cancellation between iterations or API calls. If the controller is stopping, this could delay shutdown or cause unnecessary work.
♻️ Add context cancellation check
 		for {
+			select {
+			case <-ctx.Done():
+				c.logger.Info("resync loop cancelled")
+				return
+			default:
+			}
+
 			deployments, err := c.clientSet.AppsV1().Deployments(NamespaceSentinel).List(ctx, metav1.ListOptions{
56-65: Consider adding a default case for unknown state types.

If the proto definition adds new state types in the future, this switch would silently ignore them. A default case with logging would help surface unexpected states during debugging.
♻️ Add default case
 			switch res.Msg.GetState().(type) {
 			case *ctrlv1.SentinelState_Apply:
 				if err := c.ApplySentinel(ctx, res.Msg.GetApply()); err != nil {
 					c.logger.Error("unable to apply sentinel", "error", err.Error(), "sentinel_id", sentinelID)
 				}
 			case *ctrlv1.SentinelState_Delete:
 				if err := c.DeleteSentinel(ctx, res.Msg.GetDelete()); err != nil {
 					c.logger.Error("unable to delete sentinel", "error", err.Error(), "sentinel_id", sentinelID)
 				}
+			default:
+				c.logger.Warn("unknown sentinel state type", "sentinel_id", sentinelID)
 			}
pkg/db/queries/sentinel_list_desired.sql (1)
6-10: Prefer a named parameter for LIMIT to keep sqlc params explicit.

Using positional LIMIT ? alongside named args can make parameter ordering less clear in generated code. Consider LIMIT sqlc.arg(limit) for consistency.
Suggested change
-ORDER BY id ASC
-LIMIT ?;
+ORDER BY id ASC
+LIMIT sqlc.arg(limit);
web/internal/db/src/schema/sentinels.ts (2)
46-50: Validate bigint precision for version.

With mode: "number", values beyond 2^53‑1 lose precision and can break ordering/comparisons. If versions might grow large, consider mode: "bigint" (BigInt) or enforce an upper bound.
Possible change
-    version: bigint("version", { mode: "number", unsigned: true }).notNull().unique(),
+    version: bigint("version", { mode: "bigint", unsigned: true }).notNull().unique(),
55-57: Consider an index aligned with desired_state + region + id pagination.

The bootstrap query filters on region and desired_state and paginates by id. If this path is hot, a composite index (e.g., (region, desired_state, id)) can avoid scans.
Possible index addition
   (table) => [
     index("idx_environment_id").on(table.environmentId),
+    index("region_desired_state_id_idx").on(table.region, table.desiredState, table.id),
     index("region_version_idx").on(table.region, table.version),
     uniqueIndex("one_env_per_region").on(table.environmentId, table.region),
   ],
svc/krane/internal/deployment/resync.go (1)
23-73: Add context cancellation handling and a default case in the state switch.

The review identifies two separate concerns:

Context cancellation: Unkey's repeat.Every returns a stop function that must be called explicitly to halt the loop—it doesn't accept context directly. The callback should check ctx.Done() to break early if the context is cancelled, but more importantly, the stop function returned by repeat.Every() must be captured and called when the parent context is cancelled. Without proper teardown, graceful shutdown may be delayed.

Missing default case: The switch statement (lines 56–65) lacks a default case. If the control plane returns an unexpected state type, it will be silently ignored, potentially masking future API changes or bugs.
Suggested improvements
 		switch res.Msg.GetState().(type) {
 		case *ctrlv1.DeploymentState_Apply:
 			if err := c.ApplyDeployment(ctx, res.Msg.GetApply()); err != nil {
 				c.logger.Error("unable to apply deployment", "error", err.Error(), "deployment_id", deploymentID)
 			}
 		case *ctrlv1.DeploymentState_Delete:
 			if err := c.DeleteDeployment(ctx, res.Msg.GetDelete()); err != nil {
 				c.logger.Error("unable to delete deployment", "error", err.Error(), "deployment_id", deploymentID)
 			}
+		default:
+			c.logger.Warn("unknown deployment state type", "deployment_id", deploymentID)
 		}
For context cancellation, ensure the stop function returned by repeat.Every() is properly managed in the caller's lifecycle (e.g., called when the controller shuts down or context is cancelled).
svc/krane/internal/deployment/state.go (2)
3-12: Import order should follow coding guidelines.

Per the Go coding guidelines, imports should be grouped as: (1) Standard library, (2) External/third-party packages, (3) Internal packages. The internal ctrlv1 import should come after the k8s packages.
Suggested import reordering
 import (
 	"context"
 	"fmt"
 	"strings"

-	ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1"
 	appsv1 "k8s.io/api/apps/v1"
 	corev1 "k8s.io/api/core/v1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+
+	ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1"
 )
53-73: Status mapping logic for running pods may misclassify temporarily unready pods as failed.

When a pod is Running but ContainersReady is not true (e.g., during liveness probe initialization), setting status to STATUS_FAILED is potentially misleading. A pod that is starting up or temporarily unready is not necessarily failed.

Consider using STATUS_PENDING for running-but-not-ready pods, or introduce a distinct status for degraded/unhealthy state.
Alternative mapping
 		if allReady {
 			instance.Status = ctrlv1.ReportDeploymentStatusRequest_Update_Instance_STATUS_RUNNING
 		} else {
-			instance.Status = ctrlv1.ReportDeploymentStatusRequest_Update_Instance_STATUS_FAILED
+			instance.Status = ctrlv1.ReportDeploymentStatusRequest_Update_Instance_STATUS_PENDING
 		}
svc/ctrl/services/cluster/rpc_report_sentinel_status.go (1)
35-45: Consider adding a default case for the health switch.

If req.Msg.GetHealth() returns an unexpected or future enum value not covered by the explicit cases, health will remain at its zero value. While HEALTH_UNSPECIFIED maps to Unknown, an explicit default case would make the behavior clearer and catch any unexpected values.
Add explicit default case
 	switch req.Msg.GetHealth() {
 	case ctrlv1.Health_HEALTH_HEALTHY:
 		health = db.SentinelsHealthHealthy
 	case ctrlv1.Health_HEALTH_UNHEALTHY:
 		health = db.SentinelsHealthUnhealthy
 	case ctrlv1.Health_HEALTH_PAUSED:
 		health = db.SentinelsHealthPaused
 	case ctrlv1.Health_HEALTH_UNSPECIFIED:
 		health = db.SentinelsHealthUnknown
+	default:
+		health = db.SentinelsHealthUnknown
 	}
svc/krane/internal/deployment/apply.go (1)
3-18: Import order should follow coding guidelines.

Per the Go coding guidelines, imports should be grouped as: (1) Standard library, (2) External/third-party packages, (3) Internal packages. The k8s packages should come before internal github.com/unkeyed/unkey packages.
Suggested import reordering
 import (
 	"context"
 	"encoding/base64"
 	"encoding/json"
 	"fmt"
 	"strconv"

-	ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1"
-	"github.com/unkeyed/unkey/pkg/assert"
-	"github.com/unkeyed/unkey/pkg/ptr"
-	"github.com/unkeyed/unkey/svc/krane/pkg/labels"
 	appsv1 "k8s.io/api/apps/v1"
 	corev1 "k8s.io/api/core/v1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/types"
+
+	ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1"
+	"github.com/unkeyed/unkey/pkg/assert"
+	"github.com/unkeyed/unkey/pkg/ptr"
+	"github.com/unkeyed/unkey/svc/krane/pkg/labels"
 )
svc/krane/internal/sentinel/controller.go (1)
76-80: Consider waiting for goroutines to complete on Stop.

The Stop method closes the done channel but doesn't wait for the background goroutines to finish. This could lead to incomplete shutdown where goroutines are still running after Stop returns.

Consider using a sync.WaitGroup to track goroutines and wait for their completion:
♻️ Suggested improvement
 type Controller struct {
 	clientSet       kubernetes.Interface
 	logger          logging.Logger
 	cluster         ctrlv1connect.ClusterServiceClient
 	cb              circuitbreaker.CircuitBreaker[any]
 	done            chan struct{}
+	wg              sync.WaitGroup
 	region          string
 	versionLastSeen uint64
 }

 func (c *Controller) Stop() error {
 	close(c.done)
+	c.wg.Wait()
 	return nil
 }
svc/ctrl/services/cluster/rpc_watch_sentinels.go (1)

36-61: Consider making the polling interval configurable.

The 1-second sleep at line 59 is hardcoded. For production systems, this should likely be configurable to tune the trade-off between responsiveness and database load.

Additionally, consider adding a backoff mechanism if the stream repeatedly returns empty results, which could indicate low activity periods.

svc/krane/internal/deployment/controller.go (1)

86-95: Use fault.* for structured errors.

fmt.Errorf here bypasses the standardized fault wrapping used elsewhere. Please wrap errors with fault.Wrap/fault.Internal (and codes where applicable). As per coding guidelines, ...
svc/ctrl/integration/harness.go (1)
173-186: Potential inconsistency in returned Topology timestamp.

The CreateDeployment method manually constructs the returned DeploymentTopology with CreatedAt: h.Now() (line 183), but this timestamp may differ from what was actually inserted into the database (line 166). For test assertions that compare timestamps, this could cause flaky behavior.

Consider fetching the topology from the database instead of constructing it manually, or at minimum, capture the timestamp before the insert and reuse it.
💡 Suggested improvement
+	createdAt := h.Now()
 	h.versionCounter++
 	err = db.Query.InsertDeploymentTopology(ctx, h.DB.RW(), db.InsertDeploymentTopologyParams{
 		WorkspaceID:     workspaceID,
 		DeploymentID:    deploymentID,
 		Region:          req.Region,
 		DesiredReplicas: 1,
 		DesiredStatus:   db.DeploymentTopologyDesiredStatusStarted,
 		Version:         h.versionCounter,
-		CreatedAt:       h.Now(),
+		CreatedAt:       createdAt,
 	})
 	require.NoError(h.t, err)
 
 	deployment, err := db.Query.FindDeploymentById(ctx, h.DB.RO(), deploymentID)
 	require.NoError(h.t, err)
 
 	return CreateDeploymentResult{
 		Deployment: deployment,
 		Topology: db.DeploymentTopology{
 			Pk:              0,
 			WorkspaceID:     workspaceID,
 			DeploymentID:    deploymentID,
 			Region:          req.Region,
 			DesiredReplicas: 1,
 			DesiredStatus:   db.DeploymentTopologyDesiredStatusStarted,
 			Version:         h.versionCounter,
-			CreatedAt:       h.Now(),
+			CreatedAt:       createdAt,
 			UpdatedAt:       sql.NullInt64{Valid: false},
 		},
 	}
svc/krane/internal/sentinel/apply.go (1)

206-220: Resource requirements are commented out.

The CPU, memory, and ephemeral storage resource limits and requests are commented out. This may be intentional for development flexibility, but in production, running without resource limits can lead to noisy neighbor problems and make capacity planning difficult.

Consider adding a TODO comment explaining when these should be enabled, or making them configurable.
svc/krane/internal/testutil/mock_cluster_client.go (1)
31-43: Default nil returns for Watch methods may need documentation.

WatchDeployments and WatchSentinels return nil, nil when no hook is set. This could cause nil pointer dereferences in tests that don't explicitly set the hook but still call these methods. Consider either:

Adding a comment warning about this behavior

Returning an error by default to make the failure explicit

This is acceptable for test utilities where callers are expected to configure what they need.
💡 Optional: Add warning comment
 func (m *MockClusterClient) WatchDeployments(ctx context.Context, req *connect.Request[ctrlv1.WatchDeploymentsRequest]) (*connect.ServerStreamForClient[ctrlv1.DeploymentState], error) {
 	if m.WatchDeploymentsFunc != nil {
 		return m.WatchDeploymentsFunc(ctx, req)
 	}
+	// Returns nil - tests using this method must set WatchDeploymentsFunc
 	return nil, nil
 }
svc/ctrl/services/cluster/rpc_report_deployment_status.go (2)
3-11: Re-group imports to match repo ordering

Line 3–11 mixes generated and pkg imports in the same group. Please split into standard / third‑party / pkg / gen groups.
♻️ Suggested import grouping
 import (
 	"context"
 
 	"connectrpc.com/connect"
-	ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1"
-	"github.com/unkeyed/unkey/pkg/assert"
-	"github.com/unkeyed/unkey/pkg/db"
-	"github.com/unkeyed/unkey/pkg/uid"
+
+	"github.com/unkeyed/unkey/pkg/assert"
+	"github.com/unkeyed/unkey/pkg/db"
+	"github.com/unkeyed/unkey/pkg/uid"
+
+	ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1"
 )
As per coding guidelines, please keep Go imports grouped by category.

38-107: Use txCtx inside the TxRetry block

Line 38+ uses ctx for DB calls inside TxRetry. Prefer txCtx so cancellation/timeouts are bound to the transactional context.
✅ Suggested change
-				deployment, err := db.Query.FindDeploymentByK8sName(ctx, tx, msg.Update.GetK8SName())
+				deployment, err := db.Query.FindDeploymentByK8sName(txCtx, tx, msg.Update.GetK8SName())
 				if err != nil {
 					return err
 				}
 
-				staleInstances, err := db.Query.FindInstancesByDeploymentIdAndRegion(ctx, tx, db.FindInstancesByDeploymentIdAndRegionParams{
+				staleInstances, err := db.Query.FindInstancesByDeploymentIdAndRegion(txCtx, tx, db.FindInstancesByDeploymentIdAndRegionParams{
 					Deploymentid: deployment.ID,
 					Region:       region,
 				})
 				if err != nil {
 					return err
 				}
@@
-						err = db.Query.DeleteInstance(ctx, tx, db.DeleteInstanceParams{
+						err = db.Query.DeleteInstance(txCtx, tx, db.DeleteInstanceParams{
 							K8sName: staleInstance.K8sName,
 							Region:  region,
 						})
 						if err != nil {
 							return err
@@
-					err = db.Query.UpsertInstance(ctx, tx, db.UpsertInstanceParams{
+					err = db.Query.UpsertInstance(txCtx, tx, db.UpsertInstanceParams{
 						ID:            uid.New(uid.InstancePrefix),
 						DeploymentID:  deployment.ID,
@@
-				deployment, err := db.Query.FindDeploymentByK8sName(ctx, tx, msg.Delete.GetK8SName())
+				deployment, err := db.Query.FindDeploymentByK8sName(txCtx, tx, msg.Delete.GetK8SName())
 				if err != nil {
 					return err
 				}
 
-				err = db.Query.DeleteDeploymentInstances(ctx, tx, db.DeleteDeploymentInstancesParams{
+				err = db.Query.DeleteDeploymentInstances(txCtx, tx, db.DeleteDeploymentInstancesParams{
 					DeploymentID: deployment.ID,
 					Region:       region,
 				})
svc/ctrl/integration/seed/seed.go (2)
3-20: Re-group imports to match repo conventions

Line 3–20 mixes gen/pkg/third‑party imports. Please split into standard / third‑party / pkg / gen groups.
♻️ Suggested import grouping
 import (
 	"context"
 	"database/sql"
 	"errors"
 	"testing"
 	"time"
 
 	"github.com/go-sql-driver/mysql"
 	"github.com/stretchr/testify/require"
-	vaultv1 "github.com/unkeyed/unkey/gen/proto/vault/v1"
 	"github.com/unkeyed/unkey/pkg/assert"
 	"github.com/unkeyed/unkey/pkg/db"
 	dbtype "github.com/unkeyed/unkey/pkg/db/types"
 	"github.com/unkeyed/unkey/pkg/hash"
 	"github.com/unkeyed/unkey/pkg/ptr"
 	"github.com/unkeyed/unkey/pkg/uid"
 	"github.com/unkeyed/unkey/pkg/vault"
+
+	vaultv1 "github.com/unkeyed/unkey/gen/proto/vault/v1"
 )
As per coding guidelines, please keep Go imports grouped by category.

134-166: Use a consistent receiver name

Line 134 uses h *Seeder while the rest of the file uses s *Seeder. Aligning the receiver name improves scanability.
🔧 Suggested tweak
-func (h *Seeder) CreateProject(ctx context.Context, req CreateProjectRequest) db.Project {
-	err := db.Query.InsertProject(ctx, h.DB.RW(), db.InsertProjectParams{
+func (s *Seeder) CreateProject(ctx context.Context, req CreateProjectRequest) db.Project {
+	err := db.Query.InsertProject(ctx, s.DB.RW(), db.InsertProjectParams{
 		ID:               req.ID,
 		WorkspaceID:      req.WorkspaceID,
@@
-	require.NoError(h.t, err)
+	require.NoError(s.t, err)
 
-	project, err := db.Query.FindProjectById(ctx, h.DB.RO(), req.ID)
-	require.NoError(h.t, err)
+	project, err := db.Query.FindProjectById(ctx, s.DB.RO(), req.ID)
+	require.NoError(s.t, err)

dev/k8s/manifests/ctrl-worker.yaml

svc/ctrl/integration/seed/seed.go

svc/krane/internal/deployment/actual_state_report.go

svc/krane/internal/deployment/apply.go

svc/krane/internal/deployment/desired_state_apply.go

svc/krane/internal/sentinel/desired_state_apply.go

svc/worker/deploy/deploy_handler.go

web/apps/engineering/content/docs/architecture/services/ctrl/pull-based-infra.mdx

web/apps/engineering/content/docs/architecture/services/krane/index.mdx

web/apps/engineering/content/docs/architecture/services/krane/sync-engine.mdx

…estate-worker

cmd/ctrl/api.go

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

…key into chronark/split-restate-worker

ogzhanolguncu

Looks good you might wanna go over coderabbit comments tho

svc/ctrl/services/deployment/create_s3_upload_url.go

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@cmd/ctrl/api.go`:
- Around line 164-171: The RestateConfig block assigns Restate.AdminURL,
HttpPort, and RegisterAs but those CLI flags are not defined on apiCmd.Flags,
causing zero values; either add the missing flags (define "restate-admin-url",
"restate-http-port", and "restate-register-as" on apiCmd.Flags alongside the
existing "restate-url" and "restate-api-key") or remove/simplify those fields
from the Restate: ctrlapi.RestateConfig assignment if they are unused; update
the code in cmd/ctrl/api.go around the RestateConfig and where apiCmd.Flags are
declared so the flag names match the RestateConfig fields (symbols:
RestateConfig, Restate, apiCmd.Flags).

In `@cmd/ctrl/worker.go`:
- Line 169: The code calls cmd.String("restate-url") but the flag "restate-url"
is never defined, so Restate.URL (the Restate.URL field) will always be empty;
add a new string flag declaration for "restate-url" (e.g., alongside other
cmd.String(...) flag definitions) or change the call to use the existing flag
name if one already provides the Restate URL, and ensure the Restate.URL
assignment uses that defined flag variable so the value is populated at runtime.
- Line 131: Add a "build-backend" flag to the worker command so the referenced
flag isn't always empty: define a string flag on workerCmd.Flags() (e.g. create
the flag with workerCmd.Flags().String("build-backend", "", "path/name of build
backend") or equivalent StringVar) before using cmd.String("build-backend"),
ensuring the value passed into BuildBackend (the BuildBackend field in the
struct and the call worker.BuildBackend(...)) receives the actual flag value.

In `@svc/ctrl/worker/run.go`:
- Around line 227-233: The anonymous goroutine that starts the Restate server
currently only logs startErr from restateSrv.Start and allows the worker to
continue; change this to propagate the error to the main startup flow so the
process fails fast: create or use an existing startup error channel (e.g., a
chan error) or context cancellation mechanism in the surrounding function and
send startErr (from restateSrv.Start) to it instead of just calling
logger.Error, then ensure the caller waits for that channel and returns/shuts
down on any non-nil error so that failures in restateSrv.Start (referenced here
as restateSrv.Start and cfg.Restate.HttpPort) cause the worker to stop rather
than continue silently without Restate.

♻️ Duplicate comments (5)

cmd/ctrl/api.go (2)

131-131: image flag not defined but referenced.

cmd.String("image") references a flag that doesn't exist in apiCmd.Flags. This will always return an empty string. Either add the flag or remove this assignment if Image is unused.

74-75: ACME flags defined but not wired to config.

acme-enabled and acme-email-domain are defined but never passed to ctrlapi.Config. Either remove these flags or add an Acme field to the config struct and wire them.

svc/ctrl/worker/run.go (3)

289-297: Renewal cron shouldn’t depend on DNS provider.
HTTP‑01 renewals skipped when Route53 off.

398-403: Don’t return on existing wildcard domain.
Skips ACME challenge bootstrap.

413-414: Reuse InfraWorkspaceID constant.
Avoid drift.

🧹 Nitpick comments (1)

cmd/ctrl/worker.go (1)
79-81: Cloudflare ACME flags defined but not wired to config.

acme-cloudflare-enabled and acme-cloudflare-api-token flags are defined but worker.AcmeConfig only contains Route53Config—no Cloudflare field exists. Remove unused flags or add CloudflareConfig to the struct.
Remove unused flags
-		// Cloudflare DNS provider
-		cli.Bool("acme-cloudflare-enabled", "Enable Cloudflare for wildcard certificates", cli.EnvVar("UNKEY_ACME_CLOUDFLARE_ENABLED")),
-		cli.String("acme-cloudflare-api-token", "Cloudflare API token for Let's Encrypt", cli.EnvVar("UNKEY_ACME_CLOUDFLARE_API_TOKEN")),
-

cmd/ctrl/api.go

cmd/ctrl/worker.go

svc/ctrl/worker/run.go

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@cmd/ctrl/api.go`:
- Around line 165-170: The RestateConfig instantiation references flags that
aren't defined; add corresponding command-line flags for "restate-admin-url",
"restate-http-port", and "restate-register-as" so the values passed into
RestateConfig.URL/AdminURL/HttpPort/RegisterAs come from defined flags. Locate
where other flags are declared and add definitions for those three (matching
types: string for admin URL and register-as, int for http port) so
RestateConfig{ URL, AdminURL, HttpPort, RegisterAs, APIKey } can read them
reliably.

In `@cmd/ctrl/worker.go`:
- Around line 130-132: The BuildBackend value is using
cmd.String("build-backend") but no CLI flag is defined, so BuildBackend will
always be empty; add a CLI flag definition for "build-backend" on the same
cobra/kingpin command object that defines other flags (where cmd is created) —
e.g., register cmd.Flag("build-backend", "build backend to use").String() (or
the equivalent in your CLI lib) with an appropriate default and help text so
worker.BuildBackend receives the intended value when referenced in the Build
configuration.
- Around line 79-82: The two Cloudflare ACME CLI flags
("acme-cloudflare-enabled" and "acme-cloudflare-api-token") are declared but
never used; either remove them or wire them into the worker.Config so they take
effect. To fix, add fields to the worker.Config struct (e.g.,
AcmeCloudflareEnabled bool and AcmeCloudflareAPIToken string) and populate them
when constructing the config in the CLI handling code (read from the cli.Context
using the exact flag names "acme-cloudflare-enabled" and
"acme-cloudflare-api-token"), then pass those fields to any ACME/cloudflare
setup logic that expects them; alternatively, if Cloudflare support is not
needed, remove the two cli.Bool/cli.String flag declarations to avoid dead
flags.
- Around line 168-170: The Restate URL flag is missing so RestateConfig.URL is
empty; add a new CLI flag named "restate-url" (using the same cmd variable where
other flags are defined) with an appropriate default/empty value and
description, e.g., cmd.String("restate-url", "", "URL for Restate API"), and
ensure RestateConfig{ URL: cmd.String("restate-url"), AdminURL:
cmd.String("restate-admin-url") } uses that flag; also verify that
"restate-admin-url" is present and documented similarly.

In `@svc/ctrl/worker/run.go`:
- Around line 82-90: The vault client is created with http.DefaultClient which
has no timeout and can hang; replace http.DefaultClient with a dedicated
*http.Client that sets a sensible Timeout (e.g., 5–30s) and use that when
calling vaultv1connect.NewVaultServiceClient(cfg.VaultURL, ...), ensuring you
import time if needed and keep the existing interceptors and Authorization
header (references: vaultClient, vaultv1connect.NewVaultServiceClient,
cfg.VaultURL, cfg.VaultToken).

♻️ Duplicate comments (5)

cmd/ctrl/api.go (2)
131-131: image flag still undefined.
Line 131 uses cmd.String("image") but no flag exists. Remove or add flag.
✅ Remove unused field assignment
-		Image:          cmd.String("image"),
74-75: ACME flags unused in api config.
Flags defined but not wired into ctrlapi.Config. Remove or plumb.
svc/ctrl/worker/run.go (3)
294-297: Renewal cron blocked for HTTP-01.
Guard requires dnsProvider != nil; HTTP-01 certs won’t renew.
✅ Minimal fix
-			if cfg.Acme.Enabled && dnsProvider != nil {
+			if cfg.Acme.Enabled {
398-403: Existing domain skips challenge bootstrap.
Early return prevents ensuring ACME challenge.

413-414: Use InfraWorkspaceID constant.
Avoid hardcoded "unkey_internal".
✅ Use constant
-	workspaceID := "unkey_internal"
+	workspaceID := certificate.InfraWorkspaceID

cmd/ctrl/api.go

cmd/ctrl/worker.go

svc/ctrl/worker/run.go

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@cmd/ctrl/api.go`:
- Around line 165-171: The RestateConfig fields (URL, AdminURL, HttpPort,
RegisterAs, APIKey) are being populated from cmd.String/cmd.Int values that are
not defined on apiCmd.Flags, causing silent zero/empty values; either remove the
unused fields from the RestateConfig construction in the api command or add the
corresponding flag definitions to apiCmd.Flags (e.g., define "restate-admin-url"
and "restate-register-as" as String flags and "restate-http-port" as an Int
flag) so that RestateConfig.AdminURL, RestateConfig.HttpPort, and
RestateConfig.RegisterAs receive real values; locate the config construction
referencing RestateConfig and the apiCmd.Flags setup to make the change.

In `@cmd/ctrl/worker.go`:
- Line 169: The config struct is reading a "restate-url" via
cmd.String("restate-url") but that flag is never defined on workerCmd.Flags, so
the value will always be empty; add a flag definition to the worker command
(e.g., call workerCmd.Flags().String("restate-url", "", "ReState ingress URL for
workflow invocations") or the equivalent StringVar on the same Flags() used
elsewhere) so that the symbol cmd.String("restate-url") resolves to the provided
value when parsed.
- Line 131: The code references cmd.String("build-backend") when constructing
BuildBackend (in the worker.BuildBackend field) but the "build-backend" flag is
not registered on workerCmd.Flags, so it always yields an empty string; fix this
by adding a string flag registration for "build-backend" on workerCmd.Flags
(e.g., in the init or command setup where workerCmd is configured) so the value
returned by cmd.String("build-backend") is populated and the BuildBackend
selection works correctly—ensure the flag name exactly matches "build-backend"
and provide a sensible default and usage text when calling Flags().String(...)
(or the equivalent flag registration method used in this codebase).

♻️ Duplicate comments (5)

cmd/ctrl/api.go (2)
74-78: Unused ACME flags in API command.

ACME flags are defined but never wired to ctrlapi.Config (which lacks an Acme field). ACME is only used by the worker service. Remove these flags or add proper wiring if needed.

129-131: Remove unused Image field assignment.

cmd.String("image") references a flag not defined in apiCmd.Flags. This will always return empty string. The Image field appears unused in the API service.
 	config := ctrlapi.Config{
 		// Basic configuration
-		Image:          cmd.String("image"),
 		HttpPort:       cmd.Int("http-port"),
svc/ctrl/worker/run.go (3)
413-414: Use constant for infrastructure workspace ID.

The hardcoded "unkey_internal" appears to duplicate certificate.InfraWorkspaceID. Reuse the constant to prevent drift.
#!/bin/bash
# Verify if InfraWorkspaceID constant exists in certificate package
rg -n "InfraWorkspaceID" svc/ctrl/worker/certificate/
294-310: Renewal cron gated on DNS provider only blocks HTTP-01 renewals.

The condition cfg.Acme.Enabled && dnsProvider != nil prevents certificate renewals when only HTTP-01 provider is configured. The renewal cron should start whenever ACME is enabled.
 			// Start the certificate renewal cron job if ACME is enabled
 			// Use Send with idempotency key so multiple restarts don't create duplicate crons
-			if cfg.Acme.Enabled && dnsProvider != nil {
+			if cfg.Acme.Enabled {
398-407: Early return may skip ACME challenge creation for existing domain.

If a domain record exists but its ACME challenge is missing or invalid, this early return skips challenge bootstrap. Consider checking challenge existence too or making the upsert idempotent.

🧹 Nitpick comments (2)

svc/ctrl/worker/run.go (1)

82-93: Consider using a configured HTTP client for vault.

http.DefaultClient lacks timeout configuration. While Connect clients may handle timeouts internally, using a dedicated client with timeout is safer for production.
cmd/ctrl/worker.go (1)
79-81: Unused Cloudflare ACME flags.

acme-cloudflare-enabled and acme-cloudflare-api-token flags are defined but never wired to worker.Config. Either remove these flags or add Cloudflare config support.
-		// Cloudflare DNS provider
-		cli.Bool("acme-cloudflare-enabled", "Enable Cloudflare for wildcard certificates", cli.EnvVar("UNKEY_ACME_CLOUDFLARE_ENABLED")),
-		cli.String("acme-cloudflare-api-token", "Cloudflare API token for Let's Encrypt", cli.EnvVar("UNKEY_ACME_CLOUDFLARE_API_TOKEN")),
-

cmd/ctrl/api.go

cmd/ctrl/worker.go

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@cmd/ctrl/api.go`:
- Around line 164-171: The Restate config is being populated with AdminURL,
HttpPort and RegisterAs using flags that don't exist, so those fields will be
empty/zero; fix by either (A) adding the missing flags (define flags for
"restate-admin-url", "restate-http-port", "restate-register-as" alongside the
existing "restate-url" and "restate-api-key") so ctrlapi.RestateConfig receives
real values, or (B) if the API process doesn't need those fields, stop
populating them and only set RestateConfig.URL and APIKey (or simplify
ctrlapi.RestateConfig usage for API-only paths); locate the RestateConfig
construction in cmd/ctrl/api.go (the Restate: ctrlapi.RestateConfig{...} block)
and apply one of these two fixes.
- Around line 80-84: The configuration is missing flag definitions for
restate-admin-url, restate-http-port, and restate-register-as referenced later;
add corresponding CLI flags in cmd/ctrl/api.go alongside the existing
restate-url and restate-api-key entries: define a
cli.String("restate-admin-url", ...) with a suitable default and EnvVar (e.g.,
UNKEY_RESTATE_ADMIN_URL), define a cli.Int("restate-http-port", ...) with a
default port and EnvVar (e.g., UNKEY_RESTATE_HTTP_PORT), and define a
cli.String("restate-register-as", ...) with a default and EnvVar (e.g.,
UNKEY_RESTATE_REGISTER_AS) so the code that reads restate-admin-url,
restate-http-port, and restate-register-as gets non-empty values.

♻️ Duplicate comments (5)

cmd/ctrl/api.go (2)

74-78: Remove unused ACME and domain flags.

These flags are defined but never wired to ctrlapi.Config (which has no Acme field). Either remove them or add corresponding config fields if needed for future use.

131-131: image flag not defined but referenced.

cmd.String("image") will return empty string since no --image flag exists in apiCmd.Flags. Either add the flag or remove this line if Image field is unused.

svc/ctrl/worker/run.go (3)

289-307: Cron gated on dnsProvider blocks HTTP‑01 renewals.
Start cron whenever ACME enabled.

398-403: Existing wildcard domain path skips challenge ensure.
Do not return; ensure challenge record.

413-415: Avoid hardcoded workspace ID.
Use certificate.InfraWorkspaceID.

cmd/ctrl/api.go

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@svc/ctrl/api/config.go`:
- Around line 52-94: Remove the unused config type definitions Route53Config,
AcmeConfig, DepotConfig, and RegistryConfig from the file and delete any imports
only used by them; ensure you also remove any stray comments or docblocks tied
to those types and verify there are no remaining references to these types
(e.g., in any Config struct or functions) so the build remains clean.

♻️ Duplicate comments (5)

svc/api/routes/v2_deploy_get_deployment/handler.go (1)

88-107: Steps and ErrorMessage remain unpopulated.

Previous review flagged that these fields should be populated from deployment_steps records. If intentionally deferred, consider adding a TODO comment to track this work.

svc/ctrl/api/config.go (2)

8-23: Remove unused BuildBackend type and constants.

BuildBackend is defined but the Config struct has no BuildBackend field. This duplicates the definition in svc/ctrl/worker/config.go. Remove to eliminate dead code.

240-247: Implement validation for required fields in Validate().

The method returns nil, allowing invalid configs to pass. Add validation for critical fields:

DatabasePrimary (non-empty)

HttpPort (> 0)

AuthToken (non-empty for secure deployments)
cmd/ctrl/api.go (2)
163-170: Missing Restate flag definitions cause silent zero values.

restate-admin-url, restate-http-port, and restate-register-as are referenced but not defined in apiCmd.Flags. These will be empty/zero at runtime.

If the API service doesn't need these fields, simplify the config:
 		Restate: ctrlapi.RestateConfig{
 			URL:        cmd.String("restate-url"),
-			AdminURL:   cmd.String("restate-admin-url"),
-			HttpPort:   cmd.Int("restate-http-port"),
-			RegisterAs: cmd.String("restate-register-as"),
 			APIKey:     cmd.String("restate-api-key"),
 		},
Or add the missing flags if needed.

74-86: Remove unused CLI flags or wire them to config.

These flags are defined but never passed to ctrlapi.Config:

acme-enabled, acme-email-domain (lines 74-75)

default-domain, regional-apex-domain (lines 77-78)

clickhouse-url (lines 85-86)

Either remove them or add corresponding fields to ctrlapi.Config and wire them in apiAction.
♻️ Remove unused flags
-		cli.Bool("acme-enabled", "Enable Let's Encrypt for acme challenges", cli.EnvVar("UNKEY_ACME_ENABLED")),
-		cli.String("acme-email-domain", "Domain for ACME registration emails (workspace_id@domain)", cli.Default("unkey.com"), cli.EnvVar("UNKEY_ACME_EMAIL_DOMAIN")),
-
-		cli.String("default-domain", "Default domain for auto-generated hostnames", cli.Default("unkey.app"), cli.EnvVar("UNKEY_DEFAULT_DOMAIN")),
-		cli.String("regional-apex-domain", "Apex domain for cross-region frontline communication (e.g., unkey.cloud). Certs are provisioned for *.{region}.{regional-apex-domain}", cli.EnvVar("UNKEY_REGIONAL_APEX_DOMAIN")),
-
 		// Restate Configuration
 		cli.String("restate-url", "URL of the Restate ingress endpoint for invoking workflows. Example: http://restate:8080",
 			cli.Default("http://restate:8080"), cli.EnvVar("UNKEY_RESTATE_INGRESS_URL")),
 		cli.String("restate-api-key", "API key for Restate ingress requests",
 			cli.EnvVar("UNKEY_RESTATE_API_KEY")),
-		cli.String("clickhouse-url", "ClickHouse connection string for analytics. Recommended for production. Example: clickhouse://user:pass@host:9000/unkey",
-			cli.EnvVar("UNKEY_CLICKHOUSE_URL")),

svc/ctrl/api/config.go

ogzhanolguncu

LFG

chronark and others added 12 commits January 18, 2026 19:21

wip: make krane initiates smaller watches

51b2b64

wip: make krane initiates smaller watches

b849e50

chore: remove tmp files

12fd34e

refactor: make it simpler

02c1f7a

feat: use total ordering

ffd3735

merge

26ff3b2

it works again

39fc19d

Merge branch 'main' into decouple-control-workers

c277c0b

fix: comment

3769fea

Merge branch 'decouple-control-workers' of github.com:unkeyed/unkey i…

015bad2

…nto decouple-control-workers

refactor: move restate workflow logic into separate svc/worker

5210c8e

this allows us to use restates operator to deploy it independently of our control plane

chronark requested review from Flo4604, MichaelUnkey, imeyer, mcstepp, ogzhanolguncu and perkinsjr as code owners January 20, 2026 21:37

chronark changed the title ~~chronark/split restate worker~~ refactor: split restate worker [stacked on 4782] Jan 20, 2026

chronark marked this pull request as draft January 20, 2026 21:38

vercel bot temporarily deployed to Preview – engineering January 20, 2026 21:39 Inactive

vercel bot temporarily deployed to Preview – dashboard January 20, 2026 21:42 Inactive

coderabbitai bot reviewed Jan 20, 2026

View reviewed changes

chronark added 3 commits January 22, 2026 07:58

Merge branch 'main' of github.com:unkeyed/unkey into chronark/split-r…

a2ce41e

…estate-worker

fix: remove cloudflare

356b1f9

fix: handle error

374e889

vercel bot temporarily deployed to Preview – dashboard January 22, 2026 07:08 Inactive

Flo4604 reviewed Jan 23, 2026

View reviewed changes

cmd/ctrl/api.go Outdated Show resolved Hide resolved

Flo4604 reviewed Jan 23, 2026

View reviewed changes

cmd/ctrl/api.go Outdated Show resolved Hide resolved

Flo4604 reviewed Jan 23, 2026

View reviewed changes

cmd/ctrl/api.go Show resolved Hide resolved

chronark and others added 7 commits January 23, 2026 12:17

Update cmd/ctrl/worker.go

d7477a0

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update svc/krane/internal/deployment/consts.go

7a53c3e

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update svc/ctrl/worker/run.go

30470e2

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update cmd/ctrl/api.go

fae4f2a

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Merge branch 'main' into chronark/split-restate-worker

53daf4c

fix: flags

51e2ac6

Merge branch 'chronark/split-restate-worker' of github.com:unkeyed/un…

d8af282

…key into chronark/split-restate-worker

ogzhanolguncu reviewed Jan 23, 2026

View reviewed changes

svc/ctrl/services/deployment/create_s3_upload_url.go Show resolved Hide resolved

vercel bot temporarily deployed to Preview – dashboard January 23, 2026 11:37 Inactive

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

cmd/ctrl/api.go Show resolved Hide resolved

cmd/ctrl/worker.go Show resolved Hide resolved

cmd/ctrl/worker.go Show resolved Hide resolved

svc/ctrl/worker/run.go Show resolved Hide resolved

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

cmd/ctrl/api.go Show resolved Hide resolved

cmd/ctrl/worker.go Show resolved Hide resolved

cmd/ctrl/worker.go Show resolved Hide resolved

cmd/ctrl/worker.go Show resolved Hide resolved

svc/ctrl/worker/run.go Show resolved Hide resolved

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

cmd/ctrl/api.go Show resolved Hide resolved

cmd/ctrl/worker.go Show resolved Hide resolved

cmd/ctrl/worker.go Show resolved Hide resolved

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

cmd/ctrl/api.go Show resolved Hide resolved

cmd/ctrl/api.go Show resolved Hide resolved

fix: comments

9d46a62

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

svc/ctrl/api/config.go Show resolved Hide resolved

ogzhanolguncu approved these changes Jan 23, 2026

View reviewed changes

chronark added 3 commits January 23, 2026 15:21

ci: nuke ci for the old api

3730105

Merge branch 'main' into chronark/split-restate-worker

1158ce8

Merge branch 'main' into chronark/split-restate-worker

f034656

vercel bot temporarily deployed to Preview – dashboard January 23, 2026 15:19 Inactive

Merge branch 'main' into chronark/split-restate-worker

90764f1

Flo4604 approved these changes Jan 23, 2026

View reviewed changes

vercel bot temporarily deployed to Preview – engineering January 23, 2026 15:21 Inactive

ogzhanolguncu approved these changes Jan 23, 2026

View reviewed changes

chronark merged commit cabc294 into main Jan 23, 2026
18 of 22 checks passed

chronark deleted the chronark/split-restate-worker branch January 23, 2026 15:27

coderabbitai bot mentioned this pull request Mar 8, 2026

regions fixes #5246

Merged

Conversation

chronark commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

Architecture

New Packages

API Changes

Infrastructure

Documentation

Testing

Uh oh!

CLAassistant commented Jan 20, 2026

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

vercel bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogzhanolguncu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogzhanolguncu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chronark commented Jan 20, 2026 •

edited

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

vercel bot commented Jan 20, 2026 •

edited

Loading