refactor: split restate worker [stacked on 4782]#4798
Conversation
…nto decouple-control-workers
This is a merge commit the virtual branches in your workspace. Due to GitButler managing multiple virtual branches, you cannot switch back and forth between git branches and virtual branches easily. If you switch to another branch, GitButler will need to be reinitialized. If you commit on this branch, GitButler will throw it away. Here are the branches that are currently applied: - decouple-control-workers (refs/gitbutler/decouple-control-workers) branch head: 015bad2 For more information about what we're doing here, check out our docs: https://docs.gitbutler.com/features/branch-management/integration-branch
this allows us to use restates operator to deploy it independently of our control plane
|
|
📝 WalkthroughWalkthroughSplit the control plane into separate API and Worker services; replace legacy build backends with S3/Hydra-based build flows; introduce Restate-backed durable workflows and region-scoped List+Watch RPCs; add S3 presigned upload endpoints; update DB schema, dev manifests, Bazel targets, tests, and many service signatures and wiring. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as rgba(30,130,76,0.5)
participant DB as rgba(0,122,204,0.5)
participant Restate as rgba(201,37,96,0.5)
participant Worker as rgba(255,140,0,0.5)
participant S3 as rgba(102,51,153,0.5)
participant Hydra as rgba(0,153,153,0.5)
Client->>API: CreateDeployment / generateUploadUrl
API->>DB: Persist deployment (pending)
API->>Restate: Start durable workflow (project-scoped)
Restate->>Worker: Schedule workflow task
Worker->>S3: GenerateUploadURL / presigned URL
Client->>S3: Upload build context (PUT to presigned URL)
Worker->>S3: Fetch build context
Worker->>Hydra: BuildDockerImage (hydra)
Hydra-->>Worker: image name / build id
Worker->>DB: Update deployment status/topology
Worker->>API: Report status (via DB or ReportDeploymentStatus)
Worker-->>Restate: Workflow complete
Restate-->>API: Workflow result
API-->>Client: Return deployment status / hostnames
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
svc/krane/internal/deployment/delete.go (1)
35-39: Unreachable or redundant code at line 39.At line 39,
errwill always benilbecause:
- If
reportDeploymentStatusat line 28 fails, line 35-37 returns the error.- If it succeeds,
errisnil.So
client.IgnoreNotFound(err)effectively returnsnil. This appears to be dead code or a copy-paste artifact.🐛 Suggested fix
err = c.reportDeploymentStatus(ctx, &ctrlv1.ReportDeploymentStatusRequest{ Change: &ctrlv1.ReportDeploymentStatusRequest_Delete_{ Delete: &ctrlv1.ReportDeploymentStatusRequest_Delete{ K8SName: req.GetK8SName(), }, }, }) if err != nil { return err } - return client.IgnoreNotFound(err) + return nil }
🤖 Fix all issues with AI agents
In `@dev/k8s/manifests/worker.yaml`:
- Around line 20-24: The manifest mounts the host Docker socket via the volume
named "docker-socket" (path "/var/run/docker.sock"), which grants host-level
privileges; remove the volume and any corresponding volumeMount referencing
"docker-socket" from the Deployment/Pod spec or replace it with a safer
alternative (e.g., use a sidecar DinD image, a remote build service, or a
privileged Pod specifically in a dev-only kustomize/overlay). Ensure the
dev-only nature is explicit by moving the socket mount into a non-production
overlay or feature flag so production manifests never include "docker-socket".
In `@svc/ctrl/integration/seed/seed.go`:
- Around line 412-445: Seeder.CreateRatelimit currently may do nothing if both
req.IdentityID and req.KeyID are nil and can overwrite an earlier error if both
inserts run; add input validation to require at least one of req.IdentityID or
req.KeyID and fail early if neither is set, then call
db.Query.InsertIdentityRatelimit and db.Query.InsertKeyRatelimit independently
(if their corresponding IDs are non-nil), capture each insert's error separately
and assert require.NoError(s.t, err) for each insert (or aggregate and fail if
any error occurred) instead of reusing a single err variable that can be
overwritten; reference the Seeder.CreateRatelimit function and the
db.Query.InsertIdentityRatelimit / InsertKeyRatelimit calls when making these
changes.
In `@svc/krane/internal/deployment/actual_state_report.go`:
- Around line 20-75: The runActualStateReportLoop goroutine needs lifecycle and
reconnection handling: wrap the watch creation (w := c.clientSet...Watch) in a
loop that exits on ctx.Done() or c.done, call w.Stop() when context cancels, and
when w.ResultChan() closes or a watch.Error is received, log and break the inner
event loop then retry creating a new watch (with a small backoff) so the watch
is re-established on expiration; ensure existing logic that casts to
*appsv1.ReplicaSet and calls c.buildDeploymentStatus / c.reportDeploymentStatus
is kept inside the event loop but that the outer loop recreates w on closure or
error to maintain continuous reporting.
In `@svc/krane/internal/deployment/apply.go`:
- Line 104: The Resources field for the container is left empty despite
validating ApplyDeployment.CpuMillicores and MemoryMib; update the container's
Resources (in svc/krane/internal/deployment/apply.go where Resources:
corev1.ResourceRequirements{} is set) to populate Requests and Limits using the
proto fields CpuMillicores and MemoryMib (set CPU as millicores and Memory as
MiB). Use the Kubernetes resource quantities (e.g., resource.NewMilliQuantity
for CPU and resource.NewQuantity with binary/decimal Mi units for memory) to
create a corev1.ResourceList for corev1.ResourceCPU and corev1.ResourceMemory
and assign them to both Requests and Limits so the validated values are actually
applied to the container.
In `@svc/krane/internal/deployment/desired_state_apply.go`:
- Around line 18-30: The runDesiredStateApplyLoop currently blocks with
time.Sleep and will keep reconnecting after ctx cancellation; change the loop in
Controller.runDesiredStateApplyLoop to use a context-aware wait (e.g., select on
ctx.Done() and a timer channel computed from interval) before calling
streamDesiredStateOnce, and break/return immediately when ctx is cancelled so
the goroutine exits; ensure any sleep/wait uses ctx (select on ctx.Done()) and
avoid calling c.streamDesiredStateOnce when ctx is done to prevent log spam
during shutdown.
- Around line 51-60: The switch over state.GetState() in desired_state_apply.go
currently ignores nil/unknown variants and still advances the version; add a
default branch that returns a hard error (use fault.Internal) when
state.GetState() is nil or an unrecognized oneof so the agent fails fast instead
of acknowledging unsupported state; update the switch that matches
*ctrlv1.DeploymentState_Apply and *ctrlv1.DeploymentState_Delete to include a
default case that returns fault.Internalf (or fault.Internal with context)
mentioning the unexpected state, ensuring callers of this function see the error
rather than silently progressing the version.
- Around line 66-70: After the receive loop, call stream.Err() and handle any
non-nil error before invoking stream.Close(); if stream.Err() returns an error,
log it via c.logger.Error (include a descriptive message and the error) and
return that error instead of proceeding straight to Close(), so wire/receive
errors from the Receive() loop are not swallowed; ensure you still attempt
Close() or handle Close() errors appropriately (log/return) after checking
stream.Err().
In `@svc/krane/internal/deployment/state.go`:
- Around line 48-51: The code incorrectly references pod.Spec.Resources; replace
that block to read resources from the pod's containers instead: iterate
pod.Spec.Containers (or use the primary container at index 0) and read
container.Resources.Limits.Cpu().MilliValue() and
container.Resources.Limits.Memory().Value() / (1024*1024), assigning the results
to instance.CpuMillicores and instance.MemoryMib respectively; ensure you check
for nil/empty containers and for nil/absent Limits before calling Cpu() or
Memory() to avoid panics.
In `@svc/krane/internal/sentinel/desired_state_apply.go`:
- Around line 18-31: The reconnection loop in
Controller.runDesiredStateApplyLoop never observes context cancellation; replace
the unconditional sleep+work loop with a select that waits on either a timer
(time.After with the computed interval) or ctx.Done(), and break/return when ctx
is cancelled so the loop exits gracefully; also check ctx.Done() before calling
Controller.streamDesiredStateOnce to avoid starting work after cancellation and
propagate/ignore errors appropriately when ctx is closed.
- Around line 47-72: After the receive loop, check stream.Err() and if non-nil
return that error (so a server-side streaming error is propagated), otherwise
continue as normal; inside the switch handling state.GetState() add a default
case that logs an unexpected/unknown state type (using c.logger.Error and the
raw state info) and continues rather than silently ignoring it. Update the code
around the Receive loop, the switch on state.GetState(), and the
cleanup/stream.Close() handling to use stream.Err(), c.logger, ApplySentinel,
DeleteSentinel, and c.versionLastSeen as needed.
In `@svc/worker/deploy/deploy_handler.go`:
- Around line 205-208: The call to restate.RunVoid wrapping
db.BulkQuery.InsertDeploymentTopologies can fail but its returned error (err) is
not checked, allowing the workflow to continue despite insert failures; update
the surrounding code that invokes restate.RunVoid (the block calling db.Tx,
w.db.RW(), and db.BulkQuery.InsertDeploymentTopologies) to capture and handle
the returned error from restate.RunVoid (and propagate/return it from the
enclosing function or log and return a failure) so the deployment stops instead
of blocking; specifically ensure you check the err variable after
restate.RunVoid and do not ignore it.
In
`@web/apps/engineering/content/docs/architecture/services/ctrl/pull-based-infra.mdx`:
- Line 186: The copy incorrectly claims "exactly-once delivery"; update the
sentence referencing Krane's watermark and the phrase "exactly-once delivery" to
clarify that the version-based approach ensures "no missed updates" and that any
replays across polling/reconnects are handled via idempotent reprocessing (or
describe it as effectively at-least-once with idempotent handling), e.g.,
replace "exactly-once delivery" with wording that mentions "no missed updates"
and idempotent reprocessing while keeping the reference to Krane's watermark.
In `@web/apps/engineering/content/docs/architecture/services/krane/index.mdx`:
- Around line 10-18: The intro incorrectly says Krane "polls" for changes;
update the opening paragraph to describe the streaming, server-streaming model
instead—mention that Krane connects to ctrl's ClusterService using
WatchDeployments and WatchSentinels RPCs to receive streamed state changes,
processes each event, applies them to local Kubernetes resources, and resumes
from its last-seen version on reconnect; keep references to Krane, ctrl,
ClusterService, WatchDeployments, and WatchSentinels so readers see the
consistent streaming architecture.
In
`@web/apps/engineering/content/docs/architecture/services/krane/sync-engine.mdx`:
- Around line 107-112: The doc claims soft deletes eliminate a separate
changelog table but the PR introduces state_changes; update the "Soft Deletes"
section to reconcile this by noting that resource tables (rows with version and
fields like desired_replicas and desired_state='archived') are the ongoing
source-of-truth for incremental sync, while the new state_changes table is used
only for bootstrapping/streaming/replication scenarios (or short-lived change
streaming) rather than replacing row-level versioning; mention the purpose of
state_changes, how it complements soft deletes, and remove the blanket
"eliminates the need for a separate changelog table" phrasing or qualify it
accordingly.
🧹 Nitpick comments (33)
pkg/assert/greater_or_equal.go (1)
19-19: LGTM! Consider addingint8andint16for consistency.The addition of unsigned integer types is correct and non-breaking. However, the constraint now covers all unsigned sizes (
uint8throughuint64) but still omitsint8andint16from the signed side. Consider adding them for completeness.🔧 Optional: Add missing signed types
-func GreaterOrEqual[T ~int | ~int32 | ~int64 | ~float32 | ~float64 | ~uint | ~uint8 | ~uint16 | ~uint32 | ~uint64](a, b T, message ...string) error { +func GreaterOrEqual[T ~int | ~int8 | ~int16 | ~int32 | ~int64 | ~float32 | ~float64 | ~uint | ~uint8 | ~uint16 | ~uint32 | ~uint64](a, b T, message ...string) error {svc/krane/secrets/token/k8s_validator.go (3)
3-13: Reorder imports per coding guidelines.External packages (
k8s.io/...) should be grouped before internal packages (github.com/unkeyed/unkey/...). As per coding guidelines, the order should be: (1) Standard library, (2) External/third-party, (3) Internal packages.Suggested import order
import ( "context" "fmt" "strings" - "github.com/unkeyed/unkey/svc/krane/pkg/labels" - authv1 "k8s.io/api/authentication/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/client-go/kubernetes" + + "github.com/unkeyed/unkey/svc/krane/pkg/labels" )
42-44: Consider validating clientset is not nil.If
cfg.Clientsetis nil, the validator will panic on first use. A defensive check could provide a clearer error at construction time rather than at validation time.Optional: Add nil check
func NewK8sValidator(cfg K8sValidatorConfig) *K8sValidator { + if cfg.Clientset == nil { + panic("K8sValidatorConfig.Clientset must not be nil") + } return &K8sValidator{clientset: cfg.Clientset} }
59-123: Consider usingfaultpackage for structured error handling.The coding guidelines specify using the
faultpackage for structured error handling withfault.Wrap(),fault.Code(), etc. Currently all errors usefmt.Errorf. This would provide better error categorization and potentially safer public-facing error messages.Example refactor for one error case
+import "github.com/unkeyed/unkey/internal/fault" + // In Validate method: -return nil, fmt.Errorf("failed to validate token via TokenReview: %w", err) +return nil, fault.Wrap(err, fault.Internal("failed to validate token via TokenReview"))Additionally, on line 71, if
result.Status.Erroris empty, the error message will be less informative. Consider handling this case:if !result.Status.Authenticated { - return nil, fmt.Errorf("token not authenticated: %s", result.Status.Error) + errMsg := result.Status.Error + if errMsg == "" { + errMsg = "unknown reason" + } + return nil, fmt.Errorf("token not authenticated: %s", errMsg) }svc/krane/pkg/controlplane/client.go (1)
3-14: Consider organizing imports into groups per coding guidelines.The imports could be separated with blank lines between groups: (1) standard library, (2) external packages, (6) generated code.
♻️ Suggested import organization
import ( "context" "crypto/tls" "net" "net/http" "strings" "time" - + "connectrpc.com/connect" - "github.com/unkeyed/unkey/gen/proto/ctrl/v1/ctrlv1connect" "golang.org/x/net/http2" + + "github.com/unkeyed/unkey/gen/proto/ctrl/v1/ctrlv1connect" )Based on coding guidelines, imports should be grouped with blank lines: standard library, external/third-party, then generated code. Running
make fmtmay handle this automatically.svc/krane/internal/deployment/desired_state_apply.go (1)
39-45: Usefaultfor structured error propagation.Consider wrapping these returned errors with
fault.Wrap/fault.Internalto add context and align with project error-handling conventions. As per coding guidelines, ...Also applies to: 53-59, 67-69
svc/worker/deploy/deploy_handler.go (1)
185-203: VerifyNextVersion()calls are replay-safe in workflows.These per‑region
NextVersion()calls are made directly in the workflow loop. If Restate only dedupes calls wrapped inrestate.Run/durable call recording, retries could allocate extra versions. Please confirm the client call is workflow‑durable or wrap it in arestate.Runstep (same applies to the sentinel versioning below).svc/ctrl/run.go (1)
225-240: Consider simplifying the http2.Server initialization.All fields are explicitly set to their zero values, which is equivalent to using an empty struct literal. This could be simplified for clarity.
♻️ Suggested simplification
- h2cHandler := h2c.NewHandler(mux, &http2.Server{ - MaxHandlers: 0, - MaxConcurrentStreams: 0, - MaxDecoderHeaderTableSize: 0, - MaxEncoderHeaderTableSize: 0, - MaxReadFrameSize: 0, - PermitProhibitedCipherSuites: false, - IdleTimeout: 0, - ReadIdleTimeout: 0, - PingTimeout: 0, - WriteByteTimeout: 0, - MaxUploadBufferPerConnection: 0, - MaxUploadBufferPerStream: 0, - NewWriteScheduler: nil, - CountError: nil, - }) + h2cHandler := h2c.NewHandler(mux, &http2.Server{})svc/ctrl/integration/sync_test.go (1)
1156-1168: Consider clarifying the sequence validation logic.The
bootstrapSequencevariable is set but only used once for initialization. The current logic sets it from the first message then checks all non-bookmark messages havesequence > 0. If the intent is also to verify bootstrap messages have consistent sequences, consider adding that assertion explicitly.♻️ Suggested clarification
// All messages during bootstrap should have the same sequence (the max sequence at bootstrap time) - var bootstrapSequence uint64 for _, msg := range messages { seq := msg.GetSequence() - if bootstrapSequence == 0 { - bootstrapSequence = seq - } - // Bootstrap messages all have the same sequence - if msg.GetBookmark() == nil { - // Non-bookmark messages should have a valid sequence - require.Greater(t, seq, uint64(0), "all messages should have sequence > 0") - } + // All messages should have a valid sequence > 0 + require.Greater(t, seq, uint64(0), "all messages should have sequence > 0") }svc/krane/internal/deployment/namespace.go (1)
33-40: Considerfault.Wrapfor structured errors.
Lines 39 and 102 still usefmt.Errorf; repo guidelines ask for structuredfaulterrors for consistency and codes.Also applies to: 100-103
svc/worker/config.go (1)
325-327: Consider returning the parse error fromGetBuildPlatform.
Errors are currently ignored; returning(BuildPlatform, error)would guard against misuse outsideValidate().♻️ Safer signature option
-func (c Config) GetBuildPlatform() BuildPlatform { - parsed, _ := parseBuildPlatform(c.BuildPlatform) - return parsed -} +func (c Config) GetBuildPlatform() (BuildPlatform, error) { + return parseBuildPlatform(c.BuildPlatform) +}svc/ctrl/services/cluster/service.go (1)
3-7: Import ordering doesn't follow coding guidelines.Per coding guidelines, Go imports should be organized as: (1) Standard library, (2) External/third-party, (3) Internal packages, (4) Package-level, (5) Service-level, (6) Generated code. Currently, generated code (
ctrlv1connect) is listed first.Suggested import ordering
import ( + "github.com/unkeyed/unkey/pkg/db" + "github.com/unkeyed/unkey/pkg/otel/logging" + "github.com/unkeyed/unkey/gen/proto/ctrl/v1/ctrlv1connect" - "github.com/unkeyed/unkey/pkg/db" - "github.com/unkeyed/unkey/pkg/otel/logging" )dev/k8s/manifests/worker.yaml (1)
25-138: Consider adding resource limits and health probes.The worker container lacks resource requests/limits and liveness/readiness probes. Even for development, adding these helps catch resource issues early and ensures the pod lifecycle is properly managed.
♻️ Suggested additions for the container spec
resources: requests: memory: "256Mi" cpu: "100m" limits: memory: "512Mi" cpu: "500m" livenessProbe: httpGet: path: /health port: 7092 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 7092 initialDelaySeconds: 5 periodSeconds: 5svc/krane/internal/sentinel/resync.go (2)
23-73: Context cancellation not checked during pagination.The pagination loop doesn't check for context cancellation between iterations or API calls. If the controller is stopping, this could delay shutdown or cause unnecessary work.
♻️ Add context cancellation check
for { + select { + case <-ctx.Done(): + c.logger.Info("resync loop cancelled") + return + default: + } + deployments, err := c.clientSet.AppsV1().Deployments(NamespaceSentinel).List(ctx, metav1.ListOptions{
56-65: Consider adding a default case for unknown state types.If the proto definition adds new state types in the future, this switch would silently ignore them. A default case with logging would help surface unexpected states during debugging.
♻️ Add default case
switch res.Msg.GetState().(type) { case *ctrlv1.SentinelState_Apply: if err := c.ApplySentinel(ctx, res.Msg.GetApply()); err != nil { c.logger.Error("unable to apply sentinel", "error", err.Error(), "sentinel_id", sentinelID) } case *ctrlv1.SentinelState_Delete: if err := c.DeleteSentinel(ctx, res.Msg.GetDelete()); err != nil { c.logger.Error("unable to delete sentinel", "error", err.Error(), "sentinel_id", sentinelID) } + default: + c.logger.Warn("unknown sentinel state type", "sentinel_id", sentinelID) }pkg/db/queries/sentinel_list_desired.sql (1)
6-10: Prefer a named parameter for LIMIT to keep sqlc params explicit.Using positional
LIMIT ?alongside named args can make parameter ordering less clear in generated code. ConsiderLIMIT sqlc.arg(limit)for consistency.Suggested change
-ORDER BY id ASC -LIMIT ?; +ORDER BY id ASC +LIMIT sqlc.arg(limit);web/internal/db/src/schema/sentinels.ts (2)
46-50: Validate bigint precision forversion.With
mode: "number", values beyond 2^53‑1 lose precision and can break ordering/comparisons. If versions might grow large, considermode: "bigint"(BigInt) or enforce an upper bound.Possible change
- version: bigint("version", { mode: "number", unsigned: true }).notNull().unique(), + version: bigint("version", { mode: "bigint", unsigned: true }).notNull().unique(),
55-57: Consider an index aligned with desired_state + region + id pagination.The bootstrap query filters on
regionanddesired_stateand paginates byid. If this path is hot, a composite index (e.g.,(region, desired_state, id)) can avoid scans.Possible index addition
(table) => [ index("idx_environment_id").on(table.environmentId), + index("region_desired_state_id_idx").on(table.region, table.desiredState, table.id), index("region_version_idx").on(table.region, table.version), uniqueIndex("one_env_per_region").on(table.environmentId, table.region), ],svc/krane/internal/deployment/resync.go (1)
23-73: Add context cancellation handling and a default case in the state switch.The review identifies two separate concerns:
Context cancellation: Unkey's
repeat.Everyreturns a stop function that must be called explicitly to halt the loop—it doesn't accept context directly. The callback should checkctx.Done()to break early if the context is cancelled, but more importantly, the stop function returned byrepeat.Every()must be captured and called when the parent context is cancelled. Without proper teardown, graceful shutdown may be delayed.Missing default case: The switch statement (lines 56–65) lacks a
defaultcase. If the control plane returns an unexpected state type, it will be silently ignored, potentially masking future API changes or bugs.Suggested improvements
switch res.Msg.GetState().(type) { case *ctrlv1.DeploymentState_Apply: if err := c.ApplyDeployment(ctx, res.Msg.GetApply()); err != nil { c.logger.Error("unable to apply deployment", "error", err.Error(), "deployment_id", deploymentID) } case *ctrlv1.DeploymentState_Delete: if err := c.DeleteDeployment(ctx, res.Msg.GetDelete()); err != nil { c.logger.Error("unable to delete deployment", "error", err.Error(), "deployment_id", deploymentID) } + default: + c.logger.Warn("unknown deployment state type", "deployment_id", deploymentID) }For context cancellation, ensure the stop function returned by
repeat.Every()is properly managed in the caller's lifecycle (e.g., called when the controller shuts down or context is cancelled).svc/krane/internal/deployment/state.go (2)
3-12: Import order should follow coding guidelines.Per the Go coding guidelines, imports should be grouped as: (1) Standard library, (2) External/third-party packages, (3) Internal packages. The internal
ctrlv1import should come after the k8s packages.Suggested import reordering
import ( "context" "fmt" "strings" - ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1" appsv1 "k8s.io/api/apps/v1" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + + ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1" )
53-73: Status mapping logic for running pods may misclassify temporarily unready pods as failed.When a pod is
RunningbutContainersReadyis not true (e.g., during liveness probe initialization), setting status toSTATUS_FAILEDis potentially misleading. A pod that is starting up or temporarily unready is not necessarily failed.Consider using
STATUS_PENDINGfor running-but-not-ready pods, or introduce a distinct status for degraded/unhealthy state.Alternative mapping
if allReady { instance.Status = ctrlv1.ReportDeploymentStatusRequest_Update_Instance_STATUS_RUNNING } else { - instance.Status = ctrlv1.ReportDeploymentStatusRequest_Update_Instance_STATUS_FAILED + instance.Status = ctrlv1.ReportDeploymentStatusRequest_Update_Instance_STATUS_PENDING }svc/ctrl/services/cluster/rpc_report_sentinel_status.go (1)
35-45: Consider adding a default case for the health switch.If
req.Msg.GetHealth()returns an unexpected or future enum value not covered by the explicit cases,healthwill remain at its zero value. WhileHEALTH_UNSPECIFIEDmaps toUnknown, an explicitdefaultcase would make the behavior clearer and catch any unexpected values.Add explicit default case
switch req.Msg.GetHealth() { case ctrlv1.Health_HEALTH_HEALTHY: health = db.SentinelsHealthHealthy case ctrlv1.Health_HEALTH_UNHEALTHY: health = db.SentinelsHealthUnhealthy case ctrlv1.Health_HEALTH_PAUSED: health = db.SentinelsHealthPaused case ctrlv1.Health_HEALTH_UNSPECIFIED: health = db.SentinelsHealthUnknown + default: + health = db.SentinelsHealthUnknown }svc/krane/internal/deployment/apply.go (1)
3-18: Import order should follow coding guidelines.Per the Go coding guidelines, imports should be grouped as: (1) Standard library, (2) External/third-party packages, (3) Internal packages. The k8s packages should come before internal
github.com/unkeyed/unkeypackages.Suggested import reordering
import ( "context" "encoding/base64" "encoding/json" "fmt" "strconv" - ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1" - "github.com/unkeyed/unkey/pkg/assert" - "github.com/unkeyed/unkey/pkg/ptr" - "github.com/unkeyed/unkey/svc/krane/pkg/labels" appsv1 "k8s.io/api/apps/v1" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/types" + + ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1" + "github.com/unkeyed/unkey/pkg/assert" + "github.com/unkeyed/unkey/pkg/ptr" + "github.com/unkeyed/unkey/svc/krane/pkg/labels" )svc/krane/internal/sentinel/controller.go (1)
76-80: Consider waiting for goroutines to complete on Stop.The
Stopmethod closes thedonechannel but doesn't wait for the background goroutines to finish. This could lead to incomplete shutdown where goroutines are still running afterStopreturns.Consider using a
sync.WaitGroupto track goroutines and wait for their completion:♻️ Suggested improvement
type Controller struct { clientSet kubernetes.Interface logger logging.Logger cluster ctrlv1connect.ClusterServiceClient cb circuitbreaker.CircuitBreaker[any] done chan struct{} + wg sync.WaitGroup region string versionLastSeen uint64 } func (c *Controller) Stop() error { close(c.done) + c.wg.Wait() return nil }svc/ctrl/services/cluster/rpc_watch_sentinels.go (1)
36-61: Consider making the polling interval configurable.The 1-second sleep at line 59 is hardcoded. For production systems, this should likely be configurable to tune the trade-off between responsiveness and database load.
Additionally, consider adding a backoff mechanism if the stream repeatedly returns empty results, which could indicate low activity periods.
svc/krane/internal/deployment/controller.go (1)
86-95: Usefault.*for structured errors.
fmt.Errorfhere bypasses the standardized fault wrapping used elsewhere. Please wrap errors withfault.Wrap/fault.Internal(and codes where applicable). As per coding guidelines, ...svc/ctrl/integration/harness.go (1)
173-186: Potential inconsistency in returned Topology timestamp.The
CreateDeploymentmethod manually constructs the returnedDeploymentTopologywithCreatedAt: h.Now()(line 183), but this timestamp may differ from what was actually inserted into the database (line 166). For test assertions that compare timestamps, this could cause flaky behavior.Consider fetching the topology from the database instead of constructing it manually, or at minimum, capture the timestamp before the insert and reuse it.
💡 Suggested improvement
+ createdAt := h.Now() h.versionCounter++ err = db.Query.InsertDeploymentTopology(ctx, h.DB.RW(), db.InsertDeploymentTopologyParams{ WorkspaceID: workspaceID, DeploymentID: deploymentID, Region: req.Region, DesiredReplicas: 1, DesiredStatus: db.DeploymentTopologyDesiredStatusStarted, Version: h.versionCounter, - CreatedAt: h.Now(), + CreatedAt: createdAt, }) require.NoError(h.t, err) deployment, err := db.Query.FindDeploymentById(ctx, h.DB.RO(), deploymentID) require.NoError(h.t, err) return CreateDeploymentResult{ Deployment: deployment, Topology: db.DeploymentTopology{ Pk: 0, WorkspaceID: workspaceID, DeploymentID: deploymentID, Region: req.Region, DesiredReplicas: 1, DesiredStatus: db.DeploymentTopologyDesiredStatusStarted, Version: h.versionCounter, - CreatedAt: h.Now(), + CreatedAt: createdAt, UpdatedAt: sql.NullInt64{Valid: false}, }, }svc/krane/internal/sentinel/apply.go (1)
206-220: Resource requirements are commented out.The CPU, memory, and ephemeral storage resource limits and requests are commented out. This may be intentional for development flexibility, but in production, running without resource limits can lead to noisy neighbor problems and make capacity planning difficult.
Consider adding a TODO comment explaining when these should be enabled, or making them configurable.
svc/krane/internal/testutil/mock_cluster_client.go (1)
31-43: Default nil returns for Watch methods may need documentation.
WatchDeploymentsandWatchSentinelsreturnnil, nilwhen no hook is set. This could cause nil pointer dereferences in tests that don't explicitly set the hook but still call these methods. Consider either:
- Adding a comment warning about this behavior
- Returning an error by default to make the failure explicit
This is acceptable for test utilities where callers are expected to configure what they need.
💡 Optional: Add warning comment
func (m *MockClusterClient) WatchDeployments(ctx context.Context, req *connect.Request[ctrlv1.WatchDeploymentsRequest]) (*connect.ServerStreamForClient[ctrlv1.DeploymentState], error) { if m.WatchDeploymentsFunc != nil { return m.WatchDeploymentsFunc(ctx, req) } + // Returns nil - tests using this method must set WatchDeploymentsFunc return nil, nil }svc/ctrl/services/cluster/rpc_report_deployment_status.go (2)
3-11: Re-group imports to match repo orderingLine 3–11 mixes generated and pkg imports in the same group. Please split into standard / third‑party / pkg / gen groups.
As per coding guidelines, please keep Go imports grouped by category.♻️ Suggested import grouping
import ( "context" "connectrpc.com/connect" - ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1" - "github.com/unkeyed/unkey/pkg/assert" - "github.com/unkeyed/unkey/pkg/db" - "github.com/unkeyed/unkey/pkg/uid" + + "github.com/unkeyed/unkey/pkg/assert" + "github.com/unkeyed/unkey/pkg/db" + "github.com/unkeyed/unkey/pkg/uid" + + ctrlv1 "github.com/unkeyed/unkey/gen/proto/ctrl/v1" )
38-107: Use txCtx inside the TxRetry blockLine 38+ uses
ctxfor DB calls insideTxRetry. PrefertxCtxso cancellation/timeouts are bound to the transactional context.✅ Suggested change
- deployment, err := db.Query.FindDeploymentByK8sName(ctx, tx, msg.Update.GetK8SName()) + deployment, err := db.Query.FindDeploymentByK8sName(txCtx, tx, msg.Update.GetK8SName()) if err != nil { return err } - staleInstances, err := db.Query.FindInstancesByDeploymentIdAndRegion(ctx, tx, db.FindInstancesByDeploymentIdAndRegionParams{ + staleInstances, err := db.Query.FindInstancesByDeploymentIdAndRegion(txCtx, tx, db.FindInstancesByDeploymentIdAndRegionParams{ Deploymentid: deployment.ID, Region: region, }) if err != nil { return err } @@ - err = db.Query.DeleteInstance(ctx, tx, db.DeleteInstanceParams{ + err = db.Query.DeleteInstance(txCtx, tx, db.DeleteInstanceParams{ K8sName: staleInstance.K8sName, Region: region, }) if err != nil { return err @@ - err = db.Query.UpsertInstance(ctx, tx, db.UpsertInstanceParams{ + err = db.Query.UpsertInstance(txCtx, tx, db.UpsertInstanceParams{ ID: uid.New(uid.InstancePrefix), DeploymentID: deployment.ID, @@ - deployment, err := db.Query.FindDeploymentByK8sName(ctx, tx, msg.Delete.GetK8SName()) + deployment, err := db.Query.FindDeploymentByK8sName(txCtx, tx, msg.Delete.GetK8SName()) if err != nil { return err } - err = db.Query.DeleteDeploymentInstances(ctx, tx, db.DeleteDeploymentInstancesParams{ + err = db.Query.DeleteDeploymentInstances(txCtx, tx, db.DeleteDeploymentInstancesParams{ DeploymentID: deployment.ID, Region: region, })svc/ctrl/integration/seed/seed.go (2)
3-20: Re-group imports to match repo conventionsLine 3–20 mixes gen/pkg/third‑party imports. Please split into standard / third‑party / pkg / gen groups.
As per coding guidelines, please keep Go imports grouped by category.♻️ Suggested import grouping
import ( "context" "database/sql" "errors" "testing" "time" "github.com/go-sql-driver/mysql" "github.com/stretchr/testify/require" - vaultv1 "github.com/unkeyed/unkey/gen/proto/vault/v1" "github.com/unkeyed/unkey/pkg/assert" "github.com/unkeyed/unkey/pkg/db" dbtype "github.com/unkeyed/unkey/pkg/db/types" "github.com/unkeyed/unkey/pkg/hash" "github.com/unkeyed/unkey/pkg/ptr" "github.com/unkeyed/unkey/pkg/uid" "github.com/unkeyed/unkey/pkg/vault" + + vaultv1 "github.com/unkeyed/unkey/gen/proto/vault/v1" )
134-166: Use a consistent receiver nameLine 134 uses
h *Seederwhile the rest of the file usess *Seeder. Aligning the receiver name improves scanability.🔧 Suggested tweak
-func (h *Seeder) CreateProject(ctx context.Context, req CreateProjectRequest) db.Project { - err := db.Query.InsertProject(ctx, h.DB.RW(), db.InsertProjectParams{ +func (s *Seeder) CreateProject(ctx context.Context, req CreateProjectRequest) db.Project { + err := db.Query.InsertProject(ctx, s.DB.RW(), db.InsertProjectParams{ ID: req.ID, WorkspaceID: req.WorkspaceID, @@ - require.NoError(h.t, err) + require.NoError(s.t, err) - project, err := db.Query.FindProjectById(ctx, h.DB.RO(), req.ID) - require.NoError(h.t, err) + project, err := db.Query.FindProjectById(ctx, s.DB.RO(), req.ID) + require.NoError(s.t, err)
web/apps/engineering/content/docs/architecture/services/ctrl/pull-based-infra.mdx
Show resolved
Hide resolved
web/apps/engineering/content/docs/architecture/services/krane/sync-engine.mdx
Show resolved
Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…key into chronark/split-restate-worker
ogzhanolguncu
left a comment
There was a problem hiding this comment.
Looks good you might wanna go over coderabbit comments tho
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@cmd/ctrl/api.go`:
- Around line 164-171: The RestateConfig block assigns Restate.AdminURL,
HttpPort, and RegisterAs but those CLI flags are not defined on apiCmd.Flags,
causing zero values; either add the missing flags (define "restate-admin-url",
"restate-http-port", and "restate-register-as" on apiCmd.Flags alongside the
existing "restate-url" and "restate-api-key") or remove/simplify those fields
from the Restate: ctrlapi.RestateConfig assignment if they are unused; update
the code in cmd/ctrl/api.go around the RestateConfig and where apiCmd.Flags are
declared so the flag names match the RestateConfig fields (symbols:
RestateConfig, Restate, apiCmd.Flags).
In `@cmd/ctrl/worker.go`:
- Line 169: The code calls cmd.String("restate-url") but the flag "restate-url"
is never defined, so Restate.URL (the Restate.URL field) will always be empty;
add a new string flag declaration for "restate-url" (e.g., alongside other
cmd.String(...) flag definitions) or change the call to use the existing flag
name if one already provides the Restate URL, and ensure the Restate.URL
assignment uses that defined flag variable so the value is populated at runtime.
- Line 131: Add a "build-backend" flag to the worker command so the referenced
flag isn't always empty: define a string flag on workerCmd.Flags() (e.g. create
the flag with workerCmd.Flags().String("build-backend", "", "path/name of build
backend") or equivalent StringVar) before using cmd.String("build-backend"),
ensuring the value passed into BuildBackend (the BuildBackend field in the
struct and the call worker.BuildBackend(...)) receives the actual flag value.
In `@svc/ctrl/worker/run.go`:
- Around line 227-233: The anonymous goroutine that starts the Restate server
currently only logs startErr from restateSrv.Start and allows the worker to
continue; change this to propagate the error to the main startup flow so the
process fails fast: create or use an existing startup error channel (e.g., a
chan error) or context cancellation mechanism in the surrounding function and
send startErr (from restateSrv.Start) to it instead of just calling
logger.Error, then ensure the caller waits for that channel and returns/shuts
down on any non-nil error so that failures in restateSrv.Start (referenced here
as restateSrv.Start and cfg.Restate.HttpPort) cause the worker to stop rather
than continue silently without Restate.
♻️ Duplicate comments (5)
cmd/ctrl/api.go (2)
131-131:imageflag not defined but referenced.
cmd.String("image")references a flag that doesn't exist inapiCmd.Flags. This will always return an empty string. Either add the flag or remove this assignment ifImageis unused.
74-75: ACME flags defined but not wired to config.
acme-enabledandacme-email-domainare defined but never passed toctrlapi.Config. Either remove these flags or add anAcmefield to the config struct and wire them.svc/ctrl/worker/run.go (3)
289-297: Renewal cron shouldn’t depend on DNS provider.
HTTP‑01 renewals skipped when Route53 off.
398-403: Don’t return on existing wildcard domain.
Skips ACME challenge bootstrap.
413-414: Reuse InfraWorkspaceID constant.
Avoid drift.
🧹 Nitpick comments (1)
cmd/ctrl/worker.go (1)
79-81: Cloudflare ACME flags defined but not wired to config.
acme-cloudflare-enabledandacme-cloudflare-api-tokenflags are defined butworker.AcmeConfigonly containsRoute53Config—noCloudflarefield exists. Remove unused flags or addCloudflareConfigto the struct.Remove unused flags
- // Cloudflare DNS provider - cli.Bool("acme-cloudflare-enabled", "Enable Cloudflare for wildcard certificates", cli.EnvVar("UNKEY_ACME_CLOUDFLARE_ENABLED")), - cli.String("acme-cloudflare-api-token", "Cloudflare API token for Let's Encrypt", cli.EnvVar("UNKEY_ACME_CLOUDFLARE_API_TOKEN")), -
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@cmd/ctrl/api.go`:
- Around line 165-170: The RestateConfig instantiation references flags that
aren't defined; add corresponding command-line flags for "restate-admin-url",
"restate-http-port", and "restate-register-as" so the values passed into
RestateConfig.URL/AdminURL/HttpPort/RegisterAs come from defined flags. Locate
where other flags are declared and add definitions for those three (matching
types: string for admin URL and register-as, int for http port) so
RestateConfig{ URL, AdminURL, HttpPort, RegisterAs, APIKey } can read them
reliably.
In `@cmd/ctrl/worker.go`:
- Around line 130-132: The BuildBackend value is using
cmd.String("build-backend") but no CLI flag is defined, so BuildBackend will
always be empty; add a CLI flag definition for "build-backend" on the same
cobra/kingpin command object that defines other flags (where cmd is created) —
e.g., register cmd.Flag("build-backend", "build backend to use").String() (or
the equivalent in your CLI lib) with an appropriate default and help text so
worker.BuildBackend receives the intended value when referenced in the Build
configuration.
- Around line 79-82: The two Cloudflare ACME CLI flags
("acme-cloudflare-enabled" and "acme-cloudflare-api-token") are declared but
never used; either remove them or wire them into the worker.Config so they take
effect. To fix, add fields to the worker.Config struct (e.g.,
AcmeCloudflareEnabled bool and AcmeCloudflareAPIToken string) and populate them
when constructing the config in the CLI handling code (read from the cli.Context
using the exact flag names "acme-cloudflare-enabled" and
"acme-cloudflare-api-token"), then pass those fields to any ACME/cloudflare
setup logic that expects them; alternatively, if Cloudflare support is not
needed, remove the two cli.Bool/cli.String flag declarations to avoid dead
flags.
- Around line 168-170: The Restate URL flag is missing so RestateConfig.URL is
empty; add a new CLI flag named "restate-url" (using the same cmd variable where
other flags are defined) with an appropriate default/empty value and
description, e.g., cmd.String("restate-url", "", "URL for Restate API"), and
ensure RestateConfig{ URL: cmd.String("restate-url"), AdminURL:
cmd.String("restate-admin-url") } uses that flag; also verify that
"restate-admin-url" is present and documented similarly.
In `@svc/ctrl/worker/run.go`:
- Around line 82-90: The vault client is created with http.DefaultClient which
has no timeout and can hang; replace http.DefaultClient with a dedicated
*http.Client that sets a sensible Timeout (e.g., 5–30s) and use that when
calling vaultv1connect.NewVaultServiceClient(cfg.VaultURL, ...), ensuring you
import time if needed and keep the existing interceptors and Authorization
header (references: vaultClient, vaultv1connect.NewVaultServiceClient,
cfg.VaultURL, cfg.VaultToken).
♻️ Duplicate comments (5)
cmd/ctrl/api.go (2)
131-131:imageflag still undefined.
Line 131 usescmd.String("image")but no flag exists. Remove or add flag.✅ Remove unused field assignment
- Image: cmd.String("image"),
74-75: ACME flags unused in api config.
Flags defined but not wired intoctrlapi.Config. Remove or plumb.svc/ctrl/worker/run.go (3)
294-297: Renewal cron blocked for HTTP-01.
Guard requiresdnsProvider != nil; HTTP-01 certs won’t renew.✅ Minimal fix
- if cfg.Acme.Enabled && dnsProvider != nil { + if cfg.Acme.Enabled {
398-403: Existing domain skips challenge bootstrap.
Early return prevents ensuring ACME challenge.
413-414: Use InfraWorkspaceID constant.
Avoid hardcoded"unkey_internal".✅ Use constant
- workspaceID := "unkey_internal" + workspaceID := certificate.InfraWorkspaceID
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@cmd/ctrl/api.go`:
- Around line 165-171: The RestateConfig fields (URL, AdminURL, HttpPort,
RegisterAs, APIKey) are being populated from cmd.String/cmd.Int values that are
not defined on apiCmd.Flags, causing silent zero/empty values; either remove the
unused fields from the RestateConfig construction in the api command or add the
corresponding flag definitions to apiCmd.Flags (e.g., define "restate-admin-url"
and "restate-register-as" as String flags and "restate-http-port" as an Int
flag) so that RestateConfig.AdminURL, RestateConfig.HttpPort, and
RestateConfig.RegisterAs receive real values; locate the config construction
referencing RestateConfig and the apiCmd.Flags setup to make the change.
In `@cmd/ctrl/worker.go`:
- Line 169: The config struct is reading a "restate-url" via
cmd.String("restate-url") but that flag is never defined on workerCmd.Flags, so
the value will always be empty; add a flag definition to the worker command
(e.g., call workerCmd.Flags().String("restate-url", "", "ReState ingress URL for
workflow invocations") or the equivalent StringVar on the same Flags() used
elsewhere) so that the symbol cmd.String("restate-url") resolves to the provided
value when parsed.
- Line 131: The code references cmd.String("build-backend") when constructing
BuildBackend (in the worker.BuildBackend field) but the "build-backend" flag is
not registered on workerCmd.Flags, so it always yields an empty string; fix this
by adding a string flag registration for "build-backend" on workerCmd.Flags
(e.g., in the init or command setup where workerCmd is configured) so the value
returned by cmd.String("build-backend") is populated and the BuildBackend
selection works correctly—ensure the flag name exactly matches "build-backend"
and provide a sensible default and usage text when calling Flags().String(...)
(or the equivalent flag registration method used in this codebase).
♻️ Duplicate comments (5)
cmd/ctrl/api.go (2)
74-78: Unused ACME flags in API command.ACME flags are defined but never wired to
ctrlapi.Config(which lacks anAcmefield). ACME is only used by the worker service. Remove these flags or add proper wiring if needed.
129-131: Remove unusedImagefield assignment.
cmd.String("image")references a flag not defined inapiCmd.Flags. This will always return empty string. TheImagefield appears unused in the API service.config := ctrlapi.Config{ // Basic configuration - Image: cmd.String("image"), HttpPort: cmd.Int("http-port"),svc/ctrl/worker/run.go (3)
413-414: Use constant for infrastructure workspace ID.The hardcoded
"unkey_internal"appears to duplicatecertificate.InfraWorkspaceID. Reuse the constant to prevent drift.#!/bin/bash # Verify if InfraWorkspaceID constant exists in certificate package rg -n "InfraWorkspaceID" svc/ctrl/worker/certificate/
294-310: Renewal cron gated on DNS provider only blocks HTTP-01 renewals.The condition
cfg.Acme.Enabled && dnsProvider != nilprevents certificate renewals when only HTTP-01 provider is configured. The renewal cron should start whenever ACME is enabled.// Start the certificate renewal cron job if ACME is enabled // Use Send with idempotency key so multiple restarts don't create duplicate crons - if cfg.Acme.Enabled && dnsProvider != nil { + if cfg.Acme.Enabled {
398-407: Early return may skip ACME challenge creation for existing domain.If a domain record exists but its ACME challenge is missing or invalid, this early return skips challenge bootstrap. Consider checking challenge existence too or making the upsert idempotent.
🧹 Nitpick comments (2)
svc/ctrl/worker/run.go (1)
82-93: Consider using a configured HTTP client for vault.
http.DefaultClientlacks timeout configuration. While Connect clients may handle timeouts internally, using a dedicated client with timeout is safer for production.cmd/ctrl/worker.go (1)
79-81: Unused Cloudflare ACME flags.
acme-cloudflare-enabledandacme-cloudflare-api-tokenflags are defined but never wired toworker.Config. Either remove these flags or add Cloudflare config support.- // Cloudflare DNS provider - cli.Bool("acme-cloudflare-enabled", "Enable Cloudflare for wildcard certificates", cli.EnvVar("UNKEY_ACME_CLOUDFLARE_ENABLED")), - cli.String("acme-cloudflare-api-token", "Cloudflare API token for Let's Encrypt", cli.EnvVar("UNKEY_ACME_CLOUDFLARE_API_TOKEN")), -
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@cmd/ctrl/api.go`:
- Around line 164-171: The Restate config is being populated with AdminURL,
HttpPort and RegisterAs using flags that don't exist, so those fields will be
empty/zero; fix by either (A) adding the missing flags (define flags for
"restate-admin-url", "restate-http-port", "restate-register-as" alongside the
existing "restate-url" and "restate-api-key") so ctrlapi.RestateConfig receives
real values, or (B) if the API process doesn't need those fields, stop
populating them and only set RestateConfig.URL and APIKey (or simplify
ctrlapi.RestateConfig usage for API-only paths); locate the RestateConfig
construction in cmd/ctrl/api.go (the Restate: ctrlapi.RestateConfig{...} block)
and apply one of these two fixes.
- Around line 80-84: The configuration is missing flag definitions for
restate-admin-url, restate-http-port, and restate-register-as referenced later;
add corresponding CLI flags in cmd/ctrl/api.go alongside the existing
restate-url and restate-api-key entries: define a
cli.String("restate-admin-url", ...) with a suitable default and EnvVar (e.g.,
UNKEY_RESTATE_ADMIN_URL), define a cli.Int("restate-http-port", ...) with a
default port and EnvVar (e.g., UNKEY_RESTATE_HTTP_PORT), and define a
cli.String("restate-register-as", ...) with a default and EnvVar (e.g.,
UNKEY_RESTATE_REGISTER_AS) so the code that reads restate-admin-url,
restate-http-port, and restate-register-as gets non-empty values.
♻️ Duplicate comments (5)
cmd/ctrl/api.go (2)
74-78: Remove unused ACME and domain flags.These flags are defined but never wired to
ctrlapi.Config(which has noAcmefield). Either remove them or add corresponding config fields if needed for future use.
131-131:imageflag not defined but referenced.
cmd.String("image")will return empty string since no--imageflag exists inapiCmd.Flags. Either add the flag or remove this line ifImagefield is unused.svc/ctrl/worker/run.go (3)
289-307: Cron gated on dnsProvider blocks HTTP‑01 renewals.
Start cron whenever ACME enabled.
398-403: Existing wildcard domain path skips challenge ensure.
Do not return; ensure challenge record.
413-415: Avoid hardcoded workspace ID.
Use certificate.InfraWorkspaceID.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@svc/ctrl/api/config.go`:
- Around line 52-94: Remove the unused config type definitions Route53Config,
AcmeConfig, DepotConfig, and RegistryConfig from the file and delete any imports
only used by them; ensure you also remove any stray comments or docblocks tied
to those types and verify there are no remaining references to these types
(e.g., in any Config struct or functions) so the build remains clean.
♻️ Duplicate comments (5)
svc/api/routes/v2_deploy_get_deployment/handler.go (1)
88-107:StepsandErrorMessageremain unpopulated.Previous review flagged that these fields should be populated from
deployment_stepsrecords. If intentionally deferred, consider adding a TODO comment to track this work.svc/ctrl/api/config.go (2)
8-23: Remove unusedBuildBackendtype and constants.
BuildBackendis defined but theConfigstruct has noBuildBackendfield. This duplicates the definition insvc/ctrl/worker/config.go. Remove to eliminate dead code.
240-247: Implement validation for required fields inValidate().The method returns nil, allowing invalid configs to pass. Add validation for critical fields:
DatabasePrimary(non-empty)HttpPort(> 0)AuthToken(non-empty for secure deployments)cmd/ctrl/api.go (2)
163-170: Missing Restate flag definitions cause silent zero values.
restate-admin-url,restate-http-port, andrestate-register-asare referenced but not defined inapiCmd.Flags. These will be empty/zero at runtime.If the API service doesn't need these fields, simplify the config:
Restate: ctrlapi.RestateConfig{ URL: cmd.String("restate-url"), - AdminURL: cmd.String("restate-admin-url"), - HttpPort: cmd.Int("restate-http-port"), - RegisterAs: cmd.String("restate-register-as"), APIKey: cmd.String("restate-api-key"), },Or add the missing flags if needed.
74-86: Remove unused CLI flags or wire them to config.These flags are defined but never passed to
ctrlapi.Config:
acme-enabled,acme-email-domain(lines 74-75)default-domain,regional-apex-domain(lines 77-78)clickhouse-url(lines 85-86)Either remove them or add corresponding fields to
ctrlapi.Configand wire them inapiAction.♻️ Remove unused flags
- cli.Bool("acme-enabled", "Enable Let's Encrypt for acme challenges", cli.EnvVar("UNKEY_ACME_ENABLED")), - cli.String("acme-email-domain", "Domain for ACME registration emails (workspace_id@domain)", cli.Default("unkey.com"), cli.EnvVar("UNKEY_ACME_EMAIL_DOMAIN")), - - cli.String("default-domain", "Default domain for auto-generated hostnames", cli.Default("unkey.app"), cli.EnvVar("UNKEY_DEFAULT_DOMAIN")), - cli.String("regional-apex-domain", "Apex domain for cross-region frontline communication (e.g., unkey.cloud). Certs are provisioned for *.{region}.{regional-apex-domain}", cli.EnvVar("UNKEY_REGIONAL_APEX_DOMAIN")), - // Restate Configuration cli.String("restate-url", "URL of the Restate ingress endpoint for invoking workflows. Example: http://restate:8080", cli.Default("http://restate:8080"), cli.EnvVar("UNKEY_RESTATE_INGRESS_URL")), cli.String("restate-api-key", "API key for Restate ingress requests", cli.EnvVar("UNKEY_RESTATE_API_KEY")), - cli.String("clickhouse-url", "ClickHouse connection string for analytics. Recommended for production. Example: clickhouse://user:pass@host:9000/unkey", - cli.EnvVar("UNKEY_CLICKHOUSE_URL")),
Summary
Splits the
ctrlservice into two independently deployable components:Why
The monolithic ctrl service combined request handling with long-running workflows, making it difficult to scale and deploy independently. Separating them allows:
Changes
Architecture
cmd/ctrlCLI withapiandworkersubcommandssvc/ctrl/api/- API server configuration and startupsvc/ctrl/worker/- Restate worker with sub-packages:certificate/- ACME certificate management workflowsdeploy/- Deployment, promotion, rollback workflowsrouting/- Frontline route assignmentversioning/- Per-region version trackingNew Packages
svc/ctrl/pkg/build/- Depot.dev container build integrationsvc/ctrl/pkg/s3/- S3-compatible storage with Docker networking supportsvc/ctrl/pkg/hash/- Content hashing utilitiesAPI Changes
svc/api/routes/v2_deploy_generate_upload_url/- New endpoint for deployment uploadssvc/ctrl/services/deployment/- Deployment service with S3 upload URL generationInfrastructure
ctrl-api.yaml,ctrl-worker.yamlDocumentation
Comprehensive doc.go and symbol documentation added to all new packages following the documentation guidelines.
Testing
Existing integration tests pass. New endpoint has test coverage for 200/400/401/403/404 responses.