feat: spin down prod deployments after 30min#4974
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRenamed DeploymentService to DeployService across protos, clients, handlers, and docs; added a new hydra.v1 DeployService proto; introduced a Restate-backed Deployment virtual-object package with nonce-based schedule/change/clear RPCs; updated workflows to call ScheduleDesiredStateChange for idle-preview standby transitions. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant DeployWF as DeployService (Workflow)
participant Routing as RoutingService
participant DeploymentVO as DeploymentService (VirtualObject)
participant DB
Caller->>DeployWF: Deploy(request)
activate DeployWF
DeployWF->>DeployWF: Build / provision / health checks
DeployWF->>Routing: Update frontline routes
DeployWF->>DB: Update live_deployment_id
alt Previous live exists
DeployWF->>DeploymentVO: ScheduleDesiredStateChange(previous_id, delay=30m, STANDBY)
activate DeploymentVO
DeploymentVO->>DeploymentVO: Store transition + nonce (restate)
DeploymentVO->>DeploymentVO: Enqueue delayed ChangeDesiredState
deactivate DeploymentVO
end
DeployWF->>Caller: DeployResponse
deactivate DeployWF
par Delayed transition (after 30m)
DeploymentVO->>DeploymentVO: ChangeDesiredState(nonce)
activate DeploymentVO
DeploymentVO->>DB: Update desired_state=STANDBY, version++
DeploymentVO->>DeploymentVO: Clear stored transition
deactivate DeploymentVO
end
sequenceDiagram
participant Caller
participant DeployWF as DeployService (Workflow)
participant DeploymentVO as DeploymentService (VirtualObject)
participant Routing as RoutingService
participant DB
Caller->>DeployWF: Promote(target_deployment_id)
activate DeployWF
DeployWF->>DeploymentVO: ClearScheduledStateChanges(target_deployment_id)
activate DeploymentVO
DeploymentVO->>DeploymentVO: Clear pending transition
deactivate DeploymentVO
DeployWF->>Routing: Update frontline routes -> target_deployment_id
DeployWF->>DB: Update live_deployment_id
DeployWF->>DeploymentVO: ScheduleDesiredStateChange(previous_id, delay=30m, STANDBY)
activate DeploymentVO
DeploymentVO->>DeploymentVO: Store transition + nonce
DeploymentVO->>DeploymentVO: Enqueue delayed ChangeDesiredState
deactivate DeploymentVO
DeployWF->>Caller: PromoteResponse
deactivate DeployWF
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
web/apps/engineering/content/docs/architecture/workflows/creating-services.mdx (1)
114-129:⚠️ Potential issue | 🟡 MinorImport path in example may be stale.
The import on line 118 references
svc/ctrl/workflows/yourservicebut the documented service location on line 63 issvc/ctrl/worker/yourservice/. Consider updating the example import to match:import ( "github.com/unkeyed/unkey/svc/ctrl/worker/yourservice" )web/apps/engineering/content/docs/architecture/workflows/github-deployments.mdx (2)
103-127:⚠️ Potential issue | 🟡 MinorDeploy workflow diagram omits standby scheduling.
Current flow schedules the previous live deployment to standby after UpdateLive; add that step for accuracy.
321-330:⚠️ Potential issue | 🟡 MinorProto location for DeployRequest likely outdated.
If workflow RPCs moved to deploy.proto after the split, update this location block.
🤖 Fix all issues with AI agents
In `@svc/ctrl/worker/deploy/deploy_handler.go`:
- Around line 421-433: The ScheduleDesiredStateChange call that moves
previousLiveDeploymentID to standby is unguarded and runs for preview deploys;
wrap this block with the same production/rolled-back condition used when
updating the live pointer so only real production live updates trigger
scheduling. Concretely, before calling
hydrav1.NewDeploymentServiceClient(...).ScheduleDesiredStateChange(), check the
same production/rolled-back predicate used in the live-pointer update logic
(reuse that predicate or helper), and only run the ScheduleDesiredStateChange
request (and restate.WithIdempotencyKey(deployment.ID)) when that predicate
indicates a production non-rolled-back live update.
In `@svc/ctrl/worker/deployment/clear_scheduled.go`:
- Around line 8-12: The comment on ClearScheduledStateChanges incorrectly states
that a delayed ChangeDesiredState call will "return a terminal error" when the
transition is nil; in reality ChangeDesiredState (see deployment_state.go - the
branch that checks t == nil) returns an empty ChangeDesiredStateResponse and nil
error (a no-op success). Update the ClearScheduledStateChanges comment to
describe the actual behavior: that any delayed ChangeDesiredState will see a nil
transition and be treated as a no-op (successful no-op) rather than producing a
terminal error, and optionally mention that callers should treat it as a
harmless no-op.
In `@svc/ctrl/worker/run.go`:
- Around line 171-173: The call to deployment.New passes a Config with the wrong
field name `DB`, causing a compile error; update the struct literal in the
restateSrv.Bind invocation (where
hydrav1.NewDeploymentServiceServer(deployment.New(...)) is called) to use the
correct field name `Database: database` matching deployment.Config.Database, so
deployment.New(deployment.Config{Database: database}) compiles.
In `@web/apps/engineering/content/docs/architecture/services/ctrl/index.mdx`:
- Around line 22-24: The docs list only DeployService which is ambiguous after
the new split; update the service list text to also mention the new
DeploymentService virtual object and briefly state its role (handling
desired-state mutations) alongside DeployService (which manages deployment
creation/management workflows); locate and edit the paragraph that currently
enumerates `ClusterService`, `BuildService`, `DeployService`, `AcmeService`,
`OpenApiService`, and `CtrlService` to insert `DeploymentService` with a short
parenthetical or clause clarifying it is the virtual object for desired-state
mutations while `DeployService` continues to orchestrate deployments.
In `@web/apps/engineering/content/docs/architecture/workflows/index.mdx`:
- Around line 17-20: The example text incorrectly states the key/service:
replace the reference to DeployService keyed by project_id with the correct
mapping between the workflow and the virtual object — mention that DeployService
is a workflow keyed by deployment/workflow ID and the serialized virtual object
is DeploymentService keyed by deployment_id (not project_id); update the
sentence in the Virtual Objects section to use DeploymentService and
deployment_id so the example accurately reflects that only one deployment per
deployment_id runs at a time and remove the incorrect project_id implication.
- Around line 31-36: Update the DeployService metadata block so the Proto points
to the post-split deploy proto and the Key reflects the workflow/deployment
identifier: change the Proto value to svc/ctrl/proto/hydra/v1/deploy.proto and
set Key to workflow_id (the workflow/deployment ID, not project_id) in the
DeployService section for DeployService (Location: svc/ctrl/worker/deploy/;
Operations remain Deploy, Rollback, Promote, ScaleDownIdlePreviewDeployments).
🧹 Nitpick comments (5)
svc/ctrl/worker/deploy/rollback_handler.go (1)
91-95: Wrap error for consistency with other service calls.Other service/query errors in this handler include context (e.g., line 133:
"failed to switch frontlineRoutes"). Wrapping here aids debugging.Suggested change
// ensure the rolled back deployment does not get spun down from existing scheduled actions _, err = hydrav1.NewDeploymentServiceClient(ctx, targetDeployment.ID).ClearScheduledStateChanges().Request(&hydrav1.ClearScheduledStateChangesRequest{}) if err != nil { - return nil, err + return nil, fmt.Errorf("failed to clear scheduled state changes: %w", err) }svc/ctrl/proto/hydra/v1/deploy.proto (1)
78-79: Consider removing the unusedkey_auth_idfield.The TODO indicates this field is unused. Keeping deprecated proto fields can lead to confusion. If removal is deferred, consider marking it
reservedordeprecatedoption.svc/ctrl/proto/hydra/v1/deployment.proto (1)
38-43: Consider adding a note about internal-only invocation in the proto documentation.The comment states
ChangeDesiredStateis internal, but it's a public RPC. While Restate's delayed calls require the RPC to be public, you might want to document that external callers should not invoke it directly (e.g., via a convention like_internalsuffix or explicit documentation in the request message).svc/ctrl/worker/deploy/promote_handler.go (1)
128-132: Error fromSend()is silently ignored.The
ScheduleDesiredStateChange().Send()call returns an error that isn't checked. If scheduling fails, the old deployment won't spin down, and there's no indication of failure.Consider handling the error or at minimum logging it:
♻️ Suggested improvement
// schedule old deployment to be spun down -hydrav1.NewDeploymentServiceClient(ctx, project.LiveDeploymentID.String).ScheduleDesiredStateChange().Send(&hydrav1.ScheduleDesiredStateChangeRequest{ +_, err = hydrav1.NewDeploymentServiceClient(ctx, project.LiveDeploymentID.String).ScheduleDesiredStateChange().Send(&hydrav1.ScheduleDesiredStateChangeRequest{ State: hydrav1.DeploymentDesiredState_DEPLOYMENT_DESIRED_STATE_STANDBY, DelayMillis: (30 * time.Minute).Milliseconds(), }) +if err != nil { + logger.Warn("failed to schedule standby for previous live deployment", + "deployment_id", project.LiveDeploymentID.String, + "error", err) + // Non-fatal: promotion succeeded, just won't auto-spindown +}svc/ctrl/worker/deploy/deploy_handler.go (1)
327-335: TODO: source type hardcoded.
Routing metadata will always show CLI_UPLOAD even for Git/Docker. Want me to draft a source-type mapping?
web/apps/engineering/content/docs/architecture/workflows/index.mdx
Outdated
Show resolved
Hide resolved
web/apps/engineering/content/docs/architecture/workflows/index.mdx
Outdated
Show resolved
Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@svc/ctrl/worker/deployment/deployment_state.go`:
- Around line 28-37: ScheduleDesiredStateChange currently persists the incoming
req.GetState() without validation; before calling restate.Set (and before
creating transition/nonce), validate that req.GetState() maps to a legal domain
state (e.g., parse/convert the raw enum into your domain ADT or check against
the allowed enum values in hydrav1), return an appropriate error if invalid, and
only construct the transition struct and call restate.Set with transitionKey
once the state is validated/converted. Target symbols:
ScheduleDesiredStateChange, req.GetState(), transition (Nonce/To), restate.Set,
transitionKey.
🧹 Nitpick comments (1)
svc/ctrl/worker/deploy/deploy_handler.go (1)
334-336: Consider tracking the hardcoded source type as technical debt.Static analysis flagged this TODO. The source type should reflect actual source (
GitSourcevsDockerImage).Would you like me to open an issue to track fixing this?
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@svc/ctrl/worker/deploy/scale_down_idle_preview_deployments.go`:
- Around line 17-22: The comment for ScaleDownIdlePreviewDeployments is
inconsistent: it says idle deployments are transitioned to "archived" but the
implementation sets the deployment status to STANDBY; update the documentation
to reflect the actual behavior or change the code to archive if that was
intended. Locate the ScaleDownIdlePreviewDeployments declaration and the logic
that sets the status to STANDBY and either (a) change the wording in the
function comment to say "transitions idle deployments to STANDBY" or (b) if the
desired state is archived, replace the STANDBY assignment with the
ARCHIVED/archived state symbol used elsewhere; ensure the comment and the code
use the same unique symbol (STANDBY or ARCHIVED) consistently.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@svc/ctrl/proto/hydra/v1/deploy.proto`:
- Around line 17-20: The top-of-file comment incorrectly references
SetDeploymentDesiredState (which is not defined in this service); update the
comment to refer to the DeploymentService desired-state RPCs instead and keep
the description of ScaleDownIdlePreviewDeployments intact. Specifically, in the
comment block that mentions ScaleDownIdlePreviewDeployments and
SetDeploymentDesiredState, remove or replace the SetDeploymentDesiredState
reference with a phrase like "DeploymentService desired-state RPCs" so readers
are directed to the correct RPC surface (e.g., DeploymentService) while
preserving the cron/idle/standby explanation.
In `@svc/ctrl/worker/deploy/deploy_handler.go`:
- Around line 364-365: buildDomains currently hardcodes the source type to
ctrlv1.SourceType_SOURCE_TYPE_CLI_UPLOAD which mislabels DockerImage/Git
deployments; update buildDomains to accept and use the actual source type from
the incoming request or the deployment object (e.g., use req.SourceType or
deployment.SourceType) instead of the hardcoded
ctrlv1.SourceType_SOURCE_TYPE_CLI_UPLOAD, and propagate that value into the call
site so DockerImage/Git paths are labeled correctly.
In `@svc/ctrl/worker/run.go`:
- Around line 181-183: The DeploymentServiceServer bind call currently exposes
state-mutating endpoints without ingress restrictions; update the
restateSrv.Bind invocation that wraps
hydrav1.NewDeploymentServiceServer(deployment.New(deployment.Config{DB:
database})) to include the same ingress restriction used by other services
(e.g., restate.WithIngressPrivate(true)), so the DeploymentServiceServer is
bound with restate.WithIngressPrivate(true) to prevent public access to mutation
methods.
🧹 Nitpick comments (1)
svc/ctrl/worker/deploy/deploy_handler.go (1)
41-44: Doc note: replicas default isn’t always 1.RegionConfig overrides replica count. Update comment to avoid mismatch.
Proposed minimal edit
-// The workflow creates deployment topologies for all configured regions, each with -// a version obtained from VersioningService and 1 desired replica. Sentinel +// The workflow creates deployment topologies for all configured regions, each with +// a version obtained from VersioningService and desired replicas from runtime +// settings (default 1). Sentinel
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@svc/ctrl/integration/harness/harness.go`:
- Around line 183-185: The config field passed to deployment.New uses the wrong
key name; update the struct literal for deployment.Config in harness.go to use
the expected field name Database (matching deployment.Config's Database
db.Database) instead of DB so it compiles; ensure the value you pass (database)
matches the type expected by deployment.Config's Database field.
🧹 Nitpick comments (3)
svc/ctrl/api/deployment_integration_test.go (1)
28-33: Unusedrequestsfield in mockDeploymentService.
DeploymentServiceRPCs (ScheduleDesiredStateChange,ChangeDesiredState,ClearScheduledStateChanges) don't useDeployRequest. This field appears to be copy-paste frommockDeployServiceand is never used.♻️ Proposed cleanup
type mockDeploymentService struct { hydrav1.UnimplementedDeploymentServiceServer - requests chan *hydrav1.DeployRequest }svc/ctrl/worker/deploy/deploy_handler.go (1)
364-365: Track TODO for source type handling.The hardcoded
SOURCE_TYPE_CLI_UPLOADdoesn't reflect the actual source (GitSourcevsDockerImage). This affects domain generation accuracy.Would you like me to open an issue to track deriving the correct
SourceTypefromreq.GetSource()?svc/ctrl/proto/hydra/v1/deploy.proto (1)
77-78: Track removal of unused field.The
key_auth_idfield is marked as unused. Consider removing it in a follow-up to avoid confusion.
| deploymentSvc := deployment.New(deployment.Config{ | ||
| DB: database, | ||
| }) |
There was a problem hiding this comment.
Config field name mismatch causes compilation error.
Per the relevant snippet from svc/ctrl/worker/deployment/service.go lines 19-22, deployment.Config expects Database, not DB:
Config struct {
Database db.Database
...
}🐛 Proposed fix
deploymentSvc := deployment.New(deployment.Config{
- DB: database,
+ Database: database,
})🤖 Prompt for AI Agents
In `@svc/ctrl/integration/harness/harness.go` around lines 183 - 185, The config
field passed to deployment.New uses the wrong key name; update the struct
literal for deployment.Config in harness.go to use the expected field name
Database (matching deployment.Config's Database db.Database) instead of DB so it
compiles; ensure the value you pass (database) matches the type expected by
deployment.Config's Database field.
Automatically set production deployments to standby 30 minutes after they are superseded by a newer deploy, reducing idle resource consumption. Previously, old deployments kept running indefinitely.
Architecture
This PR introduces a new Restate virtual object (
DeploymentService) and splits the existing deployment proto into two services:DeployService(Restate Workflow) — orchestrates the full deployment lifecycle: build, provision, health-check, route, promote, rollback. Keyed by a caller-supplied workflow ID. This is the renamed formerDeploymentService.DeploymentService(Restate Virtual Object) — serializes all desired-state mutations for a single deployment, keyed by deployment ID. Handles scheduling standby/archive transitions with a nonce-based last-writer-wins mechanism.The split exists because multiple callers need to mutate a deployment's desired state (deploy workflow, preview cron, promote, rollback). The virtual object key guarantees sequential processing per deployment without external locks.
How Production Spin-Down Works
After a successful production deploy, the
DeployServiceworkflow callsScheduleDesiredStateChange(standby, 30min)on the previous live deployment's virtual object. The flow:ScheduleDesiredStateChangegenerates a unique nonce, stores{nonce, target_state}in Restate state, and enqueues a delayedChangeDesiredStatecall.ChangeDesiredStatefires, verifies the nonce still matches the stored transition, and writesdesired_state = standbyto the database.Last-Writer-Wins via Nonce
Restate delayed calls cannot be cancelled, so the nonce provides conflict resolution:
ClearScheduledStateChangesdeletes the stored transition. The delayed call still fires but finds nothing to match against and no-ops.Key Changes
New:
svc/ctrl/worker/deployment/A new Restate virtual object with three RPCs:
ScheduleDesiredStateChange— records a delayed state transition with nonce-based last-writer-wins.ChangeDesiredState— internal handler that fires after the delay; no-ops on nonce mismatch, otherwise writes newdesired_stateto the DB.ClearScheduledStateChanges— cancels pending transitions by deleting the stored transition record.Modified:
svc/ctrl/worker/deploy/ScaleDownIdleDeploymentstoScaleDownIdlePreviewDeployments; now transitions idle previews toarchivedvia the virtual object instead of directly.Proto Split
deployment.proto→deploy.proto(workflow RPCs: Deploy, Rollback, Promote, ScaleDownIdlePreviewDeployments) +deployment.proto(virtual object RPCs: ScheduleDesiredStateChange, ChangeDesiredState, ClearScheduledStateChanges).