add auto-stop for idle workspaces#181
Conversation
📝 WalkthroughWalkthroughAdds per-workspace and global auto-pause configuration, workspace activity tracking, CLI and API surface changes, and controller logic that probes running workspace pods (via pods/exec) to update activity timestamps and auto-pause idle workspaces. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Controller
participant K8sAPI as K8s API
participant Pod
User->>K8sAPI: Create/Update Workspace (autoPauseTimeoutMinutes)
K8sAPI-->>Controller: Workspace event
Controller->>K8sAPI: Get Workspace resource
K8sAPI-->>Controller: WorkspaceSpec + Status
rect rgba(100,150,200,0.5)
Note over Controller: Auto-pause check (only when Running)
Controller->>Controller: Compute effective timeout
Controller->>Controller: Calculate idle duration (now - lastActivityTime)
end
alt Idle duration < timeout
Controller->>K8sAPI: Exec request (pods/exec) -> probe activity
K8sAPI->>Pod: Exec probe
Pod-->>K8sAPI: Probe result
K8sAPI-->>Controller: Activity detected
Controller->>K8sAPI: Patch Status.LastActivityTime
Controller->>Controller: Requeue for next check
else Idle duration >= timeout
Controller->>K8sAPI: Patch Spec.Stopped = true
K8sAPI-->>K8sAPI: Workspace transitions to Stopped
Controller->>K8sAPI: Update Status message (auto-paused)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
api/v1alpha1/zz_generated.deepcopy.go (1)
1-28:⚠️ Potential issue | 🟡 MinorGenerated file should not be committed.
Based on learnings:
api/v1alpha1/zz_generated.deepcopy.gois explicitly listed in.gitignoreand regenerated at deploy/build time viamake generate manifests. This file should be removed from the PR to follow repository conventions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/v1alpha1/zz_generated.deepcopy.go` around lines 1 - 28, Remove the generated file zz_generated.deepcopy.go from this PR/branch (do not commit generated artifacts); delete the file (so it no longer appears in the commit) and rely on the repo’s generation step (make generate manifests) to recreate it during build/deploy; confirm package v1alpha1 files are untouched aside from removing this generated file and do not alter .gitignore which already lists zz_generated.deepcopy.go.
🧹 Nitpick comments (1)
internal/controller/workspace/controller.go (1)
443-450: Two-patch approach may cause inconsistent state on partial failure.The auto-pause logic patches
Spec.Stopped = truefirst, then separately callssetStatusto update the status. If the spec patch succeeds butsetStatusfails, the workspace is stopped but the status message/phase is stale. The next reconcile will fix this, but consider combining into an atomic operation or handling the error more explicitly.Additionally, the
setStatuscall uses the in-memorywsobject which still has the oldresourceVersionfrom before the spec patch. This could fail with a conflict error if another actor (liketouchWorkspaceActivity) modified the resource.Consider re-fetching after spec patch
specPatch := client.MergeFrom(ws.DeepCopy()) ws.Spec.Stopped = true if err := r.Patch(ctx, ws, specPatch); err != nil { return ctrl.Result{}, fmt.Errorf("failed to auto-pause workspace: %w", err) } + // Re-fetch to get updated resourceVersion for status patch + if err := r.Get(ctx, client.ObjectKeyFromObject(ws), ws); err != nil { + return ctrl.Result{}, err + } + msg := fmt.Sprintf("Auto-paused after %s of inactivity", idleDuration.Truncate(time.Minute)) return ctrl.Result{}, r.setStatus(ctx, ws, "Stopped", msg)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/workspace/controller.go` around lines 443 - 450, The current two-step auto-pause can leave Spec updated but Status stale or conflict because ws still holds the old resourceVersion; after successfully calling r.Patch(ctx, ws, specPatch) (the specPatch/ r.Patch/ ws code path), re-fetch the latest workspace into ws (e.g., r.Get(ctx, namespacedName, ws) or load a fresh copy) to refresh resourceVersion and then call r.setStatus (or r.Status().Patch) using a new patch built from that fresh copy; also handle setStatus conflicts by retrying (or requeueing) and consider reverting the spec change or requeuing on persistent failures so state stays consistent (references: specPatch, r.Patch, setStatus, touchWorkspaceActivity, resourceVersion).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v1alpha1/operatorconfig_types.go`:
- Around line 418-425: The accessor GetAutoPauseTimeoutMinutes currently treats
any non-positive AutoPauseTimeoutMinutes as missing and returns
DefaultAutoPauseTimeoutMinutes, which prevents an explicit 0 from disabling
auto-pause; update GetAutoPauseTimeoutMinutes (in WorkspacesConfig) to first
check for a non-nil receiver and return c.AutoPauseTimeoutMinutes when it is
explicitly set (including 0 to indicate disabled), and only return
DefaultAutoPauseTimeoutMinutes when the config is nil or the field is unset
(e.g., negative/absent) so that 0 properly disables auto-pause cluster-wide.
In `@internal/buildapi/workspace.go`:
- Around line 466-479: touchWorkspaceActivity currently patches
ws.Status.LastActivityTime using the original ws which can race with the
auto-pause controller; change the function to re-fetch the latest Workspace from
the API (using k8sClient.Get) immediately before creating the MergeFrom patch,
check that the fresh object's Spec.Stopped is false (abort if true), update the
fresh object's Status.LastActivityTime and then call k8sClient.Status().Patch
with the MergeFrom of that fresh copy, and handle/ignore conflict errors as
best-effort (do not blindly patch the stale ws or overwrite stopped state).
---
Outside diff comments:
In `@api/v1alpha1/zz_generated.deepcopy.go`:
- Around line 1-28: Remove the generated file zz_generated.deepcopy.go from this
PR/branch (do not commit generated artifacts); delete the file (so it no longer
appears in the commit) and rely on the repo’s generation step (make generate
manifests) to recreate it during build/deploy; confirm package v1alpha1 files
are untouched aside from removing this generated file and do not alter
.gitignore which already lists zz_generated.deepcopy.go.
---
Nitpick comments:
In `@internal/controller/workspace/controller.go`:
- Around line 443-450: The current two-step auto-pause can leave Spec updated
but Status stale or conflict because ws still holds the old resourceVersion;
after successfully calling r.Patch(ctx, ws, specPatch) (the specPatch/ r.Patch/
ws code path), re-fetch the latest workspace into ws (e.g., r.Get(ctx,
namespacedName, ws) or load a fresh copy) to refresh resourceVersion and then
call r.setStatus (or r.Status().Patch) using a new patch built from that fresh
copy; also handle setStatus conflicts by retrying (or requeueing) and consider
reverting the spec change or requeuing on persistent failures so state stays
consistent (references: specPatch, r.Patch, setStatus, touchWorkspaceActivity,
resourceVersion).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 263a22a6-4145-49d0-a7e6-b4dcace49fd6
⛔ Files ignored due to path filters (2)
config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yamlis excluded by!config/crd/bases/**config/crd/bases/automotive.sdv.cloud.redhat.com_workspaces.yamlis excluded by!config/crd/bases/**
📒 Files selected for processing (8)
api/v1alpha1/operatorconfig_types.goapi/v1alpha1/workspace_types.goapi/v1alpha1/zz_generated.deepcopy.gocmd/caib/workspace/workspace.gocmd/main.gointernal/buildapi/workspace.gointernal/controller/workspace/controller.gointernal/controller/workspace/controller_test.go
| // GetAutoPauseTimeoutMinutes returns the global auto-pause timeout in minutes. | ||
| // Returns DefaultAutoPauseTimeoutMinutes (30) when not configured. | ||
| func (c *WorkspacesConfig) GetAutoPauseTimeoutMinutes() int32 { | ||
| if c != nil && c.AutoPauseTimeoutMinutes > 0 { | ||
| return c.AutoPauseTimeoutMinutes | ||
| } | ||
| return DefaultAutoPauseTimeoutMinutes | ||
| } |
There was a problem hiding this comment.
Potential semantic inconsistency: 0 should disable auto-pause, not fall back to default.
The workspace spec documents that 0 = disable auto-pause, but this accessor returns DefaultAutoPauseTimeoutMinutes when the value is <= 0. This means setting WorkspacesConfig.AutoPauseTimeoutMinutes = 0 at the cluster level won't actually disable auto-pause globally.
If 0 should disable auto-pause cluster-wide (consistent with per-workspace behavior), consider:
Suggested fix
func (c *WorkspacesConfig) GetAutoPauseTimeoutMinutes() int32 {
- if c != nil && c.AutoPauseTimeoutMinutes > 0 {
+ if c != nil && c.AutoPauseTimeoutMinutes != 0 {
return c.AutoPauseTimeoutMinutes
}
return DefaultAutoPauseTimeoutMinutes
}Alternatively, if 0 at the cluster level should not disable but only per-workspace 0 should disable, document this distinction clearly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/v1alpha1/operatorconfig_types.go` around lines 418 - 425, The accessor
GetAutoPauseTimeoutMinutes currently treats any non-positive
AutoPauseTimeoutMinutes as missing and returns DefaultAutoPauseTimeoutMinutes,
which prevents an explicit 0 from disabling auto-pause; update
GetAutoPauseTimeoutMinutes (in WorkspacesConfig) to first check for a non-nil
receiver and return c.AutoPauseTimeoutMinutes when it is explicitly set
(including 0 to indicate disabled), and only return
DefaultAutoPauseTimeoutMinutes when the config is nil or the field is unset
(e.g., negative/absent) so that 0 properly disables auto-pause cluster-wide.
| // touchWorkspaceActivity updates LastActivityTime on the workspace status. | ||
| // Called from handlers that represent actual workspace usage (exec, shell, sync, deploy) | ||
| // so the auto-pause controller knows the workspace is in use. | ||
| func (a *APIServer) touchWorkspaceActivity(c *gin.Context, ws *automotivev1alpha1.Workspace) { | ||
| k8sClient, err := getClientFromRequest(c) | ||
| if err != nil { | ||
| return // best-effort, don't fail the operation | ||
| } | ||
|
|
||
| now := metav1.Now() | ||
| patch := client.MergeFrom(ws.DeepCopy()) | ||
| ws.Status.LastActivityTime = &now | ||
| _ = k8sClient.Status().Patch(c.Request.Context(), ws, patch) | ||
| } |
There was a problem hiding this comment.
Race condition between activity touch and auto-pause controller.
This function patches LastActivityTime using a stale ws object (fetched at handler entry). If called after the controller decides to auto-pause but before setStatus clears LastActivityTime, this patch may:
- Fail with a conflict error (silently ignored)
- Succeed and set activity time on a workspace that's already being stopped
While the best-effort approach is reasonable for activity tracking, consider re-fetching the workspace before patching, or at minimum checking that ws.Spec.Stopped is still false:
Suggested improvement
func (a *APIServer) touchWorkspaceActivity(c *gin.Context, ws *automotivev1alpha1.Workspace) {
+ // Don't touch activity on workspaces being stopped
+ if ws.Spec.Stopped {
+ return
+ }
k8sClient, err := getClientFromRequest(c)
if err != nil {
return // best-effort, don't fail the operation
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/buildapi/workspace.go` around lines 466 - 479,
touchWorkspaceActivity currently patches ws.Status.LastActivityTime using the
original ws which can race with the auto-pause controller; change the function
to re-fetch the latest Workspace from the API (using k8sClient.Get) immediately
before creating the MergeFrom patch, check that the fresh object's Spec.Stopped
is false (abort if true), update the fresh object's Status.LastActivityTime and
then call k8sClient.Status().Patch with the MergeFrom of that fresh copy, and
handle/ignore conflict errors as best-effort (do not blindly patch the stale ws
or overwrite stopped state).
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/controller/workspace/controller.go (1)
366-375:⚠️ Potential issue | 🟡 MinorDon't let the no-op fast path skip the new activity reset.
LastActivityTimeis not part of the equality guard. If it gets re-populated after a workspace is alreadyStopped,Pending, orCreating, latersetStatuscalls return early and never clear it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/workspace/controller.go` around lines 366 - 375, The early-return equality check (comparing ws.Status.Phase, ws.Status.Message and ws.Status.PodName) can skip clearing ws.Status.LastActivityTime; move the LastActivityTime reset logic so it executes before the no-op fast path or include LastActivityTime in the equality comparison in the setStatus block (the code touching ws.Status.Phase, ws.Status.Message, ws.Status.PodName and ws.Status.LastActivityTime). Concretely, ensure that when phase is "Stopped"|"Pending"|"Creating" you set ws.Status.LastActivityTime = nil prior to returning early in the no-op path (or update the equality guard to also compare LastActivityTime) so subsequent calls cannot leave LastActivityTime incorrectly populated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/buildapi/workspace.go`:
- Around line 943-950: The code currently prints the compiled
DefaultAutoPauseTimeoutMinutes whenever ws.Spec.AutoPauseTimeoutMinutes is nil,
but nil means "inherit from OperatorConfig"; update the nil branch to read the
operator-level setting instead: fetch the OperatorConfig's
AutoPauseTimeoutMinutes (e.g. via your reconciler/helper that exposes
operatorConfig or a getOperatorConfig call), then if that operator value is nil
fall back to automotivev1alpha1.DefaultAutoPauseTimeoutMinutes, if the operator
value is 0 print "disabled", otherwise print the operator value with "%dm". Use
the same semantics for 0/ptr vs nil as with ws.Spec.AutoPauseTimeoutMinutes
(reference symbols: ws.Spec.AutoPauseTimeoutMinutes,
automotivev1alpha1.DefaultAutoPauseTimeoutMinutes, and
OperatorConfig.AutoPauseTimeoutMinutes or your reconciler method that returns
it).
In `@internal/controller/workspace/controller.go`:
- Around line 390-397: The current lookup of OperatorConfig (r.Get(..., oc))
treats any error as “not configured” and returns DefaultAutoPauseTimeoutMinutes;
change the logic to only fall back to the default when errors.IsNotFound(err) is
true, and for any other error propagate the error back to the caller so the
reconcile loop can retry; specifically, update the block that calls r.Get(ctx,
client.ObjectKey{Name: "config", Namespace: ws.Namespace}, oc) to check
errors.IsNotFound(err) and return the default only in that case, and otherwise
return/forward the Get error instead of silently using
DefaultAutoPauseTimeoutMinutes (preserve use of
oc.Spec.Workspaces.GetAutoPauseTimeoutMinutes() when oc is found).
- Around line 38-45: The clientset initialization error is currently stored in a
local initErr inside getClientset(), so if the once initializer fails subsequent
calls return (nil, nil) and cause panics; add a new Reconciler field (e.g.,
clientsetInitErr error), set that field from within the sync.Once initializer in
getClientset(), and change getClientset() to return (r.clientset,
r.clientsetInitErr); update any callers (like isWorkspaceActive()) to handle the
returned error. This ensures the initialization error is persisted across calls
and prevents nil clientset usage.
---
Outside diff comments:
In `@internal/controller/workspace/controller.go`:
- Around line 366-375: The early-return equality check (comparing
ws.Status.Phase, ws.Status.Message and ws.Status.PodName) can skip clearing
ws.Status.LastActivityTime; move the LastActivityTime reset logic so it executes
before the no-op fast path or include LastActivityTime in the equality
comparison in the setStatus block (the code touching ws.Status.Phase,
ws.Status.Message, ws.Status.PodName and ws.Status.LastActivityTime).
Concretely, ensure that when phase is "Stopped"|"Pending"|"Creating" you set
ws.Status.LastActivityTime = nil prior to returning early in the no-op path (or
update the equality guard to also compare LastActivityTime) so subsequent calls
cannot leave LastActivityTime incorrectly populated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3a9369c7-7adc-4f89-97d6-d753eea2369e
⛔ Files ignored due to path filters (2)
config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yamlis excluded by!config/crd/bases/**config/crd/bases/automotive.sdv.cloud.redhat.com_workspaces.yamlis excluded by!config/crd/bases/**
📒 Files selected for processing (8)
api/v1alpha1/operatorconfig_types.goapi/v1alpha1/workspace_types.goapi/v1alpha1/zz_generated.deepcopy.gocmd/caib/workspace/workspace.gocmd/main.gointernal/buildapi/workspace.gointernal/controller/workspace/controller.gointernal/controller/workspace/controller_test.go
✅ Files skipped from review due to trivial changes (3)
- api/v1alpha1/workspace_types.go
- internal/controller/workspace/controller_test.go
- api/v1alpha1/zz_generated.deepcopy.go
🚧 Files skipped from review as they are similar to previous changes (3)
- cmd/caib/workspace/workspace.go
- cmd/main.go
- api/v1alpha1/operatorconfig_types.go
| var autoPauseTimeout string | ||
| switch { | ||
| case ws.Spec.AutoPauseTimeoutMinutes == nil: | ||
| autoPauseTimeout = fmt.Sprintf("default (%dm)", automotivev1alpha1.DefaultAutoPauseTimeoutMinutes) | ||
| case *ws.Spec.AutoPauseTimeoutMinutes == 0: | ||
| autoPauseTimeout = "disabled" | ||
| default: | ||
| autoPauseTimeout = fmt.Sprintf("%dm", *ws.Spec.AutoPauseTimeoutMinutes) |
There was a problem hiding this comment.
Don't render the compiled-in default as the effective timeout.
nil here means "inherit from OperatorConfig", but this branch always prints DefaultAutoPauseTimeoutMinutes. Any namespace with a custom global timeout — or a global disable — will return the wrong autoPauseTimeout in list/show responses.
Minimal safe fix
case ws.Spec.AutoPauseTimeoutMinutes == nil:
- autoPauseTimeout = fmt.Sprintf("default (%dm)", automotivev1alpha1.DefaultAutoPauseTimeoutMinutes)
+ autoPauseTimeout = "default"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| var autoPauseTimeout string | |
| switch { | |
| case ws.Spec.AutoPauseTimeoutMinutes == nil: | |
| autoPauseTimeout = fmt.Sprintf("default (%dm)", automotivev1alpha1.DefaultAutoPauseTimeoutMinutes) | |
| case *ws.Spec.AutoPauseTimeoutMinutes == 0: | |
| autoPauseTimeout = "disabled" | |
| default: | |
| autoPauseTimeout = fmt.Sprintf("%dm", *ws.Spec.AutoPauseTimeoutMinutes) | |
| var autoPauseTimeout string | |
| switch { | |
| case ws.Spec.AutoPauseTimeoutMinutes == nil: | |
| autoPauseTimeout = "default" | |
| case *ws.Spec.AutoPauseTimeoutMinutes == 0: | |
| autoPauseTimeout = "disabled" | |
| default: | |
| autoPauseTimeout = fmt.Sprintf("%dm", *ws.Spec.AutoPauseTimeoutMinutes) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/buildapi/workspace.go` around lines 943 - 950, The code currently
prints the compiled DefaultAutoPauseTimeoutMinutes whenever
ws.Spec.AutoPauseTimeoutMinutes is nil, but nil means "inherit from
OperatorConfig"; update the nil branch to read the operator-level setting
instead: fetch the OperatorConfig's AutoPauseTimeoutMinutes (e.g. via your
reconciler/helper that exposes operatorConfig or a getOperatorConfig call), then
if that operator value is nil fall back to
automotivev1alpha1.DefaultAutoPauseTimeoutMinutes, if the operator value is 0
print "disabled", otherwise print the operator value with "%dm". Use the same
semantics for 0/ptr vs nil as with ws.Spec.AutoPauseTimeoutMinutes (reference
symbols: ws.Spec.AutoPauseTimeoutMinutes,
automotivev1alpha1.DefaultAutoPauseTimeoutMinutes, and
OperatorConfig.AutoPauseTimeoutMinutes or your reconciler method that returns
it).
| oc := &automotivev1alpha1.OperatorConfig{} | ||
| if err := r.Get(ctx, client.ObjectKey{Name: "config", Namespace: ws.Namespace}, oc); err == nil { | ||
| if oc.Spec.Workspaces != nil { | ||
| return time.Duration(oc.Spec.Workspaces.GetAutoPauseTimeoutMinutes()) * time.Minute | ||
| } | ||
| } | ||
|
|
||
| return time.Duration(automotivev1alpha1.DefaultAutoPauseTimeoutMinutes) * time.Minute |
There was a problem hiding this comment.
Only fall back to the built-in timeout on NotFound.
This helper currently turns any r.Get error into DefaultAutoPauseTimeoutMinutes. A transient cache/API failure can therefore change the effective idle policy and auto-pause with the wrong timeout instead of retrying.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/controller/workspace/controller.go` around lines 390 - 397, The
current lookup of OperatorConfig (r.Get(..., oc)) treats any error as “not
configured” and returns DefaultAutoPauseTimeoutMinutes; change the logic to only
fall back to the default when errors.IsNotFound(err) is true, and for any other
error propagate the error back to the caller so the reconcile loop can retry;
specifically, update the block that calls r.Get(ctx, client.ObjectKey{Name:
"config", Namespace: ws.Namespace}, oc) to check errors.IsNotFound(err) and
return the default only in that case, and otherwise return/forward the Get error
instead of silently using DefaultAutoPauseTimeoutMinutes (preserve use of
oc.Spec.Workspaces.GetAutoPauseTimeoutMinutes() when oc is found).
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
internal/buildapi/workspace.go (1)
469-481:⚠️ Potential issue | 🟡 MinorStill patching activity from a stale
Workspacesnapshot.
wswas loaded before the handler began its main operation, so thisMergeFrom(ws.DeepCopy())can still race with auto-pause/stop. A late best-effort patch can conflict or repopulateLastActivityTimeafter the workspace has already transitioned towardStopped.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/buildapi/workspace.go` around lines 469 - 481, touchWorkspaceActivity is creating a MergeFrom patch from an earlier ws snapshot which can race with auto-pause/stop; instead re-read the current Workspace before patching and build the patch from that fresh copy: after getClientFromRequest succeeds, call k8sClient.Get to fetch the latest Workspace into a new variable (e.g., latestWS), check latestWS.Spec.Stopped again, set latestWS.Status.LastActivityTime = &now and then call k8sClient.Status().Patch using client.MergeFrom(latestWS.DeepCopy()) (or bail if the workspace is now stopped) so you never patch from a stale snapshot.internal/controller/workspace/controller.go (1)
391-398:⚠️ Potential issue | 🟠 MajorDon't turn
OperatorConfigread failures into the built-in timeout.Any transient
r.Get()error here currently changes the effective idle policy instead of retrying, so the controller can auto-pause with the wrong timeout. Only fall back onIsNotFound; bubble other errors back tocheckAutoPause().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/workspace/controller.go` around lines 391 - 398, The current logic around r.Get(ctx, client.ObjectKey{Name: "config", Namespace: ws.Namespace}, oc) swallows all errors and falls back to the built-in timeout; change this so only a NotFound error falls back to automotivev1alpha1.DefaultAutoPauseTimeoutMinutes, while any other r.Get error is returned to the caller (so checkAutoPause() can retry/handle it). Concretely, update the method using OperatorConfig/oc to inspect the error with apierrors.IsNotFound(err) and return the default timeout only in that case; for any other non-nil err return it (adjust the function signature to return (time.Duration, error) if needed) and keep the existing path that uses oc.Spec.Workspaces.GetAutoPauseTimeoutMinutes() when err == nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v1alpha1/workspace_types.go`:
- Around line 68-73: The AutoPauseTimeoutMinutes field currently allows negative
values; add a kubebuilder validation marker to reject negatives by annotating
the AutoPauseTimeoutMinutes field (the pointer int32 field named
AutoPauseTimeoutMinutes) with "+kubebuilder:validation:Minimum=0" so the CRD
schema enforces values >= 0 (nil and 0 still allowed), then regenerate CRDs
(e.g., run controller-gen / repo's generate task) so the CRD rejects negative
numbers at the API boundary.
In `@cmd/caib/workspace/workspace.go`:
- Around line 297-300: The code currently treats any negative autoPauseTimeout
like "-1" by only setting req.AutoPauseTimeoutMinutes when autoPauseTimeout >=
0; update the validation logic where autoPauseTimeout is processed (the block
that sets req.AutoPauseTimeoutMinutes) to explicitly reject values less than -1
by returning an error (or exiting with a clear message) if autoPauseTimeout <
-1, and keep the existing behavior for -1 (leave req.AutoPauseTimeoutMinutes
nil) and for >= 0 (convert to int32 and set req.AutoPauseTimeoutMinutes = &v).
In `@internal/controller/workspace/controller.go`:
- Around line 403-405: The current early return when timeout :=
r.getAutoPauseTimeout(ctx, ws) yields zero prevents workspaces from observing
later global (OperatorConfig) changes; detect whether the zero came from an
inherited OperatorConfig value (add a second return or helper like
getAutoPauseTimeoutInherited or isAutoPauseInherited) and, if timeout==0 AND
inherited==true, return a slow requeue (e.g., ctrl.Result{RequeueAfter:
<reasonable duration>}, nil) so global enables/changes propagate; alternatively
wire a watch on OperatorConfig to trigger reconciles, but implement one of these
in the reconciliation code path around r.getAutoPauseTimeout.
- Around line 374-376: The early-return in setStatus() only compares phase and
can skip updates when ws.Status.LastActivityTime should be changed; update the
change-detection logic in setStatus (or the function doing the no-op check) to
also compare the previous ws.Status.LastActivityTime with the new computed
LastActivityTime and treat them as a difference if they are not equal, so the
code that sets ws.Status.LastActivityTime for phases "Stopped", "Pending", or
"Creating" runs even when phase is unchanged; ensure comparisons handle nil vs
non-nil correctly and use the same time equality semantics used elsewhere.
---
Duplicate comments:
In `@internal/buildapi/workspace.go`:
- Around line 469-481: touchWorkspaceActivity is creating a MergeFrom patch from
an earlier ws snapshot which can race with auto-pause/stop; instead re-read the
current Workspace before patching and build the patch from that fresh copy:
after getClientFromRequest succeeds, call k8sClient.Get to fetch the latest
Workspace into a new variable (e.g., latestWS), check latestWS.Spec.Stopped
again, set latestWS.Status.LastActivityTime = &now and then call
k8sClient.Status().Patch using client.MergeFrom(latestWS.DeepCopy()) (or bail if
the workspace is now stopped) so you never patch from a stale snapshot.
In `@internal/controller/workspace/controller.go`:
- Around line 391-398: The current logic around r.Get(ctx,
client.ObjectKey{Name: "config", Namespace: ws.Namespace}, oc) swallows all
errors and falls back to the built-in timeout; change this so only a NotFound
error falls back to automotivev1alpha1.DefaultAutoPauseTimeoutMinutes, while any
other r.Get error is returned to the caller (so checkAutoPause() can
retry/handle it). Concretely, update the method using OperatorConfig/oc to
inspect the error with apierrors.IsNotFound(err) and return the default timeout
only in that case; for any other non-nil err return it (adjust the function
signature to return (time.Duration, error) if needed) and keep the existing path
that uses oc.Spec.Workspaces.GetAutoPauseTimeoutMinutes() when err == nil.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4080a934-0722-45bb-8be5-b932c8b5cfc0
⛔ Files ignored due to path filters (2)
config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yamlis excluded by!config/crd/bases/**config/crd/bases/automotive.sdv.cloud.redhat.com_workspaces.yamlis excluded by!config/crd/bases/**
📒 Files selected for processing (8)
api/v1alpha1/operatorconfig_types.goapi/v1alpha1/workspace_types.goapi/v1alpha1/zz_generated.deepcopy.gocmd/caib/workspace/workspace.gocmd/main.gointernal/buildapi/workspace.gointernal/controller/workspace/controller.gointernal/controller/workspace/controller_test.go
✅ Files skipped from review due to trivial changes (2)
- internal/controller/workspace/controller_test.go
- api/v1alpha1/zz_generated.deepcopy.go
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/main.go
- api/v1alpha1/operatorconfig_types.go
Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assisted-by: claude-opus-4.6
There was a problem hiding this comment.
♻️ Duplicate comments (3)
internal/controller/workspace/controller.go (3)
391-396:⚠️ Potential issue | 🟠 MajorOnly fall back to the default timeout on
NotFounderrors.This lookup treats any
r.Geterror (including transient API failures or timeouts) the same as "not configured", falling through to the built-in default. A transient failure could cause the workspace to use an incorrect timeout. OnlyNotFoundshould trigger the default fallback; other errors should be propagated.Suggested fix
oc := &automotivev1alpha1.OperatorConfig{} - if err := r.Get(ctx, client.ObjectKey{Name: "config", Namespace: ws.Namespace}, oc); err == nil { + if err := r.Get(ctx, client.ObjectKey{Name: "config", Namespace: ws.Namespace}, oc); err != nil { + if !k8serrors.IsNotFound(err) { + // Log transient error but don't fail — use default + // Consider returning error to let reconcile retry + } + } else { if oc.Spec.Workspaces != nil { return time.Duration(oc.Spec.Workspaces.GetAutoPauseTimeoutMinutes()) * time.Minute } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/workspace/controller.go` around lines 391 - 396, The current r.Get call in the workspace timeout lookup treats any error as "not configured" and falls back to the default; change it to only fall back when the error is a NotFound. After calling r.Get(ctx, client.ObjectKey{Name: "config", Namespace: ws.Namespace}, oc), check apierrors.IsNotFound(err) and only then use the built-in default timeout; for any other non-nil err return/propagate the error from the surrounding function. Keep the existing logic that returns oc.Spec.Workspaces.GetAutoPauseTimeoutMinutes() * time.Minute when oc is present.
367-376:⚠️ Potential issue | 🟡 MinorEarly return doesn't account for
LastActivityTimeneeding to be cleared.If the phase/message/podName are unchanged but
LastActivityTimeis non-nil when transitioning to Stopped/Pending/Creating (e.g., due to a race with the build API), the early return would skip clearing it. This is a minor edge case but could leave stale activity data.Suggested fix
+ shouldClearActivity := phase == "Stopped" || phase == "Pending" || phase == "Creating" + activityNeedsClear := shouldClearActivity && ws.Status.LastActivityTime != nil - if ws.Status.Phase == phase && ws.Status.Message == message && ws.Status.PodName == podName { + if ws.Status.Phase == phase && ws.Status.Message == message && ws.Status.PodName == podName && !activityNeedsClear { return nil // no change } patch := client.MergeFrom(ws.DeepCopy()) ws.Status.Phase = phase ws.Status.Message = message ws.Status.PodName = podName - if phase == "Stopped" || phase == "Pending" || phase == "Creating" { + if shouldClearActivity { ws.Status.LastActivityTime = nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/workspace/controller.go` around lines 367 - 376, The early return only compares ws.Status.Phase, Message and PodName and can skip clearing ws.Status.LastActivityTime when transitioning to "Stopped"/"Pending"/"Creating"; update the guard so it also checks whether LastActivityTime must be cleared: before returning ensure that if phase is "Stopped" or "Pending" or "Creating" and ws.Status.LastActivityTime != nil you proceed to clear it (set to nil) and perform the status patch (use the existing client.MergeFrom(ws.DeepCopy()) flow), otherwise return early as before; reference ws.Status.LastActivityTime, ws.Status.Phase, ws.Status.Message, ws.Status.PodName, client.MergeFrom and ws.DeepCopy in your change.
403-406:⚠️ Potential issue | 🟡 MinorInherited global disables won't observe later OperatorConfig changes.
When
timeout == 0comes from the inheritedOperatorConfig(rather than an explicit workspace override), returning without a requeue means the workspace won't re-check if the operator later enables auto-pause. Consider either:
- Returning a slow requeue (e.g., 30 minutes) for inherited-zero cases
- Watching
OperatorConfigchanges to trigger workspace reconcilesSuggested improvement
timeout := r.getAutoPauseTimeout(ctx, ws) if timeout == 0 { - return ctrl.Result{}, nil + // Still requeue slowly to observe OperatorConfig changes + // when auto-pause is inherited (not explicitly disabled per-workspace) + if ws.Spec.AutoPauseTimeoutMinutes == nil { + return ctrl.Result{RequeueAfter: 30 * time.Minute}, nil + } + return ctrl.Result{}, nil }
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/controller/workspace/controller.go`:
- Around line 391-396: The current r.Get call in the workspace timeout lookup
treats any error as "not configured" and falls back to the default; change it to
only fall back when the error is a NotFound. After calling r.Get(ctx,
client.ObjectKey{Name: "config", Namespace: ws.Namespace}, oc), check
apierrors.IsNotFound(err) and only then use the built-in default timeout; for
any other non-nil err return/propagate the error from the surrounding function.
Keep the existing logic that returns
oc.Spec.Workspaces.GetAutoPauseTimeoutMinutes() * time.Minute when oc is
present.
- Around line 367-376: The early return only compares ws.Status.Phase, Message
and PodName and can skip clearing ws.Status.LastActivityTime when transitioning
to "Stopped"/"Pending"/"Creating"; update the guard so it also checks whether
LastActivityTime must be cleared: before returning ensure that if phase is
"Stopped" or "Pending" or "Creating" and ws.Status.LastActivityTime != nil you
proceed to clear it (set to nil) and perform the status patch (use the existing
client.MergeFrom(ws.DeepCopy()) flow), otherwise return early as before;
reference ws.Status.LastActivityTime, ws.Status.Phase, ws.Status.Message,
ws.Status.PodName, client.MergeFrom and ws.DeepCopy in your change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 887a78fc-5661-47d5-a59a-99daf50778c0
⛔ Files ignored due to path filters (2)
config/crd/bases/automotive.sdv.cloud.redhat.com_operatorconfigs.yamlis excluded by!config/crd/bases/**config/crd/bases/automotive.sdv.cloud.redhat.com_workspaces.yamlis excluded by!config/crd/bases/**
📒 Files selected for processing (8)
api/v1alpha1/operatorconfig_types.goapi/v1alpha1/workspace_types.goapi/v1alpha1/zz_generated.deepcopy.gocmd/caib/workspace/workspace.gocmd/main.gointernal/buildapi/workspace.gointernal/controller/workspace/controller.gointernal/controller/workspace/controller_test.go
✅ Files skipped from review due to trivial changes (1)
- internal/controller/workspace/controller_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- api/v1alpha1/workspace_types.go
- api/v1alpha1/zz_generated.deepcopy.go
- api/v1alpha1/operatorconfig_types.go
Summary by CodeRabbit
New Features
Bug Fixes