-
Notifications
You must be signed in to change notification settings - Fork 675
fix: setup planner serviceaccount in cluster-wide deployment #3520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Julien Mancuso <[email protected]>
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughAdds conditional Helm and operator RBAC handling: templates now choose namespaced or cluster-scoped planner RBAC; operator gains an RBAC manager, a planner-cluster-role flag, wiring into the reconciler, and runtime logic to ensure planner ServiceAccount/RoleBinding in cluster-wide mode. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Helm
participant K8s
participant Operator
participant Reconciler
participant RBACMgr
rect rgba(230,240,255,0.5)
note right of Helm: Helm install/upgrade
User->>Helm: helm install/upgrade
alt namespaceRestriction.enabled == true
Helm->>K8s: Create SA, Role, RoleBinding (namespaced)
else
Helm->>K8s: Create ClusterRole "...-planner"
note right of Helm: Inject manager arg --planner-cluster-role-name
end
end
rect rgba(240,255,240,0.5)
note right of Operator: Runtime reconcile flow
Operator->>Reconciler: Start reconcile(DynamoGraphDeployment)
alt Cluster-wide mode (namespaceRestriction disabled)
Reconciler->>RBACMgr: EnsureServiceAccountWithRBAC(ns, planner-SA, clusterRoleName)
RBACMgr->>K8s: Get/Create ServiceAccount
RBACMgr->>K8s: Get/Create/Update RoleBinding -> ClusterRole
RBACMgr-->>Reconciler: success / error
opt error
Reconciler-->>Operator: return wrapped error "failed to ensure planner RBAC"
end
else Namespaced mode
note over Reconciler: Skip RBAC ensure (Helm provided namespaced RBAC)
end
Reconciler->>K8s: Reconcile existing PVCs (unchanged flow)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/cloud/operator/cmd/main.go (1)
187-236: Fail fast when the planner cluster role name is missing in cluster-wide mode.When we run without
--restrictedNamespace,plannerClusterRoleNamedefaults to"". The reconciler then callsEnsureServiceAccountWithRBACwith an empty cluster role name, and the RoleBinding creation fails (spec.roleRef.name: Required value). Every reconciliation in cluster-wide mode will keep erroring, effectively breaking planner provisioning for default installs that have not yet set the new flag. Please validate the flag after parsing and exit with a clear message before we start the manager.Apply this diff after
flag.Parse():flag.Parse() + if restrictedNamespace == "" && plannerClusterRoleName == "" { + setupLog.Error(nil, "planner-cluster-role-name is required when running cluster-wide") + os.Exit(1) + } + // Validate modelExpressURL if provided if modelExpressURL != "" {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/planner.yaml(3 hunks)deploy/cloud/operator/cmd/main.go(5 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go(3 hunks)deploy/cloud/operator/internal/controller_common/predicate.go(1 hunks)deploy/cloud/operator/internal/rbac/manager.go(1 hunks)deploy/cloud/operator/internal/rbac/manager_test.go(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
WalkthroughAdds namespace-restricted vs cluster-wide planner RBAC modes. Helm templates conditionally render SA/Role/RoleBinding or ClusterRole and pass planner-cluster-role-name to the operator. Operator introduces an RBAC manager, wires it into the DynamoGraphDeploymentReconciler, and ensures planner ServiceAccount and RoleBinding to a ClusterRole during reconcile. Includes tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Helm as Helm Charts
participant K8s as Kubernetes
participant Oper as Operator (Manager)
participant Recon as DynamoGraphDeploymentReconciler
participant RBAC as RBAC Manager
rect rgb(235, 245, 255)
note over Helm,K8s: Install/Upgrade
Helm->>K8s: Render templates
alt namespaceRestriction.enabled
Helm-->>K8s: SA/Role/RoleBinding (namespace-scoped)
else cluster-wide
Helm-->>K8s: ClusterRole (planner)
Helm-->>K8s: Operator Deployment arg --planner-cluster-role-name
end
end
rect rgb(240, 255, 240)
note over Oper,RBAC: Operator startup
K8s-->>Oper: Start operator pod
Oper->>Recon: Initialize with Config.RBAC.PlannerClusterRoleName<br/>and RBACMgr
end
rect rgb(255, 250, 235)
note over Recon,RBAC: Reconcile flow (cluster-wide mode)
Recon->>Recon: Check RestrictedNamespace
alt not restricted
Recon->>RBAC: EnsureServiceAccountWithRBAC(ns, plannerSA, clusterRoleName)
RBAC->>K8s: Get/Create ServiceAccount
RBAC->>K8s: Get/Create/Update RoleBinding -> ClusterRole
RBAC-->>Recon: Result (error/success)
opt on error
Recon-->>Oper: Requeue/Error
end
else restricted
Recon->>Recon: Skip RBAC ensure
end
Recon->>K8s: Continue with resource reconciliation (e.g., PVC)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
deploy/cloud/helm/platform/components/operator/templates/planner.yaml (2)
39-42: Add watch (and possibly delete) verbs for robustness.Planner likely needs watch for informer/list-watch patterns; delete may be needed for cleanup. Align both Role and ClusterRole.
Apply this diff to both blocks:
- verbs: ["get", "list", "create", "update", "patch"] + verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]Also applies to: 68-70
26-29: Parity gap: imagePullSecrets only in namespaced mode.Cluster-wide mode creates SAs via the RBAC manager without imagePullSecrets, which can break private image pulls. Align behavior by either:
- Adding optional imagePullSecrets handling in the RBAC manager (preferred), or
- Dropping imagePullSecrets from this SA and ensuring pods specify them directly.
deploy/cloud/operator/internal/rbac/manager_test.go (2)
419-452: Strengthen test for RoleRef immutability handling.After changing ClusterRole, assert the RoleBinding is replaced (delete+create), not updated in place.
Example adjustment:
// Second call with a different ClusterRole err = manager.EnsureServiceAccountWithRBAC(ctx, "test-namespace", "test-sa", "cluster-role-2") if err != nil { t.Fatalf("Second call failed: %v", err) } // Verify RoleBinding now points to cluster-role-2 rb := &rbacv1.RoleBinding{} _ = fakeClient.Get(ctx, client.ObjectKey{Namespace: "test-namespace", Name: "test-sa-binding"}, rb) if rb.RoleRef.Name != "cluster-role-2" { t.Errorf("Expected RoleRef name cluster-role-2, got %s", rb.RoleRef.Name) }
454-483: Add negative test: empty namespace should error.Once input validation is added in manager.go, assert EnsureServiceAccountWithRBAC returns an error for empty namespace.
Example:
if err == nil { t.Fatalf("expected error on empty namespace, got nil") }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/planner.yaml(3 hunks)deploy/cloud/operator/cmd/main.go(5 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go(3 hunks)deploy/cloud/operator/internal/controller_common/predicate.go(1 hunks)deploy/cloud/operator/internal/rbac/manager.go(1 hunks)deploy/cloud/operator/internal/rbac/manager_test.go(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
🔇 Additional comments (2)
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)
127-129: LGTM: planner ClusterRole flag is correctly gated and named.Matches the ClusterRole name rendered by Helm.
deploy/cloud/operator/internal/controller_common/predicate.go (1)
69-76: Config surface extension looks good.RBACConfig and PlannerClusterRoleName integrate cleanly.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Show resolved
Hide resolved
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
|
@coderabbitai help |
Signed-off-by: Julien Mancuso <[email protected]>
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
deploy/cloud/operator/internal/rbac/manager.go (1)
8-18: Harden input validation: trim whitespace for namesGuard against whitespace-only values; import strings and use TrimSpace.
import ( "context" "fmt" + "strings" @@ - if targetNamespace == "" { + if strings.TrimSpace(targetNamespace) == "" { return fmt.Errorf("target namespace is required") } - if serviceAccountName == "" { + if strings.TrimSpace(serviceAccountName) == "" { return fmt.Errorf("service account name is required") } - if clusterRoleName == "" { + if strings.TrimSpace(clusterRoleName) == "" { return fmt.Errorf("cluster role name is required") }Also applies to: 59-67
deploy/cloud/operator/cmd/main.go (1)
188-191: LGTM - Validation correctly implemented.Based on past review comments, this validation was already reviewed and confirmed to be correct. It ensures that
planner-cluster-role-nameis required when running in cluster-wide mode (whenrestrictedNamespace == ""), preventing RoleBindings from being created with an empty RoleRef.deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
167-184: LGTM - Cluster-wide RBAC logic correctly implemented.Based on past review comments, this logic was already reviewed and confirmed to be correct. It includes:
- ✅ Check for cluster-wide mode (
RestrictedNamespace == "")- ✅ Nil check for
RBACMgrwith clear error message- ✅ Empty check for
PlannerClusterRoleNamewith validation- ✅ Call to
EnsureServiceAccountWithRBACwith proper error handlingThe guards prevent panics and invalid bindings in cluster-wide mode.
🧹 Nitpick comments (1)
deploy/cloud/operator/internal/rbac/manager.go (1)
93-97: Reconcile required labels on existing SA/RBEnsure labels converge when objects pre-exist; update only required keys to avoid clobbering user labels.
@@ - logger.V(1).Info("ServiceAccount already exists", - "serviceAccount", serviceAccountName, - "namespace", targetNamespace) + // Ensure required labels are present on existing SA + updated := false + if sa.Labels == nil { + sa.Labels = map[string]string{} + } + for k, v := range map[string]string{ + "app.kubernetes.io/managed-by": "dynamo-operator", + "app.kubernetes.io/component": "rbac", + "app.kubernetes.io/name": serviceAccountName, + } { + if sa.Labels[k] != v { + sa.Labels[k] = v + updated = true + } + } + if updated { + if err := m.client.Update(ctx, sa); err != nil { + return fmt.Errorf("failed to update service account labels: %w", err) + } + logger.V(1).Info("ServiceAccount labels updated", + "serviceAccount", serviceAccountName, + "namespace", targetNamespace) + } else { + logger.V(1).Info("ServiceAccount already exists", + "serviceAccount", serviceAccountName, + "namespace", targetNamespace) + } @@ -} else if needsUpdate { +} else if needsUpdate { // Only Subjects changed, can update in-place - existingRB.Subjects = rb.Subjects + existingRB.Subjects = rb.Subjects + // Keep labels in sync too + if existingRB.Labels == nil { + existingRB.Labels = map[string]string{} + } + for k, v := range rb.Labels { + existingRB.Labels[k] = v + } if err := m.client.Update(ctx, existingRB); err != nil { return fmt.Errorf("failed to update role binding: %w", err) } logger.V(1).Info("RoleBinding subjects updated", "roleBinding", roleBindingName, "namespace", targetNamespace) -} else { +} else { + // Update labels if they drifted + labelsChanged := false + if existingRB.Labels == nil { + existingRB.Labels = map[string]string{} + } + for k, v := range rb.Labels { + if existingRB.Labels[k] != v { + existingRB.Labels[k] = v + labelsChanged = true + } + } + if labelsChanged { + if err := m.client.Update(ctx, existingRB); err != nil { + return fmt.Errorf("failed to update role binding labels: %w", err) + } + logger.V(1).Info("RoleBinding labels updated", + "roleBinding", roleBindingName, + "clusterRole", clusterRoleName, + "namespace", targetNamespace) + return nil + } logger.V(1).Info("RoleBinding already up-to-date", "roleBinding", roleBindingName, "clusterRole", clusterRoleName, "namespace", targetNamespace) }Also applies to: 175-189
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
deploy/cloud/operator/go.sumis excluded by!**/*.sum
📒 Files selected for processing (10)
deploy/cloud/helm/platform/Chart.yaml(1 hunks)deploy/cloud/helm/platform/README.md(1 hunks)deploy/cloud/helm/platform/components/operator/templates/deployment.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/planner.yaml(3 hunks)deploy/cloud/operator/cmd/main.go(5 hunks)deploy/cloud/operator/go.mod(1 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go(3 hunks)deploy/cloud/operator/internal/controller_common/predicate.go(1 hunks)deploy/cloud/operator/internal/rbac/manager.go(1 hunks)deploy/cloud/operator/internal/rbac/manager_test.go(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
📚 Learning: 2025-09-17T22:35:40.674Z
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for Helm-based upgrades because the chart automatically provides default values (secretName: "mpi-run-ssh-secret", sshKeygen.enabled: true) and the deployment template conditionally injects the --mpi-run-ssh-secret-name flag, ensuring existing installations get the required configuration during upgrades.
Applied to files:
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
🪛 YAMLlint (1.37.1)
deploy/cloud/helm/platform/components/operator/templates/planner.yaml
[error] 16-16: syntax error: expected the node content, but found '-'
(syntax)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: sglang
- GitHub Check: trtllm (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (17)
deploy/cloud/operator/go.mod (1)
9-9: Dep bump looks good; run tidy/testsUpgrade to grove/operator/api v0.1.0-alpha.2 LGTM. Please run go mod tidy and the operator tests to catch any API drift.
deploy/cloud/helm/platform/Chart.yaml (1)
38-38: LGTM on chart dependency bumpsVersions updated consistently.
Also applies to: 43-43
deploy/cloud/helm/platform/README.md (1)
92-93: Docs sync looks correctDependency versions align with Chart.yaml.
deploy/cloud/operator/internal/controller_common/predicate.go (1)
69-77: RBACConfig addition looks goodPublic config extension is straightforward.
Please confirm PlannerClusterRoleName is wired from flags/values and used by the reconciler (search for usages and CLI wiring).
deploy/cloud/helm/platform/components/operator/templates/planner.yaml (5)
16-17: LGTM - False positive from static analysis.The YAML linter error is a false positive. The
{{- if .Values.namespaceRestriction.enabled }}syntax is valid Helm template syntax for conditional rendering.
23-25: LGTM - Correct namespace reference.The change from
.Values.namespaceto.Release.Namespaceensures the ServiceAccount is created in the operator's deployment namespace, which is the correct behavior for namespace-restricted mode.
35-37: LGTM - Consistent changes.The namespace and labels changes are consistent with the ServiceAccount updates and correctly implement namespace-restricted mode.
47-53: LGTM - Correct RoleBinding configuration.The RoleBinding correctly references both its own namespace and the subject's namespace as
.Release.Namespace, ensuring the binding is scoped to the operator's deployment namespace.
58-71: LGTM - Cluster-wide mode correctly implemented.The ClusterRole is properly defined for cluster-wide mode. Note that in cluster-wide mode, the ServiceAccount and RoleBinding are dynamically created in target namespaces by the RBAC manager (as seen in
dynamographdeployment_controller.go), while this ClusterRole provides the cluster-wide permissions template.deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)
127-129: LGTM - Correct CLI argument injection.The
--planner-cluster-role-nameargument is correctly injected only in cluster-wide mode (whennamespaceRestriction.enabledis false) and uses the same naming convention as the ClusterRole defined inplanner.yaml.deploy/cloud/operator/internal/rbac/manager_test.go (1)
1-765: LGTM - Comprehensive test coverage.The test suite provides excellent coverage of the RBAC manager's functionality:
- ✅ Creation scenarios
- ✅ Idempotency verification
- ✅ Update and reconciliation logic
- ✅ Multi-namespace handling
- ✅ Edge cases and validation
The tests properly document edge cases where fake client behavior differs from real Kubernetes (e.g., RoleRef immutability, empty namespace handling).
deploy/cloud/operator/cmd/main.go (4)
63-63: LGTM - Required import added.The RBAC package import is correctly added to support the new RBAC manager functionality.
141-141: LGTM - Flag properly defined.The
plannerClusterRoleNameflag is correctly defined with appropriate help text indicating it's for cluster-wide mode only.Also applies to: 180-181
237-239: LGTM - RBAC config properly wired.The
PlannerClusterRoleNameis correctly added to theRBACConfigand will be available to the reconciler for cluster-wide RBAC provisioning.
436-437: LGTM - RBAC manager initialized and wired.The RBAC manager is properly initialized with the Kubernetes client and passed to the
DynamoGraphDeploymentReconciler, enabling dynamic ServiceAccount and RoleBinding creation in cluster-wide mode.Also applies to: 446-446
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2)
66-69: LGTM - Interface properly defined.The
rbacManagerinterface is well-defined with a single, focused method for ensuring ServiceAccount with RBAC. This abstraction enables testability and follows good design practices.
79-79: LGTM - Field added to reconciler.The
RBACMgrfield is correctly added to theDynamoGraphDeploymentReconcilerstruct and will be initialized frommain.go.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
Overview:
setup planner serviceaccount in cluster-wide deployment
Summary by CodeRabbit
New Features
Improvements
Tests
Chores