-
Notifications
You must be signed in to change notification settings - Fork 688
feat: add initial Grove support #2012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces Grove support and resource reconciliation abstractions to the Dynamo operator. It adds new schema fields for stop signals and autoscaling tolerance, updates RBAC to support Grove's PodGangSet, centralizes resource generation logic, and refactors controller flows for modularity. Dependency versions are updated, and new tests validate Grove integration. Changes
Sequence Diagram(s)Grove-enabled Reconciliation FlowsequenceDiagram
participant User
participant Controller
participant GroveAPI
participant K8sAPI
participant IstioAPI
User->>Controller: Submit DynamoGraphDeployment
Controller->>Controller: Check EnableGrove flag
alt Grove enabled
Controller->>GroveAPI: Create/Sync PodGangSet
Controller->>K8sAPI: Create/Sync Service
Controller->>K8sAPI: Create/Sync Ingress
Controller->>IstioAPI: Create/Sync VirtualService
else Grove disabled
Controller->>K8sAPI: Create/Sync DynamoComponentDeployments
end
Controller->>Controller: Check readiness of all resources
Controller->>User: Update status (Ready/Pending)
Estimated code review effort3 (~45 minutes) Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
♻️ Duplicate comments (1)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
124-129: Same concern as scaleDownReplicate the guard logic for
scaleUp.toleranceor removal before applying the HPA.
🧹 Nitpick comments (5)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
1859-1861: DuplicatestopSignalfield – remember to handle for both list & main container pathsDon’t forget to update whatever common unmarshalling / defaulting code you have so that the semantics are identical for
.spec.containers[*].lifecycle.stopSignaland.spec.mainContainer.lifecycle.stopSignal.deploy/cloud/operator/go.mod (1)
81-93: Minor: audit newly added indirects
golang.org/x/time&sigs.k8s.io/randfillare fine, but double-check that they’re pulled only transitively; if not used directly, considergo mod tidy -compat=1.21to keep the file lean.deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (2)
1233-1235: Consider validatingstopSignalvalues
stopSignalis free-form; a typo will surface only at runtime (or be ignored).
Adding either an enum of common signals ("SIGTERM","SIGINT", …) or at least a regex like^SIG[A-Z0-9]+$would give users early feedback.- stopSignal: - type: string + stopSignal: + type: string + pattern: ^SIG[A-Z0-9]+$
1915-1916: Duplicate definition OK, but keep the two specs in sync
stopSignalappears for bothcontainers[*]andmainContainer.
Whenever you refine validation (see above) or defaulting, remember to update both locations to prevent silent drift.deploy/cloud/operator/internal/controller_common/resource.go (1)
409-442: Note the asymmetry in GPU resource handlingThe function parses GPU resources only for Limits (lines 388-397) but not for Requests. While this is a common pattern in Kubernetes where GPU requests typically match limits, consider adding a comment to document this intentional design choice or implement GPU parsing for Requests for consistency.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
deploy/cloud/operator/go.sumis excluded by!**/*.sum
📒 Files selected for processing (20)
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml(4 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml(4 hunks)deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml(1 hunks)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go(2 hunks)deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go(0 hunks)deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go(1 hunks)deploy/cloud/operator/cmd/main.go(3 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml(4 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml(4 hunks)deploy/cloud/operator/config/rbac/role.yaml(1 hunks)deploy/cloud/operator/go.mod(5 hunks)deploy/cloud/operator/internal/consts/consts.go(1 hunks)deploy/cloud/operator/internal/controller/common.go(0 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go(5 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go(4 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go(7 hunks)deploy/cloud/operator/internal/controller_common/predicate.go(1 hunks)deploy/cloud/operator/internal/controller_common/resource.go(2 hunks)deploy/cloud/operator/internal/dynamo/graph.go(4 hunks)deploy/cloud/operator/internal/dynamo/graph_test.go(3 hunks)
💤 Files with no reviewable changes (2)
- deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go
- deploy/cloud/operator/internal/controller/common.go
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1)
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.
deploy/cloud/operator/internal/dynamo/graph.go (1)
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.
🧬 Code Graph Analysis (3)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
IngressSpec(100-111)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
IngressSpec(100-111)
deploy/cloud/operator/internal/controller_common/resource.go (3)
deploy/cloud/operator/internal/dynamo/graph.go (1)
Resources(52-57)deploy/cloud/operator/internal/consts/consts.go (1)
KubeResourceGPUNvidia(29-29)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
Resource(156-159)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - vllm
🔇 Additional comments (26)
deploy/cloud/operator/internal/consts/consts.go (1)
37-37: LGTM! Good refactoring for ingress hostname construction.The addition of
DefaultIngressSuffixconstant centralizes ingress hostname construction logic and provides a sensible default value for local development environments.deploy/cloud/operator/internal/controller_common/predicate.go (1)
36-36: LGTM! Clean addition of Grove feature flag.The
EnableGroveboolean field provides a clean way to conditionally enable Grove functionality in the operator controllers.deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)
119-130: LGTM! Appropriate RBAC permissions for Grove integration.The new RBAC rule correctly grants necessary permissions for managing Grove PodGangSet resources. The permissions are appropriately scoped to the specific resource type and API group.
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)
253-257: LGTM! Proper handling of pointer field in deepcopy logic.The generated deepcopy code correctly handles the
Ingressfield as a pointer type with proper nil checking to prevent runtime panics. This aligns with the type change fromIngressSpecto*IngressSpec.deploy/cloud/operator/config/rbac/role.yaml (1)
101-112: LGTM! Consistent RBAC permissions for Grove integration.The new RBAC rule for Grove PodGangSet resources is correctly configured and consistent with the Helm template changes. The permissions are appropriate for managing Grove resources.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (4)
281-281: LGTM: Correctly updated test to use pointer type for Ingress fieldThis change correctly aligns with the API type change where
Ingressfield was changed fromIngressSpecto*IngressSpecin theDynamoComponentDeploymentSharedSpec.
340-340: LGTM: Consistent pointer type usage across test casesThis change maintains consistency with the API type change across multiple test cases in the same test function.
403-403: LGTM: Correctly updated virtual service test to use pointer typeThis change properly updates the virtual service test to align with the pointer type change in the API definition.
435-435: LGTM: Final test case properly updated to use pointer typeThis completes the consistent update of all test cases to use the pointer type for the
Ingressfield, maintaining alignment with the API type change.deploy/cloud/operator/cmd/main.go (3)
92-92: LGTM: Properly declared Grove feature flag variableThe
enableGrovevariable declaration follows the established pattern for other feature flags in the codebase.
120-121: LGTM: Grove feature flag properly implementedThe flag implementation follows Go best practices with:
- Clear flag name (
enable-grove)- Safe default value (
false)- Descriptive help text
- Consistent with existing feature flags
134-134: LGTM: Grove flag correctly wired to controller configurationThe flag value is properly assigned to the controller configuration, enabling runtime control of Grove functionality.
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (3)
75-75: LGTM: Ingress field changed to pointer type for better nil handlingChanging the
Ingressfield fromIngressSpecto*IngressSpecallows proper representation of absent ingress configuration as nil, which is more semantically correct than an empty struct.
152-154: LGTM: IsReady() method correctly delegates to statusThe
IsReady()method properly delegates to the embeddedStatus.IsReady()method, providing a consistent interface for readiness checks as required by the generic Resource interface.
156-158: LGTM: GetName() method provides clean access to object nameThe
GetName()method correctly returns the object's metadata name, supporting the generic Resource interface requirements.deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (2)
1178-1180: ValidatestopSignalcompatibility with Kubernetes versionsThe
stopSignalfield underlifecycleis only supported in Kubernetes v1.33+ behind the alphaContainerStopSignalsfeature gate. On earlier versions or clusters where the gate is disabled, any Pod template containinglifecycle.stopSignalwill be rejected by the API server.Please ensure one of the following before merging:
• Your operator only targets Kubernetes v1.33+ withContainerStopSignalsenabled.
• You rename this field to an operator-scoped name (e.g.dynamoStopSignal) or strip it before creating Pods.Additionally, enhance the CRD schema by adding a
descriptionand anenumconstraint to document and restrict allowed values.Points to address:
- File: deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
Lines: 1178–1180Suggested schema snippet:
stopSignal: type: string description: > Signal sent to the container on termination (requires Kubernetes v1.33+ with ContainerStopSignals feature gate). enum: - SIGTERM - SIGINT - SIGHUP - SIGUSR1 - SIGUSR2
92-98: No need to striptolerance—operator never forwards unknown fieldsThe
generateHPAfunction indynamocomponentdeployment_controller.gobuilds the HPA spec by pulling only the known fields fromSpec.Autoscaling(MinReplicas, MaxReplicas, Behavior, Metrics) and never reflects or passes through any additional properties. Since the Go API type forAutoscalingdoesn’t include aTolerancefield, anytolerance:entry in the CR is ignored by the controller and won’t end up in the generatedHorizontalPodAutoscaler.• generateHPA manually sets
– MinReplicas / MaxReplicas
– ScaleTargetRef
– Behavior
– Metrics• Unknown JSON fields (like
tolerance) are dropped by the JSON unmarshallerIf you intended to make
tolerancean operator-side knob, you’ll need to:
- Add a
Tolerancefield to theAutoscalingGo type inapi/v1alpha1/common.go- Read it in
generateHPA(or elsewhere) and apply your custom logic- Continue to omit it when constructing the
HorizontalPodAutoscalerLikely an incorrect or invalid review comment.
deploy/cloud/operator/go.mod (2)
9-9: New Grove API dependency looks goodThe direct import of
github.com/NVIDIA/grove/operator/apimatches the Grove support introduced in this PR.
49-49: Transitive bump ofgnostic-modelsis harmlessNothing to flag here.
deploy/cloud/operator/internal/dynamo/graph_test.go (4)
27-38: LGTM - Import changes support Grove integration.The new imports for Grove v1alpha1 API, resource parsing, and intstr utilities are properly organized and necessary for the new PodGangSet test functionality.
389-392: LGTM - Correct pointer usage for Ingress field.The fix properly uses a pointer to
IngressSpecinstead of a direct struct assignment, which aligns with the field type definition.
1119-1443: Excellent comprehensive test for Grove PodGangSet integration.This test function provides thorough coverage of the
GenerateGrovePodGangSetfunctionality with:
- Realistic test data: Two services with different configurations including resources, environment variables, probes, and volumes
- Complete validation: Tests the conversion from DynamoGraphDeployment to Grove PodGangSet structure
- Proper test structure: Clear setup, execution, and verification with detailed diff comparison
- Deterministic comparison: Sorting of cliques ensures consistent test results
- Edge case coverage: Tests environment variable merging, resource mapping, and volume mount configurations
The test validates critical Grove integration functionality and follows Go testing best practices.
1119-1443: Test integration follows established patterns.The new Grove test function integrates seamlessly with the existing test suite, following the same error handling patterns, test structure conventions, and verification approaches used throughout the file.
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (2)
151-157: Double-check controller support for the newtoleranceknobThe schema now permits a
tolerancevalue underscaleDown.
This field does not exist in the upstream HPA API, so the controller code must explicitly read and act on it; otherwise users will set it with no effect and validation will silently succeed.Please verify that:
- The reconcilers parse
services[*].autoscaling.behavior.scaleDown.tolerance.- A default is applied when the field is omitted.
- Unit / e2e tests exercise non-trivial values (both int & string).
183-189: Mirror the above verification forscaleUp.toleranceSame concern as for
scaleDown: ensure the controller actually consults this field during scale-up calculations.deploy/cloud/operator/internal/dynamo/graph.go (1)
300-303: Good improvement: Sorted environment variablesAdding alphabetical sorting to the merged environment variables ensures consistent ordering, which improves debugging and reduces unnecessary updates due to ordering differences.
Overview:
add initial Grove support
Summary by CodeRabbit
New Features
Enhancements
Bug Fixes
Dependency Updates
Tests
Documentation