-
Notifications
You must be signed in to change notification settings - Fork 689
fix: handle groveTerminationDelay and auto-detect grove installation #2190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis change removes the static Changes
Sequence Diagram(s)sequenceDiagram
participant Helm
participant Operator Main
participant Controller Common
participant K8s API
participant Controller
Helm->>Operator Main: Passes groveTerminationDelay config
Operator Main->>Controller Common: DetectGroveAvailability(ctx, mgr)
Controller Common->>K8s API: Query for "grove.io" API group
K8s API-->>Controller Common: Return API group presence status
Controller Common-->>Operator Main: Returns Grove enabled/disabled
Operator Main->>Controller: Passes Grove config (Enabled, TerminationDelay)
Controller->>Controller: Uses Grove features if Enabled, applies TerminationDelay
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.2.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)
103-105: Quote the duration to avoid YAML edge-casesQuoting protects against values that contain special characters or leading zeros.
- - --grove-termination-delay={{ .Values.dynamo.groveTerminationDelay }} + - --grove-termination-delay="{{ .Values.dynamo.groveTerminationDelay }}"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
deploy/cloud/helm/deploy.sh(1 hunks)deploy/cloud/helm/dynamo-platform-values.yaml(0 hunks)deploy/cloud/helm/platform/components/operator/templates/deployment.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml(0 hunks)deploy/cloud/helm/platform/components/operator/values.yaml(1 hunks)deploy/cloud/helm/platform/values.yaml(1 hunks)deploy/cloud/operator/cmd/main.go(5 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go(2 hunks)deploy/cloud/operator/internal/controller_common/predicate.go(2 hunks)deploy/cloud/operator/internal/dynamo/graph.go(1 hunks)deploy/cloud/operator/internal/dynamo/graph_test.go(3 hunks)
💤 Files with no reviewable changes (2)
- deploy/cloud/helm/dynamo-platform-values.yaml
- deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
deploy/cloud/helm/platform/values.yaml (3)
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
PR: #1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (5)
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
PR: #1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The DYN_DEPLOYMENT_CONFIG environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.
Learnt from: julienmancuso
PR: #1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.
deploy/cloud/operator/internal/dynamo/graph.go (2)
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
deploy/cloud/helm/deploy.sh (1)
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2)
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
deploy/cloud/helm/platform/components/operator/values.yaml (2)
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
deploy/cloud/operator/internal/dynamo/graph_test.go (2)
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
deploy/cloud/operator/cmd/main.go (1)
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
🧬 Code Graph Analysis (2)
deploy/cloud/operator/internal/dynamo/graph_test.go (2)
deploy/cloud/operator/internal/controller_common/predicate.go (1)
GroveConfig(33-38)deploy/cloud/operator/api/dynamo/schemas/schemas.go (1)
Duration(38-38)
deploy/cloud/operator/cmd/main.go (1)
deploy/cloud/operator/internal/controller_common/predicate.go (3)
Config(40-48)GroveConfig(33-38)DetectGroveAvailability(63-96)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - vllm
🔇 Additional comments (14)
deploy/cloud/helm/deploy.sh (1)
167-167: No remaining${ENABLE_GROVE}placeholders in Helm templates
A ripgrep search across all.yaml,.yml, and.tplfiles underdeploy/cloud/helmfound no occurrences of${ENABLE_GROVE}, so it’s safe to remove it from theenvsubstwhitelist.deploy/cloud/operator/internal/dynamo/graph_test.go (1)
26-27: Tests correctly cover the new termination-delay logicThe import of
timeand the assertions aroundTerminationDelayensure the new flag is exercised end-to-end.Also applies to: 1140-1143, 1279-1280
deploy/cloud/helm/platform/values.yaml (1)
37-37: Default delay LGTM
15mis a validtime.ParseDurationstring; no action required.deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2)
147-147: LGTM! Clean refactor to nested Grove configuration.The change from
r.Config.EnableGrovetor.Config.Grove.Enabledis consistent with the refactoring to support dynamic Grove detection while maintaining identical control flow.
311-311: LGTM! Consistent Grove configuration update.The change maintains the same conditional logic for setting up Grove PodGangSet ownership, now using the nested Grove configuration structure.
deploy/cloud/helm/platform/components/operator/values.yaml (1)
85-85: LGTM! Addition of configurable Grove termination delay.The new
groveTerminationDelay: 15mconfiguration replaces the previous booleanenableGroveflag as part of the refactoring to support dynamic Grove detection with configurable termination behavior. The 15-minute default is reasonable.deploy/cloud/operator/internal/controller_common/predicate.go (4)
23-23: LGTM! Necessary imports for Grove detection functionality.The added imports support the new Grove configuration structure (
timefor Duration) and dynamic detection capability (discoveryandctrlfor API group detection).Also applies to: 26-27
33-38: LGTM! Well-designed Grove configuration structure.The
GroveConfigstruct cleanly separates the dynamic enablement detection from the configurable termination delay, with clear documentation explaining the automatic detection behavior.
44-44: LGTM! Clean migration to nested Grove configuration.Replacing the boolean
EnableGrovefield with the structuredGrove GroveConfigfield enables more sophisticated Grove configuration while maintaining clear semantics.
61-96: LGTM! Robust Grove availability detection.The
DetectGroveAvailabilityfunction uses the appropriate Kubernetes discovery client approach to detect Grove API group presence. The implementation includes proper error handling, informative logging, and fail-safe behavior (returning false on errors).deploy/cloud/operator/cmd/main.go (4)
33-33: LGTM! Necessary scheme imports for API extensions and Istio.The added imports and scheme registrations for
apiextensionsv1andistioclientsetschemeare consistent with the existing pattern and support the operator's interaction with these resource types.Also applies to: 78-80
100-100: LGTM! Clean migration from static Grove flag to configurable termination delay.The change from boolean
enable-groveflag to durationgrove-termination-delayflag aligns with the dynamic Grove detection approach. The 15-minute default is reasonable and consistent with Helm values.Also applies to: 128-129
139-143: LGTM! Proper Grove configuration initialization.The configuration correctly initializes the new
GroveConfigstructure withEnabled: false(to be set after detection) andTerminationDelayfrom the command-line flag. The inline comment clarifies the detection timing.
210-214: LGTM! Well-placed Grove availability detection.The Grove detection logic is appropriately positioned after manager creation (required for discovery client) but before controller setup (controllers need Grove status). The logging provides good visibility into the detection process.
Overview:
handle groveTerminationDelay and auto-detect grove installation
Summary by CodeRabbit
New Features
Configuration Changes
groveTerminationDelayduration parameter in configuration files.Bug Fixes