Skip to content

Conversation

@julienmancuso
Copy link
Contributor

@julienmancuso julienmancuso commented Jul 30, 2025

Overview:

handle groveTerminationDelay and auto-detect grove installation

Summary by CodeRabbit

  • New Features

    • Added support for configuring Grove termination delay via a new duration-based setting.
    • Grove functionality is now automatically enabled if the required CRD is detected in the cluster.
  • Configuration Changes

    • Replaced the previous boolean flag for enabling Grove with a new groveTerminationDelay duration parameter in configuration files.
    • The enablement of Grove permissions in RBAC and deployment is now unconditional.
  • Bug Fixes

    • Ensured Grove termination delay is correctly set in generated PodGangSet resources.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 30, 2025

Walkthrough

This change removes the static enableGrove configuration flag and replaces it with a dynamic runtime detection of Grove via CRD presence, along with a new groveTerminationDelay duration setting. The Helm charts, operator configuration, and controller logic are updated to use the new delay parameter and to enable Grove features only if the relevant CRD is detected at runtime.

Changes

Cohort / File(s) Change Summary
Helm Values and Templates Update
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml, deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml, deploy/cloud/helm/platform/components/operator/values.yaml, deploy/cloud/helm/platform/values.yaml, deploy/cloud/helm/dynamo-platform-values.yaml, deploy/cloud/helm/deploy.sh
Removed the enableGrove flag and all related conditionals; introduced groveTerminationDelay as a duration parameter; RBAC rules for Grove are now always included unconditionally; deployment template now injects the new delay parameter if set; environment variable ENABLE_GROVE removed from deployment script and templates.
Operator Main Entrypoint
deploy/cloud/operator/cmd/main.go
Replaced the enableGrove flag with groveTerminationDelay; updated controller config to use a nested Grove struct; Grove enablement is now detected at runtime via CRD discovery after manager creation; configuration struct and flag registration updated accordingly.
Controller Common Logic
deploy/cloud/operator/internal/controller_common/predicate.go
Updated Config struct to use a nested Grove config; added DetectGroveAvailability function to check for Grove CRD presence and set enablement dynamically; introduced new struct for Grove configuration and imported Kubernetes discovery client.
Dynamo Controller Logic
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Updated code to check r.Config.Grove.Enabled instead of the old boolean for Grove-related logic; no change in logic, only in field accessed.
Dynamo Graph Generation
deploy/cloud/operator/internal/dynamo/graph.go
Sets TerminationDelay in generated Grove PodGangSet templates using the new configuration parameter.
Dynamo Graph Tests
deploy/cloud/operator/internal/dynamo/graph_test.go
Test updated to include and verify the new TerminationDelay field in the generated PodGangSet, matching the configuration.

Sequence Diagram(s)

sequenceDiagram
    participant Helm
    participant Operator Main
    participant Controller Common
    participant K8s API
    participant Controller

    Helm->>Operator Main: Passes groveTerminationDelay config
    Operator Main->>Controller Common: DetectGroveAvailability(ctx, mgr)
    Controller Common->>K8s API: Query for "grove.io" API group
    K8s API-->>Controller Common: Return API group presence status
    Controller Common-->>Operator Main: Returns Grove enabled/disabled
    Operator Main->>Controller: Passes Grove config (Enabled, TerminationDelay)
    Controller->>Controller: Uses Grove features if Enabled, applies TerminationDelay
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇
Goodbye to static flags, Grove now finds its way,
By peeking for CRDs in clusters every day.
A gentle delay is set for pods to say goodbye,
With templates and configs now reaching for the sky.
Dynamic and nimble, this code hops ahead—
The future is runtime, not just what’s in your YAML bed!

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@julienmancuso
Copy link
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 31, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)

103-105: Quote the duration to avoid YAML edge-cases

Quoting protects against values that contain special characters or leading zeros.

-  - --grove-termination-delay={{ .Values.dynamo.groveTerminationDelay }}
+  - --grove-termination-delay="{{ .Values.dynamo.groveTerminationDelay }}"
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dabfea3 and a9eb58c.

📒 Files selected for processing (11)
  • deploy/cloud/helm/deploy.sh (1 hunks)
  • deploy/cloud/helm/dynamo-platform-values.yaml (0 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (0 hunks)
  • deploy/cloud/helm/platform/components/operator/values.yaml (1 hunks)
  • deploy/cloud/helm/platform/values.yaml (1 hunks)
  • deploy/cloud/operator/cmd/main.go (5 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2 hunks)
  • deploy/cloud/operator/internal/controller_common/predicate.go (2 hunks)
  • deploy/cloud/operator/internal/dynamo/graph.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph_test.go (3 hunks)
💤 Files with no reviewable changes (2)
  • deploy/cloud/helm/dynamo-platform-values.yaml
  • deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
deploy/cloud/helm/platform/values.yaml (3)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Learnt from: julienmancuso
PR: #1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.

deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (5)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Learnt from: julienmancuso
PR: #1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The DYN_DEPLOYMENT_CONFIG environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Learnt from: julienmancuso
PR: #1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.

deploy/cloud/operator/internal/dynamo/graph.go (2)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

deploy/cloud/helm/deploy.sh (1)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

deploy/cloud/helm/platform/components/operator/values.yaml (2)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

deploy/cloud/operator/internal/dynamo/graph_test.go (2)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The stopSignal field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

deploy/cloud/operator/cmd/main.go (1)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

🧬 Code Graph Analysis (2)
deploy/cloud/operator/internal/dynamo/graph_test.go (2)
deploy/cloud/operator/internal/controller_common/predicate.go (1)
  • GroveConfig (33-38)
deploy/cloud/operator/api/dynamo/schemas/schemas.go (1)
  • Duration (38-38)
deploy/cloud/operator/cmd/main.go (1)
deploy/cloud/operator/internal/controller_common/predicate.go (3)
  • Config (40-48)
  • GroveConfig (33-38)
  • DetectGroveAvailability (63-96)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (14)
deploy/cloud/helm/deploy.sh (1)

167-167: No remaining ${ENABLE_GROVE} placeholders in Helm templates
A ripgrep search across all .yaml, .yml, and .tpl files under deploy/cloud/helm found no occurrences of ${ENABLE_GROVE}, so it’s safe to remove it from the envsubst whitelist.

deploy/cloud/operator/internal/dynamo/graph_test.go (1)

26-27: Tests correctly cover the new termination-delay logic

The import of time and the assertions around TerminationDelay ensure the new flag is exercised end-to-end.

Also applies to: 1140-1143, 1279-1280

deploy/cloud/helm/platform/values.yaml (1)

37-37: Default delay LGTM

15m is a valid time.ParseDuration string; no action required.

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2)

147-147: LGTM! Clean refactor to nested Grove configuration.

The change from r.Config.EnableGrove to r.Config.Grove.Enabled is consistent with the refactoring to support dynamic Grove detection while maintaining identical control flow.


311-311: LGTM! Consistent Grove configuration update.

The change maintains the same conditional logic for setting up Grove PodGangSet ownership, now using the nested Grove configuration structure.

deploy/cloud/helm/platform/components/operator/values.yaml (1)

85-85: LGTM! Addition of configurable Grove termination delay.

The new groveTerminationDelay: 15m configuration replaces the previous boolean enableGrove flag as part of the refactoring to support dynamic Grove detection with configurable termination behavior. The 15-minute default is reasonable.

deploy/cloud/operator/internal/controller_common/predicate.go (4)

23-23: LGTM! Necessary imports for Grove detection functionality.

The added imports support the new Grove configuration structure (time for Duration) and dynamic detection capability (discovery and ctrl for API group detection).

Also applies to: 26-27


33-38: LGTM! Well-designed Grove configuration structure.

The GroveConfig struct cleanly separates the dynamic enablement detection from the configurable termination delay, with clear documentation explaining the automatic detection behavior.


44-44: LGTM! Clean migration to nested Grove configuration.

Replacing the boolean EnableGrove field with the structured Grove GroveConfig field enables more sophisticated Grove configuration while maintaining clear semantics.


61-96: LGTM! Robust Grove availability detection.

The DetectGroveAvailability function uses the appropriate Kubernetes discovery client approach to detect Grove API group presence. The implementation includes proper error handling, informative logging, and fail-safe behavior (returning false on errors).

deploy/cloud/operator/cmd/main.go (4)

33-33: LGTM! Necessary scheme imports for API extensions and Istio.

The added imports and scheme registrations for apiextensionsv1 and istioclientsetscheme are consistent with the existing pattern and support the operator's interaction with these resource types.

Also applies to: 78-80


100-100: LGTM! Clean migration from static Grove flag to configurable termination delay.

The change from boolean enable-grove flag to duration grove-termination-delay flag aligns with the dynamic Grove detection approach. The 15-minute default is reasonable and consistent with Helm values.

Also applies to: 128-129


139-143: LGTM! Proper Grove configuration initialization.

The configuration correctly initializes the new GroveConfig structure with Enabled: false (to be set after detection) and TerminationDelay from the command-line flag. The inline comment clarifies the detection timing.


210-214: LGTM! Well-placed Grove availability detection.

The Grove detection logic is appropriately positioned after manager creation (required for discovery client) but before controller setup (controllers need Grove status). The logging provides good visibility into the detection process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants