Skip to content

Conversation

@julienmancuso
Copy link
Contributor

@julienmancuso julienmancuso commented Jun 11, 2025

Overview:

couple of fixes after Hippocratic feedback deploying in GCP

Details:

  • buildctl doesn't take into account GCP workload identity : adding an init container to fetch creds using workload identity
  • removing privileged permission for the image builder
  • make sure we only reconcile virtual service if enabled

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features
    • Improved support for Google Container Registry: Automatically generates Docker authentication for Google registries when no Docker config secret is provided.
  • Refactor
    • Unified Docker config volume handling and streamlined pod security context for image builder pods.
    • Simplified error handling and return values for Ingress and VirtualService resource management.
  • Documentation
    • Added detailed guide for Google Kubernetes Engine (GKE) setup, including Workload Identity and Artifact Registry configuration.
    • Included cloud provider-specific deployment instructions for Dynamo Cloud with a focus on GKE.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 11, 2025

"""

Walkthrough

This change introduces new constants for Docker configuration, adds logic to detect Google container registries, refactors image builder pod template generation to standardize Docker config volume handling, adds GCP-specific init container logic, simplifies security context handling, refactors error handling in ingress resource management for improved clarity, and adds GKE-specific deployment documentation.

Changes

File(s) Change Summary
deploy/cloud/operator/internal/consts/consts.go Added constants: DockerConfigVolumeName and DockerConfigVolumeMountPath for Docker config volume and mount path.
deploy/cloud/operator/internal/controller/common.go Added isGoogleRegistry(host string) bool function to identify Google container registry hosts.
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go Refactored image builder pod template: unified Docker config volume handling, added GCP init container, simplified security context.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go Refactored createOrUpdateOrDeleteIngress for explicit error handling and return values.
docs/guides/dynamo_deploy/dynamo_cloud.md Added "Cloud provider specific deployment" section with pointer to GKE deployment guide.
docs/guides/dynamo_deploy/gke_setup.md Added detailed guide for GKE Workload Identity setup and Artifact Registry access configuration.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Controller
    participant PodTemplateGen
    participant GCPInitContainer

    User->>Controller: Request image builder pod template
    Controller->>PodTemplateGen: Generate pod template
    PodTemplateGen->>PodTemplateGen: Add Docker config volume/mount (always)
    PodTemplateGen->>PodTemplateGen: Set DOCKER_CONFIG env var
    PodTemplateGen->>PodTemplateGen: Check for Google registry
    alt Google registry and no Docker config secret
        PodTemplateGen->>GCPInitContainer: Add GCP SDK init container to generate Docker config
    end
    PodTemplateGen->>Controller: Return pod template
    Controller->>User: Respond with pod template
Loading

Possibly related PRs

Poem

In the cloud where pods take flight,
Docker configs now set just right.
Google registries, we know your face—
An init container joins the race!
Security’s simpler, errors clear,
This bunny hops with extra cheer.
🐇✨
"""


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3da44a3 and 4f33e04.

📒 Files selected for processing (1)
  • docs/guides/dynamo_deploy/dynamo_cloud.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/guides/dynamo_deploy/dynamo_cloud.md
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deploy/cloud/operator/internal/controller/common.go (1)

71-76: The .google.com suffix check is too broad.

The function checks for hosts ending with .google.com, which could match non-registry Google domains and lead to false positives. Google container registries typically use gcr.io, *.gcr.io (e.g., us.gcr.io, eu.gcr.io), and *.pkg.dev domains.

Consider removing the .google.com check or making it more specific:

 func isGoogleRegistry(host string) bool {
 	return host == "gcr.io" ||
 		strings.HasSuffix(host, ".gcr.io") ||
-		strings.HasSuffix(host, ".pkg.dev") ||
-		strings.HasSuffix(host, ".google.com")
+		strings.HasSuffix(host, ".pkg.dev")
 }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e32fe67 and 3184488.

📒 Files selected for processing (4)
  • deploy/cloud/operator/internal/consts/consts.go (1 hunks)
  • deploy/cloud/operator/internal/controller/common.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (5 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
deploy/cloud/operator/internal/controller_common/resource.go (1)
  • SyncResource (56-183)
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (2)
deploy/cloud/operator/internal/consts/consts.go (2)
  • DockerConfigVolumeName (73-73)
  • DockerConfigVolumeMountPath (74-74)
deploy/sdk/src/dynamo/sdk/core/protocol/deployment.py (1)
  • Resources (25-75)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (6)
deploy/cloud/operator/internal/consts/consts.go (1)

73-74: LGTM!

The new constants follow the existing naming convention and provide clear, descriptive names for Docker configuration volume handling.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)

1003-1020: Good refactoring to improve error handling clarity.

The change from naked returns to explicit returns makes the error handling flow clearer and the combined modification flag logic more obvious.

deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (4)

855-859: Good unification of Docker config volume handling.

The changes ensure that Docker config volume is always present, either backed by a secret or an emptyDir, with the DOCKER_CONFIG environment variable consistently pointing to the mount path.

Also applies to: 862-883, 1146-1149


1007-1038: Consider OAuth2 token expiration for long-running builds.

The GCP OAuth2 access token obtained via gcloud auth print-access-token typically expires after 1 hour. For image builds that take longer than the token lifetime, the push operation might fail with authentication errors.

Consider implementing token refresh logic or using a credential helper that handles token refresh automatically. Additionally, the script should include error handling:

 Command: []string{
 	"/bin/bash",
 	"-c",
 	fmt.Sprintf(`
+set -e
 TOKEN=$(gcloud auth print-access-token)
+if [ -z "$TOKEN" ]; then
+  echo "Failed to obtain GCP access token" >&2
+  exit 1
+fi
 cat > %s/config.json <<EOL
 {
 	"auths": {
 		"%s": {
-			"auth": "$(echo -n "oauth2accesstoken:${TOKEN}" | base64 -w 0)"
+			"auth": "$(echo -n "oauth2accesstoken:${TOKEN}" | base64 | tr -d '\n')"
 		}
 	}
 }
 EOL
 echo 'Docker config.json created successfully'
 `, consts.DockerConfigVolumeMountPath, dockerRegistry),
 },

Note: Using tr -d '\n' is more portable than base64 -w 0.


1295-1299:

❓ Verification inconclusive

Verify UID 1000 compatibility with all container images.

The pod-level security context runs all containers (downloader, GCP init container, and builder) as UID/GID 1000. Please ensure all container images support running as this non-root user.


🏁 Script executed:

#!/bin/bash
# Check if the container images used are documented to support non-root execution
# Note: This is a verification request for the images' compatibility

echo "Container images to verify for UID 1000 compatibility:"
echo "1. Downloader: Check 'rapidfort/curl:latest' (from InternalImagesDynamoComponentsDownloaderDefault)"
echo "2. GCP Init: google/cloud-sdk:slim"
echo "3. Kaniko: gcr.io/kaniko-project/executor:debug" 
echo "4. Buildkit: moby/buildkit:v0.20.2"
echo "5. Buildkit-rootless: moby/buildkit:v0.20.2-rootless"

Length of output: 711


Verify non-root UID/GID 1000 support for all images

Please confirm that every container image used by this Pod supports running as UID/GID 1000 and honors FSGroup 1000 (e.g., files owned by root will be writable by GID 1000). In particular, verify the following images:

  • Downloader: rapidfort/curl:latest
  • GCP init: google/cloud-sdk:slim
  • Kaniko executor: gcr.io/kaniko-project/executor:debug
  • BuildKit: moby/buildkit:v0.20.2
  • BuildKit (rootless): moby/buildkit:v0.20.2-rootless

You can check each image’s Dockerfile or official docs for a USER declaration or tests validating non-root execution. If any image does not support running as UID 1000, either adjust its permissions (e.g., via an init container) or choose an appropriate user.


1009-1010: 🛠️ Refactor suggestion

Add validation for image name parsing.

The code assumes the image name contains at least one "/" character. This could cause a panic for image names without a registry prefix.

-		dockerRegistry := strings.Split(imageName, "/")[0]
+		parts := strings.SplitN(imageName, "/", 2)
+		if len(parts) < 2 {
+			// Handle case where image name doesn't include registry
+			return nil, fmt.Errorf("invalid image name format: %s", imageName)
+		}
+		dockerRegistry := parts[0]

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (1)

861-883: 🛠️ Refactor suggestion

Secret/emptyDir branch: tighten file permissions

For the secret branch we still rely on the default file mode (0666 → 0644 after umask) which leaves config.json world-readable inside the pod. The token generated by the GCP init-container may grant push rights to your registry and shouldn’t be readable by every container/user.

-Secret: &corev1.SecretVolumeSource{
-  SecretName: dockerConfigJSONSecretName,
-  Items: []corev1.KeyToPath{
-    {Key: ".dockerconfigjson", Path: "config.json"},
-  },
-},
+Secret: &corev1.SecretVolumeSource{
+  SecretName: dockerConfigJSONSecretName,
+  DefaultMode: ptr.To(int32(0o440)),
+  Items: []corev1.KeyToPath{
+    {Key: ".dockerconfigjson", Path: "config.json"},
+  },
+},

Same applies to the EmptyDir fall-back after the GCP init container populates it.
[security]

🧹 Nitpick comments (1)
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (1)

1146-1149: Explicit DOCKER_CONFIG env – check downloader + init containers

Builder now has the variable, but dynamocomponent-downloader & gcp-init-docker-config also refer to the config (e.g. for pre-authenticated gsutil). Consider appending the same env var to volumeMounts-consuming init containers for consistency.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3184488 and fbc0046.

📒 Files selected for processing (1)
  • deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (1)
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (1)

855-859: 👍 Consistent docker-config mount

Using the shared constants for the volume name and mount path removes a lot of duplication in downstream code and makes the mount immediately discoverable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
docs/guides/dynamo_deploy/gke_setup.md (2)

23-27: Clarify Workload Identity terminology
GKE uses workloadIdentityConfig.workloadPool (a pool), not a “namespace.” Rename the heading and description accordingly to avoid confusion.

Example:

- # Retrieve the Workload Identifier Namespace associated with your cluster:
+ # Retrieve the Workload Identity pool associated with your cluster:

96-96: Add missing article and improve phrasing
Insert “the” before “Artifact Registry” and switch to “ensure that” for clarity.

Suggested rewrite:

- This is needed to make sure pods can pull images from Artifact Registry without needing to specify an imagePullSecret
+ This is needed to ensure that pods can pull images from the Artifact Registry without needing to specify an imagePullSecret.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~96-~96: You might be missing the article “the” here.
Context: ... to make sure pods can pull images from Artifact Registry without needing to specify an ...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30a3b29 and ee7d2f4.

📒 Files selected for processing (2)
  • docs/guides/dynamo_deploy/dynamo_cloud.md (1 hunks)
  • docs/guides/dynamo_deploy/gke_setup.md (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/guides/dynamo_deploy/dynamo_cloud.md
🧰 Additional context used
🪛 LanguageTool
docs/guides/dynamo_deploy/gke_setup.md

[uncategorized] ~96-~96: You might be missing the article “the” here.
Context: ... to make sure pods can pull images from Artifact Registry without needing to specify an ...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm

@julienmancuso julienmancuso requested a review from whoisj June 12, 2025 16:22
@julienmancuso julienmancuso enabled auto-merge (squash) June 13, 2025 19:48
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
docs/guides/dynamo_deploy/gke_setup.md (2)

41-43: 🛠️ Refactor suggestion

Ensure a space before backslashes for CLI line continuation
Shell examples must have a space before the trailing backslash (\) to work correctly.

Apply this diff:

- gcloud iam service-accounts create workload-identity-sa\
+ gcloud iam service-accounts create workload-identity-sa \

29-31: 🛠️ Refactor suggestion

Use standard Markdown admonition syntax
The fenced {important} block isn’t supported by GitHub-flavored Markdown. Replace it with a proper admonition (e.g., !!! important) so the note renders correctly.

Apply this diff:

- ```{important}
- Make sure Workload Identity is enabled in your cluster!
- ```
+ !!! important
+ Make sure Workload Identity is enabled in your cluster!
🧹 Nitpick comments (3)
docs/guides/dynamo_deploy/gke_setup.md (3)

23-23: Clarify Workload Identity terminology
“Workload Identifier Namespace” is non-standard. Consider using “Workload Identity pool” to align with GCP docs.

Example:

- # Retrieve the Workload Identifier Namespace associated with your cluster:
+ # Retrieve the Workload Identity pool associated with your cluster:

95-95: Consider adding “the” for clarity
To improve readability, you could say “to make sure the pods can pull images from the Artifact Registry.”

- This is needed to make sure pods can pull images from Artifact Registry without needing to specify an imagePullSecret
+ This is needed to make sure the pods can pull images from the Artifact Registry without needing to specify an imagePullSecret

131-131: Use consistent placeholders in code snippets
The YAML examples mix ... and ...., which can confuse readers or tooling. Use a single convention (e.g., ...) or switch to comments (# …).

-   ...
+   # … (or use `...` consistently)

Also applies to: 137-137, 152-152

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e25ba28 and 3da44a3.

📒 Files selected for processing (2)
  • docs/guides/dynamo_deploy/dynamo_cloud.md (1 hunks)
  • docs/guides/dynamo_deploy/gke_setup.md (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/guides/dynamo_deploy/dynamo_cloud.md
🧰 Additional context used
🪛 LanguageTool
docs/guides/dynamo_deploy/gke_setup.md

[uncategorized] ~96-~96: You might be missing the article “the” here.
Context: ... to make sure pods can pull images from Artifact Registry without needing to specify an ...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - vllm

@julienmancuso julienmancuso enabled auto-merge (squash) June 13, 2025 23:00
@julienmancuso julienmancuso merged commit 648740e into main Jun 13, 2025
9 of 11 checks passed
@julienmancuso julienmancuso deleted the jsm/gcp branch June 13, 2025 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants