-
Notifications
You must be signed in to change notification settings - Fork 690
fix: enable GCP deployments #1474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
""" WalkthroughThis change introduces new constants for Docker configuration, adds logic to detect Google container registries, refactors image builder pod template generation to standardize Docker config volume handling, adds GCP-specific init container logic, simplifies security context handling, refactors error handling in ingress resource management for improved clarity, and adds GKE-specific deployment documentation. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Controller
participant PodTemplateGen
participant GCPInitContainer
User->>Controller: Request image builder pod template
Controller->>PodTemplateGen: Generate pod template
PodTemplateGen->>PodTemplateGen: Add Docker config volume/mount (always)
PodTemplateGen->>PodTemplateGen: Set DOCKER_CONFIG env var
PodTemplateGen->>PodTemplateGen: Check for Google registry
alt Google registry and no Docker config secret
PodTemplateGen->>GCPInitContainer: Add GCP SDK init container to generate Docker config
end
PodTemplateGen->>Controller: Return pod template
Controller->>User: Respond with pod template
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deploy/cloud/operator/internal/controller/common.go (1)
71-76: The.google.comsuffix check is too broad.The function checks for hosts ending with
.google.com, which could match non-registry Google domains and lead to false positives. Google container registries typically usegcr.io,*.gcr.io(e.g.,us.gcr.io,eu.gcr.io), and*.pkg.devdomains.Consider removing the
.google.comcheck or making it more specific:func isGoogleRegistry(host string) bool { return host == "gcr.io" || strings.HasSuffix(host, ".gcr.io") || - strings.HasSuffix(host, ".pkg.dev") || - strings.HasSuffix(host, ".google.com") + strings.HasSuffix(host, ".pkg.dev") }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
deploy/cloud/operator/internal/consts/consts.go(1 hunks)deploy/cloud/operator/internal/controller/common.go(1 hunks)deploy/cloud/operator/internal/controller/dynamocomponent_controller.go(5 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
deploy/cloud/operator/internal/controller_common/resource.go (1)
SyncResource(56-183)
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (2)
deploy/cloud/operator/internal/consts/consts.go (2)
DockerConfigVolumeName(73-73)DockerConfigVolumeMountPath(74-74)deploy/sdk/src/dynamo/sdk/core/protocol/deployment.py (1)
Resources(25-75)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - vllm
🔇 Additional comments (6)
deploy/cloud/operator/internal/consts/consts.go (1)
73-74: LGTM!The new constants follow the existing naming convention and provide clear, descriptive names for Docker configuration volume handling.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
1003-1020: Good refactoring to improve error handling clarity.The change from naked returns to explicit returns makes the error handling flow clearer and the combined modification flag logic more obvious.
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (4)
855-859: Good unification of Docker config volume handling.The changes ensure that Docker config volume is always present, either backed by a secret or an emptyDir, with the DOCKER_CONFIG environment variable consistently pointing to the mount path.
Also applies to: 862-883, 1146-1149
1007-1038: Consider OAuth2 token expiration for long-running builds.The GCP OAuth2 access token obtained via
gcloud auth print-access-tokentypically expires after 1 hour. For image builds that take longer than the token lifetime, the push operation might fail with authentication errors.Consider implementing token refresh logic or using a credential helper that handles token refresh automatically. Additionally, the script should include error handling:
Command: []string{ "/bin/bash", "-c", fmt.Sprintf(` +set -e TOKEN=$(gcloud auth print-access-token) +if [ -z "$TOKEN" ]; then + echo "Failed to obtain GCP access token" >&2 + exit 1 +fi cat > %s/config.json <<EOL { "auths": { "%s": { - "auth": "$(echo -n "oauth2accesstoken:${TOKEN}" | base64 -w 0)" + "auth": "$(echo -n "oauth2accesstoken:${TOKEN}" | base64 | tr -d '\n')" } } } EOL echo 'Docker config.json created successfully' `, consts.DockerConfigVolumeMountPath, dockerRegistry), },Note: Using
tr -d '\n'is more portable thanbase64 -w 0.
1295-1299:❓ Verification inconclusive
Verify UID 1000 compatibility with all container images.
The pod-level security context runs all containers (downloader, GCP init container, and builder) as UID/GID 1000. Please ensure all container images support running as this non-root user.
🏁 Script executed:
#!/bin/bash # Check if the container images used are documented to support non-root execution # Note: This is a verification request for the images' compatibility echo "Container images to verify for UID 1000 compatibility:" echo "1. Downloader: Check 'rapidfort/curl:latest' (from InternalImagesDynamoComponentsDownloaderDefault)" echo "2. GCP Init: google/cloud-sdk:slim" echo "3. Kaniko: gcr.io/kaniko-project/executor:debug" echo "4. Buildkit: moby/buildkit:v0.20.2" echo "5. Buildkit-rootless: moby/buildkit:v0.20.2-rootless"Length of output: 711
Verify non-root UID/GID 1000 support for all images
Please confirm that every container image used by this Pod supports running as UID/GID 1000 and honors FSGroup 1000 (e.g., files owned by root will be writable by GID 1000). In particular, verify the following images:
- Downloader: rapidfort/curl:latest
- GCP init: google/cloud-sdk:slim
- Kaniko executor: gcr.io/kaniko-project/executor:debug
- BuildKit: moby/buildkit:v0.20.2
- BuildKit (rootless): moby/buildkit:v0.20.2-rootless
You can check each image’s Dockerfile or official docs for a
USERdeclaration or tests validating non-root execution. If any image does not support running as UID 1000, either adjust its permissions (e.g., via an init container) or choose an appropriate user.
1009-1010: 🛠️ Refactor suggestionAdd validation for image name parsing.
The code assumes the image name contains at least one "/" character. This could cause a panic for image names without a registry prefix.
- dockerRegistry := strings.Split(imageName, "/")[0] + parts := strings.SplitN(imageName, "/", 2) + if len(parts) < 2 { + // Handle case where image name doesn't include registry + return nil, fmt.Errorf("invalid image name format: %s", imageName) + } + dockerRegistry := parts[0]Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (1)
861-883: 🛠️ Refactor suggestionSecret/emptyDir branch: tighten file permissions
For the secret branch we still rely on the default file mode (0666 → 0644 after umask) which leaves
config.jsonworld-readable inside the pod. The token generated by the GCP init-container may grant push rights to your registry and shouldn’t be readable by every container/user.-Secret: &corev1.SecretVolumeSource{ - SecretName: dockerConfigJSONSecretName, - Items: []corev1.KeyToPath{ - {Key: ".dockerconfigjson", Path: "config.json"}, - }, -}, +Secret: &corev1.SecretVolumeSource{ + SecretName: dockerConfigJSONSecretName, + DefaultMode: ptr.To(int32(0o440)), + Items: []corev1.KeyToPath{ + {Key: ".dockerconfigjson", Path: "config.json"}, + }, +},Same applies to the
EmptyDirfall-back after the GCP init container populates it.
[security]
🧹 Nitpick comments (1)
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (1)
1146-1149: Explicit DOCKER_CONFIG env – check downloader + init containersBuilder now has the variable, but
dynamocomponent-downloader&gcp-init-docker-configalso refer to the config (e.g. for pre-authenticatedgsutil). Consider appending the same env var tovolumeMounts-consuming init containers for consistency.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - vllm
🔇 Additional comments (1)
deploy/cloud/operator/internal/controller/dynamocomponent_controller.go (1)
855-859: 👍 Consistent docker-config mountUsing the shared constants for the volume name and mount path removes a lot of duplication in downstream code and makes the mount immediately discoverable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
docs/guides/dynamo_deploy/gke_setup.md (2)
23-27: Clarify Workload Identity terminology
GKE usesworkloadIdentityConfig.workloadPool(a pool), not a “namespace.” Rename the heading and description accordingly to avoid confusion.Example:
- # Retrieve the Workload Identifier Namespace associated with your cluster: + # Retrieve the Workload Identity pool associated with your cluster:
96-96: Add missing article and improve phrasing
Insert “the” before “Artifact Registry” and switch to “ensure that” for clarity.Suggested rewrite:
- This is needed to make sure pods can pull images from Artifact Registry without needing to specify an imagePullSecret + This is needed to ensure that pods can pull images from the Artifact Registry without needing to specify an imagePullSecret.🧰 Tools
🪛 LanguageTool
[uncategorized] ~96-~96: You might be missing the article “the” here.
Context: ... to make sure pods can pull images from Artifact Registry without needing to specify an ...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/guides/dynamo_deploy/dynamo_cloud.md(1 hunks)docs/guides/dynamo_deploy/gke_setup.md(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- docs/guides/dynamo_deploy/dynamo_cloud.md
🧰 Additional context used
🪛 LanguageTool
docs/guides/dynamo_deploy/gke_setup.md
[uncategorized] ~96-~96: You might be missing the article “the” here.
Context: ... to make sure pods can pull images from Artifact Registry without needing to specify an ...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
docs/guides/dynamo_deploy/gke_setup.md (2)
41-43: 🛠️ Refactor suggestionEnsure a space before backslashes for CLI line continuation
Shell examples must have a space before the trailing backslash (\) to work correctly.Apply this diff:
- gcloud iam service-accounts create workload-identity-sa\ + gcloud iam service-accounts create workload-identity-sa \
29-31: 🛠️ Refactor suggestionUse standard Markdown admonition syntax
The fenced{important}block isn’t supported by GitHub-flavored Markdown. Replace it with a proper admonition (e.g.,!!! important) so the note renders correctly.Apply this diff:
- ```{important} - Make sure Workload Identity is enabled in your cluster! - ``` + !!! important + Make sure Workload Identity is enabled in your cluster!
🧹 Nitpick comments (3)
docs/guides/dynamo_deploy/gke_setup.md (3)
23-23: Clarify Workload Identity terminology
“Workload Identifier Namespace” is non-standard. Consider using “Workload Identity pool” to align with GCP docs.Example:
- # Retrieve the Workload Identifier Namespace associated with your cluster: + # Retrieve the Workload Identity pool associated with your cluster:
95-95: Consider adding “the” for clarity
To improve readability, you could say “to make sure the pods can pull images from the Artifact Registry.”- This is needed to make sure pods can pull images from Artifact Registry without needing to specify an imagePullSecret + This is needed to make sure the pods can pull images from the Artifact Registry without needing to specify an imagePullSecret
131-131: Use consistent placeholders in code snippets
The YAML examples mix...and...., which can confuse readers or tooling. Use a single convention (e.g.,...) or switch to comments (# …).- ... + # … (or use `...` consistently)Also applies to: 137-137, 152-152
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/guides/dynamo_deploy/dynamo_cloud.md(1 hunks)docs/guides/dynamo_deploy/gke_setup.md(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- docs/guides/dynamo_deploy/dynamo_cloud.md
🧰 Additional context used
🪛 LanguageTool
docs/guides/dynamo_deploy/gke_setup.md
[uncategorized] ~96-~96: You might be missing the article “the” here.
Context: ... to make sure pods can pull images from Artifact Registry without needing to specify an ...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - vllm
Overview:
couple of fixes after Hippocratic feedback deploying in GCP
Details:
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit