Skip to content

Conversation

zxue2
Copy link

@zxue2 zxue2 commented Oct 10, 2025

Overview:

Add gpu type into dynamo operator and helm CRD as to support heterogeneous devices for k8s deployment.

Details:

  • Update cloud helm CRD by adding gpu_type
  • Update cloud dynamo operator by adding GPUType and intel GPU resources
  • Update helm chart template by adding GPU resources based on gpu_type
  • Add vllm deployment sample agg.xe.yaml

Where should the reviewer start?

agg.xe.yaml reflects how it will work.

Related Issues

Resolves k8s relevant part for #3303

Summary by CodeRabbit

  • New Features
    • Added GPU type selection in deployments. You can now specify gpu_type (xe, i915, or default Nvidia) to target the correct GPU vendor, ensuring accurate scheduling and resource requests across deployments and charts.
    • Enhanced compatibility for Intel GPUs, enabling seamless deployment without manual resource key changes. Autoscaling and limits respect the chosen GPU type.
    • Introduced a vllm-agg deployment preset with frontend and decode worker services, preconfigured to run a lightweight Qwen model.

@zxue2 zxue2 requested a review from a team as a code owner October 10, 2025 14:00
Copy link

copy-pr-bot bot commented Oct 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

👋 Hi zxue2! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added feat external-contribution Pull request is from an external contributor labels Oct 10, 2025
Copy link
Contributor

coderabbitai bot commented Oct 10, 2025

Walkthrough

Introduces GPU type awareness across CRDs, operator types/logic, and Helm templates to map GPU requests to vendor-specific resource keys (Intel xe/i915 or Nvidia). Adds Intel GPU resource constants. Includes a new vllm-agg DynamoGraphDeployment manifest specifying gpu_type "xe" and a decode worker model configuration.

Changes

Cohort / File(s) Change summary
CRD schemas
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml, deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
Add gpu_type (string) under spec.resources.{limits,requests}.gpu to allow specifying GPU vendor/type alongside count.
Operator API types
deploy/cloud/operator/api/dynamo/common/common.go, deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go, deploy/cloud/operator/internal/dynamo/graph.go
Add GPUType fields (string or *string) to resource structs and HPA conf; generate deepcopy for pointer field.
Operator resource mapping
deploy/cloud/operator/internal/controller_common/resource.go, deploy/cloud/operator/internal/consts/consts.go
Introduce Intel GPU resource constants (gpu.intel.com/xe, gpu.intel.com/i915) and map gpu_type to corresponding K8s resource key; default to Nvidia when unspecified/other.
Helm deployment templating
deploy/helm/chart/templates/deployment.yaml
Conditionally render GPU resource keys in requests/limits based on .serviceSpec.resources.gpu_type (xe → gpu.intel.com/xe, i915 → gpu.intel.com/i915, else nvidia.com/gpu).
New deployment manifest
components/backends/vllm/deploy/agg.xe.yaml
Add vllm-agg DynamoGraphDeployment with frontend and VllmDecodeWorker; sets gpu_type: xe, 1 GPU, image overrides, and model Qwen/Qwen3-0.6B.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User
  participant CRD as Dynamo* CRDs
  participant Operator as Operator Reconciler
  participant Helm as Helm Template
  participant K8s as Kubernetes API
  participant Scheduler as K8s Scheduler
  Note over User,CRD: Define deployment with resources.gpu_type and gpu count

  User->>CRD: Apply DynamoGraphDeployment (gpu_type=xe, gpu=1)
  CRD-->>Operator: Reconcile event
  Operator->>Operator: Map gpu_type → resource key\nxe/i915 → gpu.intel.com/*\nelse → nvidia.com/gpu
  Operator->>Helm: Provide values (resource key, gpu count)
  Helm->>K8s: Render Deployment/Pod spec with requests/limits\nfor mapped GPU resource
  K8s-->>Scheduler: Pod admitted
  Scheduler->>K8s: Place pod on node matching GPU resource
  Note right of Scheduler: Default path used if gpu_type absent/other
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my ears at maps of GPU might,
From xe to i915, we set the keys just right.
Helm hums a chart, the operator nods,
Pods hop to nodes like nimble little gods.
One carrot for Nvidia, two for Intel’s crew—
Deploy and nibble: vllm hops anew! 🥕🐇

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly summarizes the primary change of adding GPU type support and enabling Intel GPU resources, accurately reflecting the main additions in the PR without extraneous detail.
Description Check ✅ Passed The PR description follows the repository template by including Overview, Details, Where should the reviewer start, and Related Issues; it succinctly describes the changes and points reviewers to the agg.xe.yaml example and the related GitHub issue. It clearly outlines the main changes (CRD, operator, Helm templates, and sample deployment) so reviewers can understand intent and impact. Some optional details like explicit file-by-file reviewer guidance, testing steps, or migration notes are not present but are non-critical for an initial review.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
deploy/cloud/operator/internal/controller_common/resource.go (2)

426-432: Consider using constants for GPU type strings.

The hardcoded strings "xe" and "i915" should be defined as constants for maintainability and to prevent typos. This would also make it easier to support additional GPU types in the future.

Consider defining constants in consts.go:

const (
    GPUTypeXe    = "xe"
    GPUTypei915  = "i915"
    GPUTypeNvidia = "" // default/empty
)

Then use them in the conditional:

-			if resources.Limits.GPUType == "xe" {
+			if resources.Limits.GPUType == consts.GPUTypeXe {
 				currentResources.Limits[corev1.ResourceName(consts.KubeResourceGPUXeIntel)] = q
-			} else if resources.Limits.GPUType == "i915" {
+			} else if resources.Limits.GPUType == consts.GPUTypei915 {
 				currentResources.Limits[corev1.ResourceName(consts.KubeResourceGPUi915Intel)] = q
 			} else {
 				currentResources.Limits[corev1.ResourceName(consts.KubeResourceGPUNvidia)] = q
 			}

426-432: Consider validating GPUType values.

Invalid GPUType values silently default to Nvidia GPUs, which could mask configuration errors. Consider adding validation or logging for unrecognized GPU types to help users catch typos or unsupported values.

Example validation approach:

validGPUTypes := []string{"", "xe", "i915"} // empty string means Nvidia (default)
if resources.Limits.GPUType != "" && 
   resources.Limits.GPUType != "xe" && 
   resources.Limits.GPUType != "i915" {
    return nil, fmt.Errorf("unsupported GPU type: %s (supported: xe, i915)", resources.Limits.GPUType)
}
components/backends/vllm/deploy/agg.xe.yaml (1)

22-25: Consider: Specify gpu_type in both requests and limits.

The gpu_type is specified only under limits. While the Helm template should handle this, consider adding it to requests as well for consistency and clarity, especially if the template logic changes in the future.

Apply this diff to add gpu_type to requests:

       resources:
+        requests:
+          gpu: "1"
+          gpu_type: "xe"
         limits:
           gpu: "1"
           gpu_type: "xe"
deploy/helm/chart/templates/deployment.yaml (2)

78-90: Refactor: Extract GPU mapping logic to reduce duplication.

The GPU resource mapping logic is duplicated between the requests (lines 78-90) and limits (lines 98-110) sections. Consider extracting this into a Helm named template to improve maintainability.

Example named template approach:

Add to _helpers.tpl:

{{- define "dynamo.gpuResource" -}}
{{- $gpu := .gpu -}}
{{- $gpuType := .gpuType -}}
{{- if $gpuType | eq "xe" -}}
gpu.intel.com/xe: "{{ $gpu }}"
{{- else if $gpuType | eq "i915" -}}
gpu.intel.com/i915: "{{ $gpu }}"
{{- else -}}
nvidia.com/gpu: "{{ $gpu }}"
{{- end -}}
{{- end -}}

Then use it in deployment.yaml:

{{- if $serviceSpec.resources.gpu }}
{{ include "dynamo.gpuResource" (dict "gpu" $serviceSpec.resources.gpu "gpuType" $serviceSpec.resources.gpu_type) | indent 12 }}
{{- end }}

84-86: Consider: Fail fast on unsupported GPU types.

The template silently falls back to nvidia.com/gpu for unsupported gpu_type values. Consider using fail to provide early feedback for typos or invalid configurations.

Apply this diff to fail on unsupported types:

-                {{- else }}
+                {{- else if not $serviceSpec.resources.gpu_type }}
                 nvidia.com/gpu: "{{ $serviceSpec.resources.gpu }}"
+                {{- else }}
+                {{- fail (printf "Unsupported gpu_type '%s' for service '%s'. Supported values: xe, i915" $serviceSpec.resources.gpu_type $serviceName) }}
                 {{- end }}

Also applies to: 104-106

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a2a820 and fb0bea3.

📒 Files selected for processing (9)
  • components/backends/vllm/deploy/agg.xe.yaml (1 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (2 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (2 hunks)
  • deploy/cloud/operator/api/dynamo/common/common.go (2 hunks)
  • deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go (1 hunks)
  • deploy/cloud/operator/internal/consts/consts.go (1 hunks)
  • deploy/cloud/operator/internal/controller_common/resource.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph.go (1 hunks)
  • deploy/helm/chart/templates/deployment.yaml (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

  • deploy/cloud/operator/internal/consts/consts.go
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
🧬 Code graph analysis (1)
deploy/cloud/operator/internal/controller_common/resource.go (1)
deploy/cloud/operator/internal/consts/consts.go (3)
  • KubeResourceGPUXeIntel (48-48)
  • KubeResourceGPUi915Intel (49-49)
  • KubeResourceGPUNvidia (47-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)

10328-10331: Schema update looks solid.

The new gpu_type property slots cleanly into both requests and limits and keeps the GPU schema in sync with the operator changes. No issues spotted.

Also applies to: 10348-10351

deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go (1)

57-61: LGTM!

The deep copy logic for the GPUType field correctly follows the established pattern for pointer fields, with proper nil checking and allocation.

deploy/cloud/operator/internal/dynamo/graph.go (1)

86-86: LGTM!

The GPUType field addition follows the established pattern for optional resource fields with proper JSON/YAML tagging.

deploy/cloud/operator/internal/consts/consts.go (1)

48-49: LGTM!

The Intel GPU resource constants are correctly defined and match the standard resource names used by Intel's GPU device plugin.

deploy/cloud/operator/api/dynamo/common/common.go (1)

43-43: Note: Type consistency consideration.

DeploymentTargetHPAConf.GPUType is a pointer (*string) while ResourceItem.GPUType (line 31) is a non-pointer string. This inconsistency is likely intentional since HPA configuration fields are optional, but verify this aligns with the usage patterns for these structs.

@zxue2 zxue2 force-pushed the feature/enable-heterogeneous-device-by-adding-gputype branch from fb0bea3 to 96f6182 Compare October 11, 2025 04:11
@pull-request-size pull-request-size bot added size/L and removed size/M labels Oct 11, 2025
@zxue2
Copy link
Author

zxue2 commented Oct 11, 2025

@coderabbitai generate docstrings

Copy link
Contributor

coderabbitai bot commented Oct 11, 2025

Warning

Docstrings generation is disabled for your repository or organization.

Copy link
Contributor

coderabbitai bot commented Oct 11, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the modified Go files to check current documentation state
echo "=== common.go ==="
cat deploy/cloud/operator/api/dynamo/common/common.go

echo -e "\n=== consts.go ==="
cat deploy/cloud/operator/internal/consts/consts.go

echo -e "\n=== resource.go ==="
cat deploy/cloud/operator/internal/controller_common/resource.go

echo -e "\n=== graph.go ==="
cat deploy/cloud/operator/internal/dynamo/graph.go

Length of output: 70000


@zxue2: I'll generate docstrings for the new GPU type fields and constants added in this PR.

common.go

type ResourceItem struct {
	CPU      string            `json:"cpu,omitempty"`
	Memory   string            `json:"memory,omitempty"`
	// Indicates the number of GPUs to request.
	// total number of GPUs is NumberOfNodes * GPU in case of multinode deployment.
	GPU      string            `json:"gpu,omitempty"`
	// GPUType specifies the type of GPU resource to request.
	// Valid values are "xe" for Intel Xe GPUs or "i915" for Intel i915 GPUs.
	// If not specified, defaults to NVIDIA GPUs.
	GPUType  string            `json:"gpu_type,omitempty" kubebuilder:"validation:Enum=xe;i915"`
	Custom   map[string]string `json:"custom,omitempty"`
}

type DeploymentTargetHPAConf struct {
	CPU         *int32  `json:"cpu,omitempty"`
	GPU         *int32  `json:"gpu,omitempty"`
	// GPUType specifies the type of GPU resource for horizontal pod autoscaling.
	// Valid values are "xe" for Intel Xe GPUs or "i915" for Intel i915 GPUs.
	// If not specified, defaults to NVIDIA GPUs.
	GPUType     *string `json:"gpu_type,omitempty" kubebuilder:"validation:Enum=xe;i915"`
	Memory      *string `json:"memory,omitempty"`
	QPS         *int64  `json:"qps,omitempty"`
	MinReplicas *int32  `json:"min_replicas,omitempty"`
	MaxReplicas *int32  `json:"max_replicas,omitempty"`
}

consts.go

const (
	// ... existing constants ...

	KubeLabelDynamoComponentPod = "nvidia.com/dynamo-component-pod"

	// KubeResourceGPUNvidia is the Kubernetes resource name for NVIDIA GPUs.
	KubeResourceGPUNvidia = "nvidia.com/gpu"
	// KubeResourceGPUXeIntel is the Kubernetes resource name for Intel Xe GPUs.
	KubeResourceGPUXeIntel = "gpu.intel.com/xe"
	// KubeResourceGPUi915Intel is the Kubernetes resource name for Intel i915 GPUs.
	KubeResourceGPUi915Intel = "gpu.intel.com/i915"

	// ... rest of constants ...
)

graph.go

type Resources struct {
	CPU     *string           `yaml:"cpu,omitempty" json:"cpu,omitempty"`
	Memory  *string           `yaml:"memory,omitempty" json:"memory,omitempty"`
	GPU     *string           `yaml:"gpu,omitempty" json:"gpu,omitempty"`
	// GPUType specifies the type of GPU resource to request.
	// Valid values are "xe" for Intel Xe GPUs or "i915" for Intel i915 GPUs.
	// If not specified, defaults to NVIDIA GPUs.
	GPUType *string           `yaml:"gpu_type,omitempty" json:"gpu_type,omitempty" kubebuilder:"validation:Enum=xe;i915"`
	Custom  map[string]string `yaml:"custom,omitempty" json:"custom,omitempty"`
}

These docstrings:

  • Follow Go conventions by starting with the field/constant name
  • Explain the purpose of each field
  • Document valid values for the GPUType field
  • Clarify the default behavior when not specified
  • Are concise and clear

@zxue2 zxue2 force-pushed the feature/enable-heterogeneous-device-by-adding-gputype branch from 96f6182 to b668553 Compare October 11, 2025 07:23
@zxue2 zxue2 force-pushed the feature/enable-heterogeneous-device-by-adding-gputype branch 2 times, most recently from e36147f to 069c25c Compare October 18, 2025 04:18
Signed-off-by: Zhan Xue <[email protected]>
Co-authored-by: Eero Tamminen <[email protected]>
Co-authored-by: Tuomas Katila <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants