feat: add GPU type and enable intel GPU resources #3548

zxue2 · 2025-10-10T14:00:48Z

Overview:

Add gpu type into dynamo operator and helm CRD as to support heterogeneous devices for k8s deployment.

Details:

Update cloud helm CRD by adding gpu_type
Update cloud dynamo operator by adding GPUType and intel GPU resources
Update helm chart template by adding GPU resources based on gpu_type
Add vllm deployment sample agg.xe.yaml

Where should the reviewer start?

agg.xe.yaml reflects how it will work.

Related Issues

Resolves k8s relevant part for #3303

Summary by CodeRabbit

New Features
- Added GPU type selection in deployments. You can now specify gpu_type (xe, i915, or default Nvidia) to target the correct GPU vendor, ensuring accurate scheduling and resource requests across deployments and charts.
- Enhanced compatibility for Intel GPUs, enabling seamless deployment without manual resource key changes. Autoscaling and limits respect the chosen GPU type.
- Introduced a vllm-agg deployment preset with frontend and decode worker services, preconfigured to run a lightweight Qwen model.

copy-pr-bot · 2025-10-10T14:00:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-10-10T14:00:57Z

👋 Hi zxue2! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-10-10T14:07:08Z

Walkthrough

Introduces GPU type awareness across CRDs, operator types/logic, and Helm templates to map GPU requests to vendor-specific resource keys (Intel xe/i915 or Nvidia). Adds Intel GPU resource constants. Includes a new vllm-agg DynamoGraphDeployment manifest specifying gpu_type "xe" and a decode worker model configuration.

Changes

Cohort / File(s)	Change summary
CRD schemas `deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml`	Add `gpu_type` (string) under `spec.resources.{limits,requests}.gpu` to allow specifying GPU vendor/type alongside count.
Operator API types `deploy/cloud/operator/api/dynamo/common/common.go`, `deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go`, `deploy/cloud/operator/internal/dynamo/graph.go`	Add `GPUType` fields (string or *string) to resource structs and HPA conf; generate deepcopy for pointer field.
Operator resource mapping `deploy/cloud/operator/internal/controller_common/resource.go`, `deploy/cloud/operator/internal/consts/consts.go`	Introduce Intel GPU resource constants (`gpu.intel.com/xe`, `gpu.intel.com/i915`) and map `gpu_type` to corresponding K8s resource key; default to Nvidia when unspecified/other.
Helm deployment templating `deploy/helm/chart/templates/deployment.yaml`	Conditionally render GPU resource keys in requests/limits based on `.serviceSpec.resources.gpu_type` (xe → `gpu.intel.com/xe`, i915 → `gpu.intel.com/i915`, else `nvidia.com/gpu`).
New deployment manifest `components/backends/vllm/deploy/agg.xe.yaml`	Add `vllm-agg` DynamoGraphDeployment with frontend and VllmDecodeWorker; sets `gpu_type: xe`, 1 GPU, image overrides, and model `Qwen/Qwen3-0.6B`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User
  participant CRD as Dynamo* CRDs
  participant Operator as Operator Reconciler
  participant Helm as Helm Template
  participant K8s as Kubernetes API
  participant Scheduler as K8s Scheduler
  Note over User,CRD: Define deployment with resources.gpu_type and gpu count

  User->>CRD: Apply DynamoGraphDeployment (gpu_type=xe, gpu=1)
  CRD-->>Operator: Reconcile event
  Operator->>Operator: Map gpu_type → resource key\nxe/i915 → gpu.intel.com/*\nelse → nvidia.com/gpu
  Operator->>Helm: Provide values (resource key, gpu count)
  Helm->>K8s: Render Deployment/Pod spec with requests/limits\nfor mapped GPU resource
  K8s-->>Scheduler: Pod admitted
  Scheduler->>K8s: Place pod on node matching GPU resource
  Note right of Scheduler: Default path used if gpu_type absent/other

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my ears at maps of GPU might,
From xe to i915, we set the keys just right.
Helm hums a chart, the operator nods,
Pods hop to nodes like nimble little gods.
One carrot for Nvidia, two for Intel’s crew—
Deploy and nibble: vllm hops anew! 🥕🐇

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly summarizes the primary change of adding GPU type support and enabling Intel GPU resources, accurately reflecting the main additions in the PR without extraneous detail.
Description Check	✅ Passed	The PR description follows the repository template by including Overview, Details, Where should the reviewer start, and Related Issues; it succinctly describes the changes and points reviewers to the agg.xe.yaml example and the related GitHub issue. It clearly outlines the main changes (CRD, operator, Helm templates, and sample deployment) so reviewers can understand intent and impact. Some optional details like explicit file-by-file reviewer guidance, testing steps, or migration notes are not present but are non-critical for an initial review.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (5)

deploy/cloud/operator/internal/controller_common/resource.go (2)
426-432: Consider using constants for GPU type strings.

The hardcoded strings "xe" and "i915" should be defined as constants for maintainability and to prevent typos. This would also make it easier to support additional GPU types in the future.

Consider defining constants in consts.go:
const (
    GPUTypeXe    = "xe"
    GPUTypei915  = "i915"
    GPUTypeNvidia = "" // default/empty
)
Then use them in the conditional:
-			if resources.Limits.GPUType == "xe" {
+			if resources.Limits.GPUType == consts.GPUTypeXe {
 				currentResources.Limits[corev1.ResourceName(consts.KubeResourceGPUXeIntel)] = q
-			} else if resources.Limits.GPUType == "i915" {
+			} else if resources.Limits.GPUType == consts.GPUTypei915 {
 				currentResources.Limits[corev1.ResourceName(consts.KubeResourceGPUi915Intel)] = q
 			} else {
 				currentResources.Limits[corev1.ResourceName(consts.KubeResourceGPUNvidia)] = q
 			}
426-432: Consider validating GPUType values.

Invalid GPUType values silently default to Nvidia GPUs, which could mask configuration errors. Consider adding validation or logging for unrecognized GPU types to help users catch typos or unsupported values.

Example validation approach:
validGPUTypes := []string{"", "xe", "i915"} // empty string means Nvidia (default)
if resources.Limits.GPUType != "" && 
   resources.Limits.GPUType != "xe" && 
   resources.Limits.GPUType != "i915" {
    return nil, fmt.Errorf("unsupported GPU type: %s (supported: xe, i915)", resources.Limits.GPUType)
}
components/backends/vllm/deploy/agg.xe.yaml (1)
22-25: Consider: Specify gpu_type in both requests and limits.

The gpu_type is specified only under limits. While the Helm template should handle this, consider adding it to requests as well for consistency and clarity, especially if the template logic changes in the future.

Apply this diff to add gpu_type to requests:
       resources:
+        requests:
+          gpu: "1"
+          gpu_type: "xe"
         limits:
           gpu: "1"
           gpu_type: "xe"
deploy/helm/chart/templates/deployment.yaml (2)
78-90: Refactor: Extract GPU mapping logic to reduce duplication.

The GPU resource mapping logic is duplicated between the requests (lines 78-90) and limits (lines 98-110) sections. Consider extracting this into a Helm named template to improve maintainability.

Example named template approach:

Add to _helpers.tpl:
{{- define "dynamo.gpuResource" -}}
{{- $gpu := .gpu -}}
{{- $gpuType := .gpuType -}}
{{- if $gpuType | eq "xe" -}}
gpu.intel.com/xe: "{{ $gpu }}"
{{- else if $gpuType | eq "i915" -}}
gpu.intel.com/i915: "{{ $gpu }}"
{{- else -}}
nvidia.com/gpu: "{{ $gpu }}"
{{- end -}}
{{- end -}}
Then use it in deployment.yaml:
{{- if $serviceSpec.resources.gpu }}
{{ include "dynamo.gpuResource" (dict "gpu" $serviceSpec.resources.gpu "gpuType" $serviceSpec.resources.gpu_type) | indent 12 }}
{{- end }}
84-86: Consider: Fail fast on unsupported GPU types.

The template silently falls back to nvidia.com/gpu for unsupported gpu_type values. Consider using fail to provide early feedback for typos or invalid configurations.

Apply this diff to fail on unsupported types:
-                {{- else }}
+                {{- else if not $serviceSpec.resources.gpu_type }}
                 nvidia.com/gpu: "{{ $serviceSpec.resources.gpu }}"
+                {{- else }}
+                {{- fail (printf "Unsupported gpu_type '%s' for service '%s'. Supported values: xe, i915" $serviceSpec.resources.gpu_type $serviceName) }}
                 {{- end }}
Also applies to: 104-106

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a2a820 and fb0bea3.

📒 Files selected for processing (9)

components/backends/vllm/deploy/agg.xe.yaml (1 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (2 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (2 hunks)
deploy/cloud/operator/api/dynamo/common/common.go (2 hunks)
deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go (1 hunks)
deploy/cloud/operator/internal/consts/consts.go (1 hunks)
deploy/cloud/operator/internal/controller_common/resource.go (1 hunks)
deploy/cloud/operator/internal/dynamo/graph.go (1 hunks)
deploy/helm/chart/templates/deployment.yaml (2 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: biswapanda
PR: ai-dynamo/dynamo#2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

📚 Learning: 2025-09-04T19:03:06.643Z

Learnt from: biswapanda
PR: ai-dynamo/dynamo#2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

deploy/cloud/operator/internal/consts/consts.go
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml

🧬 Code graph analysis (1)

deploy/cloud/operator/internal/controller_common/resource.go (1)

deploy/cloud/operator/internal/consts/consts.go (3)

KubeResourceGPUXeIntel (48-48)

KubeResourceGPUi915Intel (49-49)

KubeResourceGPUNvidia (47-47)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (5)

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)

10328-10331: Schema update looks solid.

The new gpu_type property slots cleanly into both requests and limits and keeps the GPU schema in sync with the operator changes. No issues spotted.

Also applies to: 10348-10351

deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go (1)

57-61: LGTM!

The deep copy logic for the GPUType field correctly follows the established pattern for pointer fields, with proper nil checking and allocation.

deploy/cloud/operator/internal/dynamo/graph.go (1)

86-86: LGTM!

The GPUType field addition follows the established pattern for optional resource fields with proper JSON/YAML tagging.

deploy/cloud/operator/internal/consts/consts.go (1)

48-49: LGTM!

The Intel GPU resource constants are correctly defined and match the standard resource names used by Intel's GPU device plugin.

deploy/cloud/operator/api/dynamo/common/common.go (1)

43-43: Note: Type consistency consideration.

DeploymentTargetHPAConf.GPUType is a pointer (*string) while ResourceItem.GPUType (line 31) is a non-pointer string. This inconsistency is likely intentional since HPA configuration fields are optional, but verify this aligns with the usage patterns for these structs.

deploy/cloud/operator/api/dynamo/common/common.go

deploy/cloud/operator/internal/controller_common/resource.go

zxue2 · 2025-10-11T05:52:56Z

@coderabbitai generate docstrings

coderabbitai · 2025-10-11T05:53:18Z

Warning

Docstrings generation is disabled for your repository or organization.

coderabbitai · 2025-10-11T05:54:01Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the modified Go files to check current documentation state
echo "=== common.go ==="
cat deploy/cloud/operator/api/dynamo/common/common.go

echo -e "\n=== consts.go ==="
cat deploy/cloud/operator/internal/consts/consts.go

echo -e "\n=== resource.go ==="
cat deploy/cloud/operator/internal/controller_common/resource.go

echo -e "\n=== graph.go ==="
cat deploy/cloud/operator/internal/dynamo/graph.go

Length of output: 70000

@zxue2: I'll generate docstrings for the new GPU type fields and constants added in this PR.

common.go

type ResourceItem struct {
	CPU      string            `json:"cpu,omitempty"`
	Memory   string            `json:"memory,omitempty"`
	// Indicates the number of GPUs to request.
	// total number of GPUs is NumberOfNodes * GPU in case of multinode deployment.
	GPU      string            `json:"gpu,omitempty"`
	// GPUType specifies the type of GPU resource to request.
	// Valid values are "xe" for Intel Xe GPUs or "i915" for Intel i915 GPUs.
	// If not specified, defaults to NVIDIA GPUs.
	GPUType  string            `json:"gpu_type,omitempty" kubebuilder:"validation:Enum=xe;i915"`
	Custom   map[string]string `json:"custom,omitempty"`
}

type DeploymentTargetHPAConf struct {
	CPU         *int32  `json:"cpu,omitempty"`
	GPU         *int32  `json:"gpu,omitempty"`
	// GPUType specifies the type of GPU resource for horizontal pod autoscaling.
	// Valid values are "xe" for Intel Xe GPUs or "i915" for Intel i915 GPUs.
	// If not specified, defaults to NVIDIA GPUs.
	GPUType     *string `json:"gpu_type,omitempty" kubebuilder:"validation:Enum=xe;i915"`
	Memory      *string `json:"memory,omitempty"`
	QPS         *int64  `json:"qps,omitempty"`
	MinReplicas *int32  `json:"min_replicas,omitempty"`
	MaxReplicas *int32  `json:"max_replicas,omitempty"`
}

consts.go

const (
	// ... existing constants ...

	KubeLabelDynamoComponentPod = "nvidia.com/dynamo-component-pod"

	// KubeResourceGPUNvidia is the Kubernetes resource name for NVIDIA GPUs.
	KubeResourceGPUNvidia = "nvidia.com/gpu"
	// KubeResourceGPUXeIntel is the Kubernetes resource name for Intel Xe GPUs.
	KubeResourceGPUXeIntel = "gpu.intel.com/xe"
	// KubeResourceGPUi915Intel is the Kubernetes resource name for Intel i915 GPUs.
	KubeResourceGPUi915Intel = "gpu.intel.com/i915"

	// ... rest of constants ...
)

graph.go

type Resources struct {
	CPU     *string           `yaml:"cpu,omitempty" json:"cpu,omitempty"`
	Memory  *string           `yaml:"memory,omitempty" json:"memory,omitempty"`
	GPU     *string           `yaml:"gpu,omitempty" json:"gpu,omitempty"`
	// GPUType specifies the type of GPU resource to request.
	// Valid values are "xe" for Intel Xe GPUs or "i915" for Intel i915 GPUs.
	// If not specified, defaults to NVIDIA GPUs.
	GPUType *string           `yaml:"gpu_type,omitempty" json:"gpu_type,omitempty" kubebuilder:"validation:Enum=xe;i915"`
	Custom  map[string]string `yaml:"custom,omitempty" json:"custom,omitempty"`
}

These docstrings:

Follow Go conventions by starting with the field/constant name
Explain the purpose of each field
Document valid values for the GPUType field
Clarify the default behavior when not specified
Are concise and clear

Signed-off-by: Zhan Xue <[email protected]> Co-authored-by: Eero Tamminen <[email protected]> Co-authored-by: Tuomas Katila <[email protected]>

zxue2 requested a review from a team as a code owner October 10, 2025 14:00

pull-request-size bot added the size/M label Oct 10, 2025

github-actions bot added feat external-contribution Pull request is from an external contributor labels Oct 10, 2025

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

deploy/cloud/operator/api/dynamo/common/common.go Outdated Show resolved Hide resolved

deploy/cloud/operator/internal/controller_common/resource.go Show resolved Hide resolved

zxue2 force-pushed the feature/enable-heterogeneous-device-by-adding-gputype branch from fb0bea3 to 96f6182 Compare October 11, 2025 04:11

pull-request-size bot added size/L and removed size/M labels Oct 11, 2025

zxue2 force-pushed the feature/enable-heterogeneous-device-by-adding-gputype branch from 96f6182 to b668553 Compare October 11, 2025 07:23

zxue2 mentioned this pull request Oct 13, 2025

[FEATURE]: Heterogeneous Device Support on Dynamo #3303

Closed

zxue2 force-pushed the feature/enable-heterogeneous-device-by-adding-gputype branch 2 times, most recently from e36147f to 069c25c Compare October 18, 2025 04:18

feat: add GPU type and enable intel GPU resources

069c25c

Signed-off-by: Zhan Xue <[email protected]> Co-authored-by: Eero Tamminen <[email protected]> Co-authored-by: Tuomas Katila <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add GPU type and enable intel GPU resources #3548

feat: add GPU type and enable intel GPU resources #3548

Uh oh!

zxue2 commented Oct 10, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 10, 2025

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

coderabbitai bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

zxue2 commented Oct 11, 2025

Uh oh!

coderabbitai bot commented Oct 11, 2025

Uh oh!

coderabbitai bot commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add GPU type and enable intel GPU resources #3548

Are you sure you want to change the base?

feat: add GPU type and enable intel GPU resources #3548

Uh oh!

Conversation

zxue2 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 10, 2025

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

coderabbitai bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zxue2 commented Oct 11, 2025

Uh oh!

coderabbitai bot commented Oct 11, 2025

Uh oh!

coderabbitai bot commented Oct 11, 2025

common.go

consts.go

graph.go

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zxue2 commented Oct 10, 2025 •

edited

Loading

coderabbitai bot commented Oct 10, 2025 •

edited

Loading