-
Notifications
You must be signed in to change notification settings - Fork 646
feat: add GPU type and enable intel GPU resources #3548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add GPU type and enable intel GPU resources #3548
Conversation
👋 Hi zxue2! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughIntroduces GPU type awareness across CRDs, operator types/logic, and Helm templates to map GPU requests to vendor-specific resource keys (Intel xe/i915 or Nvidia). Adds Intel GPU resource constants. Includes a new vllm-agg DynamoGraphDeployment manifest specifying gpu_type "xe" and a decode worker model configuration. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User as User
participant CRD as Dynamo* CRDs
participant Operator as Operator Reconciler
participant Helm as Helm Template
participant K8s as Kubernetes API
participant Scheduler as K8s Scheduler
Note over User,CRD: Define deployment with resources.gpu_type and gpu count
User->>CRD: Apply DynamoGraphDeployment (gpu_type=xe, gpu=1)
CRD-->>Operator: Reconcile event
Operator->>Operator: Map gpu_type → resource key\nxe/i915 → gpu.intel.com/*\nelse → nvidia.com/gpu
Operator->>Helm: Provide values (resource key, gpu count)
Helm->>K8s: Render Deployment/Pod spec with requests/limits\nfor mapped GPU resource
K8s-->>Scheduler: Pod admitted
Scheduler->>K8s: Place pod on node matching GPU resource
Note right of Scheduler: Default path used if gpu_type absent/other
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
deploy/cloud/operator/internal/controller_common/resource.go (2)
426-432
: Consider using constants for GPU type strings.The hardcoded strings "xe" and "i915" should be defined as constants for maintainability and to prevent typos. This would also make it easier to support additional GPU types in the future.
Consider defining constants in
consts.go
:const ( GPUTypeXe = "xe" GPUTypei915 = "i915" GPUTypeNvidia = "" // default/empty )Then use them in the conditional:
- if resources.Limits.GPUType == "xe" { + if resources.Limits.GPUType == consts.GPUTypeXe { currentResources.Limits[corev1.ResourceName(consts.KubeResourceGPUXeIntel)] = q - } else if resources.Limits.GPUType == "i915" { + } else if resources.Limits.GPUType == consts.GPUTypei915 { currentResources.Limits[corev1.ResourceName(consts.KubeResourceGPUi915Intel)] = q } else { currentResources.Limits[corev1.ResourceName(consts.KubeResourceGPUNvidia)] = q }
426-432
: Consider validating GPUType values.Invalid GPUType values silently default to Nvidia GPUs, which could mask configuration errors. Consider adding validation or logging for unrecognized GPU types to help users catch typos or unsupported values.
Example validation approach:
validGPUTypes := []string{"", "xe", "i915"} // empty string means Nvidia (default) if resources.Limits.GPUType != "" && resources.Limits.GPUType != "xe" && resources.Limits.GPUType != "i915" { return nil, fmt.Errorf("unsupported GPU type: %s (supported: xe, i915)", resources.Limits.GPUType) }components/backends/vllm/deploy/agg.xe.yaml (1)
22-25
: Consider: Specifygpu_type
in both requests and limits.The
gpu_type
is specified only underlimits
. While the Helm template should handle this, consider adding it torequests
as well for consistency and clarity, especially if the template logic changes in the future.Apply this diff to add gpu_type to requests:
resources: + requests: + gpu: "1" + gpu_type: "xe" limits: gpu: "1" gpu_type: "xe"deploy/helm/chart/templates/deployment.yaml (2)
78-90
: Refactor: Extract GPU mapping logic to reduce duplication.The GPU resource mapping logic is duplicated between the
requests
(lines 78-90) andlimits
(lines 98-110) sections. Consider extracting this into a Helm named template to improve maintainability.Example named template approach:
Add to
_helpers.tpl
:{{- define "dynamo.gpuResource" -}} {{- $gpu := .gpu -}} {{- $gpuType := .gpuType -}} {{- if $gpuType | eq "xe" -}} gpu.intel.com/xe: "{{ $gpu }}" {{- else if $gpuType | eq "i915" -}} gpu.intel.com/i915: "{{ $gpu }}" {{- else -}} nvidia.com/gpu: "{{ $gpu }}" {{- end -}} {{- end -}}Then use it in deployment.yaml:
{{- if $serviceSpec.resources.gpu }} {{ include "dynamo.gpuResource" (dict "gpu" $serviceSpec.resources.gpu "gpuType" $serviceSpec.resources.gpu_type) | indent 12 }} {{- end }}
84-86
: Consider: Fail fast on unsupported GPU types.The template silently falls back to
nvidia.com/gpu
for unsupportedgpu_type
values. Consider usingfail
to provide early feedback for typos or invalid configurations.Apply this diff to fail on unsupported types:
- {{- else }} + {{- else if not $serviceSpec.resources.gpu_type }} nvidia.com/gpu: "{{ $serviceSpec.resources.gpu }}" + {{- else }} + {{- fail (printf "Unsupported gpu_type '%s' for service '%s'. Supported values: xe, i915" $serviceSpec.resources.gpu_type $serviceName) }} {{- end }}Also applies to: 104-106
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
components/backends/vllm/deploy/agg.xe.yaml
(1 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
(2 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
(2 hunks)deploy/cloud/operator/api/dynamo/common/common.go
(2 hunks)deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go
(1 hunks)deploy/cloud/operator/internal/consts/consts.go
(1 hunks)deploy/cloud/operator/internal/controller_common/resource.go
(1 hunks)deploy/cloud/operator/internal/dynamo/graph.go
(1 hunks)deploy/helm/chart/templates/deployment.yaml
(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
Applied to files:
deploy/cloud/operator/internal/consts/consts.go
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
🧬 Code graph analysis (1)
deploy/cloud/operator/internal/controller_common/resource.go (1)
deploy/cloud/operator/internal/consts/consts.go (3)
KubeResourceGPUXeIntel
(48-48)KubeResourceGPUi915Intel
(49-49)KubeResourceGPUNvidia
(47-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)
10328-10331
: Schema update looks solid.The new
gpu_type
property slots cleanly into both requests and limits and keeps the GPU schema in sync with the operator changes. No issues spotted.Also applies to: 10348-10351
deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go (1)
57-61
: LGTM!The deep copy logic for the GPUType field correctly follows the established pattern for pointer fields, with proper nil checking and allocation.
deploy/cloud/operator/internal/dynamo/graph.go (1)
86-86
: LGTM!The GPUType field addition follows the established pattern for optional resource fields with proper JSON/YAML tagging.
deploy/cloud/operator/internal/consts/consts.go (1)
48-49
: LGTM!The Intel GPU resource constants are correctly defined and match the standard resource names used by Intel's GPU device plugin.
deploy/cloud/operator/api/dynamo/common/common.go (1)
43-43
: Note: Type consistency consideration.
DeploymentTargetHPAConf.GPUType
is a pointer (*string
) whileResourceItem.GPUType
(line 31) is a non-pointerstring
. This inconsistency is likely intentional since HPA configuration fields are optional, but verify this aligns with the usage patterns for these structs.
fb0bea3
to
96f6182
Compare
@coderabbitai generate docstrings |
Warning Docstrings generation is disabled for your repository or organization. |
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed:
Length of output: 70000 @zxue2: I'll generate docstrings for the new GPU type fields and constants added in this PR. common.gotype ResourceItem struct {
CPU string `json:"cpu,omitempty"`
Memory string `json:"memory,omitempty"`
// Indicates the number of GPUs to request.
// total number of GPUs is NumberOfNodes * GPU in case of multinode deployment.
GPU string `json:"gpu,omitempty"`
// GPUType specifies the type of GPU resource to request.
// Valid values are "xe" for Intel Xe GPUs or "i915" for Intel i915 GPUs.
// If not specified, defaults to NVIDIA GPUs.
GPUType string `json:"gpu_type,omitempty" kubebuilder:"validation:Enum=xe;i915"`
Custom map[string]string `json:"custom,omitempty"`
}
type DeploymentTargetHPAConf struct {
CPU *int32 `json:"cpu,omitempty"`
GPU *int32 `json:"gpu,omitempty"`
// GPUType specifies the type of GPU resource for horizontal pod autoscaling.
// Valid values are "xe" for Intel Xe GPUs or "i915" for Intel i915 GPUs.
// If not specified, defaults to NVIDIA GPUs.
GPUType *string `json:"gpu_type,omitempty" kubebuilder:"validation:Enum=xe;i915"`
Memory *string `json:"memory,omitempty"`
QPS *int64 `json:"qps,omitempty"`
MinReplicas *int32 `json:"min_replicas,omitempty"`
MaxReplicas *int32 `json:"max_replicas,omitempty"`
} consts.goconst (
// ... existing constants ...
KubeLabelDynamoComponentPod = "nvidia.com/dynamo-component-pod"
// KubeResourceGPUNvidia is the Kubernetes resource name for NVIDIA GPUs.
KubeResourceGPUNvidia = "nvidia.com/gpu"
// KubeResourceGPUXeIntel is the Kubernetes resource name for Intel Xe GPUs.
KubeResourceGPUXeIntel = "gpu.intel.com/xe"
// KubeResourceGPUi915Intel is the Kubernetes resource name for Intel i915 GPUs.
KubeResourceGPUi915Intel = "gpu.intel.com/i915"
// ... rest of constants ...
) graph.gotype Resources struct {
CPU *string `yaml:"cpu,omitempty" json:"cpu,omitempty"`
Memory *string `yaml:"memory,omitempty" json:"memory,omitempty"`
GPU *string `yaml:"gpu,omitempty" json:"gpu,omitempty"`
// GPUType specifies the type of GPU resource to request.
// Valid values are "xe" for Intel Xe GPUs or "i915" for Intel i915 GPUs.
// If not specified, defaults to NVIDIA GPUs.
GPUType *string `yaml:"gpu_type,omitempty" json:"gpu_type,omitempty" kubebuilder:"validation:Enum=xe;i915"`
Custom map[string]string `yaml:"custom,omitempty" json:"custom,omitempty"`
} These docstrings:
|
96f6182
to
b668553
Compare
e36147f
to
069c25c
Compare
Signed-off-by: Zhan Xue <[email protected]> Co-authored-by: Eero Tamminen <[email protected]> Co-authored-by: Tuomas Katila <[email protected]>
Overview:
Add gpu type into dynamo operator and helm CRD as to support heterogeneous devices for k8s deployment.
Details:
Where should the reviewer start?
agg.xe.yaml reflects how it will work.
Related Issues
Resolves k8s relevant part for #3303
Summary by CodeRabbit