diff --git a/docs/proposals/3015-workload-aware-scheduling/README.md b/docs/proposals/3015-workload-aware-scheduling/README.md new file mode 100644 index 0000000000..5edba06616 --- /dev/null +++ b/docs/proposals/3015-workload-aware-scheduling/README.md @@ -0,0 +1,729 @@ +# KEP-3015: Workload Aware Scheduling for TrainJob + +## Summary + +This document proposes integrating the Kubernetes Workload API into Kubeflow Trainer to enable +native workload aware scheduling for TrainJobs. The Workload API, introduced in Kubernetes v1.35 (alpha) +and targeting v1.37 (beta), provides multiple features to enhance AI workload scheduling orchestration +including gang-scheduling: [KEP-4671](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling), +topology-aware scheduling: [KEP-5732](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/5732-topology-aware-workload-scheduling), +DRA: [KEP-5729](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/5729-resourceclaim-support-for-workloads), +and other features through the Workload and PodGroup resources. + +This integration will be implemented as a new `PodGroupPolicy` plugin following the existing Trainer +Pipeline Framework pattern. + +## Motivation + +In the initial implementation, we plan to support gang-scheduling to ensure that all Pods of a +TrainJob are scheduled together, preventing resource deadlock and idle GPU time. Currently, +Kubeflow Trainer supports gang scheduling through external solutions: + +- **Coscheduling plugin**: Requires installing the Kubernetes scheduler-plugins project +- **Volcano scheduler**: Requires deploying the Volcano scheduling system + +The Kubernetes community has recognized gang scheduling as a fundamental requirement and introduced +native support through the Workload API. This API is becoming the standard way to express gang +scheduling in Kubernetes, with integration planned for: + +- **Job controller**: [KEP-4671](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling): Automatic Workload/PodGroup creation for parallel jobs +- **JobSet**: [KEP-969](https://github.com/kubernetes-sigs/jobset/pull/1068): Gang scheduling support for job groups + +### Goals + +1. Implement a Workload plugin following the Trainer Pipeline Framework pattern to support gang scheduling via the native Kubernetes Workload API +1. Support automatic creation of Workload and PodGroup objects with proper lifecycle management tied to TrainJob +1. Ensure gang-scheduling works with MPI plugin +1. Support for TrainJob initializers +1. Maintain backward compatibility with existing Coscheduling and Volcano plugins + +### Non-Goals + +1. Replace existing Coscheduling or Volcano plugins - this is an additional scheduling option +1. Support all Workload API features immediately - focus on core gang scheduling first +1. Support Kubernetes versions < 1.35 - Workload API requires v1.35+ +1. Delegate Workload creation to JobSet - the TrainJob controller (as the highest-level controller) + must create the Workload object + +## Proposal + +### User Stories + +#### Story 1 + +As a platform engineer, I want to configure ClusterTrainingRuntime to use native Kubernetes +gang-scheduling without installing external scheduler plugins. + +The ClusterTrainingRuntime and TrainJob may look as follows: + +```yaml +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: ClusterTrainingRuntime +metadata: + name: torch-distributed +spec: + mlPolicy: + numNodes: 1 + torch: {} + podGroupPolicy: + workload: {} + template: + spec: + replicatedJobs: + - name: node + template: + spec: + template: + metadata: + labels: + trainer.kubeflow.org/trainjob-ancestor-step: trainer + spec: + containers: + - name: node + image: pytorch/pytorch:2.9.1-cuda12.8-cudnn9-runtime +--- +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: TrainJob +metadata: + name: my-job +spec: + runtimeRef: + name: torch-distributed + trainer: + image: docker.io/torch-run + numNodes: 100 + resourcesPerNode: + requests: + nvidia.com/gpu: 4 +``` + +When this plugin is enabled, the TrainJob controller will create the following resources: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha2 +kind: Workload +metadata: + name: my-job + ownerReferences: + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + controller: true +spec: + controllerRef: + apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + podGroupTemplates: + - name: trainer + schedulingPolicy: + gang: + minCount: 100 # Equal to trainJob.spec.trainer.numNodes or mlPolicy.numNodes +``` + +The PodGroup will be created automatically by the TrainJob controller + +```yaml +apiVersion: scheduling.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: my-job-trainer- + ownerReferences: + - apiVersion: scheduling.k8s.io/v1alpha2 + kind: Workload + name: + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + controller: true +spec: + podGroupTemplateRef: + workloadName: my-job + podGroupTemplateName: trainer + schedulingPolicy: + gang: + minCount: 100 +``` + +And the Pod specs will be updated with the scheduling group: + +```yaml +spec: + schedulingGroup: + podGroupName: my-job-trainer- +``` + +#### Story 2 + +As a platform engineer, I want to configure MPI-based distributed training with gang scheduling using +the Workload API to ensure all MPI nodes are scheduled together. + +The ClusterTrainingRuntime and TrainJob may look as follows: + +```yaml +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: ClusterTrainingRuntime +metadata: + name: deepspeed-distributed + labels: + trainer.kubeflow.org/framework: deepspeed +spec: + mlPolicy: + numNodes: 1 + mpi: + numProcPerNode: 4 + mpiImplementation: OpenMPI + template: + spec: + network: + publishNotReadyAddresses: true + successPolicy: + operator: All + targetReplicatedJobs: + - launcher + replicatedJobs: + - name: launcher + template: + metadata: + labels: + trainer.kubeflow.org/trainjob-ancestor-step: trainer + spec: + template: + spec: + containers: + - name: node + image: ghcr.io/kubeflow/trainer/deepspeed-runtime + securityContext: + runAsUser: 1000 + - name: node + template: + spec: + template: + spec: + containers: + - name: node + image: ghcr.io/kubeflow/trainer/deepspeed-runtime + securityContext: + runAsUser: 1000 + command: + - /usr/sbin/sshd + args: + - -De + - -f + - /home/mpiuser/.sshd_config + readinessProbe: + tcpSocket: + port: 2222 + initialDelaySeconds: 5 +--- +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: TrainJob +metadata: + name: my-job +spec: + runtimeRef: + name: deepspeed-distributed + trainer: + numNodes: 50 + resourcesPerNode: + requests: + nvidia.com/gpu: 4 +``` + +When this plugin is enabled, the TrainJob controller will create the Workload with one PodGroup +for launcher+node. + +```yaml +apiVersion: scheduling.k8s.io/v1alpha2 +kind: Workload +metadata: + name: my-job + ownerReferences: + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + controller: true +spec: + controllerRef: + apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + podGroupTemplates: + - name: trainer + schedulingPolicy: + gang: + minCount: 50 # Equal to trainJob.spec.trainer.numNodes +``` + +The corresponding PodGroups will be created: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: my-job-trainer- + ownerReferences: + - apiVersion: scheduling.k8s.io/v1alpha2 + kind: Workload + name: my-job + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + controller: true +spec: + podGroupTemplateRef: + workloadName: my-job + podGroupTemplateName: trainer + schedulingPolicy: + gang: + minCount: 50 +``` + +And the Pod specs will be updated with the scheduling group: + +```yaml +spec: + schedulingGroup: + podGroupName: my-job-trainer- +``` + +#### Story 3 + +As a platform engineer, I want to configure LLM fine-tuning with dataset/model initializers and +gang scheduling. The initializers and trainer should have separate PodGroups to ensure proper +scheduling coordination. + +The ClusterTrainingRuntime with Workload API enabled and TrainJob may look as follows: + +```yaml +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: ClusterTrainingRuntime +metadata: + name: torchtune-qwen2.5-1.5b + labels: + trainer.kubeflow.org/framework: torchtune +spec: + mlPolicy: + numNodes: 1 + torch: {} + podGroupPolicy: + workload: {} + template: + spec: + volumeClaimPolicies: + - templates: + - metadata: + name: initializer + spec: + accessModes: ["ReadWriteOnce"] + resources: + requests: + storage: 20Gi + replicatedJobs: + - name: dataset-initializer + template: + metadata: + labels: + trainer.kubeflow.org/trainjob-ancestor-step: dataset-initializer + spec: + template: + spec: + containers: + - name: dataset-initializer + image: ghcr.io/kubeflow/trainer/dataset-initializer + env: + - name: STORAGE_URI + value: hf://tatsu-lab/alpaca + volumeMounts: + - mountPath: /workspace + name: initializer + - name: model-initializer + template: + metadata: + labels: + trainer.kubeflow.org/trainjob-ancestor-step: model-initializer + spec: + template: + spec: + containers: + - name: model-initializer + image: ghcr.io/kubeflow/trainer/model-initializer + env: + - name: STORAGE_URI + value: hf://Qwen/Qwen2.5-1.5B-Instruct + volumeMounts: + - name: initializer + mountPath: /workspace + - name: node + dependsOn: + - name: dataset-initializer + status: Complete + - name: model-initializer + status: Complete + template: + metadata: + labels: + trainer.kubeflow.org/trainjob-ancestor-step: trainer + spec: + template: + spec: + containers: + - name: node + image: ghcr.io/kubeflow/trainer/torchtune-trainer + command: + - tune + - run + - full_finetune_distributed + - --config + - qwen2_5/1.5B_full + - dataset.source=parquet + - dataset.data_dir=/workspace/dataset/data + - output_dir=/workspace/output + - tokenizer.path=/workspace/model/vocab.json + - tokenizer.merges_file=/workspace/model/merges.txt + - checkpointer.checkpoint_dir=/workspace/model + resources: + limits: + nvidia.com/gpu: 2 + volumeMounts: + - mountPath: /workspace + name: initializer +--- +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: TrainJob +metadata: + name: my-job +spec: + runtimeRef: + name: torchtune-qwen2.5-1.5b +``` + +When this plugin is enabled, the TrainJob controller will create the Workload with two PodGroup +templates - one for initializer and one for the trainer. The initializer PodGroup does not require +gang-scheduling to enable lazy-loading: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha2 +kind: Workload +metadata: + name: my-job + ownerReferences: + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + controller: true +spec: + controllerRef: + apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + podGroupTemplates: + - name: initializer + schedulingPolicy: + basic: {} + - name: trainer + schedulingPolicy: + gang: + minCount: 8 # Equal to trainJob.spec.trainer.numNodes +``` + +The corresponding PodGroups will be created: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: my-job-initializer- + ownerReferences: + - apiVersion: scheduling.k8s.io/v1alpha2 + kind: Workload + name: my-job + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + controller: true +spec: + podGroupTemplateRef: + workloadName: my-job + podGroupTemplateName: initializer + schedulingPolicy: + basic: {} +--- +apiVersion: scheduling.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: my-job-trainer- + ownerReferences: + - apiVersion: scheduling.k8s.io/v1alpha2 + kind: Workload + name: my-job + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + controller: true +spec: + podGroupTemplateRef: + workloadName: my-job + podGroupTemplateName: trainer + schedulingPolicy: + gang: + minCount: 8 +``` + +Each Pod will be associated with its respective PodGroup: + +```yaml +# Initializer Pods (dataset-initializer, model-initializer) +spec: + schedulingGroup: + podGroupName: my-job-initializer- +--- +# Trainer Pod +spec: + schedulingGroup: + podGroupName: my-job-trainer- +``` + +## Design Details + +### Kubernetes Workload API Overview + +The Workload API introduces two new resource types: + +- **Workload**: A static template defining scheduling policies and `PodGroupTemplates`. +- **PodGroup**: Runtime instances representing actual pod groups with status tracking. + +The key design principle is that **the highest-level controller creates the Workload object**. +Since TrainJob is the top-level resource in Kubeflow Trainer, the TrainJob controller must create +the Workload object, not JobSet. + +### Workload Scheduling API + +We extend `PodGroupPolicySource` to include a Workload option: + +```go +// PodGroupPolicySource configures the gang-scheduling plugin. +// Only one of its members may be specified. +type PodGroupPolicySource struct { + // Coscheduling plugin from the Kubernetes scheduler-plugins for gang-scheduling. + Coscheduling *CoschedulingPodGroupPolicySource `json:"coscheduling,omitempty"` + + // Volcano plugin from the Volcano scheduler for gang-scheduling. + Volcano *VolcanoPodGroupPolicySource `json:"volcano,omitempty"` + + // Workload plugin using native Kubernetes Workload API for gang-scheduling + // Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled. + Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"` +} + +// WorkloadPodGroupPolicySource configures scheduling behavior using Kubernetes Workload API. +type WorkloadPodGroupPolicySource struct {} +``` + +The `WorkloadPodGroupPolicySource` struct is intentionally minimal for the initial implementation. +The `minCount` for gang scheduling is automatically derived from `mlPolicy.numNodes` or +`trainJob.spec.trainer.numNodes`. + +### Workload Runtime Plugin + +Similar to the Coscheduling and Volcano plugins, we implement a Workload plugin in +`pkg/runtime/framework/plugins/workload/workload.go`. This plugin implements the following interfaces +from the Pipeline Framework: + +#### Build Phase + +The plugin implements the `ComponentBuilder` interface to build the Workload object: + +```go +func (w *Workload) Build(ctx context.Context, info *runtime.Info, trainJob *trainv1alpha1.TrainJob) ([]runtime.ApplyConfiguration, error) { + // 1. Extract numNodes from runtime info + // 2. Build Workload object with: + // - controllerRef pointing to TrainJob + // - podGroupTemplate with gang scheduling policy + // - minCount equal to numNodes + // 3. Build PodGroup objects. + // 4. Return a slice of apply configurations for the Workload and PodGroup objects +} +``` + +#### EnforcePodGroupPolicy Phase + +The plugin implements the `EnforcePodGroupPolicy` interface to configure the `schedulingGroup` field in Pod specs: + +```go +func (w *Workload) EnforcePodGroupPolicy(info *runtime.Info, trainJob *trainv1alpha1.TrainJob) error { + // 1. Set schedulingGroup.podGroupName in all Pod templates +} +``` + +#### WatchExtension Phase + +The plugin implements `WatchExtension` to watch Workload resources and trigger TrainJob reconciliation: + +```go +func (w *Workload) ReconcilerBuilders() []runtime.ReconcilerBuilder { + // 1. Watch Workload and PodGroup resources owned by TrainJob + // 2. Trigger reconciliation on PodGroup status changes +} +``` + +### OwnerReferences Relationship + +The ownerReferences relationship between `TrainJob`, `Workload`, `PodGroup`, and `Pod` is as follows: + +```mermaid +flowchart BT + Pod[Pod] + PodGroup[PodGroup] + Workload[Workload] + TrainJob[TrainJob] + + Workload -->|ownerRef| TrainJob + PodGroup -->|ownerRef| TrainJob + Pod -->|ownerRef| TrainJob + + PodGroup -->|ownerRef| Workload + Pod -->|ownerRef| PodGroup +``` + +- The `Workload` object has an ownerReference to the `TrainJob` object with `controller: true` in case it was created by the TrainJob controller +- The `PodGroup` object has an ownerReference to the `TrainJob` object with `controller: true` in case it was created by the TrainJob controller and another ownerReference to the `Workload` object +- The `Pod` object has an ownerReference to the `TrainJob` object with `controller: true` and another ownerReference to the `PodGroup` object + +By this ownerReferences relationship, garbage collection will remove objects accordingly that avoids orphaned Pods with a stale PodGroup reference. + +### Resource Lifecycle + +1. **Creation**: When a TrainJob is created with `podGroupPolicy.workload` configured, the Workload + plugin creates the Workload and PodGroup objects with `ownerReferences` pointing to the TrainJob. + +1. **Pod Association**: The plugin injects `schedulingGroup.podGroupName` into Pod specs, linking Pods to their PodGroup. + +1. **Scheduling**: The kube-scheduler uses the Workload Scheduling Cycle to process entire PodGroups + atomically, ensuring all Pods in a gang are scheduled together. + +1. **Suspension**: When the TrainJob is suspended, Workload and PodGroup resources are preserved. + In the future, we can re-create to ensure elastic TrainJob support. + +1. **Deletion**: When the TrainJob is deleted, Kubernetes garbage collection automatically cleans up the Workload object (and subsequently the PodGroup). + +The TrainJob controller requires permissions to manage Workload resources: + +```go +// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=workloads,verbs=get;list;watch;create;update;patch +// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=workloads/status,verbs=get +// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch +// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups/status,verbs=get +``` + +### Feature Gate Dependencies + +The Workload API requires the following Kubernetes feature gates to be enabled: + +- `GenericWorkload`: Enables the Workload and PodGroup APIs + +The same feature gate is required to be enabled in TrainJob config: + +``` +GenericWorkload +``` + +## Defaulting/Validation + +- Workload API should not be configured in JobSet or Job templates +- Only one of `coscheduling`, `volcano`, or `workload` may be specified (CEL) +- `GenericWorkload` feature gate must be enabled (webhook) +- `spec.schedulingGroup` must not be manually set in Pod templates + +## Test Plan + +- [x] I/we understand the owners of the involved components may require updates to + existing tests to make this code solid enough prior to committing the changes necessary + to implement this enhancement. + +### Unit Tests + +- Workload and PodGroup creation based on the Runtime spec +- Correct calculation of `minCount` +- Proper `controllerRef` and `ownerReferences` configuration +- Pod template injection of `schedulingGroup` field + +### Integration Tests + +- Workload and PodGroup creation by TrainJob controller +- Workload creation for TrainJob with MPI plugin +- Workload creation for TrainJob with initializer +- TrainJob deletion triggers Workload cleanup +- Suspended TrainJob behavior with Workload + +### E2E Tests + +Verify the Workload and PodGroup orchestration as part of Trainer E2E tests. + +## Future Plans + +We plan to integrate other features like DRA, Topology-Aware Scheduling, Preemption as part of +the Workload API in the future. + +For example, we can introduce TopologyConstraint under Workload API to allow users setting topology: + +```yaml +podGroupPolicy: + workload: + topologyConstraint: + level: topology.kubernetes.io/rack +``` + +## Implementation History + +- 2026-02-16: KEP Creation for gang-scheduling support + +## Alternatives + +### Set Workload Spec in the TrainJob API + +We can integrate the Workload spec in the TrainJob API directly. That might introduce inconsistency +with other PodGroupPolicy plugins. Alternatively, we can allow to set Workload parameters in the +Runtime and TrainJob spec, similarly to the Trainer/Initializer. + +```yaml +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: TrainJob +metadata: + name: my-job +spec: + runtimeRef: + name: torch-distributed + workloadSpec: {} +``` + +### Custom Setting for PodGroupTemplates + +We could allow users to override the default Workload API behavior to enable custom orchestration for +TrainJob. To support this, we would need to extend the `.workload` API accordingly. +Alternatively, we could rely on [the JobSet integration](https://github.com/kubernetes-sigs/jobset/pull/1068) +to provide users with more fine-grained control over PodGroup configuration. + +```yaml +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: ClusterTrainingRuntime +metadata: + name: custom-workload +spec: + podGroupPolicy: + workload: + podGroupTemplates: + - name: trainer + targetJobs: + - name: launcher + - name: node + schedulingPolicy: + gang: + minCount: 10 + - name: evaluator + targetJobs: + - name: evaluator + schedulingPolicy: + gang: + minCount: 2 +```