Support Gang Scheduling for PytorchJob on Kueue #2796

FWCoder · 2024-08-07T21:29:16Z

What happened:
When I submitted a PytorchJob that is required 8 GPUs on Master and 8 GPUs on Worker, it was admitted even though there is only 8 GPU available in the Cluster Queue. Both master and worker pods were created but only Master pod can move to Init and Running states. The Worker Pod is stuck on Pending until the Master pod move to Completed state. At that point, the Worker Pod will stuck on Init state since it is waiting for the Master pod to come up. (Deadlock Scenario)

This happens with "waitForPodsReady" enable.

What you expected to happen:
Kueue Controller Manager will evaluate the total requested resources between both Master and Workers. It should blocks the job being admitted until there is enough resources in the Cluster Queue.

How to reproduce it (as minimally and precisely as possible):

Job Details:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: <LOCAL_QUEUE_NAME>
  name: hello-world-kueue
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - "sleep"
                - "60"
              image: <PYTORCH_IMAGE>
              imagePullPolicy: Always
              name: pytorch
              resources:
                limits:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
                requests:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
          securityContext:
            runAsUser: 1000
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - "sleep"
                - "10"
              image:<PYTORCH_IMAGE>
              imagePullPolicy: Always
              name: pytorch
              resources:
                limits:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
                requests:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
          securityContext:
            runAsUser: 1000
  runPolicy:
    ttlSecondsAfterFinished: 604800

Create Job:

kubectl create -f hello-world-kueue.yaml

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.28
Kueue version (use git describe --tags --dirty --always): 0.6.1
Cloud provider or hardware configuration: AWS

Tasks

Give feedback

No tasks being tracked yet.

Options

The text was updated successfully, but these errors were encountered:

mszadkow · 2024-09-12T06:58:55Z

/assign

mszadkow · 2024-09-12T14:08:02Z

@FWCoder
Can you share more on:

the ClusterQueue configuration
available nodes - like a list of them with resources
node status after the job gets admitted

mszadkow · 2024-09-13T12:34:16Z

Let me explain why it's important.
The total amount of resources available to the clusterQueue on the basis of which the decision is made may match.
However, it is possible that the node configuration caused the master to "fit" into one node but the worker does not have the same node at its disposal.
In other words, if there are 10 CPUs available in the cluster, the master needs 5 and the worker needs 5, but the nodes configuration is 5, 3, 2, the workload will be allocated.
Thus we need both cluster configuration and nodes setup, if that will prove correct.
Then we could look deeper, but so far I couldn't reproduce the issue.

alculquicondor · 2024-09-13T14:33:42Z

/triage needs-information

FWCoder added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2024

FWCoder changed the title ~~Support Gang Scheduling for Kueue~~ Support Gang Scheduling for PytorchJob on Kueue Aug 7, 2024

k8s-ci-robot assigned mszadkow Sep 12, 2024

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Sep 13, 2024

mszadkow mentioned this issue Sep 19, 2024

integration test to reproduce faulty scenario #3075

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Gang Scheduling for PytorchJob on Kueue #2796

Support Gang Scheduling for PytorchJob on Kueue #2796

FWCoder commented Aug 7, 2024 •

edited

Loading

Tasks

mszadkow commented Sep 12, 2024

mszadkow commented Sep 12, 2024 •

edited

Loading

mszadkow commented Sep 13, 2024

alculquicondor commented Sep 13, 2024

Support Gang Scheduling for PytorchJob on Kueue #2796

Support Gang Scheduling for PytorchJob on Kueue #2796

Comments

FWCoder commented Aug 7, 2024 • edited Loading

Tasks

mszadkow commented Sep 12, 2024

mszadkow commented Sep 12, 2024 • edited Loading

mszadkow commented Sep 13, 2024

alculquicondor commented Sep 13, 2024

FWCoder commented Aug 7, 2024 •

edited

Loading

mszadkow commented Sep 12, 2024 •

edited

Loading