Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Gang Scheduling for PytorchJob on Kueue #2796

Open
FWCoder opened this issue Aug 7, 2024 · 4 comments
Open

Support Gang Scheduling for PytorchJob on Kueue #2796

FWCoder opened this issue Aug 7, 2024 · 4 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@FWCoder
Copy link

FWCoder commented Aug 7, 2024

What happened:
When I submitted a PytorchJob that is required 8 GPUs on Master and 8 GPUs on Worker, it was admitted even though there is only 8 GPU available in the Cluster Queue. Both master and worker pods were created but only Master pod can move to Init and Running states. The Worker Pod is stuck on Pending until the Master pod move to Completed state. At that point, the Worker Pod will stuck on Init state since it is waiting for the Master pod to come up. (Deadlock Scenario)

This happens with "waitForPodsReady" enable.

What you expected to happen:
Kueue Controller Manager will evaluate the total requested resources between both Master and Workers. It should blocks the job being admitted until there is enough resources in the Cluster Queue.

How to reproduce it (as minimally and precisely as possible):

Job Details:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: <LOCAL_QUEUE_NAME>
  name: hello-world-kueue
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - "sleep"
                - "60"
              image: <PYTORCH_IMAGE>
              imagePullPolicy: Always
              name: pytorch
              resources:
                limits:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
                requests:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
          securityContext:
            runAsUser: 1000
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - "sleep"
                - "10"
              image:<PYTORCH_IMAGE>
              imagePullPolicy: Always
              name: pytorch
              resources:
                limits:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
                requests:
                  cpu: "86"
                  memory: 1037Gi
                  nvidia.com/gpu: "8"
          securityContext:
            runAsUser: 1000
  runPolicy:
    ttlSecondsAfterFinished: 604800

Create Job:

kubectl create -f hello-world-kueue.yaml

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.28
  • Kueue version (use git describe --tags --dirty --always): 0.6.1
  • Cloud provider or hardware configuration: AWS

Tasks

No tasks being tracked yet.
@FWCoder FWCoder added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2024
@FWCoder FWCoder changed the title Support Gang Scheduling for Kueue Support Gang Scheduling for PytorchJob on Kueue Aug 7, 2024
@mszadkow
Copy link
Contributor

/assign

@mszadkow
Copy link
Contributor

mszadkow commented Sep 12, 2024

@FWCoder
Can you share more on:

  • the ClusterQueue configuration
  • available nodes - like a list of them with resources
  • node status after the job gets admitted

@mszadkow
Copy link
Contributor

Let me explain why it's important.
The total amount of resources available to the clusterQueue on the basis of which the decision is made may match.
However, it is possible that the node configuration caused the master to "fit" into one node but the worker does not have the same node at its disposal.
In other words, if there are 10 CPUs available in the cluster, the master needs 5 and the worker needs 5, but the nodes configuration is 5, 3, 2, the workload will be allocated.
Thus we need both cluster configuration and nodes setup, if that will prove correct.
Then we could look deeper, but so far I couldn't reproduce the issue.

@alculquicondor
Copy link
Contributor

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants