Support Gang Scheduling for PytorchJob on Kueue #2796
Labels
kind/bug
Categorizes issue or PR as related to a bug.
triage/needs-information
Indicates an issue needs more information in order to work on it.
What happened:
When I submitted a PytorchJob that is required 8 GPUs on Master and 8 GPUs on Worker, it was admitted even though there is only 8 GPU available in the Cluster Queue. Both master and worker pods were created but only Master pod can move to
Init
andRunning
states. The Worker Pod is stuck onPending
until the Master pod move toCompleted
state. At that point, the Worker Pod will stuck onInit
state since it is waiting for the Master pod to come up. (Deadlock Scenario)This happens with "waitForPodsReady" enable.
What you expected to happen:
Kueue Controller Manager will evaluate the total requested resources between both Master and Workers. It should blocks the job being admitted until there is enough resources in the Cluster Queue.
How to reproduce it (as minimally and precisely as possible):
Job Details:
Create Job:
Anything else we need to know?:
Environment:
kubectl version
): 1.28git describe --tags --dirty --always
): 0.6.1Tasks
The text was updated successfully, but these errors were encountered: