-
-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to exit after the submitit launcher has scheduled the run on slurm #2479
Comments
Hi @abhiskk, I've looked into disabling waiting for the jobs to return, and I've run into a problem: Hydra's sweepers plugins expect the launcher to return information about the launched job (including the return value). For example, you'll see here that Hydra's BasicSweeper collects the returned values from each job in the sweep. An exception is raised if the job does not return a value. The same is true of Hydra's other sweeper plugins. I think making this work in a clean way would require changes to Hydra's sweeper API.
Maybe there's another way we can work around this limitation. Could you please share an example of how you're running the for loop? Is it a loop in python or in bash? Is there a reason you're using a for loop instead of using a hydra sweep? |
If there's no other workaround, we can add an option to have the submitit launcher use a dummy return value (e.g. |
Any update on this? It would be very uesful. |
+1. This would be very useful. I am using an HPC and need to launch experiments on a login node via |
A simple workaround is to launch your command in the background with |
It is not a good idea, since if you are submitting 400 jobs, it will exceed the thread limit of some servers. But if you only submit 10 jobs, it is fine. |
Any progress on this? |
🚀 Feature Request
Currently, when we execute the launcher command, the job is scheduled but the train process is checking the job status without exiting. One limitation is that we can not execute several scheduling commands in a loop. This behavior is coming from a wait command running here. It would be very helpful if we can make this configurable so that the launcher can exit without continuously waiting for the run results.
The text was updated successfully, but these errors were encountered: