-
-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Resuming sweep #1407
Comments
Thanks for the feature request @hobogalaxy ! For now, for each job Hydra saves all the configs in the jobs |
Hi @hobogalaxy. I think individual sweepers can support it, but this is outside of the scope of Hydra itself. You are welcome to create your own sweeper plugin that will support this. Regarding resume functionality in HPO sweepers: I am closing this as it's currently out of scope. |
@omry I am just confirming, if you want to continue from a failed multi-run, there is no way to do this? You need to start from the beginning even with the built in sweeper? This is pretty standard in libraries in ray. |
Hi @KaleabTessera, yes, this is currently unsupported by Hydra.
It seems that ray uses checkpointing to enable resuming execution after a failed run. Checkpointing is not implemented by Hydra. |
I am only ever using the When using the However, the sweeper will reset the To me, that seems at least confusing. I would advise to replace this line with the following two lines to 'sync' hydra with optuna:
I think this also solves the unsolved part of the issue in #1679. Shall I create a seperate issue for this or propose a PR - although I suppose it should be refactored for other sweepers as well? |
This is Optuna-specific so implementing it for other sweepers would be out of scope. If this resolves #1679 then you can directly propose a PR, while if it solves a different problem it may be best to first create a new specific issue. |
Okay, thanks, it's in #2647. |
🚀 Feature Request
Hi,
is there any way to resume failed hydra sweep?
For example, I run 10 jobs, but job number 6 crashes the whole multirun. Can I somehow resume exactly from job 6?
Describe the solution you'd like
Maybe there could be simply added a parameter allowing user to choose from which job sweep should be started?
E.g. I run
python train.py --multirun batch_size=32,64,128,256,512 --start_from 2
and it starts from job number 2, which would be batch_size 128. I know in this case we can simply run sweep for batch_size=128,256,512 instead, but things get a little tricky when there are multiple different parameters that we sweep over.This solution however wouldn't be very helpful for resuming sweeps of plugin sweepers like optuna, since in most cases optuna needs the history of runs executed so far too decide which parameters to choose next. Is there any chance some kind of other resuming mechanism could be implemented for plugin sweepers?
The text was updated successfully, but these errors were encountered: