[tune] Cluster Fault Tolerance#3309
Conversation
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
ericl
left a comment
There was a problem hiding this comment.
Works great now! I also patched rllib's train.py to only resume on --resume.
It might be a bit aggressive to enable this by default, but we can always default it to false later, and I think this will help a lot for visibility.
| except Exception: | ||
| logger.exception("Error checkpointing trial metadata.") | ||
|
|
||
| def get_checkpoints(self): |
There was a problem hiding this comment.
@old-bear any comments on the changes to the trial_executor interface?
|
Also nit: "This will ignore any new changes to specification" isn't grammatically correct. |
|
Test FAILed. |
Co-Authored-By: richardliaw <rliaw@berkeley.edu>
Co-Authored-By: richardliaw <rliaw@berkeley.edu>
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test FAILed. |
|
jenkins retest this please |
|
Test FAILed. |
|
jenkins retest this please |
|
Test FAILed. |
|
Merging this since:
@ericl, thanks for the multiple rounds of extensive review! |
|
Nice, this helps a lot! Thanks guys! |
What do these changes do?
A redo of #3165 with extraneous cleanup changes removed.
This currently does not use the same restoring code-path as #3238, but this can change later when component FT is implemented... (i.e., this doesn't notify components that some trials go RUNNING -> PENDING).
This adds the following functionality:
Example:
TODO:
NOTE: this should be a lot easier to review after #3414 is merged.