-
Notifications
You must be signed in to change notification settings - Fork 7k
[train] Raise ValueError when checkpoint_upload_fn does not return a checkpoint #58863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] Raise ValueError when checkpoint_upload_fn does not return a checkpoint #58863
Conversation
Signed-off-by: Timothy Seah <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable validation check to ensure that a user-provided checkpoint_upload_fn returns a checkpoint object. By raising a ValueError when the function returns a falsy value, it helps users fail fast and avoid subtle bugs like stalled training runs. The implementation is straightforward and effective. The accompanying unit test is well-written and correctly verifies the new behavior in a distributed setting. Overall, this is a great improvement for user experience and robustness.
python/ray/train/v2/tests/test_async_checkpointing_validation.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Timothy Seah <[email protected]>
python/ray/train/v2/tests/test_async_checkpointing_validation.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Timothy Seah <[email protected]>
justinvyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
…checkpoint (ray-project#58863) Fail fast if the users forgets to return a checkpoint in their `checkpoint_upload_fn`. This also causes unexpected issues like `get_all_reported_checkpoints` stalling indefinitely because the counter is misaligned, which I can also fix in a separate PR. --------- Signed-off-by: Timothy Seah <[email protected]> Signed-off-by: YK <[email protected]>
…checkpoint (ray-project#58863) Fail fast if the users forgets to return a checkpoint in their `checkpoint_upload_fn`. This also causes unexpected issues like `get_all_reported_checkpoints` stalling indefinitely because the counter is misaligned, which I can also fix in a separate PR. --------- Signed-off-by: Timothy Seah <[email protected]>
Summary
Fail fast if the users forgets to return a checkpoint in their
checkpoint_upload_fn. This also causes unexpected issues likeget_all_reported_checkpointsstalling indefinitely because the counter is misaligned, which I can also fix in a separate PR.Testing
Unit tests