Skip to content

Conversation

@TimothySeah
Copy link
Contributor

Summary

Fail fast if the users forgets to return a checkpoint in their checkpoint_upload_fn. This also causes unexpected issues like get_all_reported_checkpoints stalling indefinitely because the counter is misaligned, which I can also fix in a separate PR.

Testing

Unit tests

@TimothySeah TimothySeah requested a review from a team as a code owner November 20, 2025 23:52
@TimothySeah TimothySeah changed the title [train] Raise ValueError for buggy checkpoint_upload_fn [train] Raise ValueError when checkpoint_upload_fn does not return a checkpoint Nov 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable validation check to ensure that a user-provided checkpoint_upload_fn returns a checkpoint object. By raising a ValueError when the function returns a falsy value, it helps users fail fast and avoid subtle bugs like stalled training runs. The implementation is straightforward and effective. The accompanying unit test is well-written and correctly verifies the new behavior in a distributed setting. Overall, this is a great improvement for user experience and robustness.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Nov 21, 2025
Signed-off-by: Timothy Seah <[email protected]>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Nov 22, 2025
Signed-off-by: Timothy Seah <[email protected]>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@justinvyu justinvyu enabled auto-merge (squash) November 24, 2025 19:54
@justinvyu justinvyu merged commit 7c4d4e3 into ray-project:master Nov 24, 2025
7 checks passed
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…checkpoint (ray-project#58863)

Fail fast if the users forgets to return a checkpoint in their
`checkpoint_upload_fn`. This also causes unexpected issues like
`get_all_reported_checkpoints` stalling indefinitely because the counter
is misaligned, which I can also fix in a separate PR.

---------

Signed-off-by: Timothy Seah <[email protected]>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…checkpoint (ray-project#58863)

Fail fast if the users forgets to return a checkpoint in their
`checkpoint_upload_fn`. This also causes unexpected issues like
`get_all_reported_checkpoints` stalling indefinitely because the counter
is misaligned, which I can also fix in a separate PR.

---------

Signed-off-by: Timothy Seah <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants