Skip to content

Conversation

@JasonLi1909
Copy link
Contributor

@JasonLi1909 JasonLi1909 commented Oct 21, 2025

This PR adds persistent epoch data to the checkpointing logic in the FSDP2 Template.

This PR includes:

  • New logic for saving the epoch into a distributed checkpoint
  • New logic for resuming training from the saved epoch in a loaded checkpoint
  • Updates the OSS FSDP2 example to include the new logic

Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f

@JasonLi1909 JasonLi1909 requested review from a team as code owners October 21, 2025 01:35
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to add functionality for resuming training from a specific epoch by saving and loading the epoch number in checkpoints. The implementation correctly modifies AppState to store the epoch and updates the saving logic. However, I've identified a critical bug in the checkpoint loading logic (load_fsdp_checkpoint) where a new AppState object is created for loading, separate from the one used to retrieve the epoch afterwards. This prevents the epoch from being correctly restored, causing training to always resume from the beginning. This issue is present in both the Jupyter notebook and Markdown versions of the example.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: JasonLi1909 <[email protected]>
@ray-gardener ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue labels Oct 21, 2025
Signed-off-by: JasonLi1909 <[email protected]>
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: JasonLi1909 <[email protected]>
@matthewdeng matthewdeng enabled auto-merge (squash) October 22, 2025 18:19
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 22, 2025
@matthewdeng matthewdeng merged commit 7e11431 into ray-project:master Oct 22, 2025
8 checks passed
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
…ray-project#57938)

This PR adds persistent epoch data to the checkpointing logic in the
[FSDP2
Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html).

This PR includes:
- New logic for saving the epoch into a distributed checkpoint
- New logic for resuming training from the saved epoch in a loaded
checkpoint
- Updates the [OSS FSDP2
example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html)
to include the new logic

Passing release test:
https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f

---------

Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: Jason Li <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Signed-off-by: xgui <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ray-project#57938)

This PR adds persistent epoch data to the checkpointing logic in the
[FSDP2
Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html).

This PR includes:
- New logic for saving the epoch into a distributed checkpoint
- New logic for resuming training from the saved epoch in a loaded
checkpoint
- Updates the [OSS FSDP2
example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html)
to include the new logic

Passing release test:
https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f

---------

Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: Jason Li <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ray-project#57938)

This PR adds persistent epoch data to the checkpointing logic in the
[FSDP2
Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html).

This PR includes:
- New logic for saving the epoch into a distributed checkpoint
- New logic for resuming training from the saved epoch in a loaded
checkpoint
- Updates the [OSS FSDP2
example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html)
to include the new logic

Passing release test:
https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f

---------

Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: Jason Li <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ray-project#57938)

This PR adds persistent epoch data to the checkpointing logic in the
[FSDP2
Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html).

This PR includes:
- New logic for saving the epoch into a distributed checkpoint
- New logic for resuming training from the saved epoch in a loaded
checkpoint
- Updates the [OSS FSDP2
example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html)
to include the new logic

Passing release test:
https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f

---------

Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: Jason Li <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs An issue or change related to documentation go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants