[train] FSDP2 Template: Resume from previous epoch when checkpointing #57938

JasonLi1909 · 2025-10-21T01:35:08Z

This PR adds persistent epoch data to the checkpointing logic in the FSDP2 Template.

This PR includes:

New logic for saving the epoch into a distributed checkpoint
New logic for resuming training from the saved epoch in a loaded checkpoint
Updates the OSS FSDP2 example to include the new logic

Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f

Signed-off-by: JasonLi1909 <[email protected]>

gemini-code-assist

Code Review

This pull request aims to add functionality for resuming training from a specific epoch by saving and loading the epoch number in checkpoints. The implementation correctly modifies AppState to store the epoch and updates the saving logic. However, I've identified a critical bug in the checkpoint loading logic (load_fsdp_checkpoint) where a new AppState object is created for loading, separate from the one used to retrieve the epoch afterwards. This prevents the epoch from being correctly restored, causing training to always resume from the beginning. This issue is present in both the Jupyter notebook and Markdown versions of the example.

doc/source/train/examples/pytorch/pytorch-fsdp/README.ipynb

doc/source/train/examples/pytorch/pytorch-fsdp/README.md

Signed-off-by: JasonLi1909 <[email protected]>

doc/source/train/examples/pytorch/pytorch-fsdp/README.md

Signed-off-by: JasonLi1909 <[email protected]>

doc/source/train/examples/pytorch/pytorch-fsdp/README.md

Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Jason Li <[email protected]>

…ray-project#57938) This PR adds persistent epoch data to the checkpointing logic in the [FSDP2 Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html). This PR includes: - New logic for saving the epoch into a distributed checkpoint - New logic for resuming training from the saved epoch in a loaded checkpoint - Updates the [OSS FSDP2 example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html) to include the new logic Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: xgui <[email protected]>

…ray-project#57938) This PR adds persistent epoch data to the checkpointing logic in the [FSDP2 Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html). This PR includes: - New logic for saving the epoch into a distributed checkpoint - New logic for resuming training from the saved epoch in a loaded checkpoint - Updates the [OSS FSDP2 example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html) to include the new logic Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: matthewdeng <[email protected]>

…ray-project#57938) This PR adds persistent epoch data to the checkpointing logic in the [FSDP2 Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html). This PR includes: - New logic for saving the epoch into a distributed checkpoint - New logic for resuming training from the saved epoch in a loaded checkpoint - Updates the [OSS FSDP2 example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html) to include the new logic Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

…ray-project#57938) This PR adds persistent epoch data to the checkpointing logic in the [FSDP2 Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html). This PR includes: - New logic for saving the epoch into a distributed checkpoint - New logic for resuming training from the saved epoch in a loaded checkpoint - Updates the [OSS FSDP2 example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html) to include the new logic Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Future-Outlier <[email protected]>

persisting epoch on checkpointing logic

c8d0c21

Signed-off-by: JasonLi1909 <[email protected]>

JasonLi1909 requested review from a team as code owners October 21, 2025 01:35

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

doc/source/train/examples/pytorch/pytorch-fsdp/README.ipynb Outdated Show resolved Hide resolved

doc/source/train/examples/pytorch/pytorch-fsdp/README.md Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

load_fsdp_checkpoint fix

764e859

Signed-off-by: JasonLi1909 <[email protected]>

ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue labels Oct 21, 2025

matthewdeng approved these changes Oct 21, 2025

View reviewed changes

doc/source/train/examples/pytorch/pytorch-fsdp/README.md Outdated Show resolved Hide resolved

doc/source/train/examples/pytorch/pytorch-fsdp/README.md Show resolved Hide resolved

moved epoch out of metrics

147a3d9

Signed-off-by: JasonLi1909 <[email protected]>

This comment was marked as outdated.

Sign in to view

start epoch set to 0 if no epoch in checkpoint

d061e11

Signed-off-by: JasonLi1909 <[email protected]>

This comment was marked as outdated.

Sign in to view

fix

4cd8251

Signed-off-by: JasonLi1909 <[email protected]>

matthewdeng reviewed Oct 22, 2025

View reviewed changes

doc/source/train/examples/pytorch/pytorch-fsdp/README.md Show resolved Hide resolved

doc/source/train/examples/pytorch/pytorch-fsdp/README.md Outdated Show resolved Hide resolved

Update doc/source/train/examples/pytorch/pytorch-fsdp/README.md

a3bd0a6

Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Jason Li <[email protected]>

matthewdeng enabled auto-merge (squash) October 22, 2025 18:19

github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 22, 2025

matthewdeng merged commit 7e11431 into ray-project:master Oct 22, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[train] FSDP2 Template: Resume from previous epoch when checkpointing #57938

[train] FSDP2 Template: Resume from previous epoch when checkpointing #57938

Uh oh!

JasonLi1909 commented Oct 21, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[train] FSDP2 Template: Resume from previous epoch when checkpointing #57938

[train] FSDP2 Template: Resume from previous epoch when checkpointing #57938

Uh oh!

Conversation

JasonLi1909 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JasonLi1909 commented Oct 21, 2025 •

edited

Loading