-
Notifications
You must be signed in to change notification settings - Fork 7k
[train] FSDP2 Template: Resume from previous epoch when checkpointing #57938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] FSDP2 Template: Resume from previous epoch when checkpointing #57938
Conversation
Signed-off-by: JasonLi1909 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to add functionality for resuming training from a specific epoch by saving and loading the epoch number in checkpoints. The implementation correctly modifies AppState to store the epoch and updates the saving logic. However, I've identified a critical bug in the checkpoint loading logic (load_fsdp_checkpoint) where a new AppState object is created for loading, separate from the one used to retrieve the epoch afterwards. This prevents the epoch from being correctly restored, causing training to always resume from the beginning. This issue is present in both the Jupyter notebook and Markdown versions of the example.
Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: JasonLi1909 <[email protected]>
Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Jason Li <[email protected]>
…ray-project#57938) This PR adds persistent epoch data to the checkpointing logic in the [FSDP2 Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html). This PR includes: - New logic for saving the epoch into a distributed checkpoint - New logic for resuming training from the saved epoch in a loaded checkpoint - Updates the [OSS FSDP2 example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html) to include the new logic Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: xgui <[email protected]>
…ray-project#57938) This PR adds persistent epoch data to the checkpointing logic in the [FSDP2 Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html). This PR includes: - New logic for saving the epoch into a distributed checkpoint - New logic for resuming training from the saved epoch in a loaded checkpoint - Updates the [OSS FSDP2 example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html) to include the new logic Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: matthewdeng <[email protected]>
…ray-project#57938) This PR adds persistent epoch data to the checkpointing logic in the [FSDP2 Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html). This PR includes: - New logic for saving the epoch into a distributed checkpoint - New logic for resuming training from the saved epoch in a loaded checkpoint - Updates the [OSS FSDP2 example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html) to include the new logic Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
…ray-project#57938) This PR adds persistent epoch data to the checkpointing logic in the [FSDP2 Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html). This PR includes: - New logic for saving the epoch into a distributed checkpoint - New logic for resuming training from the saved epoch in a loaded checkpoint - Updates the [OSS FSDP2 example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html) to include the new logic Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Future-Outlier <[email protected]>
This PR adds persistent epoch data to the checkpointing logic in the FSDP2 Template.
This PR includes:
Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f