Skip to content

Checkpointing fixes#9

Merged
SahilJain314 merged 1 commit intomainfrom
ashors/ckpt-fixes
Mar 21, 2025
Merged

Checkpointing fixes#9
SahilJain314 merged 1 commit intomainfrom
ashors/ckpt-fixes

Conversation

@ashors1
Copy link
Contributor

@ashors1 ashors1 commented Mar 21, 2025

What does this PR do ?

  • converts relative paths to absolute paths before checkpoint saving/loading. This works around the problem of ray workers having different relative dirs
  • adds LR scheduler state to the checkpoint

Changelog

  • Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

Checklist when contributing

  • TBD

Additional Information

  • Related to # (issue)

@SahilJain314 SahilJain314 merged commit c98ab97 into main Mar 21, 2025
3 of 5 checks passed
@SahilJain314 SahilJain314 deleted the ashors/ckpt-fixes branch March 21, 2025 03:31
KiddoZhu pushed a commit that referenced this pull request May 6, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants