Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ability to try loading the latest checkpoint from save folders #717

Merged
merged 4 commits into from
Sep 3, 2024

Conversation

2015aroras
Copy link
Collaborator

@2015aroras 2015aroras commented Aug 30, 2024

Issue: Our training runs never finish in 1 run of train.py. Currently we don't have a nice way to continue training using the same config; we have to set the load_path to the latest checkpoint.

Fix: Add an option that tries loading the latest checkpoint from the local and remote save folders (assuming load_path is not set). If there are no checkpoints in either folder, then the model initializes from scratch as usual. If this option (--try_load_latest_save) is set to True for both the initial and subsequent runs, then the first run will initialize and save an initial checkpoint while subsequent runs will resume from the latest checkpoint.

UPDATE: Changed try_load_latest_save to override load_path. This enables using the same config for first and subsequent runs when starting a run using a checkpoint and saving to a different location.

@2015aroras
Copy link
Collaborator Author

Tested the 3 main scenarios: no existing checkpoint, only remote checkpoints, local + remote checkpoints.

@2015aroras 2015aroras marked this pull request as ready for review August 30, 2024 01:27
@2015aroras 2015aroras requested review from epwalsh and dirkgr August 30, 2024 01:27
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this will be super useful

@2015aroras 2015aroras merged commit ca81901 into main Sep 3, 2024
11 of 12 checks passed
@2015aroras 2015aroras deleted the shanea/try-load-latest-save-2 branch September 3, 2024 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants