Added ability to try loading the latest checkpoint from save folders #717

2015aroras · 2024-08-30T00:34:20Z

Issue: Our training runs never finish in 1 run of train.py. Currently we don't have a nice way to continue training using the same config; we have to set the load_path to the latest checkpoint.

Fix: Add an option that tries loading the latest checkpoint from the local and remote save folders (assuming load_path is not set). If there are no checkpoints in either folder, then the model initializes from scratch as usual. If this option (--try_load_latest_save) is set to True for both the initial and subsequent runs, then the first run will initialize and save an initial checkpoint while subsequent runs will resume from the latest checkpoint.

UPDATE: Changed try_load_latest_save to override load_path. This enables using the same config for first and subsequent runs when starting a run using a checkpoint and saving to a different location.

2015aroras · 2024-08-30T00:36:17Z

Tested the 3 main scenarios: no existing checkpoint, only remote checkpoints, local + remote checkpoints.

epwalsh

LGTM, this will be super useful

2015aroras added 3 commits August 29, 2024 17:24

Add config option for trying to load latest saved checkpoint

d1a83a1

Implement logic for trying to load latest saved checkpoint

b39cc7b

Update CHANGELOG

212cf47

Make try_load_latest_save override load_path

e18ed7f

2015aroras marked this pull request as ready for review August 30, 2024 01:27

2015aroras requested review from epwalsh and dirkgr August 30, 2024 01:27

epwalsh approved these changes Aug 30, 2024

View reviewed changes

2015aroras merged commit ca81901 into main Sep 3, 2024
11 of 12 checks passed

2015aroras deleted the shanea/try-load-latest-save-2 branch September 3, 2024 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ability to try loading the latest checkpoint from save folders #717

Added ability to try loading the latest checkpoint from save folders #717

2015aroras commented Aug 30, 2024 •

edited

Loading

2015aroras commented Aug 30, 2024

epwalsh left a comment

Added ability to try loading the latest checkpoint from save folders #717

Added ability to try loading the latest checkpoint from save folders #717

Conversation

2015aroras commented Aug 30, 2024 • edited Loading

2015aroras commented Aug 30, 2024

epwalsh left a comment

Choose a reason for hiding this comment

2015aroras commented Aug 30, 2024 •

edited

Loading