Added ability to try loading the latest checkpoint from save folders #717
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue: Our training runs never finish in 1 run of
train.py
. Currently we don't have a nice way to continue training using the same config; we have to set theload_path
to the latest checkpoint.Fix: Add an option that tries loading the latest checkpoint from the local and remote save folders (assuming
load_path
is not set). If there are no checkpoints in either folder, then the model initializes from scratch as usual. If this option (--try_load_latest_save
) is set toTrue
for both the initial and subsequent runs, then the first run will initialize and save an initial checkpoint while subsequent runs will resume from the latest checkpoint.UPDATE: Changed
try_load_latest_save
to overrideload_path
. This enables using the same config for first and subsequent runs when starting a run using a checkpoint and saving to a different location.