Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train a few steps after time limit reached #362

Merged
merged 7 commits into from
Jan 4, 2024
Merged

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Nov 6, 2023

This expands on the cancellation logic so that when a run is canceled due to reaching the time limit, it will train for 10 more steps after the cancellation goes into effect and after saving the final checkpoint. That way when we restart the run from the latest checkpoint we'll have some overlap in metrics on W&B, which is good for verifying that the restart worked properly.

olmo/train.py Outdated
)

if stop_at is not None and self.global_step >= stop_at:
canceled = hard_stop = True

# Maybe save sharded checkpoint.
if canceled or (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will save a checkpoint for all the extra steps. Consider making this and some later code hard_stop instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could have canceled represent a hard stop and cancel_initiated represent the beginning of a cancellation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

olmo/train.py Outdated
if get_global_rank() == 0:
if self.cfg.time_limit is not None and time.time() - self._start_time >= self.cfg.time_limit:
# First check if we've reached the training time limit.
should_cancel = True
cancel_reason = "time limit reached"
extra_steps = 10 # train for 10 extra steps so we get an overlap in metrics when we restart
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this a config setting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -849,7 +875,7 @@ def on_trace_ready(p):
speed_monitor.reset()

# Maybe run evaluations.
if not canceled and self.global_step % self.cfg.eval_interval == 0:
if not cancel_initiated and self.global_step % self.cfg.eval_interval == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, you don't want eval metrics if they happen in those extra steps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right... though it's debatable. I think when we cancel we want to stop ASAP, and the eval loop adds time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no eval loops. This is a sanity check.

olmo/util.py Outdated Show resolved Hide resolved
olmo/util.py Outdated Show resolved Hide resolved
Co-authored-by: Dirk Groeneveld <[email protected]>
@epwalsh epwalsh requested a review from dirkgr November 9, 2023 01:26
@dirkgr
Copy link
Member

dirkgr commented Jan 4, 2024

Do we still want this? Can we get it merged?

@epwalsh
Copy link
Member Author

epwalsh commented Jan 4, 2024

@dirkgr yes, do you want to give a final review? Otherwise I think we're good to go with this.

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review again. I was fine with it last time, except those spelling errors.

@epwalsh epwalsh merged commit 23eb949 into main Jan 4, 2024
10 checks passed
@epwalsh epwalsh deleted the epwalsh/train-after-cancel branch January 4, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants