-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train a few steps after time limit reached #362
Conversation
olmo/train.py
Outdated
) | ||
|
||
if stop_at is not None and self.global_step >= stop_at: | ||
canceled = hard_stop = True | ||
|
||
# Maybe save sharded checkpoint. | ||
if canceled or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will save a checkpoint for all the extra steps. Consider making this and some later code hard_stop
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you could have canceled
represent a hard stop and cancel_initiated
represent the beginning of a cancellation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
olmo/train.py
Outdated
if get_global_rank() == 0: | ||
if self.cfg.time_limit is not None and time.time() - self._start_time >= self.cfg.time_limit: | ||
# First check if we've reached the training time limit. | ||
should_cancel = True | ||
cancel_reason = "time limit reached" | ||
extra_steps = 10 # train for 10 extra steps so we get an overlap in metrics when we restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this a config setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -849,7 +875,7 @@ def on_trace_ready(p): | |||
speed_monitor.reset() | |||
|
|||
# Maybe run evaluations. | |||
if not canceled and self.global_step % self.cfg.eval_interval == 0: | |||
if not cancel_initiated and self.global_step % self.cfg.eval_interval == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, you don't want eval metrics if they happen in those extra steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right... though it's debatable. I think when we cancel we want to stop ASAP, and the eval loop adds time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, no eval loops. This is a sanity check.
Co-authored-by: Dirk Groeneveld <[email protected]>
Do we still want this? Can we get it merged? |
@dirkgr yes, do you want to give a final review? Otherwise I think we're good to go with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not review again. I was fine with it last time, except those spelling errors.
This expands on the cancellation logic so that when a run is canceled due to reaching the time limit, it will train for 10 more steps after the cancellation goes into effect and after saving the final checkpoint. That way when we restart the run from the latest checkpoint we'll have some overlap in metrics on W&B, which is good for verifying that the restart worked properly.