-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow specifying LR schedule in terms of tokens #411
Conversation
return int(float(self.cfg.max_duration[:-1].strip())) | ||
elif self.cfg.max_duration.endswith("ep"): | ||
max_epochs = int(self.cfg.max_duration[:-2].strip()) | ||
return max_epochs * self.batches_per_epoch * self.tokens_per_batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but (maybe barring some weird edge cases) it seems that the conversion from steps to tokens (and from self.global_step
to self.global_train_tokens_seen
) is multiplying by self.tokens_per_batch
. If that is the case, then it may be more readable if you make self.max_tokens
return self.max_steps * self.tokens_per_batch
, or some equivalent code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that works if batch size has changed at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Did you test this in any way?
@dirkgr, no, but I will before I merge. |
Confirmed it's working as expected after a restart with 2x batch size. https://wandb.ai/ai2-llm/olmo-small-test?workspace=user-epwalsh |
This PR allows us to specify the LR schedule in terms of tokens instead of steps. For example, just change this:
To this:
The above two configurations are equivalent given a constant batch size after restarts, but the latter allows us to continue the same LR schedule while changing the batch size without any additional config changes.
This is backwards compatible, so you can make this change to your config and still restart from an older checkpoint.