Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow specifying LR schedule in terms of tokens #411

Merged
merged 9 commits into from
Jan 18, 2024

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Jan 17, 2024

This PR allows us to specify the LR schedule in terms of tokens instead of steps. For example, just change this:

scheduler:
  name: linear_with_warmup
  t_warmup: 5000
  t_max: 476837
  alpha_f: 0.1
  grad_clip_warmup_steps: 1000
  grad_clip_warmup_factor: 10.0

To this:

 scheduler:
   name: linear_with_warmup
+  units: tokens
-  t_warmup: 5000
-  t_max: 476837
+  t_warmup: 2e10
+  t_max: 2e12
   alpha_f: 0.1
-  grad_clip_warmup_steps: 1000
+  grad_clip_warmup_steps: 4e9
   grad_clip_warmup_factor: 10.0

The above two configurations are equivalent given a constant batch size after restarts, but the latter allows us to continue the same LR schedule while changing the batch size without any additional config changes.

This is backwards compatible, so you can make this change to your config and still restart from an older checkpoint.

@epwalsh epwalsh requested a review from dirkgr January 17, 2024 19:24
@epwalsh epwalsh requested a review from 2015aroras January 17, 2024 19:27
return int(float(self.cfg.max_duration[:-1].strip()))
elif self.cfg.max_duration.endswith("ep"):
max_epochs = int(self.cfg.max_duration[:-2].strip())
return max_epochs * self.batches_per_epoch * self.tokens_per_batch
Copy link
Collaborator

@2015aroras 2015aroras Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but (maybe barring some weird edge cases) it seems that the conversion from steps to tokens (and from self.global_step to self.global_train_tokens_seen) is multiplying by self.tokens_per_batch. If that is the case, then it may be more readable if you make self.max_tokens return self.max_steps * self.tokens_per_batch, or some equivalent code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that works if batch size has changed at some point.

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Did you test this in any way?

@epwalsh
Copy link
Member Author

epwalsh commented Jan 17, 2024

Looks good. Did you test this in any way?

@dirkgr, no, but I will before I merge.

@epwalsh
Copy link
Member Author

epwalsh commented Jan 18, 2024

Confirmed it's working as expected after a restart with 2x batch size. https://wandb.ai/ai2-llm/olmo-small-test?workspace=user-epwalsh

@epwalsh epwalsh merged commit dcae8e8 into main Jan 18, 2024
9 of 10 checks passed
@epwalsh epwalsh deleted the epwalsh/lr-schedule-tokens branch January 18, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants