New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Problems with multi-epoch training #584

Open

Muennighoff opened this issue May 18, 2024 · 0 comments

Labels

type/bug

Collaborator

Muennighoff commented May 18, 2024

🐛 Describe the bug

I think there are two problems with multi-epoch training:

Training finishes if setting e.g. duration: 2e12T & 1 epoch < 2e12 tokens. It currently requires setting duration: 2ep but it should also work with T I think (also mentioned here: Break at 1 epoch "Training epoch complete", can't pretraining beyond 1 epoch ? #554)
olmo.train:816 INFO [step=817847/1430511]
train/CrossEntropyLoss=2.341
train/Perplexity=10.39
throughput/total_tokens=1,715,149,471,744
throughput/device/tokens_per_second=18,573
throughput/device/batches_per_second=0.5668
olmo.train:1172 INFO Training epoch complete
olmo.train:1194 INFO Saving final checkpoint...
train:238 INFO Training complete
Afaict when resuming a run in >1 epoch state, it requires newly setting epoch: num_epochs in the config to ensure that the data is in a different order:

OLMo/olmo/data/__init__.py

Line 103 in e6430a0

seed=seed + (train_config.epoch or 0),

I think we should just load this from the trainer state dict. However, afaict this is currently not happening because the checkpoint is only loaded after the IterableDataset is already created. I.e. data loader is loaded:

OLMo/scripts/train.py

Line 116 in e6430a0

train_loader = build_train_dataloader(cfg)

Checkpoint with epoch value is loaded:

OLMo/scripts/train.py

Line 238 in e6430a0

trainer.restore_checkpoint(

& the data loader remains unchanged.

Without knowing this, people will train the 2nd epoch with the same data order as the 1st.

Versions

latest main

The text was updated successfully, but these errors were encountered:

Muennighoff added the type/bug label

2015aroras mentioned this issue

How many tokens were trained for 7B model. #608

Open

dirkgr mentioned this issue

Olmo tiny scripts #628

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment