-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix distributed dataloaders & deduplicate eval #276
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Just left a tiny request but feel free to leave it for another PR.
Aside (while on the topic of dataloaders): Do you know what the purpose of the rollout_loader
in the lines below is for? Seems like it was used to satisfy prepare
with deepspeed before it became a no-op
trlx/trlx/trainer/accelerate_ppo_trainer.py
Lines 36 to 40 in 070c58f
rollout_loader = self.store.create_loader(self.config.train.batch_size, shuffle=True) | |
self.model, self.opt, self.scheduler, rollout_loader = self.accelerator.prepare( | |
self.model, self.opt, self.scheduler, rollout_loader | |
) |
samples = self.accelerator.gather_for_metrics(torch.vstack(all_samples)) | ||
prompts = self.accelerator.gather_for_metrics(torch.vstack(all_prompts)) | ||
prompt_sizes = self.accelerator.gather_for_metrics(torch.hstack(prompt_sizes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should bump the accelerate
version in our dependency list to the accelerate>=0.16.0
fix?
Line 14 in 070c58f
accelerate>=0.12.0 |
(This doesn't affect training so maybe not worth the constraint now?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constraint would be benign anyway since nothing else depends on accelerate, however we have to check whether there isn't something broken with this new release to be deliberate with the version here, so probably let's add it later
Just to clarify |
Ah gotcha - I misunderstood this discussion. Thanks for clarifying. Merging! |
eval_pipeline
(accelerate.prepare(dataloader)
is a noop with deepspeed) to split evaluation samples among processesPPOOrchestrator
in the case of deepspeed (previously was sharded only for ddp)https://wandb.ai/sorry/trlx/reports/Fix-distributed-dataloaders-deduplicate-eval-276---VmlldzozNDgzNDU3
These changes influenced training with both deepspeed & ddp by no more than correcting effective
batch_size
andnum_rollouts
. The number of evaluation samples now is the same as the number of passedeval_prompts
(this is a fix of #247)ppo_sentiments.py @ num_processes=4, batch_size=16, num_rollouts=128, len(eval_prompts)=1024
main & ddp
len(train_dataloader)=2
len(eval_dataloader)=16
fix & ddp
len(train_dataloader)=8
len(eval_dataloader)=16
main & deepspeed
len(train_dataloader)=8
len(eval_dataloader)=64
fix & deepspeed
len(train_dataloader)=8
len(eval_dataloader)=16