Bug in accelerate_sft_trainer.py: Incorrect calculation of total_steps #426

rockmagma02 · 2023-04-08T20:00:06Z

🐛 Describe the bug

I found a bug in your accelerate_sft_trainer.py file that I would like to report. Specifically, I noticed an issue with the prepare_learning() function. In the current implementation, self.total_steps is calculated by multiplying the number of epochs (self.config.train.epochs) with the length of the train_dataloader. However, since train_dataloader is assigned before accelerator preparation, it does not reflect the actual number of training steps taken if multiple GPUs are used.

As a result, the total_steps value ends up being larger than the true total number of training steps, leading to training ending prematurely.

To fix this issue, I recommend modifying the prepare_learning() function to calculate self.total_steps using the self.train_dataloader variable instead, which correctly reflects the number of training steps after accelerator preparation. Here is the suggested modification:

    def prepare_learning(self):
        train_dataloader = self.store.create_loader(self.config.train.batch_size)
        eval_dataloader = self.eval_pipeline.create_loader(self.config.train.batch_size)

        (
            self.model,
            self.opt,
            self.train_dataloader,
            self.eval_dataloader,
        ) = self.accelerator.prepare(self.model, self.opt, train_dataloader, eval_dataloader)

        self.n_updates_per_batch = 1
        self.total_steps = self.config.train.epochs * len(self.train_dataloader)
        self.total_steps = min(self.total_steps, self.config.train.total_steps)

I hope this helps! Let me know if you have any questions or if there's anything else I can assist you with.

Which trlX version are you using?

newest

Additional system and package information

No response

jon-tow · 2023-04-13T00:11:11Z

Thanks for reporting! Fixed with @reciprocated's #432 patch.

rockmagma02 added the bug Something isn't working label Apr 8, 2023

maxreciprocate mentioned this issue Apr 12, 2023

fix(sft_trainer): total_steps calculation when running distributed #432

Merged

jon-tow closed this as completed Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in accelerate_sft_trainer.py: Incorrect calculation of total_steps #426

Bug in accelerate_sft_trainer.py: Incorrect calculation of total_steps #426

rockmagma02 commented Apr 8, 2023

jon-tow commented Apr 13, 2023

Bug in accelerate_sft_trainer.py: Incorrect calculation of total_steps #426

Bug in accelerate_sft_trainer.py: Incorrect calculation of total_steps #426

Comments

rockmagma02 commented Apr 8, 2023

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

jon-tow commented Apr 13, 2023