Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid gradient synchronization when accumulating #396

Merged
merged 3 commits into from
Apr 5, 2023

Conversation

eluzhnica
Copy link
Contributor

I'll run some tests, but I only have 2 (slow) GPUs so might take some time

@eluzhnica eluzhnica mentioned this pull request Mar 27, 2023
@Dahoas
Copy link
Collaborator

Dahoas commented Apr 3, 2023

@eluzhnica Can you explain a bit for my sake how contextlib manages accelerate's grad synchronization? Otherwise looks good to me.

Edit: This seems to explain it https://huggingface.co/docs/accelerate/concept_guides/gradient_synchronization

@Dahoas
Copy link
Collaborator

Dahoas commented Apr 3, 2023

@eluzhnica Were there additional tests you wanted to include? If not I can merge this now.

@eluzhnica
Copy link
Contributor Author

@Dahoas Yes, that link does a great job explaining it. One additional pointer, they do check for the end of dataloader in their code here: https://github.com/huggingface/accelerate/blob/main/src/accelerate/accelerator.py

For tests, I did run PPO on pythia 125M on 2 single node GPUs with an accumulation of 8 and I saw a small improvement of about 5%. Here they do see an improvement of ~20% on single node, but on a different GPU (T4 vs mine is RTX8000) and different model.

This would show the true colors on bigger models and more GPUs (and especially if those are multi-nodes). Unfortunately, I don't have the GPU resources to execute that.

So, if someone is interested perhaps the experiment could be PPO pythia 6B, (ideally multi-node) multi GPUs to see if the average batch time is improved (i.e how long does it take to run 1000 batches).

  • Run with gradient_accumulation_steps = 1, with mb_size of 8 and batch size of 64.
  • Run with the same mb and batch, but with gradient_accumulation_steps of say 4. We'd probably have to increase the learning rate about 2x in this case since the effective batch size is now 4x, if we want to get comparable results.

However, this shouldn't be blocking.

I'll add some unittests for sanity check.

@Dahoas Dahoas merged commit 833d049 into CarperAI:minibatch-impl Apr 5, 2023
cat-state pushed a commit that referenced this pull request Apr 6, 2023
This PR adds gradient accumulation and minibatching to the accelerate trainers.

* fixes half exp not implemented error

* added minibatching

* fix num_mb name

* fix minibatch indexing

* fixing style

* fixing style

* fixing style

* Minibatch iterator (#403)

* Add minibatch iterator

* Add tests

* Avoid gradient synchronization when accumulating (#396)

* Avoid gradient synchronization when accumulating

* Fix accumulation to account for dataloader

* Add some tests

---------

Co-authored-by: Enxhell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants