Avoid gradient synchronization when accumulating #396

eluzhnica · 2023-03-27T18:49:53Z

I'll run some tests, but I only have 2 (slow) GPUs so might take some time

Dahoas · 2023-04-03T18:47:01Z

@eluzhnica Can you explain a bit for my sake how contextlib manages accelerate's grad synchronization? Otherwise looks good to me.

Edit: This seems to explain it https://huggingface.co/docs/accelerate/concept_guides/gradient_synchronization

Dahoas · 2023-04-03T19:01:59Z

@eluzhnica Were there additional tests you wanted to include? If not I can merge this now.

eluzhnica · 2023-04-05T00:51:03Z

@Dahoas Yes, that link does a great job explaining it. One additional pointer, they do check for the end of dataloader in their code here: https://github.com/huggingface/accelerate/blob/main/src/accelerate/accelerator.py

For tests, I did run PPO on pythia 125M on 2 single node GPUs with an accumulation of 8 and I saw a small improvement of about 5%. Here they do see an improvement of ~20% on single node, but on a different GPU (T4 vs mine is RTX8000) and different model.

This would show the true colors on bigger models and more GPUs (and especially if those are multi-nodes). Unfortunately, I don't have the GPU resources to execute that.

So, if someone is interested perhaps the experiment could be PPO pythia 6B, (ideally multi-node) multi GPUs to see if the average batch time is improved (i.e how long does it take to run 1000 batches).

Run with gradient_accumulation_steps = 1, with mb_size of 8 and batch size of 64.
Run with the same mb and batch, but with gradient_accumulation_steps of say 4. We'd probably have to increase the learning rate about 2x in this case since the effective batch size is now 4x, if we want to get comparable results.

However, this shouldn't be blocking.

I'll add some unittests for sanity check.

This PR adds gradient accumulation and minibatching to the accelerate trainers. * fixes half exp not implemented error * added minibatching * fix num_mb name * fix minibatch indexing * fixing style * fixing style * fixing style * Minibatch iterator (#403) * Add minibatch iterator * Add tests * Avoid gradient synchronization when accumulating (#396) * Avoid gradient synchronization when accumulating * Fix accumulation to account for dataloader * Add some tests --------- Co-authored-by: Enxhell <[email protected]>

eluzhnica mentioned this pull request Mar 27, 2023

Minibatch impl #364

Merged

eluzhnica added 3 commits April 5, 2023 02:54

Avoid gradient synchronization when accumulating

3f350c3

Fix accumulation to account for dataloader

32f7d3b

Add some tests

680abd5

eluzhnica force-pushed the minibatch-impl branch from 2c95f3e to 680abd5 Compare April 5, 2023 00:55

Dahoas merged commit 833d049 into CarperAI:minibatch-impl Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid gradient synchronization when accumulating #396

Avoid gradient synchronization when accumulating #396

eluzhnica commented Mar 27, 2023

Dahoas commented Apr 3, 2023 •

edited

Loading

Dahoas commented Apr 3, 2023

eluzhnica commented Apr 5, 2023

Avoid gradient synchronization when accumulating #396

Avoid gradient synchronization when accumulating #396

Conversation

eluzhnica commented Mar 27, 2023

Dahoas commented Apr 3, 2023 • edited Loading

Dahoas commented Apr 3, 2023

eluzhnica commented Apr 5, 2023

Dahoas commented Apr 3, 2023 •

edited

Loading