-
Notifications
You must be signed in to change notification settings - Fork 233
Dynamic bucket selection rng sync #1341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@lifeiteng care to try this one? |
I tried this with config below, and training on librispeech spent the same(21 minutes/epoch)
|
Interesting. How many GPUs? Can you also try increasing the buffer size to 50k? Otherwise maybe the batch duration is too low to notice a difference. I observed a 10% speedup on a 2 GPU setup but need to investigate further. |
8 A100 GPUs. |
Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)? |
FP32 now |
it cost 31 minutes one epoch with this config. |
Mmm it seems you are using webdataset or Lhotse shar, so when batch size or buffer size grows, the initialization of dataloader (on first step of iteration) takes longer as it has to read more data into memory. Try precomputing duration bins by running ‘estimate_duration_bins’ on your data and providing the output to samplers duration bins argument. You can then try to revert buffer size to the original setting and compare with and without sync buffers again. |
…arly bucket depletion cases.
Just pushed a version that is better tested and supports both map-style and iterable-style datasets. |
|
Follow up to #863 and #1309
This version seems to work as intended, it consistently picks the same buckets on each DDP rank. It depends on good
duration_bins
initialization (i.e. it has to be estimated on the actual training data to fit the duration distribution well) and large enoughbuffer_size
, so that all buckets are filled enough to yield at least 1 mini-batch for the most of the time. If it hits a non-ready bucket it tries again with its neighbors.I'm determining what kind of speedup can be expected from this, also need to add proper tests, and if I find it's good enough, I'll probably make it the default.