Last mini-batch redistribution in distributed samplers #1277

pzelasko · 2024-01-30T19:55:41Z

This change is intended to prevent training/validation loops hanging in multi-GPU setups when samplers have the setting drop_last=False. When drop_last=True and the number of GPUs (world_size) is large, it leads to discarding up to world_size - 1 mini-batches, which is not acceptable for validation data as it's typically small.

The default (drop_last=False) will now redistribute the data across mini-batches intended for each rank (since each rank has the access to the all ranks mini-batch in the current step anyway) in a consistent way across ranks to yield a partial mini-batch everywhere. When the number of available examples is less than world_size, we will duplicate the examples to cover difference, ensuring each rank gets a 1-element mini-batch. Note: this duplication is consistent with PyTorch's DistributedSampler behavior; we missed it when creating Lhotse samplers.

Last mini-batch redistribution in distributed samplers

a895ff4

pzelasko added this to the v1.20.0 milestone Jan 30, 2024

pzelasko added 2 commits January 30, 2024 15:11

Account for world_size >> num_available_examples

5a5ce41

Merge branch 'master' into feature/redistribute-last-batch

031f66f

pzelasko merged commit 59a4b05 into master Jan 30, 2024
10 checks passed

pzelasko deleted the feature/redistribute-last-batch branch January 30, 2024 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Last mini-batch redistribution in distributed samplers #1277

Last mini-batch redistribution in distributed samplers #1277

pzelasko commented Jan 30, 2024 •

edited

Loading

Last mini-batch redistribution in distributed samplers #1277

Last mini-batch redistribution in distributed samplers #1277

Conversation

pzelasko commented Jan 30, 2024 • edited Loading

pzelasko commented Jan 30, 2024 •

edited

Loading