Last mini-batch redistribution in distributed samplers #1277
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change is intended to prevent training/validation loops hanging in multi-GPU setups when samplers have the setting
drop_last=False
. Whendrop_last=True
and the number of GPUs (world_size
) is large, it leads to discarding up toworld_size - 1
mini-batches, which is not acceptable for validation data as it's typically small.The default (
drop_last=False
) will now redistribute the data across mini-batches intended for each rank (since each rank has the access to the all ranks mini-batch in the current step anyway) in a consistent way across ranks to yield a partial mini-batch everywhere. When the number of available examples is less thanworld_size
, we will duplicate the examples to cover difference, ensuring each rank gets a 1-element mini-batch. Note: this duplication is consistent with PyTorch'sDistributedSampler
behavior; we missed it when creating Lhotse samplers.