Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last mini-batch redistribution in distributed samplers #1277

Merged
merged 3 commits into from
Jan 30, 2024

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jan 30, 2024

This change is intended to prevent training/validation loops hanging in multi-GPU setups when samplers have the setting drop_last=False. When drop_last=True and the number of GPUs (world_size) is large, it leads to discarding up to world_size - 1 mini-batches, which is not acceptable for validation data as it's typically small.

The default (drop_last=False) will now redistribute the data across mini-batches intended for each rank (since each rank has the access to the all ranks mini-batch in the current step anyway) in a consistent way across ranks to yield a partial mini-batch everywhere. When the number of available examples is less than world_size, we will duplicate the examples to cover difference, ensuring each rank gets a 1-element mini-batch. Note: this duplication is consistent with PyTorch's DistributedSampler behavior; we missed it when creating Lhotse samplers.

@pzelasko pzelasko added this to the v1.20.0 milestone Jan 30, 2024
@pzelasko pzelasko merged commit 59a4b05 into master Jan 30, 2024
10 checks passed
@pzelasko pzelasko deleted the feature/redistribute-last-batch branch January 30, 2024 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant