Optimizing distributed Adam when running with one work queue #1551

timmoon10 · 2022-12-07T01:53:48Z

When @erhoo82 ran NeMo-Megatron with CUDA_DEVICE_MAX_CONNECTIONS=1, he observed poor overlapping between the model's backward compute and the distributed Adam optimizer's gradient reduce-scatters. In particular, if multiple reductions are launched in a row, then only the last one is possible to overlap. This PR makes several changes to optimize performance:

When dist Adam launches multiple grad reduce-scatters at the same time, it coalesces them with NCCL group calls.
Support initializing multiple params together, so their grad reductions are launched together.
Support variable-sized buckets. This is not currently used (large buckets tend to increase memory overheads), but it may be a useful feature in the future.

…ckets

The buckets perform communication together, so they are effectively a large bucket.

crcrpar

excuse me for the delay but there seems to be a quite recent change in pytorch with which this pull request failed

apex/contrib/optimizers/distributed_fused_adam.py

Handles checkpoints generated before NVIDIA#1551.

…1551) * Coalesce reduce-scatters in distributed Adam * Support variable-size param buckets in dist Adam optimizer * Support contiguous grad buffer with variable-size param buckets * Add dist Adam unit test with contiguous grad buffers * Optimize compute/communication overlap in dist Adam optim step * Restore Dist Adam default of splitting params across default-sized buckets * Support initializing multiple dist Adam param buckets together The buckets perform communication together, so they are effectively a large bucket. * Handle recent change in PyTorch API for coalescing NCCL calls

timmoon10 added 7 commits November 28, 2022 16:41

Coalesce reduce-scatters in distributed Adam

6fca3bb

Support variable-size param buckets in dist Adam optimizer

bb5e137

Support contiguous grad buffer with variable-size param buckets

6856f3d

Add dist Adam unit test with contiguous grad buffers

0146de4

Optimize compute/communication overlap in dist Adam optim step

5d04d07

Restore Dist Adam default of splitting params across default-sized bu…

2e16b21

…ckets

Support initializing multiple dist Adam param buckets together

4d25813

The buckets perform communication together, so they are effectively a large bucket.

This was referenced Dec 7, 2022

Optimizing distributed Adam when running with one work queue NVIDIA/NeMo#5560

Merged

Coalesce reduce-scatters in distributed Adam #1544

Closed

timmoon10 mentioned this pull request Dec 20, 2022

Distributed optimizer overlaps forward compute and param all-gather #1559

Merged

Merge branch 'master' into dist-adam-variable-bucket-size

905ffc0

crcrpar reviewed Jan 13, 2023

View reviewed changes

apex/contrib/optimizers/distributed_fused_adam.py Outdated Show resolved Hide resolved

apex/contrib/optimizers/distributed_fused_adam.py Outdated Show resolved Hide resolved

Handle recent change in PyTorch API for coalescing NCCL calls

d42f781

crcrpar modified the milestones: 23.01, 23.02 Jan 19, 2023

crcrpar added the contrib label Jan 19, 2023

crcrpar approved these changes Jan 19, 2023

View reviewed changes

crcrpar merged commit 75f401e into NVIDIA:master Jan 20, 2023

timmoon10 added a commit to timmoon10/apex that referenced this pull request Mar 2, 2023

Hack to load old distopt checkpoints

aeaf74d

Handles checkpoints generated before NVIDIA#1551.

timmoon10 mentioned this pull request Mar 2, 2023

Hack to load old distributed optimizer checkpoints #1602

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing distributed Adam when running with one work queue #1551

Optimizing distributed Adam when running with one work queue #1551

timmoon10 commented Dec 7, 2022 •

edited

Loading

crcrpar left a comment

Optimizing distributed Adam when running with one work queue #1551

Optimizing distributed Adam when running with one work queue #1551

Conversation

timmoon10 commented Dec 7, 2022 • edited Loading

crcrpar left a comment

Choose a reason for hiding this comment

timmoon10 commented Dec 7, 2022 •

edited

Loading