(1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast #7793

erhoo82 · 2023-10-24T03:30:18Z

What does this PR do ?

(1) Add SHARP interface to M-CORE in the communicator initialization.
(2) use send/recv to send train loss to the first rank instead of b-cast. This mitigates the communication overhead.

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

ericharper

LGTM. Thanks!

athitten

LGTM, thank you!

athitten · 2023-10-24T22:59:10Z

@erhoo82 just remembered we want to get rid of the torch broadcast in on_validation_epoch_end as well right ? here: torch.distributed.broadcast

github-actions · 2023-11-08T01:44:46Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2023-11-15T01:45:45Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

erhoo82 · 2023-11-30T23:24:22Z

Somehow this was closed. Re-opend.

erhoo82 · 2023-11-30T23:25:10Z

@athitten
We need to keep the torch.distirbuted.broadcast in eval because this is the training termination condition for MLPerf.

erhoo82 · 2023-11-30T23:43:19Z

Added some changes.

erhoo82 · 2023-12-05T18:48:14Z

jenkins

erhoo82 · 2023-12-08T06:48:28Z

@ericharper @athitten
I think there is no issue with this PR right? Can we merge?

athitten · 2023-12-08T07:05:46Z

nemo/collections/nlp/parts/megatron_trainer_builder.py

@@ -53,6 +53,7 @@ def _training_strategy(self) -> NLPDDPStrategy:
            no_ddp_communication_hook=True,
            gradient_as_bucket_view=self.cfg.model.gradient_as_bucket_view,
            find_unused_parameters=False,
+            sharp=cfg.model.get('sharp', False),


@erhoo82 probably a typo. It should be self.cfg.model.get right ?

This breaks the running of this PR

athitten · 2023-12-08T07:09:41Z

@ericharper @athitten I think there is no issue with this PR right? Can we merge?

Yes should be okay to merge once the CI passes. Also, there are some conflicts with the base branch.

erhoo82 · 2023-12-09T02:18:18Z

Thanks!
Fixed the typo and resolved the conflict.

ericharper · 2023-12-10T01:42:48Z

jenkins

ericharper

LGTM. Thanks!

ericharper · 2023-12-16T01:24:07Z

jenkins

github-actions · 2023-12-30T01:45:03Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

…st rank Signed-off-by: Sangkug Lym <[email protected]> Add a default SHARP setting to arg list Signed-off-by: Sangkug Lym <[email protected]> cleanup Signed-off-by: Sangkug Lym <[email protected]>

Signed-off-by: Sangkug Lym <[email protected]>

erhoo82 · 2024-01-02T08:14:19Z

jenkins

…ss to the first rank instead of b-cast (NVIDIA#7793) * (1) SHARP for DP proc group, (2) Use send/recv loss_mean logging at 1st rank Signed-off-by: Sangkug Lym <[email protected]> Add a default SHARP setting to arg list Signed-off-by: Sangkug Lym <[email protected]> cleanup Signed-off-by: Sangkug Lym <[email protected]> * cleanup Signed-off-by: Sangkug Lym <[email protected]> --------- Signed-off-by: Sangkug Lym <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]>

…ss to the first rank instead of b-cast (NVIDIA#7793) * (1) SHARP for DP proc group, (2) Use send/recv loss_mean logging at 1st rank Signed-off-by: Sangkug Lym <[email protected]> Add a default SHARP setting to arg list Signed-off-by: Sangkug Lym <[email protected]> cleanup Signed-off-by: Sangkug Lym <[email protected]> * cleanup Signed-off-by: Sangkug Lym <[email protected]> --------- Signed-off-by: Sangkug Lym <[email protected]> Signed-off-by: Sasha Meister <[email protected]>

…ss to the first rank instead of b-cast (NVIDIA#7793) * (1) SHARP for DP proc group, (2) Use send/recv loss_mean logging at 1st rank Signed-off-by: Sangkug Lym <[email protected]> Add a default SHARP setting to arg list Signed-off-by: Sangkug Lym <[email protected]> cleanup Signed-off-by: Sangkug Lym <[email protected]> * cleanup Signed-off-by: Sangkug Lym <[email protected]> --------- Signed-off-by: Sangkug Lym <[email protected]>

erhoo82 requested review from ericharper and athitten October 24, 2023 03:30

github-actions bot added the NLP label Oct 24, 2023

ericharper previously approved these changes Oct 24, 2023

View reviewed changes

erhoo82 dismissed ericharper’s stale review via b23e471 October 24, 2023 16:28

athitten previously approved these changes Oct 24, 2023

View reviewed changes

github-actions bot added the stale label Nov 8, 2023

github-actions bot closed this Nov 15, 2023

erhoo82 reopened this Nov 30, 2023

erhoo82 dismissed athitten’s stale review via 419ad62 November 30, 2023 23:35

erhoo82 force-pushed the slym/mlperf_merge branch from b23e471 to 419ad62 Compare November 30, 2023 23:35

github-actions bot removed the stale label Dec 1, 2023

athitten reviewed Dec 8, 2023

View reviewed changes

erhoo82 force-pushed the slym/mlperf_merge branch 3 times, most recently from 70343d8 to 2065563 Compare December 9, 2023 02:16

ericharper previously approved these changes Dec 10, 2023

View reviewed changes

ericharper dismissed their stale review via e99d654 December 16, 2023 01:23

github-actions bot added the stale label Dec 30, 2023

(1) SHARP for DP proc group, (2) Use send/recv loss_mean logging at 1…

ff13ffd

…st rank Signed-off-by: Sangkug Lym <[email protected]> Add a default SHARP setting to arg list Signed-off-by: Sangkug Lym <[email protected]> cleanup Signed-off-by: Sangkug Lym <[email protected]>

erhoo82 force-pushed the slym/mlperf_merge branch 3 times, most recently from 0ef3363 to f4b5515 Compare January 2, 2024 01:18

cleanup

c12cee7

Signed-off-by: Sangkug Lym <[email protected]>

erhoo82 force-pushed the slym/mlperf_merge branch from f4b5515 to c12cee7 Compare January 2, 2024 01:31

github-actions bot removed the stale label Jan 2, 2024

layalir approved these changes Jan 2, 2024

View reviewed changes

erhoo82 merged commit 7e1bf36 into main Jan 2, 2024
15 checks passed

erhoo82 deleted the slym/mlperf_merge branch January 2, 2024 15:02

janEbert mentioned this pull request Jan 13, 2024

Fix FSDP SHARP integration #8170

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast #7793

(1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast #7793

erhoo82 commented Oct 24, 2023

ericharper left a comment

athitten left a comment

athitten commented Oct 24, 2023

github-actions bot commented Nov 8, 2023

github-actions bot commented Nov 15, 2023

erhoo82 commented Nov 30, 2023

erhoo82 commented Nov 30, 2023

erhoo82 commented Nov 30, 2023

erhoo82 commented Dec 5, 2023

erhoo82 commented Dec 8, 2023

athitten Dec 8, 2023

wdykas Dec 8, 2023

athitten commented Dec 8, 2023

erhoo82 commented Dec 9, 2023

ericharper commented Dec 10, 2023

ericharper left a comment

ericharper commented Dec 16, 2023

github-actions bot commented Dec 30, 2023

erhoo82 commented Jan 2, 2024

(1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast #7793

(1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast #7793

Conversation

erhoo82 commented Oct 24, 2023

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

ericharper left a comment

Choose a reason for hiding this comment

athitten left a comment

Choose a reason for hiding this comment

athitten commented Oct 24, 2023

github-actions bot commented Nov 8, 2023

github-actions bot commented Nov 15, 2023

erhoo82 commented Nov 30, 2023

erhoo82 commented Nov 30, 2023

erhoo82 commented Nov 30, 2023

erhoo82 commented Dec 5, 2023

erhoo82 commented Dec 8, 2023

athitten Dec 8, 2023

Choose a reason for hiding this comment

wdykas Dec 8, 2023

Choose a reason for hiding this comment

athitten commented Dec 8, 2023

erhoo82 commented Dec 9, 2023

ericharper commented Dec 10, 2023

ericharper left a comment

Choose a reason for hiding this comment

ericharper commented Dec 16, 2023

github-actions bot commented Dec 30, 2023

erhoo82 commented Jan 2, 2024