[data] switch pretrain from broadcast to replicated load by ananthsub · Pull Request #491 · NVIDIA-NeMo/Megatron-Bridge

ananthsub · 2025-08-26T19:21:20Z

Don't load the batch in one TP rank and broadcast. instead, initialize the dataloader on all ranks and unconditionally call get_batch_from_iterator

Broadcast uses less file I/O and PCIe BW but hurts GPU memcpy and NVL BW

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

copy-pr-bot · 2025-08-26T19:21:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ananthsub · 2025-08-26T19:57:01Z

/ok to test fe02f57

sanandaraj5597

LGTM.

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>

switch pretrain to replicated load from broadcast

fe02f57

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

ananthsub requested review from erhoo82, malay-nagda and sanandaraj5597 August 26, 2025 19:21

ananthsub added the r0.1.0 label Aug 26, 2025

copy-pr-bot bot temporarily deployed to nemo-ci August 26, 2025 19:57 Inactive

copy-pr-bot bot temporarily deployed to test August 26, 2025 19:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci August 26, 2025 19:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci August 26, 2025 20:20 Inactive

sanandaraj5597 approved these changes Aug 26, 2025

View reviewed changes

ananthsub marked this pull request as ready for review August 26, 2025 21:01

ananthsub enabled auto-merge (squash) August 26, 2025 21:13

ananthsub merged commit 7612573 into NVIDIA-NeMo:main Aug 26, 2025
33 checks passed

ko3n1g pushed a commit that referenced this pull request Aug 26, 2025

switch pretrain to replicated load from broadcast (#491)

74cc8d1

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

ananthsub added a commit that referenced this pull request Aug 26, 2025

switch pretrain to replicated load from broadcast (#491) (#492)

672a90c

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com>

ananthsub mentioned this pull request Sep 10, 2025

[data] Support option to switch between replicated TP load and broadcast #626

Open

ananthsub deleted the tp-replicated-load branch February 17, 2026 07:16

izikgo mentioned this pull request Feb 17, 2026

feat: Enhance dataset loading efficiency with tensor parallelism #2405

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] switch pretrain from broadcast to replicated load#491

[data] switch pretrain from broadcast to replicated load#491
ananthsub merged 1 commit intoNVIDIA-NeMo:mainfrom
ananthsub:tp-replicated-load

ananthsub commented Aug 26, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Aug 26, 2025

Uh oh!

ananthsub commented Aug 26, 2025

Uh oh!

sanandaraj5597 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ananthsub commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Aug 26, 2025

Uh oh!

ananthsub commented Aug 26, 2025

Uh oh!

sanandaraj5597 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ananthsub commented Aug 26, 2025 •

edited

Loading