Explicitly pass `expert_tensor_parallel_size` to `initialize_model_parallel` by nathan-az · Pull Request #537 · NVIDIA-NeMo/Megatron-Bridge

nathan-az · 2025-09-03T14:38:50Z

Due to the default handling in MCore, if this isn't passed, expert_tensor_parallel_size is forced to tensor_model_parallel_size. With this PR, we first check whether the parameter is passed, and if not, fall back to the default behaviour with None.

Not extensively tested, but came across the issue and the fix seems straightforward.

(Hopefully) fixes #546

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>

copy-pr-bot · 2025-09-03T14:38:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>

ananthsub · 2025-09-03T16:27:38Z

/ok to test 91fba7d

ananthsub · 2025-09-03T16:53:21Z

Thanks for your contribution @nathan-az - could you share the full stack trace of the error you ran into?

nathan-az · 2025-09-04T00:46:29Z

@ananthsub created an issue with a repro and stacktrace. Let me know if I've overlooked anything. I'm working out the kinks of using NeMO-RL and reporting/fixing issues along the way, but I'm not at all experienced in the megatron/nemo stack.

nathan-az · 2025-09-04T00:47:05Z

Also CC @yaoyu-33 since (AFAICT) some EP cases are currently broken, and this fixes them - I'm not sure how often the NeMO-RL submodule branch yuya/nemo-rl-use-chunkpatch is sync'd with main.

yfw · 2025-09-04T19:43:32Z

Thank you @nathan-az for your contribution! I've verified this fixes an issue I was seeing with expert_tensor_parallel_size not being set correctly when testing NeMo RL with deepseek models. I've included this change in this PR for NeMo RL: NVIDIA-NeMo/RL#1059 to make sure it gets included into NeMo RL.

…rallel` (#537) * Pass expert_tensor_parallel_size to pstate init Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> * check correct key Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> --------- Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>

…rallel` (#537) (#557) * Pass expert_tensor_parallel_size to pstate init * check correct key --------- Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> Co-authored-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>

…rallel` (NVIDIA-NeMo#537) * Pass expert_tensor_parallel_size to pstate init Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> * check correct key Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> --------- Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>

Pass expert_tensor_parallel_size to pstate init

d86651d

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>

github-actions bot added the community-request label Sep 3, 2025

check correct key

91fba7d

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>

nathan-az changed the title ~~Explicitly pass expert_tensor_parallel_size to Pass expert_tensor_parallel_size to initialize_model_parallel~~ Explicitly pass expert_tensor_parallel_size to to initialize_model_parallel Sep 3, 2025

nathan-az changed the title ~~Explicitly pass expert_tensor_parallel_size to to initialize_model_parallel~~ Explicitly pass expert_tensor_parallel_size to initialize_model_parallel Sep 3, 2025

copy-pr-bot bot temporarily deployed to nemo-ci September 3, 2025 16:28 Inactive

copy-pr-bot bot temporarily deployed to test September 3, 2025 16:28 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci September 3, 2025 16:28 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci September 3, 2025 16:47 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci September 3, 2025 16:47 Failure

copy-pr-bot bot temporarily deployed to nemo-ci September 3, 2025 16:47 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci September 3, 2025 18:35 Inactive

ananthsub approved these changes Sep 3, 2025

View reviewed changes

ananthsub added the r0.1.0 label Sep 4, 2025

ananthsub requested a review from yaoyu-33 September 4, 2025 03:41

yfw approved these changes Sep 4, 2025

View reviewed changes

ananthsub merged commit cfe24f9 into NVIDIA-NeMo:main Sep 4, 2025
42 of 43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly pass `expert_tensor_parallel_size` to `initialize_model_parallel`#537

Explicitly pass `expert_tensor_parallel_size` to `initialize_model_parallel`#537
ananthsub merged 2 commits intoNVIDIA-NeMo:mainfrom
nathan-az:pass-expert-tp-to-init

nathan-az commented Sep 3, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Sep 3, 2025

Uh oh!

ananthsub commented Sep 3, 2025

Uh oh!

ananthsub commented Sep 3, 2025

Uh oh!

nathan-az commented Sep 4, 2025

Uh oh!

nathan-az commented Sep 4, 2025 •

edited

Loading

Uh oh!

yfw commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nathan-az commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Sep 3, 2025

Uh oh!

ananthsub commented Sep 3, 2025

Uh oh!

ananthsub commented Sep 3, 2025

Uh oh!

nathan-az commented Sep 4, 2025

Uh oh!

nathan-az commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yfw commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nathan-az commented Sep 3, 2025 •

edited

Loading

nathan-az commented Sep 4, 2025 •

edited

Loading