Skip to content

Explicitly pass expert_tensor_parallel_size to initialize_model_parallel#537

Merged
ananthsub merged 2 commits intoNVIDIA-NeMo:mainfrom
nathan-az:pass-expert-tp-to-init
Sep 4, 2025
Merged

Explicitly pass expert_tensor_parallel_size to initialize_model_parallel#537
ananthsub merged 2 commits intoNVIDIA-NeMo:mainfrom
nathan-az:pass-expert-tp-to-init

Conversation

@nathan-az
Copy link
Contributor

@nathan-az nathan-az commented Sep 3, 2025

Due to the default handling in MCore, if this isn't passed, expert_tensor_parallel_size is forced to tensor_model_parallel_size. With this PR, we first check whether the parameter is passed, and if not, fall back to the default behaviour with None.

Not extensively tested, but came across the issue and the fix seems straightforward.

(Hopefully) fixes #546

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>
@nathan-az nathan-az changed the title Explicitly pass expert_tensor_parallel_size to Pass expert_tensor_parallel_size to initialize_model_parallel Explicitly pass expert_tensor_parallel_size to to initialize_model_parallel Sep 3, 2025
@nathan-az nathan-az changed the title Explicitly pass expert_tensor_parallel_size to to initialize_model_parallel Explicitly pass expert_tensor_parallel_size to initialize_model_parallel Sep 3, 2025
@ananthsub
Copy link
Contributor

/ok to test 91fba7d

@ananthsub
Copy link
Contributor

Thanks for your contribution @nathan-az - could you share the full stack trace of the error you ran into?

@nathan-az
Copy link
Contributor Author

@ananthsub created an issue with a repro and stacktrace. Let me know if I've overlooked anything. I'm working out the kinks of using NeMO-RL and reporting/fixing issues along the way, but I'm not at all experienced in the megatron/nemo stack.

@nathan-az
Copy link
Contributor Author

nathan-az commented Sep 4, 2025

Also CC @yaoyu-33 since (AFAICT) some EP cases are currently broken, and this fixes them - I'm not sure how often the NeMO-RL submodule branch yuya/nemo-rl-use-chunkpatch is sync'd with main.

@yfw
Copy link
Contributor

yfw commented Sep 4, 2025

Thank you @nathan-az for your contribution! I've verified this fixes an issue I was seeing with expert_tensor_parallel_size not being set correctly when testing NeMo RL with deepseek models. I've included this change in this PR for NeMo RL: NVIDIA-NeMo/RL#1059 to make sure it gets included into NeMo RL.

@ananthsub ananthsub merged commit cfe24f9 into NVIDIA-NeMo:main Sep 4, 2025
42 of 43 checks passed
ko3n1g pushed a commit that referenced this pull request Sep 4, 2025
…rallel` (#537)

* Pass expert_tensor_parallel_size to pstate init

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>

* check correct key

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>

---------

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>
ananthsub pushed a commit that referenced this pull request Sep 4, 2025
…rallel` (#537) (#557)

* Pass expert_tensor_parallel_size to pstate init



* check correct key



---------

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>
Co-authored-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>
paul-gibbons pushed a commit to paul-gibbons/Megatron-Bridge that referenced this pull request Oct 29, 2025
…rallel` (NVIDIA-NeMo#537)

* Pass expert_tensor_parallel_size to pstate init

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>

* check correct key

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>

---------

Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>
Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error computing expert_tensor_model_pipeline_parallel_size when using tensor_model_parallel_size

4 participants