Explicitly pass expert_tensor_parallel_size to initialize_model_parallel#537
Conversation
Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>
Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>
expert_tensor_parallel_size to Pass expert_tensor_parallel_size to initialize_model_parallelexpert_tensor_parallel_size to to initialize_model_parallel
expert_tensor_parallel_size to to initialize_model_parallelexpert_tensor_parallel_size to initialize_model_parallel
|
/ok to test 91fba7d |
|
Thanks for your contribution @nathan-az - could you share the full stack trace of the error you ran into? |
|
@ananthsub created an issue with a repro and stacktrace. Let me know if I've overlooked anything. I'm working out the kinks of using |
|
Also CC @yaoyu-33 since (AFAICT) some EP cases are currently broken, and this fixes them - I'm not sure how often the NeMO-RL submodule branch |
|
Thank you @nathan-az for your contribution! I've verified this fixes an issue I was seeing with expert_tensor_parallel_size not being set correctly when testing NeMo RL with deepseek models. I've included this change in this PR for NeMo RL: NVIDIA-NeMo/RL#1059 to make sure it gets included into NeMo RL. |
…rallel` (#537) * Pass expert_tensor_parallel_size to pstate init Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> * check correct key Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> --------- Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com>
…rallel` (NVIDIA-NeMo#537) * Pass expert_tensor_parallel_size to pstate init Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> * check correct key Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> --------- Signed-off-by: Nathan Azrak <42650258+nathan-az@users.noreply.github.com> Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>
Due to the default handling in MCore, if this isn't passed,
expert_tensor_parallel_sizeis forced totensor_model_parallel_size. With this PR, we first check whether the parameter is passed, and if not, fall back to the default behaviour with None.Not extensively tested, but came across the issue and the fix seems straightforward.
(Hopefully) fixes #546