[for #1766] fix fc1 gate/up split with TP & fix expert layers export when EP > 1 by HollowMan6 · Pull Request #1817 · NVIDIA-NeMo/Megatron-Bridge

HollowMan6 · 2025-12-29T00:46:18Z

What does this PR do ?

For fixing expert layers export when EP > 1, after the fix, with TP=4, PP=1, EP=8, ETP=1, CP=1, the training inference mismatch is also stabilizing:

For fixing fc1 gate/up split with TP:
Previous method contains errors, although it can converge, it gradually turns into training inference diverges and eventually make the training collapse.

After the fix, now it works fine!

Changelog

Correctly split fused FC1 LoRA linear_out into gate/up with TP-aware ordering

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Signed-off-by: Hollow Man <hollowman@opensuse.org>

copy-pr-bot · 2025-12-29T00:46:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copilot

Pull request overview

This PR fixes a critical bug in the splitting of fused FC1 (gate/up) LoRA weights when Tensor Parallelism (TP) is enabled. The previous implementation incorrectly split the fused weights with a simple bisection, which didn't account for TP-aware weight ordering. This caused training to diverge and eventually collapse. The fix properly handles the interleaved shard structure where each TP shard contains both gate and up components.

Key changes:

Added _split_fused_fc1_linear_out_weight method that implements TP-aware splitting logic
Updated _get_fused_adapter_linear_out_slices and _merge_lora_adapter_weights to use the new splitting method
Properly threaded is_expert parameter through the call chain to handle both expert and non-expert TP configurations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.