[for #1766] fix fc1 gate/up split with TP & fix expert layers export when EP > 1#1817
Conversation
Signed-off-by: Hollow Man <hollowman@opensuse.org>
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical bug in the splitting of fused FC1 (gate/up) LoRA weights when Tensor Parallelism (TP) is enabled. The previous implementation incorrectly split the fused weights with a simple bisection, which didn't account for TP-aware weight ordering. This caused training to diverge and eventually collapse. The fix properly handles the interleaved shard structure where each TP shard contains both gate and up components.
Key changes:
- Added
_split_fused_fc1_linear_out_weightmethod that implements TP-aware splitting logic - Updated
_get_fused_adapter_linear_out_slicesand_merge_lora_adapter_weightsto use the new splitting method - Properly threaded
is_expertparameter through the call chain to handle both expert and non-expert TP configurations
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
8582166 to
e66a042
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What does this PR do ?
For fixing expert layers export when EP > 1, after the fix, with TP=4, PP=1, EP=8, ETP=1, CP=1, the training inference mismatch is also stabilizing:
For fixing fc1 gate/up split with TP:
Previous method contains errors, although it can converge, it gradually turns into training inference diverges and eventually make the training collapse.
After the fix, now it works fine!
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information