Skip to content

[LoRA] Fix LoRA merge and support CanonicalLoRA merge#1603

Merged
yaoyu-33 merged 4 commits intoNVIDIA-NeMo:mainfrom
HollowMan6:lora_merge
Dec 12, 2025
Merged

[LoRA] Fix LoRA merge and support CanonicalLoRA merge#1603
yaoyu-33 merged 4 commits intoNVIDIA-NeMo:mainfrom
HollowMan6:lora_merge

Conversation

@HollowMan6
Copy link
Contributor

@HollowMan6 HollowMan6 commented Dec 5, 2025

What does this PR do ?

Previous LoRA merge #1310 contains several errors, this PR aims to fix it and support CanonicalLoRA merge

Changelog

  • Handle fused QKV/gate up correctly
  • Handle PP gathering correctly

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

✨ Presented to you with Mind Lab - A Lab for Experiential Intelligence.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 5, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HollowMan6
Copy link
Contributor Author

@HollowMan6 HollowMan6 requested a review from yaoyu-33 December 5, 2025 20:45
Previous LoRA merge contains several errors:
- It didn't handle fused QKV/gate up correctly
- The handling of PP gathering is problematic

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6
Copy link
Contributor Author

Now this PR is refactored for better support of future LoRA bridge, we first construct the AdapterWeightConversionTask using the information from _megatron_global_adapters_info_all_pp_ranks() and build a dict using global_base_prefix as key for indexing those tasks, then we materialize (finish) those adapters' tasks with base weight conversion tasks in the streaming manner, and store them into a list AdapterWeight. Finally we do LoRA merge using those AdapterWeight.

@yaoyu-33
Copy link
Contributor

yaoyu-33 commented Dec 7, 2025

/ok to test 37b2a2e

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6 HollowMan6 requested a review from yaoyu-33 December 7, 2025 11:07
Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6
Copy link
Contributor Author

Test results on MoE (Qwen3-30B-A3B) LoRA also looks good:

image

Canonical LoRA will work on MoE with the following PRs merged:

@yaoyu-33 yaoyu-33 enabled auto-merge (squash) December 7, 2025 19:15
@yaoyu-33
Copy link
Contributor

yaoyu-33 commented Dec 7, 2025

/ok to test 19c4c5e

@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 01:35 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 01:35 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 01:35 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 01:35 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 01:35 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 01:35 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 01:35 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 01:35 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 01:35 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci December 10, 2025 16:39 Abandoned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants