[LoRA] MoE LoRA Refactor by jeejeelee · Pull Request #40338 · vllm-project/vllm

jeejeelee · 2026-04-20T09:05:26Z

Motivation

Currently, MoE LoRA is wired in by monkey-patching methods on the modular kernel at construction time. FusedMoEWithLoRA._inject_lora_into_fused_moe wraps FusedMoEKernel.apply, TritonExperts.activation, and TritonExperts.moe_sum with fwd_decorator / act_decorator / moe_sum_decorator, and smuggles tensors between them through moe_state_dict. This has several concrete problems:

1. Hacky and hard to maintain.

The LoRA contribution is hidden inside decorators on base-MoE methods. Someone reading the MoE experts code sees self.activation(...) and self.moe_sum(...) — they have no syntactic hint that these calls may actually run LoRA shrink/expand GEMMs, or that moe_state_dict carries cross-call state between them. Debugging requires holding the patching order in your head.

2. MoE changes don't see LoRA

Because the LoRA path lives outside the expert apply() functions, any refactor of TritonExperts / UnfusedOAITritonExperts / MarlinExperts has to re-discover the LoRA contract (what activation receives, what moe_sum receives, what shape intermediate_cache* has). Every MoE-side change risks breaking LoRA silently.

3. Hard to extend

Adding support for new features — EP, additional quantized backends, new expert impls — means replicating or working around the decorator chain, and the state-dict plumbing assumes one specific control flow. There is no extension point for an expert that wants to apply LoRA at a different point .

Change

Treat LoRA as a first-class concern in the modular MoE kernel rather than an external patch.

1. `MoELoRAContext`: a single explicit payload

A new MoELoRAContext dataclass (vllm/lora/lora_context.py) packages all of
the LoRA state a MoE forward pass needs:

w13_lora_a_stacked / w13_lora_b_stacked / w2_lora_a_stacked /
w2_lora_b_stacked
adapter_enabled, max_loras
routing/sharding info: top_k, w13_num_slices, fully_sharded, tp_rank,
tp_size, local_num_experts
the active punica_wrapper
use_tuned_config (whether VLLM_TUNED_CONFIG_FOLDER is set)

FusedMoEWithLoRA.set_mapping builds this context once and stashes it on the base layer as FusedMoE._lora_context. MoERunnerBase forwards it into FusedMoEMethodBase.apply(..., lora_context=...), and FusedMoEModularMethod.apply / FusedMoEKernel.apply / FusedMoEKernelModularImpl.apply
thread it through to FusedMoEExpertsModular.apply(..., lora_context=...).

The context is the only LoRA surface area seen by the MoE code path —
there's no more hidden state passed between method wrappers.

2. LoRA compute inlined into the expert `apply()`

Expert implementations that support LoRA now call it directly inside their own apply() function, at the same logical point the decorators used to target:

TritonExperts.apply (fused_moe.py): after the w13 GEMM and before
activation, call self.apply_w13_lora(...) to add the LoRA delta to
intermediate_cache1. After the w2 GEMM and before moe_sum, call
self.apply_w2_lora(...) on intermediate_cache3, reusing the
sorted_token_ids_lora tensors from the first call.
UnfusedOAITritonExperts.apply (gpt_oss_triton_kernels_moe.py):
same pattern, adjusted for the gather/scatter layout that its two
matmul_ogs calls produce.
MarlinExperts.apply (fused_marlin_moe.py): fused_marlin_moe
consumes activation_func and moe_sum as callables, so the LoRA path
wraps those two callables to inject apply_w13_lora / apply_w2_lora at
the correct buffer state.

FusedMoEExperts.supports_lora() defaults to False. Each expert impl that has a validated LoRA path overrides it to True (TritonExperts, UnfusedOAITritonExperts, MarlinExperts). FusedMoEWithLoRA.__init__ asserts that the selected expert impl reports supports_lora(), and oracle/unquantized.py::select_unquantized_moe_backend
now filters the backend auto-selection by that flag so unsupported backends (FlashInfer / AITER) are transparently skipped when LoRA is enabled instead of silently producing wrong output or crashing later.

Because the LoRA shrink/expand is now visible in the expert source, anyone modifying TritonExperts.apply can see the LoRA call site and keep it correct; tests on the MoE path automatically cover the LoRA path as well.

3. LoRA computation stays in `PunicaWrapper`

MoE LoRA still respects the PunicaWrapper logic , the actual shrink/expand compute is not moved. Two new methods on PunicaWrapperBase — add_lora_w13 and add_lora_w2 — encapsulate config lookup (tuned vs. heuristic), moe_lora_align_block_size, and the add_lora_fused_moe call. PunicaWrapperGPU provides the concrete implementation. FusedMoEExpertsModular has thin helpers apply_w13_lora / apply_w2_lora that just forward the context fields to these methods.

Test Plan

All the LoRA and MoE tests on CI should pass correctly

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

mergify · 2026-04-20T09:06:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jeejeelee.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request refactors the LoRA implementation for Fused MoE layers by replacing the legacy decorator-based monkey-patching with an explicit MoELoRAContext propagation through the modular kernel path. It updates various quantization methods and MoE backends to handle this context. Feedback highlights potential AttributeError issues: one regarding the access of fused_experts on FusedMoEKernel, and another concerning the missing block_shape property in the FusedMoEExperts base class and its subclasses like MarlinExperts, which would cause runtime crashes when LoRA is enabled.

gemini-code-assist · 2026-04-20T09:10:20Z

-        fused_experts.moe_sum = moe_sum_decorator(
-            self.base_layer, fused_experts.moe_sum
+        assert (
+            isinstance(moe_kernel.fused_experts, FusedMoEExpertsModular)


The check isinstance(moe_kernel.fused_experts, FusedMoEExpertsModular) might fail because FusedMoEKernel typically wraps the implementation in an impl attribute. You should likely check moe_kernel.impl.fused_experts instead, or ensure that FusedMoEKernel exposes fused_experts as a property.

Suggested change

isinstance(moe_kernel.fused_experts, FusedMoEExpertsModular)

isinstance(moe_kernel.impl.fused_experts, FusedMoEExpertsModular)

You could use the is_monolithic property instead of isinstance.

You might also need to add a supports_lora method to FusedMoEKernel

gemini-code-assist · 2026-04-20T09:10:20Z

+    def apply_w13_lora(
+        self,
+        lora_context: "MoELoRAContext",
+        *,
+        y: torch.Tensor,
+        x: torch.Tensor,
+        topk_ids: torch.Tensor,
+        topk_weights: torch.Tensor,
+        expert_map: torch.Tensor | None,
+        w1: torch.Tensor,
+        w2: torch.Tensor,
+        num_tokens: int,
+        top_k_num: int,
+    ) -> tuple[
+        torch.Tensor | None,
+        torch.Tensor | None,
+        torch.Tensor | None,
+        torch.Tensor | None,
+    ]:
+        return lora_context.punica_wrapper.add_lora_w13(
+            y,
+            x,
+            lora_context.w13_lora_a_stacked,
+            lora_context.w13_lora_b_stacked,
+            topk_ids,
+            topk_weights,
+            expert_map,
+            w1,
+            w2,
+            num_tokens,
+            top_k_num,
+            lora_context.max_loras,
+            lora_context.adapter_enabled,
+            lora_context.local_num_experts,
+            lora_context.top_k,
+            lora_context.w13_num_slices,
+            lora_context.fully_sharded,
+            lora_context.use_tuned_config,
+            block_shape=self.block_shape,
+        )


The apply_w13_lora method (and apply_w2_lora below) calls self.block_shape, but block_shape is not defined as an abstract property in the FusedMoEExperts base class. This will cause an AttributeError at runtime for any expert implementation that does not explicitly define it (e.g., MarlinExperts or UnfusedOAITritonExperts). Please add block_shape as an abstract property to FusedMoEExperts and implement it in all subclasses that support LoRA.

gemini-code-assist · 2026-04-20T09:10:20Z

+    def supports_lora() -> bool:
+        return True


MarlinExperts claims to support LoRA but does not implement the block_shape property required by apply_w13_lora and apply_w2_lora in the base class. This will lead to a crash when LoRA is enabled with Marlin quantization.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

bnellnm · 2026-04-23T23:45:40Z

+            moe_kernel = FusedMoEKernel(
                prepare_finalize,
                self.base_layer.quant_method.select_gemm_impl(
                    prepare_finalize, self.base_layer
                ),
            )


Do we know if this case is ever hit now? Most methods have been switched over to the new MK initialization pattern (_setup_kernel)

bnellnm · 2026-04-24T00:11:00Z

            index, :, : sliced_w2_lora_b.shape[1], : sliced_w2_lora_b.shape[2]
        ].copy_(sliced_w2_lora_b, non_blocking=True)

+    def set_mapping(self, punica_wrapper):


Does this happen at runtime or is this part of the LoRA setup?

This is LoRA setup, not runtime. MoELoRAContext captures references to it so the experts kernel sees fresh values without rebinding

bnellnm · 2026-04-24T00:28:41Z

It's probably too much for this PR but we could consider having separate subclasses for experts that support LoRA (so that the LoRA code could be completely isolated) and the setup in FusedMoEWithLoRA could construct the proper LoRA MK instead of rewriting or hijacking the existing MK.

bnellnm

I think at least one of the gemini comments (the one about isinstance) are relevant and should be fixed. Otherwise, I think it looks pretty good.

jeejeelee · 2026-04-24T12:51:21Z

It's probably too much for this PR but we could consider having separate subclasses for experts that support LoRA (so that the LoRA code could be completely isolated) and the setup in FusedMoEWithLoRA could construct the proper LoRA MK instead of rewriting or hijacking the existing MK.

Good point, I'll look into it further. Thanks!

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

bnellnm

LGTM. Thanks for the good work! Btw, Do you know if the select_gemm_impl codepath ever gets triggered? It should largely be defunct now and will be removed once everything is migrated over to _setup_kernel.

jeejeelee · 2026-04-25T09:11:05Z

LGTM. Thanks for the good work! Btw, Do you know if the select_gemm_impl codepath ever gets triggered? It should largely be defunct now and will be removed once everything is migrated over to _setup_kernel.

Thank you for the impressive reivew firstly.

Yes, it looks like still gets triggered — specifically by the unmigrated quant methods (AWQ-Marlin, compressed_tensors_moe).

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

bnellnm · 2026-04-25T18:42:25Z

FYI, my guess is that #40794 is causing the LoRA failure. There was a similar issue when the truncate came before the reduce in a prior PR that was fixed by moving the trunc afterwards. I'm not sure what the best solution is here. Calling .contiguous() on the result of the truncation should "fix" the problem but feels like a bandaid.

Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

Signed-off-by: Adrian <info@zzit.ch>

jeejeelee added 2 commits April 19, 2026 15:44

Init

85cf592

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Move

9c64777

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee requested review from mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth and yewentao256 as code owners April 20, 2026 09:05

claude Bot reviewed Apr 20, 2026

View reviewed changes

jeejeelee marked this pull request as draft April 20, 2026 09:05

mergify Bot added gpt-oss Related to GPT-OSS models needs-rebase labels Apr 20, 2026

github-project-automation Bot added this to gpt-oss Issues & Enhancements Apr 20, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Apr 20, 2026

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

Address conflict

7b81c9e

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

mergify Bot removed the needs-rebase label Apr 20, 2026

jeejeelee added 3 commits April 20, 2026 18:25

Merge branch 'main' into moe-lora-refactor

80d0188

Fix

9350a67

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Fix

b3d1ea6

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

mergify Bot added nvidia rocm Related to AMD ROCm intel-gpu Related to Intel GPU labels Apr 20, 2026

github-project-automation Bot added this to NVIDIA and AMD Apr 20, 2026

github-project-automation Bot moved this to Todo in AMD Apr 20, 2026

Merge branch 'main' into moe-lora-refactor

13caeb4

jeejeelee marked this pull request as ready for review April 20, 2026 12:00

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 20, 2026

Merge branch 'main' into moe-lora-refactor

ce0f6c3

bnellnm reviewed Apr 23, 2026

View reviewed changes

bnellnm reviewed Apr 24, 2026

View reviewed changes

bnellnm suggested changes Apr 24, 2026

View reviewed changes

github-project-automation Bot moved this to In review in NVIDIA Apr 24, 2026

github-project-automation Bot moved this from To Triage to In progress in gpt-oss Issues & Enhancements Apr 24, 2026

jeejeelee added 2 commits April 24, 2026 12:52

Move

5c1fe18

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'main' into moe-lora-refactor

3efd6c5

jeejeelee requested a review from bnellnm April 24, 2026 13:00

bnellnm approved these changes Apr 24, 2026

View reviewed changes

jeejeelee mentioned this pull request Apr 25, 2026

[LoRA] Initial EP support for LoRA #40867

Draft

4 tasks

jeejeelee added 2 commits April 25, 2026 17:11

Merge branch 'main' into moe-lora-refactor

4583130

Revert

b632e53

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

simon-mo approved these changes Apr 25, 2026

View reviewed changes

github-project-automation Bot moved this from In progress to Ready in gpt-oss Issues & Enhancements Apr 25, 2026

github-project-automation Bot moved this from In review to Ready in NVIDIA Apr 25, 2026

simon-mo enabled auto-merge (squash) April 25, 2026 17:15

Merge branch 'main' into moe-lora-refactor

987d67a

simon-mo merged commit 8cd174f into main Apr 26, 2026
74 of 75 checks passed

simon-mo deleted the moe-lora-refactor branch April 26, 2026 01:55

github-project-automation Bot moved this from Ready to Done in gpt-oss Issues & Enhancements Apr 26, 2026

github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 26, 2026

github-project-automation Bot moved this from Todo to Done in AMD Apr 26, 2026

avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026

[LoRA] MoE LoRA Refactor (vllm-project#40338)

4d4b532

Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

jatseng-ai pushed a commit to jatseng-ai/vllm that referenced this pull request Apr 28, 2026

[LoRA] MoE LoRA Refactor (vllm-project#40338)

5b909d2

Lafunamor pushed a commit to Lafunamor/vllm that referenced this pull request May 1, 2026

[LoRA] MoE LoRA Refactor (vllm-project#40338)

d27fcce

Signed-off-by: Adrian <info@zzit.ch>

	isinstance(moe_kernel.fused_experts, FusedMoEExpertsModular)
	isinstance(moe_kernel.impl.fused_experts, FusedMoEExpertsModular)

Uh oh!

Conversation

jeejeelee commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

1. Hacky and hard to maintain.

2. MoE changes don't see LoRA

3. Hard to extend

Change

1. MoELoRAContext: a single explicit payload

2. LoRA compute inlined into the expert apply()

3. LoRA computation stays in PunicaWrapper

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

bnellnm Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

bnellnm Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

bnellnm Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jeejeelee Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

bnellnm commented Apr 24, 2026

Uh oh!

bnellnm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeejeelee commented Apr 24, 2026

Uh oh!

bnellnm left a comment

Choose a reason for hiding this comment

Uh oh!

jeejeelee commented Apr 25, 2026

Uh oh!

bnellnm commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeejeelee commented Apr 20, 2026 •

edited

Loading

1. `MoELoRAContext`: a single explicit payload

2. LoRA compute inlined into the expert `apply()`

3. LoRA computation stays in `PunicaWrapper`

bnellnm Apr 23, 2026 •

edited

Loading

bnellnm left a comment •

edited

Loading