[LoRA][FusedMoE] Introduce FusedMoEPermuteExpertsUnpermuteWithLoRA #27959

varun-sundar-rabindranath · 2025-11-03T05:15:11Z

Purpose

This PR better integrates LoRA with FusedMoE modular kernel.
This PR also fixes an existing bug that occurs when using Chunking with LoRA

TODO : Add design doc

Test Plan

Added a test_fused_moe_lora_layer.py that checks the plumbing
Run tests/lora/test_deepseekv2_tp.py
Run tests/lora/test_gptoss.py

Test Result

Other than tests/lora/test_gptoss.py , all tests pass.
tests/lora/test_gptoss.py fails on main - I verified that this PR produces the same outputs as on main

mergify · 2025-11-03T05:15:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed refactoring to better integrate LoRA with FusedMoE kernels. The new design, which replaces a complex decorator-based approach with a wrapper class FusedMoEPermuteExpertsUnpermuteWithLoRA and a mixin MkFusedExpertsSupportsLoRA, greatly improves code clarity and maintainability. The PR also correctly fixes a bug related to using chunking with LoRA by introducing lora_token_mapping_offset. I've identified a critical runtime bug in the new LoRA injection logic and a performance issue that should be addressed.

gemini-code-assist · 2025-11-03T05:19:05Z

vllm/lora/layers/fused_moe_permute_experts_unpermute.py

+        self.w1_lora_a_stacked = w1_lora_a_stacked
+        self.w1_lora_b_stacked = w1_lora_b_stacked
+        self.w3_lora_a_stacked = w3_lora_a_stacked
+        self.w3_lora_b_stacked = w3_lora_b_stacked


This assertion will fail at runtime. The activation_prologue is called as an instance method on the base_experts object, so args will contain self as the first positional argument. Therefore, len(args) will be 1, not 0.

Suggested change

self.w3_lora_b_stacked = w3_lora_b_stacked

assert len(args) == 1 # self

gemini-code-assist · 2025-11-03T05:19:05Z

vllm/lora/layers/fused_moe_permute_experts_unpermute.py

+            for x in [
+                self.w1_lora_a_stacked,
+                self.w1_lora_b_stacked,
+                self.w3_lora_a_stacked,


This assertion will also fail at runtime for the same reason as in activation_prologue. The activation_epilogue is called as an instance method, so len(args) will be 1, not 0.

Suggested change

self.w3_lora_a_stacked,

assert len(args) == 1 # self

gemini-code-assist · 2025-11-03T05:19:05Z

vllm/lora/layers/fused_moe_permute_experts_unpermute.py

+        topk_ids = self.experts_forward_state.topk_ids
+        topk_weights = self.experts_forward_state.topk_weights


As noted in the TODO, add_lora_fused_moe performs an accumulation rather than an overwrite. Zeroing out the buffer with fill_(0) before the kernel call introduces unnecessary overhead, especially for large tensors. This can impact performance. It would be more efficient if the kernel could write its output directly without needing this pre-fill step. A similar issue exists on line 151 for lora_down_output.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-03T05:28:18Z

vllm/lora/layers/fused_moe_permute_experts_unpermute.py

+    def gateup_proj_lora(self):
+        self._ensure_weights()
+
+        assert self.experts_forward_state is not None
+        assert self.w1_lora_a_stacked is not None
+        hidden_states = self.experts_forward_state.hidden_states
+        topk_ids = self.experts_forward_state.topk_ids
+        topk_weights = self.experts_forward_state.topk_weights
+
+        num_topk = topk_ids.size(-1)
+        max_lora_rank = self.w1_lora_a_stacked.size(-2)
+
+        w13_lora_a_stacked = [self.w1_lora_a_stacked, self.w3_lora_a_stacked]
+        w13_lora_b_stacked = [self.w1_lora_b_stacked, self.w3_lora_b_stacked]
+
+        # TODO (varun): Fix add_lora_fused_moe to overwrite output
+        self.experts_forward_state.lora_gateup_output.fill_(0)
+        assert self.punica_wrapper is not None
+        self.punica_wrapper.add_lora_fused_moe(
+            self.experts_forward_state.lora_gateup_output,
+            hidden_states,
+            w13_lora_a_stacked,
+            w13_lora_b_stacked,
+            topk_weights,
+            self.experts_forward_state.sorted_token_ids_lora,
+            self.experts_forward_state.expert_ids_lora,
+            self.experts_forward_state.num_tokens_post_padded_lora,
+            max_lora_rank,
+            num_topk,
+            self.experts_forward_state.config,
+        )


Pass unquantized activations to LoRA kernels

The new FusedMoEPermuteExpertsUnpermuteWithLoRA calls gateup_proj_lora() with the hidden_states argument it receives from FusedMoEModularKernel (lines 103‑133). When MoE is quantized, that tensor is a1q, i.e. already quantized to fp8/int8. The LoRA Triton kernel invoked via punica_wrapper.add_lora_fused_moe asserts that its inputs are fp16/bf16 and equal to the LoRA weight dtype. With the current code, quantized MoE + LoRA will fail or produce incorrect results because the LoRA path now consumes quantized activations instead of the original fp16/bf16 activations that the previous implementation stored before quantization. This effectively breaks LoRA support for all quantized FusedMoE models.

Useful? React with 👍 / 👎.

jeejeelee · 2025-11-03T08:57:21Z

I have removed the flaky test, see:#27966 (comment)

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

jeejeelee · 2025-11-07T07:28:15Z

vllm/model_executor/layers/fused_moe/fused_moe.py

-            activation, intermediate_cache2, intermediate_cache1.view(-1, N)
-        )
+        with self.maybe_activation_with_lora_hook(
+            gateup_proj_output=intermediate_cache1,


I think we should use w13/w2 which is consistent with fused_moe name 😅
This comment also applies to all places that use gate up and down.

mergify · 2025-11-07T07:28:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

varun-sundar-rabindranath requested review from jeejeelee, mgoin and pavanimajety as code owners November 3, 2025 05:15

varun-sundar-rabindranath marked this pull request as draft November 3, 2025 05:15

mergify bot added the needs-rebase label Nov 3, 2025

gemini-code-assist bot reviewed Nov 3, 2025

View reviewed changes

varun-sundar-rabindranath marked this pull request as ready for review November 3, 2025 05:21

chatgpt-codex-connector bot reviewed Nov 3, 2025

View reviewed changes

Varun Sundar Rabindranath added 2 commits November 6, 2025 02:56

cleanup

de59f4b

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix rebase

56e9a1f

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath force-pushed the varun/cleanup-lora-fusedmoe branch from 2add92d to 56e9a1f Compare November 6, 2025 03:55

mergify bot removed the needs-rebase label Nov 6, 2025

fixes

8526666

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

jeejeelee reviewed Nov 7, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[LoRA][FusedMoE] Introduce FusedMoEPermuteExpertsUnpermuteWithLoRA #27959

[LoRA][FusedMoE] Introduce FusedMoEPermuteExpertsUnpermuteWithLoRA #27959

Uh oh!

varun-sundar-rabindranath commented Nov 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Nov 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Uh oh!

gemini-code-assist bot Nov 3, 2025

Uh oh!

gemini-code-assist bot Nov 3, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 3, 2025

Uh oh!

jeejeelee commented Nov 3, 2025

Uh oh!

jeejeelee Nov 7, 2025

Uh oh!

mergify bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	self.w3_lora_b_stacked = w3_lora_b_stacked
	assert len(args) == 1 # self

		topk_ids = self.experts_forward_state.topk_ids
		topk_weights = self.experts_forward_state.topk_weights

Uh oh!

[LoRA][FusedMoE] Introduce FusedMoEPermuteExpertsUnpermuteWithLoRA #27959

Are you sure you want to change the base?

[LoRA][FusedMoE] Introduce FusedMoEPermuteExpertsUnpermuteWithLoRA #27959

Uh oh!

Conversation

varun-sundar-rabindranath commented Nov 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Nov 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

jeejeelee commented Nov 3, 2025

Uh oh!

jeejeelee Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varun-sundar-rabindranath commented Nov 3, 2025 •

edited by github-actions bot

Loading