Skip to content

Conversation

@varun-sundar-rabindranath
Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Nov 3, 2025

Purpose

This PR better integrates LoRA with FusedMoE modular kernel.
This PR also fixes an existing bug that occurs when using Chunking with LoRA

TODO : Add design doc

Test Plan

  • Added a test_fused_moe_lora_layer.py that checks the plumbing
  • Run tests/lora/test_deepseekv2_tp.py
  • Run tests/lora/test_gptoss.py

Test Result

Other than tests/lora/test_gptoss.py , all tests pass.
tests/lora/test_gptoss.py fails on main - I verified that this PR produces the same outputs as on main

@mergify
Copy link

mergify bot commented Nov 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 3, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed refactoring to better integrate LoRA with FusedMoE kernels. The new design, which replaces a complex decorator-based approach with a wrapper class FusedMoEPermuteExpertsUnpermuteWithLoRA and a mixin MkFusedExpertsSupportsLoRA, greatly improves code clarity and maintainability. The PR also correctly fixes a bug related to using chunking with LoRA by introducing lora_token_mapping_offset. I've identified a critical runtime bug in the new LoRA injection logic and a performance issue that should be addressed.

self.w1_lora_a_stacked = w1_lora_a_stacked
self.w1_lora_b_stacked = w1_lora_b_stacked
self.w3_lora_a_stacked = w3_lora_a_stacked
self.w3_lora_b_stacked = w3_lora_b_stacked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This assertion will fail at runtime. The activation_prologue is called as an instance method on the base_experts object, so args will contain self as the first positional argument. Therefore, len(args) will be 1, not 0.

Suggested change
self.w3_lora_b_stacked = w3_lora_b_stacked
assert len(args) == 1 # self

for x in [
self.w1_lora_a_stacked,
self.w1_lora_b_stacked,
self.w3_lora_a_stacked,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This assertion will also fail at runtime for the same reason as in activation_prologue. The activation_epilogue is called as an instance method, so len(args) will be 1, not 0.

Suggested change
self.w3_lora_a_stacked,
assert len(args) == 1 # self

Comment on lines +109 to +183
topk_ids = self.experts_forward_state.topk_ids
topk_weights = self.experts_forward_state.topk_weights
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

As noted in the TODO, add_lora_fused_moe performs an accumulation rather than an overwrite. Zeroing out the buffer with fill_(0) before the kernel call introduces unnecessary overhead, especially for large tensors. This can impact performance. It would be more efficient if the kernel could write its output directly without needing this pre-fill step. A similar issue exists on line 151 for lora_down_output.

@varun-sundar-rabindranath varun-sundar-rabindranath marked this pull request as ready for review November 3, 2025 05:21
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 103 to 208
def gateup_proj_lora(self):
self._ensure_weights()

assert self.experts_forward_state is not None
assert self.w1_lora_a_stacked is not None
hidden_states = self.experts_forward_state.hidden_states
topk_ids = self.experts_forward_state.topk_ids
topk_weights = self.experts_forward_state.topk_weights

num_topk = topk_ids.size(-1)
max_lora_rank = self.w1_lora_a_stacked.size(-2)

w13_lora_a_stacked = [self.w1_lora_a_stacked, self.w3_lora_a_stacked]
w13_lora_b_stacked = [self.w1_lora_b_stacked, self.w3_lora_b_stacked]

# TODO (varun): Fix add_lora_fused_moe to overwrite output
self.experts_forward_state.lora_gateup_output.fill_(0)
assert self.punica_wrapper is not None
self.punica_wrapper.add_lora_fused_moe(
self.experts_forward_state.lora_gateup_output,
hidden_states,
w13_lora_a_stacked,
w13_lora_b_stacked,
topk_weights,
self.experts_forward_state.sorted_token_ids_lora,
self.experts_forward_state.expert_ids_lora,
self.experts_forward_state.num_tokens_post_padded_lora,
max_lora_rank,
num_topk,
self.experts_forward_state.config,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass unquantized activations to LoRA kernels

The new FusedMoEPermuteExpertsUnpermuteWithLoRA calls gateup_proj_lora() with the hidden_states argument it receives from FusedMoEModularKernel (lines 103‑133). When MoE is quantized, that tensor is a1q, i.e. already quantized to fp8/int8. The LoRA Triton kernel invoked via punica_wrapper.add_lora_fused_moe asserts that its inputs are fp16/bf16 and equal to the LoRA weight dtype. With the current code, quantized MoE + LoRA will fail or produce incorrect results because the LoRA path now consumes quantized activations instead of the original fp16/bf16 activations that the previous implementation stored before quantization. This effectively breaks LoRA support for all quantized FusedMoE models.

Useful? React with 👍 / 👎.

@jeejeelee
Copy link
Collaborator

I have removed the flaky test, see:#27966 (comment)

Varun Sundar Rabindranath added 2 commits November 6, 2025 02:56
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
activation, intermediate_cache2, intermediate_cache1.view(-1, N)
)
with self.maybe_activation_with_lora_hook(
gateup_proj_output=intermediate_cache1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use w13/w2 which is consistent with fused_moe name 😅
This comment also applies to all places that use gate up and down.

@mergify
Copy link

mergify bot commented Nov 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants