Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request refactors the LoRA implementation for Fused MoE layers by replacing the legacy decorator-based monkey-patching with an explicit MoELoRAContext propagation through the modular kernel path. It updates various quantization methods and MoE backends to handle this context. Feedback highlights potential AttributeError issues: one regarding the access of fused_experts on FusedMoEKernel, and another concerning the missing block_shape property in the FusedMoEExperts base class and its subclasses like MarlinExperts, which would cause runtime crashes when LoRA is enabled.
| fused_experts.moe_sum = moe_sum_decorator( | ||
| self.base_layer, fused_experts.moe_sum | ||
| assert ( | ||
| isinstance(moe_kernel.fused_experts, FusedMoEExpertsModular) |
There was a problem hiding this comment.
The check isinstance(moe_kernel.fused_experts, FusedMoEExpertsModular) might fail because FusedMoEKernel typically wraps the implementation in an impl attribute. You should likely check moe_kernel.impl.fused_experts instead, or ensure that FusedMoEKernel exposes fused_experts as a property.
| isinstance(moe_kernel.fused_experts, FusedMoEExpertsModular) | |
| isinstance(moe_kernel.impl.fused_experts, FusedMoEExpertsModular) |
There was a problem hiding this comment.
You could use the is_monolithic property instead of isinstance.
You might also need to add a supports_lora method to FusedMoEKernel
| def apply_w13_lora( | ||
| self, | ||
| lora_context: "MoELoRAContext", | ||
| *, | ||
| y: torch.Tensor, | ||
| x: torch.Tensor, | ||
| topk_ids: torch.Tensor, | ||
| topk_weights: torch.Tensor, | ||
| expert_map: torch.Tensor | None, | ||
| w1: torch.Tensor, | ||
| w2: torch.Tensor, | ||
| num_tokens: int, | ||
| top_k_num: int, | ||
| ) -> tuple[ | ||
| torch.Tensor | None, | ||
| torch.Tensor | None, | ||
| torch.Tensor | None, | ||
| torch.Tensor | None, | ||
| ]: | ||
| return lora_context.punica_wrapper.add_lora_w13( | ||
| y, | ||
| x, | ||
| lora_context.w13_lora_a_stacked, | ||
| lora_context.w13_lora_b_stacked, | ||
| topk_ids, | ||
| topk_weights, | ||
| expert_map, | ||
| w1, | ||
| w2, | ||
| num_tokens, | ||
| top_k_num, | ||
| lora_context.max_loras, | ||
| lora_context.adapter_enabled, | ||
| lora_context.local_num_experts, | ||
| lora_context.top_k, | ||
| lora_context.w13_num_slices, | ||
| lora_context.fully_sharded, | ||
| lora_context.use_tuned_config, | ||
| block_shape=self.block_shape, | ||
| ) |
There was a problem hiding this comment.
The apply_w13_lora method (and apply_w2_lora below) calls self.block_shape, but block_shape is not defined as an abstract property in the FusedMoEExperts base class. This will cause an AttributeError at runtime for any expert implementation that does not explicitly define it (e.g., MarlinExperts or UnfusedOAITritonExperts). Please add block_shape as an abstract property to FusedMoEExperts and implement it in all subclasses that support LoRA.
| def supports_lora() -> bool: | ||
| return True |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
| moe_kernel = FusedMoEKernel( | ||
| prepare_finalize, | ||
| self.base_layer.quant_method.select_gemm_impl( | ||
| prepare_finalize, self.base_layer | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Do we know if this case is ever hit now? Most methods have been switched over to the new MK initialization pattern (_setup_kernel)
| index, :, : sliced_w2_lora_b.shape[1], : sliced_w2_lora_b.shape[2] | ||
| ].copy_(sliced_w2_lora_b, non_blocking=True) | ||
|
|
||
| def set_mapping(self, punica_wrapper): |
There was a problem hiding this comment.
Does this happen at runtime or is this part of the LoRA setup?
There was a problem hiding this comment.
This is LoRA setup, not runtime. MoELoRAContext captures references to it so the experts kernel sees fresh values without rebinding
|
It's probably too much for this PR but we could consider having separate subclasses for experts that support LoRA (so that the LoRA code could be completely isolated) and the setup in |
Good point, I'll look into it further. Thanks! |
bnellnm
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the good work! Btw, Do you know if the select_gemm_impl codepath ever gets triggered? It should largely be defunct now and will be removed once everything is migrated over to _setup_kernel.
Thank you for the impressive reivew firstly. Yes, it looks like still gets triggered — specifically by the unmigrated quant methods (AWQ-Marlin, compressed_tensors_moe). |
|
FYI, my guess is that #40794 is causing the LoRA failure. There was a similar issue when the truncate came before the reduce in a prior PR that was fixed by moving the trunc afterwards. I'm not sure what the best solution is here. Calling |
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Signed-off-by: Adrian <info@zzit.ch>
Motivation
Currently, MoE LoRA is wired in by monkey-patching methods on the modular kernel at construction time.
FusedMoEWithLoRA._inject_lora_into_fused_moewrapsFusedMoEKernel.apply,TritonExperts.activation, andTritonExperts.moe_sumwithfwd_decorator/act_decorator/moe_sum_decorator, and smuggles tensors between them throughmoe_state_dict. This has several concrete problems:1. Hacky and hard to maintain.
The LoRA contribution is hidden inside decorators on base-MoE methods. Someone reading the MoE experts code sees
self.activation(...)andself.moe_sum(...)— they have no syntactic hint that these calls may actually run LoRA shrink/expand GEMMs, or thatmoe_state_dictcarries cross-call state between them. Debugging requires holding the patching order in your head.2. MoE changes don't see LoRA
Because the LoRA path lives outside the expert
apply()functions, any refactor ofTritonExperts/UnfusedOAITritonExperts/MarlinExpertshas to re-discover the LoRA contract (whatactivationreceives, whatmoe_sumreceives, what shapeintermediate_cache*has). Every MoE-side change risks breaking LoRA silently.3. Hard to extend
Adding support for new features — EP, additional quantized backends, new expert impls — means replicating or working around the decorator chain, and the state-dict plumbing assumes one specific control flow. There is no extension point for an expert that wants to apply LoRA at a different point .
Change
Treat LoRA as a
first-classconcern in the modular MoE kernel rather than an external patch.1.
MoELoRAContext: a single explicit payloadA new
MoELoRAContextdataclass (vllm/lora/lora_context.py) packages all ofthe LoRA state a MoE forward pass needs:
w13_lora_a_stacked/w13_lora_b_stacked/w2_lora_a_stacked/w2_lora_b_stackedadapter_enabled,max_lorastop_k,w13_num_slices,fully_sharded,tp_rank,tp_size,local_num_expertspunica_wrapperuse_tuned_config(whetherVLLM_TUNED_CONFIG_FOLDERis set)FusedMoEWithLoRA.set_mappingbuilds this context once and stashes it on the base layer asFusedMoE._lora_context.MoERunnerBaseforwards it intoFusedMoEMethodBase.apply(..., lora_context=...), andFusedMoEModularMethod.apply/FusedMoEKernel.apply/FusedMoEKernelModularImpl.applythread it through to
FusedMoEExpertsModular.apply(..., lora_context=...).The context is the only LoRA surface area seen by the MoE code path —
there's no more hidden state passed between method wrappers.
2. LoRA compute inlined into the expert
apply()Expert implementations that support LoRA now call it directly inside their own
apply()function, at the same logical point the decorators used to target:TritonExperts.apply(fused_moe.py): after the w13 GEMM and beforeactivation, call
self.apply_w13_lora(...)to add the LoRA delta tointermediate_cache1. After the w2 GEMM and beforemoe_sum, callself.apply_w2_lora(...)onintermediate_cache3, reusing thesorted_token_ids_loratensors from the first call.UnfusedOAITritonExperts.apply(gpt_oss_triton_kernels_moe.py):same pattern, adjusted for the gather/scatter layout that its two
matmul_ogscalls produce.MarlinExperts.apply(fused_marlin_moe.py):fused_marlin_moeconsumes
activation_funcandmoe_sumas callables, so the LoRA pathwraps those two callables to inject
apply_w13_lora/apply_w2_loraatthe correct buffer state.
FusedMoEExperts.supports_lora()defaults toFalse. Each expert impl that has a validated LoRA path overrides it toTrue(TritonExperts,UnfusedOAITritonExperts,MarlinExperts).FusedMoEWithLoRA.__init__asserts that the selected expert impl reportssupports_lora(), andoracle/unquantized.py::select_unquantized_moe_backendnow filters the backend auto-selection by that flag so unsupported backends (FlashInfer / AITER) are transparently skipped when LoRA is enabled instead of silently producing wrong output or crashing later.
Because the LoRA shrink/expand is now visible in the expert source, anyone modifying
TritonExperts.applycan see the LoRA call site and keep it correct; tests on the MoE path automatically cover the LoRA path as well.3. LoRA computation stays in
PunicaWrapperMoE LoRA still respects the PunicaWrapper logic , the actual shrink/expand compute is not moved. Two new methods on
PunicaWrapperBase—add_lora_w13andadd_lora_w2— encapsulate config lookup (tuned vs. heuristic),moe_lora_align_block_size, and theadd_lora_fused_moecall.PunicaWrapperGPUprovides the concrete implementation.FusedMoEExpertsModularhas thin helpersapply_w13_lora/apply_w2_lorathat just forward the context fields to these methods.Test Plan
All the LoRA and MoE tests on CI should pass correctly
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.