[Kernel] Multi-LoRA MoE CUDA BGMV kernel#40670
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. Agent GuidelinesIMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a new CUDA-based extension, _lora_C, providing optimized MoE BGMV kernels for LoRA. The implementation includes specialized CUDA kernels for shrink and expand operations, Python wrappers, and integration into the PunicaWrapperGPU with support for runtime backend selection. Key feedback highlights critical issues: skipping LoRA operations during CUDA graph capture leads to functional bugs, the LoRA scaling factor is incorrectly hardcoded to 1.0f, and the preallocated accumulation buffer size is insufficient for all supported model dimensions. Additionally, the backend selection logic for decode phases needs refinement to respect environment variable settings correctly.
| if torch.cuda.is_current_stream_capturing(): | ||
| return |
There was a problem hiding this comment.
Skipping the MoE LoRA operations during CUDA graph capture will result in the LoRA contribution being entirely missing from the captured graph. This means that during inference (replay), the model will behave as if no LoRA adapters are active, which is a critical functional bug. vLLM's CUDA graph capture is intended to record the operations so they can be replayed with different data (including different LoRA indices). If the kernels are not capture-safe (e.g., due to host-device synchronization or dynamic allocations), they should be refactored to be capture-safe rather than being skipped.
There was a problem hiding this comment.
This is intentional and matches vLLM's existing LoRA + CUDA graph design. vLLM applies LoRA contributions outside the captured graph, not inside it. The Triton MoE LoRA path has the same behavior - LoRA ops are never part of the captured graph. See cudagraph_specialize_lora in vLLM's config and the CudagraphDispatcher for how LoRA is handled during graph replay. Making these kernels capture-safe is a separate effort that would apply to both the Triton and CUDA path equally.
| #define CASE_MOE_SHRINK(in_T, out_T, W_T, narrow, wide) \ | ||
| case pack_u32(wide, narrow): \ | ||
| moe_bgmv_shrink_sliced<wide, narrow, in_T, out_T, W_T>( \ | ||
| Y, X, w_ptr, sorted_token_ids, expert_ids, lora_indices, \ | ||
| num_pairs, num_slices, num_experts, num_tokens, lora_stride, 1.0f); \ | ||
| return true; | ||
| FOR_MOE_ALL_WIDE_NARROW(CASE_MOE_SHRINK, T, T, T) | ||
| #undef CASE_MOE_SHRINK |
There was a problem hiding this comment.
The LoRA scaling factor (scale) is hardcoded to 1.0f in the kernel launcher. This will produce incorrect results for any LoRA adapter where the scaling factor (typically alpha / rank) is not 1.0. The scale parameter should be passed from the Python layer through the C++ dispatch functions to the kernel.
There was a problem hiding this comment.
The scale=1.0 is correct. vLLM pre-scales LoRA weights by alpha/rank during weight loading (see LoRALayerWeights), so the kernel doesn't need a runtime scale parameter. This matches the convention used by the existing Triton MoE LoRA kernels and all other LoRA kernels in vLLM.
| max_feat_out = 16384 | ||
| max_rank = self.lora_config.max_lora_rank | ||
| self._moe_shrink_out_bf16 = torch.zeros( | ||
| max_slices, max_pairs, max_rank, | ||
| dtype=torch.bfloat16, device=device) | ||
| self._moe_shrink_out_fp16 = torch.zeros( | ||
| max_slices, max_pairs, max_rank, | ||
| dtype=torch.float16, device=device) | ||
| self._moe_y_accum = torch.zeros( | ||
| max_num_batched_tokens * max_feat_out, | ||
| dtype=torch.float32, device=device) | ||
| self._moe_y_accum_max_feat = max_feat_out |
There was a problem hiding this comment.
The max_feat_out constant is hardcoded to 16384, but the supported dimensions in moe_bgmv_config.h go up to 28672. For models with larger intermediate dimensions (e.g., those using the legacy 28672 dimension), the _moe_y_accum buffer will be undersized, leading to a crash or out-of-bounds access during the view operation in add_lora_fused_moe_cuda (line 714) when processing large batches (prefill).
| max_feat_out = 16384 | |
| max_rank = self.lora_config.max_lora_rank | |
| self._moe_shrink_out_bf16 = torch.zeros( | |
| max_slices, max_pairs, max_rank, | |
| dtype=torch.bfloat16, device=device) | |
| self._moe_shrink_out_fp16 = torch.zeros( | |
| max_slices, max_pairs, max_rank, | |
| dtype=torch.float16, device=device) | |
| self._moe_y_accum = torch.zeros( | |
| max_num_batched_tokens * max_feat_out, | |
| dtype=torch.float32, device=device) | |
| self._moe_y_accum_max_feat = max_feat_out | |
| max_feat_out = 28672 | |
| max_rank = self.lora_config.max_lora_rank | |
| self._moe_shrink_out_bf16 = torch.zeros( | |
| max_slices, max_pairs, max_rank, | |
| dtype=torch.bfloat16, device=device) | |
| self._moe_shrink_out_fp16 = torch.zeros( | |
| max_slices, max_pairs, max_rank, | |
| dtype=torch.float16, device=device) | |
| self._moe_y_accum = torch.zeros( | |
| max_num_batched_tokens * max_feat_out, | |
| dtype=torch.float32, device=device) | |
| self._moe_y_accum_max_feat = max_feat_out |
There was a problem hiding this comment.
The 28672 dimensions in moe_bgmv_config.h are the feat_in for the shrink kernel (model hidden dim), while max_feat_out here covers the LoRA expand output dimension, which is <= 16384 for all currently supported models (GPT-OSS-120B, Nemotron-Super, Qwen3-30B, Nemotron-Nano). So this is safe for now, but I agree bumping the constant to 28672 would be good for future-proofing.
| def _use_cuda_for_moe_lora(self, num_tokens: int) -> bool: | ||
| """Decide whether to use CUDA BGMV or Triton for this call.""" | ||
| if _MOE_DECODE_THRESHOLD > 0: | ||
| if num_tokens <= _MOE_DECODE_THRESHOLD: | ||
| return _MOE_DECODE_USE_CUDA | ||
| else: | ||
| return _MOE_PREFILL_USE_CUDA | ||
| return _MOE_PREFILL_USE_CUDA |
There was a problem hiding this comment.
The logic in _use_cuda_for_moe_lora ignores the VLLM_MOE_LORA_DECODE_BACKEND setting when VLLM_MOE_LORA_DECODE_THRESHOLD is 0 (the default). This means that by default, the prefill backend setting is used for both prefill and decode, which contradicts the intention of having separate environment variables for each phase. A more sensible default would be to use a threshold (e.g., 1 or 32) or to check the phase explicitly.
There was a problem hiding this comment.
When the threshold is 0 (default), both VLLM_MOE_LORA_PREFILL_BACKEND and VLLM_MOE_LORA_DECODE_BACKEND default to cuda, so returning _MOE_PREFILL_USE_CUDA produces the correct result -- CUDA is used for both phases. The separate env vars are provided as opt-in overrides for debugging or A/B comparison. The threshold adds an optional batch-size cutoff on top of that. The default path is intentionally simple: use CUDA for everything.
jeejeelee
left a comment
There was a problem hiding this comment.
Sorry for the delayed response. I noticed that you submitted a similar PR for FI — I also think integrating this kernel into FI is better.
|
This pull request has merge conflicts that must be resolved before it can be |
| n_eid = min(expert_ids.view(-1).size(0), num_pairs) | ||
| self._moe_expert_ids_i64[:n_eid].copy_(expert_ids.view(-1)[:n_eid]) | ||
| expert_ids_i64 = self._moe_expert_ids_i64[:num_pairs] |
There was a problem hiding this comment.
🟡 Severity: MEDIUM
When expert_ids.numel() < num_pairs, only n_eid elements are copied into _moe_expert_ids_i64 (allocated via torch.empty, uninitialized). The slice [:num_pairs] includes the uninitialized tail, which is passed to the CUDA kernel as expert IDs. These garbage values index into w_ptr (raw GPU pointer array), causing out-of-bounds pointer dereference on GPU memory.
Helpful? Add 👍 / 👎
💡 Fix Suggestion
Suggestion: Zero-fill the uninitialized tail of the _moe_expert_ids_i64 buffer when n_eid < num_pairs to prevent garbage expert IDs from being passed to the CUDA kernel. Add self._moe_expert_ids_i64[n_eid:num_pairs].zero_() after the copy. The same fix should also be applied to the _moe_topk_weights_flat buffer at lines 673-676.
⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.
| n_eid = min(expert_ids.view(-1).size(0), num_pairs) | |
| self._moe_expert_ids_i64[:n_eid].copy_(expert_ids.view(-1)[:n_eid]) | |
| expert_ids_i64 = self._moe_expert_ids_i64[:num_pairs] | |
| n_eid = min(expert_ids.view(-1).size(0), num_pairs) | |
| self._moe_expert_ids_i64[:n_eid].copy_(expert_ids.view(-1)[:n_eid]) | |
| if n_eid < num_pairs: | |
| self._moe_expert_ids_i64[n_eid:num_pairs].zero_() | |
| expert_ids_i64 = self._moe_expert_ids_i64[:num_pairs] |
| int64_t expert_id = expert_ids[pair_idx]; | ||
| float topk_w = topk_weights[pair_idx]; | ||
| int64_t col_offset = slice_start_loc[slice_id]; | ||
| const W_T *W = w_ptr[slice_id * num_experts + expert_id] |
There was a problem hiding this comment.
🟡 Severity: MEDIUM
The expand kernel uses expert_id from expert_ids[pair_idx] to index into w_ptr without any bounds check (expert_id >= 0 && expert_id < num_experts). Existing MoE kernels (e.g., moe_align_sum_kernels.cu:126) validate this. An out-of-range expert_id causes OOB read of a raw device pointer from w_ptr, which is then dereferenced.
Helpful? Add 👍 / 👎
💡 Fix Suggestion
Suggestion: Add a bounds check for expert_id immediately after reading it from expert_ids[pair_idx] at line 281, before it is used to index into w_ptr. Add if (expert_id < 0 || expert_id >= num_experts) return; after line 281, consistent with the existing pattern in moe_align_sum_kernels.cu:126. Note that the shrink kernel (around line 66-70) has the same missing bounds check for eid when indexing w_ptr[slice_id * num_experts + eid] and should be fixed similarly.
⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.
| int64_t expert_id = expert_ids[pair_idx]; | |
| float topk_w = topk_weights[pair_idx]; | |
| int64_t col_offset = slice_start_loc[slice_id]; | |
| const W_T *W = w_ptr[slice_id * num_experts + expert_id] | |
| int64_t expert_id = expert_ids[pair_idx]; | |
| if (expert_id < 0 || expert_id >= num_experts) return; | |
| float topk_w = topk_weights[pair_idx]; | |
| int64_t col_offset = slice_start_loc[slice_id]; | |
| const W_T *W = w_ptr[slice_id * num_experts + expert_id] |
|
This pull request has merge conflicts that must be resolved before it can be |
Co-authored-by: Claude
Purpose
This PR adds custom CUDA BGMV kernels for MoE LoRA (shrink + expand), replacing the Triton SGMV kernels used in vllm. The CUDA kernels are optimized for decode-heavy MoE LoRA workloads with the following techniques:
The kernels support all LoRA rank values (8, 16, 32, 64) and cover model dimensions for GPT-OSS-120B and Nemotron-Nano-3-30B-A3B including TP-sharded variants.
Test Plan
Test Result
End-to-end OTPS p50 (higher is better):

Nemotron-Nano-3-30B-A3B
GPT-OSS-120B
