Conversation
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: banjuede <lmklhc@163.com>
Signed-off-by: banjuede <lmklhc@163.com>
Signed-off-by: bk-201 <joy25810@foxmail.com>
Signed-off-by: Danielle Robinson <dmmaddix@amazon.com>
Update to default_act_function and pass as callable
Signed-off-by: Chen Wu <cntryroa@gmail.com>
Signed-off-by: Chen Wu <cntryroa@gmail.com>
Adding test for gptoss
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces support for loading customized kernel configurations for fused_moe_lora from JSON files, aiming to improve performance. The changes are extensive, adding new CUDA/Triton kernels, corresponding tests, and adapting the LoRA layer injection mechanism to support FusedMoE layers. My review identified a critical bug in the LoRA injection logic for MoE layers, which would cause a runtime error. I've provided a specific comment and suggestion to fix this issue.
| expert_ids_lora = expert_ids_lora.view(max_loras, -1) | ||
| sorted_token_ids_lora = sorted_token_ids_lora.view(max_loras, -1) | ||
| intermediate_cache2 = moe_state_dict["intermediate_cache2"] | ||
| intermediate_cache3 = args[0] |
There was a problem hiding this comment.
There is an incorrect indexing of args here. The moe_sum_decorator wraps a method with the signature moe_sum(self, input, output). When this decorated method is called, args will be a tuple (self, input, output). Therefore, args[0] refers to the self object of the wrapped method, not the intermediate_cache3 tensor as intended. This should be args[1] to correctly access the input tensor. This bug will cause a TypeError at runtime when add_lora_fused_moe is called with an object instead of a tensor.
| intermediate_cache3 = args[0] | |
| intermediate_cache3 = args[1] |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # get the expert_id to process curr shard | ||
| ind = lora_idx * stride_el + pid_m | ||
| expert_id = tl.load(expert_ids_ptr + ind, ind < top_k * stride_el, 0.0) | ||
| if expert_id == -1: | ||
| return | ||
|
|
||
| # get a_ptr,b_ptr,c_ptr | ||
| cur_a_ptr = a_ptr + (slice_id % num_slice_a) * slice_a_size | ||
| cur_b_ptr = tl.load(b_ptr + slice_id).to(tl.pointer_type(tl.bfloat16)) | ||
| cur_c_ptr = c_ptr + (slice_id % num_slice_c) * slice_c_size | ||
|
|
||
| offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N).to(tl.int64)) % N | ||
| offs_k = tl.arange(0, BLOCK_SIZE_K) | ||
|
|
||
| offs_token_id = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M).to(tl.int64) | ||
| token_ind = stride_tl * lora_idx + offs_token_id | ||
| offs_token = tl.load( | ||
| sorted_token_ids_ptr + token_ind, token_ind < top_k * stride_tl, 0.0 | ||
| ) |
There was a problem hiding this comment.
Stop masking LoRA indices using expert top_k
The Triton kernel uses top_k (number of routed experts per token) to bound reads of expert_ids and sorted_token_ids via ind < top_k * stride_el and token_ind < top_k * stride_tl. Because ind/token_ind are computed as lora_idx * stride_* + …, these masks effectively zero out every LoRA adapter whose index is ≥ top_k. In a typical configuration top_k is 1 or 2 while max_loras can be dozens, so only the first top_k adapters ever contribute—later adapters silently produce no output. The kernel should guard against out‑of‑bounds with the number of LoRAs (or drop the mask entirely), not the number of experts.
Useful? React with 👍 / 👎.
Purpose
This PR complements PR #26319
Similar with PR #26319 we add the support for loading customized kernel config for fused_moe_lora kernel in the format of json file.
According to the benchmark results, together with PR #26319, we can improve the OTPS 80% - 90% when the concurrency is 1 and 2:

The lora config folder should be passed in by export
VLLM_TUNED_CONFIG_FOLDER=/path/to/configs. Without it, the kernel would use default configs.