Skip to content

Load json config#27226

Closed
yugong333 wants to merge 114 commits intovllm-project:mainfrom
dcmaddix:load_json_config
Closed

Load json config#27226
yugong333 wants to merge 114 commits intovllm-project:mainfrom
dcmaddix:load_json_config

Conversation

@yugong333
Copy link
Copy Markdown
Contributor

@yugong333 yugong333 commented Oct 20, 2025

Purpose

This PR complements PR #26319
Similar with PR #26319 we add the support for loading customized kernel config for fused_moe_lora kernel in the format of json file.

According to the benchmark results, together with PR #26319, we can improve the OTPS 80% - 90% when the concurrency is 1 and 2:
image

The lora config folder should be passed in by export VLLM_TUNED_CONFIG_FOLDER=/path/to/configs. Without it, the kernel would use default configs.

wuchen and others added 30 commits July 23, 2025 03:59
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: banjuede <lmklhc@163.com>
Signed-off-by: banjuede <lmklhc@163.com>
jeejeelee and others added 14 commits October 14, 2025 09:51
Signed-off-by: bk-201 <joy25810@foxmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: bk-201 <joy25810@foxmail.com>
Signed-off-by: Danielle Robinson <dmmaddix@amazon.com>
Update to default_act_function and pass as callable
Signed-off-by: Chen Wu <cntryroa@gmail.com>
Signed-off-by: Chen Wu <cntryroa@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify mergify bot added ci/build deepseek Related to DeepSeek models qwen Related to Qwen models labels Oct 20, 2025
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Oct 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yugong333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for loading customized kernel configurations for fused_moe_lora from JSON files, aiming to improve performance. The changes are extensive, adding new CUDA/Triton kernels, corresponding tests, and adapting the LoRA layer injection mechanism to support FusedMoE layers. My review identified a critical bug in the LoRA injection logic for MoE layers, which would cause a runtime error. I've provided a specific comment and suggestion to fix this issue.

expert_ids_lora = expert_ids_lora.view(max_loras, -1)
sorted_token_ids_lora = sorted_token_ids_lora.view(max_loras, -1)
intermediate_cache2 = moe_state_dict["intermediate_cache2"]
intermediate_cache3 = args[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is an incorrect indexing of args here. The moe_sum_decorator wraps a method with the signature moe_sum(self, input, output). When this decorated method is called, args will be a tuple (self, input, output). Therefore, args[0] refers to the self object of the wrapped method, not the intermediate_cache3 tensor as intended. This should be args[1] to correctly access the input tensor. This bug will cause a TypeError at runtime when add_lora_fused_moe is called with an object instead of a tensor.

Suggested change
intermediate_cache3 = args[0]
intermediate_cache3 = args[1]

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +93 to +111
# get the expert_id to process curr shard
ind = lora_idx * stride_el + pid_m
expert_id = tl.load(expert_ids_ptr + ind, ind < top_k * stride_el, 0.0)
if expert_id == -1:
return

# get a_ptr,b_ptr,c_ptr
cur_a_ptr = a_ptr + (slice_id % num_slice_a) * slice_a_size
cur_b_ptr = tl.load(b_ptr + slice_id).to(tl.pointer_type(tl.bfloat16))
cur_c_ptr = c_ptr + (slice_id % num_slice_c) * slice_c_size

offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N).to(tl.int64)) % N
offs_k = tl.arange(0, BLOCK_SIZE_K)

offs_token_id = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M).to(tl.int64)
token_ind = stride_tl * lora_idx + offs_token_id
offs_token = tl.load(
sorted_token_ids_ptr + token_ind, token_ind < top_k * stride_tl, 0.0
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop masking LoRA indices using expert top_k

The Triton kernel uses top_k (number of routed experts per token) to bound reads of expert_ids and sorted_token_ids via ind < top_k * stride_el and token_ind < top_k * stride_tl. Because ind/token_ind are computed as lora_idx * stride_* + …, these masks effectively zero out every LoRA adapter whose index is ≥ top_k. In a typical configuration top_k is 1 or 2 while max_loras can be dozens, so only the first top_k adapters ever contribute—later adapters silently produce no output. The kernel should guard against out‑of‑bounds with the number of LoRAs (or drop the mask entirely), not the number of experts.

Useful? React with 👍 / 👎.

@yugong333 yugong333 closed this Oct 20, 2025
@github-project-automation github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models gpt-oss Related to GPT-OSS models needs-rebase qwen Related to Qwen models

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants