Load json config by yugong333 · Pull Request #27226 · vllm-project/vllm

yugong333 · 2025-10-20T20:59:25Z

Purpose

This PR complements PR #26319
Similar with PR #26319 we add the support for loading customized kernel config for fused_moe_lora kernel in the format of json file.

According to the benchmark results, together with PR #26319, we can improve the OTPS 80% - 90% when the concurrency is 1 and 2:

The lora config folder should be passed in by export VLLM_TUNED_CONFIG_FOLDER=/path/to/configs. Without it, the kernel would use default configs.

Signed-off-by: wuchen <cntryroa@gmail.com>

Signed-off-by: banjuede <lmklhc@163.com>

…ra projection

Signed-off-by: bk-201 <joy25810@foxmail.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: bk-201 <joy25810@foxmail.com>

Signed-off-by: Danielle Robinson <dmmaddix@amazon.com>

Update to default_act_function and pass as callable

Signed-off-by: Chen Wu <cntryroa@gmail.com>

Adding test for gptoss

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

mergify · 2025-10-20T21:00:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yugong333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces support for loading customized kernel configurations for fused_moe_lora from JSON files, aiming to improve performance. The changes are extensive, adding new CUDA/Triton kernels, corresponding tests, and adapting the LoRA layer injection mechanism to support FusedMoE layers. My review identified a critical bug in the LoRA injection logic for MoE layers, which would cause a runtime error. I've provided a specific comment and suggestion to fix this issue.

gemini-code-assist · 2025-10-20T21:02:52Z

vllm/lora/layers/fused_moe.py

+                expert_ids_lora = expert_ids_lora.view(max_loras, -1)
+                sorted_token_ids_lora = sorted_token_ids_lora.view(max_loras, -1)
+                intermediate_cache2 = moe_state_dict["intermediate_cache2"]
+                intermediate_cache3 = args[0]


There is an incorrect indexing of args here. The moe_sum_decorator wraps a method with the signature moe_sum(self, input, output). When this decorated method is called, args will be a tuple (self, input, output). Therefore, args[0] refers to the self object of the wrapped method, not the intermediate_cache3 tensor as intended. This should be args[1] to correctly access the input tensor. This bug will cause a TypeError at runtime when add_lora_fused_moe is called with an object instead of a tensor.

Suggested change

intermediate_cache3 = args[0]

intermediate_cache3 = args[1]

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-10-20T21:04:10Z

vllm/lora/ops/triton_ops/fused_moe_lora_op.py

+    # get the expert_id to process curr shard
+    ind = lora_idx * stride_el + pid_m
+    expert_id = tl.load(expert_ids_ptr + ind, ind < top_k * stride_el, 0.0)
+    if expert_id == -1:
+        return
+
+    # get a_ptr,b_ptr,c_ptr
+    cur_a_ptr = a_ptr + (slice_id % num_slice_a) * slice_a_size
+    cur_b_ptr = tl.load(b_ptr + slice_id).to(tl.pointer_type(tl.bfloat16))
+    cur_c_ptr = c_ptr + (slice_id % num_slice_c) * slice_c_size
+
+    offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N).to(tl.int64)) % N
+    offs_k = tl.arange(0, BLOCK_SIZE_K)
+
+    offs_token_id = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M).to(tl.int64)
+    token_ind = stride_tl * lora_idx + offs_token_id
+    offs_token = tl.load(
+        sorted_token_ids_ptr + token_ind, token_ind < top_k * stride_tl, 0.0
+    )


Stop masking LoRA indices using expert top_k

The Triton kernel uses top_k (number of routed experts per token) to bound reads of expert_ids and sorted_token_ids via ind < top_k * stride_el and token_ind < top_k * stride_tl. Because ind/token_ind are computed as lora_idx * stride_* + …, these masks effectively zero out every LoRA adapter whose index is ≥ top_k. In a typical configuration top_k is 1 or 2 while max_loras can be dozens, so only the first top_k adapters ever contribute—later adapters silently produce no output. The kernel should guard against out‑of‑bounds with the number of LoRAs (or drop the mask entirely), not the number of experts.

Useful? React with 👍 / 👎.

wuchen and others added 30 commits July 23, 2025 03:59

Ignore yapf config

e6d4be8

Signed-off-by: wuchen <cntryroa@gmail.com>

to support fusedmoe with lora

1f03e12

Signed-off-by: wuchen <cntryroa@gmail.com>

added test_moe_lora_align_sum.py

d494980

Signed-off-by: wuchen <cntryroa@gmail.com>

fixed typing import error

4e2d01e

Signed-off-by: wuchen <cntryroa@gmail.com>

refactor FusedMoE LoRA weight loading

2f9cdb3

Signed-off-by: wuchen <cntryroa@gmail.com>

clear junk files

303d8ca

Signed-off-by: wuchen <cntryroa@gmail.com>

added test_deepseekv2.py

1908746

Signed-off-by: wuchen <cntryroa@gmail.com>

merge branch main

1384c0a

Signed-off-by: wuchen <cntryroa@gmail.com>

Reformatting deepseek_v2.py

a3df7fc

Signed-off-by: wuchen <cntryroa@gmail.com>

reformat code

da605c9

Signed-off-by: wuchen <cntryroa@gmail.com>

reformat code

c650bc0

Signed-off-by: wuchen <cntryroa@gmail.com>

reformat code

9057b19

Signed-off-by: wuchen <cntryroa@gmail.com>

polish code

41e9730

Signed-off-by: wuchen <cntryroa@gmail.com>

Merge branch 'main' into fused_moe_lora

90c80cb

merge from main

ee198e8

Signed-off-by: wuchen <cntryroa@gmail.com>

Merge branch 'main' into fused_moe_lora

fa8a440

Decouple FusedMoE and LoRA

240c99d

Signed-off-by: wuchen <cntryroa@gmail.com>

Merged main branch

3cda62a

Signed-off-by: wuchen <cntryroa@gmail.com>

refactor FusedMoEWithLoRA

9d84b0d

Signed-off-by: wuchen <cntryroa@gmail.com>

refactor FusedMoEWithLoRA2

a15359a

Signed-off-by: wuchen <cntryroa@gmail.com>

change

2e8f1af

Signed-off-by: wuchen <cntryroa@gmail.com>

adapt the fusedmoe interface

7c96a35

Signed-off-by: wuchen <cntryroa@gmail.com>

merge main

4ff24c7

Signed-off-by: wuchen <cntryroa@gmail.com>

moe fixes for deepseek r1

509d08c

added add_lora_fused_moe method for punica_gpu

31540cd

Signed-off-by: wuchen <cntryroa@gmail.com>

merge branch 'main' into myfusedmoe

5a4e50d

Signed-off-by: banjuede <lmklhc@163.com>

merge branch with added MergedReplicatedLinearWithLoRA

968928f

Signed-off-by: banjuede <lmklhc@163.com>

Fix error with building vllm from source based on vllm-project#23929

dadce69

Add function and SupportLoRA to enable multi-lora for gpt-oss

fe6d969

Modify packed_modules_mapping to extract q,k,v projs and fuse with lo…

40ace74

…ra projection

jeejeelee and others added 14 commits October 14, 2025 09:51

Merge branch 'main' into fused_moe_lora

48b15f4

add unit testing

ed61b7a

Signed-off-by: bk-201 <joy25810@foxmail.com>

Add test

8c3148f

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Adding test for gptoss

42c7abe

add test

355967e

Signed-off-by: bk-201 <joy25810@foxmail.com>

Update to default_act_function and pass as callable

1f1d5d5

remove None type

95fe27e

Update default_act function signature

e22782a

Signed-off-by: Danielle Robinson <dmmaddix@amazon.com>

Merge pull request vllm-project#16 from dcmaddix/remove_torch_ops

a860253

Update to default_act_function and pass as callable

clean up the code

c18e190

Signed-off-by: Chen Wu <cntryroa@gmail.com>

represent invalid expert ids with -1

b46d9c8

Signed-off-by: Chen Wu <cntryroa@gmail.com>

Merge pull request vllm-project#15 from dcmaddix/test_pr_final

d346000

Adding test for gptoss

Add test

8f4a4be

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Adding json config loading for fused_moe_lora kernel

ee7e417

yugong333 requested review from LucasWilkinson, jeejeelee, mgoin and tlrmchlsmth as code owners October 20, 2025 20:59

mergify bot added ci/build deepseek Related to DeepSeek models qwen Related to Qwen models labels Oct 20, 2025

mergify bot added gpt-oss Related to GPT-OSS models needs-rebase labels Oct 20, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Oct 20, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Oct 20, 2025

gemini-code-assist bot reviewed Oct 20, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 20, 2025

View reviewed changes

yugong333 closed this Oct 20, 2025

github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Oct 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Load json config#27226

Load json config#27226
yugong333 wants to merge 114 commits intovllm-project:mainfrom
dcmaddix:load_json_config

yugong333 commented Oct 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 20, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

yugong333 commented Oct 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

mergify bot commented Oct 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yugong333 commented Oct 20, 2025 •

edited by github-actions bot

Loading