[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 by xyang16 · Pull Request #29708 · vllm-project/vllm

xyang16 · 2025-11-29T00:49:25Z

Purpose

This PR is to support FusedMoE LoRA Triton kernel for mxfp4 model.

Add UnfusedOAITritonExperts
Inject lora module in activation: Since matmul_ogs can fuse activation, set fused_activation to None to unfuse activation in the first matmul_ogs.
Inject lora module in moe_sum: This need to unfuse sum in second matmul_ogs. Grouped reduction does scatter + accumulate, it is essentially equal to: y[dst_indx // n_expts_act, :] += x[src_indx, :], so that scatter sum across multiple experts, and collapse M * topk to M rows. Therefore, we need to set routing_data.n_expts_act to 1, so it doesn't sum across multiple experts, in order unfuse moe_sum in the second matmul_ogs.
Added test_modular_oai_triton_moe.py

Test Plan

pytest -s -v tests/kernels/moe/test_modular_oai_triton_moe.py

Test Result

Tests passed.

Benchmark

Baseline (marlin):

VLLM_MXFP4_USE_MARLIN=1 vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --max-num-seqs 16 \
  --compilation-config '{"max_cudagraph_capture_size": 128, "compile_sizes": [1, 2, 4, 8, 16]}' \
  --enable-lora \
  --max-loras 6 \
  --lora-modules lora1=/opt/dlami/nvme/models/gpt-oss-20b-lora-gsm8k \
  --max-lora-rank 64

vllm bench serve \
  --model openai/gpt-oss-20b \
  --lora-modules lora1 \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --max-concurrency 16 \
  --num-prompts 1000 \
  --num-warmups 60 \
  --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  142.34    
Total input tokens:                      215312    
Total generated tokens:                  199033    
Request throughput (req/s):              7.03      
Output token throughput (tok/s):         1398.29   
Peak output token throughput (tok/s):    1677.00   
Peak concurrent requests:                29.00     
Total Token throughput (tok/s):          2910.95   
---------------Time to First Token----------------
Mean TTFT (ms):                          43.24     
Median TTFT (ms):                        26.77     
P99 TTFT (ms):                           140.09    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.19     
Median TPOT (ms):                        10.93     
P99 TPOT (ms):                           17.44     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.12     
Median ITL (ms):                         9.73      
P99 ITL (ms):                            56.55     
==================================================

PR (triton):

Install triton_kernels

pip install "git+https://github.com/triton-lang/triton.git@0a2e3a391cbb9e13d29bf12a2a0005e358102d74#subdirectory=python/triton_kernels"

vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --max-num-seqs 16 \
  --compilation-config '{"max_cudagraph_capture_size": 128, "compile_sizes": [1, 2, 4, 8, 16]}' \
  --enable-lora \
  --max-loras 6 \
  --lora-modules lora1=/opt/dlami/nvme/models/gpt-oss-20b-lora-gsm8k \
  --max-lora-rank 64

vllm bench serve \
  --model openai/gpt-oss-20b \
  --lora-modules lora1 \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --max-concurrency 16 \
  --num-prompts 1000 \
  --num-warmups 60 \
  --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  125.59    
Total input tokens:                      215312    
Total generated tokens:                  199033    
Request throughput (req/s):              7.96      
Output token throughput (tok/s):         1584.77   
Peak output token throughput (tok/s):    2171.00   
Peak concurrent requests:                29.00     
Total Token throughput (tok/s):          3299.15   
---------------Time to First Token----------------
Mean TTFT (ms):                          52.14     
Median TTFT (ms):                        19.03     
P99 TTFT (ms):                           129.60    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.82      
Median TPOT (ms):                        9.28      
P99 TPOT (ms):                           21.31     
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.72      
Median ITL (ms):                         7.32      
P99 ITL (ms):                            90.10     
==================================================

Accuracy Testing

20b deepep triton

VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --data-parallel-size 2 \
  --enable-expert-parallel \
  --no-enable-prefix-caching

OPENAI_API_KEY=EMPTY python3 -m gpt_oss.evals --model openai/gpt-oss-20b --eval gpqa --n-threads 200 --reasoning-effort low

Writing report to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20251125_122239.html
{'chars': np.float64(55.219065656565654), 'chars:std': np.float64(220.291583893894), 'score': np.float64(0.5707070707070707), 'score:std': np.float64(0.4949752621616814)}
Writing results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20251125_122239.json
Writing all results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20251125_122239_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20251125_122239', 'metric': 0.5707070707070707}]

20b lora triton

vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
  --max-num-seqs 16 \
  --enable-lora \
  --max-loras 6 \
  --lora-modules lora1=/opt/dlami/nvme/models/gpt-oss-20b-lora/checkpoint-13 \
  --max-lora-rank 64

OPENAI_API_KEY=EMPTY python3 -m gpt_oss.evals --model lora1 --eval gpqa --n-threads 200 --reasoning-effort low

Writing report to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20251126_124509.html
{'chars': np.float64(62.859848484848484), 'chars:std': np.float64(230.21663448129942), 'score': np.float64(0.577020202020202), 'score:std': np.float64(0.4940322747359399)}
Writing results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20251126_124509.json
Writing all results to /tmp/gpqa_openai__gpt-oss-20b-low_temp1.0_20251126_124509_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20251126_124509', 'metric': 0.577020202020202}]

20b lora marlin

VLLM_MXFP4_USE_MARLIN=1 vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
  --max-num-seqs 16 \
  --enable-lora \
  --max-loras 6 \
  --lora-modules lora1=/opt/dlami/nvme/models/gpt-oss-20b-lora/checkpoint-13 \
  --max-lora-rank 64

OPENAI_API_KEY=EMPTY python3 -m gpt_oss.evals --model lora1 --eval gpqa --n-threads 200 --reasoning-effort low

Writing report to /tmp/gpqa_lora1-low_temp1.0_20251126_125725.html
{'chars': np.float64(25.56691919191919), 'chars:std': np.float64(165.94325370775), 'score': np.float64(0.5669191919191919), 'score:std': np.float64(0.4955015860245882)}
Writing results to /tmp/gpqa_lora1-low_temp1.0_20251126_125725.json
Writing all results to /tmp/gpqa_lora1-low_temp1.0_20251126_125725_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'lora1-low_temp1.0_20251126_125725', 'metric': 0.5669191919191919}]

Note:

#28971 got reverted by #29697 because of breaking tests. This PR redo #28971.

https://buildkite.com/vllm/ci/builds/41020/steps/canvas?jid=019ac943-dd5c-4673-97e4-736893ab6035
- Fix test_modular_oai_triton_moe.py by removing big MNK size test to avoid OOM.
https://buildkite.com/vllm/ci/builds/41020/steps/canvas?sid=019ac943-dd87-4ee9-b343-39cd7b152457
- test_olmoe_tp.py test failure should be unrelated, because [LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 #28971 only apply to mxfp4, and I can see it has been failing before.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@jeejeelee @DarkLight1337 Please take a look. Thanks a lot for reviewing!

Signed-off-by: Xin Yang <xyangx@amazon.com>

chatgpt-codex-connector · 2025-11-29T00:49:33Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request re-introduces support for FusedMoE LoRA with Triton kernels for mxfp4 quantization, which was previously reverted. The changes are well-structured and mainly involve:

Adding an UnfusedOAITritonExperts class to allow for LoRA injection by separating GEMM, activation, and reduction steps.
Updating the mxfp4 backend selection logic to enable the Triton backend for LoRA when available.
Adding a comprehensive test suite to validate the new unfused Triton kernel against a PyTorch reference implementation.

The changes look solid and align with the goal of modularizing the MoE kernels. I have a couple of suggestions for improving maintainability and robustness.

vllm/model_executor/layers/quantization/mxfp4.py

vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py

tests/kernels/moe/test_modular_oai_triton_moe.py

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Xin Yang <105740670+xyang16@users.noreply.github.com>

) Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

) Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

) Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4

d3119e5

Signed-off-by: Xin Yang <xyangx@amazon.com>

xyang16 requested review from WoosukKwon, jeejeelee, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 29, 2025 00:49

Fix test

b55736c

Signed-off-by: Xin Yang <xyangx@amazon.com>

mergify bot added the gpt-oss Related to GPT-OSS models label Nov 29, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Nov 29, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Nov 29, 2025

gemini-code-assist bot reviewed Nov 29, 2025

View reviewed changes

vllm/model_executor/layers/quantization/mxfp4.py Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py Show resolved Hide resolved

jeejeelee reviewed Nov 29, 2025

View reviewed changes

tests/kernels/moe/test_modular_oai_triton_moe.py Outdated Show resolved Hide resolved

xyang16 force-pushed the fused_moe_lora_triton branch from 3e5e554 to b55736c Compare November 29, 2025 00:56

Update tests/kernels/moe/test_modular_oai_triton_moe.py

4cbbc93

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Xin Yang <105740670+xyang16@users.noreply.github.com>

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 29, 2025

xyang16 and others added 4 commits November 28, 2025 18:03

Merge branch 'main' into fused_moe_lora_triton

88df207

Merge branch 'main' into fused_moe_lora_triton

7590bca

Merge branch 'main' into fused_moe_lora_triton

9e78575

Merge branch 'main' into fused_moe_lora_triton

e627770

jeejeelee approved these changes Nov 30, 2025

View reviewed changes

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Nov 30, 2025

jeejeelee merged commit a491b09 into vllm-project:main Nov 30, 2025
60 checks passed

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Nov 30, 2025

xyang16 deleted the fused_moe_lora_triton branch November 30, 2025 03:13

jeejeelee mentioned this pull request Dec 4, 2025

[Bugfix] Fix FP8 MoE LoRA #29890

Merged

5 tasks

shyeh25 mentioned this pull request Dec 9, 2025

[Bug]: H200 GPT-OSS-120B has perf drop #30326

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4#29708

[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4#29708
jeejeelee merged 7 commits intovllm-project:mainfrom
xyang16:fused_moe_lora_triton

xyang16 commented Nov 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xyang16 commented Nov 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Benchmark

Accuracy Testing

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xyang16 commented Nov 29, 2025 •

edited by github-actions bot

Loading