[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe by mgoin · Pull Request #31832 · vllm-project/vllm

mgoin · 2026-01-06T20:52:08Z

Purpose

We can easily fuse the silu_and_mul operation into the nvfp4 quantization operation for moe, which we already do for dense nvfp4 as of #23671. This PR just generalizes some utils to expose the same functionality for expert quantization. This seems to improve latency by ~2% and throughput by ~4%

Before:

After:

Test Plan

Benchmarks and evals

Test Result

Latency Benchmark

vllm serve nvidia/Qwen3-30B-A3B-NVFP4
vllm bench serve --input-len 100 --output-len 100 --num-prompts 8

# MAIN
============ Serving Benchmark Result ============
Successful requests:                     8         
Failed requests:                         0         
Benchmark duration (s):                  0.80      
Total input tokens:                      800       
Total generated tokens:                  800       
Request throughput (req/s):              10.06     
Output token throughput (tok/s):         1005.70   
Peak output token throughput (tok/s):    800.00    
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          2011.41   
---------------Time to First Token----------------
Mean TTFT (ms):                          33.40     
Median TTFT (ms):                        34.42     
P99 TTFT (ms):                           40.03     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.62      
Median TPOT (ms):                        7.62      
P99 TPOT (ms):                           7.62      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.62      
Median ITL (ms):                         7.67      
P99 ITL (ms):                            8.22      
==================================================

# PR
============ Serving Benchmark Result ============
Successful requests:                     8         
Failed requests:                         0         
Benchmark duration (s):                  0.78      
Total input tokens:                      800       
Total generated tokens:                  800       
Request throughput (req/s):              10.25     
Output token throughput (tok/s):         1025.21   
Peak output token throughput (tok/s):    800.00    
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          2050.42   
---------------Time to First Token----------------
Mean TTFT (ms):                          31.20     
Median TTFT (ms):                        32.42     
P99 TTFT (ms):                           34.46     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.53      
Median TPOT (ms):                        7.53      
P99 TPOT (ms):                           7.53      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.53      
Median ITL (ms):                         7.55      
P99 ITL (ms):                            8.30      
==================================================

Throughput Benchmark

vllm serve nvidia/Qwen3-30B-A3B-NVFP4
vllm bench serve --input-len 100 --output-len 100 --num-prompts 512

# MAIN
============ Serving Benchmark Result ============
Successful requests:                     512       
Failed requests:                         0         
Benchmark duration (s):                  2.65      
Total input tokens:                      51200     
Total generated tokens:                  51200     
Request throughput (req/s):              193.35    
Output token throughput (tok/s):         19335.14  
Peak output token throughput (tok/s):    30913.00  
Peak concurrent requests:                512.00    
Total token throughput (tok/s):          38670.29  
---------------Time to First Token----------------
Mean TTFT (ms):                          812.20    
Median TTFT (ms):                        793.07    
P99 TTFT (ms):                           1031.71   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.91     
Median TPOT (ms):                        18.14     
P99 TPOT (ms):                           20.28     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.93     
Median ITL (ms):                         16.15     
P99 ITL (ms):                            87.56     
================================================== 

# PR
============ Serving Benchmark Result ============
Successful requests:                     512       
Failed requests:                         0         
Benchmark duration (s):                  2.55      
Total input tokens:                      51200     
Total generated tokens:                  51200     
Request throughput (req/s):              200.45    
Output token throughput (tok/s):         20044.52  
Peak output token throughput (tok/s):    32996.00  
Peak concurrent requests:                512.00    
Total token throughput (tok/s):          40089.04  
---------------Time to First Token----------------
Mean TTFT (ms):                          776.44    
Median TTFT (ms):                        759.85    
P99 TTFT (ms):                           1005.40   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.20     
Median TPOT (ms):                        17.63     
P99 TPOT (ms):                           19.88     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.24     
Median ITL (ms):                         15.43     
P99 ITL (ms):                            100.69    
==================================================

Eval

vllm serve nvidia/Qwen3-30B-A3B-NVFP4
python tests/evals/gsm8k/gsm8k_eval.py --port 8000

# MAIN
Results:
Accuracy: 0.886
Invalid responses: 0.001
Total latency: 20.575 s
Questions per second: 64.106
Total output tokens: 153861
Output tokens per second: 7477.995

# PR
Results:
Accuracy: 0.883
Invalid responses: 0.000
Total latency: 20.371 s
Questions per second: 64.748
Total output tokens: 153967
Output tokens per second: 7558.026

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <mgoin64@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a fused kernel, silu_and_mul_scaled_fp4_experts_quant, to optimize the performance of Mixture of Experts (MoE) layers using NVFP4 quantization. The fusion is controlled by a new FUSE_SILU_MUL template parameter, and the changes are well-integrated into the existing kernel infrastructure. The PR also includes beneficial refactoring, such as centralizing input validation logic. The benchmark results provided in the description demonstrate a clear performance improvement. My review includes one suggestion to improve C++ code quality by using const references for read-only parameters, which is a good practice for correctness and can aid compiler optimizations. Overall, this is a solid performance enhancement.

csrc/quantization/fp4/nvfp4_utils.cuh

Signed-off-by: mgoin <mgoin64@gmail.com>

yewentao256

Nice work! Just a few small thoughts

csrc/quantization/fp4/nvfp4_utils.cuh

yewentao256 · 2026-01-08T16:19:02Z

csrc/quantization/fp4/nvfp4_experts_quant.cu

                    float const* SFScale, uint32_t* out, uint32_t* SFout,
                    uint32_t* input_offset_by_experts,
                    uint32_t* output_scale_offset_by_experts, int n_experts,
                    bool low_latency) {


here low_latency seems not used?

Yeah... I'm not sure, I didn't touch this code so I don't want to change that in this PR

csrc/quantization/fp4/nvfp4_utils.cuh

mgoin · 2026-01-08T17:50:56Z

csrc/quantization/fp4/nvfp4_experts_quant.cu

                    float const* SFScale, uint32_t* out, uint32_t* SFout,
                    uint32_t* input_offset_by_experts,
                    uint32_t* output_scale_offset_by_experts, int n_experts,
                    bool low_latency) {


Yeah... I'm not sure, I didn't touch this code so I don't want to change that in this PR

Signed-off-by: Michael Goin <mgoin64@gmail.com>

yewentao256

LGTM, thanks for the work!

…m-project#31832) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>

…m-project#31832) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…m-project#31832) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>

Add fused silu_and_mul_scaled_fp4_experts_quant for NVFP4 cutlass_moe

18f9821

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested a review from pavanimajety as a code owner January 6, 2026 20:52

mgoin changed the title ~~[Perf][Kernel] Fused silu_and_mul_scaled_fp4_experts_quant for NVFP4 cutlass_moe~~ [Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe Jan 6, 2026

mergify bot added the nvidia label Jan 6, 2026

github-project-automation bot added this to NVIDIA Jan 6, 2026

mgoin added the performance Performance-related issues label Jan 6, 2026

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

csrc/quantization/fp4/nvfp4_utils.cuh Outdated Show resolved Hide resolved

mgoin mentioned this pull request Jan 6, 2026

[Perf] Fuse stride preparation for NVFP4 cutlass_moe #31837

Merged

5 tasks

Add const

07c13c7

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 7, 2026

yewentao256 self-assigned this Jan 7, 2026

yewentao256 reviewed Jan 8, 2026

View reviewed changes

mgoin commented Jan 8, 2026

View reviewed changes

Update csrc/quantization/fp4/nvfp4_utils.cuh

5306d68

Signed-off-by: Michael Goin <mgoin64@gmail.com>

yewentao256 approved these changes Jan 8, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 8, 2026

Merge branch 'main' into nvfp4_cutlass_moe_fused_silu_quant

4c75ef0

mgoin merged commit 34cd32f into vllm-project:main Jan 9, 2026
93 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 9, 2026

mgoin deleted the nvfp4_cutlass_moe_fused_silu_quant branch January 9, 2026 14:40

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe (vll…

b0ce73d

…m-project#31832) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>

JackChuang mentioned this pull request Feb 11, 2026

[Perf][Kernel] Fuse SiLU+Mul into NVFP4 Expert Quantization for CUTLASS MoE sgl-project/sglang#18612

Open

5 tasks

ProExpertProg mentioned this pull request Feb 17, 2026

[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ #34718

Merged

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe (vll…

18f3046

…m-project#31832) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe#31832

[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe#31832
mgoin merged 4 commits intovllm-project:mainfrom
neuralmagic:nvfp4_cutlass_moe_fused_silu_quant

mgoin commented Jan 6, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

yewentao256 Jan 8, 2026

Uh oh!

mgoin Jan 8, 2026

Uh oh!

Uh oh!

mgoin Jan 8, 2026

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mgoin commented Jan 6, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Latency Benchmark

Throughput Benchmark

Eval

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yewentao256 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mgoin commented Jan 6, 2026 •

edited by github-actions bot

Loading