Skip to content

[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe#31832

Merged
mgoin merged 4 commits intovllm-project:mainfrom
neuralmagic:nvfp4_cutlass_moe_fused_silu_quant
Jan 9, 2026
Merged

[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe#31832
mgoin merged 4 commits intovllm-project:mainfrom
neuralmagic:nvfp4_cutlass_moe_fused_silu_quant

Conversation

@mgoin
Copy link
Member

@mgoin mgoin commented Jan 6, 2026

Purpose

We can easily fuse the silu_and_mul operation into the nvfp4 quantization operation for moe, which we already do for dense nvfp4 as of #23671. This PR just generalizes some utils to expose the same functionality for expert quantization. This seems to improve latency by ~2% and throughput by ~4%

Before:
Screenshot 2026-01-06 at 4 00 44 PM

After:
Screenshot 2026-01-06 at 4 00 48 PM

Test Plan

Benchmarks and evals

Test Result

Latency Benchmark

vllm serve nvidia/Qwen3-30B-A3B-NVFP4
vllm bench serve --input-len 100 --output-len 100 --num-prompts 8

# MAIN
============ Serving Benchmark Result ============
Successful requests:                     8         
Failed requests:                         0         
Benchmark duration (s):                  0.80      
Total input tokens:                      800       
Total generated tokens:                  800       
Request throughput (req/s):              10.06     
Output token throughput (tok/s):         1005.70   
Peak output token throughput (tok/s):    800.00    
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          2011.41   
---------------Time to First Token----------------
Mean TTFT (ms):                          33.40     
Median TTFT (ms):                        34.42     
P99 TTFT (ms):                           40.03     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.62      
Median TPOT (ms):                        7.62      
P99 TPOT (ms):                           7.62      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.62      
Median ITL (ms):                         7.67      
P99 ITL (ms):                            8.22      
==================================================

# PR
============ Serving Benchmark Result ============
Successful requests:                     8         
Failed requests:                         0         
Benchmark duration (s):                  0.78      
Total input tokens:                      800       
Total generated tokens:                  800       
Request throughput (req/s):              10.25     
Output token throughput (tok/s):         1025.21   
Peak output token throughput (tok/s):    800.00    
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          2050.42   
---------------Time to First Token----------------
Mean TTFT (ms):                          31.20     
Median TTFT (ms):                        32.42     
P99 TTFT (ms):                           34.46     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.53      
Median TPOT (ms):                        7.53      
P99 TPOT (ms):                           7.53      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.53      
Median ITL (ms):                         7.55      
P99 ITL (ms):                            8.30      
==================================================

Throughput Benchmark

vllm serve nvidia/Qwen3-30B-A3B-NVFP4
vllm bench serve --input-len 100 --output-len 100 --num-prompts 512

# MAIN
============ Serving Benchmark Result ============
Successful requests:                     512       
Failed requests:                         0         
Benchmark duration (s):                  2.65      
Total input tokens:                      51200     
Total generated tokens:                  51200     
Request throughput (req/s):              193.35    
Output token throughput (tok/s):         19335.14  
Peak output token throughput (tok/s):    30913.00  
Peak concurrent requests:                512.00    
Total token throughput (tok/s):          38670.29  
---------------Time to First Token----------------
Mean TTFT (ms):                          812.20    
Median TTFT (ms):                        793.07    
P99 TTFT (ms):                           1031.71   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.91     
Median TPOT (ms):                        18.14     
P99 TPOT (ms):                           20.28     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.93     
Median ITL (ms):                         16.15     
P99 ITL (ms):                            87.56     
================================================== 

# PR
============ Serving Benchmark Result ============
Successful requests:                     512       
Failed requests:                         0         
Benchmark duration (s):                  2.55      
Total input tokens:                      51200     
Total generated tokens:                  51200     
Request throughput (req/s):              200.45    
Output token throughput (tok/s):         20044.52  
Peak output token throughput (tok/s):    32996.00  
Peak concurrent requests:                512.00    
Total token throughput (tok/s):          40089.04  
---------------Time to First Token----------------
Mean TTFT (ms):                          776.44    
Median TTFT (ms):                        759.85    
P99 TTFT (ms):                           1005.40   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.20     
Median TPOT (ms):                        17.63     
P99 TPOT (ms):                           19.88     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.24     
Median ITL (ms):                         15.43     
P99 ITL (ms):                            100.69    
==================================================

Eval

vllm serve nvidia/Qwen3-30B-A3B-NVFP4
python tests/evals/gsm8k/gsm8k_eval.py --port 8000

# MAIN
Results:
Accuracy: 0.886
Invalid responses: 0.001
Total latency: 20.575 s
Questions per second: 64.106
Total output tokens: 153861
Output tokens per second: 7477.995

# PR
Results:
Accuracy: 0.883
Invalid responses: 0.000
Total latency: 20.371 s
Questions per second: 64.748
Total output tokens: 153967
Output tokens per second: 7558.026

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mgoin mgoin requested a review from pavanimajety as a code owner January 6, 2026 20:52
@mgoin mgoin changed the title [Perf][Kernel] Fused silu_and_mul_scaled_fp4_experts_quant for NVFP4 cutlass_moe [Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe Jan 6, 2026
@mergify mergify bot added the nvidia label Jan 6, 2026
@mgoin mgoin added the performance Performance-related issues label Jan 6, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fused kernel, silu_and_mul_scaled_fp4_experts_quant, to optimize the performance of Mixture of Experts (MoE) layers using NVFP4 quantization. The fusion is controlled by a new FUSE_SILU_MUL template parameter, and the changes are well-integrated into the existing kernel infrastructure. The PR also includes beneficial refactoring, such as centralizing input validation logic. The benchmark results provided in the description demonstrate a clear performance improvement. My review includes one suggestion to improve C++ code quality by using const references for read-only parameters, which is a good practice for correctness and can aid compiler optimizations. Overall, this is a solid performance enhancement.

Signed-off-by: mgoin <mgoin64@gmail.com>
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 7, 2026
@yewentao256 yewentao256 self-assigned this Jan 7, 2026
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Just a few small thoughts

float const* SFScale, uint32_t* out, uint32_t* SFout,
uint32_t* input_offset_by_experts,
uint32_t* output_scale_offset_by_experts, int n_experts,
bool low_latency) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here low_latency seems not used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I'm not sure, I didn't touch this code so I don't want to change that in this PR

float const* SFScale, uint32_t* out, uint32_t* SFout,
uint32_t* input_offset_by_experts,
uint32_t* output_scale_offset_by_experts, int n_experts,
bool low_latency) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I'm not sure, I didn't touch this code so I don't want to change that in this PR

Signed-off-by: Michael Goin <mgoin64@gmail.com>
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 8, 2026
@mgoin mgoin merged commit 34cd32f into vllm-project:main Jan 9, 2026
93 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 9, 2026
@mgoin mgoin deleted the nvfp4_cutlass_moe_fused_silu_quant branch January 9, 2026 14:40
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
…m-project#31832)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…m-project#31832)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…m-project#31832)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants