[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe#31832
[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe#31832mgoin merged 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a fused kernel, silu_and_mul_scaled_fp4_experts_quant, to optimize the performance of Mixture of Experts (MoE) layers using NVFP4 quantization. The fusion is controlled by a new FUSE_SILU_MUL template parameter, and the changes are well-integrated into the existing kernel infrastructure. The PR also includes beneficial refactoring, such as centralizing input validation logic. The benchmark results provided in the description demonstrate a clear performance improvement. My review includes one suggestion to improve C++ code quality by using const references for read-only parameters, which is a good practice for correctness and can aid compiler optimizations. Overall, this is a solid performance enhancement.
yewentao256
left a comment
There was a problem hiding this comment.
Nice work! Just a few small thoughts
| float const* SFScale, uint32_t* out, uint32_t* SFout, | ||
| uint32_t* input_offset_by_experts, | ||
| uint32_t* output_scale_offset_by_experts, int n_experts, | ||
| bool low_latency) { |
There was a problem hiding this comment.
here low_latency seems not used?
There was a problem hiding this comment.
Yeah... I'm not sure, I didn't touch this code so I don't want to change that in this PR
| float const* SFScale, uint32_t* out, uint32_t* SFout, | ||
| uint32_t* input_offset_by_experts, | ||
| uint32_t* output_scale_offset_by_experts, int n_experts, | ||
| bool low_latency) { |
There was a problem hiding this comment.
Yeah... I'm not sure, I didn't touch this code so I don't want to change that in this PR
Signed-off-by: Michael Goin <mgoin64@gmail.com>
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
…m-project#31832) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>
…m-project#31832) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…m-project#31832) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>
Purpose
We can easily fuse the silu_and_mul operation into the nvfp4 quantization operation for moe, which we already do for dense nvfp4 as of #23671. This PR just generalizes some utils to expose the same functionality for expert quantization. This seems to improve latency by ~2% and throughput by ~4%
Before:

After:

Test Plan
Benchmarks and evals
Test Result
Latency Benchmark
Throughput Benchmark
Eval
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.