[Perf][Kernel] Fuse SiLU+Mul into NVFP4 Expert Quantization for CUTLASS MoE#18612
[Perf][Kernel] Fuse SiLU+Mul into NVFP4 Expert Quantization for CUTLASS MoE#18612JackChuang wants to merge 4 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @JackChuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly optimizes the CUTLASS FP4 Mixture-of-Experts (MoE) pipeline by fusing three previously separate operations—intermediate buffer allocation, SiLU+Mul activation, and scaled FP4 expert quantization—into a single, highly efficient CUDA kernel. This fusion reduces memory overhead and kernel launch costs, leading to notable performance gains, specifically around 5% improvement in latency and throughput for MoE models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization by fusing the SiLU+Mul operation into the NVFP4 expert quantization kernel for CUTLASS MoE. This is achieved by adding a new fused CUDA kernel silu_and_mul_scaled_fp4_experts_quant_packed, which eliminates an intermediate buffer and a kernel launch. The changes span across CUDA kernel implementations, PyTorch op registration, and Python wrappers.
My review focuses on code structure and style. I've identified a couple of areas for improvement:
- Refactoring duplicated code in the new CUDA kernel to improve maintainability.
- Adhering to Python's import conventions.
Overall, the changes are well-documented and the performance benefits are clearly demonstrated. The implementation appears correct and follows the logic described.
| void silu_and_mul_scaled_fp4_experts_quant_packed_sm100a( | ||
| torch::Tensor& output, | ||
| torch::Tensor& output_scale, | ||
| torch::Tensor const& input, | ||
| torch::Tensor const& input_global_scale, | ||
| torch::Tensor const& input_offset_by_experts, | ||
| torch::Tensor const& output_scale_offset_by_experts) { |
There was a problem hiding this comment.
There is significant code duplication between this new function silu_and_mul_scaled_fp4_experts_quant_packed_sm100a and the existing scaled_fp4_experts_quant_sm100a function (lines 568-650). Both functions perform nearly identical checks for tensor properties, dimensions, and types.
To improve maintainability and reduce redundancy, consider refactoring the common logic into a shared helper function. This helper could accept a bool use_silu_and_mul parameter to handle the minor differences in logic, such as the calculation of k and the flag passed to quant_impl.
For example, you could create a helper function like this:
void scaled_fp4_experts_quant_sm100a_impl(
torch::Tensor& output,
torch::Tensor& output_scale,
torch::Tensor const& input,
torch::Tensor const& input_global_scale,
torch::Tensor const& input_offset_by_experts,
torch::Tensor const& output_scale_offset_by_experts,
bool use_silu_and_mul) {
// ... all common checks and logic ...
auto k = input.size(1);
if (use_silu_and_mul) {
TORCH_CHECK(k % 2 == 0, "input last dim must be even (2*k)");
k /= 2;
}
// ... more checks ...
// Call quant_impl with the use_silu_and_mul flag
if (input.dtype() == at::ScalarType::Half) {
quant_impl<half>(..., use_silu_and_mul, ...);
} else if (input.dtype() == at::ScalarType::BFloat16) {
quant_impl<__nv_bfloat16>(..., use_silu_and_mul, ...);
}
}Then, silu_and_mul_scaled_fp4_experts_quant_packed_sm100a and scaled_fp4_experts_quant_sm100a would become simple wrappers calling this helper.
| m_numtopk, k_input_doubled = input_tensor.shape | ||
| k = k_input_doubled // 2 # Actual feature dim after SiLU+mul reduces 2k -> k | ||
|
|
||
| import os |
There was a problem hiding this comment.
The import os statement is inside the function. According to the PEP 8 style guide, imports should usually be at the top of the file. Please move this import to the top level to improve code style and consistency.
References
- PEP 8 recommends that imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. (link)
Summary
In the CUTLASS FP4 MoE pipeline, the path between GEMM1 and GEMM2 previously required 3 separate steps: allocate intermediate buffer →
silu_and_mul→scaled_fp4_experts_quant. This PR fuses them into a single CUDA kernelsilu_and_mul_scaled_fp4_experts_quant_packed, eliminating one intermediate buffer allocation and one extra kernel launch. Inspired by vllm #31832Before → After this PR
Key Changes
CUDA kernel (nvfp4_expert_quant.cu):
use_silu_and_mulflag tocvt_fp16_to_fp4kernels (both low-latency and offset-based variants). Previously SiLU+mul was implicitly tied tomask != nullptr; now it's an independent toggle.silu_and_mul_scaled_fp4_experts_quant_packed_sm100a— uses expert offsets (not masks) to correctly handle non-uniform token distribution across experts.(m, 2*k)— gate+up concatenated from GEMM1 output; the kernel reads both halves, applies SiLU(gate)×up, then FP4-quantizes in one pass.Op registration (common_extension.cc, sgl_kernel_ops.h, nvfp4_quant_entry.cu):
silu_and_mul_scaled_fp4_experts_quant_packedas a new torch op.Python wrapper (gemm.py):
silu_and_mul_scaled_fp4_experts_quant_packed()— handles dimension calculation (k = input.shape[1] // 2), output/scale allocation, kernel dispatch, and reinterprets scale output asfloat8_e4m3fnfor GEMM2.MoE integration (cutlass_moe.py):
silu_and_mul_scaled_fp4_experts_quant_packed(c1, ...)call.Experimental Results
Experimental Setup
Accuracy & Throughput & Latency Benchmark
Both latency and throughput brings ~5% improvement
Throughput Benchmark
1.4% gain
Latency Benchmark
2% gain
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci