-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[NVIDIA] [2/N] Optimize silu_and_mul_scaled_fp4_grouped_quant perf
#9556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kaixih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request focuses on optimizing the performance of the silu_and_mul_scaled_fp4_grouped_quant operation, particularly for low-latency deep learning applications. The core change involves the introduction of a new CUDA kernel, cvt_fp16_to_fp4_masked, which efficiently handles the conversion from FP16 to FP4 while incorporating the SiLU and multiply operations when a mask is present. The system now intelligently selects between this new masked kernel and the existing unmasked kernel based on the input parameters, ensuring a more optimized execution path for relevant scenarios.
Highlights
- New CUDA Kernel for Masked Operations: Introduced
cvt_fp16_to_fp4_maskedkernel to specifically handle scenarios where a mask is provided, integrating thesilu_and_muloperation directly within the conversion process from FP16 to FP4. - Conditional Kernel Dispatch: The
quant_implfunction now intelligently dispatches to either the existingcvt_fp16_to_fp4kernel or the newcvt_fp16_to_fp4_maskedkernel based on whether a mask is present, ensuring optimized execution paths. - Performance Optimization for Low Latency: The changes are targeted at improving the performance of
silu_and_mul_scaled_fp4_grouped_quantfor deep learning low-latency use cases, particularly when dealing with masked inputs.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new CUDA kernel cvt_fp16_to_fp4_masked to optimize the silu_and_mul_scaled_fp4_grouped_quant operation, particularly for low-latency scenarios with masking. The new kernel partitions threads among experts to improve parallelism. While this is a solid optimization strategy, I've found a critical issue in the thread-to-expert mapping logic that could lead to out-of-bounds memory access and division-by-zero errors under certain launch configurations. My review includes a suggested fix to ensure the kernel's correctness and stability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for assigning threads to experts has two potential critical issues:
- Division by zero: If the total number of threads (
gridDim.x * blockDim.x) is less thann_experts,threadsPerExpertwill be calculated as zero, causing a division-by-zero error in the following lines when calculatingtid_in_expertandexpert_idx. - Out-of-bounds access: If the total number of threads is not perfectly divisible by
n_experts,expert_idxcan be calculated to be>= n_expertsfor some threads. This would lead to out-of-bounds access oninput_offset_by_expertsand other expert-indexed arrays later in the kernel.
These issues can lead to kernel crashes. I suggest adding guards to handle these cases safely.
int threadsPerExpert = gridDim.x * blockDim.x / n_experts;
if (threadsPerExpert == 0) {
// Not enough threads for at least one per expert.
return;
}
int expert_idx = tid / threadsPerExpert;
if (expert_idx >= n_experts) {
return;
}
int tid_in_expert = tid % threadsPerExpert;
|
The collected perf can be seen here. The repro: |
|
feel free to ping me (maybe on slack if I do not reply here) when this needs a review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, could you please test on this shape (which is of most interest)
- 6 local experts
- 1024/512/256/128 tokens per local expert (1024 is most interested which corresponds to 768 tok per rank and 48 rank), max_m=4096 (or other big num)
- hidden dim 7168/2048 (7168 is hidden dim, 2048 is moe expert intermediate dim)
|
I will review once the test case above shows improvement |
|
Just pushed some changes. And I noticed that I couldn't make it to replace the existing kernels because I see accuracy drops for dsr1 (Trying to debug it with @pavanimajety since it affects the cutlass fp4 path). Since it is not tightly related to this work, I make the new kernel only available when mask is used at this moment. As for the kernel improvement for the requested interesting shapes: here. Basically, we can see ~1.2 to 5x improvement over previous version. |
|
Added a benchmark script. Below is the output with varying M and K with masks (max_m=4096). This PR focuses on improve Before: After: |
fzyzcjy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made a very simple check and the general direction looks reasonable to me, will review in more detail later
btw @Alcanderian do you have interest in having a review as well? I am too busy these days :(
|
@kaixih wait a bit please thanks |
|
I need to bump this first https://github.com/sgl-project/sglang/actions/runs/17276109894 please commit after that |
|
@zhyncs Sure. Thx for the headsup. The last commit enabled masked quant. With it, the leading quant performs almost the same as silu-quant-masked in the e2e benchmark with this PR. That said, the last commit is more of a “nice-to-have.” Once DeepEP supports swizzled NVFP4 output, we’ll be able to invoke GEMM right after the dispatch. |
|
btw, when #9199 (#9199 (comment)) passes accuracy, this kernel will be double checked e2e |
|
@fzyzcjy This PR (#9199) already uses my changes for the accuracy tests (@wenscarl patched my changes in his internale repo). Without them, execution time doubles or even triples. I think my PR should be treated as a prerequisite. If possible, we should merge it first so that @wenscarl can run his tests more conveniently. |
silu_and_mul_scaled_fp4_grouped_quant perf


This PR further optimizes
silu_and_mul_scaled_fp4_grouped_quantfor DeepEP low-latency scenarios. When the mask contains small values, the kernel achieves up to 82× speedup over the previous version. The change also improves performance for non-masked NVFP4 expert quantization (scaled_fp4_experts_quant) and NVFP4 group quantization (scaled_fp4_grouped_quant), delivering about 1.5× gains.The new design changes how threads are assigned in expert-aware kernels. Previously, all threads started with the first expert and then strided to the next, requiring extra computation to determine which expert they were handling. This needs frequent access to the offset tensors, hence the earlier optimization like here. Now, threads are evenly partitioned across experts, and each thread only processes its assigned expert. This (1) removes the need to recompute expert indices or reaccess the offset tensor, and (2) makes early exit with masking straightforward.
cc. @wenscarl @fzyzcjy @kushanam