-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fp4 MOE quant kernel optimization #8777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @jy-song-hub, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've ported vLLM's FP4 Mixture-of-Experts (MoE) kernel optimization to SGLang, specifically targeting NVIDIA Blackwell GPUs. This work aims to significantly improve the performance of expert-based FP4 quantization. The optimizations include a new grid-stride loop layout for better parallelism, dynamic launch configuration tuning, and specialized strategies for handling expert offsets based on problem size: using registers for small problems and shared memory with binary search for larger ones. These changes have resulted in an approximate 13% speedup on NVIDIA B200 hardware.
Highlights
- Grid-Stride Loop Layout: The kernel now utilizes a grid-stride loop layout, replacing the per-block row processing to enhance thread-level parallelism and overall GPU utilization.
- Dynamic Launch Configuration: Implemented adaptive launch configuration tuning that adjusts grid and block sizes based on the number of Streaming Multiprocessors (SMs) and problem size, aiming to improve GPU occupancy.
- Optimized Expert Offset Lookup (Small Problems): For smaller problem sizes where GPU blocks are not frequently reused, expert offsets are now efficiently read directly into registers, minimizing lookup overhead.
- Optimized Expert Offset Lookup (Large Problems): For larger problem sizes involving block reuse, expert offsets are loaded into shared memory and accessed via a binary search, providing efficient lookup.
- Performance Improvement: Achieved a notable performance gain of approximately 13% on NVIDIA B200 (Blackwell GPU) for FP4 MoE quantization, as verified by benchmarks.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The PR introduces several optimizations to the FP4 MoE quantization kernel, including a grid-stride loop, launch tuning, and adaptive expert lookup strategies. These changes are well-aligned with the goal of improving performance. However, I've found a few critical issues in the implementation of the expert lookup logic, which could lead to out-of-bounds memory access and incorrect behavior. Specifically, the handling of cases where the number of experts is not a multiple of the vector size is incorrect in both the register-based and shared-memory-based kernels. I've also pointed out some areas for further optimization and robustness improvements. Addressing these issues is crucial for the correctness and stability of the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two issues in this expert search implementation:
- Out-of-bounds access: The vectorized loads (
__ldca(reinterpret_cast<const int4*>(...))) can read past the end of theinput_offset_by_expertsarray ifn_expertsis not a multiple of 16. This can lead to crashes or incorrect results. - Incorrect loop control: The
breakstatement only exits the inner loop. The outerchunk_startloop continues to iterate, which is inefficient and can lead toexpert_idxbeing overwritten if another match is found (which would be incorrect).
Please consider handling the tail case where n_experts is not a multiple of 16 and ensure the search terminates once an expert is found.
uint32_t local_offsets[17];
bool expert_found = false;
for (int chunk_start = 0; chunk_start < n_experts; chunk_start += 16) {
if (chunk_start + 16 < n_experts) {
*reinterpret_cast<int4*>(local_offsets) =
__ldca(reinterpret_cast<const int4*>(
&input_offset_by_experts[chunk_start]));
*reinterpret_cast<int4*>(local_offsets + 4) =
__ldca(reinterpret_cast<const int4*>(
&input_offset_by_experts[chunk_start + 4]));
*reinterpret_cast<int4*>(local_offsets + 8) =
__ldca(reinterpret_cast<const int4*>(
&input_offset_by_experts[chunk_start + 8]));
*reinterpret_cast<int4*>(local_offsets + 12) =
__ldca(reinterpret_cast<const int4*>(
&input_offset_by_experts[chunk_start + 12]));
local_offsets[16] = __ldca(&input_offset_by_experts[chunk_start + 16]);
} else {
#pragma unroll
for (int i = 0; i < 17 && chunk_start + i <= n_experts; ++i) {
local_offsets[i] = __ldca(&input_offset_by_experts[chunk_start + i]);
}
}
// Check against the 16 loaded offsets
#pragma unroll
for (int i = 0; i < 16 && chunk_start + i < n_experts; i++) {
if (rowIdx >= local_offsets[i] && rowIdx < local_offsets[i + 1]) {
rowIdx_in_expert = rowIdx - local_offsets[i];
expert_idx = chunk_start + i;
expert_found = true;
break;
}
}
if (expert_found) {
break;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic for loading expert offsets into shared memory is incorrect and can lead to out-of-bounds memory access.
- The vectorized load
*reinterpret_cast<const int4*>can read past the end ofinput_offset_by_expertsifn_expertsis not a multiple of 4. - For
n_expertsthat are not a multiple of 4, the offsets in the tail part are not loaded into shared memory, except for the very last one atinput_offset_by_experts[n_experts].
This will cause the subsequent binary search to work with incomplete/incorrect data and potentially crash. Please handle the tail elements correctly.
for (int i = threadIdx.x * 4; i < n_experts - 3; i += blockDim.x * 4) {
*reinterpret_cast<int4*>(&shared_input_offsets[i]) =
*reinterpret_cast<const int4*>(&input_offset_by_experts[i]);
}
// Handle the tail case.
for (int i = (n_experts / 4) * 4 + threadIdx.x; i < n_experts + 1;
i += blockDim.x) {
shared_input_offsets[i] = input_offset_by_experts[i];
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kernel always uses a binary search to find the expert, regardless of the SMALL_NUM_EXPERTS template parameter. For a small number of experts, a simple linear scan is often more efficient than a binary search due to less branching and better instruction-level parallelism.
Please consider adding a conditional linear scan for the SMALL_NUM_EXPERTS=true case.
if constexpr (SMALL_NUM_EXPERTS) {
// Linear scan for small number of experts
for (int i = 0; i < n_experts; i++) {
if (rowIdx >= shared_input_offsets[i] &&
rowIdx < shared_input_offsets[i + 1]) {
rowIdx_in_expert = rowIdx - shared_input_offsets[i];
expert_idx = i;
break;
}
}
} else {
// Binary search through experts using shared memory
int left = 0, right = n_experts - 1;
while (left <= right) {
int mid = (left + right) / 2;
// Get offsets: shared_input_offsets[i] corresponds to
// input_offset_by_experts[i]
uint32_t mid_offset = shared_input_offsets[mid];
uint32_t next_offset = shared_input_offsets[mid + 1];
if (rowIdx >= mid_offset && rowIdx < next_offset) {
rowIdx_in_expert = rowIdx - mid_offset;
expert_idx = mid;
break;
} else if (rowIdx < mid_offset) {
right = mid - 1;
} else {
left = mid + 1;
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks for porting the kernel updates over! Please add lm_eval results for the FP4 models. |
Thank you for the suggestion. I added lm_eval results to section Accuracy Test. |
faaec46 to
2d86cd4
Compare
2d86cd4 to
b7f77f1
Compare
pavanimajety
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I reviewed the original vLLM PR.
HydraQYH
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jy-song-hub @yicwang @rainj-me Great job! Although this PR has been merged, I still think there are some points that can be improved. I have written them in the comments. I hope you can help me complete them later.
| // Grid, Block size. | ||
| // Each thread converts 8 values. | ||
| dim3 block(std::min(int(k / ELTS_PER_THREAD), 512)); | ||
| int const workSizePerRow = k / ELTS_PER_THREAD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For line 450~453, Can we use ATen function to get this?
| int const totalWorkSize = m_topk * workSizePerRow; | ||
| dim3 block(std::min(workSizePerRow, 512)); | ||
| // Get number of blocks per SM (assume we can fully utilize the SM). | ||
| int const numBlocksPerSM = 2048 / block.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will reigsters per thread and shared memory become bottlenecks that limit occupancy?
|
|
||
| // Binary search through experts using shared memory | ||
| int left = 0, right = n_experts - 1; | ||
| while (left <= right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the index global scale has a certain overhead in both versions of cvt_fp16_to_fp4. Can we speed up this process by using an additional Map data structure?
| uint32_t* output_scale_offset_by_experts, | ||
| int n_experts) { | ||
| int n_experts, | ||
| bool low_latency) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The low_latency parameter is not used and is only used for function overloading. Can this parameter be converted to a non-type template parameter later and the two versions of cvt_fp16_to_fp4 be merged into a single kernel?
Co-authored-by: Rain Jiang <[email protected]>
Co-authored-by: Rain Jiang <[email protected]>
Motivation
Port vLLM's FP4 MoE kernel optimization (PR #19500) to SGLang, improving performance of expert-based FP4 quantization on NVIDIA Blackwell GPUs.
Modifications
This PR introduces several optimizations for FP4 expert quantization:
Accuracy Test
We verified correctness through lm_eval, dataset gsm8k :
python -m lm_eval --model sglang --model_args pretrained=/models/DeepSeek-R1-FP4,tp_size=4,ep_size=4 --tasks gsm8k --num_fewshot 5 --device cuda --batch_size auto --output_path ./results.json
Verification environment
Benchmark & Profiling
We perform benchmark and profiling through
pytest -s -v test_fp4_moe.py
Benchmark environment
Checklist