Fp4 MOE quant kernel optimization #8777

jy-song-hub · 2025-08-04T23:21:14Z

Motivation

Port vLLM's FP4 MoE kernel optimization (PR #19500) to SGLang, improving performance of expert-based FP4 quantization on NVIDIA Blackwell GPUs.

Modifications

This PR introduces several optimizations for FP4 expert quantization:

Switched to a grid-stride loop layout to replace per-block row processing, enabling better thread-level parallelism.
Added launch configuration tuning: if grid size is smaller than the number of SMs and block size is large, we double the grid size and halve the block size to improve occupancy.
For small problem sizes where blocks are not reused, expert offsets are read into registers for low-overhead lookup.
For large problem sizes where blocks are reused, expert offsets are first loaded into shared memory and then accessed via binary search for efficiency.

Accuracy Test

We verified correctness through lm_eval, dataset gsm8k :
python -m lm_eval --model sglang --model_args pretrained=/models/DeepSeek-R1-FP4,tp_size=4,ep_size=4 --tasks gsm8k --num_fewshot 5 --device cuda --batch_size auto --output_path ./results.json

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9568	±	0.0056
		strict-match	5	exact_match	↑	0.9560	±	0.0056

Verification environment

Hardware: NVIDIA B200 (Blackwell GPU)
SGLang version: v0.4.9.post6
CUDA version: 12.8
lm_eval version: v0.4.9.1

Benchmark & Profiling

We perform benchmark and profiling through
pytest -s -v test_fp4_moe.py

Version	Avg Time(sec)	Speedup
Baseline	68.22	-
Optimized	59.17	~+13%

Benchmark environment

Hardware: NVIDIA B200 (Blackwell GPU)
SGLang version: v0.4.9.post6
CUDA version: 12.8

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @jy-song-hub, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've ported vLLM's FP4 Mixture-of-Experts (MoE) kernel optimization to SGLang, specifically targeting NVIDIA Blackwell GPUs. This work aims to significantly improve the performance of expert-based FP4 quantization. The optimizations include a new grid-stride loop layout for better parallelism, dynamic launch configuration tuning, and specialized strategies for handling expert offsets based on problem size: using registers for small problems and shared memory with binary search for larger ones. These changes have resulted in an approximate 13% speedup on NVIDIA B200 hardware.

Highlights

Grid-Stride Loop Layout: The kernel now utilizes a grid-stride loop layout, replacing the per-block row processing to enhance thread-level parallelism and overall GPU utilization.
Dynamic Launch Configuration: Implemented adaptive launch configuration tuning that adjusts grid and block sizes based on the number of Streaming Multiprocessors (SMs) and problem size, aiming to improve GPU occupancy.
Optimized Expert Offset Lookup (Small Problems): For smaller problem sizes where GPU blocks are not frequently reused, expert offsets are now efficiently read directly into registers, minimizing lookup overhead.
Optimized Expert Offset Lookup (Large Problems): For larger problem sizes involving block reuse, expert offsets are loaded into shared memory and accessed via a binary search, providing efficient lookup.
Performance Improvement: Achieved a notable performance gain of approximately 13% on NVIDIA B200 (Blackwell GPU) for FP4 MoE quantization, as verified by benchmarks.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The PR introduces several optimizations to the FP4 MoE quantization kernel, including a grid-stride loop, launch tuning, and adaptive expert lookup strategies. These changes are well-aligned with the goal of improving performance. However, I've found a few critical issues in the implementation of the expert lookup logic, which could lead to out-of-bounds memory access and incorrect behavior. Specifically, the handling of cases where the number of experts is not a multiple of the vector size is incorrect in both the register-based and shared-memory-based kernels. I've also pointed out some areas for further optimization and robustness improvements. Addressing these issues is crucial for the correctness and stability of the kernel.

gemini-code-assist · 2025-08-04T23:24:03Z

sgl-kernel/csrc/gemm/nvfp4_expert_quant.cu

There are two issues in this expert search implementation:

Out-of-bounds access: The vectorized loads (__ldca(reinterpret_cast<const int4*>(...))) can read past the end of the input_offset_by_experts array if n_experts is not a multiple of 16. This can lead to crashes or incorrect results.

Incorrect loop control: The break statement only exits the inner loop. The outer chunk_start loop continues to iterate, which is inefficient and can lead to expert_idx being overwritten if another match is found (which would be incorrect).

Please consider handling the tail case where n_experts is not a multiple of 16 and ensure the search terminates once an expert is found.

uint32_t local_offsets[17]; bool expert_found = false; for (int chunk_start = 0; chunk_start < n_experts; chunk_start += 16) { if (chunk_start + 16 < n_experts) { *reinterpret_cast<int4*>(local_offsets) = __ldca(reinterpret_cast<const int4*>( &input_offset_by_experts[chunk_start])); *reinterpret_cast<int4*>(local_offsets + 4) = __ldca(reinterpret_cast<const int4*>( &input_offset_by_experts[chunk_start + 4])); *reinterpret_cast<int4*>(local_offsets + 8) = __ldca(reinterpret_cast<const int4*>( &input_offset_by_experts[chunk_start + 8])); *reinterpret_cast<int4*>(local_offsets + 12) = __ldca(reinterpret_cast<const int4*>( &input_offset_by_experts[chunk_start + 12])); local_offsets[16] = __ldca(&input_offset_by_experts[chunk_start + 16]); } else { #pragma unroll for (int i = 0; i < 17 && chunk_start + i <= n_experts; ++i) { local_offsets[i] = __ldca(&input_offset_by_experts[chunk_start + i]); } } // Check against the 16 loaded offsets #pragma unroll for (int i = 0; i < 16 && chunk_start + i < n_experts; i++) { if (rowIdx >= local_offsets[i] && rowIdx < local_offsets[i + 1]) { rowIdx_in_expert = rowIdx - local_offsets[i]; expert_idx = chunk_start + i; expert_found = true; break; } } if (expert_found) { break; } }

gemini-code-assist · 2025-08-04T23:24:03Z

sgl-kernel/csrc/gemm/nvfp4_expert_quant.cu

This logic for loading expert offsets into shared memory is incorrect and can lead to out-of-bounds memory access.

The vectorized load *reinterpret_cast<const int4*> can read past the end of input_offset_by_experts if n_experts is not a multiple of 4.

For n_experts that are not a multiple of 4, the offsets in the tail part are not loaded into shared memory, except for the very last one at input_offset_by_experts[n_experts].

This will cause the subsequent binary search to work with incomplete/incorrect data and potentially crash. Please handle the tail elements correctly.

for (int i = threadIdx.x * 4; i < n_experts - 3; i += blockDim.x * 4) { *reinterpret_cast<int4*>(&shared_input_offsets[i]) = *reinterpret_cast<const int4*>(&input_offset_by_experts[i]); } // Handle the tail case. for (int i = (n_experts / 4) * 4 + threadIdx.x; i < n_experts + 1; i += blockDim.x) { shared_input_offsets[i] = input_offset_by_experts[i]; }

gemini-code-assist · 2025-08-04T23:24:03Z

sgl-kernel/csrc/gemm/nvfp4_expert_quant.cu

This kernel always uses a binary search to find the expert, regardless of the SMALL_NUM_EXPERTS template parameter. For a small number of experts, a simple linear scan is often more efficient than a binary search due to less branching and better instruction-level parallelism.

Please consider adding a conditional linear scan for the SMALL_NUM_EXPERTS=true case.

if constexpr (SMALL_NUM_EXPERTS) { // Linear scan for small number of experts for (int i = 0; i < n_experts; i++) { if (rowIdx >= shared_input_offsets[i] && rowIdx < shared_input_offsets[i + 1]) { rowIdx_in_expert = rowIdx - shared_input_offsets[i]; expert_idx = i; break; } } } else { // Binary search through experts using shared memory int left = 0, right = n_experts - 1; while (left <= right) { int mid = (left + right) / 2; // Get offsets: shared_input_offsets[i] corresponds to // input_offset_by_experts[i] uint32_t mid_offset = shared_input_offsets[mid]; uint32_t next_offset = shared_input_offsets[mid + 1]; if (rowIdx >= mid_offset && rowIdx < next_offset) { rowIdx_in_expert = rowIdx - mid_offset; expert_idx = mid; break; } else if (rowIdx < mid_offset) { right = mid - 1; } else { left = mid + 1; } } }

gemini-code-assist · 2025-08-04T23:24:04Z

sgl-kernel/csrc/gemm/nvfp4_expert_quant.cu

The calculation int mid = (left + right) / 2; can lead to an integer overflow if left and right are large positive integers. While it might not be an issue with current constraints, it's safer to use a formulation that avoids this potential overflow.

int mid = left + (right - left) / 2;

pavanimajety · 2025-08-05T00:39:27Z

Thanks for porting the kernel updates over! Please add lm_eval results for the FP4 models.

jy-song-hub · 2025-08-06T21:43:24Z

Thanks for porting the kernel updates over! Please add lm_eval results for the FP4 models.

Thank you for the suggestion. I added lm_eval results to section Accuracy Test.

yicwang · 2025-08-13T17:40:25Z

Hi @HydraQYH @BBuf @ch-wan. Do you guys have some time to review and have this merged?

pavanimajety

LGTM, I reviewed the original vLLM PR.

HydraQYH

@jy-song-hub @yicwang @rainj-me Great job! Although this PR has been merged, I still think there are some points that can be improved. I have written them in the comments. I hope you can help me complete them later.

HydraQYH · 2025-08-15T08:12:00Z

sgl-kernel/csrc/gemm/nvfp4_expert_quant.cu

  // Grid, Block size.
  // Each thread converts 8 values.
-  dim3 block(std::min(int(k / ELTS_PER_THREAD), 512));
+  int const workSizePerRow = k / ELTS_PER_THREAD;


For line 450~453, Can we use ATen function to get this?

HydraQYH · 2025-08-17T02:13:30Z

sgl-kernel/csrc/gemm/nvfp4_expert_quant.cu

+  int const totalWorkSize = m_topk * workSizePerRow;
+  dim3 block(std::min(workSizePerRow, 512));
  // Get number of blocks per SM (assume we can fully utilize the SM).
  int const numBlocksPerSM = 2048 / block.x;


Will reigsters per thread and shared memory become bottlenecks that limit occupancy?

HydraQYH · 2025-08-17T03:11:22Z

sgl-kernel/csrc/gemm/nvfp4_expert_quant.cu

+
+    // Binary search through experts using shared memory
+    int left = 0, right = n_experts - 1;
+    while (left <= right) {


It seems that the index global scale has a certain overhead in both versions of cvt_fp16_to_fp4. Can we speed up this process by using an additional Map data structure?

HydraQYH · 2025-08-17T03:14:22Z

sgl-kernel/csrc/gemm/nvfp4_expert_quant.cu

    uint32_t* output_scale_offset_by_experts,
-    int n_experts) {
+    int n_experts,
+    bool low_latency) {


The low_latency parameter is not used and is only used for function overloading. Can this parameter be converted to a non-type template parameter later and the two versions of cvt_fp16_to_fp4 be merged into a single kernel?

Co-authored-by: Rain Jiang <[email protected]>

jy-song-hub requested review from BBuf, FlamingoPg, HaiShaw, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners August 4, 2025 23:21

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

zhyncs assigned HydraQYH and BBuf Aug 4, 2025

zhyncs added the high priority label Aug 4, 2025

zhyncs assigned ch-wan Aug 4, 2025

jy-song-hub force-pushed the fp4_quant_opt branch from faaec46 to 2d86cd4 Compare August 12, 2025 00:14

Fp4 MOE quant kernel optimization

b7f77f1

jy-song-hub force-pushed the fp4_quant_opt branch from 2d86cd4 to b7f77f1 Compare August 12, 2025 00:24

pavanimajety approved these changes Aug 13, 2025

View reviewed changes

rainj-me and others added 2 commits August 14, 2025 10:22

Merge branch 'main' into fp4_quant_opt

9040596

Merge branch 'main' into fp4_quant_opt

d11b5c4

zhyncs merged commit 4fc09e0 into sgl-project:main Aug 15, 2025
52 of 56 checks passed

HydraQYH reviewed Aug 17, 2025

View reviewed changes

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

Fp4 MOE quant kernel optimization (sgl-project#8777)

e618d30

Co-authored-by: Rain Jiang <[email protected]>

kaixih mentioned this pull request Aug 24, 2025

[NVIDIA] [2/N] Optimize silu_and_mul_scaled_fp4_grouped_quant perf #9556

Merged

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

Fp4 MOE quant kernel optimization (sgl-project#8777)

333045e

Co-authored-by: Rain Jiang <[email protected]>

Fp4 MOE quant kernel optimization #8777

Fp4 MOE quant kernel optimization #8777

Uh oh!

Conversation

jy-song-hub commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

pavanimajety commented Aug 5, 2025

Uh oh!

jy-song-hub commented Aug 6, 2025

Uh oh!

yicwang commented Aug 13, 2025

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HydraQYH left a comment

Choose a reason for hiding this comment

Uh oh!

HydraQYH Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

HydraQYH Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

HydraQYH Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

HydraQYH Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jy-song-hub commented Aug 4, 2025 •

edited

Loading