Skip to content

Conversation

@jy-song-hub
Copy link
Contributor

@jy-song-hub jy-song-hub commented Aug 4, 2025

Motivation

Port vLLM's FP4 MoE kernel optimization (PR #19500) to SGLang, improving performance of expert-based FP4 quantization on NVIDIA Blackwell GPUs.

Modifications

This PR introduces several optimizations for FP4 expert quantization:

  • Switched to a grid-stride loop layout to replace per-block row processing, enabling better thread-level parallelism.
  • Added launch configuration tuning: if grid size is smaller than the number of SMs and block size is large, we double the grid size and halve the block size to improve occupancy.
  • For small problem sizes where blocks are not reused, expert offsets are read into registers for low-overhead lookup.
  • For large problem sizes where blocks are reused, expert offsets are first loaded into shared memory and then accessed via binary search for efficiency.

Accuracy Test

We verified correctness through lm_eval, dataset gsm8k :
python -m lm_eval --model sglang --model_args pretrained=/models/DeepSeek-R1-FP4,tp_size=4,ep_size=4 --tasks gsm8k --num_fewshot 5 --device cuda --batch_size auto --output_path ./results.json

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9568 ± 0.0056
strict-match 5 exact_match 0.9560 ± 0.0056

Verification environment

  • Hardware: NVIDIA B200 (Blackwell GPU)
  • SGLang version: v0.4.9.post6
  • CUDA version: 12.8
  • lm_eval version: v0.4.9.1

Benchmark & Profiling

We perform benchmark and profiling through
pytest -s -v test_fp4_moe.py

Version Avg Time(sec) Speedup
Baseline 68.22 -
Optimized 59.17 ~+13%

Benchmark environment

  • Hardware: NVIDIA B200 (Blackwell GPU)
  • SGLang version: v0.4.9.post6
  • CUDA version: 12.8

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @jy-song-hub, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've ported vLLM's FP4 Mixture-of-Experts (MoE) kernel optimization to SGLang, specifically targeting NVIDIA Blackwell GPUs. This work aims to significantly improve the performance of expert-based FP4 quantization. The optimizations include a new grid-stride loop layout for better parallelism, dynamic launch configuration tuning, and specialized strategies for handling expert offsets based on problem size: using registers for small problems and shared memory with binary search for larger ones. These changes have resulted in an approximate 13% speedup on NVIDIA B200 hardware.

Highlights

  • Grid-Stride Loop Layout: The kernel now utilizes a grid-stride loop layout, replacing the per-block row processing to enhance thread-level parallelism and overall GPU utilization.
  • Dynamic Launch Configuration: Implemented adaptive launch configuration tuning that adjusts grid and block sizes based on the number of Streaming Multiprocessors (SMs) and problem size, aiming to improve GPU occupancy.
  • Optimized Expert Offset Lookup (Small Problems): For smaller problem sizes where GPU blocks are not frequently reused, expert offsets are now efficiently read directly into registers, minimizing lookup overhead.
  • Optimized Expert Offset Lookup (Large Problems): For larger problem sizes involving block reuse, expert offsets are loaded into shared memory and accessed via a binary search, providing efficient lookup.
  • Performance Improvement: Achieved a notable performance gain of approximately 13% on NVIDIA B200 (Blackwell GPU) for FP4 MoE quantization, as verified by benchmarks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The PR introduces several optimizations to the FP4 MoE quantization kernel, including a grid-stride loop, launch tuning, and adaptive expert lookup strategies. These changes are well-aligned with the goal of improving performance. However, I've found a few critical issues in the implementation of the expert lookup logic, which could lead to out-of-bounds memory access and incorrect behavior. Specifically, the handling of cases where the number of experts is not a multiple of the vector size is incorrect in both the register-based and shared-memory-based kernels. I've also pointed out some areas for further optimization and robustness improvements. Addressing these issues is crucial for the correctness and stability of the kernel.

Comment on lines 300 to 321
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are two issues in this expert search implementation:

  1. Out-of-bounds access: The vectorized loads (__ldca(reinterpret_cast<const int4*>(...))) can read past the end of the input_offset_by_experts array if n_experts is not a multiple of 16. This can lead to crashes or incorrect results.
  2. Incorrect loop control: The break statement only exits the inner loop. The outer chunk_start loop continues to iterate, which is inefficient and can lead to expert_idx being overwritten if another match is found (which would be incorrect).

Please consider handling the tail case where n_experts is not a multiple of 16 and ensure the search terminates once an expert is found.

      uint32_t local_offsets[17];
      bool expert_found = false;
      for (int chunk_start = 0; chunk_start < n_experts; chunk_start += 16) {
        if (chunk_start + 16 < n_experts) {
          *reinterpret_cast<int4*>(local_offsets) =
              __ldca(reinterpret_cast<const int4*>(
                  &input_offset_by_experts[chunk_start]));
          *reinterpret_cast<int4*>(local_offsets + 4) =
              __ldca(reinterpret_cast<const int4*>(
                  &input_offset_by_experts[chunk_start + 4]));
          *reinterpret_cast<int4*>(local_offsets + 8) =
              __ldca(reinterpret_cast<const int4*>(
                  &input_offset_by_experts[chunk_start + 8]));
          *reinterpret_cast<int4*>(local_offsets + 12) =
              __ldca(reinterpret_cast<const int4*>(
                  &input_offset_by_experts[chunk_start + 12]));
          local_offsets[16] = __ldca(&input_offset_by_experts[chunk_start + 16]);
        } else {
          #pragma unroll
          for (int i = 0; i < 17 && chunk_start + i <= n_experts; ++i) {
            local_offsets[i] = __ldca(&input_offset_by_experts[chunk_start + i]);
          }
        }

        // Check against the 16 loaded offsets
        #pragma unroll
        for (int i = 0; i < 16 && chunk_start + i < n_experts; i++) {
          if (rowIdx >= local_offsets[i] && rowIdx < local_offsets[i + 1]) {
            rowIdx_in_expert = rowIdx - local_offsets[i];
            expert_idx = chunk_start + i;
            expert_found = true;
            break;
          }
        }
        if (expert_found) {
          break;
        }
      }

Comment on lines 377 to 379
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This logic for loading expert offsets into shared memory is incorrect and can lead to out-of-bounds memory access.

  1. The vectorized load *reinterpret_cast<const int4*> can read past the end of input_offset_by_experts if n_experts is not a multiple of 4.
  2. For n_experts that are not a multiple of 4, the offsets in the tail part are not loaded into shared memory, except for the very last one at input_offset_by_experts[n_experts].

This will cause the subsequent binary search to work with incomplete/incorrect data and potentially crash. Please handle the tail elements correctly.

    for (int i = threadIdx.x * 4; i < n_experts - 3; i += blockDim.x * 4) {
      *reinterpret_cast<int4*>(&shared_input_offsets[i]) =
          *reinterpret_cast<const int4*>(&input_offset_by_experts[i]);
    }
    // Handle the tail case.
    for (int i = (n_experts / 4) * 4 + threadIdx.x; i < n_experts + 1;
         i += blockDim.x) {
      shared_input_offsets[i] = input_offset_by_experts[i];
    }

Comment on lines 407 to 420
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This kernel always uses a binary search to find the expert, regardless of the SMALL_NUM_EXPERTS template parameter. For a small number of experts, a simple linear scan is often more efficient than a binary search due to less branching and better instruction-level parallelism.

Please consider adding a conditional linear scan for the SMALL_NUM_EXPERTS=true case.

    if constexpr (SMALL_NUM_EXPERTS) {
      // Linear scan for small number of experts
      for (int i = 0; i < n_experts; i++) {
        if (rowIdx >= shared_input_offsets[i] &&
            rowIdx < shared_input_offsets[i + 1]) {
          rowIdx_in_expert = rowIdx - shared_input_offsets[i];
          expert_idx = i;
          break;
        }
      }
    } else {
      // Binary search through experts using shared memory
      int left = 0, right = n_experts - 1;
      while (left <= right) {
        int mid = (left + right) / 2;
        // Get offsets: shared_input_offsets[i] corresponds to
        // input_offset_by_experts[i]
        uint32_t mid_offset = shared_input_offsets[mid];
        uint32_t next_offset = shared_input_offsets[mid + 1];

        if (rowIdx >= mid_offset && rowIdx < next_offset) {
          rowIdx_in_expert = rowIdx - mid_offset;
          expert_idx = mid;
          break;
        } else if (rowIdx < mid_offset) {
          right = mid - 1;
        } else {
          left = mid + 1;
        }
      }
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation int mid = (left + right) / 2; can lead to an integer overflow if left and right are large positive integers. While it might not be an issue with current constraints, it's safer to use a formulation that avoids this potential overflow.

      int mid = left + (right - left) / 2;

@pavanimajety
Copy link
Collaborator

Thanks for porting the kernel updates over! Please add lm_eval results for the FP4 models.

@jy-song-hub
Copy link
Contributor Author

Thanks for porting the kernel updates over! Please add lm_eval results for the FP4 models.

Thank you for the suggestion. I added lm_eval results to section Accuracy Test.

@yicwang
Copy link
Contributor

yicwang commented Aug 13, 2025

Hi @HydraQYH @BBuf @ch-wan. Do you guys have some time to review and have this merged?

Copy link
Collaborator

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I reviewed the original vLLM PR.

@zhyncs zhyncs merged commit 4fc09e0 into sgl-project:main Aug 15, 2025
52 of 56 checks passed
Copy link
Collaborator

@HydraQYH HydraQYH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jy-song-hub @yicwang @rainj-me Great job! Although this PR has been merged, I still think there are some points that can be improved. I have written them in the comments. I hope you can help me complete them later.

// Grid, Block size.
// Each thread converts 8 values.
dim3 block(std::min(int(k / ELTS_PER_THREAD), 512));
int const workSizePerRow = k / ELTS_PER_THREAD;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For line 450~453, Can we use ATen function to get this?

int const totalWorkSize = m_topk * workSizePerRow;
dim3 block(std::min(workSizePerRow, 512));
// Get number of blocks per SM (assume we can fully utilize the SM).
int const numBlocksPerSM = 2048 / block.x;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reigsters per thread and shared memory become bottlenecks that limit occupancy?


// Binary search through experts using shared memory
int left = 0, right = n_experts - 1;
while (left <= right) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the index global scale has a certain overhead in both versions of cvt_fp16_to_fp4. Can we speed up this process by using an additional Map data structure?

uint32_t* output_scale_offset_by_experts,
int n_experts) {
int n_experts,
bool low_latency) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The low_latency parameter is not used and is only used for function overloading. Can this parameter be converted to a non-type template parameter later and the two versions of cvt_fp16_to_fp4 be merged into a single kernel?

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants