[NVIDIA] Enable trtllm fp4 moe for qwen by kaixih · Pull Request #13556 · sgl-project/sglang

kaixih · 2025-11-19T05:28:16Z

Motivation

Inspired by #13489, this PR enables the NVFP4 FlashInfer TRT-LLM MoE backend. With this change, users can now set --moe-runner-backend flashinfer_trtllm when using NVFP4 checkpoints such as nvidia/Qwen3-30B-A3B-NVFP4.

In our tests, accuracy remains unchanged while performance improves by roughly 8% compared to the Cutlass FP4 MoE backend.

Modifications

Accuracy Tests

# Before (with cutlass fp4 moe):
+ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 --port 30000            
100%|███████████████████████████████████████████████████████████████████████████| 1319/1319 [02:58<00:00,  7.39it/s] 
Accuracy: 0.907                                           
Invalid: 0.000                                            
Latency: 178.658 s                                        
Output throughput: 958.693 token/s

# After (with flashinfer trtllm fp4 moe):
+ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 --port 30000            
100%|███████████████████████████████████████████████████████████████████████████| 1319/1319 [02:42<00:00,  8.11it/s] 
Accuracy: 0.900                                           
Invalid: 0.000                                            
Latency: 162.700 s                                        
Output throughput: 1035.904 token/s

model_str=nvidia/Qwen3-30B-A3B-NVFP4

if [[ "$1" == "server0" ]]; then
python3 -m sglang.launch_server \
  --model-path $model_str \
  --trust-remote-code \
  --disable-radix-cache \
  --max-running-requests 256 \
  --chunked-prefill-size 1024 \
  --mem-fraction-static 0.89 \
  --max-prefill-tokens 16384 \
  --quantization modelopt_fp4
fi

if [[ "$1" == "server1" ]]; then
python3 -m sglang.launch_server \
  --model-path $model_str \
  --trust-remote-code \
  --disable-radix-cache \
  --max-running-requests 256 \
  --chunked-prefill-size 1024 \
  --mem-fraction-static 0.89 \
  --max-prefill-tokens 16384 \
  --quantization modelopt_fp4 \
  --moe-runner-backend flashinfer_trtllm

fi

if [[ "$1" == "bench" ]]; then
  curl -X POST "http://127.0.0.1:30000/flush_cache"
  cd /scratch/repo/sglang
  python3 benchmark/gsm8k/bench_sglang.py \
      --num-questions 2000 \
      --parallel 2000 \
      --num-shots 8 \
      --port 30000
fi

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-11-19T05:28:29Z

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the NVFP4 FlashInfer TRT-LLM backend for Mixture-of-Experts (MoE) operations, specifically targeting Qwen models. The primary goal is to enhance the inference performance of NVFP4 quantized models by switching from the existing Cutlass FP4 MoE backend to the more optimized FlashInfer TRT-LLM implementation. Benchmarking demonstrates a significant throughput improvement of about 8% without any degradation in accuracy.

Highlights

Performance Improvement: Integrates the NVFP4 FlashInfer TRT-LLM Mixture-of-Experts (MoE) backend, leading to an approximate 8% performance increase for Qwen models using NVFP4 checkpoints, while maintaining accuracy.
Backend Configuration: Users can now specify "--moe-runner-backend flashinfer_trtllm" when running models like "nvidia/Qwen3-30B-A3B-NVFP4" to utilize the new optimized backend.
Code Refinements: Adjusts data type handling for "router_logits" and "correction_bias" within the "trtllm_fp4_block_scale_moe" function call, and ensures proper "routed_scaling_factor" and "routing_method_type" are passed.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

kaixih · 2025-11-19T05:30:22Z

@Fridge003 @b8zhong

gemini-code-assist

Code Review

This pull request enhances the FlashInferFP4MoE implementation by making it more generic and robust. The changes correctly enable support for models like Qwen by removing hardcoded routing logic and making it configurable. Additionally, the added None checks for correction_bias and routed_scaling_factor improve the code's resilience. I have one suggestion to further improve robustness by ensuring a default routing method is used when none is specified, preventing potential runtime errors.

b8zhong · 2025-11-19T06:09:30Z

@samuellees @kaixih , I see #13427 (which looks like the same thing, up to you which to merge)

kaixih · 2025-11-19T06:14:37Z

Emm, yes, that's same. Didn't notice the one. Let me close this one and focus on the #13427.

junna2016 · 2025-12-09T03:02:01Z

When I try to run test on Sglang-v0.5.6 and gb200 platform

launch server

export SGL_ENABLE_JIT_DEEPGEMM=false

model_path=/workspace/Qwen3-30B-A3B-NVFP4

port=6677
/opt/conda310/bin/python -m sglang.launch_server --model-path ${model_path} --moe-runner-backend flashinfer_trtllm --quantization modelopt_fp4 --trust-remote-code --disable-radix-cache --port ${port} --mem-fraction-static 0.89 --max-running-requests 1024 --chunked-prefill-size 16384 --max-prefill-tokens 16384 --attention-backend trtllm_mha

send request

import requests
from sglang.utils import print_highlight

API_KEY=None

headers = {
"Content-Type": "application/json",
}

stream=False

#text = "The capital of China is " * 1000
text = "Who are you? " * 1000
import pdb;pdb.set_trace()
temperature = 0
max_new_tokens = 32
data = {
"text": text,
"sampling_params": {
"temperature": temperature,
"max_new_tokens": max_new_tokens,
},
"stream": False,
}
base_url = "http://localhost:6677/generate"

response = requests.post(base_url, json=data, headers=headers)
response.raise_for_status()

print_highlight(response.json())

The answer is wrong：

{'text': ' � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �', 'output_ids': [32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495], 'meta_info': {'id': '8cc0a49dd334493f95373f90a117f99d', 'finish_reason': {'type': 'length', 'length': 32}, 'prompt_tokens': 4001, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 32, 'cached_tokens': 0, 'e2e_latency': 0.20084381103515625, 'response_sent_to_client_ts': 1765248005.6215777}}

b8zhong · 2025-12-11T01:48:57Z

@junna2016 May you try the tip of main? Recently, there was a few bugs that have been fixed.

python -m sglang.launch_server --model-path nvidia/Qwen3-30B-A3B-NVFP4 --moe-runner-backend flashinfer_trtllm --quantization modelopt_fp4 --trust-remote-code

python3 -m sglang.test.send_one --stream
 Sure! Here's a simple example of a FastAPI server. This example includes a basic route that returns a JSON response.

```python
from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    return {"message": "Hello, World!"}

@app.get("/items/{item_id}")
def read_item(item_id: int, q: str = None):
    return {"item_id": item_id, "q": q}

Enable trtllm fp4 moe for qwen

58675c7

kaixih requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners November 19, 2025 05:28

kaixih changed the title ~~Enable trtllm fp4 moe for qwen~~ [NVIDIA] Enable trtllm fp4 moe for qwen Nov 19, 2025

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

Lint

a7e3808

kaixih closed this Nov 19, 2025

This was referenced Nov 19, 2025

[Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13427

Closed

[Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13761 #13761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] Enable trtllm fp4 moe for qwen#13556

[NVIDIA] Enable trtllm fp4 moe for qwen#13556
kaixih wants to merge 2 commits intosgl-project:mainfrom
kaixih:enable_fp4_qwen

kaixih commented Nov 19, 2025

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Uh oh!

kaixih commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

b8zhong commented Nov 19, 2025

Uh oh!

kaixih commented Nov 19, 2025

Uh oh!

junna2016 commented Dec 9, 2025 •

edited

Loading

Uh oh!

b8zhong commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaixih commented Nov 19, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

kaixih commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

b8zhong commented Nov 19, 2025

Uh oh!

kaixih commented Nov 19, 2025

Uh oh!

junna2016 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

launch server

send request

Uh oh!

b8zhong commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

junna2016 commented Dec 9, 2025 •

edited

Loading