Skip to content

[NVIDIA] Enable trtllm fp4 moe for qwen#13556

Closed
kaixih wants to merge 2 commits intosgl-project:mainfrom
kaixih:enable_fp4_qwen
Closed

[NVIDIA] Enable trtllm fp4 moe for qwen#13556
kaixih wants to merge 2 commits intosgl-project:mainfrom
kaixih:enable_fp4_qwen

Conversation

@kaixih
Copy link
Collaborator

@kaixih kaixih commented Nov 19, 2025

Motivation

Inspired by #13489, this PR enables the NVFP4 FlashInfer TRT-LLM MoE backend. With this change, users can now set --moe-runner-backend flashinfer_trtllm when using NVFP4 checkpoints such as nvidia/Qwen3-30B-A3B-NVFP4.

In our tests, accuracy remains unchanged while performance improves by roughly 8% compared to the Cutlass FP4 MoE backend.

Modifications

Accuracy Tests

# Before (with cutlass fp4 moe):
+ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 --port 30000            
100%|███████████████████████████████████████████████████████████████████████████| 1319/1319 [02:58<00:00,  7.39it/s] 
Accuracy: 0.907                                           
Invalid: 0.000                                            
Latency: 178.658 s                                        
Output throughput: 958.693 token/s

# After (with flashinfer trtllm fp4 moe):
+ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 --port 30000            
100%|███████████████████████████████████████████████████████████████████████████| 1319/1319 [02:42<00:00,  8.11it/s] 
Accuracy: 0.900                                           
Invalid: 0.000                                            
Latency: 162.700 s                                        
Output throughput: 1035.904 token/s
model_str=nvidia/Qwen3-30B-A3B-NVFP4

if [[ "$1" == "server0" ]]; then
python3 -m sglang.launch_server \
  --model-path $model_str \
  --trust-remote-code \
  --disable-radix-cache \
  --max-running-requests 256 \
  --chunked-prefill-size 1024 \
  --mem-fraction-static 0.89 \
  --max-prefill-tokens 16384 \
  --quantization modelopt_fp4
fi

if [[ "$1" == "server1" ]]; then
python3 -m sglang.launch_server \
  --model-path $model_str \
  --trust-remote-code \
  --disable-radix-cache \
  --max-running-requests 256 \
  --chunked-prefill-size 1024 \
  --mem-fraction-static 0.89 \
  --max-prefill-tokens 16384 \
  --quantization modelopt_fp4 \
  --moe-runner-backend flashinfer_trtllm

fi

if [[ "$1" == "bench" ]]; then
  curl -X POST "http://127.0.0.1:30000/flush_cache"
  cd /scratch/repo/sglang
  python3 benchmark/gsm8k/bench_sglang.py \
      --num-questions 2000 \
      --parallel 2000 \
      --num-shots 8 \
      --port 30000
fi

Benchmarking and Profiling

Checklist

@kaixih kaixih changed the title Enable trtllm fp4 moe for qwen [NVIDIA] Enable trtllm fp4 moe for qwen Nov 19, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the NVFP4 FlashInfer TRT-LLM backend for Mixture-of-Experts (MoE) operations, specifically targeting Qwen models. The primary goal is to enhance the inference performance of NVFP4 quantized models by switching from the existing Cutlass FP4 MoE backend to the more optimized FlashInfer TRT-LLM implementation. Benchmarking demonstrates a significant throughput improvement of about 8% without any degradation in accuracy.

Highlights

  • Performance Improvement: Integrates the NVFP4 FlashInfer TRT-LLM Mixture-of-Experts (MoE) backend, leading to an approximate 8% performance increase for Qwen models using NVFP4 checkpoints, while maintaining accuracy.
  • Backend Configuration: Users can now specify "--moe-runner-backend flashinfer_trtllm" when running models like "nvidia/Qwen3-30B-A3B-NVFP4" to utilize the new optimized backend.
  • Code Refinements: Adjusts data type handling for "router_logits" and "correction_bias" within the "trtllm_fp4_block_scale_moe" function call, and ensures proper "routed_scaling_factor" and "routing_method_type" are passed.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kaixih
Copy link
Collaborator Author

kaixih commented Nov 19, 2025

@Fridge003 @b8zhong

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the FlashInferFP4MoE implementation by making it more generic and robust. The changes correctly enable support for models like Qwen by removing hardcoded routing logic and making it configurable. Additionally, the added None checks for correction_bias and routed_scaling_factor improve the code's resilience. I have one suggestion to further improve robustness by ensuring a default routing method is used when none is specified, preventing potential runtime errors.

@b8zhong
Copy link
Collaborator

b8zhong commented Nov 19, 2025

@samuellees @kaixih , I see #13427 (which looks like the same thing, up to you which to merge)

@kaixih
Copy link
Collaborator Author

kaixih commented Nov 19, 2025

Emm, yes, that's same. Didn't notice the one. Let me close this one and focus on the #13427.

@junna2016
Copy link

junna2016 commented Dec 9, 2025

When I try to run test on Sglang-v0.5.6 and gb200 platform

launch server

export SGL_ENABLE_JIT_DEEPGEMM=false

model_path=/workspace/Qwen3-30B-A3B-NVFP4

port=6677
/opt/conda310/bin/python -m sglang.launch_server --model-path ${model_path} --moe-runner-backend flashinfer_trtllm --quantization modelopt_fp4 --trust-remote-code --disable-radix-cache --port ${port} --mem-fraction-static 0.89 --max-running-requests 1024 --chunked-prefill-size 16384 --max-prefill-tokens 16384 --attention-backend trtllm_mha

send request

import requests
from sglang.utils import print_highlight

API_KEY=None

headers = {
"Content-Type": "application/json",
}

stream=False

#text = "The capital of China is " * 1000
text = "Who are you? " * 1000
import pdb;pdb.set_trace()
temperature = 0
max_new_tokens = 32
data = {
"text": text,
"sampling_params": {
"temperature": temperature,
"max_new_tokens": max_new_tokens,
},
"stream": False,
}
base_url = "http://localhost:6677/generate"

response = requests.post(base_url, json=data, headers=headers)
response.raise_for_status()

print_highlight(response.json())

The answer is wrong:

{'text': ' � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �', 'output_ids': [32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495], 'meta_info': {'id': '8cc0a49dd334493f95373f90a117f99d', 'finish_reason': {'type': 'length', 'length': 32}, 'prompt_tokens': 4001, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 32, 'cached_tokens': 0, 'e2e_latency': 0.20084381103515625, 'response_sent_to_client_ts': 1765248005.6215777}}

@b8zhong
Copy link
Collaborator

b8zhong commented Dec 11, 2025

@junna2016 May you try the tip of main? Recently, there was a few bugs that have been fixed.

python -m sglang.launch_server --model-path nvidia/Qwen3-30B-A3B-NVFP4 --moe-runner-backend flashinfer_trtllm --quantization modelopt_fp4 --trust-remote-code
python3 -m sglang.test.send_one --stream
 Sure! Here's a simple example of a FastAPI server. This example includes a basic route that returns a JSON response.

```python
from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    return {"message": "Hello, World!"}

@app.get("/items/{item_id}")
def read_item(item_id: int, q: str = None):
    return {"item_id": item_id, "q": q}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants