[NVIDIA] Enable trtllm fp4 moe for qwen#13556
[NVIDIA] Enable trtllm fp4 moe for qwen#13556kaixih wants to merge 2 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @kaixih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the NVFP4 FlashInfer TRT-LLM backend for Mixture-of-Experts (MoE) operations, specifically targeting Qwen models. The primary goal is to enhance the inference performance of NVFP4 quantized models by switching from the existing Cutlass FP4 MoE backend to the more optimized FlashInfer TRT-LLM implementation. Benchmarking demonstrates a significant throughput improvement of about 8% without any degradation in accuracy. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request enhances the FlashInferFP4MoE implementation by making it more generic and robust. The changes correctly enable support for models like Qwen by removing hardcoded routing logic and making it configurable. Additionally, the added None checks for correction_bias and routed_scaling_factor improve the code's resilience. I have one suggestion to further improve robustness by ensuring a default routing method is used when none is specified, preventing potential runtime errors.
|
@samuellees @kaixih , I see #13427 (which looks like the same thing, up to you which to merge) |
|
Emm, yes, that's same. Didn't notice the one. Let me close this one and focus on the #13427. |
|
When I try to run test on Sglang-v0.5.6 and gb200 platform launch serverexport SGL_ENABLE_JIT_DEEPGEMM=false model_path=/workspace/Qwen3-30B-A3B-NVFP4 port=6677 send requestimport requests API_KEY=None headers = { stream=False #text = "The capital of China is " * 1000 response = requests.post(base_url, json=data, headers=headers) print_highlight(response.json()) The answer is wrong: {'text': ' � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �', 'output_ids': [32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495, 32495], 'meta_info': {'id': '8cc0a49dd334493f95373f90a117f99d', 'finish_reason': {'type': 'length', 'length': 32}, 'prompt_tokens': 4001, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 32, 'cached_tokens': 0, 'e2e_latency': 0.20084381103515625, 'response_sent_to_client_ts': 1765248005.6215777}} |
|
@junna2016 May you try the tip of main? Recently, there was a few bugs that have been fixed. |
Motivation
Inspired by #13489, this PR enables the NVFP4 FlashInfer TRT-LLM MoE backend. With this change, users can now set
--moe-runner-backend flashinfer_trtllmwhen using NVFP4 checkpoints such asnvidia/Qwen3-30B-A3B-NVFP4.In our tests, accuracy remains unchanged while performance improves by roughly 8% compared to the Cutlass FP4 MoE backend.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist