-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Enable Flashinfer TRTLLM-GEN-MoE FP8 blockwise kernel for Qwen3-Next on Blackwell #12543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ispobock
merged 15 commits into
sgl-project:main
from
samuellees:trtllm-gen-moe-fp8-q3n
Nov 13, 2025
Merged
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
1dd327e
Enable flashinfer-trtllm-gen-moe fp8 blockwise backend for models lik…
samuellees 4acf5d8
Merge branch 'main' into trtllm-gen-moe-fp8-q3n
samuellees b5a0207
reformat
samuellees 24edae9
Merge branch 'main' into trtllm-gen-moe-fp8-q3n
samuellees 507aa06
Merge branch 'main' into trtllm-gen-moe-fp8-q3n
samuellees 836dc27
Add unit test for trtllm-gen-moe
samuellees f18434c
Merge branch 'main' into trtllm-gen-moe-fp8-q3n
samuellees 587baaf
Merge branch 'main' into trtllm-gen-moe-fp8-q3n
Fridge003 9167429
Merge branch 'main' into trtllm-gen-moe-fp8-q3n
samuellees 8516b4f
Merge branch 'trtllm-gen-moe-fp8-q3n' of github.com:samuellees/sglang…
samuellees 4e84d10
enable test case
samuellees 7d875fc
Refactor RoutingMethodType to avoid import error
samuellees dda44d9
Merge branch 'main' into trtllm-gen-moe-fp8-q3n
samuellees 82f96f1
move unit test into nightly dir
samuellees 82c6f5c
Merge branch 'main' into trtllm-gen-moe-fp8-q3n
samuellees File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
65 changes: 65 additions & 0 deletions
65
test/srt/nightly/test_flashinfer_trtllm_gen_moe_backend.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| import os | ||
| import unittest | ||
| from types import SimpleNamespace | ||
|
|
||
| from sglang.srt.utils import kill_process_tree | ||
| from sglang.test.few_shot_gsm8k import run_eval | ||
| from sglang.test.test_utils import ( | ||
| DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, | ||
| DEFAULT_URL_FOR_TEST, | ||
| CustomTestCase, | ||
| popen_launch_server, | ||
| ) | ||
|
|
||
|
|
||
| class TestFlashinferTrtllmGenMoeBackend(CustomTestCase): | ||
| @classmethod | ||
| def setUpClass(cls): | ||
| cls.model = "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8" | ||
| cls.base_url = DEFAULT_URL_FOR_TEST | ||
| cls.process = popen_launch_server( | ||
| cls.model, | ||
| cls.base_url, | ||
| timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, | ||
| env={**os.environ, "SGLANG_ENABLE_JIT_DEEPGEMM": "False"}, | ||
| other_args=[ | ||
| "--attention-backend", | ||
| "triton", | ||
| "--moe-runner-backend", | ||
| "flashinfer_trtllm", | ||
| "--cuda-graph-max-bs", | ||
| "512", | ||
| "--tp-size", | ||
| "4", | ||
| "--ep-size", | ||
| "4", | ||
| "--mem-fraction-static", | ||
| "0.7", | ||
| "--mamba-ssm-dtype", | ||
| "bfloat16", | ||
| "--quantization", | ||
| "fp8", | ||
| ], | ||
| ) | ||
|
|
||
| @classmethod | ||
| def tearDownClass(cls): | ||
| kill_process_tree(cls.process.pid) | ||
|
|
||
| def test_gsm8k(self): | ||
| args = SimpleNamespace( | ||
| num_shots=5, | ||
| data_path=None, | ||
| num_questions=200, | ||
| max_new_tokens=512, | ||
| parallel=128, | ||
| host="http://127.0.0.1", | ||
| port=int(self.base_url.split(":")[-1]), | ||
| ) | ||
| metrics = run_eval(args) | ||
| print(f"{metrics=}") | ||
| self.assertGreater(metrics["accuracy"], 0.93) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| unittest.main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for getting
routing_method_typecan result inNonebeing passed to thetrtllm_fp8_block_scale_moekernel. If a model usingFlashInferFusedMoEdoes not specifyrouting_method_type, it defaults toNoneinFusedMoE.__init__. In this case,getattr(layer, "routing_method_type", ...)will returnNone.The kernel previously used a hardcoded value and likely does not handle
None, which could lead to a runtime error. To make this more robust and ensure backward compatibility, it's better to explicitly check forNoneand fall back toRoutingMethodType.DeepSeekV3.