-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked #9199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked #9199
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @wenscarl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request integrates support for Flashinfer's CuteDSL masked grouped GEMM into the Mixture-of-Experts (MoE) implementation, aiming to provide a low-latency backend. It introduces a new configuration flag, updates the MoE layer to conditionally use this new path, and ensures compatibility with modelopt_fp4 quantization. Additionally, a new test utility for masked grouped GEMM operations is included.
Highlights
- New MoE Backend Option: Introduces
--enable-flashinfer-cutedsl-moeas a server argument to activate the Flashinfer CuteDSL MoE implementation, targeting low-latency scenarios. - MoE Layer Integration: The
ep_moe/layer.pyis updated to conditionally dispatch to a newforward_flashinfer_maskedmethod when the CuteDSL MoE is enabled and DeepEP low-latency mode is active. - Quantization Requirement: Enforces
modelopt_fp4quantization as a prerequisite for utilizing the Flashinfer CuteDSL MoE backend. - Masked Grouped GEMM Test Utility: A new file
w4a4_bf16_masked.pyis added, containing a reference implementation and pytest-based correctness tests for masked grouped GEMM operations, which are fundamental to the CuteDSL functionality. - DeepSeekV2 Model Update: The DeepSeekV2 model's initialization is modified to enable the Flashinfer CuteDSL MoE if the corresponding global server argument is set.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Flashinfer's CuteDSL masked group GEMM for MoE layers. The changes span across server arguments, model implementation, and utility functions. My review has identified some critical issues, including an incomplete placeholder function that will cause runtime errors and the disabling of package version checks, which poses a significant risk. There are also several medium-severity issues related to code duplication, code cleanliness, and TODOs that should be addressed before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/sglang/srt/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of assert_pkg_version has been commented out and replaced with pass. This effectively disables all package version checks, which could lead to hard-to-debug runtime errors if incompatible package versions are used. This change seems risky for a production environment. Was this intended for temporary debugging? If so, it should be reverted before merging.
| pass | |
| # try: | |
| # installed_version = version(pkg) | |
| # if pkg_version.parse(installed_version) < pkg_version.parse(min_version): | |
| # raise Exception( | |
| # f"{pkg} is installed with version {installed_version}, which " | |
| # f"is less than the minimum required version {min_version}. " + message | |
| # ) | |
| # except PackageNotFoundError: | |
| # raise Exception( | |
| # f"{pkg} with minimum required version {min_version} is not installed. " | |
| # + message | |
| # ) | |
| try: | |
| installed_version = version(pkg) | |
| if pkg_version.parse(installed_version) < pkg_version.parse(min_version): | |
| raise Exception( | |
| f"{pkg} is installed with version {installed_version}, which " | |
| f"is less than the minimum required version {min_version}. " + message | |
| ) | |
| except PackageNotFoundError: | |
| raise Exception( | |
| f"{pkg} with minimum required version {min_version} is not installed. " | |
| + message | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new file appears to be a test file for masked grouped gemm.
- It seems to be in the wrong directory. Test files are usually located in a
test/directory. - The filename
w4a4_bf16_masked.pysuggests it's for W4A4 quantization, but the content is a general correctness test for grouped GEMM. - There is a large block of commented-out code at the top of the file. This should be removed before merging.
Could you please move this file to an appropriate test directory, rename it to reflect its purpose (e.g., test_grouped_gemm.py), and remove the dead code?
python/sglang/srt/server_args.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4319c6c to
32bf3e4
Compare
kaixih
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt work! I’ve left some comments mainly about the behavior and scope of the MoE function.
fzyzcjy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM since this is again only temporary work and will be refined and fused later (and thus there is no need for very high code detail quality). Only some optional nits
|
LGTM by directly looking at the code (i.e. dnk accuracy etc), I think maybe we can move to the next part (the e2e integration) to know whether everything here is ok or there is some accuracy bug. |
4e314d3 to
3cea8af
Compare
362c0f1 to
325031e
Compare
4d8f43e to
4cac99f
Compare
|
looks great, could you please resolve the conflicts and retest on latest main |
|
main seems better now, could you please handle the lint |
|
after CI is green + paste your test results using latest code on gpqa-diamond and math-500 I think it can be merged |
|
math-500 gpqa-diamond: |
|
hmm gpqa looks not good... could you please repeat at least 8-16 times and show all values (just the average value, i.e. the first row), and ensure generate 32k tokens. cc @kaixih I think you have modified my script and make it work on b200 or other cases - could you please share the script with @wenscarl . |
|
64k gpqa: 32k |
|
great |
|
cc @zhyncs this looks good to merge |


@kaixih @kushanam @fzyzcjy
Motivation
Add grouped_gemm_nt_masked from flashinfer to support nvfp4 MoE. This PR exposes 2 APIs:
flashinfer_cutedsl_grouped_gemm_nt_maskedandflashinfer_cutedsl_grouped_gemm_nt_masked.Depends on 9200
The next step is to integrate into EpMoE.
Modifications
Accuracy Tests
Accuracy: 0.980
Invalid: 0.000
Latency: 288.874 s
Output throughput: 93.390 token/s
math-500 data-set:
this PR
Benchmarking and Profiling
Checklist