-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
[Quantization] add humming mxfp4 moe backend #41083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
vllm-bot
merged 7 commits into
vllm-project:main
from
jinzhen-lin:humming_mxfp4_moe_backend
May 3, 2026
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
6575e94
add humming mxfp4 moe backend
jinzhen-lin 70a813f
fix comments
jinzhen-lin 2addfaa
fix
jinzhen-lin 7c1aa48
fix pre-commit
jinzhen-lin b76c081
fix typo
jinzhen-lin 6255318
re -> regex
jinzhen-lin a3698a7
disable grouped gemm by default
jinzhen-lin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to avoid passing the layer here if at all possible. It contains the modular kernels. If we ever construct the modular kernels at
__init__time of the layer (which we are considering) then this will lead to all sorts of problems.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Humming supports a wide variety of quantization combinations, the corresponding weight combinations are also quite numerous. To reduce the complexity on the caller side, I prefer to use a layer-based approach. If directly passing the
FusedMoElayer would cause issues, do you think it would be a good choice to directly extract all the required weights and reconstruct a temporary layer inside the modular kernels.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what "construct the modular kernels at
__init__time of the layer" means. Since the modular kernels currently require passing in aFusedMoEQuantConfig, and this config can only be fully defined afterprocess_weights_after_loading, how are we supposed to construct the modular kernels at the__init__stage? Do you plan to pass these in as runtime variables?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the modular kernels require a
FusedMoEQuantConfigat construction time, they don't really need much information from it (if any). We've been discussing removing this as a requirement for construction so that modular kernels can be instantiated at the same time as the quant methods that own them. This is to address other subtle order of initialization issues related to theFusedMoElayer, quant methods, SharedExperts, MoERunner, etc.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, are you planning to pass model parameters or layers as arguments to the apply function? (Many quantization methods have additional parameters besides weight and scale.) I can do the relevant refactoring work for humming in advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, are you planning to pass model parameters or layers as arguments to the apply function? (Many quantization methods have additional parameters besides weight and scale.) I can do the relevant refactoring work for humming in advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the layer will still be passed as a runtime arg to apply. It's only a problem when used as an argument to
__init__any modular kernel objects.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bnellnm , since layer is not yet a parameter for methods like
apply,moe_problem_size, orworkspace_shapes, I can't remove it from the__init__arguments for now. Is there a timeline for the refactoring you mentioned?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a number of the values you need are members of
FusedMoEConfigorFusedMoEQuantConfigwhich each MK has as attributes, e.g. hidden_size, number of experts, etc.The layer is not going to be passed down to the MK apply function. It is passed to the quant_method
applyfunctions and it is the quant_method's responsibility to unpack any data needed from the layer and pass it along to the MK.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the parameters for the MK apply function are fixed. Is there a way to pass specific parameters to the kernel? For example, the Humming MoE requires a
layerobject, and similarly, the Marlin MoE requires variables likew13_g_idx.