[Quantization] add humming mxfp4 moe backend#41083
[Quantization] add humming mxfp4 moe backend#41083vllm-bot merged 7 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
There was a problem hiding this comment.
Code Review
This pull request integrates the Humming mixed-precision kernels into the vLLM framework, specifically focusing on Mixture of Experts (MoE) layers and MXFP4 quantization support. Key changes include the addition of the 'humming' kernel type, refactoring expert classes to use standardized configuration objects, and the introduction of a new utility module, humming_utils.py, for layer preparation and quantization configuration. The review feedback identifies several critical issues in the new code: an unnecessary self parameter in the standalone function humming_is_layer_skipped and the static method humming_gemm_type which would lead to runtime errors, as well as a logic error in the shape_config calculation for non-gated models.
|
Hi @jinzhen-lin, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
|
Hi @jinzhen-lin, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @jinzhen-lin, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @jinzhen-lin, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
|
@jinzhen-lin Nice job. Could you add the corresponding accuracy comparison too? |
|
Hi @huangzhilin-hzl , thank you for your interest. The accucry tests are done now. |
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>
| @@ -60,49 +59,44 @@ class HummingExpertsBase(mk.FusedMoEExpertsModular): | |||
| def __init__( | |||
| self, | |||
| layer: torch.nn.Module, | |||
There was a problem hiding this comment.
We should try to avoid passing the layer here if at all possible. It contains the modular kernels. If we ever construct the modular kernels at __init__ time of the layer (which we are considering) then this will lead to all sorts of problems.
There was a problem hiding this comment.
Since Humming supports a wide variety of quantization combinations, the corresponding weight combinations are also quite numerous. To reduce the complexity on the caller side, I prefer to use a layer-based approach. If directly passing the FusedMoE layer would cause issues, do you think it would be a good choice to directly extract all the required weights and reconstruct a temporary layer inside the modular kernels.
There was a problem hiding this comment.
I don't quite understand what "construct the modular kernels at __init__ time of the layer" means. Since the modular kernels currently require passing in a FusedMoEQuantConfig, and this config can only be fully defined after process_weights_after_loading, how are we supposed to construct the modular kernels at the __init__ stage? Do you plan to pass these in as runtime variables?
There was a problem hiding this comment.
I don't quite understand what "construct the modular kernels at
__init__time of the layer" means. Since the modular kernels currently require passing in aFusedMoEQuantConfig, and this config can only be fully defined afterprocess_weights_after_loading, how are we supposed to construct the modular kernels at the__init__stage? Do you plan to pass these in as runtime variables?
Even though the modular kernels require a FusedMoEQuantConfig at construction time, they don't really need much information from it (if any). We've been discussing removing this as a requirement for construction so that modular kernels can be instantiated at the same time as the quant methods that own them. This is to address other subtle order of initialization issues related to the FusedMoE layer, quant methods, SharedExperts, MoERunner, etc.
There was a problem hiding this comment.
So, are you planning to pass model parameters or layers as arguments to the apply function? (Many quantization methods have additional parameters besides weight and scale.) I can do the relevant refactoring work for humming in advance.
There was a problem hiding this comment.
So, are you planning to pass model parameters or layers as arguments to the apply function? (Many quantization methods have additional parameters besides weight and scale.) I can do the relevant refactoring work for humming in advance.
There was a problem hiding this comment.
No, the layer will still be passed as a runtime arg to apply. It's only a problem when used as an argument to __init__ any modular kernel objects.
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
This PR add humming mxfp4 moe backend.
Humming project: https://github.com/inclusionAI/humming/
In #34556 , we add an initial integration of humming. Now I am working to integrate humming backends to dense/moe kernel oracles.
This backend supports running DeepSeek-V4.
Note for users: You must pass the
--moe-backend hummingflag to use this backend, as Humming is not currently a mandatory dependency for vLLM. I am working on releasing the Humming PyPI package. Until then, you can install it using:Benchmark
DeepSeek-V4-Flash + H20 x 4
Service start command:
Bench command:
Bench result (TPS):
The performance gains are primarily driven by enhancements in the moe kernel and the moe sum kernel.
marlin moe w4a16 kernel
humming moe w4a16 kernel
marlin moe sum kernel (torch.sum)
humming moe sum kernel (introduced in #34556)
Accuracy Test
Marlin W4A16 (main)
Humming W4A16 (PR)
Humming W4A8 (PR)