Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend#19395
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend#19395mgoin merged 5 commits intovllm-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Do we need to override the config for MoeWNA16Config too?
There was a problem hiding this comment.
This modification will not impact the standard W4A16 model. However, if WNA intends to support mix as well, it hasn't been adapted for that yet.
|
Friendly bump |
|
I've merged the latest changes from the main branch. Could you please take another look at this PR? |
|
Thanks! Looking forward to seeing this in the next release. |
|
Hi @robertgshaw2-redhat |
dynamic Quantization Does Not Take Effect for MoE Modules with gptq_marlin in vLLM
Problem:
When running Mixture of Experts (MoE) modules using the gptq_marlin backend in vLLM, the dynamic quantization settings from the gptq configuration are not applied.
Root Cause:
In the GPTQMarlinConfig class (defined in gptq_marlin.py), the get_quant_method function uses a utility function called get_linear_quant_method (imported from gptq_utils.py) to handle per-layer dynamic quantization settings for standard linear layers. However, for MoE layers, there is currently no equivalent function in vLLM to process and apply the dynamic flag.
Solution:
To address this, we created a new helper function get_moe_quant_method (currently placed in gptq_marlin.py for simplicity) that mirrors the behavior of get_linear_quant_method, but for MoE layers. This function processes the dynamic quantization setting and returns the appropriate quantization method for each MoE layer.
Following vLLM's existing structure, this helper should ideally be moved to gptq_utils.py (similar to get_linear_quant_method) and imported into gptq_marlin.py.
Validation:
This fix can be verified using our R1 repository https://huggingface.co/QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Medium