[Bugfix] Temporarily disable group quant rms norm fusion#30273
[Bugfix] Temporarily disable group quant rms norm fusion#30273ElizaWszola wants to merge 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request temporarily disables group quantization RMS norm fusion to address an issue with model config assertions on Vision-Language models. The changes involve commenting out the registration of certain fusion patterns and removing the logic that determines scale layout for group quantization.
My review identifies a potential incompleteness in disabling the feature. While FusedAddRMSNormGroupQuantPattern is disabled, RMSNormGroupQuantPattern for the same group shapes remains active, even though the corresponding tests are disabled. I've added comments to suggest disabling these patterns as well to prevent untested code paths from being active.
vllm/compilation/fusion.py
Outdated
| # FusedAddRMSNormGroupQuantPattern( | ||
| # epsilon, FP8_DTYPE, group_shape=GroupShape(1, 128) | ||
| # ).register(self.patterns) |
There was a problem hiding this comment.
While FusedAddRMSNormGroupQuantPattern for group shape (1, 128) is correctly commented out, the corresponding RMSNormGroupQuantPattern on lines 485-487 remains active. Given that tests for this group shape are disabled in tests/compile/test_fusion.py, this leaves an untested code path. To fully disable the group quantization fusion as intended by this PR, RMSNormGroupQuantPattern for this group shape should also be commented out.
vllm/compilation/fusion.py
Outdated
| # FusedAddRMSNormGroupQuantPattern( | ||
| # epsilon, FP8_DTYPE, group_shape=GroupShape(1, 64) | ||
| # ).register(self.patterns) |
There was a problem hiding this comment.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # TODO: add use_col_major_scales and use_e8m0 to MatcherQuantFP8 | ||
| # after the issue with group quant rms fusion for VL models is fixed | ||
| self.quant_matcher = MatcherQuantFP8(key.quant) |
There was a problem hiding this comment.
Group quant fusion ignores col‑major/e8m0 settings
RMSNormQuantPattern now always builds MatcherQuantFP8 without forwarding the use_col_major_scales or use_e8m0 flags, yet the RMSNormGroupQuantPattern registrations for group sizes 64/128 are still active. On platforms where block FP8 paths expect column-major scales or e8m0 (CUTLASS or DeepGEMM), the fused rms_norm_per_block_quant kernels will now be invoked with row-major scales and e4m3, producing mis-quantized outputs on those GPUs rather than just disabling the fusion.
Useful? React with 👍 / 👎.
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
Note that when testing on Hopper, I currently replace with in |
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
To check, @ElizaWszola, you do want this to go in first, right? (when compared with #30244) |
|
This pull request has merge conflicts that must be resolved before it can be |
Temporarily disable changed introduced by #27883 until we cleanly resolve the issue with model config assertions on VL models.
Testing:
Unit testing with
tests/compile/test_fusion.py.Tested e2e with:
Qwen/Qwen3-30B-A3B-FP8Qwen/Qwen3-VL-4B-InstructQwen/Qwen3-VL-2B-Instruct-FP8All tests have been done with both
VLLM_USE_DEEP_GEMM=0andVLLM_USE_DEEP_GEMM=1.