[Bugfix] Fix fusion for VL models#30244
Conversation
Signed-off-by: ElizaWszola <ewszola@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the FP8 quantization fusion logic to make it more robust for Vision-Language (VL) models. The change correctly moves the decision-making for using deepgemm and column-major scales from a static configuration-based approach to a dynamic one that uses the weight tensor's shape at runtime. This is a solid improvement. I've identified one area of code duplication introduced in this change that should be addressed to improve maintainability.
vllm/compilation/matcher_utils.py
Outdated
| using_deepgemm = should_use_deepgemm_for_fp8_linear( | ||
| self.model_dtype, | ||
| weight, | ||
| ) | ||
| use_col_major_scales = using_deepgemm or cutlass_block_fp8_supported() |
There was a problem hiding this comment.
This logic to determine using_deepgemm and use_col_major_scales is duplicated in vllm/compilation/fusion.py in FusedAddRMSNormGroupQuantPattern.replacement (line 267) and RMSNormGroupQuantPattern.replacement (line 331). To improve maintainability and prevent potential bugs from inconsistent updates, consider centralizing this logic. A helper method within the MatcherQuantFP8 class could be a good way to encapsulate this logic, which can then be called from all three locations.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/compilation/matcher_utils.py
Outdated
| using_deepgemm = should_use_deepgemm_for_fp8_linear( | ||
| self.model_dtype, | ||
| weight, | ||
| ) |
There was a problem hiding this comment.
Skip deepgemm check for 1D RMSNorm weights
During the FP8 group quantization path the matcher now feeds the RMSNorm weight tensor into should_use_deepgemm_for_fp8_linear, but that helper assumes a 2D linear weight and unconditionally accesses weight.shape[1]. RMSNorm weights are 1D, so when the pattern is traced (or the replacement runs) this call raises IndexError: tuple index out of range, preventing the fused RMSNorm+quant pattern from compiling for group-quantized models such as VL configs.
Useful? React with 👍 / 👎.
ProExpertProg
left a comment
There was a problem hiding this comment.
Can we instead just create multiple patterns, one for column scales and one for row scales?
@ProExpertProg We will also need one for e8m0 if we don't want to check if we use deepgemm during matching. Should I put all this information in |
|
I think pass it like we're passing epsilon for now, it doesn't seem like something to go in QuantKey at least for now |
@ProExpertProg I'm running into some duplicate pattern errors with this approach. Because it's a breaking bug (it breaks all VL models), would it be ok to land this PR as is and then make a follow-up one with cleaner matching? |
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
I have just tried this PR atop of 67475a6 but got the following error for Qwen3 MoE: Launch script: Qwen3-vl-32B(non-MoE) is working like a charm BTW. |
Yikes, I mistakenly tested the BF16 checkpoint for qwen3-vl-32b. #30336 indeed solves the dynamo compilation error, huge thanks! |
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
Hi, I'd just like to mention that current main (since #28480 merged) is incompatible with deepgemm, see #28480 (comment). I have checked that commit 00e5cbb with this PR & #30336 cherry-picked is working well for FP8 llama4 and qwen3vl moe. |
|
@cjackal I've been seeing the same error, thanks for identifying the root PR! |
Fortunately there's ongoing bugfix at #30399 🚀 |
Signed-off-by: ElizaWszola <ewszola@redhat.com>
| list[tuple[Any, ...]](flat_product(MODELS_GROUP_FP8, CUSTOM_OPS_QUANT_RMS_NORM)), | ||
| ) | ||
| @pytest.mark.parametrize("inductor_graph_partition", [True, False]) | ||
| def test_rms_group_quant( |
There was a problem hiding this comment.
We don't need a new test, just enable rmsnorm+quant fusion in the other tests!
There was a problem hiding this comment.
My reasoning was to test a block-quant model separately, so I don't have to figure out the number of fusions for all models (with the current regex-based counting, block and non-blocks quant rms fusions are counted together), and also so I don't have to make the code inside tests more complicated (block quant rms fusions are supported only when fp8 is enabled).
There was a problem hiding this comment.
We should address this in a follow up also to reduce the total test running times
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
…ect#30396 Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
…ect#30396 (vllm-project#30787) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…ect#30396 (vllm-project#30787) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…ect#30396 (vllm-project#30787) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
…ect#30396 (vllm-project#30787) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
A fix for #27883 so the fusion code doesn't break VL models.
Testing:
All tests have been run on Hopper GPU with both
VLLM_USE_DEEP_GEMM=0andVLLM_USE_DEEP_GEMM=1. Note that DeepGemm runs currently require changes from #30336 to run.E2E tests:
Qwen/Qwen3-30B-A3B-FP8Qwen/Qwen3-VL-4B-InstructQwen/Qwen3-VL-2B-Instruct-FP8Qwen/Qwen3-VL-30B-A3B-Instruct-FP8All FP8 models have been manually verified to produce the fused group quant rms norm kernel during compilation.
Unit test:
tests/compile/test_fusion.py