[Bugfix] Fix fusion for VL models by ElizaWszola · Pull Request #30244 · vllm-project/vllm

ElizaWszola · 2025-12-08T07:47:35Z

A fix for #27883 so the fusion code doesn't break VL models.

Testing:

All tests have been run on Hopper GPU with both VLLM_USE_DEEP_GEMM=0 and VLLM_USE_DEEP_GEMM=1. Note that DeepGemm runs currently require changes from #30336 to run.

E2E tests:

Qwen/Qwen3-30B-A3B-FP8
Qwen/Qwen3-VL-4B-Instruct
Qwen/Qwen3-VL-2B-Instruct-FP8
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8

All FP8 models have been manually verified to produce the fused group quant rms norm kernel during compilation.

Unit test:

tests/compile/test_fusion.py

Signed-off-by: ElizaWszola <ewszola@redhat.com>

gemini-code-assist

Code Review

This pull request refactors the FP8 quantization fusion logic to make it more robust for Vision-Language (VL) models. The change correctly moves the decision-making for using deepgemm and column-major scales from a static configuration-based approach to a dynamic one that uses the weight tensor's shape at runtime. This is a solid improvement. I've identified one area of code duplication introduced in this change that should be addressed to improve maintainability.

gemini-code-assist · 2025-12-08T07:49:56Z

vllm/compilation/matcher_utils.py

+            using_deepgemm = should_use_deepgemm_for_fp8_linear(
+                self.model_dtype,
+                weight,
+            )
+            use_col_major_scales = using_deepgemm or cutlass_block_fp8_supported()


This logic to determine using_deepgemm and use_col_major_scales is duplicated in vllm/compilation/fusion.py in FusedAddRMSNormGroupQuantPattern.replacement (line 267) and RMSNormGroupQuantPattern.replacement (line 331). To improve maintainability and prevent potential bugs from inconsistent updates, consider centralizing this logic. A helper method within the MatcherQuantFP8 class could be a good way to encapsulate this logic, which can then be called from all three locations.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-08T07:50:32Z

vllm/compilation/matcher_utils.py

+            using_deepgemm = should_use_deepgemm_for_fp8_linear(
+                self.model_dtype,
+                weight,
+            )


Skip deepgemm check for 1D RMSNorm weights

During the FP8 group quantization path the matcher now feeds the RMSNorm weight tensor into should_use_deepgemm_for_fp8_linear, but that helper assumes a 2D linear weight and unconditionally accesses weight.shape[1]. RMSNorm weights are 1D, so when the pattern is traced (or the replacement runs) this call raises IndexError: tuple index out of range, preventing the fused RMSNorm+quant pattern from compiling for group-quantized models such as VL configs.

Useful? React with 👍 / 👎.

ProExpertProg

Can we instead just create multiple patterns, one for column scales and one for row scales?

ElizaWszola · 2025-12-08T14:34:48Z

Can we instead just create multiple patterns, one for column scales and one for row scales?

@ProExpertProg We will also need one for e8m0 if we don't want to check if we use deepgemm during matching. Should I put all this information in QuantKey (it will result in 4x combinations for groupwise keys in quant_utils.py) or pass it like we're passing epsilon?

ProExpertProg · 2025-12-08T14:50:09Z

I think pass it like we're passing epsilon for now, it doesn't seem like something to go in QuantKey at least for now

ElizaWszola · 2025-12-08T16:37:39Z

I think pass it like we're passing epsilon for now, it doesn't seem like something to go in QuantKey at least for now

@ProExpertProg I'm running into some duplicate pattern errors with this approach. Because it's a breaking bug (it breaks all VL models), would it be ok to land this PR as is and then make a follow-up one with cleaner matching?

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify · 2025-12-09T08:06:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ElizaWszola <ewszola@redhat.com>

cjackal · 2025-12-09T10:57:27Z

I have just tried this PR atop of 67475a6 but got the following error for Qwen3 MoE:

...
  File "/app/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl.py", line 1651, in forward
    hidden_states = self.language_model.model(
...
  File "/app/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 841, in compile_wrapper
    raise e.with_traceback(None) from e.__cause__  # User compiler error

torch._dynamo.exc.Unsupported: can't handle functions not implemented in python

from user code:
  File "/app/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl_moe.py", line 116, in forward
    hidden_states, residual = layer(
...
  File "/app/.venv/lib/python3.12/site-packages/vllm/utils/deep_gemm.py", line 63, in is_deep_gemm_e8m0_used
    if not is_deep_gemm_supported():
...
  File "/app/.venv/lib/python3.12/site-packages/vllm/platforms/interface.py", line 295, in is_device_capability
    current_capability = cls.get_device_capability(device_id=device_id)

Launch script:

vllm serve qwen/qwen3-vl-235b-a22b-instruct-fp8 --tensor-parallel-size 8 --enable-expert-parallel --all2all-backend deepep_low_latency --mm-encoder-tp-mode data --async-scheduling

Qwen3-vl-32B(non-MoE) is working like a charm BTW.

ElizaWszola · 2025-12-09T12:09:28Z

@cjackal Looking into this now. In the meantime, you can also try disabling the group quant rms norm fusion altogether: #30273

ElizaWszola · 2025-12-09T12:42:30Z

@cjackal This looks like an unrelated issue, can you try applying changes from this PR on the top of the current one? #30336 (I tested it with Qwen/Qwen3-VL-30B-A3B-Instruct-FP8, it let me run inference on it till completion)

cjackal · 2025-12-09T14:28:27Z

@cjackal This looks like an unrelated issue, can you try applying changes from this PR on the top of the current one? #30336 (I tested it with Qwen/Qwen3-VL-30B-A3B-Instruct-FP8, it let me run inference on it till completion)

Yikes, I mistakenly tested the BF16 checkpoint for qwen3-vl-32b. #30336 indeed solves the dynamo compilation error, huge thanks!

Signed-off-by: ElizaWszola <ewszola@redhat.com>

cjackal · 2025-12-10T15:10:58Z

Hi, I'd just like to mention that current main (since #28480 merged) is incompatible with deepgemm, see #28480 (comment). I have checked that commit 00e5cbb with this PR & #30336 cherry-picked is working well for FP8 llama4 and qwen3vl moe.

ElizaWszola · 2025-12-10T15:34:15Z

@cjackal I've been seeing the same error, thanks for identifying the root PR!

cjackal · 2025-12-10T15:35:36Z

@cjackal I've been seeing the same error, thanks for identifying the root PR!

It looks like I'm able to run the model with deepgemm when I replace m.weight_scale with m.weight_scale_inv in _extract_data_from_linear_base_module() in deep_gemm_warmup.py, but I'm not sure atp if this is a safe solution to the problem

Fortunately there's ongoing bugfix at #30399 🚀

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ProExpertProg · 2025-12-11T20:01:36Z

tests/compile/distributed/test_fusions_e2e.py

+    list[tuple[Any, ...]](flat_product(MODELS_GROUP_FP8, CUSTOM_OPS_QUANT_RMS_NORM)),
+)
+@pytest.mark.parametrize("inductor_graph_partition", [True, False])
+def test_rms_group_quant(


We don't need a new test, just enable rmsnorm+quant fusion in the other tests!

My reasoning was to test a block-quant model separately, so I don't have to figure out the number of fusions for all models (with the current regex-based counting, block and non-blocks quant rms fusions are counted together), and also so I don't have to make the code inside tests more complicated (block quant rms fusions are supported only when fp8 is enabled).

We should address this in a follow up also to reduce the total test running times

yewentao256

LGTM, thanks for the work!

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>

Signed-off-by: ElizaWszola <ewszola@redhat.com>

…ect#30396 Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…ect#30396 (vllm-project#30787) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…ect#30396 (vllm-project#30787) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…ect#30396 (vllm-project#30787) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com>

…ect#30396 (vllm-project#30787) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Fix fusion for VL models

7d3e727

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ElizaWszola requested review from ProExpertProg, youkaichao and zou3519 as code owners December 8, 2025 07:47

ElizaWszola changed the title ~~Fix fusion for VL models~~ [Bugfix] Fix fusion for VL models Dec 8, 2025

ElizaWszola mentioned this pull request Dec 8, 2025

[Performance] Fused blockwise quant RMS norm #27883

Merged

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 8, 2025

View reviewed changes

ProExpertProg reviewed Dec 8, 2025

View reviewed changes

ElizaWszola added 2 commits December 8, 2025 17:39

Deepgemm fix

604d613

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Match multiple layouts and e8m0 combinations

d1b6ed1

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot added the needs-rebase label Dec 9, 2025

Merge branch 'main' into fix-vl-models-fusion

895aed3

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot removed the needs-rebase label Dec 9, 2025

ElizaWszola added 2 commits December 10, 2025 07:00

Merge branch 'main' into fix-vl-models-fusion

a198f2a

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Unit test for e2e fusion

d355515

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Merge branch 'main' into fix-vl-models-fusion

92ca52f

Signed-off-by: ElizaWszola <ewszola@redhat.com>

zou3519 mentioned this pull request Dec 11, 2025

[Bugfix] Temporarily disable group quant rms norm fusion #30273

Draft

ProExpertProg approved these changes Dec 11, 2025

View reviewed changes

cjackal mentioned this pull request Dec 12, 2025

[Bug] Fix AttributeError: 'Qwen3VLMoeConfig' object has no attribute 'intermediate_size' #30567

Closed

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 12, 2025

yewentao256 approved these changes Dec 12, 2025

View reviewed changes

DarkLight1337 merged commit 994acec into vllm-project:main Dec 14, 2025
50 checks passed

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Dec 15, 2025

[Bugfix] Fix fusion for VL models (vllm-project#30244)

9fa1b38

Signed-off-by: ElizaWszola <ewszola@redhat.com>

joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request Dec 15, 2025

[Bugfix] Fix fusion for VL models (vllm-project#30244)

0943263

Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>

teddygood pushed a commit to teddygood/vllm that referenced this pull request Dec 16, 2025

[Bugfix] Fix fusion for VL models (vllm-project#30244)

e365f5d

Signed-off-by: ElizaWszola <ewszola@redhat.com>

DarkLight1337 added a commit to DarkLight1337/vllm that referenced this pull request Dec 16, 2025

[CI/Build] Fix compatibility between vllm-project#30244 and vllm-proj…

b418882

…ect#30396 Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

vllm-bot pushed a commit that referenced this pull request Dec 17, 2025

[CI/Build] Fix compatibility between #30244 and #30396 (#30787)

44d3b1d

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Dec 17, 2025

[CI/Build] Fix compatibility between vllm-project#30244 and vllm-proj…

35f5f6f

…ect#30396 (vllm-project#30787) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

[Bugfix] Fix fusion for VL models (vllm-project#30244)

a9a0474

Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[Bugfix] Fix fusion for VL models (vllm-project#30244)

e053bd4

Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Bugfix] Fix fusion for VL models (vllm-project#30244)

26aaa5e

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[CI/Build] Fix compatibility between vllm-project#30244 and vllm-proj…

3089e83

…ect#30396 (vllm-project#30787) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Uh oh!

Conversation

ElizaWszola commented Dec 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing:

E2E tests:

Unit test:

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ElizaWszola commented Dec 8, 2025

Uh oh!

ProExpertProg commented Dec 8, 2025

Uh oh!

ElizaWszola commented Dec 8, 2025

Uh oh!

mergify bot commented Dec 9, 2025

Uh oh!

cjackal commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ElizaWszola commented Dec 9, 2025

Uh oh!

ElizaWszola commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjackal commented Dec 9, 2025

Uh oh!

cjackal commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ElizaWszola commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjackal commented Dec 10, 2025

Uh oh!

ProExpertProg Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

ElizaWszola Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ElizaWszola commented Dec 8, 2025 •

edited by github-actions bot

Loading

cjackal commented Dec 9, 2025 •

edited

Loading

ElizaWszola commented Dec 9, 2025 •

edited

Loading

cjackal commented Dec 10, 2025 •

edited

Loading

ElizaWszola commented Dec 10, 2025 •

edited

Loading