[BugFix]Qwen-Image performance regression by using omni RMSNorm(RMSNorm backend) by NumberWan · Pull Request #3933 · vllm-project/vllm-omni

NumberWan · 2026-05-28T08:31:24Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Mitigate the Qwen-Image diffusion performance regression observed in vLLM-Omni 0.20.0 (steady-state compiled runs) by switching Qwen-Image RMSNorm layers to vllm_omni.diffusion.layers.norm.RMSNorm, avoiding the vLLM RMSNorm extension kernel path and validating improvement via profiler totals.

Index	v0.18.0	v0.20.0 (vLLM RMSNorm)	v0.20.0 (omni RMSNorm)
throughput_qps	0.843	0.763 (−9.4%)	0.792 (−6.0%)
latency_mean	1.187	1.311 (+10.4%)	1.263 (+6.4%)
latency_median	1.181	1.307 (+10.7%)	1.258 (+6.5%)
latency_p99	1.269	1.373 (+8.2%)	1.316 (+3.7%)
latency_p95	1.234	1.362 (+10.4%)	1.287 (+4.3%)

There is still a regression vs v0.18.0 on client latency, but switching from vLLM RMSNorm to omni RMSNorm reduces the gap from ~10% (vLLM, latency_mean) to ~6% (omni).

Description

Issue: Qwen-Image diffusion shows a performance regression when moving from vLLM-Omni 0.18.0 to 0.20.0 (steady-state compiled runs).
Observation: In 0.20.0, Qwen-Image uses vLLM’s RMSNorm extension kernels (_C::rms_norm / vllm::rms_norm_kernel) which contribute ~310ms self CUDA in our run.
A/B result: Switching Qwen-Image RMSNorm to vllm_omni.diffusion.layers.norm.RMSNorm:
- removes those vLLM RMSNorm extension rows (vllm_ext=0)
- reduces norm keyword aggregate self CUDA by ~309ms
- improves profiler_out_0.txt totals by -0.098s Self CPU and -0.092s Self CUDA
Change: Qwen-Image-only change in vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py to use vllm_omni.diffusion.layers.norm.RMSNorm.

We are investigating a Qwen-Image performance regression between vLLM-Omni 0.18.0 and 0.20.0 (see issue #3812 ). Profiling indicates notable differences in operator paths/buckets in compiled runs (e.g., mm/addmm ratio, CompiledFxGraph time, and normalization kernels).

To narrow down to a low-risk and actionable patch, we performed keyword aggregation and an A/B test specifically on RMSNorm backend used by Qwen-Image. The results suggest that the vLLM RMSNorm extension path is a measurable contributor to the regression in our environment, and switching to vllm_omni.diffusion.layers.norm.RMSNorm improves both keyword-aggregated norm time and overall profiler totals.

Test Plan

0.20.0 compile+warmup A/B version (vLLM RMS_Norm / omni RMS_Norm)

Test Result

Compare (vLLM RMS_Norm / omni RMS_Norm) in profiling

vLLM RMS_Norm CPU time vs omni RMS_Norm CPU time

vLLM

Omni

omni RMS_Norm: ~13us
vLLM RMS_Norm: ~57us

vLLM RMS_Norm CUDA time vs omni RMS_Norm CUDA time

vLLM

#### Omni

omni RMS_Norm: ~31us
vLLM RMS_Norm: ~62us

Changed Comparison

Item	vLLM RMSNorm（before）	omni RMSNorm（after）
`profiler_out_0.txt` totals — Self CPU	2.796s	2.698s
`profiler_out_0.txt` totals — Self CUDA	2.651s	2.559s
keyword aggregate（`rms_norm\|layer_norm\|norm`）— self CUDA	503.527ms	194.607ms
`vllm_ext`（`_C::rms_norm` + `vllm::rms_norm_kernel`）— self CUDA	310.067ms	0ms

Summary (vllm - omni):

Self CPU +0.098s
Self CUDA +0.092s

End-to-end comparison: vLLM RMSNorm vs omni RMSNorm

diffusion_forward (1 occurrence per trace)
vLLM RMSNorm: 2.74s
omni RMSNorm: 2.64s
Delta (vLLM − omni): +0.101s (~+101.0ms)

What this PR changes

File: vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py
Change: Use vllm_omni.diffusion.layers.norm.RMSNorm for Qwen-Image RMSNorm layers (avoid vLLM RMSNorm extension path).
Scope: Qwen-Image only (no changes to other models/pipelines).

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: NumberWan <wantszkin2003@gmail.com>

chatgpt-codex-connector · 2026-05-28T08:31:30Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: NumberWan <wantszkin2003@gmail.com>

NumberWan · 2026-05-28T09:01:17Z

@fhfuih @hsliuustc0106 PTAL

hsliuustc0106 · 2026-05-28T09:30:10Z

@SamitHuang @wtomin PTAL

fhfuih · 2026-05-29T02:36:36Z

Thanks for the PR. I also had offline discussion with @NumberWan .

Probably the performance gain of using torch's RMSNorm is possible after #3352 . Because I tried the same thing before #3352, and torch RMSNorm performance is worse. (Didn't profile it though, only tested e2e latency as a coarse exploration)

But since we can observe performance gain now, using torch's RMSNorm also reduces engineering dependency on vllm. Looks good to me

I think we can use torch RMSNorm on Qwen Image first, see if it's stable in the coming releases, and gradually promote this op to other diffusion models in future iterations (also doc pages/examples like "add a diffusion model")

fhfuih · 2026-05-29T02:59:13Z


 logger = init_logger(__name__)

+# Before #41804 issue fixed, PyTorch RMSNorm replaced vLLM RMSNorm for Qwen-Image.


Actually if torch RMSNorm has performance gain, we may not need to change it back to vllm's RMSNorm regardless of their 41804 issue

Perhaps we should consider the potential benefits that vllm RMSNorm can bring to other platforms (such as NPU). I also noticed that vllm-omni has its own RMSNorm implementation (vllm_omni/diffusion/layers/norm.py). What is the purpose of this file, and why not consider making some replacement attempts only on CUDA?

Perhaps we should consider the potential benefits that vllm RMSNorm can bring to other platforms (such as NPU). I also noticed that vllm-omni has its own RMSNorm implementation (vllm_omni/diffusion/layers/norm.py). What is the purpose of this file, and why not consider making some replacement attempts only on CUDA?

Thank you for you suggestion, I will keep tracking the performance from both (vllm / torch) RMSNorm to select the suitable one in different version

Perhaps we should consider the potential benefits that vllm RMSNorm can bring to other platforms (such as NPU). I also noticed that vllm-omni has its own RMSNorm implementation (vllm_omni/diffusion/layers/norm.py). What is the purpose of this file, and why not consider making some replacement attempts only on CUDA?

Thank you for your suggestion. I understand your consideration. Let me explore the other platforms' (such as NPU) performance and compare how the changes affecting.

I also noticed that vllm-omni has its own RMSNorm implementation

At least I have tested this impl when upgrading from 0.19 to 0.20. At that time, this impl and the torch op is similarly slow. So @NumberWan yes you can also retest this impl.

But

What is the purpose of this file

This idk. AFAIK Wan used it, and otherwise it is seldom used

Signed-off-by: NumberWan <wantszkin2003@gmail.com>

SamitHuang · 2026-05-29T08:25:53Z

Can you test our previous setting:

512 independent requests with batch size=1, num_inference_steps=10, resolution 512x512

In this case, we have ~10% performance drop compared to v0.18

NumberWan · 2026-05-29T10:13:51Z

Can you test our previous setting:

512 independent requests with batch size=1, num_inference_steps=10, resolution 512x512

In this case, we have ~10% performance drop compared to v0.18

I used this setting and ran 50 requests, here is the result

Index	v0.18.0	v0.20.0 (vLLM RMSNorm)	v0.20.0 (omni RMSNorm)
throughput_qps	0.843	0.763 (−9.4%)	0.792 (−6.0%)
latency_mean	1.187	1.311 (+10.4%)	1.263 (+6.4%)
latency_median	1.181	1.307 (+10.7%)	1.258 (+6.5%)
latency_p99	1.269	1.373 (+8.2%)	1.316 (+3.7%)
latency_p95	1.234	1.362 (+10.4%)	1.287 (+4.3%)

There is still a regression vs v0.18.0 on client latency, but switching from vLLM RMSNorm to omni RMSNorm reduces the gap from ~10% (vLLM, latency_mean) to ~6% (omni).

hsliuustc0106 · 2026-05-29T10:39:19Z

what are the left gaps?

NumberWan · 2026-05-29T14:55:10Z

what are the left gaps?

Not comfirmed yet, will continue to track this issue.

…rm backend) (vllm-project#3933) Signed-off-by: NumberWan <wantszkin2003@gmail.com>

NumberWan added 2 commits May 28, 2026 08:14

Replace vLLM rms_norm by Torch RMSNorm

3ccc69d

Signed-off-by: NumberWan <wantszkin2003@gmail.com>

Replace vLLM rms_norm by Torch RMSNorm

3b4821e

Signed-off-by: NumberWan <wantszkin2003@gmail.com>

NumberWan requested review from Isotr0py, RuixiangMa, SamitHuang, ZJY0516, david6666666, princepride and wtomin as code owners May 28, 2026 08:31

NumberWan added 2 commits May 28, 2026 08:37

Replace vLLM rms_norm by Torch RMSNorm

60048ae

Signed-off-by: NumberWan <wantszkin2003@gmail.com>

format changed

b171b13

Signed-off-by: NumberWan <wantszkin2003@gmail.com>

Merge branch 'main' into qwen_perf

8c88f00

fhfuih reviewed May 29, 2026

View reviewed changes

Changed vllm RMSnorm to omni RMSnorm

d4bf12b

Signed-off-by: NumberWan <wantszkin2003@gmail.com>

NumberWan changed the title ~~[BugFix]Qwen-Image performance regression by using torch RMSNorm(RMSNorm backend)~~ [BugFix]Qwen-Image performance regression by using omni RMSNorm(RMSNorm backend) May 29, 2026

zhumingjue138 mentioned this pull request May 30, 2026

[Bug]: Nightly / CI failed - tests/e2e/accuracy/test_diffusers_backend_similarity.py::test_diffusers_backend_t2i_matches_diffusers[Qwen/Qwen-Image] #3867

Closed

1 task

hsliuustc0106 merged commit 6f4bd3e into vllm-project:main May 31, 2026
5 checks passed

This was referenced Jun 1, 2026

[Bug]: Nightly / CI failed - tests/e2e/accuracy/test_qwen_image.py::test_qwen_image_matches_diffusers #4029

Open

[BugFix]Qwen-Image performance regression by using torch RMSNorm(RMSNorm backend) #4074

Merged

86MaxCao pushed a commit to 86MaxCao/vllm-omni that referenced this pull request Jun 4, 2026

[BugFix]Qwen-Image performance regression by using omni RMSNorm(RMSNo…

65d6fca

…rm backend) (vllm-project#3933) Signed-off-by: NumberWan <wantszkin2003@gmail.com>


		logger = init_logger(__name__)

		# Before #41804 issue fixed, PyTorch RMSNorm replaced vLLM RMSNorm for Qwen-Image.

Conversation

NumberWan commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Description

Test Plan

Test Result

Compare (vLLM RMS_Norm / omni RMS_Norm) in profiling

vLLM RMS_Norm CPU time vs omni RMS_Norm CPU time

vLLM

Omni

vLLM RMS_Norm CUDA time vs omni RMS_Norm CUDA time

vLLM

Changed Comparison

Summary (vllm - omni):

End-to-end comparison: vLLM RMSNorm vs omni RMSNorm

What this PR changes

Uh oh!

chatgpt-codex-connector Bot commented May 28, 2026

Uh oh!

NumberWan commented May 28, 2026

Uh oh!

hsliuustc0106 commented May 28, 2026

Uh oh!

fhfuih commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhfuih May 29, 2026

Choose a reason for hiding this comment

Uh oh!

ZhongsJie May 29, 2026

Choose a reason for hiding this comment

Uh oh!

NumberWan May 29, 2026

Choose a reason for hiding this comment

Uh oh!

NumberWan May 29, 2026

Choose a reason for hiding this comment

Uh oh!

fhfuih May 29, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NumberWan commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented May 29, 2026

Uh oh!

NumberWan commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NumberWan commented May 28, 2026 •

edited

Loading

fhfuih commented May 29, 2026 •

edited

Loading

SamitHuang commented May 29, 2026 •

edited

Loading

NumberWan commented May 29, 2026 •

edited

Loading