[BugFix]Qwen-Image performance regression by using torch RMSNorm(RMSNorm backend) by NumberWan · Pull Request #4074 · vllm-project/vllm-omni

NumberWan · 2026-06-02T08:35:58Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Fix/mitigate the accuracy regression reported in #4029 (test_qwen_image_matches_diffusers, SSIM threshold ≥ 0.97): switch Qwen-Image RMSNorm layers to torch.nn.RMSNorm.

Background

A Qwen-Image performance regression investigation (v0.18.0 → v0.20.x), observed via the official serving benchmark (diffusion_benchmark_serving.py → latency_mean).
[BugFix]Qwen-Image performance regression by using omni RMSNorm(RMSNorm backend) #3933 switched Qwen-Image RMSNorm from vLLM RMSNorm to omni RMSNorm (operator/backend path change).
Around the same window, [Bug]: Nightly / CI failed - tests/e2e/accuracy/test_qwen_image.py::test_qwen_image_matches_diffusers #4029 reported nightly CI accuracy regression (SSIM vs diffusers).
This PR switches to torch.nn.RMSNorm as a pragmatic fallback: avoid vLLM RMSNorm extension path, and reduce the risk of fused-kernel numerical drift impacting SSIM.

Change

File: vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py
Change: use torch.nn.RMSNorm for Qwen-Image RMSNorm layers
Scope: Qwen-Image diffusion transformer only

Test Plan

CI: run tests/e2e/accuracy/test_qwen_image.py::test_qwen_image_matches_diffusers (primary goal: confirm [Bug]: Nightly / CI failed - tests/e2e/accuracy/test_qwen_image.py::test_qwen_image_matches_diffusers #4029 is addressed).
diffusion_benchmark_serving.py (512×512, 10 steps) to show latency_mean improved compare will applied vllm rmsNorm
0.20.0 compile+warmup A/B version (vLLM RMS_Norm / Torch RMS_Norm)

Test Result

Accuracy: `test_qwen_image_matches_diffusers` in local test

Note: the rows below are local results (L20X + venv). They are not matched nightly CI absolute values, but are useful for comparing relative changes across RMSNorm backends.

Will be updated once getting CI test result

Variant (local)	commit	SSIM	PSNR (dB)
baseline (before-#3933 merged)	“Last PASS commit” `11c4fced`	0.950571	26.75
omni RMSNorm plateau (post-#3933)	`6f4bd3e`	0.950902	26.75
torch RMSNorm (this PR)	`b1e12d1a`	0.959643	28.017508

Accuracy: `test_qwen_image_matches_diffusers` in CI

Variant (local)	commit	SSIM	PSNR (dB)
baseline (before-#3933 merged)	“Last PASS commit” `11c4fced`	0.972616	30.022175
omni RMSNorm plateau (post-#3933)	`6f4bd3e`	0.941089	25.499989
torch RMSNorm (After this PR)	`b1e12d1a`	0.975093	30.727339

Workload: 512×512, 10 steps, 500 requests (+1 warmup), max-concurrency=1.
Baseline table (v0.18.0 as baseline) + torch RMSNorm (this PR):

Metric	v0.18.0	v0.20.0 (vLLM RMSNorm)	v0.20.0 (omni RMSNorm)	v0.20.0 (torch RMSNorm, this PR)
throughput_qps	0.843	0.742 (−11.9%)	0.811 (−3.8%)	0.829 (−1.6%)
latency_mean	1.187	1.347 (+13.5%)	1.233 (+3.9%)	1.206 (+1.6%)
latency_median	1.181	1.317 (+11.5%)	1.223 (+3.5%)	1.187 (+0.5%)
latency_p99	1.269	1.647 (+29.8%)	1.363 (+7.4%)	1.372 (+8.2%)
latency_p95	1.234	1.459 (+18.2%)	1.298 (+5.2%)	1.294 (+4.9%)

Compare (vLLM RMS_Norm / Torch RMS_Norm) in profiling

vLLM RMS_Norm CPU time vs Torch RMS_Norm CPU time

vLLM

#### Torch

Torch RMS_Norm: ~14us vLLM RMS_Norm: ~57us

vLLM RMS_Norm CPU time vs Torch RMS_Norm CUDA time

vLLM

Torch

Torch RMS_Norm: ~31us vLLM RMS_Norm: ~62us

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: NumberWan <wantszkin2003@gmail.com>

hsliuustc0106

lgtm

hsliuustc0106 · 2026-06-02T13:51:25Z

can you provide a profiling for different versions and this PR?

NumberWan · 2026-06-03T01:56:32Z

can you provide a profiling for different versions and this PR?

The difference in profiling comparison has been added to the description.

…orm backend) (vllm-project#4074) Signed-off-by: NumberWan <wantszkin2003@gmail.com>

…orm backend) (vllm-project#4074) Signed-off-by: NumberWan <wantszkin2003@gmail.com> Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

NumberWan requested review from Isotr0py, RuixiangMa, SamitHuang, ZJY0516, david6666666, princepride and wtomin as code owners June 2, 2026 08:35

NumberWan changed the title ~~Qwen perf~~ [BugFix]Qwen-Image performance regression by using torch RMSNorm(RMSNorm backend) Jun 2, 2026

Gaohan123 added this to the v0.22.0 milestone Jun 2, 2026

Gaohan123 added ready label to trigger buildkite CI diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels Jun 2, 2026

Gaohan123 linked an issue Jun 2, 2026 that may be closed by this pull request

[Performance]: Qwen-Image on vLLM-Omni 0.18 -> latest performance regression #3812

Closed

1 task

Edited pre-commit

b18d70a

Signed-off-by: NumberWan <wantszkin2003@gmail.com>

NumberWan force-pushed the qwen_perf branch from bb9f17f to b18d70a Compare June 2, 2026 09:40

hsliuustc0106 approved these changes Jun 2, 2026

View reviewed changes

Gaohan123 merged commit 35ee3c7 into vllm-project:main Jun 2, 2026
8 checks passed

fhfuih mentioned this pull request Jun 4, 2026

[CI][bugfix]: Improve Qwen Image accuracy test with diffusers attn alignment #4143

Merged

5 tasks

86MaxCao pushed a commit to 86MaxCao/vllm-omni that referenced this pull request Jun 4, 2026

[BugFix]Qwen-Image performance regression by using torch RMSNorm(RMSN…

55f18c5

…orm backend) (vllm-project#4074) Signed-off-by: NumberWan <wantszkin2003@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix]Qwen-Image performance regression by using torch RMSNorm(RMSNorm backend)#4074

[BugFix]Qwen-Image performance regression by using torch RMSNorm(RMSNorm backend)#4074
Gaohan123 merged 1 commit into
vllm-project:mainfrom
NumberWan:qwen_perf

NumberWan commented Jun 2, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 left a comment

Uh oh!

hsliuustc0106 commented Jun 2, 2026

Uh oh!

Uh oh!

NumberWan commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NumberWan commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Background

Change

Test Plan

Test Result

Accuracy: test_qwen_image_matches_diffusers in local test

Will be updated once getting CI test result

Accuracy: test_qwen_image_matches_diffusers in CI

Compare (vLLM RMS_Norm / Torch RMS_Norm) in profiling

vLLM RMS_Norm CPU time vs Torch RMS_Norm CPU time

vLLM

vLLM RMS_Norm CPU time vs Torch RMS_Norm CUDA time

vLLM

Torch

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jun 2, 2026

Uh oh!

Uh oh!

NumberWan commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NumberWan commented Jun 2, 2026 •

edited

Loading

Accuracy: `test_qwen_image_matches_diffusers` in local test

Accuracy: `test_qwen_image_matches_diffusers` in CI