Skip to content

[BugFix]Qwen-Image performance regression by using omni RMSNorm(RMSNorm backend)#3933

Merged
hsliuustc0106 merged 6 commits into
vllm-project:mainfrom
NumberWan:qwen_perf
May 31, 2026
Merged

[BugFix]Qwen-Image performance regression by using omni RMSNorm(RMSNorm backend)#3933
hsliuustc0106 merged 6 commits into
vllm-project:mainfrom
NumberWan:qwen_perf

Conversation

@NumberWan

@NumberWan NumberWan commented May 28, 2026

Copy link
Copy Markdown
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Mitigate the Qwen-Image diffusion performance regression observed in vLLM-Omni 0.20.0 (steady-state compiled runs) by switching Qwen-Image RMSNorm layers to vllm_omni.diffusion.layers.norm.RMSNorm, avoiding the vLLM RMSNorm extension kernel path and validating improvement via profiler totals.

Index v0.18.0 v0.20.0 (vLLM RMSNorm) v0.20.0 (omni RMSNorm)
throughput_qps 0.843 0.763 (−9.4%) 0.792 (−6.0%)
latency_mean 1.187 1.311 (+10.4%) 1.263 (+6.4%)
latency_median 1.181 1.307 (+10.7%) 1.258 (+6.5%)
latency_p99 1.269 1.373 (+8.2%) 1.316 (+3.7%)
latency_p95 1.234 1.362 (+10.4%) 1.287 (+4.3%)

There is still a regression vs v0.18.0 on client latency, but switching from vLLM RMSNorm to omni RMSNorm reduces the gap from ~10% (vLLM, latency_mean) to ~6% (omni).

Description

  • Issue: Qwen-Image diffusion shows a performance regression when moving from vLLM-Omni 0.18.0 to 0.20.0 (steady-state compiled runs).
  • Observation: In 0.20.0, Qwen-Image uses vLLM’s RMSNorm extension kernels (_C::rms_norm / vllm::rms_norm_kernel) which contribute ~310ms self CUDA in our run.
  • A/B result: Switching Qwen-Image RMSNorm to vllm_omni.diffusion.layers.norm.RMSNorm:
    • removes those vLLM RMSNorm extension rows (vllm_ext=0)
    • reduces norm keyword aggregate self CUDA by ~309ms
    • improves profiler_out_0.txt totals by -0.098s Self CPU and -0.092s Self CUDA
  • Change: Qwen-Image-only change in vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py to use vllm_omni.diffusion.layers.norm.RMSNorm.

We are investigating a Qwen-Image performance regression between vLLM-Omni 0.18.0 and 0.20.0 (see issue #3812 ). Profiling indicates notable differences in operator paths/buckets in compiled runs (e.g., mm/addmm ratio, CompiledFxGraph time, and normalization kernels).

To narrow down to a low-risk and actionable patch, we performed keyword aggregation and an A/B test specifically on RMSNorm backend used by Qwen-Image. The results suggest that the vLLM RMSNorm extension path is a measurable contributor to the regression in our environment, and switching to vllm_omni.diffusion.layers.norm.RMSNorm improves both keyword-aggregated norm time and overall profiler totals.

Test Plan

0.20.0 compile+warmup A/B version (vLLM RMS_Norm / omni RMS_Norm)

Test Result

Compare (vLLM RMS_Norm / omni RMS_Norm) in profiling

vLLM RMS_Norm CPU time vs omni RMS_Norm CPU time

vLLM

vllm_rms_norm

Omni

omni_rms_norm

omni RMS_Norm: ~13us
vLLM RMS_Norm: ~57us

vLLM RMS_Norm CUDA time vs omni RMS_Norm CUDA time

vLLM

vllm_rms_norm_GPU #### Omni omni_rms_norm_GPU

omni RMS_Norm: ~31us
vLLM RMS_Norm: ~62us

Changed Comparison

Item vLLM RMSNorm(before) omni RMSNorm(after)
profiler_out_0.txt totals — Self CPU 2.796s 2.698s
profiler_out_0.txt totals — Self CUDA 2.651s 2.559s
keyword aggregate(rms_norm|layer_norm|norm)— self CUDA 503.527ms 194.607ms
vllm_ext_C::rms_norm + vllm::rms_norm_kernel)— self CUDA 310.067ms 0ms

Summary (vllm - omni):

  • Self CPU +0.098s
  • Self CUDA +0.092s

End-to-end comparison: vLLM RMSNorm vs omni RMSNorm

diffusion_forward (1 occurrence per trace)
vLLM RMSNorm: 2.74s
omni RMSNorm: 2.64s
Delta (vLLM − omni): +0.101s (~+101.0ms)

What this PR changes

  • File: vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py
  • Change: Use vllm_omni.diffusion.layers.norm.RMSNorm for Qwen-Image RMSNorm layers (avoid vLLM RMSNorm extension path).
  • Scope: Qwen-Image only (no changes to other models/pipelines).

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

NumberWan added 2 commits May 28, 2026 08:14
Signed-off-by: NumberWan <wantszkin2003@gmail.com>
Signed-off-by: NumberWan <wantszkin2003@gmail.com>
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

NumberWan added 2 commits May 28, 2026 08:37
Signed-off-by: NumberWan <wantszkin2003@gmail.com>
Signed-off-by: NumberWan <wantszkin2003@gmail.com>
@NumberWan

Copy link
Copy Markdown
Contributor Author

@fhfuih @hsliuustc0106 PTAL

@hsliuustc0106

Copy link
Copy Markdown
Collaborator

@SamitHuang @wtomin PTAL

@fhfuih

fhfuih commented May 29, 2026

Copy link
Copy Markdown
Contributor

Thanks for the PR. I also had offline discussion with @NumberWan .

Probably the performance gain of using torch's RMSNorm is possible after #3352 . Because I tried the same thing before #3352, and torch RMSNorm performance is worse. (Didn't profile it though, only tested e2e latency as a coarse exploration)

But since we can observe performance gain now, using torch's RMSNorm also reduces engineering dependency on vllm. Looks good to me

I think we can use torch RMSNorm on Qwen Image first, see if it's stable in the coming releases, and gradually promote this op to other diffusion models in future iterations (also doc pages/examples like "add a diffusion model")


logger = init_logger(__name__)

# Before #41804 issue fixed, PyTorch RMSNorm replaced vLLM RMSNorm for Qwen-Image.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually if torch RMSNorm has performance gain, we may not need to change it back to vllm's RMSNorm regardless of their 41804 issue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should consider the potential benefits that vllm RMSNorm can bring to other platforms (such as NPU). I also noticed that vllm-omni has its own RMSNorm implementation (vllm_omni/diffusion/layers/norm.py). What is the purpose of this file, and why not consider making some replacement attempts only on CUDA?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should consider the potential benefits that vllm RMSNorm can bring to other platforms (such as NPU). I also noticed that vllm-omni has its own RMSNorm implementation (vllm_omni/diffusion/layers/norm.py). What is the purpose of this file, and why not consider making some replacement attempts only on CUDA?

Thank you for you suggestion, I will keep tracking the performance from both (vllm / torch) RMSNorm to select the suitable one in different version

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should consider the potential benefits that vllm RMSNorm can bring to other platforms (such as NPU). I also noticed that vllm-omni has its own RMSNorm implementation (vllm_omni/diffusion/layers/norm.py). What is the purpose of this file, and why not consider making some replacement attempts only on CUDA?

Thank you for your suggestion. I understand your consideration. Let me explore the other platforms' (such as NPU) performance and compare how the changes affecting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed that vllm-omni has its own RMSNorm implementation

At least I have tested this impl when upgrading from 0.19 to 0.20. At that time, this impl and the torch op is similarly slow. So @NumberWan yes you can also retest this impl.

But

What is the purpose of this file

This idk. AFAIK Wan used it, and otherwise it is seldom used

Signed-off-by: NumberWan <wantszkin2003@gmail.com>
@NumberWan NumberWan changed the title [BugFix]Qwen-Image performance regression by using torch RMSNorm(RMSNorm backend) [BugFix]Qwen-Image performance regression by using omni RMSNorm(RMSNorm backend) May 29, 2026
@SamitHuang

SamitHuang commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Can you test our previous setting:

512 independent requests with batch size=1, num_inference_steps=10, resolution 512x512

In this case, we have ~10% performance drop compared to v0.18

@NumberWan

NumberWan commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

Can you test our previous setting:

512 independent requests with batch size=1, num_inference_steps=10, resolution 512x512

In this case, we have ~10% performance drop compared to v0.18

I used this setting and ran 50 requests, here is the result

Index v0.18.0 v0.20.0 (vLLM RMSNorm) v0.20.0 (omni RMSNorm)
throughput_qps 0.843 0.763 (−9.4%) 0.792 (−6.0%)
latency_mean 1.187 1.311 (+10.4%) 1.263 (+6.4%)
latency_median 1.181 1.307 (+10.7%) 1.258 (+6.5%)
latency_p99 1.269 1.373 (+8.2%) 1.316 (+3.7%)
latency_p95 1.234 1.362 (+10.4%) 1.287 (+4.3%)

There is still a regression vs v0.18.0 on client latency, but switching from vLLM RMSNorm to omni RMSNorm reduces the gap from ~10% (vLLM, latency_mean) to ~6% (omni).

@hsliuustc0106

Copy link
Copy Markdown
Collaborator

what are the left gaps?

@NumberWan

Copy link
Copy Markdown
Contributor Author

what are the left gaps?

Not comfirmed yet, will continue to track this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants