[BugFix]Qwen-Image performance regression by using omni RMSNorm(RMSNorm backend)#3933
Conversation
Signed-off-by: NumberWan <wantszkin2003@gmail.com>
Signed-off-by: NumberWan <wantszkin2003@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Signed-off-by: NumberWan <wantszkin2003@gmail.com>
Signed-off-by: NumberWan <wantszkin2003@gmail.com>
|
@fhfuih @hsliuustc0106 PTAL |
|
@SamitHuang @wtomin PTAL |
|
Thanks for the PR. I also had offline discussion with @NumberWan . Probably the performance gain of using torch's RMSNorm is possible after #3352 . Because I tried the same thing before #3352, and torch RMSNorm performance is worse. (Didn't profile it though, only tested e2e latency as a coarse exploration) But since we can observe performance gain now, using torch's RMSNorm also reduces engineering dependency on vllm. Looks good to me I think we can use torch RMSNorm on Qwen Image first, see if it's stable in the coming releases, and gradually promote this op to other diffusion models in future iterations (also doc pages/examples like "add a diffusion model") |
|
|
||
| logger = init_logger(__name__) | ||
|
|
||
| # Before #41804 issue fixed, PyTorch RMSNorm replaced vLLM RMSNorm for Qwen-Image. |
There was a problem hiding this comment.
Actually if torch RMSNorm has performance gain, we may not need to change it back to vllm's RMSNorm regardless of their 41804 issue
There was a problem hiding this comment.
Perhaps we should consider the potential benefits that vllm RMSNorm can bring to other platforms (such as NPU). I also noticed that vllm-omni has its own RMSNorm implementation (vllm_omni/diffusion/layers/norm.py). What is the purpose of this file, and why not consider making some replacement attempts only on CUDA?
There was a problem hiding this comment.
Perhaps we should consider the potential benefits that vllm RMSNorm can bring to other platforms (such as NPU). I also noticed that vllm-omni has its own RMSNorm implementation (vllm_omni/diffusion/layers/norm.py). What is the purpose of this file, and why not consider making some replacement attempts only on CUDA?
Thank you for you suggestion, I will keep tracking the performance from both (vllm / torch) RMSNorm to select the suitable one in different version
There was a problem hiding this comment.
Perhaps we should consider the potential benefits that vllm RMSNorm can bring to other platforms (such as NPU). I also noticed that vllm-omni has its own RMSNorm implementation (vllm_omni/diffusion/layers/norm.py). What is the purpose of this file, and why not consider making some replacement attempts only on CUDA?
Thank you for your suggestion. I understand your consideration. Let me explore the other platforms' (such as NPU) performance and compare how the changes affecting.
There was a problem hiding this comment.
I also noticed that vllm-omni has its own RMSNorm implementation
At least I have tested this impl when upgrading from 0.19 to 0.20. At that time, this impl and the torch op is similarly slow. So @NumberWan yes you can also retest this impl.
But
What is the purpose of this file
This idk. AFAIK Wan used it, and otherwise it is seldom used
Signed-off-by: NumberWan <wantszkin2003@gmail.com>
|
Can you test our previous setting: 512 independent requests with batch size=1, num_inference_steps=10, resolution 512x512 In this case, we have ~10% performance drop compared to v0.18 |
I used this setting and ran 50 requests, here is the result
There is still a regression vs v0.18.0 on client latency, but switching from vLLM RMSNorm to omni RMSNorm reduces the gap from ~10% (vLLM, latency_mean) to ~6% (omni). |
|
what are the left gaps? |
Not comfirmed yet, will continue to track this issue. |
…rm backend) (vllm-project#3933) Signed-off-by: NumberWan <wantszkin2003@gmail.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Mitigate the Qwen-Image diffusion performance regression observed in vLLM-Omni 0.20.0 (steady-state compiled runs) by switching Qwen-Image RMSNorm layers to
vllm_omni.diffusion.layers.norm.RMSNorm, avoiding the vLLM RMSNorm extension kernel path and validating improvement via profiler totals.There is still a regression vs v0.18.0 on client latency, but switching from vLLM RMSNorm to omni RMSNorm reduces the gap from ~10% (vLLM, latency_mean) to ~6% (omni).
Description
_C::rms_norm/vllm::rms_norm_kernel) which contribute ~310ms self CUDA in our run.vllm_omni.diffusion.layers.norm.RMSNorm:vllm_ext=0)profiler_out_0.txttotals by -0.098s Self CPU and -0.092s Self CUDAvllm_omni/diffusion/models/qwen_image/qwen_image_transformer.pyto usevllm_omni.diffusion.layers.norm.RMSNorm.We are investigating a Qwen-Image performance regression between vLLM-Omni 0.18.0 and 0.20.0 (see issue #3812 ). Profiling indicates notable differences in operator paths/buckets in compiled runs (e.g., mm/addmm ratio, CompiledFxGraph time, and normalization kernels).
To narrow down to a low-risk and actionable patch, we performed keyword aggregation and an A/B test specifically on RMSNorm backend used by Qwen-Image. The results suggest that the vLLM RMSNorm extension path is a measurable contributor to the regression in our environment, and switching to
vllm_omni.diffusion.layers.norm.RMSNormimproves both keyword-aggregated norm time and overall profiler totals.Test Plan
0.20.0 compile+warmup A/B version (vLLM RMS_Norm / omni RMS_Norm)
Test Result
Compare (vLLM RMS_Norm / omni RMS_Norm) in profiling
vLLM RMS_Norm CPU time vs omni RMS_Norm CPU time
vLLM
Omni
omni RMS_Norm: ~13us
vLLM RMS_Norm: ~57us
vLLM RMS_Norm CUDA time vs omni RMS_Norm CUDA time
vLLM
omni RMS_Norm: ~31us
vLLM RMS_Norm: ~62us
Changed Comparison
profiler_out_0.txttotals — Self CPUprofiler_out_0.txttotals — Self CUDArms_norm|layer_norm|norm)— self CUDAvllm_ext(_C::rms_norm+vllm::rms_norm_kernel)— self CUDASummary (vllm - omni):
End-to-end comparison: vLLM RMSNorm vs omni RMSNorm
diffusion_forward (1 occurrence per trace)
vLLM RMSNorm: 2.74s
omni RMSNorm: 2.64s
Delta (vLLM − omni): +0.101s (~+101.0ms)
What this PR changes
vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.pyvllm_omni.diffusion.layers.norm.RMSNormfor Qwen-Image RMSNorm layers (avoid vLLM RMSNorm extension path).Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)