[AutoRound] Support Qwen Omni W4A16 quantization model #2670
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
yiliu30
left a comment
There was a problem hiding this comment.
Overall LGTM — it would be great to include a command for running the end-to-end example in PR desc.
|
Well-structured PR. 31 unit tests, comprehensive benchmarks, docs updated. The accuracy delta (-0.02 on omni_bench) is acceptable for W4A16. |
|
@hsliuustc0106 @lishunyang12, |
2f1ff83 to
1e6e929
Compare
|
Rerun accuray test with latest code: | Model | Dataset | Metric | Subset | Num | Score | Cat.0 | |
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
37655ff to
f9c1036
Compare
|
Fix CI |
|
@lishunyang12, |
|
let's wait for ci to complete |
| # Newer transformers use rope_parameters instead of rope_scaling | ||
| rope_params = getattr(talker_config.text_config, "rope_parameters", None) or {} | ||
| rope_params["rope_theta"] = talker_config.text_config.rope_theta | ||
| rope_params = dict(rope_params) |
There was a problem hiding this comment.
@amy-why-3459 transformers will be upgraded to 5.x next week, please check the logic here
|
@hsliuustc0106 @lishunyang12, |
…2670) Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
…2670) Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
…2670) Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Enable loading offline AutoRound W4A16 quantized checkpoints for Qwen Omni multi-stage models (Qwen2.5-Omni and Qwen3-Omni). Extends the AutoRound/INC support added in #1777 (FLUX) to the Omni pipeline.
Related: #1325, #1777
Test Plan
tests/diffusion/quantization/test_component_routing.pycoveringTest Result
Unit tests: all 31 tests pass.
CMD and Model:
Accuracy Test:
Using
evalscopeto test dataset omni-bench:Performance Benchmark Results: FP8 (W8A8) vs INT4 (W4A16 AutoRound)
fp8_bench.json
int4_bench.json
Model: Qwen3-Omni-30B-A3B-Instruct
Hardware: 8x NVIDIA RTX 5090 D (32GB)
Configuration: Thinker TP=4 (GPU 0-3), Talker TP=2 (GPU 4-5), Code2Wav TP=2 (GPU 6-7)
Benchmark: 200 ShareGPT prompts, request_rate=10, max_concurrency=32, temperature=0, custom_output_len=256
Serving Performance
Text Generation
Audio Generation
Latency
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)