[AutoRound] Support Qwen Omni W4A16 quantization model by lvliang-intel · Pull Request #2670 · vllm-project/vllm-omni

lvliang-intel · 2026-04-10T06:19:42Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Enable loading offline AutoRound W4A16 quantized checkpoints for Qwen Omni multi-stage models (Qwen2.5-Omni and Qwen3-Omni). Extends the AutoRound/INC support added in #1777 (FLUX) to the Omni pipeline.

Related: #1325, #1777

Test Plan

Unit tests in tests/diffusion/quantization/test_component_routing.py covering
Manual end-to-end loading verified with checkpoint Qwen3-Omni-30B-A3B W4A16 and Qwen2.5-Omni-7B W4A16, the original models also are covered.

Test Result

Unit tests: all 31 tests pass.
CMD and Model:

CUDA_VISIBLE_DEVICES=0,1 python examples/offline_inference/qwen2_5_omni/end2end.py --model Intel/Qwen2.5-Omni-7B-int4-AutoRound --output-wav output_audio --query-type use_audio
CUDA_VISIBLE_DEVICES=0,1 python examples/offline_inference/qwen3_omni/end2end.py --model Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound --output-wav output_audio --query-type use_audio

Model	Metric	BF16 Baseline	W4A16 (AutoRound)
Qwen3-Omni-30B-A3B	Checkpoint Size (GiB)	66	25
	Size Reduction	--	62%
Qwen2.5-Omni-7B	Checkpoint Size (GiB)	21	12
	Size Reduction	--	43%

Accuracy Test:

CUDA_VISIBLE_DEVICES=0,1 vllm serve Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound --omni  --port 8801  --max-model-len 32768   --served-model-name Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound
CUDA_VISIBLE_DEVICES=0,1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni  --port 8801  --max-model-len 32768   --served-model-name Qwen3-Omni-30B-A3B-Instruct

Using evalscope to test dataset omni-bench:

 python - <<'PY'
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
    model='Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound',
    api_url='http://127.0.0.1:8801/v1',
    api_key='EMPTY',
    eval_type='openai_api',
    datasets=['omni_bench'],
    dataset_args={
        'omni_bench': {
            'extra_params': {
                'use_image': True,
                'use_audio': True,
            }
        }
    },
    eval_batch_size=1,
    generation_config={
        'max_tokens': 10000,
        'temperature': 0.0,
    },
    limit=100,
    ignore_errors=True,
)
run_task(task_cfg=task_cfg)
PY

Model	Dataset	Metric	Subset	Num	Score	Delta vs BF16
Qwen3-Omni-30B-A3B-Instruct	omni_bench	mean_acc	default	100	0.46	0.00
Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound	omni_bench	mean_acc	default	100	0.44	-0.02

Performance Benchmark Results: FP8 (W8A8) vs INT4 (W4A16 AutoRound)

fp8_bench.json
int4_bench.json

Model: Qwen3-Omni-30B-A3B-Instruct
Hardware: 8x NVIDIA RTX 5090 D (32GB)
Configuration: Thinker TP=4 (GPU 0-3), Talker TP=2 (GPU 4-5), Code2Wav TP=2 (GPU 6-7)
Benchmark: 200 ShareGPT prompts, request_rate=10, max_concurrency=32, temperature=0, custom_output_len=256

Serving Performance

Metric	FP8 (W8A8)	INT4 (W4A16)	Diff
Successful requests	200	200	-
Failed requests	0	0	-
Benchmark duration (s)	248.31	239.42	INT4 -3.6%
Request throughput (req/s)	0.81	0.84	INT4 +3.7%

Text Generation

Metric	FP8 (W8A8)	INT4 (W4A16)	Diff
Total generated tokens	40,015	39,902	-
Output token throughput (tok/s)	161.15	166.66	INT4 +3.4%
Peak output token throughput (tok/s)	768.00	704.00	FP8 +9.1%

Audio Generation

Metric	FP8 (W8A8)	INT4 (W4A16)	Diff
Total audio duration generated (s)	14,134.53	12,523.22	-
Audio throughput (audio dur/s)	56.92	52.31	FP8 +8.8%

Latency

Metric	FP8 (W8A8)	INT4 (W4A16)	Diff
Mean TTFT (ms)	90.14	107.01	FP8 -18.7%
Median TTFT (ms)	46.35	58.55	FP8 -20.9%
P99 TTFT (ms)	526.80	420.74	INT4 -20.1%
Mean TPOT (ms)	17.70	21.23	FP8 -16.6%
Median TPOT (ms)	12.29	15.73	FP8 -21.9%
P99 TPOT (ms)	47.42	50.58	FP8 -6.3%
Mean ITL (ms)	17.61	21.11	FP8 -16.6%
Median ITL (ms)	11.75	14.90	FP8 -21.1%
P99 ITL (ms)	128.49	131.67	FP8 -2.4%
Mean E2EL (ms)	36,718.00	35,465.59	INT4 -3.4%
Median E2EL (ms)	35,711.51	35,053.92	INT4 -1.8%
P99 E2EL (ms)	52,743.62	46,585.58	INT4 -11.7%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector · 2026-04-10T06:19:47Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

yiliu30

Overall LGTM — it would be great to include a command for running the end-to-end example in PR desc.

hsliuustc0106 · 2026-04-10T08:17:05Z

Well-structured PR. 31 unit tests, comprehensive benchmarks, docs updated. The accuracy delta (-0.02 on omni_bench) is acceptable for W4A16.

lvliang-intel · 2026-04-14T14:29:48Z

@hsliuustc0106 @lishunyang12,
please help to approve this PR if no further comments.

lvliang-intel · 2026-04-20T07:25:47Z

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lishunyang12 · 2026-04-21T03:43:39Z

Fix CI

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel · 2026-04-21T13:15:25Z

@lishunyang12,
CI finally passed, please help to approve and merge this PR, thanks.

lishunyang12 · 2026-04-21T13:44:19Z

let's wait for ci to complete

hsliuustc0106 · 2026-04-21T13:46:49Z

-            # Newer transformers use rope_parameters instead of rope_scaling
            rope_params = getattr(talker_config.text_config, "rope_parameters", None) or {}
-        rope_params["rope_theta"] = talker_config.text_config.rope_theta
+        rope_params = dict(rope_params)


@amy-why-3459 transformers will be upgraded to 5.x next week, please check the logic here

lvliang-intel · 2026-04-22T11:44:43Z

@hsliuustc0106 @lishunyang12,
could you please approve the PR?

…2670) Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>

lvliang-intel requested a review from hsliuustc0106 as a code owner April 10, 2026 06:19

yiliu30 mentioned this pull request Apr 10, 2026

[RFC]: Intel Auto-Round x vLLM-Omni Quantization Support (2026 H1) #1325

Open

3 tasks

yiliu30 approved these changes Apr 10, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_talker.py Outdated

lishunyang12 reviewed Apr 11, 2026

View reviewed changes

Comment thread tests/e2e/offline_inference/test_qwen3_omni_autoround_w4a16.py Outdated

Comment thread vllm_omni/engine/stage_init_utils.py Outdated

Comment thread tests/diffusion/quantization/test_component_routing.py Outdated

hsliuustc0106 reviewed Apr 12, 2026

View reviewed changes

Comment thread vllm_omni/config/model.py

Comment thread vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_talker.py

yiliu30 mentioned this pull request Apr 15, 2026

[vllm-omni]: Omni Quant Support intel/auto-round#1507

Open

lishunyang12 mentioned this pull request Apr 15, 2026

[RFC]: Continuous Quantization Support #1854

Open

lvliang-intel force-pushed the feats/ar-w4a16-qwen-omni branch from 2f1ff83 to 1e6e929 Compare April 20, 2026 03:27

lishunyang12 added the ready label to trigger buildkite CI label Apr 20, 2026

lishunyang12 enabled auto-merge (squash) April 20, 2026 08:24

lishunyang12 disabled auto-merge April 20, 2026 09:00

Support Qwen Omni model with AutoRound

f9c1036

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel force-pushed the feats/ar-w4a16-qwen-omni branch from 37655ff to f9c1036 Compare April 21, 2026 02:43

lvliang-intel added 2 commits April 21, 2026 11:52

fix ci

4af9e40

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' into feats/ar-w4a16-qwen-omni

db126cd

lishunyang12 added merge-test label to trigger buildkite merge test CI and removed ready label to trigger buildkite CI labels Apr 21, 2026

lvliang-intel added 2 commits April 21, 2026 17:00

Merge branch 'main' into feats/ar-w4a16-qwen-omni

f37c3e6

Merge branch 'main' into feats/ar-w4a16-qwen-omni

97f917f

Merge branch 'main' into feats/ar-w4a16-qwen-omni

4947b4d

Merge branch 'main' into feats/ar-w4a16-qwen-omni

0716bbb

hsliuustc0106 reviewed Apr 21, 2026

View reviewed changes

Merge branch 'main' into feats/ar-w4a16-qwen-omni

8ba404b

lvliang-intel added 2 commits April 22, 2026 11:08

Merge branch 'main' into feats/ar-w4a16-qwen-omni

f15cdb6

Merge branch 'main' into feats/ar-w4a16-qwen-omni

a036509

lishunyang12 approved these changes Apr 22, 2026

View reviewed changes

lishunyang12 merged commit ee15f39 into vllm-project:main Apr 22, 2026
6 checks passed

lvliang-intel mentioned this pull request Apr 23, 2026

[AutoRound] Support GLM-Image W4A16 quantization model #3059

Open

5 tasks

lvliang-intel mentioned this pull request May 5, 2026

[AutoRound] Support WAN2.2 W4A16 quantization model #3353

Open

5 tasks

Conversation

lvliang-intel commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Performance Benchmark Results: FP8 (W8A8) vs INT4 (W4A16 AutoRound)

Serving Performance

Text Generation

Audio Generation

Latency

Uh oh!

chatgpt-codex-connector Bot commented Apr 10, 2026

Uh oh!

yiliu30 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lvliang-intel commented Apr 14, 2026

Uh oh!

lvliang-intel commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 commented Apr 21, 2026

Uh oh!

lvliang-intel commented Apr 21, 2026

Uh oh!

lishunyang12 commented Apr 21, 2026

Uh oh!

hsliuustc0106 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

lvliang-intel commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lvliang-intel commented Apr 10, 2026 •

edited

Loading

lvliang-intel commented Apr 20, 2026 •

edited

Loading