Skip to content

[AutoRound] Support Qwen Omni W4A16 quantization model #2670

Merged
lishunyang12 merged 10 commits into
vllm-project:mainfrom
lvliang-intel:feats/ar-w4a16-qwen-omni
Apr 22, 2026
Merged

[AutoRound] Support Qwen Omni W4A16 quantization model #2670
lishunyang12 merged 10 commits into
vllm-project:mainfrom
lvliang-intel:feats/ar-w4a16-qwen-omni

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

@lvliang-intel lvliang-intel commented Apr 10, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Enable loading offline AutoRound W4A16 quantized checkpoints for Qwen Omni multi-stage models (Qwen2.5-Omni and Qwen3-Omni). Extends the AutoRound/INC support added in #1777 (FLUX) to the Omni pipeline.

Related: #1325, #1777

Test Plan

Test Result

Unit tests: all 31 tests pass.
CMD and Model:

CUDA_VISIBLE_DEVICES=0,1 python examples/offline_inference/qwen2_5_omni/end2end.py --model Intel/Qwen2.5-Omni-7B-int4-AutoRound --output-wav output_audio --query-type use_audio
CUDA_VISIBLE_DEVICES=0,1 python examples/offline_inference/qwen3_omni/end2end.py --model Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound --output-wav output_audio --query-type use_audio
Model Metric BF16 Baseline W4A16 (AutoRound)
Qwen3-Omni-30B-A3B Checkpoint Size (GiB) 66 25
Size Reduction -- 62%
Qwen2.5-Omni-7B Checkpoint Size (GiB) 21 12
Size Reduction -- 43%

Accuracy Test:

CUDA_VISIBLE_DEVICES=0,1 vllm serve Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound --omni  --port 8801  --max-model-len 32768   --served-model-name Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound
CUDA_VISIBLE_DEVICES=0,1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni  --port 8801  --max-model-len 32768   --served-model-name Qwen3-Omni-30B-A3B-Instruct

Using evalscope to test dataset omni-bench:

 python - <<'PY'
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
    model='Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound',
    api_url='http://127.0.0.1:8801/v1',
    api_key='EMPTY',
    eval_type='openai_api',
    datasets=['omni_bench'],
    dataset_args={
        'omni_bench': {
            'extra_params': {
                'use_image': True,
                'use_audio': True,
            }
        }
    },
    eval_batch_size=1,
    generation_config={
        'max_tokens': 10000,
        'temperature': 0.0,
    },
    limit=100,
    ignore_errors=True,
)
run_task(task_cfg=task_cfg)
PY
Model Dataset Metric Subset Num Score Delta vs BF16
Qwen3-Omni-30B-A3B-Instruct omni_bench mean_acc default 100 0.46 0.00
Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound omni_bench mean_acc default 100 0.44 -0.02

Performance Benchmark Results: FP8 (W8A8) vs INT4 (W4A16 AutoRound)

fp8_bench.json
int4_bench.json

Model: Qwen3-Omni-30B-A3B-Instruct
Hardware: 8x NVIDIA RTX 5090 D (32GB)
Configuration: Thinker TP=4 (GPU 0-3), Talker TP=2 (GPU 4-5), Code2Wav TP=2 (GPU 6-7)
Benchmark: 200 ShareGPT prompts, request_rate=10, max_concurrency=32, temperature=0, custom_output_len=256

Serving Performance

Metric FP8 (W8A8) INT4 (W4A16) Diff
Successful requests 200 200 -
Failed requests 0 0 -
Benchmark duration (s) 248.31 239.42 INT4 -3.6%
Request throughput (req/s) 0.81 0.84 INT4 +3.7%

Text Generation

Metric FP8 (W8A8) INT4 (W4A16) Diff
Total generated tokens 40,015 39,902 -
Output token throughput (tok/s) 161.15 166.66 INT4 +3.4%
Peak output token throughput (tok/s) 768.00 704.00 FP8 +9.1%

Audio Generation

Metric FP8 (W8A8) INT4 (W4A16) Diff
Total audio duration generated (s) 14,134.53 12,523.22 -
Audio throughput (audio dur/s) 56.92 52.31 FP8 +8.8%

Latency

Metric FP8 (W8A8) INT4 (W4A16) Diff
Mean TTFT (ms) 90.14 107.01 FP8 -18.7%
Median TTFT (ms) 46.35 58.55 FP8 -20.9%
P99 TTFT (ms) 526.80 420.74 INT4 -20.1%
Mean TPOT (ms) 17.70 21.23 FP8 -16.6%
Median TPOT (ms) 12.29 15.73 FP8 -21.9%
P99 TPOT (ms) 47.42 50.58 FP8 -6.3%
Mean ITL (ms) 17.61 21.11 FP8 -16.6%
Median ITL (ms) 11.75 14.90 FP8 -21.1%
P99 ITL (ms) 128.49 131.67 FP8 -2.4%
Mean E2EL (ms) 36,718.00 35,465.59 INT4 -3.4%
Median E2EL (ms) 35,711.51 35,053.92 INT4 -1.8%
P99 E2EL (ms) 52,743.62 46,585.58 INT4 -11.7%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Contributor

@yiliu30 yiliu30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM — it would be great to include a command for running the end-to-end example in PR desc.

Comment thread vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_talker.py Outdated
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Well-structured PR. 31 unit tests, comprehensive benchmarks, docs updated. The accuracy delta (-0.02 on omni_bench) is acceptable for W4A16.

Comment thread tests/e2e/offline_inference/test_qwen3_omni_autoround_w4a16.py Outdated
Comment thread vllm_omni/engine/stage_init_utils.py Outdated
Comment thread tests/diffusion/quantization/test_component_routing.py Outdated
Comment thread vllm_omni/config/model.py
Comment thread vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_talker.py
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 @lishunyang12,
please help to approve this PR if no further comments.

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

lvliang-intel commented Apr 20, 2026

Rerun accuray test with latest code:
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
| Qwen3-Omni-30B-A3B-Instruct | omni_bench | mean_acc | default | 100 | 0.46 | default |

| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
| Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound | omni_bench | mean_acc | default | 100 | 0.47 | default |

@lishunyang12 lishunyang12 enabled auto-merge (squash) April 20, 2026 08:24
@lishunyang12 lishunyang12 disabled auto-merge April 20, 2026 09:00
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@lvliang-intel lvliang-intel force-pushed the feats/ar-w4a16-qwen-omni branch from 37655ff to f9c1036 Compare April 21, 2026 02:43
@lishunyang12
Copy link
Copy Markdown
Collaborator

Fix CI

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@lishunyang12 lishunyang12 added merge-test label to trigger buildkite merge test CI and removed ready label to trigger buildkite CI labels Apr 21, 2026
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

@lishunyang12,
CI finally passed, please help to approve and merge this PR, thanks.

@lishunyang12
Copy link
Copy Markdown
Collaborator

let's wait for ci to complete

# Newer transformers use rope_parameters instead of rope_scaling
rope_params = getattr(talker_config.text_config, "rope_parameters", None) or {}
rope_params["rope_theta"] = talker_config.text_config.rope_theta
rope_params = dict(rope_params)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amy-why-3459 transformers will be upgraded to 5.x next week, please check the logic here

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 @lishunyang12,
could you please approve the PR?

@lishunyang12 lishunyang12 merged commit ee15f39 into vllm-project:main Apr 22, 2026
6 checks passed
qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
…2670)

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…2670)

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…2670)

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-test label to trigger buildkite merge test CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants