Support Qwen3 and Qwen2.5 Omni model quantization#1404
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
|
Thank you for the PR! Could you help verify all inferences (vLLM, Transformers 4, and Transformers 5) before merging? |
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Quantize:Inference with transformers 5.1.0
vLLM tests are currently blocked because the latest vLLM version depends on an outdated Transformers release. Qwen3-Omni requires Transformers >= 5.1.0 to address several known issues. |
There was a problem hiding this comment.
Pull request overview
Adds quantization support for the Qwen3-Omni MoE model family by integrating model-specific loading/version gating, calibration forward behavior for thinker/talker, and custom multimodal block discovery.
Changes:
- Added explicit Transformers version guard for
qwen3_omni_moe. - Introduced Qwen3-Omni processor/template registration and model-specific multimodal block name discovery.
- Implemented a Qwen3-Omni-specific forward path to run thinker (and optionally talker) during calibration.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Adds a project-specific word to typos’ allowlist. |
| auto_round/utils/model.py | Adds Transformers version guard and adjusts lm_head discovery logic. |
| auto_round/utils/common.py | Adds _no_split_modules normalization and extends multimodal ignore-key lists. |
| auto_round/special_model_handler.py | Adds Qwen3-Omni special forward + block discovery + ignore-layer rule. |
| auto_round/compressors/shard_writer.py | Improves tie_word_embeddings lookup for nested multimodal configs. |
| auto_round/compressors/mllm/utils.py | Extends multimodal ignore-key list for Qwen3-Omni components. |
| auto_round/compressors/mllm/template.py | Registers a Qwen3-Omni model template with the new processor. |
| auto_round/compressors/mllm/processor.py | Adds a custom processor for Qwen3-Omni chat-template inputs. |
| auto_round/compressors/base.py | Imports the new normalization helper. |
| auto_round/auto_scheme/utils.py | Uses normalized _no_split_modules when dispatching across devices. |
| # Use text projection to convert thinker embeddings to talker space | ||
| if hasattr(model.talker, "text_projection"): | ||
| # Get thinker embeddings | ||
| thinker_embeds = model.thinker.get_input_embeddings()(input_ids) | ||
| talker_inputs_embeds = model.talker.text_projection(thinker_embeds) |
There was a problem hiding this comment.
This path assumes input_ids is provided; if calibration runs with inputs_embeds (or other modalities without input_ids), this will throw and then be silently ignored (due to the broad except), meaning the talker forward never runs. Consider deriving inputs from inputs_embeds when present, or projecting from thinker_output.hidden_states[-1] (which you already compute) instead of re-embedding input_ids.
| # Use text projection to convert thinker embeddings to talker space | |
| if hasattr(model.talker, "text_projection"): | |
| # Get thinker embeddings | |
| thinker_embeds = model.thinker.get_input_embeddings()(input_ids) | |
| talker_inputs_embeds = model.talker.text_projection(thinker_embeds) | |
| # Use text projection to convert thinker hidden states to talker space | |
| if hasattr(model.talker, "text_projection"): | |
| # Project thinker hidden states directly into the talker embedding space | |
| talker_inputs_embeds = model.talker.text_projection(thinker_hidden) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
you could update transformers after installing vllm |
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…into lvl/support_omni
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci
…into lvl/support_omni
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…into lvl/support_omni
Qwen2.5 Omni quantize and inference test pass:CUDA_VISIBLE_DEVICES=3 python quantize_qwen25_omni.py --model /mnt/disk2/lvl/Qwen2.5-Omni-3B --output tmp_qwen25_omni_w4a16 --iters 200 CUDA_VISIBLE_DEVICES=6 python run_qwen25_omni.py --model-dir tmp_qwen25_omni_w4a16 --enable-audio-output ============================================================
|
|
|
||
|
|
||
| SPECIAL_MULTIMODAL_BLOCK = {"deepseek_vl_v2": _get_deepseek_vl2_multimodal_block} | ||
| def _get_qwen2_5_omni_multimodal_block(model, quant_vision=False): |
There was a problem hiding this comment.
Since the code for these two models has grown to 300+ lines, it’s making the main file quite cluttered. Shall we refine this file later
There was a problem hiding this comment.
Sure, we will refactor this file later.
|
Awesome work, Liang Ge! |
vLLM inference test with Qwen Omni 2.5 quantized model, accuracy is goodCUDA_VISIBLE_DEVICES=5 python run_qwen25_omni_vllm.py --model-dir ./tmp_qwen25_omni_w4a16 ============================================================
|
vLLM inference test with Qwen3 Omni quantized mode, accuracy is not good. Looks like vLLM issue since transfromer inference test is good for Qwen3 Omni.CUDA_VISIBLE_DEVICES=5 python run_qwen3_omni_vllm.py --model-dir ./tmp_qwen3_omni_w4a16 ============================================================
|
for more information, see https://pre-commit.ci
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Verified with vllm-omni (Based on PR vllm-project/vllm-omni#1777 and some adpations)1 Quantized Qwen2.5-Omni Inference Test [INFO] Running 7 prompts (3 text + 4 multimodal)... [INFO] Inference completed in 93.6s Text outputs: 3 Q1: What is 2 + 3? Answer with just the number. Q2: Briefly describe what a neural network is in one sentence. Q3: Translate 'Hello, how are you?' to Chinese. [OK] Quantized model produced 3 text outputs Multimodal Test Results[OK] image: This image shows four different luxury cars - a white Rolls Royce, a Mercedes - Benz GLE SUV, a red Ferrari Portofino M, [INFO] Results saved to output_qwen25_omni_w4a16_verify/ verify_qwen25_omni_w4a16.py 2 Quantized Qwen3-Omni Inference Test Text outputs: 3 Q1: What is 2 + 3? Answer with just the number. Q2: Briefly describe what a neural network is in one sentence. Q3: Translate 'Hello, how are you?' to Chinese. [OK] Quantized model produced 3 text outputs Multimodal Test Results[OK] image: The composite image displays four luxury vehicles: a white Rolls-Royce Phantom, a grey Mercedes-Benz GLE SUV in a desert [INFO] Results saved to output_qwen3_omni_w4a16_verify/ |
wenhuach21
left a comment
There was a problem hiding this comment.
Great, thanks!
1 please update the vlms doc to show these models have been supported. feel free to add it in another pr
2 please help upstream the quantized model to intel space
Will upstream the quantized model to intel space once the PR vllm-project/vllm-omni#1777 is merged and also will update document. |
Just mentioning this PR in the model card as a requirement is fine |
Description
This update adds quantization support for Qwen3-Omni by integrating a custom MLLM processor and template, implementing dedicated forward logic for thinker/talker calibration, and introducing model-specific block discovery.
Note: This feature requires Transformers >= 5.1.0, as earlier versions contain compatibility issues with Qwen3-Omni.
Type of Change
Related Issues
#1387
Fixes or relates to #
Checklist Before Submitting