fix: do not apply FP8 quant config to vision/audio encoders for pre-quantized checkpoints by ianliuy · Pull Request #2702 · vllm-project/vllm-omni

ianliuy · 2026-04-12T01:35:58Z

Problem

For modelopt FP8/FP4/MXFP8 pre-quantized checkpoints, only the Thinker LM is quantized. Vision/audio encoder weights are BF16 with no FP8 scale tensors in the checkpoint.

Previously, \quant_config\ was passed to all sub-components, causing \ModelOptFp8LinearMethod's FP8 kernel to run on BF16 encoder weights. Without valid scale tensors, the kernel produces numerical garbage embeddings the model completely ignores image and audio inputs (e.g., a solid red image described as 'Gray', speech transcribed as 'Yeah.').

This was a regression from v0.16 where the encoder linear layers did not propagate \quant_config, making it effectively a no-op.

Fix

**\qwen3_omni_moe_thinker.py**: set \�udio_quant_config = None\ and \�isual_quant_config = None\ for pre-quantized methods. Also corrects the misleading comment that claimed the entire thinker was quantized.
**\qwen2_5_omni_thinker.py**: same fix for the visual encoder (audio tower was already safe constructed without \quant_config).

Impact

Vision encoder (0.54B, ~1.1 GiB) and audio encoder (0.65B, ~1.3 GiB) run in BF16 negligible relative to the full model (~31 GiB)
Encoders only run during prefill, no impact on autoregressive decode speed
Consistent with v0.16 behavior

Testing

Verified by the original reporter: red image 'Red'\ , audio transcription exact

…rs for pre-quantized checkpoints For modelopt FP8/FP4/MXFP8 pre-quantized checkpoints, only the Thinker LM is quantized. Vision and audio encoder weights remain in BF16 with no corresponding input_scale/weight_scale tensors. Previously, the code passed the same quant_config to all sub-components, causing ModelOptFp8LinearMethod's FP8 kernel to run on BF16 encoder weights. This produced numerical garbage embeddings, making the model completely ignore image and audio inputs (e.g. red image described as 'Gray', speech transcribed as 'Yeah.'). Fix: set audio_quant_config = None and visual_quant_config = None for pre-quantized methods. Also corrects the misleading comment that claimed the entire thinker was quantized. Fixes vllm-project#2686 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

chatgpt-codex-connector · 2026-04-12T01:36:04Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

lishunyang12 · 2026-04-12T07:41:23Z

Test results?

…r pre-quantized checkpoints Same root cause as vllm-project#2686 (Qwen3-Omni): modelopt FP8/FP4/MXFP8 pre-quantized checkpoints only quantize the Thinker LM. The vision encoder weights remain BF16 with no FP8 scale tensors. Passing quant_config to the vision encoder causes FP8 kernels to run on BF16 weights, producing numerical garbage embeddings. Note: audio tower in Qwen2.5-Omni already constructs without quant_config, so only visual encoder needs the guard. Related to vllm-project#2686 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ianliuy · 2026-04-12T08:19:15Z

The original reporter (@mrwd2005) verified the fix in the issue description --- after applying this exact workaround (setting audio_quant_config and visual_quant_config to None for pre-quantized methods), all multimodal tests pass:

Test	Input	Before Fix	After Fix
Vision --- Red image	Solid red 100x100	`Gray`	`Red`
Vision --- Blue image	Solid blue 100x100	`Gray`	`Blue`
Vision --- Cat photo	480x640 cat photo	Unrelated hallucination	Accurate
Audio --- Transcription	`Hello, today is a beautiful day...`	`Yeah.`	Exact transcription

I don't have access to a GPU setup with the modelopt FP8 checkpoint to run end-to-end inference locally.

Changes summary:

qwen3_omni_moe_thinker.py: set audio_quant_config = None and visual_quant_config = None in the _PRE_QUANTIZED_METHODS branch + corrected the misleading comment (2 functional lines + comment update)
qwen2_5_omni_thinker.py: added the same guard for visual_quant_config (audio tower was already safe --- constructed without quant_config)

The fix stops passing FP8 config to encoders whose weights are BF16 with no scale tensors.

ianliuy · 2026-04-12T08:25:44Z

Re: docs update (per @hsliuustc0106's request in #2686)

I checked the existing documentation at docs/user_guide/diffusion/quantization/overview.md it already correctly describes the intended behavior:

"Quantization is automatically scoped to the thinker's language_model audio encoder, vision encoder, talker, and code2wav remain in BF16."

And the per-component table explicitly states Audio/Vision encoders are not quantized and stay BF16.

So the docs were already correct it was the code that didn't match the docs. This PR aligns the code with the documented behavior. No doc changes needed.

lishunyang12 · 2026-04-12T09:18:54Z

@mrwd2005 Can you help to check if this fix the issue?

hsliuustc0106 · 2026-04-12T09:23:40Z

Bugfix without regression test. Manual verification only. Please add a test that asserts visual_quant_config and audio_quant_config are None for pre-quantized methods (modelopt, modelopt_fp4, modelopt_mxfp8).

Extract resolve_encoder_quant_config() into component_config.py so the routing logic is unit-testable. Add parametrized tests asserting that visual_quant_config and audio_quant_config are None for pre-quantized methods (modelopt, modelopt_fp4, modelopt_mxfp8). Addresses review feedback from hsliuustc0106 on PR vllm-project#2702. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>

ianliuy · 2026-04-14T02:34:19Z

@hsliuustc0106 Added regression test in cf4d87f.

What changed:

Extracted resolve_encoder_quant_config() into vllm_omni/quantization/component_config.py this is the exact conditional that decides whether encoders get quant_config or None. Both thinker models now use it (qwen2.5 calls the function directly; qwen3 imports the shared PRE_QUANTIZED_METHODS constant).
New test file: tests/model_executor/models/test_encoder_quant_config.py (10 test cases):
- Pre-quantized None: parametrized over modelopt, modelopt_fp4, modelopt_mxfp8 asserts the return is None (i.e., visual_quant_config and audio_quant_config would be None)
- Non-pre-quantized preserved: fp8, awq, gptq, bitsandbytes pass through unchanged
- None input None
- ComponentQuantizationConfig passed through (caller handles .resolve())
- Constant exhaustiveness: asserts PRE_QUANTIZED_METHODS contains exactly the expected set

All 10 tests pass locally (pytest, CPU-only, WSL).

Extract resolve_encoder_quant_config() into component_config.py so the routing logic is unit-testable. Add parametrized tests asserting that visual_quant_config and audio_quant_config are None for pre-quantized methods (modelopt, modelopt_fp4, modelopt_mxfp8). Addresses review feedback from hsliuustc0106 on PR vllm-project#2702. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>

lishunyang12 · 2026-04-14T02:40:32Z

Tested locally and multimodal input works properly without crashing.

Extract resolve_encoder_quant_config() into component_config.py so the routing logic is unit-testable. Add parametrized tests asserting that visual_quant_config and audio_quant_config are None for pre-quantized methods (modelopt, modelopt_fp4, modelopt_mxfp8). Addresses review feedback from hsliuustc0106 on PR vllm-project#2702. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>

ianliuy · 2026-04-14T02:54:14Z

@lishunyang12 Thanks for verifying locally! All CI checks are green now (pre-commit, build wheel, DCO, readthedocs all passing). Could you approve when you get a chance?

ianliuy · 2026-04-14T03:45:16Z

@lishunyang12 Thanks for the approval! Looks like the buildkite/vllm-omni check is waiting for the ready label to trigger the test pipeline. Could you add it when you get a chance?

lishunyang12 · 2026-04-14T14:11:42Z

+# Pre-quantized checkpoints (modelopt FP8/FP4/MXFP8) only quantize the
+# Thinker LM.  Vision and audio encoder weights remain in BF16 with no
+# corresponding scale tensors in the checkpoint.
+PRE_QUANTIZED_METHODS: frozenset[str] = frozenset({"modelopt", "modelopt_fp4", "modelopt_mxfp8"})


We don't have modelopt fp4 and modelopt mxfp8 checkpoint for now.

…uantized checkpoints (vllm-project#2702) Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ianliuy requested a review from hsliuustc0106 as a code owner April 12, 2026 01:35

ianliuy mentioned this pull request Apr 12, 2026

[Bug] Qwen3-Omni FP8 (modelopt): Vision and Audio Encoders Erroneously Quantized — Multimodal Inputs Completely Broken #2686

Closed

ianliuy force-pushed the fix/issue-2686 branch from 374bb58 to b79d9eb Compare April 12, 2026 01:43

ianliuy force-pushed the fix/issue-2686 branch from b79d9eb to e5286d9 Compare April 12, 2026 08:18

ianliuy force-pushed the fix/issue-2686 branch from cf4d87f to 2dd35c4 Compare April 14, 2026 02:39

ianliuy force-pushed the fix/issue-2686 branch from 2dd35c4 to 4e780bd Compare April 14, 2026 02:44

lishunyang12 approved these changes Apr 14, 2026

View reviewed changes

lishunyang12 enabled auto-merge (squash) April 14, 2026 03:37

ianliuy mentioned this pull request Apr 14, 2026

fix(fish_speech): use from_indices() instead of decode() for DAC decoder #2668

Open

Gaohan123 added the ready label to trigger buildkite CI label Apr 14, 2026

Gaohan123 added this to the v0.20.0 milestone Apr 14, 2026

lishunyang12 merged commit 53a9cf4 into vllm-project:main Apr 14, 2026
6 of 7 checks passed

lishunyang12 reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: do not apply FP8 quant config to vision/audio encoders for pre-quantized checkpoints#2702

fix: do not apply FP8 quant config to vision/audio encoders for pre-quantized checkpoints#2702
lishunyang12 merged 3 commits into
vllm-project:mainfrom
ianliuy:fix/issue-2686

ianliuy commented Apr 12, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 12, 2026

Uh oh!

lishunyang12 commented Apr 12, 2026

Uh oh!

ianliuy commented Apr 12, 2026 •

edited

Loading

Uh oh!

ianliuy commented Apr 12, 2026

Uh oh!

lishunyang12 commented Apr 12, 2026

Uh oh!

hsliuustc0106 commented Apr 12, 2026

Uh oh!

ianliuy commented Apr 14, 2026

Uh oh!

lishunyang12 commented Apr 14, 2026

Uh oh!

ianliuy commented Apr 14, 2026

Uh oh!

ianliuy commented Apr 14, 2026

Uh oh!

Uh oh!

lishunyang12 Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ianliuy commented Apr 12, 2026

Problem

Fix

Impact

Testing

Uh oh!

chatgpt-codex-connector Bot commented Apr 12, 2026

Uh oh!

lishunyang12 commented Apr 12, 2026

Uh oh!

ianliuy commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianliuy commented Apr 12, 2026

Uh oh!

lishunyang12 commented Apr 12, 2026

Uh oh!

hsliuustc0106 commented Apr 12, 2026

Uh oh!

ianliuy commented Apr 14, 2026

Uh oh!

lishunyang12 commented Apr 14, 2026

Uh oh!

ianliuy commented Apr 14, 2026

Uh oh!

ianliuy commented Apr 14, 2026

Uh oh!

Uh oh!

lishunyang12 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ianliuy commented Apr 12, 2026 •

edited

Loading