Skip to content

fix: do not apply FP8 quant config to vision/audio encoders for pre-quantized checkpoints#2702

Merged
lishunyang12 merged 3 commits into
vllm-project:mainfrom
ianliuy:fix/issue-2686
Apr 14, 2026
Merged

fix: do not apply FP8 quant config to vision/audio encoders for pre-quantized checkpoints#2702
lishunyang12 merged 3 commits into
vllm-project:mainfrom
ianliuy:fix/issue-2686

Conversation

@ianliuy
Copy link
Copy Markdown
Contributor

@ianliuy ianliuy commented Apr 12, 2026

Fixes #2686

Problem

For modelopt FP8/FP4/MXFP8 pre-quantized checkpoints, only the Thinker LM is quantized. Vision/audio encoder weights are BF16 with no FP8 scale tensors in the checkpoint.

Previously, \quant_config\ was passed to all sub-components, causing \ModelOptFp8LinearMethod's FP8 kernel to run on BF16 encoder weights. Without valid scale tensors, the kernel produces numerical garbage embeddings the model completely ignores image and audio inputs (e.g., a solid red image described as 'Gray', speech transcribed as 'Yeah.').

This was a regression from v0.16 where the encoder linear layers did not propagate \quant_config, making it effectively a no-op.

Fix

  • **\qwen3_omni_moe_thinker.py**: set \�udio_quant_config = None\ and \�isual_quant_config = None\ for pre-quantized methods. Also corrects the misleading comment that claimed the entire thinker was quantized.
  • **\qwen2_5_omni_thinker.py**: same fix for the visual encoder (audio tower was already safe constructed without \quant_config).

Impact

  • Vision encoder (0.54B, ~1.1 GiB) and audio encoder (0.65B, ~1.3 GiB) run in BF16 negligible relative to the full model (~31 GiB)
  • Encoders only run during prefill, no impact on autoregressive decode speed
  • Consistent with v0.16 behavior

Testing

Verified by the original reporter: red image 'Red'\ , audio transcription exact

…rs for pre-quantized checkpoints

For modelopt FP8/FP4/MXFP8 pre-quantized checkpoints, only the Thinker
LM is quantized. Vision and audio encoder weights remain in BF16 with
no corresponding input_scale/weight_scale tensors.

Previously, the code passed the same quant_config to all sub-components,
causing ModelOptFp8LinearMethod's FP8 kernel to run on BF16 encoder
weights. This produced numerical garbage embeddings, making the model
completely ignore image and audio inputs (e.g. red image described as
'Gray', speech transcribed as 'Yeah.').

Fix: set audio_quant_config = None and visual_quant_config = None for
pre-quantized methods. Also corrects the misleading comment that claimed
the entire thinker was quantized.

Fixes vllm-project#2686

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ianliuy ianliuy requested a review from hsliuustc0106 as a code owner April 12, 2026 01:35
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@lishunyang12
Copy link
Copy Markdown
Collaborator

Test results?

…r pre-quantized checkpoints

Same root cause as vllm-project#2686 (Qwen3-Omni): modelopt FP8/FP4/MXFP8
pre-quantized checkpoints only quantize the Thinker LM. The vision
encoder weights remain BF16 with no FP8 scale tensors.

Passing quant_config to the vision encoder causes FP8 kernels to run
on BF16 weights, producing numerical garbage embeddings.

Note: audio tower in Qwen2.5-Omni already constructs without
quant_config, so only visual encoder needs the guard.

Related to vllm-project#2686

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ianliuy
Copy link
Copy Markdown
Contributor Author

ianliuy commented Apr 12, 2026

The original reporter (@mrwd2005) verified the fix in the issue description --- after applying this exact workaround (setting audio_quant_config and visual_quant_config to None for pre-quantized methods), all multimodal tests pass:

Test Input Before Fix After Fix
Vision --- Red image Solid red 100x100 Gray Red
Vision --- Blue image Solid blue 100x100 Gray Blue
Vision --- Cat photo 480x640 cat photo Unrelated hallucination Accurate
Audio --- Transcription Hello, today is a beautiful day... Yeah. Exact transcription

I don't have access to a GPU setup with the modelopt FP8 checkpoint to run end-to-end inference locally.

Changes summary:

  • qwen3_omni_moe_thinker.py: set audio_quant_config = None and visual_quant_config = None in the _PRE_QUANTIZED_METHODS branch + corrected the misleading comment (2 functional lines + comment update)
  • qwen2_5_omni_thinker.py: added the same guard for visual_quant_config (audio tower was already safe --- constructed without quant_config)

The fix stops passing FP8 config to encoders whose weights are BF16 with no scale tensors.

@ianliuy
Copy link
Copy Markdown
Contributor Author

ianliuy commented Apr 12, 2026

Re: docs update (per @hsliuustc0106's request in #2686)

I checked the existing documentation at docs/user_guide/diffusion/quantization/overview.md it already correctly describes the intended behavior:

"Quantization is automatically scoped to the thinker's language_model audio encoder, vision encoder, talker, and code2wav remain in BF16."

And the per-component table explicitly states Audio/Vision encoders are not quantized and stay BF16.

So the docs were already correct it was the code that didn't match the docs. This PR aligns the code with the documented behavior. No doc changes needed.

@lishunyang12
Copy link
Copy Markdown
Collaborator

@mrwd2005 Can you help to check if this fix the issue?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Bugfix without regression test. Manual verification only. Please add a test that asserts visual_quant_config and audio_quant_config are None for pre-quantized methods (modelopt, modelopt_fp4, modelopt_mxfp8).

ianliuy added a commit to ianliuy/vllm-omni that referenced this pull request Apr 14, 2026
Extract resolve_encoder_quant_config() into component_config.py so the
routing logic is unit-testable.  Add parametrized tests asserting that
visual_quant_config and audio_quant_config are None for pre-quantized
methods (modelopt, modelopt_fp4, modelopt_mxfp8).

Addresses review feedback from hsliuustc0106 on PR vllm-project#2702.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
@ianliuy
Copy link
Copy Markdown
Contributor Author

ianliuy commented Apr 14, 2026

@hsliuustc0106 Added regression test in cf4d87f.

What changed:

  1. Extracted resolve_encoder_quant_config() into vllm_omni/quantization/component_config.py this is the exact conditional that decides whether encoders get quant_config or None. Both thinker models now use it (qwen2.5 calls the function directly; qwen3 imports the shared PRE_QUANTIZED_METHODS constant).

  2. New test file: tests/model_executor/models/test_encoder_quant_config.py (10 test cases):

    • Pre-quantized None: parametrized over modelopt, modelopt_fp4, modelopt_mxfp8 asserts the return is None (i.e., visual_quant_config and audio_quant_config would be None)
    • Non-pre-quantized preserved: fp8, awq, gptq, bitsandbytes pass through unchanged
    • None input None
    • ComponentQuantizationConfig passed through (caller handles .resolve())
    • Constant exhaustiveness: asserts PRE_QUANTIZED_METHODS contains exactly the expected set

All 10 tests pass locally (pytest, CPU-only, WSL).

ianliuy added a commit to ianliuy/vllm-omni that referenced this pull request Apr 14, 2026
Extract resolve_encoder_quant_config() into component_config.py so the
routing logic is unit-testable.  Add parametrized tests asserting that
visual_quant_config and audio_quant_config are None for pre-quantized
methods (modelopt, modelopt_fp4, modelopt_mxfp8).

Addresses review feedback from hsliuustc0106 on PR vllm-project#2702.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
@lishunyang12
Copy link
Copy Markdown
Collaborator

Tested locally and multimodal input works properly without crashing.

Extract resolve_encoder_quant_config() into component_config.py so the
routing logic is unit-testable.  Add parametrized tests asserting that
visual_quant_config and audio_quant_config are None for pre-quantized
methods (modelopt, modelopt_fp4, modelopt_mxfp8).

Addresses review feedback from hsliuustc0106 on PR vllm-project#2702.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
@ianliuy
Copy link
Copy Markdown
Contributor Author

ianliuy commented Apr 14, 2026

@lishunyang12 Thanks for verifying locally! All CI checks are green now (pre-commit, build wheel, DCO, readthedocs all passing). Could you approve when you get a chance?

@lishunyang12 lishunyang12 enabled auto-merge (squash) April 14, 2026 03:37
@ianliuy
Copy link
Copy Markdown
Contributor Author

ianliuy commented Apr 14, 2026

@lishunyang12 Thanks for the approval! Looks like the buildkite/vllm-omni check is waiting for the ready label to trigger the test pipeline. Could you add it when you get a chance?

@Gaohan123 Gaohan123 added the ready label to trigger buildkite CI label Apr 14, 2026
@Gaohan123 Gaohan123 added this to the v0.20.0 milestone Apr 14, 2026
@lishunyang12 lishunyang12 merged commit 53a9cf4 into vllm-project:main Apr 14, 2026
6 of 7 checks passed
# Pre-quantized checkpoints (modelopt FP8/FP4/MXFP8) only quantize the
# Thinker LM. Vision and audio encoder weights remain in BF16 with no
# corresponding scale tensors in the checkpoint.
PRE_QUANTIZED_METHODS: frozenset[str] = frozenset({"modelopt", "modelopt_fp4", "modelopt_mxfp8"})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have modelopt fp4 and modelopt mxfp8 checkpoint for now.

y123456y78 pushed a commit to y123456y78/vllm-omni that referenced this pull request Apr 15, 2026
…uantized checkpoints (vllm-project#2702)

Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
…uantized checkpoints (vllm-project#2702)

Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…uantized checkpoints (vllm-project#2702)

Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…uantized checkpoints (vllm-project#2702)

Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Qwen3-Omni FP8 (modelopt): Vision and Audio Encoders Erroneously Quantized — Multimodal Inputs Completely Broken

4 participants