fix: do not apply FP8 quant config to vision/audio encoders for pre-quantized checkpoints#2702
Conversation
…rs for pre-quantized checkpoints For modelopt FP8/FP4/MXFP8 pre-quantized checkpoints, only the Thinker LM is quantized. Vision and audio encoder weights remain in BF16 with no corresponding input_scale/weight_scale tensors. Previously, the code passed the same quant_config to all sub-components, causing ModelOptFp8LinearMethod's FP8 kernel to run on BF16 encoder weights. This produced numerical garbage embeddings, making the model completely ignore image and audio inputs (e.g. red image described as 'Gray', speech transcribed as 'Yeah.'). Fix: set audio_quant_config = None and visual_quant_config = None for pre-quantized methods. Also corrects the misleading comment that claimed the entire thinker was quantized. Fixes vllm-project#2686 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Test results? |
…r pre-quantized checkpoints Same root cause as vllm-project#2686 (Qwen3-Omni): modelopt FP8/FP4/MXFP8 pre-quantized checkpoints only quantize the Thinker LM. The vision encoder weights remain BF16 with no FP8 scale tensors. Passing quant_config to the vision encoder causes FP8 kernels to run on BF16 weights, producing numerical garbage embeddings. Note: audio tower in Qwen2.5-Omni already constructs without quant_config, so only visual encoder needs the guard. Related to vllm-project#2686 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
The original reporter (@mrwd2005) verified the fix in the issue description --- after applying this exact workaround (setting
I don't have access to a GPU setup with the modelopt FP8 checkpoint to run end-to-end inference locally. Changes summary:
The fix stops passing FP8 config to encoders whose weights are BF16 with no scale tensors. |
|
Re: docs update (per @hsliuustc0106's request in #2686) I checked the existing documentation at
And the per-component table explicitly states Audio/Vision encoders are not quantized and stay BF16. So the docs were already correct it was the code that didn't match the docs. This PR aligns the code with the documented behavior. No doc changes needed. |
|
@mrwd2005 Can you help to check if this fix the issue? |
|
Bugfix without regression test. Manual verification only. Please add a test that asserts visual_quant_config and audio_quant_config are None for pre-quantized methods (modelopt, modelopt_fp4, modelopt_mxfp8). |
Extract resolve_encoder_quant_config() into component_config.py so the routing logic is unit-testable. Add parametrized tests asserting that visual_quant_config and audio_quant_config are None for pre-quantized methods (modelopt, modelopt_fp4, modelopt_mxfp8). Addresses review feedback from hsliuustc0106 on PR vllm-project#2702. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
|
@hsliuustc0106 Added regression test in cf4d87f. What changed:
All 10 tests pass locally (pytest, CPU-only, WSL). |
Extract resolve_encoder_quant_config() into component_config.py so the routing logic is unit-testable. Add parametrized tests asserting that visual_quant_config and audio_quant_config are None for pre-quantized methods (modelopt, modelopt_fp4, modelopt_mxfp8). Addresses review feedback from hsliuustc0106 on PR vllm-project#2702. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
|
Tested locally and multimodal input works properly without crashing. |
Extract resolve_encoder_quant_config() into component_config.py so the routing logic is unit-testable. Add parametrized tests asserting that visual_quant_config and audio_quant_config are None for pre-quantized methods (modelopt, modelopt_fp4, modelopt_mxfp8). Addresses review feedback from hsliuustc0106 on PR vllm-project#2702. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
|
@lishunyang12 Thanks for verifying locally! All CI checks are green now (pre-commit, build wheel, DCO, readthedocs all passing). Could you approve when you get a chance? |
|
@lishunyang12 Thanks for the approval! Looks like the |
| # Pre-quantized checkpoints (modelopt FP8/FP4/MXFP8) only quantize the | ||
| # Thinker LM. Vision and audio encoder weights remain in BF16 with no | ||
| # corresponding scale tensors in the checkpoint. | ||
| PRE_QUANTIZED_METHODS: frozenset[str] = frozenset({"modelopt", "modelopt_fp4", "modelopt_mxfp8"}) |
There was a problem hiding this comment.
We don't have modelopt fp4 and modelopt mxfp8 checkpoint for now.
…uantized checkpoints (vllm-project#2702) Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…uantized checkpoints (vllm-project#2702) Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…uantized checkpoints (vllm-project#2702) Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…uantized checkpoints (vllm-project#2702) Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes #2686
Problem
For modelopt FP8/FP4/MXFP8 pre-quantized checkpoints, only the Thinker LM is quantized. Vision/audio encoder weights are BF16 with no FP8 scale tensors in the checkpoint.
Previously, \quant_config\ was passed to all sub-components, causing \ModelOptFp8LinearMethod's FP8 kernel to run on BF16 encoder weights. Without valid scale tensors, the kernel produces numerical garbage embeddings the model completely ignores image and audio inputs (e.g., a solid red image described as 'Gray', speech transcribed as 'Yeah.').
This was a regression from v0.16 where the encoder linear layers did not propagate \quant_config, making it effectively a no-op.
Fix
Impact
Testing
Verified by the original reporter: red image 'Red'\ , audio transcription exact