[Feature] Add FP8 quantization for Qwen2.5-Omni (thinker LM only)#3466
[Feature] Add FP8 quantization for Qwen2.5-Omni (thinker LM only)#3466wuli666 wants to merge 5 commits into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Mirrors PR vllm-project#1764 (Qwen3-Omni FP8 D) routing pattern for Qwen2.5-Omni thinker LM. Routes user-supplied quant_config (e.g. --quantization fp8) to the language_model only via ComponentQuantizationConfig; vision and audio encoders stay BF16. Uses maybe_prefix(prefix, ...) for component keys so the routing works under the nested prefix structure: the thinker is constructed inside the parent Qwen2_5OmniForConditionalGeneration with prefix="thinker", so language model layers are at "thinker.language_model.X" at runtime. Signed-off-by: wuli666 <djjpro975@gmail.com>
Per reviewer guidance and matrix RFC vllm-project#2136: for omni models, dynamic FP8 should scope to the thinker/LLM only — talker, audio encoder, vision encoder, and code2wav stay BF16. The talker stage runs in a separate process and receives engine_args.quantization propagated from the user. When that becomes a bare dynamic quant_config (e.g. Fp8Config from --quantization fp8), the auto-wrap branch in talker.__init__ now sets quant_config=None and propagates the cleared config via replace(vllm_config, quant_config=None) so talker submodules construct as BF16. Pre-quantized checkpoints (modelopt) and explicit per-component dicts (where the user includes a talker entry) continue to be honored. Signed-off-by: wuli666 <djjpro975@gmail.com>
ac6898a to
7dae773
Compare
|
@wuli666 CI failed, PTAL |
Signed-off-by: wuli666 <djjpro975@gmail.com>
Signed-off-by: wuli666 <djjpro975@gmail.com>
Signed-off-by: wuli666 <djjpro975@gmail.com>
Review NotesLGTM. Clean implementation that mirrors the Qwen3-Omni FP8 D pattern (PR #1764). What works well
Minor observationThe 3.3× KV cache budget increase (from saved weight memory) is expected behavior with DocumentationPR body is thorough with test plan, results, and environment details. No additional docs needed for this scoped quantization change. |
|
have you checked the audio wav quality? |
yes,audio is fine,matches BF16 baseline |
|
Hi @wuli666, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 7 days. 🕐 Could you please provide an update?
Thanks for your contribution! 🙏 |
|
Hi @hsliuustc0106,code is ready, pre-commit clean, 18/18 tests passing. Just waiting on review. |
Purpose
Add Qwen2.5-Omni FP8 dynamic quantization support, claimed in RFC #2136 (Quantization Matrix →
🌐 Qwen2.5-Omni FP8 D).This PR mirrors the per-component routing pattern from #1764 (Qwen3-Omni FP8 D), restricting FP8 quantization to the thinker language model only — vision and audio encoders stay BF16 (they have no FP8 scale tensors in the checkpoint and would produce garbage embeddings if quantized). The talker stage receives the same
quantization=fp8flag via cross-process engine_args propagation; in the auto-wrap path it is also quantized to FP8 (additional ~1.8 GiB saving with no measurable quality loss). Users wanting a strict thinker-only configuration can pass--quantization-config '{"thinker.language_model": "fp8"}'once vLLM's per-component dict path is unblocked upstream.The routing supports three input shapes:
PRE_QUANTIZED_METHODScheck.--quantization-config '{"thinker.language_model": "fp8"}') — aComponentQuantizationConfigis supplied directly; resolved per layer prefix.--quantization fp8on a BF16 checkpoint) — auto-wrapped intoComponentQuantizationConfig({language_prefix: quant_config}, default=None)so encoders fall through toNone. The wrapped config is propagated viareplace(vllm_config, quant_config=wrapped)so all submodules within the thinker process see consistent routing.Test Plan
End-to-end on a 2× RTX 4090 host (Ada sm_89, vLLM-Omni HEAD), using the mixed-modalities query (audio + image + video). Stage 0 alone on GPU 0; stages 1+2 cohabit GPU 1. Deploy YAML retuned locally for 24 GiB cards (not part of this PR; default YAML is sized for 80 GiB H100/H200).
The
--quantizationflag is added toend2end.pylocally for test convenience and is not part of this PR. The same routing is exercised by--quantization-config '{"thinker.language_model": "fp8"}'(existing flag) once vLLM's strict pairing check onquantization_configis relaxed upstream.Validation environment
Fp8OnlineLinearMethodregression that produces garbage on biased Linear layers (Qwen2 family). The current PyPIvllm==0.20.1predates #41424; users on pip will hit upstream garbage tokens until the next vLLM release ships. This is independent of the routing logic in this PR — once #41424 is in the user's vLLM, no further changes here are needed.Test Result
Memory (per-stage model weights, RTX 4090 24 GiB)
Peak GPU 0 VRAM (which hosts thinker only) is identical at 21.04 GiB BF16 vs 21.02 GiB FP8 D — vLLM's
gpu_memory_utilization=0.85budget fills the available headroom with KV cache, so the saved weight memory is realized as a 3.3× larger KV cache budget rather than a smaller VRAM footprint at the same configuration. Loweringgpu_memory_utilizationdirectly trades that headroom back for a smaller actual VRAM footprint.Wall time (mixed-modalities query, single prompt, end-to-end including init + audio decode + WAV write)
FP8 D is slower than BF16 on RTX 4090 (Ada sm_89) because the cuTLASS FP8 GEMM kernel at these layer shapes does not outperform cuBLAS BF16 on first-generation FP8 tensor cores. The same code path on RTX 5090 (Blackwell sm_120, second-generation FP8 hardware) showed the inverse — FP8 D ran in 1m02s vs BF16 2m28s (~2.4× speedup) — so the speed/memory trade-off is hardware-dependent. Memory savings are the consistent benefit.
Output quality
BF16 baseline (mixed_modalities query):
FP8 D (same prompt, same seed):
The FP8 D output is a near-paraphrase of BF16 and correctly grounds in all three input modalities (audio recitation, image content, video humor). Both runs produced valid
output_*.wavfiles. This is the expected behavior for FP8 dynamic quantization — token-level output may differ slightly due to numerical noise, but semantic content is preserved.Routing correctness (verified via instrumentation in
ComponentQuantizationConfig.get_quant_method)Fp8OnlineLinearMethod.Fp8KVCacheMethodmarker (KV cache stays BF16 becausekv_cache_dtype=auto).lm_head+embed_tokens→None).None→ BF16 (verified by prefix not matchingthinker.language_model).Both
out_bf16/*.wavandout_fp8d/*.wavwere generated successfully, matching the BF16 / FP8 D text outputs above.