[Bugfix] Add bias to SKIP_TENSORS to fix online FP8 for models with biased linears#39666
[Bugfix] Add bias to SKIP_TENSORS to fix online FP8 for models with biased linears#39666alankessler wants to merge 12 commits into
Conversation
…True The layerwise reload mechanism wraps weight loaders for all tensors not in SKIP_TENSORS. This prevents bias parameters from loading correctly during online FP8 quantization, leaving them as zeros. Qwen2 is the most visible case (bias=True on qkv_proj), but any architecture with biased linear layers is affected. Fixes: vllm-project#39663 Related: vllm-project#37334, vllm-project#38746 Signed-off-by: Alan Kessler <alankessler@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a regression test for online FP8 quantization on models with bias and updates the model loader to skip 'bias' tensors during reload to prevent output corruption. A review comment suggests also skipping 'w13_bias' and 'w2_bias' to ensure consistency and prevent similar issues in MoE layers.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: alankessler <alankessler@gmail.com>
|
It looks like it's blocked on the pre-run check label gate. Could someone add "ready" to kick off the CI, please? Thank you! |
|
@mgoin hi, would you please verify to kick off CI? The contribution instructions told me to ping you all if it's been 7 days without moving; it's been about 2 weeks. Thanks! |
|
Thanks for finding this issue! Cc @kylesayrs |
|
Failing checks are all unrelated to this PR: schemathesis fuzzer flake, a seeded sampling mismatch in test_cpu_offload, and CUDA OOM in the fusion-e2e-tp2 suite |
Purpose
Fix online FP8 quantization producing garbage output for models with
bias=Trueon linear layers (e.g. Qwen2 family).The layerwise reload mechanism (
initialize_online_processing) wraps weight loaders for all tensors not inSKIP_TENSORS. This prevents bias parameters from loading correctly because they stay at zero instead of being populated from the checkpoint. Qwen2'sqkv_projhasbias=Truewith values up to 147.0, so zeroed bias completely corrupts Q/K/V projections.Same class of bug as #37334 and #38746.
Fixes #39663. Likely related to #27364 and #24025.
Test Plan
New integration test: loads
Qwen/Qwen2-0.5Bin BF16 (baseline) and FP8 online, compares logprobs withcheck_logprobs_close.Test Result
Without fix: test fails: FP8 output is
vpn.lua之心_burg.numpy...With fix: test pass: FP8 logprobs match BF16
Full
test_fp8.pysuite: 15 passed, 4 skipped, 0 failed (CUDA, RTX 5060 Ti, vLLM 0.19.0)Also manually verified on Intel XPU (Arc Pro B70) with Qwen2.5-0.5B, Phi-2, GPT-2, and Mistral-Nemo (bug originally found because this failed on Intel XPU under Qwen and QwQ) all produce correct output with the fix, no regressions.