Skip to content

[Bugfix] Add bias to SKIP_TENSORS to fix online FP8 for models with biased linears#39666

Closed
alankessler wants to merge 12 commits into
vllm-project:mainfrom
alankessler:fix/fp8-online-bias-loading
Closed

[Bugfix] Add bias to SKIP_TENSORS to fix online FP8 for models with biased linears#39666
alankessler wants to merge 12 commits into
vllm-project:mainfrom
alankessler:fix/fp8-online-bias-loading

Conversation

@alankessler
Copy link
Copy Markdown

@alankessler alankessler commented Apr 13, 2026

Purpose

Fix online FP8 quantization producing garbage output for models with bias=True on linear layers (e.g. Qwen2 family).

The layerwise reload mechanism (initialize_online_processing) wraps weight loaders for all tensors not in SKIP_TENSORS. This prevents bias parameters from loading correctly because they stay at zero instead of being populated from the checkpoint. Qwen2's qkv_proj has bias=True with values up to 147.0, so zeroed bias completely corrupts Q/K/V projections.

Same class of bug as #37334 and #38746.

Fixes #39663. Likely related to #27364 and #24025.

Test Plan

python -m pytest tests/quantization/test_fp8.py::test_fp8_online_bias_model -xvs

New integration test: loads Qwen/Qwen2-0.5B in BF16 (baseline) and FP8 online, compares logprobs with check_logprobs_close.

Test Result

Without fix: test fails: FP8 output is vpn.lua之心_burg.numpy...
With fix: test pass: FP8 logprobs match BF16

Full test_fp8.py suite: 15 passed, 4 skipped, 0 failed (CUDA, RTX 5060 Ti, vLLM 0.19.0)

Also manually verified on Intel XPU (Arc Pro B70) with Qwen2.5-0.5B, Phi-2, GPT-2, and Mistral-Nemo (bug originally found because this failed on Intel XPU under Qwen and QwQ) all produce correct output with the fix, no regressions.

…True

The layerwise reload mechanism wraps weight loaders for all tensors
not in SKIP_TENSORS. This prevents bias parameters from loading
correctly during online FP8 quantization, leaving them as zeros.

Qwen2 is the most visible case (bias=True on qkv_proj), but any
architecture with biased linear layers is affected.

Fixes: vllm-project#39663
Related: vllm-project#37334, vllm-project#38746

Signed-off-by: Alan Kessler <alankessler@gmail.com>
@mergify mergify Bot added the bug Something isn't working label Apr 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a regression test for online FP8 quantization on models with bias and updates the model loader to skip 'bias' tensors during reload to prevent output corruption. A review comment suggests also skipping 'w13_bias' and 'w2_bias' to ensure consistency and prevent similar issues in MoE layers.

Comment thread vllm/model_executor/model_loader/reload/meta.py
@alankessler
Copy link
Copy Markdown
Author

It looks like it's blocked on the pre-run check label gate. Could someone add "ready" to kick off the CI, please? Thank you!

@alankessler
Copy link
Copy Markdown
Author

@mgoin hi, would you please verify to kick off CI? The contribution instructions told me to ping you all if it's been 7 days without moving; it's been about 2 weeks.

Thanks!

@mgoin
Copy link
Copy Markdown
Member

mgoin commented Apr 27, 2026

Thanks for finding this issue! Cc @kylesayrs

@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed quantization labels Apr 27, 2026
@alankessler
Copy link
Copy Markdown
Author

alankessler commented Apr 27, 2026

Failing checks are all unrelated to this PR: schemathesis fuzzer flake, a seeded sampling mismatch in test_cpu_offload, and CUDA OOM in the fusion-e2e-tp2 suite

@alankessler
Copy link
Copy Markdown
Author

Closing — #41424 found the root cause. Thanks for reviewing, @mgoin!

@alankessler alankessler closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working quantization ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Online FP8 quantization drops bias weights, which breaks Qwen2 and other models with bias=True

2 participants