Skip to content

[Bugfix] Skip bias tensors in online FP8 quantization pipeline#39665

Closed
r266-tech wants to merge 1 commit into
vllm-project:mainfrom
r266-tech:fix/fp8-online-quant-skip-bias
Closed

[Bugfix] Skip bias tensors in online FP8 quantization pipeline#39665
r266-tech wants to merge 1 commit into
vllm-project:mainfrom
r266-tech:fix/fp8-online-quant-skip-bias

Conversation

@r266-tech
Copy link
Copy Markdown
Contributor

Summary

Add "bias" to SKIP_TENSORS in vllm/model_executor/model_loader/reload/meta.py so bias parameters bypass make_online_process_loader wrapping during initialize_online_processing.

Problem: When using --quantization fp8 on BF16 checkpoints, models with bias=True linear layers (Qwen2/2.5, GPT-2, Phi, etc.) produce garbage output. The bias tensors get wrapped by the online processing pipeline but never materialize — they silently stay at zero.

Root cause: initialize_online_processing in layerwise.py wraps weight loaders for all tensors not in SKIP_TENSORS. Bias parameters (1D, small) don't need FP8 quantization and are not designed to flow through the deferred loading pipeline. Same class of bug as #37334 and #38746.

Fix: 1-line addition of "bias" to the skip set. All vllm parallel linear layers (ColumnParallelLinear, QKVParallelLinear, RowParallelLinear, etc.) register bias as self.bias via Parameter(...) or register_parameter("bias", None), so the exact string "bias" matches all standard bias parameters.

Note: Custom non-standard bias parameters with different names (e.g., e_score_correction_bias) are already individually listed in SKIP_TENSORS. This fix targets the standard nn.Linear-style bias that all parallel linear layers inherit.

Fixes #39663

@r266-tech r266-tech requested a review from 22quinn as a code owner April 13, 2026 04:19
@mergify mergify Bot added the bug Something isn't working label Apr 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the metadata keys in vllm/model_executor/model_loader/reload/meta.py by adding "bias" to the existing set. There are no review comments, and I have no feedback to provide.

Add "bias" to SKIP_TENSORS so bias parameters are not wrapped by
make_online_process_loader during initialize_online_processing.
Without this, bias tensors on models with bias=True (Qwen2/2.5,
GPT-2, Phi, etc.) silently stay at zero when using --quantization fp8
on BF16 checkpoints.

Fixes vllm-project#39663

Signed-off-by: r266-tech <183631678+r266-tech@users.noreply.github.com>
@mergify mergify Bot added ci/build deepseek Related to DeepSeek models frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) mistral Related to Mistral models performance Performance-related issues qwen Related to Qwen models gpt-oss Related to GPT-OSS models nvidia rocm Related to AMD ROCm intel-gpu Related to Intel GPU cpu Related to CPU backends structured-output speculative-decoding v1 tpu Related to Google TPUs tool-calling labels Apr 13, 2026
@mergify mergify Bot added the kv-connector label Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build cpu Related to CPU backends deepseek Related to DeepSeek models frontend gpt-oss Related to GPT-OSS models intel-gpu Related to Intel GPU kv-connector llama Related to Llama models mistral Related to Mistral models multi-modality Related to multi-modality (#4194) nvidia performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output tool-calling tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Online FP8 quantization drops bias weights, which breaks Qwen2 and other models with bias=True

2 participants