[Bugfix] Skip bias tensors in online FP8 quantization pipeline#39962
[Bugfix] Skip bias tensors in online FP8 quantization pipeline#39962r266-tech wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where bias parameters were incorrectly handled by the online loader during FP8 quantization. It adds "bias" to the SKIP_TENSORS set in vllm/model_executor/model_loader/reload/meta.py to ensure it follows the normal load path. A regression test has also been added to verify that bias parameters are skipped during meta capture. I have no feedback to provide.
|
I haven't been involved in this and haven't empirically validated anything! |
|
@pstefa1707 Apologies for the mistaken @-mention — you weren't involved in this. I confused handles from another context. The actual reporter who provided the repro/validation is @alankessler in #39663. I'm correcting the PR body now. |
|
You’re really eager to poach my PR, huh @r266-tech #39666 |
|
Hi @alankessler — you're right to call this out. My apologies. I didn't internalize that you already had an open fix when I resubmitted this (I filed #39665 the same day you opened #39666 and then resubmitted as #39962 after closing mine), and I should have deferred to your PR from the start. Closing this in favor of #39666, which was yours first and addresses the same fix. Sorry for the noise and for the mistaken @-mention of @pstefa1707 earlier in the thread — that came from confusing handles while drafting the body and isn't an excuse. Rooting for #39666 to land. |
Summary
Add
"bias"toSKIP_TENSORSinvllm/model_executor/model_loader/reload/meta.pyso bias parameters bypassmake_online_process_loaderwrapping duringinitialize_online_processing.Fixes #39663. Resubmit of #39665 (auto-closed by my notification cleanup without maintainer review).
Problem
With
--quantization fp8on BF16 checkpoints, models that registerbias=Truelinear layers (Qwen2/2.5, GPT-2, Phi, etc.) produce garbage output: bias tensors get wrapped by the online processing pipeline but never materialize — they silently stay at zero.Root cause
initialize_online_processinginlayerwise.pywraps weight loaders for all tensors not inSKIP_TENSORS. Bias parameters (1D, small) don't need FP8 quantization and are not designed to flow through the deferred loading pipeline. Same class of bug previously addressed fore_score_correction_bias(already in the set).Reporter @alankessler filed #39663 with the full env capture and repro (Qwen2/2.5 producing garbage output under
--quantization fp8).Fix
Add
"bias"to theSKIP_TENSORSset. vLLM parallel linear layers (ColumnParallelLinear,QKVParallelLinear,RowParallelLinear, etc.) register bias as exactlyself.biasviaParameter(...)/register_parameter("bias", ...), so the exact-name match inSKIP_TENSORStargets precisely the affected tensors.Test
Added
test_capture_layer_to_meta_skips_bias— CPU-only unit test that verifies:"bias"is present inSKIP_TENSORStorch.nn.Linear(bias=True)→capture_layer_to_metadrops the bias but keeps the weightThe negative case (accidental over-skip) is constrained by the existing
test_reload_lifecycletest, which exercises end-to-end capture/restore/materialize on atorch.nn.Linearand would fail if weight tensors were incorrectly skipped.Notes
e_score_correction_bias) of the same exact-name skip pattern + existinge_score_correction_biasprecedent (also a narrow exact-name skip).