[Bugfix] Fix FP8 Bias Loading by alex-jw-brooks · Pull Request #41424 · vllm-project/vllm

alex-jw-brooks · 2026-04-30T23:29:52Z

Purpose

Fixes the underlying cause of #41284

The issue is that when layers have bias=True, we do the following:

Initialize the weights on meta device + wrap its own weight loader, allocate the bias (not on meta device)
The params are generally yielded alphabetically when we are loading them. This means:
- First, we load the bias normally
- Then, we load the weight, which needs to materialize the weight meta tensors, which means replacing it with a new on device tensor with torch.empty_strided

The materialization currently does this to everything, including the bias, even though it should only do it to weights. This corrupts the bias values, which creates NaNs in forward() and ultimately produces garbage values.

The handling for NaNs is also why things worked for granite speech in 0.17 with fp8, but not in 0.20. I think the native forward doesn't handle NaNs in the same way, which is why the values diverge, will open a separate PR to discuss.

Test Plan

Added an explicit test - you can also verify the fix with a minimal fp8 example with granite speech.

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

model_id = "ibm-granite/granite-speech-4.1-2b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_prompt(question: str, has_audio: bool):
    """Build the input prompt to send to vLLM."""
    if has_audio:
        question = f"<|audio|>{question}"
    chat = [
        {
            "role": "user",
            "content": question
        }
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False)

model = LLM(
    model=model_id,
    max_model_len=2048, # This may be needed for lower resource devices.
    limit_mm_per_prompt={"audio": 1},
    quantization="fp8",
)

question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
    question=question,
    has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate

inputs = {
    "prompt": prompt_with_audio,
    "multi_modal_data": {
        "audio": audio,
    }
}

outputs = model.generate(
    inputs,
    sampling_params=SamplingParams(
        temperature=0.2,
        max_tokens=64,
    ),
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")

Test Result

On main:

Generated text: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

After fix:

Generated text: the first words i spoke in the original phonograph a little piece of practical poetry mary had a little lamb its fleece was white as snow and everywhere that mary went the lamb was sure to go

CC @DarkLight1337 @robertgshaw2-redhat @lokashrinav

Signed-off-by: Alex Brooks <albrooks@redhat.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the materialize_layer function to ensure that only meta tensors are materialized, preventing the overwriting of already initialized non-meta tensors. A new test case, test_materialize_layer_preserves_non_meta_tensors, has been added to verify this logic. I have no feedback to provide.

Signed-off-by: Alex Brooks <albrooks@redhat.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>

Signed-off-by: Alex Brooks <albrooks@redhat.com>

Signed-off-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

Signed-off-by: Alex Brooks <albrooks@redhat.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

alex-jw-brooks added 3 commits April 30, 2026 23:06

don't materialize non meta tensors

8fba5f7

Signed-off-by: Alex Brooks <albrooks@redhat.com>

add bias reload test

d5608d6

Signed-off-by: Alex Brooks <albrooks@redhat.com>

minor

f560e12

Signed-off-by: Alex Brooks <albrooks@redhat.com>

alex-jw-brooks requested a review from 22quinn as a code owner April 30, 2026 23:29

claude Bot reviewed Apr 30, 2026

View reviewed changes

mergify Bot added the bug Something isn't working label Apr 30, 2026

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

alex-jw-brooks mentioned this pull request Apr 30, 2026

[Bugfix] Handle NaN in QuantFP8 Native Forward #41427

Open

DarkLight1337 requested review from Isotr0py and mgoin May 1, 2026 02:13

Isotr0py approved these changes May 2, 2026

View reviewed changes

Isotr0py enabled auto-merge (squash) May 2, 2026 04:43

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 2, 2026

Merge branch 'main' into fix_bias_loads

4963253

Isotr0py merged commit db9a84e into vllm-project:main May 3, 2026
51 checks passed

joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request May 4, 2026

[Bugfix] Fix FP8 Bias Loading (vllm-project#41424)

b26d0a6

Signed-off-by: Alex Brooks <albrooks@redhat.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>

chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026

[Bugfix] Fix FP8 Bias Loading (vllm-project#41424)

20536a7

Signed-off-by: Alex Brooks <albrooks@redhat.com>

alankessler mentioned this pull request May 6, 2026

[Bugfix] Add bias to SKIP_TENSORS to fix online FP8 for models with biased linears #39666

Closed

Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026

[Bugfix] Fix FP8 Bias Loading (vllm-project#41424)

e3f6697

Signed-off-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

ikaadil pushed a commit to ikaadil/vllm that referenced this pull request May 7, 2026

[Bugfix] Fix FP8 Bias Loading (vllm-project#41424)

3fe408d

Signed-off-by: Alex Brooks <albrooks@redhat.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

wuli666 mentioned this pull request May 9, 2026

[Feature] Add FP8 quantization for Qwen2.5-Omni (thinker LM only) vllm-project/vllm-omni#3466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix FP8 Bias Loading#41424

[Bugfix] Fix FP8 Bias Loading#41424
Isotr0py merged 4 commits intovllm-project:mainfrom
alex-jw-brooks:fix_bias_loads

alex-jw-brooks commented Apr 30, 2026

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

alex-jw-brooks commented Apr 30, 2026

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants