Skip to content

fix: Force float16 dtype for GGUF models to fix incorrect output#30090

Closed
kitaekatt wants to merge 3 commits intovllm-project:mainfrom
kitaekatt:fix-gguf-bfloat16-dtype
Closed

fix: Force float16 dtype for GGUF models to fix incorrect output#30090
kitaekatt wants to merge 3 commits intovllm-project:mainfrom
kitaekatt:fix-gguf-bfloat16-dtype

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

Summary

GGUF dequantization kernels use half precision (fp16) internally via the dfloat typedef in the CUDA kernels. When vLLM auto-selects bfloat16 on modern GPUs (default behavior when dtype="auto"), the dtype mismatch causes garbage output (e.g., "????" tokens or random characters).

This fix:

  1. Auto-sets dtype to float16 for GGUF models when dtype="auto"
  2. Removes bfloat16 from supported dtypes to prevent explicit misuse

Test Plan

Tested on RTX 5090 (sm_120, Blackwell architecture) with Qwen/Qwen3-4B-GGUF:

Before fix:

  • Output: "????..." (garbage/question marks)
  • Model loads successfully but produces incorrect tokens

After fix:

  • Output: Correct, coherent responses
  • Performance: 583.8 tok/s
  • Tested with 5 diverse prompts (math, geography, code, logic)

Validation that bf16 is properly rejected:

$ python -c "..." --dtype bfloat16
# Now correctly errors with: "bfloat16 is not supported for quantization method gguf"

Root Cause Analysis

The GGUF quantization kernels in csrc/quantization/gguf/ define:

typedef half dfloat;  // fp16 used for dequantization output

When the model dtype is bfloat16, there's a mismatch between:

  • Kernel output: float16 (half)
  • Expected dtype: bfloat16

This causes the logits to be interpreted incorrectly, resulting in garbage token predictions.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses a critical dtype mismatch issue with GGUF models, where "bfloat16" was causing incorrect outputs due to internal "float16" dequantization kernels. By automatically setting the dtype to "float16" for GGUF models when "dtype="auto"" and explicitly removing "bfloat16" from the supported dtypes for GGUF, this change ensures correctness and prevents future misuse. The changes are well-explained and directly resolve the problem.

Comment thread vllm/engine/arg_utils.py Outdated
Comment on lines +1176 to +1184
# GGUF dequantization kernels use half precision (fp16) internally.
# Using bfloat16 causes incorrect output due to dtype mismatch.
# Force float16 for GGUF models unless user explicitly set dtype.
if self.dtype == "auto":
self.dtype = "float16"
logger.info(
"GGUF models require float16 dtype. "
"Setting dtype to float16 automatically."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This logic correctly identifies GGUF models and forces the dtype to float16 when it's set to auto. This is a critical fix to ensure correct model operation and prevent the garbage output described in the PR.

Comment on lines +55 to +57
# GGUF dequantization kernels use half precision (fp16) internally.
# bfloat16 causes incorrect output due to dtype mismatch in the kernels.
# float32 is supported but not recommended for performance reasons.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Removing torch.bfloat16 from the list of supported activation dtypes for GGUF models is a crucial step to prevent the dtype mismatch that leads to incorrect outputs. This directly enforces the necessary float16 precision for GGUF dequantization kernels.

@Isotr0py
Copy link
Copy Markdown
Member

Isotr0py commented Dec 5, 2025

Hmmm, but I remember that we are testing GGUF models with bf16, and the outputs are aligned with HF format ones for those tested models:

@pytest.mark.skipif(
not is_quant_method_supported("gguf"),
reason="gguf is not supported on this GPU type.",
)
@pytest.mark.parametrize(
"model",
[pytest.param(test_config, marks=test_config.marks) for test_config in MODELS],
)
@pytest.mark.parametrize("dtype", ["bfloat16"])
@pytest.mark.parametrize("max_tokens", [32])
@pytest.mark.parametrize("num_logprobs", [5])
@pytest.mark.parametrize("tp_size", [1])
def test_models(
vllm_runner: type[VllmRunner],
example_prompts: list[str],
model: GGUFTestConfig,
dtype: str,
max_tokens: int,
num_logprobs: int,
tp_size: int,
) -> None:
check_model_outputs(
vllm_runner, example_prompts, model, dtype, max_tokens, num_logprobs, tp_size
)

@kitaekatt
Copy link
Copy Markdown
Contributor Author

kitaekatt commented Dec 9, 2025

Thanks for the feedback @Isotr0py! That's a really good point about the existing test coverage.

I observed the garbled output issue specifically on RTX 5090 (Blackwell, sm_120 architecture). It's possible this is hardware-specific - perhaps older GPUs handle the bf16→internal fp16 conversion differently, or there's something unique about sm_120's handling of mixed precision.

A specific model I tested with was bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M quantization). With --dtype bfloat16, output was garbled question marks. With --dtype float16, output was correct and coherent. I saw this on many other GGUF models as well this is just an example.

A few possibilities:

  1. GPU architecture-specific: The issue might only manifest on newer Blackwell GPUs
  2. Model-specific: Different GGUF models might handle dtype differently based on their quantization
  3. Test coverage gap: The test models might work with bf16 but others don't

Would it be helpful if I added more details about the hardware environment to the PR description? Or could we potentially add a test case with the specific model I used to reproduce?

@kitaekatt kitaekatt force-pushed the fix-gguf-bfloat16-dtype branch from ef18f47 to eeb7339 Compare December 9, 2025 00:45
# GGUF dequantization kernels use half precision (fp16) internally.
# bfloat16 causes incorrect output due to dtype mismatch in the kernels.
# float32 is supported but not recommended for performance reasons.
return [torch.half, torch.float32]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we shouldn't remove BF16 support totally. Perhaps disable it from blackwell?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. You're right that existing tests show bfloat16 working on other architectures - the blanket removal was too aggressive.

Updated approach (middle ground):

  1. arg_utils.py: Default to float16 when dtype="auto" for GGUF
  2. gguf.py: Keep bfloat16 in supported_act_dtypes for explicit override

Options considered:

Option Pros Cons
1. Remove bf16 entirely Simple, guaranteed safe Breaks working configs on older GPUs
2. Blackwell-specific detection Precise, preserves bf16 where it works Maintenance burden, needs updating for future GPUs
3. Default fp16, allow explicit bf16 Safe default + user control Users must opt-in to bf16

Chose option 3 because:

  • Fixes the Blackwell issue without breaking existing setups
  • No architecture detection complexity
  • Users on hardware where bf16 works can still use --dtype bfloat16
  • The info log tells users how to override if needed

The performance difference between fp16/bf16 for GGUF is likely negligible since dequant kernels use fp16 internally anyway.

Let me know if you'd prefer option 2 (Blackwell detection) instead.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer option 2, it can be handled simply:

    def get_supported_act_dtypes(self) -> list[torch.dtype]:
        if current_platform.has_device_capability(120):
            logger.warning_once(
                "GGUF has precision issues with bfloat16 on SM 120+ devices. "
                "bfloat16 is unavailable for blackwell devices for now."
            )
            return [torch.half, torch.float32]
        return [torch.half, torch.bfloat16, torch.float32]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Implemented exactly as suggested - used current_platform.has_device_capability(120) with warning message. Pushed in commit 20d5288.

GGUF dequantization kernels use half precision (fp16) internally via the
`dfloat` typedef. On Blackwell GPUs (sm_120), using bfloat16 causes garbage
output due to dtype mismatch.

Approach taken (middle ground):
- arg_utils.py: Auto-set dtype to float16 when dtype="auto" for GGUF
- gguf.py: Keep bfloat16 in supported_act_dtypes for explicit override

This defaults to safe behavior while preserving user control. Users on
hardware where bfloat16 works can still use --dtype bfloat16 explicitly.

Options considered:
1. Blanket removal of bfloat16 from GGUF - rejected (breaks working configs)
2. Blackwell-specific detection - rejected (maintenance burden, edge cases)
3. Default fp16 + allow explicit bf16 - chosen (simple, safe, preserves choice)

Tested on RTX 5090 (sm_120) with Qwen3-4B-GGUF: 583.8 tok/s

Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt kitaekatt force-pushed the fix-gguf-bfloat16-dtype branch from eeb7339 to aa7ff75 Compare December 9, 2025 15:37
…ity check

Instead of removing bfloat16 support globally, use device capability
detection to disable bfloat16 only on SM 120+ devices (Blackwell).

This preserves bfloat16 support on older architectures where tests show
it works correctly, while preventing precision issues on Blackwell.

Co-Authored-By: Isotr0py <isotr0py@users.noreply.github.com>
Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt
Copy link
Copy Markdown
Contributor Author

Thanks @Isotr0py! I've implemented Option 2 as suggested - the change now uses current_platform.has_device_capability(120) to check for Blackwell architecture.

Changes:

  • Added from vllm.platforms import current_platform import
  • get_supported_act_dtypes() now checks for SM 120+ capability
  • On Blackwell: returns [torch.half, torch.float32] with a warning
  • On older architectures: returns [torch.half, torch.bfloat16, torch.float32] (preserves existing behavior)

This preserves bfloat16 support for users on pre-Blackwell hardware where testing shows it works correctly.

Comment thread vllm/engine/arg_utils.py Outdated
Comment on lines +1176 to +1185
# GGUF dequantization kernels use half precision (fp16) internally.
# bfloat16 causes incorrect output on some architectures (e.g., Blackwell).
# Default to float16 for safety; explicit --dtype bfloat16 still allowed
# for users on hardware where it works.
if self.dtype == "auto":
self.dtype = "float16"
logger.info(
"GGUF models default to float16 (dequant kernels use fp16 "
"internally). Use --dtype bfloat16 to override if needed."
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# GGUF dequantization kernels use half precision (fp16) internally.
# bfloat16 causes incorrect output on some architectures (e.g., Blackwell).
# Default to float16 for safety; explicit --dtype bfloat16 still allowed
# for users on hardware where it works.
if self.dtype == "auto":
self.dtype = "float16"
logger.info(
"GGUF models default to float16 (dequant kernels use fp16 "
"internally). Use --dtype bfloat16 to override if needed."
)

Please revert this. This will break Gemma2 GGUF with auto dtype, because it doesn't support FP16.

We should just leave dtype to be handled by _resolve_auto_dtype automatically.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Reverted the arg_utils.py changes in commit 9848dd9. The PR now only contains the Blackwell-specific bfloat16 restriction in gguf.py, leaving dtype selection to _resolve_auto_dtype as you suggested.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm downloading gemma2 as well to test it, I'll put a list of models I've tested the fix with in the comments when I'm done testing gemma1

Copy link
Copy Markdown
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

Per review feedback: the arg_utils.py dtype override breaks Gemma2 GGUF
which doesn't support FP16. The Blackwell-specific bfloat16 restriction
in gguf.py's get_supported_act_dtypes() is sufficient - let
_resolve_auto_dtype handle dtype selection automatically.

Signed-off-by: Christina <truffle@gmail.com>
@Isotr0py Isotr0py enabled auto-merge (squash) December 10, 2025 03:05
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 10, 2025
@kitaekatt kitaekatt closed this Dec 10, 2025
auto-merge was automatically disabled December 10, 2025 17:42

Pull request was closed

@kitaekatt kitaekatt deleted the fix-gguf-bfloat16-dtype branch December 10, 2025 17:42
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Dec 10, 2025

⚠️ The sha of the head commit of this PR conflicts with #30408. Mergify cannot evaluate rules on this PR. ⚠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants