fix: Force float16 dtype for GGUF models to fix incorrect output by kitaekatt · Pull Request #30090 · vllm-project/vllm

kitaekatt · 2025-12-04T22:26:13Z

Summary

GGUF dequantization kernels use half precision (fp16) internally via the dfloat typedef in the CUDA kernels. When vLLM auto-selects bfloat16 on modern GPUs (default behavior when dtype="auto"), the dtype mismatch causes garbage output (e.g., "????" tokens or random characters).

This fix:

Auto-sets dtype to float16 for GGUF models when dtype="auto"
Removes bfloat16 from supported dtypes to prevent explicit misuse

Test Plan

Tested on RTX 5090 (sm_120, Blackwell architecture) with Qwen/Qwen3-4B-GGUF:

Before fix:

Output: "????..." (garbage/question marks)
Model loads successfully but produces incorrect tokens

After fix:

Output: Correct, coherent responses
Performance: 583.8 tok/s
Tested with 5 diverse prompts (math, geography, code, logic)

Validation that bf16 is properly rejected:

$ python -c "..." --dtype bfloat16
# Now correctly errors with: "bfloat16 is not supported for quantization method gguf"

Root Cause Analysis

The GGUF quantization kernels in csrc/quantization/gguf/ define:

typedef half dfloat;  // fp16 used for dequantization output

When the model dtype is bfloat16, there's a mismatch between:

Kernel output: float16 (half)
Expected dtype: bfloat16

This causes the logits to be interpreted incorrectly, resulting in garbage token predictions.

🤖 Generated with Claude Code

gemini-code-assist

Code Review

The pull request effectively addresses a critical dtype mismatch issue with GGUF models, where "bfloat16" was causing incorrect outputs due to internal "float16" dequantization kernels. By automatically setting the dtype to "float16" for GGUF models when "dtype="auto"" and explicitly removing "bfloat16" from the supported dtypes for GGUF, this change ensures correctness and prevents future misuse. The changes are well-explained and directly resolve the problem.

gemini-code-assist · 2025-12-04T22:27:53Z

+            # GGUF dequantization kernels use half precision (fp16) internally.
+            # Using bfloat16 causes incorrect output due to dtype mismatch.
+            # Force float16 for GGUF models unless user explicitly set dtype.
+            if self.dtype == "auto":
+                self.dtype = "float16"
+                logger.info(
+                    "GGUF models require float16 dtype. "
+                    "Setting dtype to float16 automatically."
+                )


This logic correctly identifies GGUF models and forces the dtype to float16 when it's set to auto. This is a critical fix to ensure correct model operation and prevent the garbage output described in the PR.

gemini-code-assist · 2025-12-04T22:27:53Z

+        # GGUF dequantization kernels use half precision (fp16) internally.
+        # bfloat16 causes incorrect output due to dtype mismatch in the kernels.
+        # float32 is supported but not recommended for performance reasons.


Removing torch.bfloat16 from the list of supported activation dtypes for GGUF models is a crucial step to prevent the dtype mismatch that leads to incorrect outputs. This directly enforces the necessary float16 precision for GGUF dequantization kernels.

Isotr0py · 2025-12-05T03:35:21Z

Hmmm, but I remember that we are testing GGUF models with bf16, and the outputs are aligned with HF format ones for those tested models:

vllm/tests/models/quantization/test_gguf.py

Lines 157 to 180 in 0098a6e

    
           @pytest.mark.skipif( 
        
               not is_quant_method_supported("gguf"), 
        
               reason="gguf is not supported on this GPU type.", 
        
           ) 
        
           @pytest.mark.parametrize( 
        
               "model", 
        
               [pytest.param(test_config, marks=test_config.marks) for test_config in MODELS], 
        
           ) 
        
           @pytest.mark.parametrize("dtype", ["bfloat16"]) 
        
           @pytest.mark.parametrize("max_tokens", [32]) 
        
           @pytest.mark.parametrize("num_logprobs", [5]) 
        
           @pytest.mark.parametrize("tp_size", [1]) 
        
           def test_models( 
        
               vllm_runner: type[VllmRunner], 
        
               example_prompts: list[str], 
        
               model: GGUFTestConfig, 
        
               dtype: str, 
        
               max_tokens: int, 
        
               num_logprobs: int, 
        
               tp_size: int, 
        
           ) -> None: 
        
               check_model_outputs( 
        
                   vllm_runner, example_prompts, model, dtype, max_tokens, num_logprobs, tp_size 
        
               )

kitaekatt · 2025-12-09T00:15:34Z

Thanks for the feedback @Isotr0py! That's a really good point about the existing test coverage.

I observed the garbled output issue specifically on RTX 5090 (Blackwell, sm_120 architecture). It's possible this is hardware-specific - perhaps older GPUs handle the bf16→internal fp16 conversion differently, or there's something unique about sm_120's handling of mixed precision.

A specific model I tested with was bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M quantization). With --dtype bfloat16, output was garbled question marks. With --dtype float16, output was correct and coherent. I saw this on many other GGUF models as well this is just an example.

A few possibilities:

GPU architecture-specific: The issue might only manifest on newer Blackwell GPUs
Model-specific: Different GGUF models might handle dtype differently based on their quantization
Test coverage gap: The test models might work with bf16 but others don't

Would it be helpful if I added more details about the hardware environment to the PR description? Or could we potentially add a test case with the specific model I used to reproduce?

Isotr0py · 2025-12-09T14:57:08Z

+        # GGUF dequantization kernels use half precision (fp16) internally.
+        # bfloat16 causes incorrect output due to dtype mismatch in the kernels.
+        # float32 is supported but not recommended for performance reasons.
+        return [torch.half, torch.float32]


Hmm, I think we shouldn't remove BF16 support totally. Perhaps disable it from blackwell?

Thanks for the feedback. You're right that existing tests show bfloat16 working on other architectures - the blanket removal was too aggressive.

Updated approach (middle ground):

arg_utils.py: Default to float16 when dtype="auto" for GGUF

gguf.py: Keep bfloat16 in supported_act_dtypes for explicit override

Options considered:

Option Pros Cons

1. Remove bf16 entirely Simple, guaranteed safe Breaks working configs on older GPUs

2. Blackwell-specific detection Precise, preserves bf16 where it works Maintenance burden, needs updating for future GPUs

3. Default fp16, allow explicit bf16 Safe default + user control Users must opt-in to bf16

Chose option 3 because:

Fixes the Blackwell issue without breaking existing setups

No architecture detection complexity

Users on hardware where bf16 works can still use --dtype bfloat16

The info log tells users how to override if needed

The performance difference between fp16/bf16 for GGUF is likely negligible since dequant kernels use fp16 internally anyway.

Let me know if you'd prefer option 2 (Blackwell detection) instead.

I prefer option 2, it can be handled simply:

def get_supported_act_dtypes(self) -> list[torch.dtype]: if current_platform.has_device_capability(120): logger.warning_once( "GGUF has precision issues with bfloat16 on SM 120+ devices. " "bfloat16 is unavailable for blackwell devices for now." ) return [torch.half, torch.float32] return [torch.half, torch.bfloat16, torch.float32]

Thanks! Implemented exactly as suggested - used current_platform.has_device_capability(120) with warning message. Pushed in commit 20d5288.

GGUF dequantization kernels use half precision (fp16) internally via the `dfloat` typedef. On Blackwell GPUs (sm_120), using bfloat16 causes garbage output due to dtype mismatch. Approach taken (middle ground): - arg_utils.py: Auto-set dtype to float16 when dtype="auto" for GGUF - gguf.py: Keep bfloat16 in supported_act_dtypes for explicit override This defaults to safe behavior while preserving user control. Users on hardware where bfloat16 works can still use --dtype bfloat16 explicitly. Options considered: 1. Blanket removal of bfloat16 from GGUF - rejected (breaks working configs) 2. Blackwell-specific detection - rejected (maintenance burden, edge cases) 3. Default fp16 + allow explicit bf16 - chosen (simple, safe, preserves choice) Tested on RTX 5090 (sm_120) with Qwen3-4B-GGUF: 583.8 tok/s Signed-off-by: Christina <truffle@gmail.com>

…ity check Instead of removing bfloat16 support globally, use device capability detection to disable bfloat16 only on SM 120+ devices (Blackwell). This preserves bfloat16 support on older architectures where tests show it works correctly, while preventing precision issues on Blackwell. Co-Authored-By: Isotr0py <isotr0py@users.noreply.github.com> Signed-off-by: Christina <truffle@gmail.com>

kitaekatt · 2025-12-09T16:21:35Z

Thanks @Isotr0py! I've implemented Option 2 as suggested - the change now uses current_platform.has_device_capability(120) to check for Blackwell architecture.

Changes:

Added from vllm.platforms import current_platform import
get_supported_act_dtypes() now checks for SM 120+ capability
On Blackwell: returns [torch.half, torch.float32] with a warning
On older architectures: returns [torch.half, torch.bfloat16, torch.float32] (preserves existing behavior)

This preserves bfloat16 support for users on pre-Blackwell hardware where testing shows it works correctly.

Isotr0py · 2025-12-09T16:36:46Z

+            # GGUF dequantization kernels use half precision (fp16) internally.
+            # bfloat16 causes incorrect output on some architectures (e.g., Blackwell).
+            # Default to float16 for safety; explicit --dtype bfloat16 still allowed
+            # for users on hardware where it works.
+            if self.dtype == "auto":
+                self.dtype = "float16"
+                logger.info(
+                    "GGUF models default to float16 (dequant kernels use fp16 "
+                    "internally). Use --dtype bfloat16 to override if needed."
+                )


Suggested change

# GGUF dequantization kernels use half precision (fp16) internally.

# bfloat16 causes incorrect output on some architectures (e.g., Blackwell).

# Default to float16 for safety; explicit --dtype bfloat16 still allowed

# for users on hardware where it works.

if self.dtype == "auto":

self.dtype = "float16"

logger.info(

"GGUF models default to float16 (dequant kernels use fp16 "

"internally). Use --dtype bfloat16 to override if needed."

)

Please revert this. This will break Gemma2 GGUF with auto dtype, because it doesn't support FP16.

We should just leave dtype to be handled by _resolve_auto_dtype automatically.

Done! Reverted the arg_utils.py changes in commit 9848dd9. The PR now only contains the Blackwell-specific bfloat16 restriction in gguf.py, leaving dtype selection to _resolve_auto_dtype as you suggested.

I'm downloading gemma2 as well to test it, I'll put a list of models I've tested the fix with in the comments when I'm done testing gemma1

Isotr0py

Otherwise LGTM.

Per review feedback: the arg_utils.py dtype override breaks Gemma2 GGUF which doesn't support FP16. The Blackwell-specific bfloat16 restriction in gguf.py's get_supported_act_dtypes() is sufficient - let _resolve_auto_dtype handle dtype selection automatically. Signed-off-by: Christina <truffle@gmail.com>

mergify · 2025-12-10T17:46:18Z

⚠️ The sha of the head commit of this PR conflicts with #30408. Mergify cannot evaluate rules on this PR. ⚠️

kitaekatt requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners December 4, 2025 22:26

gemini-code-assist Bot reviewed Dec 4, 2025

View reviewed changes

LucasWilkinson assigned Isotr0py Dec 5, 2025

kitaekatt force-pushed the fix-gguf-bfloat16-dtype branch from ef18f47 to eeb7339 Compare December 9, 2025 00:45

Isotr0py reviewed Dec 9, 2025

View reviewed changes

kitaekatt force-pushed the fix-gguf-bfloat16-dtype branch from eeb7339 to aa7ff75 Compare December 9, 2025 15:37

Isotr0py reviewed Dec 9, 2025

View reviewed changes

Isotr0py approved these changes Dec 9, 2025

View reviewed changes

kitaekatt mentioned this pull request Dec 9, 2025

fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell #30365

Closed

4 tasks

Isotr0py enabled auto-merge (squash) December 10, 2025 03:05

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 10, 2025

kitaekatt closed this Dec 10, 2025

auto-merge was automatically disabled December 10, 2025 17:42
Pull request was closed

kitaekatt deleted the fix-gguf-bfloat16-dtype branch December 10, 2025 17:42

This was referenced Dec 10, 2025

fix(gguf): Disable bfloat16 for GGUF on blackwell device #30408

Merged

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

Closed

Option	Pros	Cons
1. Remove bf16 entirely	Simple, guaranteed safe	Breaks working configs on older GPUs
2. Blackwell-specific detection	Precise, preserves bf16 where it works	Maintenance burden, needs updating for future GPUs
3. Default fp16, allow explicit bf16	Safe default + user control	Users must opt-in to bf16

Uh oh!

Conversation

kitaekatt commented Dec 4, 2025

Summary

Test Plan

Root Cause Analysis

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0py commented Dec 5, 2025

Uh oh!

kitaekatt commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isotr0py Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0py Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt commented Dec 9, 2025

Uh oh!

Isotr0py Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kitaekatt commented Dec 9, 2025 •

edited

Loading