fix: Force float16 dtype for GGUF models to fix incorrect output#30090
fix: Force float16 dtype for GGUF models to fix incorrect output#30090kitaekatt wants to merge 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
The pull request effectively addresses a critical dtype mismatch issue with GGUF models, where "bfloat16" was causing incorrect outputs due to internal "float16" dequantization kernels. By automatically setting the dtype to "float16" for GGUF models when "dtype="auto"" and explicitly removing "bfloat16" from the supported dtypes for GGUF, this change ensures correctness and prevents future misuse. The changes are well-explained and directly resolve the problem.
| # GGUF dequantization kernels use half precision (fp16) internally. | ||
| # Using bfloat16 causes incorrect output due to dtype mismatch. | ||
| # Force float16 for GGUF models unless user explicitly set dtype. | ||
| if self.dtype == "auto": | ||
| self.dtype = "float16" | ||
| logger.info( | ||
| "GGUF models require float16 dtype. " | ||
| "Setting dtype to float16 automatically." | ||
| ) |
| # GGUF dequantization kernels use half precision (fp16) internally. | ||
| # bfloat16 causes incorrect output due to dtype mismatch in the kernels. | ||
| # float32 is supported but not recommended for performance reasons. |
|
Hmmm, but I remember that we are testing GGUF models with bf16, and the outputs are aligned with HF format ones for those tested models: vllm/tests/models/quantization/test_gguf.py Lines 157 to 180 in 0098a6e |
|
Thanks for the feedback @Isotr0py! That's a really good point about the existing test coverage. I observed the garbled output issue specifically on RTX 5090 (Blackwell, sm_120 architecture). It's possible this is hardware-specific - perhaps older GPUs handle the bf16→internal fp16 conversion differently, or there's something unique about sm_120's handling of mixed precision. A specific model I tested with was A few possibilities:
Would it be helpful if I added more details about the hardware environment to the PR description? Or could we potentially add a test case with the specific model I used to reproduce? |
ef18f47 to
eeb7339
Compare
| # GGUF dequantization kernels use half precision (fp16) internally. | ||
| # bfloat16 causes incorrect output due to dtype mismatch in the kernels. | ||
| # float32 is supported but not recommended for performance reasons. | ||
| return [torch.half, torch.float32] |
There was a problem hiding this comment.
Hmm, I think we shouldn't remove BF16 support totally. Perhaps disable it from blackwell?
There was a problem hiding this comment.
Thanks for the feedback. You're right that existing tests show bfloat16 working on other architectures - the blanket removal was too aggressive.
Updated approach (middle ground):
arg_utils.py: Default to float16 whendtype="auto"for GGUFgguf.py: Keep bfloat16 insupported_act_dtypesfor explicit override
Options considered:
| Option | Pros | Cons |
|---|---|---|
| 1. Remove bf16 entirely | Simple, guaranteed safe | Breaks working configs on older GPUs |
| 2. Blackwell-specific detection | Precise, preserves bf16 where it works | Maintenance burden, needs updating for future GPUs |
| 3. Default fp16, allow explicit bf16 | Safe default + user control | Users must opt-in to bf16 |
Chose option 3 because:
- Fixes the Blackwell issue without breaking existing setups
- No architecture detection complexity
- Users on hardware where bf16 works can still use
--dtype bfloat16 - The info log tells users how to override if needed
The performance difference between fp16/bf16 for GGUF is likely negligible since dequant kernels use fp16 internally anyway.
Let me know if you'd prefer option 2 (Blackwell detection) instead.
There was a problem hiding this comment.
I prefer option 2, it can be handled simply:
def get_supported_act_dtypes(self) -> list[torch.dtype]:
if current_platform.has_device_capability(120):
logger.warning_once(
"GGUF has precision issues with bfloat16 on SM 120+ devices. "
"bfloat16 is unavailable for blackwell devices for now."
)
return [torch.half, torch.float32]
return [torch.half, torch.bfloat16, torch.float32]There was a problem hiding this comment.
Thanks! Implemented exactly as suggested - used current_platform.has_device_capability(120) with warning message. Pushed in commit 20d5288.
GGUF dequantization kernels use half precision (fp16) internally via the `dfloat` typedef. On Blackwell GPUs (sm_120), using bfloat16 causes garbage output due to dtype mismatch. Approach taken (middle ground): - arg_utils.py: Auto-set dtype to float16 when dtype="auto" for GGUF - gguf.py: Keep bfloat16 in supported_act_dtypes for explicit override This defaults to safe behavior while preserving user control. Users on hardware where bfloat16 works can still use --dtype bfloat16 explicitly. Options considered: 1. Blanket removal of bfloat16 from GGUF - rejected (breaks working configs) 2. Blackwell-specific detection - rejected (maintenance burden, edge cases) 3. Default fp16 + allow explicit bf16 - chosen (simple, safe, preserves choice) Tested on RTX 5090 (sm_120) with Qwen3-4B-GGUF: 583.8 tok/s Signed-off-by: Christina <truffle@gmail.com>
eeb7339 to
aa7ff75
Compare
…ity check Instead of removing bfloat16 support globally, use device capability detection to disable bfloat16 only on SM 120+ devices (Blackwell). This preserves bfloat16 support on older architectures where tests show it works correctly, while preventing precision issues on Blackwell. Co-Authored-By: Isotr0py <isotr0py@users.noreply.github.com> Signed-off-by: Christina <truffle@gmail.com>
|
Thanks @Isotr0py! I've implemented Option 2 as suggested - the change now uses Changes:
This preserves bfloat16 support for users on pre-Blackwell hardware where testing shows it works correctly. |
| # GGUF dequantization kernels use half precision (fp16) internally. | ||
| # bfloat16 causes incorrect output on some architectures (e.g., Blackwell). | ||
| # Default to float16 for safety; explicit --dtype bfloat16 still allowed | ||
| # for users on hardware where it works. | ||
| if self.dtype == "auto": | ||
| self.dtype = "float16" | ||
| logger.info( | ||
| "GGUF models default to float16 (dequant kernels use fp16 " | ||
| "internally). Use --dtype bfloat16 to override if needed." | ||
| ) |
There was a problem hiding this comment.
| # GGUF dequantization kernels use half precision (fp16) internally. | |
| # bfloat16 causes incorrect output on some architectures (e.g., Blackwell). | |
| # Default to float16 for safety; explicit --dtype bfloat16 still allowed | |
| # for users on hardware where it works. | |
| if self.dtype == "auto": | |
| self.dtype = "float16" | |
| logger.info( | |
| "GGUF models default to float16 (dequant kernels use fp16 " | |
| "internally). Use --dtype bfloat16 to override if needed." | |
| ) |
Please revert this. This will break Gemma2 GGUF with auto dtype, because it doesn't support FP16.
We should just leave dtype to be handled by _resolve_auto_dtype automatically.
There was a problem hiding this comment.
Done! Reverted the arg_utils.py changes in commit 9848dd9. The PR now only contains the Blackwell-specific bfloat16 restriction in gguf.py, leaving dtype selection to _resolve_auto_dtype as you suggested.
There was a problem hiding this comment.
I'm downloading gemma2 as well to test it, I'll put a list of models I've tested the fix with in the comments when I'm done testing gemma1
Per review feedback: the arg_utils.py dtype override breaks Gemma2 GGUF which doesn't support FP16. The Blackwell-specific bfloat16 restriction in gguf.py's get_supported_act_dtypes() is sufficient - let _resolve_auto_dtype handle dtype selection automatically. Signed-off-by: Christina <truffle@gmail.com>
Pull request was closed
|
|
Summary
GGUF dequantization kernels use half precision (fp16) internally via the
dfloattypedef in the CUDA kernels. When vLLM auto-selects bfloat16 on modern GPUs (default behavior whendtype="auto"), the dtype mismatch causes garbage output (e.g., "????" tokens or random characters).This fix:
dtype="auto"Test Plan
Tested on RTX 5090 (sm_120, Blackwell architecture) with
Qwen/Qwen3-4B-GGUF:Before fix:
"????..."(garbage/question marks)After fix:
Validation that bf16 is properly rejected:
Root Cause Analysis
The GGUF quantization kernels in
csrc/quantization/gguf/define:When the model dtype is bfloat16, there's a mismatch between:
half)This causes the logits to be interpreted incorrectly, resulting in garbage token predictions.
🤖 Generated with Claude Code