fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell by kitaekatt · Pull Request #30410 · vllm-project/vllm

kitaekatt · 2025-12-10T17:45:56Z

Summary

Adds _resolve_dtype_conflict() to automatically select float32 when both bfloat16 (GGUF/Blackwell) and float16 (Gemma2/Gemma3) are disallowed.

Changes

Add dtype conflict resolution logic in model config
Auto-select float32 as compatible dtype when no other option works
Log informative message about automatic dtype selection

Root Cause

Gemma2/Gemma3 models disallow float16 (numerical instability), while GGUF on Blackwell disallows bfloat16 (precision issues). Only float32 works for both.

Testing

Tested with Gemma2 GGUF on RTX 5090 - model loads and runs correctly with auto-selected float32.

chatgpt-codex-connector · 2025-12-10T17:46:05Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request introduces a mechanism to automatically resolve data type conflicts for GGUF models, particularly on newer hardware like NVIDIA's Blackwell GPUs. The changes correctly identify scenarios where bfloat16 is disallowed by the hardware/quantization format and float16 is disallowed by the model, and fall back to float32. My review found a critical issue with the compute capability check for Blackwell GPUs, which would prevent this feature from working as intended. I've also included a suggestion to improve code quality.

gemini-code-assist · 2025-12-10T17:47:23Z

+        # bfloat16 has precision issues on SM 120+ devices (Blackwell).
+        if current_platform.has_device_capability(120):


There seems to be a misunderstanding of NVIDIA's compute capability versions. Blackwell architecture (e.g., RTX 5090) has a compute capability of 10.0, which translates to an integer value of 100 for has_device_capability. The current code uses 120, which corresponds to a future SM 12.0, and will not activate this logic on Blackwell devices. The comment is also misleading.

Other parts of the vLLM codebase already use 100 to refer to Blackwell's compute capability.

Please correct the value to 100 and update the comment for accuracy. Using a constant for this magic number would also improve readability and maintainability.

Suggested change

# bfloat16 has precision issues on SM 120+ devices (Blackwell).

if current_platform.has_device_capability(120):

# bfloat16 has precision issues on SM 10.0+ devices (Blackwell).

if current_platform.has_device_capability(100):

Good catch! Fixed in 318e2d3b4. Changed to has_device_capability(100) for Blackwell SM 10.0 and updated the comment/warning message accordingly.

gemini-code-assist · 2025-12-10T17:47:24Z

+                    import torch
+


The torch library is already imported at the top of this file (line 19). This local import is redundant and should be removed to improve code clarity and adhere to standard Python style practices.

Fixed in 318e2d3b4. Removed the redundant local import since torch is already imported at the top of the file.

mergify · 2025-12-15T20:48:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kitaekatt · 2026-01-21T00:28:05Z

Testing performed:

Tested with GGUF models on RTX 5090 (32GB, Blackwell architecture):

bartowski/NousResearch_Hermes-4-14B-GGUF
tensorblock/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF

Ran models through a local benchmark runner for HumanEval and GSM8K. Verified code coverage by instrumenting the dtype selection path to confirm float16 is correctly applied for GGUF models.

Related PRs:

This is part of a series of GGUF pipeline fixes for Blackwell GPU compatibility:

fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell #30410 (this PR): Fix GGUF dtype selection for Blackwell compatibility
fix(gguf): Skip lm_head mapping for models with tied word embeddings #30412: Tokenizer semaphore leak fix
fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer #30434: Shared memory barrier fix

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2026-01-21T00:37:49Z

+                                preferred,
+                            )
+                            model_config.dtype = preferred
+                            break


Missing fallback when no preferred dtype matches compatible list

Low Severity

If compatible_dtypes is non-empty but contains no dtypes from dtype_preference (float16, bfloat16, float32), the for loop completes without break, leaving model_config.dtype unchanged at its original unsupported value. The code then continues to quant_config.maybe_update_config() and returns without raising an error, silently using an invalid dtype. For GGUF this cannot occur since all its supported dtypes are in the preference list, but it's a latent risk for future quantization methods.

Acknowledged - technically valid but cannot occur in practice since all quantization methods use float16/bfloat16/float32. Happy to add a defensive else: raise ValueError(...) in a follow-up if desired.

cursor · 2026-01-21T00:37:49Z

+                        f"{model_config.dtype} is not supported for quantization "
+                        f"method {model_config.quantization}. Supported dtypes: "
+                        f"{supported_dtypes}"
+                    )


LoRA dtype set before model dtype auto-correction

Medium Severity

In VllmConfig.__post_init__, lora_config.verify_with_model_config() is called before _get_quantization_config(). When lora_dtype is "auto", it copies model_config.dtype (e.g., bfloat16). Then _get_quantization_config() may change model_config.dtype to float32. This leaves lora_config.lora_dtype with the stale bfloat16 value, creating a dtype mismatch between LoRA weights and the model when using LoRA with GGUF on Blackwell.

Valid concern - this is a pre-existing initialization ordering issue in __post_init__, not introduced by this PR. Worth fixing separately but out of scope here (unless a human reviewer asks for it in which case, sure I'll do it).

Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto. Problem: - Gemma3 blocks float16 (numerical instability) - GGUF on Blackwell blocks bfloat16 (precision issues) - Only float32 works, but dtype=auto picks bfloat16 → fails Changes: 1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices 2. vllm.py: Auto-select compatible dtype when model and quantization restrictions conflict, instead of failing with an error This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell by automatically falling back to float32. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Christina <truffle@gmail.com>

kitaekatt · 2026-03-10T20:09:58Z

Validation Results

vLLM	transformers	Cherry-picked PRs	HumanEval	IFEval
HEAD	5.x	#30410, #30411, #30412, #30413, #30424, #30434, #30699, #30702, #31464, #33846	gem2-2b-gguf (42.1%), gemma3-1b (26.8%)	gem2-2b-gguf (65.6%)
HEAD	4.x	#30410, #30411, #30412, #30413, #30424, #30434, #30699, #30702, #31464, #33846	q3-moe-gguf (83.5%)	q3-moe-gguf (85.4%)

Tested on RTX 5090 (Blackwell, SM 120) with all listed PRs cherry-picked together; models listed under each benchmark passed that benchmark in the given environment, while the same models crash or fail without these PRs applied.

Rebased to current upstream HEAD and re-validated on RTX 5090 (Blackwell, SM 120). Fix confirmed still necessary — 3 GGUF models crash without it.

…uf-dtype # Conflicts: # vllm/config/vllm.py

kitaekatt · 2026-04-18T21:28:21Z

Hi @mgoin @Isotr0py — this has been sitting without review for a while. Just rebased on latest upstream/main (merge commit), so the branch should now be mergeable. Quick summary: Auto-select compatible dtype for GGUF models on Blackwell:when GGUF's bfloat16 restriction conflicts with a model's float16 restriction (e.g. Gemma3 GGUF on Blackwell), falls back to float32 automatically instead of erroring. Rebased on latest main.

Would appreciate a look when you have cycles. Thanks!

kitaekatt requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners December 10, 2025 17:45

mergify Bot mentioned this pull request Dec 10, 2025

fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell #30365

Closed

4 tasks

gemini-code-assist Bot reviewed Dec 10, 2025

View reviewed changes

kitaekatt marked this pull request as draft December 10, 2025 18:28

This was referenced Dec 15, 2025

fix(gemma2): Skip missing parameters during GGUF weight loading #30421

Closed

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

Closed

kitaekatt force-pushed the fix/30365-gemma3-gguf-dtype branch from 318e2d3 to 572e029 Compare December 15, 2025 20:48

mergify Bot added the needs-rebase label Dec 15, 2025

kitaekatt force-pushed the fix/30365-gemma3-gguf-dtype branch from 572e029 to c582994 Compare December 29, 2025 20:41

mergify Bot removed the needs-rebase label Dec 29, 2025

kitaekatt force-pushed the fix/30365-gemma3-gguf-dtype branch from c582994 to 7a6a37e Compare January 19, 2026 17:27

This was referenced Jan 21, 2026

fix(gguf): Skip lm_head mapping for models with tied word embeddings #30412

Open

fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer #30434

Closed

kitaekatt marked this pull request as ready for review January 21, 2026 00:28

cursor Bot reviewed Jan 21, 2026

View reviewed changes

kitaekatt force-pushed the fix/30365-gemma3-gguf-dtype branch from 7bdf3dd to 28d9d6f Compare January 22, 2026 21:42

kitaekatt force-pushed the fix/30365-gemma3-gguf-dtype branch from 28d9d6f to ee1ef34 Compare February 5, 2026 00:18

kitaekatt force-pushed the fix/30365-gemma3-gguf-dtype branch from ee1ef34 to 4064a25 Compare March 10, 2026 15:37

kitaekatt mentioned this pull request Mar 10, 2026

[Bugfix] Skip missing parameters during GGUF Gemma2 weight loading #30699

Closed

Merge remote-tracking branch 'upstream/main' into fix/30365-gemma3-gg…

b3d71b0

…uf-dtype # Conflicts: # vllm/config/vllm.py

		# bfloat16 has precision issues on SM 120+ devices (Blackwell).
		if current_platform.has_device_capability(120):

Uh oh!

Conversation

kitaekatt commented Dec 10, 2025

Summary

Changes

Root Cause

Testing

Uh oh!

chatgpt-codex-connector Bot commented Dec 10, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Dec 15, 2025

Uh oh!

kitaekatt commented Jan 21, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jan 21, 2026

Choose a reason for hiding this comment

Missing fallback when no preferred dtype matches compatible list

Uh oh!

kitaekatt Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jan 21, 2026

Choose a reason for hiding this comment

LoRA dtype set before model dtype auto-correction

Uh oh!

kitaekatt Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

kitaekatt commented Mar 10, 2026

Validation Results

Uh oh!

kitaekatt commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant