Skip to content

fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell#30365

Closed
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix-gemma3-gguf-dtype
Closed

fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell#30365
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix-gemma3-gguf-dtype

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

Summary

Fixes dtype conflict for Gemma3 GGUF models on Blackwell GPUs (SM 120+) where --dtype auto fails because:

  1. Gemma3 blocks float16 (numerical instability) - only allows [bfloat16, float32]
  2. GGUF on Blackwell blocks bfloat16 (dequant kernels use fp16) - only allows [float16, float32]
  3. Intersection: Only float32 works

Error before fix:

torch.bfloat16 is not supported for quantization method gguf.
Supported dtypes: [torch.float16, torch.float32]

Changes

  1. gguf.py: Block bfloat16 on Blackwell (SM 120+) via current_platform.has_device_capability(120)

  2. vllm.py: Add _resolve_dtype_conflict() to find compatible dtype when model restrictions and quantization restrictions conflict. Falls back to float32 when no other option exists.

Test Plan

  • Tested with google/gemma-3-1b-it GGUF on RTX 5090 (Blackwell)
  • Server starts successfully with --dtype auto
  • Inference produces correct output (211 tok/s)
  • Non-Gemma3 GGUF models still work (Gemma2, Qwen, etc.)

Related

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a dtype conflict for GGUF models on Blackwell GPUs, particularly for models like Gemma3 with specific dtype restrictions. The changes are well-implemented. In vllm/model_executor/layers/quantization/gguf.py, bfloat16 is correctly excluded on SM 120+ devices. The new logic in vllm/config/vllm.py for automatic dtype conflict resolution is robust; it finds a compatible dtype by intersecting model and quantization-supported types, selects the most performant option, and warns the user. This is a solid fix that also handles future similar conflicts. I have not identified any high or critical severity issues.

Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto.

Problem:
- Gemma3 blocks float16 (numerical instability)
- GGUF on Blackwell blocks bfloat16 (precision issues)
- Only float32 works, but dtype=auto picks bfloat16 → fails

Changes:
1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices
2. vllm.py: Auto-select compatible dtype when model and quantization
   restrictions conflict, instead of failing with an error

This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell
by automatically falling back to float32.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Dec 10, 2025

⚠️ The sha of the head commit of this PR conflicts with #30410. Mergify cannot evaluate rules on this PR. ⚠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant