fix(gguf): Skip lm_head mapping for models with tied word embeddings#30405
fix(gguf): Skip lm_head mapping for models with tied word embeddings#30405kitaekatt wants to merge 2 commits into
Conversation
Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto. Problem: - Gemma3 blocks float16 (numerical instability) - GGUF on Blackwell blocks bfloat16 (precision issues) - Only float32 works, but dtype=auto picks bfloat16 → fails Changes: 1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices 2. vllm.py: Auto-select compatible dtype when model and quantization restrictions conflict, instead of failing with an error This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell by automatically falling back to float32. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight is initialized from embed_tokens weights rather than loaded separately. Add lm_head.weight to sideload_params to allow GGUF loading to succeed without requiring this parameter to be mapped. Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight'] Signed-off-by: Christina <christina@example.com>
There was a problem hiding this comment.
Code Review
This pull request introduces several fixes related to GGUF model loading. The primary change, as indicated by the title, is to correctly handle models with tied word embeddings by skipping the mapping of lm_head.weight, which resolves a runtime error. Additionally, the PR includes changes to improve compatibility with newer hardware like NVIDIA's Blackwell GPUs by dynamically resolving dtype conflicts between model and quantization restrictions, and by disabling bfloat16 for GGUF on Blackwell due to precision issues. The changes are well-structured and address important compatibility issues. However, I've identified a critical issue in the device capability check for Blackwell GPUs that would prevent the intended fix from being applied. Please see the specific comment for details.
| def get_supported_act_dtypes(self) -> list[torch.dtype]: | ||
| # GGUF dequantization kernels use half precision (fp16) internally. | ||
| # bfloat16 has precision issues on SM 120+ devices (Blackwell). | ||
| if current_platform.has_device_capability(120): |
There was a problem hiding this comment.
The check for device capability 120 seems incorrect for targeting Blackwell devices. Blackwell architecture has a compute capability of 10.0, which corresponds to 100 in this context (major * 10 + minor). The current value of 120 would target a future architecture (SM 12.0) and would fail to apply the bfloat16 restriction on Blackwell, leading to the precision issues this change is trying to solve.
| if current_platform.has_device_capability(120): | |
| if current_platform.has_device_capability(100): |
|
|
Summary
Fixes
RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight']for models using tied word embeddings.Changes
For models like Gemma2 that use
tie_word_embeddings=True, thelm_head.weightis initialized fromembed_tokensweights rather than loaded separately. This PR addslm_head.weighttosideload_paramsto allow GGUF loading to succeed without requiring this parameter to be mapped.Root Cause
When
tie_word_embeddings=True:lm_head.weighttensorTesting
Tested with
bartowski/gemma-2-2b-it-GGUF- model loads without parameter mapping error.Related