fix(gguf): Skip lm_head mapping for models with tied word embeddings by kitaekatt · Pull Request #30405 · vllm-project/vllm

kitaekatt · 2025-12-10T17:41:35Z

Summary

Fixes RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight'] for models using tied word embeddings.

Changes

For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight is initialized from embed_tokens weights rather than loaded separately. This PR adds lm_head.weight to sideload_params to allow GGUF loading to succeed without requiring this parameter to be mapped.

Root Cause

When tie_word_embeddings=True:

Model shares weights between input embeddings and output projection
GGUF files don't contain separate lm_head.weight tensor
vLLM's GGUF loader expects to map all parameters or fail

Testing

Tested with bartowski/gemma-2-2b-it-GGUF - model loads without parameter mapping error.

Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto. Problem: - Gemma3 blocks float16 (numerical instability) - GGUF on Blackwell blocks bfloat16 (precision issues) - Only float32 works, but dtype=auto picks bfloat16 → fails Changes: 1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices 2. vllm.py: Auto-select compatible dtype when model and quantization restrictions conflict, instead of failing with an error This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell by automatically falling back to float32. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight is initialized from embed_tokens weights rather than loaded separately. Add lm_head.weight to sideload_params to allow GGUF loading to succeed without requiring this parameter to be mapped. Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight'] Signed-off-by: Christina <christina@example.com>

gemini-code-assist

Code Review

This pull request introduces several fixes related to GGUF model loading. The primary change, as indicated by the title, is to correctly handle models with tied word embeddings by skipping the mapping of lm_head.weight, which resolves a runtime error. Additionally, the PR includes changes to improve compatibility with newer hardware like NVIDIA's Blackwell GPUs by dynamically resolving dtype conflicts between model and quantization restrictions, and by disabling bfloat16 for GGUF on Blackwell due to precision issues. The changes are well-structured and address important compatibility issues. However, I've identified a critical issue in the device capability check for Blackwell GPUs that would prevent the intended fix from being applied. Please see the specific comment for details.

gemini-code-assist · 2025-12-10T17:43:28Z

    def get_supported_act_dtypes(self) -> list[torch.dtype]:
+        # GGUF dequantization kernels use half precision (fp16) internally.
+        # bfloat16 has precision issues on SM 120+ devices (Blackwell).
+        if current_platform.has_device_capability(120):


The check for device capability 120 seems incorrect for targeting Blackwell devices. Blackwell architecture has a compute capability of 10.0, which corresponds to 100 in this context (major * 10 + minor). The current value of 120 would target a future architecture (SM 12.0) and would fail to apply the bfloat16 restriction on Blackwell, leading to the precision issues this change is trying to solve.

Suggested change

if current_platform.has_device_capability(120):

if current_platform.has_device_capability(100):

mergify · 2025-12-10T17:46:51Z

⚠️ The sha of the head commit of this PR conflicts with #30412. Mergify cannot evaluate rules on this PR. ⚠️

kitaekatt and others added 2 commits December 9, 2025 18:45

kitaekatt closed this Dec 10, 2025

kitaekatt deleted the fix-gguf-tied-embeddings branch December 10, 2025 17:41

gemini-code-assist Bot reviewed Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gguf): Skip lm_head mapping for models with tied word embeddings#30405

fix(gguf): Skip lm_head mapping for models with tied word embeddings#30405
kitaekatt wants to merge 2 commits into
vllm-project:mainfrom
kitaekatt:fix-gguf-tied-embeddings

kitaekatt commented Dec 10, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Uh oh!

mergify Bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if current_platform.has_device_capability(120):
	if current_platform.has_device_capability(100):

Uh oh!

Conversation

kitaekatt commented Dec 10, 2025

Summary

Changes

Root Cause

Testing

Related

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant