Skip to content

fix(gguf): Skip lm_head mapping for models with tied word embeddings#30405

Closed
kitaekatt wants to merge 2 commits into
vllm-project:mainfrom
kitaekatt:fix-gguf-tied-embeddings
Closed

fix(gguf): Skip lm_head mapping for models with tied word embeddings#30405
kitaekatt wants to merge 2 commits into
vllm-project:mainfrom
kitaekatt:fix-gguf-tied-embeddings

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

Summary

Fixes RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight'] for models using tied word embeddings.

Changes

For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight is initialized from embed_tokens weights rather than loaded separately. This PR adds lm_head.weight to sideload_params to allow GGUF loading to succeed without requiring this parameter to be mapped.

Root Cause

When tie_word_embeddings=True:

  • Model shares weights between input embeddings and output projection
  • GGUF files don't contain separate lm_head.weight tensor
  • vLLM's GGUF loader expects to map all parameters or fail

Testing

Tested with bartowski/gemma-2-2b-it-GGUF - model loads without parameter mapping error.

Related

kitaekatt and others added 2 commits December 9, 2025 18:45
Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto.

Problem:
- Gemma3 blocks float16 (numerical instability)
- GGUF on Blackwell blocks bfloat16 (precision issues)
- Only float32 works, but dtype=auto picks bfloat16 → fails

Changes:
1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices
2. vllm.py: Auto-select compatible dtype when model and quantization
   restrictions conflict, instead of failing with an error

This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell
by automatically falling back to float32.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight
is initialized from embed_tokens weights rather than loaded separately.
Add lm_head.weight to sideload_params to allow GGUF loading to succeed
without requiring this parameter to be mapped.

Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight']

Signed-off-by: Christina <christina@example.com>
@kitaekatt kitaekatt closed this Dec 10, 2025
@kitaekatt kitaekatt deleted the fix-gguf-tied-embeddings branch December 10, 2025 17:41
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several fixes related to GGUF model loading. The primary change, as indicated by the title, is to correctly handle models with tied word embeddings by skipping the mapping of lm_head.weight, which resolves a runtime error. Additionally, the PR includes changes to improve compatibility with newer hardware like NVIDIA's Blackwell GPUs by dynamically resolving dtype conflicts between model and quantization restrictions, and by disabling bfloat16 for GGUF on Blackwell due to precision issues. The changes are well-structured and address important compatibility issues. However, I've identified a critical issue in the device capability check for Blackwell GPUs that would prevent the intended fix from being applied. Please see the specific comment for details.

def get_supported_act_dtypes(self) -> list[torch.dtype]:
# GGUF dequantization kernels use half precision (fp16) internally.
# bfloat16 has precision issues on SM 120+ devices (Blackwell).
if current_platform.has_device_capability(120):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The check for device capability 120 seems incorrect for targeting Blackwell devices. Blackwell architecture has a compute capability of 10.0, which corresponds to 100 in this context (major * 10 + minor). The current value of 120 would target a future architecture (SM 12.0) and would fail to apply the bfloat16 restriction on Blackwell, leading to the precision issues this change is trying to solve.

Suggested change
if current_platform.has_device_capability(120):
if current_platform.has_device_capability(100):

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Dec 10, 2025

⚠️ The sha of the head commit of this PR conflicts with #30412. Mergify cannot evaluate rules on this PR. ⚠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant