Skip to content

fix(gguf): Skip lm_head mapping for models with tied word embeddings#30412

Open
kitaekatt wants to merge 2 commits into
vllm-project:mainfrom
kitaekatt:fix/30405-tied-embeddings
Open

fix(gguf): Skip lm_head mapping for models with tied word embeddings#30412
kitaekatt wants to merge 2 commits into
vllm-project:mainfrom
kitaekatt:fix/30405-tied-embeddings

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

Summary

Fixes RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight'] for models using tied word embeddings.

Changes

For models like Gemma2 that use tie_word_embeddings=True, add lm_head.weight to sideload_params to allow GGUF loading to succeed.

Root Cause

When tie_word_embeddings=True, model shares weights between input embeddings and output projection. GGUF files don't contain separate lm_head.weight tensor.

Testing

Tested with bartowski/gemma-2-2b-it-GGUF - model loads without parameter mapping error.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for loading GGUF models with tied word embeddings, such as Gemma2, by correctly skipping the mapping of lm_head.weight. Additionally, it proactively addresses potential data type compatibility issues on newer hardware like NVIDIA's Blackwell GPUs by preventing the use of bfloat16 with GGUF quantization due to precision issues and implementing a mechanism to automatically select a compatible dtype when conflicts arise. The changes are well-implemented and improve the robustness of GGUF model loading. The logic for handling tied embeddings and resolving dtype conflicts is sound. I have no major concerns with this pull request.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Dec 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Dec 15, 2025
@kitaekatt kitaekatt force-pushed the fix/30405-tied-embeddings branch from a195d52 to a215a08 Compare December 29, 2025 20:42
@mergify mergify Bot removed the needs-rebase label Dec 29, 2025
@kitaekatt kitaekatt force-pushed the fix/30405-tied-embeddings branch from a215a08 to 7146180 Compare January 19, 2026 17:27
@kitaekatt
Copy link
Copy Markdown
Contributor Author

Testing performed:

Tested with GGUF models on RTX 5090 (32GB, Blackwell architecture):

  • bartowski/NousResearch_Hermes-4-14B-GGUF
  • tensorblock/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF

Ran models through a local benchmark runner for HumanEval and GSM8K. Verified the semaphore leak fix by running repeated model load/unload cycles without resource exhaustion.

Related PRs:

This is part of a series of GGUF pipeline fixes for Blackwell GPU compatibility:

For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight
is initialized from embed_tokens weights rather than loaded separately.
Add lm_head.weight to sideload_params to allow GGUF loading to succeed
without requiring this parameter to be mapped.

Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight']

Signed-off-by: Christina <christina@example.com>
Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt
Copy link
Copy Markdown
Contributor Author

Validation Results

vLLM transformers Cherry-picked PRs HumanEval IFEval
HEAD 5.x #30410, #30411, #30412, #30413, #30424, #30434, #30699, #30702, #31464, #33846 gem2-2b-gguf (42.1%), gemma3-1b (26.8%) gem2-2b-gguf (65.6%)
HEAD 4.x #30410, #30411, #30412, #30413, #30424, #30434, #30699, #30702, #31464, #33846 q3-moe-gguf (83.5%) q3-moe-gguf (85.4%)

Tested on RTX 5090 (Blackwell, SM 120) with all listed PRs cherry-picked together; models listed under each benchmark passed that benchmark in the given environment, while the same models crash or fail without these PRs applied.

Rebased to current upstream HEAD and re-validated on RTX 5090 (Blackwell, SM 120). Fix confirmed still necessary — gem2-2b-gguf and gemma3-1b (both with tied embeddings) crash without it.

@kitaekatt
Copy link
Copy Markdown
Contributor Author

Hi @22quinn @Isotr0py — this has been sitting without review for a while. Just rebased on latest upstream/main (merge commit), so the branch should now be mergeable. Quick summary: Skip lm_head mapping for models with tied word embeddings:small fix in gguf_loader.py — tied-embedding models were failing when lm_head mapping was attempted. Rebased on latest main.

Would appreciate a look when you have cycles. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant