fix(gguf): Skip lm_head mapping for models with tied word embeddings#30412
fix(gguf): Skip lm_head mapping for models with tied word embeddings#30412kitaekatt wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a fix for loading GGUF models with tied word embeddings, such as Gemma2, by correctly skipping the mapping of lm_head.weight. Additionally, it proactively addresses potential data type compatibility issues on newer hardware like NVIDIA's Blackwell GPUs by preventing the use of bfloat16 with GGUF quantization due to precision issues and implementing a mechanism to automatically select a compatible dtype when conflicts arise. The changes are well-implemented and improve the robustness of GGUF model loading. The logic for handling tied embeddings and resolving dtype conflicts is sound. I have no major concerns with this pull request.
9512f74 to
a195d52
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
a195d52 to
a215a08
Compare
a215a08 to
7146180
Compare
|
Testing performed: Tested with GGUF models on RTX 5090 (32GB, Blackwell architecture):
Ran models through a local benchmark runner for HumanEval and GSM8K. Verified the semaphore leak fix by running repeated model load/unload cycles without resource exhaustion. Related PRs: This is part of a series of GGUF pipeline fixes for Blackwell GPU compatibility:
|
1679948 to
9d36909
Compare
For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight is initialized from embed_tokens weights rather than loaded separately. Add lm_head.weight to sideload_params to allow GGUF loading to succeed without requiring this parameter to be mapped. Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight'] Signed-off-by: Christina <christina@example.com> Signed-off-by: Christina <truffle@gmail.com>
9d36909 to
c414c02
Compare
Validation Results
Tested on RTX 5090 (Blackwell, SM 120) with all listed PRs cherry-picked together; models listed under each benchmark passed that benchmark in the given environment, while the same models crash or fail without these PRs applied. Rebased to current upstream HEAD and re-validated on RTX 5090 (Blackwell, SM 120). Fix confirmed still necessary — gem2-2b-gguf and gemma3-1b (both with tied embeddings) crash without it. |
|
Hi @22quinn @Isotr0py — this has been sitting without review for a while. Just rebased on latest Would appreciate a look when you have cycles. Thanks! |
Summary
Fixes
RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight']for models using tied word embeddings.Changes
For models like Gemma2 that use
tie_word_embeddings=True, addlm_head.weighttosideload_paramsto allow GGUF loading to succeed.Root Cause
When
tie_word_embeddings=True, model shares weights between input embeddings and output projection. GGUF files don't contain separatelm_head.weighttensor.Testing
Tested with
bartowski/gemma-2-2b-it-GGUF- model loads without parameter mapping error.