Skip to content

fix(gemma2): Skip missing parameters during GGUF weight loading#30421

Closed
kitaekatt wants to merge 7 commits into
vllm-project:mainfrom
kitaekatt:fix/gemma2-gguf-qweight-type
Closed

fix(gemma2): Skip missing parameters during GGUF weight loading#30421
kitaekatt wants to merge 7 commits into
vllm-project:mainfrom
kitaekatt:fix/gemma2-gguf-qweight-type

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

Summary

The GGUF loader yields quantization metadata parameters (qweight_type) for all quantized tensors, including embeddings. However, VocabParallelEmbedding doesn't have these parameters, causing a KeyError when loading GGUF Gemma2 models.

Root Cause: When loading GGUF models, gguf_quant_weights_iterator() yields qweight_type metadata for quantized weights. For embeddings like embed_tokens, this produces embed_tokens.qweight_type, but VocabParallelEmbedding doesn't have this parameter in params_dict.

Fix: Add a safety check to skip parameters not present in the model, matching the pattern already used in llama.py (lines 502-503).

Error Before Fix

KeyError: 'embed_tokens.qweight_type'

Stack trace:

  • gguf_loader.py:338
  • gemma2.py:436gemma2.py:369 (crash site)

Test Plan

  • Test with bartowski/gemma-2-2b-it-GGUF model
  • Verify model loads and generates tokens
  • Ensure no regression for non-GGUF Gemma2 models

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a key fix for loading GGUF Gemma2 models by skipping missing parameters, preventing a KeyError. It also includes several other valuable improvements across the codebase. Notably, it adds memory fences to the shared memory broadcast communicator to prevent race conditions, implements lazy tokenizer initialization to fix semaphore leaks with GGUF models, and enhances dtype compatibility for quantization. My review found one issue where an assertion in the GGUF MoE implementation unnecessarily restricts the supported activation functions, even though the underlying kernel seems to support more options. Overall, this is a strong PR that significantly improves robustness and model compatibility.

) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]:
assert layer.activation == "silu", "Only SiLU activation is supported."
if layer.apply_router_weight_on_input:
assert activation == "silu", "Only SiLU activation is supported."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assertion assert activation == "silu" seems overly restrictive. The underlying _fused_moe_gguf function appears to support both "silu" and "gelu" activations. If GELU is indeed supported by the GGUF MoE kernels, this assertion prevents its use. Please either relax the assertion to include "gelu" or clarify why it's restricted to "silu" despite the kernel's apparent capability.

Suggested change
assert activation == "silu", "Only SiLU activation is supported."
assert activation in ("silu", "gelu"), "Only SiLU and GELU activations are supported."

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Dec 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Dec 13, 2025
@kitaekatt kitaekatt force-pushed the fix/gemma2-gguf-qweight-type branch from 1c144b2 to 5ceb8d4 Compare December 15, 2025 15:40
kitaekatt and others added 7 commits December 15, 2025 09:40
…ore leak

GGUF models without precomputed merges trigger `build_merges_on_the_fly`
in the transformers library, which uses multiprocessing primitives.
When this happens in both the APIServer process (for request validation)
and the EngineCore subprocess (via StructuredOutputManager), the
subprocess leaks a semaphore, causing the server to hang indefinitely.

This change makes tokenizer initialization lazy in StructuredOutputManager:
- Tokenizer is only loaded when grammar_init() is first called
- Most inference requests don't use structured output, so the tokenizer
  in EngineCore is never loaded
- For requests that do use structured output, tokenizer is loaded on-demand

The fix resolves the following symptoms:
- Server hangs after "resource_tracker: There appear to be 1 leaked
  semaphore objects to clean up at shutdown"
- Tokenizer merges being built twice (once in APIServer, once in EngineCore)
- GGUF models failing to start even though weights load successfully

Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M).

Signed-off-by: Christina <truffle@gmail.com>
(cherry picked from commit a72d1f9)
Signed-off-by: Christina <truffle@gmail.com>
For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight
is initialized from embed_tokens weights rather than loaded separately.
Add lm_head.weight to sideload_params to allow GGUF loading to succeed
without requiring this parameter to be mapped.

Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight']

Signed-off-by: Christina <christina@example.com>
(cherry picked from commit 9512f74)
Signed-off-by: Christina <truffle@gmail.com>
…bility

GGUF-loaded configs may only have hidden_activation from config.json,
but Gemma2MLP model code expects hidden_act attribute. This adds a
post-processing step to copy hidden_activation to hidden_act when needed.

Fixes AttributeError: 'Gemma2Config' object has no attribute 'hidden_act'
when loading Gemma2 GGUF models.

Signed-off-by: Christina <truffle@gmail.com>
(cherry picked from commit 04ceef5)
Signed-off-by: Christina <truffle@gmail.com>
Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto.

Problem:
- Gemma3 blocks float16 (numerical instability)
- GGUF on Blackwell blocks bfloat16 (precision issues)
- Only float32 works, but dtype=auto picks bfloat16 → fails

Changes:
1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices
2. vllm.py: Auto-select compatible dtype when model and quantization
   restrictions conflict, instead of failing with an error

This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell
by automatically falling back to float32.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 9b115ed)
Signed-off-by: Christina <truffle@gmail.com>
…n layers

The NemotronHAttention class was missing rotary positional embeddings (RoPE),
causing token generation to fail despite successful model loading.

Root cause:
- NemotronHAttention.__init__() had no rotary_emb initialization
- forward() did not accept positions parameter or apply RoPE to Q, K
- NemotronHAttentionDecoderLayer.forward() did not pass positions to mixer

This fix:
1. Imports get_rope from vllm.model_executor.layers.rotary_embedding
2. Adds rotary_emb initialization in NemotronHAttention.__init__()
3. Updates forward() to accept positions and apply q, k = rotary_emb(positions, q, k)
4. Updates NemotronHAttentionDecoderLayer to pass positions to mixer
5. Adds rotary_emb.inv_freq filter in load_weights to skip computed weights

Without RoPE, attention layers operate without positional information, producing
corrupted attention scores and preventing coherent token generation.

Fixes inference failure for nvidia/NVIDIA-Nemotron-Nano-9B-v2 and similar
nemotron_h architecture models.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 053d5cc)
Signed-off-by: Christina <truffle@gmail.com>
Address review feedback: Use model_config.dtype instead of
torch.get_default_dtype() to ensure rotary embeddings match
the model's actual dtype. Fall back to default dtype when
model_config is None (testing scenarios).

Signed-off-by: Christina <truffle@gmail.com>
The GGUF loader yields quantization metadata parameters (qweight_type)
for all quantized tensors, including embeddings. However,
VocabParallelEmbedding doesn't have these parameters, causing a
KeyError when loading GGUF Gemma2 models.

This adds a safety check to skip parameters not present in the model,
matching the pattern already used in llama.py (lines 502-503).

Fixes KeyError: 'embed_tokens.qweight_type' during engine core init.

Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt kitaekatt force-pushed the fix/gemma2-gguf-qweight-type branch from 5ceb8d4 to 51b0e65 Compare December 15, 2025 15:40
@mergify mergify Bot removed the needs-rebase label Dec 15, 2025
@kitaekatt
Copy link
Copy Markdown
Contributor Author

Closing in favor of focused PRs. The unique fix from this PR has been moved to #30699. Other fixes already have their own PRs: #30409, #30410, #30411, #30412, #30413.

@kitaekatt kitaekatt closed this Dec 15, 2025
@kitaekatt kitaekatt deleted the fix/gemma2-gguf-qweight-type branch December 15, 2025 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant