fix(gemma2): Skip missing parameters during GGUF weight loading by kitaekatt · Pull Request #30421 · vllm-project/vllm

kitaekatt · 2025-12-10T19:24:30Z

Summary

The GGUF loader yields quantization metadata parameters (qweight_type) for all quantized tensors, including embeddings. However, VocabParallelEmbedding doesn't have these parameters, causing a KeyError when loading GGUF Gemma2 models.

Root Cause: When loading GGUF models, gguf_quant_weights_iterator() yields qweight_type metadata for quantized weights. For embeddings like embed_tokens, this produces embed_tokens.qweight_type, but VocabParallelEmbedding doesn't have this parameter in params_dict.

Fix: Add a safety check to skip parameters not present in the model, matching the pattern already used in llama.py (lines 502-503).

Error Before Fix

KeyError: 'embed_tokens.qweight_type'

Stack trace:

gguf_loader.py:338
gemma2.py:436 → gemma2.py:369 (crash site)

Test Plan

Test with bartowski/gemma-2-2b-it-GGUF model
Verify model loads and generates tokens
Ensure no regression for non-GGUF Gemma2 models

gemini-code-assist

Code Review

This pull request introduces a key fix for loading GGUF Gemma2 models by skipping missing parameters, preventing a KeyError. It also includes several other valuable improvements across the codebase. Notably, it adds memory fences to the shared memory broadcast communicator to prevent race conditions, implements lazy tokenizer initialization to fix semaphore leaks with GGUF models, and enhances dtype compatibility for quantization. My review found one issue where an assertion in the GGUF MoE implementation unnecessarily restricts the supported activation functions, even though the underlying kernel seems to support more options. Overall, this is a strong PR that significantly improves robustness and model compatibility.

gemini-code-assist · 2025-12-10T19:26:36Z

    ) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]:
-        assert layer.activation == "silu", "Only SiLU activation is supported."
-        if layer.apply_router_weight_on_input:
+        assert activation == "silu", "Only SiLU activation is supported."


The assertion assert activation == "silu" seems overly restrictive. The underlying _fused_moe_gguf function appears to support both "silu" and "gelu" activations. If GELU is indeed supported by the GGUF MoE kernels, this assertion prevents its use. Please either relax the assertion to include "gelu" or clarify why it's restricted to "silu" despite the kernel's apparent capability.

Suggested change

assert activation == "silu", "Only SiLU activation is supported."

assert activation in ("silu", "gelu"), "Only SiLU and GELU activations are supported."

mergify · 2025-12-13T21:13:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ore leak GGUF models without precomputed merges trigger `build_merges_on_the_fly` in the transformers library, which uses multiprocessing primitives. When this happens in both the APIServer process (for request validation) and the EngineCore subprocess (via StructuredOutputManager), the subprocess leaks a semaphore, causing the server to hang indefinitely. This change makes tokenizer initialization lazy in StructuredOutputManager: - Tokenizer is only loaded when grammar_init() is first called - Most inference requests don't use structured output, so the tokenizer in EngineCore is never loaded - For requests that do use structured output, tokenizer is loaded on-demand The fix resolves the following symptoms: - Server hangs after "resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown" - Tokenizer merges being built twice (once in APIServer, once in EngineCore) - GGUF models failing to start even though weights load successfully Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M). Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit a72d1f9) Signed-off-by: Christina <truffle@gmail.com>

For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight is initialized from embed_tokens weights rather than loaded separately. Add lm_head.weight to sideload_params to allow GGUF loading to succeed without requiring this parameter to be mapped. Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight'] Signed-off-by: Christina <christina@example.com> (cherry picked from commit 9512f74) Signed-off-by: Christina <truffle@gmail.com>

…bility GGUF-loaded configs may only have hidden_activation from config.json, but Gemma2MLP model code expects hidden_act attribute. This adds a post-processing step to copy hidden_activation to hidden_act when needed. Fixes AttributeError: 'Gemma2Config' object has no attribute 'hidden_act' when loading Gemma2 GGUF models. Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit 04ceef5) Signed-off-by: Christina <truffle@gmail.com>

Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto. Problem: - Gemma3 blocks float16 (numerical instability) - GGUF on Blackwell blocks bfloat16 (precision issues) - Only float32 works, but dtype=auto picks bfloat16 → fails Changes: 1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices 2. vllm.py: Auto-select compatible dtype when model and quantization restrictions conflict, instead of failing with an error This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell by automatically falling back to float32. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 9b115ed) Signed-off-by: Christina <truffle@gmail.com>

…n layers The NemotronHAttention class was missing rotary positional embeddings (RoPE), causing token generation to fail despite successful model loading. Root cause: - NemotronHAttention.__init__() had no rotary_emb initialization - forward() did not accept positions parameter or apply RoPE to Q, K - NemotronHAttentionDecoderLayer.forward() did not pass positions to mixer This fix: 1. Imports get_rope from vllm.model_executor.layers.rotary_embedding 2. Adds rotary_emb initialization in NemotronHAttention.__init__() 3. Updates forward() to accept positions and apply q, k = rotary_emb(positions, q, k) 4. Updates NemotronHAttentionDecoderLayer to pass positions to mixer 5. Adds rotary_emb.inv_freq filter in load_weights to skip computed weights Without RoPE, attention layers operate without positional information, producing corrupted attention scores and preventing coherent token generation. Fixes inference failure for nvidia/NVIDIA-Nemotron-Nano-9B-v2 and similar nemotron_h architecture models. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 053d5cc) Signed-off-by: Christina <truffle@gmail.com>

Address review feedback: Use model_config.dtype instead of torch.get_default_dtype() to ensure rotary embeddings match the model's actual dtype. Fall back to default dtype when model_config is None (testing scenarios). Signed-off-by: Christina <truffle@gmail.com>

The GGUF loader yields quantization metadata parameters (qweight_type) for all quantized tensors, including embeddings. However, VocabParallelEmbedding doesn't have these parameters, causing a KeyError when loading GGUF Gemma2 models. This adds a safety check to skip parameters not present in the model, matching the pattern already used in llama.py (lines 502-503). Fixes KeyError: 'embed_tokens.qweight_type' during engine core init. Signed-off-by: Christina <truffle@gmail.com>

kitaekatt · 2025-12-15T15:45:22Z

Closing in favor of focused PRs. The unique fix from this PR has been moved to #30699. Other fixes already have their own PRs: #30409, #30410, #30411, #30412, #30413.

mergify Bot added structured-output v1 labels Dec 10, 2025

github-project-automation Bot added this to Structured Output Dec 10, 2025

gemini-code-assist Bot reviewed Dec 10, 2025

View reviewed changes

mergify Bot added the needs-rebase label Dec 13, 2025

kitaekatt force-pushed the fix/gemma2-gguf-qweight-type branch from 1c144b2 to 5ceb8d4 Compare December 15, 2025 15:40

kitaekatt and others added 7 commits December 15, 2025 09:40

kitaekatt force-pushed the fix/gemma2-gguf-qweight-type branch from 5ceb8d4 to 51b0e65 Compare December 15, 2025 15:40

mergify Bot removed the needs-rebase label Dec 15, 2025

kitaekatt closed this Dec 15, 2025

github-project-automation Bot moved this to Done in Structured Output Dec 15, 2025

kitaekatt deleted the fix/gemma2-gguf-qweight-type branch December 15, 2025 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gemma2): Skip missing parameters during GGUF weight loading#30421

fix(gemma2): Skip missing parameters during GGUF weight loading#30421
kitaekatt wants to merge 7 commits into
vllm-project:mainfrom
kitaekatt:fix/gemma2-gguf-qweight-type

kitaekatt commented Dec 10, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Uh oh!

mergify Bot commented Dec 13, 2025

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	assert activation == "silu", "Only SiLU activation is supported."
	assert activation in ("silu", "gelu"), "Only SiLU and GELU activations are supported."

Uh oh!

Conversation

kitaekatt commented Dec 10, 2025

Summary

Error Before Fix

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Dec 13, 2025

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant