fix(gemma2): Skip missing parameters during GGUF weight loading#30421
fix(gemma2): Skip missing parameters during GGUF weight loading#30421kitaekatt wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a key fix for loading GGUF Gemma2 models by skipping missing parameters, preventing a KeyError. It also includes several other valuable improvements across the codebase. Notably, it adds memory fences to the shared memory broadcast communicator to prevent race conditions, implements lazy tokenizer initialization to fix semaphore leaks with GGUF models, and enhances dtype compatibility for quantization. My review found one issue where an assertion in the GGUF MoE implementation unnecessarily restricts the supported activation functions, even though the underlying kernel seems to support more options. Overall, this is a strong PR that significantly improves robustness and model compatibility.
| ) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]: | ||
| assert layer.activation == "silu", "Only SiLU activation is supported." | ||
| if layer.apply_router_weight_on_input: | ||
| assert activation == "silu", "Only SiLU activation is supported." |
There was a problem hiding this comment.
The assertion assert activation == "silu" seems overly restrictive. The underlying _fused_moe_gguf function appears to support both "silu" and "gelu" activations. If GELU is indeed supported by the GGUF MoE kernels, this assertion prevents its use. Please either relax the assertion to include "gelu" or clarify why it's restricted to "silu" despite the kernel's apparent capability.
| assert activation == "silu", "Only SiLU activation is supported." | |
| assert activation in ("silu", "gelu"), "Only SiLU and GELU activations are supported." |
|
This pull request has merge conflicts that must be resolved before it can be |
1c144b2 to
5ceb8d4
Compare
…ore leak GGUF models without precomputed merges trigger `build_merges_on_the_fly` in the transformers library, which uses multiprocessing primitives. When this happens in both the APIServer process (for request validation) and the EngineCore subprocess (via StructuredOutputManager), the subprocess leaks a semaphore, causing the server to hang indefinitely. This change makes tokenizer initialization lazy in StructuredOutputManager: - Tokenizer is only loaded when grammar_init() is first called - Most inference requests don't use structured output, so the tokenizer in EngineCore is never loaded - For requests that do use structured output, tokenizer is loaded on-demand The fix resolves the following symptoms: - Server hangs after "resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown" - Tokenizer merges being built twice (once in APIServer, once in EngineCore) - GGUF models failing to start even though weights load successfully Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M). Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit a72d1f9) Signed-off-by: Christina <truffle@gmail.com>
For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight is initialized from embed_tokens weights rather than loaded separately. Add lm_head.weight to sideload_params to allow GGUF loading to succeed without requiring this parameter to be mapped. Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight'] Signed-off-by: Christina <christina@example.com> (cherry picked from commit 9512f74) Signed-off-by: Christina <truffle@gmail.com>
…bility GGUF-loaded configs may only have hidden_activation from config.json, but Gemma2MLP model code expects hidden_act attribute. This adds a post-processing step to copy hidden_activation to hidden_act when needed. Fixes AttributeError: 'Gemma2Config' object has no attribute 'hidden_act' when loading Gemma2 GGUF models. Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit 04ceef5) Signed-off-by: Christina <truffle@gmail.com>
Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto. Problem: - Gemma3 blocks float16 (numerical instability) - GGUF on Blackwell blocks bfloat16 (precision issues) - Only float32 works, but dtype=auto picks bfloat16 → fails Changes: 1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices 2. vllm.py: Auto-select compatible dtype when model and quantization restrictions conflict, instead of failing with an error This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell by automatically falling back to float32. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 9b115ed) Signed-off-by: Christina <truffle@gmail.com>
…n layers The NemotronHAttention class was missing rotary positional embeddings (RoPE), causing token generation to fail despite successful model loading. Root cause: - NemotronHAttention.__init__() had no rotary_emb initialization - forward() did not accept positions parameter or apply RoPE to Q, K - NemotronHAttentionDecoderLayer.forward() did not pass positions to mixer This fix: 1. Imports get_rope from vllm.model_executor.layers.rotary_embedding 2. Adds rotary_emb initialization in NemotronHAttention.__init__() 3. Updates forward() to accept positions and apply q, k = rotary_emb(positions, q, k) 4. Updates NemotronHAttentionDecoderLayer to pass positions to mixer 5. Adds rotary_emb.inv_freq filter in load_weights to skip computed weights Without RoPE, attention layers operate without positional information, producing corrupted attention scores and preventing coherent token generation. Fixes inference failure for nvidia/NVIDIA-Nemotron-Nano-9B-v2 and similar nemotron_h architecture models. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 053d5cc) Signed-off-by: Christina <truffle@gmail.com>
Address review feedback: Use model_config.dtype instead of torch.get_default_dtype() to ensure rotary embeddings match the model's actual dtype. Fall back to default dtype when model_config is None (testing scenarios). Signed-off-by: Christina <truffle@gmail.com>
The GGUF loader yields quantization metadata parameters (qweight_type) for all quantized tensors, including embeddings. However, VocabParallelEmbedding doesn't have these parameters, causing a KeyError when loading GGUF Gemma2 models. This adds a safety check to skip parameters not present in the model, matching the pattern already used in llama.py (lines 502-503). Fixes KeyError: 'embed_tokens.qweight_type' during engine core init. Signed-off-by: Christina <truffle@gmail.com>
5ceb8d4 to
51b0e65
Compare
Summary
The GGUF loader yields quantization metadata parameters (
qweight_type) for all quantized tensors, including embeddings. However,VocabParallelEmbeddingdoesn't have these parameters, causing aKeyErrorwhen loading GGUF Gemma2 models.Root Cause: When loading GGUF models,
gguf_quant_weights_iterator()yieldsqweight_typemetadata for quantized weights. For embeddings likeembed_tokens, this producesembed_tokens.qweight_type, butVocabParallelEmbeddingdoesn't have this parameter inparams_dict.Fix: Add a safety check to skip parameters not present in the model, matching the pattern already used in
llama.py(lines 502-503).Error Before Fix
Stack trace:
gguf_loader.py:338gemma2.py:436→gemma2.py:369(crash site)Test Plan
bartowski/gemma-2-2b-it-GGUFmodel