fix(gemma2): Add quant_config to embedding layer for GGUF support#30424
fix(gemma2): Add quant_config to embedding layer for GGUF support#30424kitaekatt wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug where Gemma2 GGUF models would produce garbage output. The root cause was correctly identified: the VocabParallelEmbedding layer was missing the quant_config parameter, causing quantized weights to be misinterpreted. The provided fix, which passes both quant_config and the layer prefix to the embedding layer, is correct and aligns with the implementation in other models.
While reviewing, I noticed that vllm/model_executor/models/gemma.py seems to have the same vulnerability. Applying a similar fix there would likely resolve the issue for Gemma1 GGUF models as well, ensuring consistent behavior across the Gemma model family. This is a great fix, well done!
d41b71f to
5837f8e
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
5837f8e to
472b770
Compare
472b770 to
096b878
Compare
096b878 to
530a043
Compare
The Gemma2 model was missing the quant_config parameter in the VocabParallelEmbedding initialization, causing GGUF quantized embeddings to be misinterpreted as float values. Without quant_config, GGUF models use UnquantizedEmbeddingMethod which calls F.embedding() directly on quantized bytes, resulting in garbage output during inference. This is the same bug that was fixed for DeepSeek in commit aa375dc ("Missing quant_config in deepseek embedding layer (vllm-project#12836)"). The fix adds: - quant_config parameter to enable GGUFEmbeddingMethod selection - prefix parameter for proper weight mapping Fixes Gemma2 GGUF models (gemma-2-2b-it-GGUF, etc.) producing garbage output like: " GHFW側から ThinkmariKeywords!")... Signed-off-by: Christina <truffle@gmail.com>
530a043 to
99fed9c
Compare
Validation Results
Tested on RTX 5090 (Blackwell, SM 120) with all listed PRs cherry-picked together; models listed under each benchmark passed that benchmark in the given environment, while the same models crash or fail without these PRs applied. Converting from draft to open. Without quant_config on the embedding layer, GGUFEmbeddingMethod is not selected and quantized bytes are read as raw floats, producing gibberish. Validated on RTX 5090 (Blackwell, SM 120). |
This PR consolidates four related GGUF bug fixes for Gemma2 and Gemma3 models, plus a style improvement from reviewer feedback. **1. Add quant_config to embedding layer (PR vllm-project#30424)** Pass quant_config to VocabParallelEmbedding in Gemma2Model so that GGUFEmbeddingMethod is selected instead of UnquantizedEmbeddingMethod. Without this, quantized bytes are read as raw floats producing gibberish. **2. Fix EOS token extraction for HF blob paths (PR vllm-project#30434)** GGUF files served from HuggingFace Hub use blob paths that don't match the expected filename pattern. Extract EOS token ID directly from GGUF metadata as a reliable fallback. **3. Skip missing parameters during GGUF weight loading (PR vllm-project#30699)** Gemma2 GGUF files may lack certain weight keys (e.g. embed_tokens.qweight_type). Skip missing parameters gracefully instead of raising KeyError. **4. Use RMSNorm instead of GemmaRMSNorm for GGUF (PR vllm-project#31464)** GGUF files store RMSNorm weights with +1 baked in (llama.cpp convention). GemmaRMSNorm adds 1 in its forward pass, causing double addition. Select plain RMSNorm at construction time for GGUF models. Applied to all norm layers in Gemma2 and Gemma3 (including q_norm/k_norm). **Style: compact rms_norm_kwargs pattern (reviewer feedback)** Use rms_norm_kwargs dict to avoid repeating constructor arguments, per hmellor's review on PR vllm-project#31464. Tested on RTX 5090 (Blackwell, SM 120) with gem2-2b-gguf and gemma3-1b. Supersedes PRs vllm-project#30424, vllm-project#30434, vllm-project#30699, vllm-project#31464. Signed-off-by: Christina <truffle@gmail.com>
Summary
quant_configparameter toVocabParallelEmbeddingin Gemma2 modelProblem
Gemma2 GGUF models produce garbage output despite loading successfully:
Root Cause
Without
quant_config,VocabParallelEmbeddingusesUnquantizedEmbeddingMethodwhich callsF.embedding()directly on quantized bytes, interpreting them as float values.This is the exact same bug that was fixed for DeepSeek in commit
aa375dca9("#12836 - Missing quant_config in deepseek embedding layer").The Fix
self.embed_tokens = VocabParallelEmbedding( config.vocab_size, config.hidden_size, + quant_config=quant_config, + prefix=f"{prefix}.embed_tokens", )Comparison
Affected Models
All Gemma2 GGUF variants:
Test Plan
🤖 Generated with Claude Code