Skip to content

fix(gemma2): Add quant_config to embedding layer for GGUF support#30424

Closed
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix/gemma2-embedding-quant-config
Closed

fix(gemma2): Add quant_config to embedding layer for GGUF support#30424
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix/gemma2-embedding-quant-config

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

Summary

  • Adds quant_config parameter to VocabParallelEmbedding in Gemma2 model
  • Enables proper GGUF quantized embedding handling

Problem

Gemma2 GGUF models produce garbage output despite loading successfully:

Input: "Say hello in 5 words"
Output: " GHFW側から ThinkmariKeywords!");\rJahre Iliad幺个人..."

Root Cause

Without quant_config, VocabParallelEmbedding uses UnquantizedEmbeddingMethod which calls F.embedding() directly on quantized bytes, interpreting them as float values.

This is the exact same bug that was fixed for DeepSeek in commit aa375dca9 ("#12836 - Missing quant_config in deepseek embedding layer").

The Fix

 self.embed_tokens = VocabParallelEmbedding(
     config.vocab_size,
     config.hidden_size,
+    quant_config=quant_config,
+    prefix=f"{prefix}.embed_tokens",
 )

Comparison

Model quant_config Passed? GGUF Works?
Gemma2 ❌ NO (before fix) ❌ Garbage
Gemma3 ✅ YES ✅ Works
Llama ✅ YES ✅ Works
DeepSeek ✅ YES (after #12836) ✅ Works

Affected Models

All Gemma2 GGUF variants:

  • gemma-2-2b-it-GGUF
  • gemma-2-9b-GGUF
  • gemma-2-27b-GGUF

Test Plan

  • Gemma2 GGUF model loads without errors
  • Inference produces coherent output (not garbage)
  • Compare output quality with Gemma3 GGUF or safetensors Gemma2

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug where Gemma2 GGUF models would produce garbage output. The root cause was correctly identified: the VocabParallelEmbedding layer was missing the quant_config parameter, causing quantized weights to be misinterpreted. The provided fix, which passes both quant_config and the layer prefix to the embedding layer, is correct and aligns with the implementation in other models.

While reviewing, I noticed that vllm/model_executor/models/gemma.py seems to have the same vulnerability. Applying a similar fix there would likely resolve the issue for Gemma1 GGUF models as well, ensuring consistent behavior across the Gemma model family. This is a great fix, well done!

@kitaekatt kitaekatt force-pushed the fix/gemma2-embedding-quant-config branch from d41b71f to 5837f8e Compare December 15, 2025 20:30
@kitaekatt kitaekatt marked this pull request as ready for review December 15, 2025 20:30
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@kitaekatt kitaekatt marked this pull request as draft December 15, 2025 20:37
@kitaekatt kitaekatt force-pushed the fix/gemma2-embedding-quant-config branch from 5837f8e to 472b770 Compare December 29, 2025 20:43
@kitaekatt kitaekatt force-pushed the fix/gemma2-embedding-quant-config branch from 472b770 to 096b878 Compare January 19, 2026 17:27
@kitaekatt kitaekatt force-pushed the fix/gemma2-embedding-quant-config branch from 096b878 to 530a043 Compare February 5, 2026 00:18
The Gemma2 model was missing the quant_config parameter in the
VocabParallelEmbedding initialization, causing GGUF quantized
embeddings to be misinterpreted as float values.

Without quant_config, GGUF models use UnquantizedEmbeddingMethod
which calls F.embedding() directly on quantized bytes, resulting
in garbage output during inference.

This is the same bug that was fixed for DeepSeek in commit aa375dc
("Missing quant_config in deepseek embedding layer (vllm-project#12836)").

The fix adds:
- quant_config parameter to enable GGUFEmbeddingMethod selection
- prefix parameter for proper weight mapping

Fixes Gemma2 GGUF models (gemma-2-2b-it-GGUF, etc.) producing garbage
output like: " GHFW側から ThinkmariKeywords!")...

Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt kitaekatt force-pushed the fix/gemma2-embedding-quant-config branch from 530a043 to 99fed9c Compare March 10, 2026 15:38
@kitaekatt kitaekatt marked this pull request as ready for review March 10, 2026 20:09
@kitaekatt
Copy link
Copy Markdown
Contributor Author

Validation Results

vLLM transformers Cherry-picked PRs HumanEval IFEval
HEAD 5.x #30410, #30411, #30412, #30413, #30424, #30434, #30699, #30702, #31464, #33846 gem2-2b-gguf (42.1%), gemma3-1b (26.8%) gem2-2b-gguf (65.6%)
HEAD 4.x #30410, #30411, #30412, #30413, #30424, #30434, #30699, #30702, #31464, #33846 q3-moe-gguf (83.5%) q3-moe-gguf (85.4%)

Tested on RTX 5090 (Blackwell, SM 120) with all listed PRs cherry-picked together; models listed under each benchmark passed that benchmark in the given environment, while the same models crash or fail without these PRs applied.

Converting from draft to open. Without quant_config on the embedding layer, GGUFEmbeddingMethod is not selected and quantized bytes are read as raw floats, producing gibberish. Validated on RTX 5090 (Blackwell, SM 120).

kitaekatt added a commit to kitaekatt/vllm that referenced this pull request Mar 16, 2026
This PR consolidates four related GGUF bug fixes for Gemma2 and Gemma3
models, plus a style improvement from reviewer feedback.

**1. Add quant_config to embedding layer (PR vllm-project#30424)**
Pass quant_config to VocabParallelEmbedding in Gemma2Model so that
GGUFEmbeddingMethod is selected instead of UnquantizedEmbeddingMethod.
Without this, quantized bytes are read as raw floats producing gibberish.

**2. Fix EOS token extraction for HF blob paths (PR vllm-project#30434)**
GGUF files served from HuggingFace Hub use blob paths that don't match
the expected filename pattern. Extract EOS token ID directly from GGUF
metadata as a reliable fallback.

**3. Skip missing parameters during GGUF weight loading (PR vllm-project#30699)**
Gemma2 GGUF files may lack certain weight keys (e.g. embed_tokens.qweight_type).
Skip missing parameters gracefully instead of raising KeyError.

**4. Use RMSNorm instead of GemmaRMSNorm for GGUF (PR vllm-project#31464)**
GGUF files store RMSNorm weights with +1 baked in (llama.cpp convention).
GemmaRMSNorm adds 1 in its forward pass, causing double addition.
Select plain RMSNorm at construction time for GGUF models. Applied to
all norm layers in Gemma2 and Gemma3 (including q_norm/k_norm).

**Style: compact rms_norm_kwargs pattern (reviewer feedback)**
Use rms_norm_kwargs dict to avoid repeating constructor arguments,
per hmellor's review on PR vllm-project#31464.

Tested on RTX 5090 (Blackwell, SM 120) with gem2-2b-gguf and gemma3-1b.
Supersedes PRs vllm-project#30424, vllm-project#30434, vllm-project#30699, vllm-project#31464.

Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt
Copy link
Copy Markdown
Contributor Author

Closing in favor of consolidated PR #37220, as requested by @Isotr0py in #30434. All fixes from this PR are included in the consolidated version.

@kitaekatt kitaekatt closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant