[Model][GGUF] Add Gemma4 GGUF serving glue by lesj0610 · Pull Request #41589 · vllm-project/vllm

lesj0610 · 2026-05-04T01:08:23Z

Summary

Add the Gemma4-specific GGUF serving glue needed to load local Gemma4 GGUF repositories with sibling HF config/tokenizer files:

prefer sibling config.json for local GGUF files when available instead of forcing transformers' GGUF parser
add Gemma4 GGUF tensor name mappings that are missing from gguf-py's current tables
reshape Gemma4 vision patch embedding GGUF tensors to vLLM's parameter layout
patch Gemma4 GGUF tokenizer/processor special-token fields from GGUF metadata
keep multimodal Gemma4 per-layer embeddings on the language-model device/dtype path

This PR intentionally contains no AutoRound quant runtime changes and no MoE activation changes.

Duplicate-work check

Checked open PRs before opening:

Fix gemma4 core: quantized inference + gguf loading + fused moe gelu_tanh #40281 is my older monolithic Gemma4 PR containing GGUF, quant, and activation changes. This split PR supersedes only its GGUF/tokenizer/processor portion.
No separate open PR was found for Gemma4 GGUF tokenizer/processor serving glue.
[Model] Fix Gemma4 MoE activation mismatch #41574 is activation-only and [Bugfix][Gemma4] Fix quantized MoE weight loading and KV cache spec merge #39582 is quantized MoE/KV-cache loading; neither covers this GGUF scope.

Tests

.venv/bin/python -m py_compile \
  tests/models/test_gguf_download.py \
  vllm/model_executor/model_loader/gguf_loader.py \
  vllm/model_executor/models/gemma4_mm.py \
  vllm/tokenizers/registry.py \
  vllm/transformers_utils/config.py \
  vllm/transformers_utils/gguf_utils.py \
  vllm/transformers_utils/processor.py

pre-commit run ruff-check --files \
  tests/models/test_gguf_download.py \
  vllm/model_executor/model_loader/gguf_loader.py \
  vllm/model_executor/models/gemma4_mm.py \
  vllm/tokenizers/registry.py \
  vllm/transformers_utils/config.py \
  vllm/transformers_utils/gguf_utils.py \
  vllm/transformers_utils/processor.py

pre-commit run ruff-format --files \
  tests/models/test_gguf_download.py \
  vllm/model_executor/model_loader/gguf_loader.py \
  vllm/model_executor/models/gemma4_mm.py \
  vllm/tokenizers/registry.py \
  vllm/transformers_utils/config.py \
  vllm/transformers_utils/gguf_utils.py \
  vllm/transformers_utils/processor.py

.venv/bin/python -m pytest tests/models/test_gguf_download.py -q

Result: 8 passed.

AI assistance was used to prepare and split this change; I reviewed the resulting diff and validation output.

Teach the GGUF loader and tokenizer/processor setup about Gemma4-specific tensor names, local config handling, and tokenizer special IDs.\n\nKeep this independent from the Gemma4 quant runtime and MoE activation changes.\n\nCo-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request introduces support for Gemma4 GGUF models, including manual tensor mappings, tokenizer patching for special token IDs, and improved configuration loading that prioritizes local JSON files. It also updates the Gemma4 multimodal model's embedding initialization. A critical bug was identified in the GGUF loader where missing '.weight' suffixes in the manual mapping for MoE expert weights would lead to a runtime error.

gemini-code-assist · 2026-05-04T01:12:08Z

+            add_mapping(
+                f"blk.{idx}.ffn_gate_up_exps.weight",
+                f"{layer_prefix}.moe.gate_up_proj.weight",
+                handled_name=f"{layer_prefix}.experts.gate_up_proj",
+            )
+            add_mapping(
+                f"blk.{idx}.ffn_down_exps.weight",
+                f"{layer_prefix}.moe.down_proj.weight",
+                handled_name=f"{layer_prefix}.experts.down_proj",
+            )


The handled_name for Gemma4 MoE expert weights (gate/up and down) is missing the .weight suffix. Since normalized_state_names (built at line 441) includes the suffix for weight tensors, this mismatch will cause the manual mapping to be skipped. Consequently, these parameters will be flagged as unmapped in the final check (line 509), leading to a RuntimeError during model loading.

Suggested change

add_mapping(

f"blk.{idx}.ffn_gate_up_exps.weight",

f"{layer_prefix}.moe.gate_up_proj.weight",

handled_name=f"{layer_prefix}.experts.gate_up_proj",

)

add_mapping(

f"blk.{idx}.ffn_down_exps.weight",

f"{layer_prefix}.moe.down_proj.weight",

handled_name=f"{layer_prefix}.experts.down_proj",

)

add_mapping(

f"blk.{idx}.ffn_gate_up_exps.weight",

f"{layer_prefix}.moe.gate_up_proj.weight",

handled_name=f"{layer_prefix}.experts.gate_up_proj.weight",

)

add_mapping(

f"blk.{idx}.ffn_down_exps.weight",

f"{layer_prefix}.moe.down_proj.weight",

handled_name=f"{layer_prefix}.experts.down_proj.weight",

)

lesj0610 · 2026-05-04T01:22:36Z

Closing per author workflow correction. This split will be handled in fork-only PRs before any upstream submission.

lesj0610 mentioned this pull request May 4, 2026

Fix gemma4 core: quantized inference + gguf loading + fused moe gelu_tanh #40281

Closed

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

lesj0610 closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model][GGUF] Add Gemma4 GGUF serving glue#41589

[Model][GGUF] Add Gemma4 GGUF serving glue#41589
lesj0610 wants to merge 1 commit intovllm-project:mainfrom
lesj0610:lesj/gemma4-gguf-serving-glue-main

lesj0610 commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

lesj0610 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lesj0610 commented May 4, 2026

Summary

Duplicate-work check

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

lesj0610 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant