[Model][GGUF] Add Gemma4 GGUF serving glue#41589
[Model][GGUF] Add Gemma4 GGUF serving glue#41589lesj0610 wants to merge 1 commit intovllm-project:mainfrom
Conversation
Teach the GGUF loader and tokenizer/processor setup about Gemma4-specific tensor names, local config handling, and tokenizer special IDs.\n\nKeep this independent from the Gemma4 quant runtime and MoE activation changes.\n\nCo-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for Gemma4 GGUF models, including manual tensor mappings, tokenizer patching for special token IDs, and improved configuration loading that prioritizes local JSON files. It also updates the Gemma4 multimodal model's embedding initialization. A critical bug was identified in the GGUF loader where missing '.weight' suffixes in the manual mapping for MoE expert weights would lead to a runtime error.
| add_mapping( | ||
| f"blk.{idx}.ffn_gate_up_exps.weight", | ||
| f"{layer_prefix}.moe.gate_up_proj.weight", | ||
| handled_name=f"{layer_prefix}.experts.gate_up_proj", | ||
| ) | ||
| add_mapping( | ||
| f"blk.{idx}.ffn_down_exps.weight", | ||
| f"{layer_prefix}.moe.down_proj.weight", | ||
| handled_name=f"{layer_prefix}.experts.down_proj", | ||
| ) |
There was a problem hiding this comment.
The handled_name for Gemma4 MoE expert weights (gate/up and down) is missing the .weight suffix. Since normalized_state_names (built at line 441) includes the suffix for weight tensors, this mismatch will cause the manual mapping to be skipped. Consequently, these parameters will be flagged as unmapped in the final check (line 509), leading to a RuntimeError during model loading.
| add_mapping( | |
| f"blk.{idx}.ffn_gate_up_exps.weight", | |
| f"{layer_prefix}.moe.gate_up_proj.weight", | |
| handled_name=f"{layer_prefix}.experts.gate_up_proj", | |
| ) | |
| add_mapping( | |
| f"blk.{idx}.ffn_down_exps.weight", | |
| f"{layer_prefix}.moe.down_proj.weight", | |
| handled_name=f"{layer_prefix}.experts.down_proj", | |
| ) | |
| add_mapping( | |
| f"blk.{idx}.ffn_gate_up_exps.weight", | |
| f"{layer_prefix}.moe.gate_up_proj.weight", | |
| handled_name=f"{layer_prefix}.experts.gate_up_proj.weight", | |
| ) | |
| add_mapping( | |
| f"blk.{idx}.ffn_down_exps.weight", | |
| f"{layer_prefix}.moe.down_proj.weight", | |
| handled_name=f"{layer_prefix}.experts.down_proj.weight", | |
| ) |
|
Closing per author workflow correction. This split will be handled in fork-only PRs before any upstream submission. |
Summary
Add the Gemma4-specific GGUF serving glue needed to load local Gemma4 GGUF repositories with sibling HF config/tokenizer files:
config.jsonfor local GGUF files when available instead of forcing transformers' GGUF parserThis PR intentionally contains no AutoRound quant runtime changes and no MoE activation changes.
Duplicate-work check
Checked open PRs before opening:
Tests
Result:
8 passed.AI assistance was used to prepare and split this change; I reviewed the resulting diff and validation output.