[Bugfix][Gemma4] Fix quantized MoE weight loading and KV cache spec merge by neotea · Pull Request #39582 · vllm-project/vllm

neotea · 2026-04-11T16:37:14Z

Summary

Fixes three issues that prevent serving quantized Gemma 4 MoE checkpoints (compressed-tensors / AWQ format) with vLLM:

3D expert weight explosion drops quantization suffixes: _weight_iterator only matched exact projection names (gate_up_proj, down_proj). Quantized checkpoints use suffixed names like gate_up_proj_packed and down_proj_scale, which were silently dropped during the 3D→2D per-expert explosion, causing KeyError at load time.
Expert params mapping ignores underscore suffixes: The mapping loop handled dot-separated suffixes (.weight_scale) and bare names, but not the underscore suffixes (_packed, _scale) from compressed-tensors quantization. Added a third matching branch that detects and remaps these.
FullAttentionSpec.merge() drops cache_dtype_str: The merged spec was constructed without forwarding cache_dtype_str from the source specs. The subsequent field-equality assertion then compared None against 'auto' and crashed. Fixed in both FullAttentionSpec.merge() and SinkFullAttentionSpec.merge(). Also improved the assertion message to show which field mismatches.
GateLinear missing quant_config: Gemma4Router.proj was not passing quant_config to GateLinear, causing crashes when router weights are quantized (e.g. in AutoRound / compressed-tensors checkpoints).

Relationship to #39406

PR #39406 makes the fallthrough path more robust by skipping unrecognized keys instead of crashing. This PR is complementary — it fixes the root cause so that quantized weight names are correctly generated and mapped in the first place, rather than falling through to the error path.

Relationship to #39460

PR #39460 fixes GPTQ-Marlin specific issues (ceiling division for scales, quantized gate weights, GELU activation). This PR fixes the upstream weight name mapping that affects all compressed-tensors quantized Gemma 4 MoE checkpoints.

Test plan

Verified no duplicate PR exists for these specific fixes
Loaded cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit (compressed-tensors W4A16) on a single RTX 3090 (24 GB VRAM)
Model loads all 4 safetensor shards without errors (Marlin WNA16 MoE backend)
Server starts and serves requests via OpenAI-compatible API
Verified correct text generation output
Regression test: unquantized google/gemma-4-26B-A4B-it loading still works (needs 80 GB GPU)

AI assistance was used during development (Claude). All changes were reviewed and tested by the submitter.

…erge Three issues prevent serving quantized Gemma 4 MoE checkpoints (e.g. cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit): 1. The 3D→2D expert weight explosion in _weight_iterator only matched exact projection names (gate_up_proj, down_proj). Compressed-tensors and AWQ checkpoints store quantized weights with underscore suffixes (gate_up_proj_packed, down_proj_scale, etc.) which were silently dropped, causing KeyError during loading. 2. The expert_params_mapping loop only handled dot-separated suffixes (.weight_scale) and bare names, but not the underscore suffixes (_packed, _scale) used by compressed-tensors quantization. 3. FullAttentionSpec.merge() and SinkFullAttentionSpec.merge() did not forward cache_dtype_str to the merged spec, so the subsequent field-equality assertion always failed when cache_dtype_str was set. Additionally, Gemma4Router.proj (GateLinear) was not receiving quant_config, which caused crashes when router weights were quantized. Tested with cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit on a single RTX 3090 (24 GB) — model loads and generates correct output. Signed-off-by: neotea <neotea@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the Gemma4 model to support quantized weights from AWQ and compressed-tensors by handling underscore suffixes in weight names and passing the quantization configuration to MoE layers. It also modifies the KV cache interface to include cache_dtype_str during attention spec merging and improves assertion error messages. However, critical issues were identified in vllm/v1/kv_cache_interface.py where cache_dtype_str is passed to the constructors of FullAttentionSpec and SinkFullAttentionSpec, which do not currently define this field, leading to runtime TypeErrors.

gemini-code-assist · 2026-04-11T16:42:41Z

            dtype=specs[0].dtype,
            kv_quant_mode=specs[0].kv_quant_mode,
            page_size_padded=specs[0].page_size_padded,
+            cache_dtype_str=specs[0].cache_dtype_str,


The FullAttentionSpec class (and its base AttentionSpec) does not define a cache_dtype_str field in its dataclass definition. Attempting to pass this keyword argument to the constructor will result in a TypeError. To fix this, you should move the cache_dtype_str field from MLAAttentionSpec up to AttentionSpec or FullAttentionSpec so it is available to all relevant attention specs during merging.

gemini-code-assist · 2026-04-11T16:42:41Z

            dtype=specs[0].dtype,
            kv_quant_mode=specs[0].kv_quant_mode,
            page_size_padded=specs[0].page_size_padded,
+            cache_dtype_str=specs[0].cache_dtype_str,


Similar to FullAttentionSpec, SinkFullAttentionSpec does not have a cache_dtype_str field. This call to the constructor will fail at runtime. Please ensure the field is defined in the class hierarchy before attempting to forward it in the merge method.

cache_dtype_str was defined only on MLAAttentionSpec, but it is relevant for all attention types (e.g. FP8 KV cache, TurboQuant). Move it to the base AttentionSpec so FullAttentionSpec.merge() and SinkFullAttentionSpec.merge() can forward it without a TypeError. Addresses review feedback from gemini-code-assist. Signed-off-by: neotea <neotea@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com>

herbertp · 2026-04-15T01:45:40Z

Hmm, I tried this PR ontop of the current vLLM because it constantly failed with:
KeyError: 'layers.0.moe.experts.0.down_proj_packed'
when trying to serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

With this PR, I still get an error, but a different one:
TypeError: GateLinear.init() got an unexpected keyword argument 'quant_config'

What am I missing here?
Thanks in advance,
Herbert

dfry-lhzn · 2026-05-08T15:09:35Z

attempting a polite nudge here, I had to apply these changes locally to get up and running on NVIDIA AGX Orin, thanks!

neotea requested a review from heheda12345 as a code owner April 11, 2026 16:37

mergify Bot added v1 bug Something isn't working labels Apr 11, 2026

gemini-code-assist Bot reviewed Apr 11, 2026

View reviewed changes

HiroakiMikami mentioned this pull request Apr 22, 2026

[Models][Gemma3/Gemma4] Support hidden_act variants in gated MLP #40588

Merged

4 tasks

lucianommartins mentioned this pull request Apr 23, 2026

[Bug]: Regression in 0.19.1 - Gemma 4 26B MoE fails to load packed experts (KeyError: down_proj_packed). Worked in dev6. #40591

Open

This was referenced May 4, 2026

[Bugfix][Gemma4] Fix AutoRound GPTQ router and INC tail shards #41588

Closed

[Model][GGUF] Add Gemma4 GGUF serving glue #41589

Closed

Fix gemma4 core: quantized inference + gguf loading + fused moe gelu_tanh #40281

Closed

teemow mentioned this pull request May 5, 2026

Gemma 4 26B-A4B AWQ4 fails to load on all eugr-tf5 builds (upstream loader bug) giantswarm/vllm#16

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Gemma4] Fix quantized MoE weight loading and KV cache spec merge#39582

[Bugfix][Gemma4] Fix quantized MoE weight loading and KV cache spec merge#39582
neotea wants to merge 2 commits intovllm-project:mainfrom
neotea:fix/gemma4-quantized-moe-weight-loading

neotea commented Apr 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

herbertp commented Apr 15, 2026

Uh oh!

dfry-lhzn commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

neotea commented Apr 11, 2026

Summary

Relationship to #39406

Relationship to #39460

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

herbertp commented Apr 15, 2026

Uh oh!

dfry-lhzn commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants