Skip to content

[Bugfix][Gemma4] Fix quantized MoE weight loading and KV cache spec merge#39582

Open
neotea wants to merge 2 commits intovllm-project:mainfrom
neotea:fix/gemma4-quantized-moe-weight-loading
Open

[Bugfix][Gemma4] Fix quantized MoE weight loading and KV cache spec merge#39582
neotea wants to merge 2 commits intovllm-project:mainfrom
neotea:fix/gemma4-quantized-moe-weight-loading

Conversation

@neotea
Copy link
Copy Markdown

@neotea neotea commented Apr 11, 2026

Summary

Fixes three issues that prevent serving quantized Gemma 4 MoE checkpoints (compressed-tensors / AWQ format) with vLLM:

  • 3D expert weight explosion drops quantization suffixes: _weight_iterator only matched exact projection names (gate_up_proj, down_proj). Quantized checkpoints use suffixed names like gate_up_proj_packed and down_proj_scale, which were silently dropped during the 3D→2D per-expert explosion, causing KeyError at load time.

  • Expert params mapping ignores underscore suffixes: The mapping loop handled dot-separated suffixes (.weight_scale) and bare names, but not the underscore suffixes (_packed, _scale) from compressed-tensors quantization. Added a third matching branch that detects and remaps these.

  • FullAttentionSpec.merge() drops cache_dtype_str: The merged spec was constructed without forwarding cache_dtype_str from the source specs. The subsequent field-equality assertion then compared None against 'auto' and crashed. Fixed in both FullAttentionSpec.merge() and SinkFullAttentionSpec.merge(). Also improved the assertion message to show which field mismatches.

  • GateLinear missing quant_config: Gemma4Router.proj was not passing quant_config to GateLinear, causing crashes when router weights are quantized (e.g. in AutoRound / compressed-tensors checkpoints).

Relationship to #39406

PR #39406 makes the fallthrough path more robust by skipping unrecognized keys instead of crashing. This PR is complementary — it fixes the root cause so that quantized weight names are correctly generated and mapped in the first place, rather than falling through to the error path.

Relationship to #39460

PR #39460 fixes GPTQ-Marlin specific issues (ceiling division for scales, quantized gate weights, GELU activation). This PR fixes the upstream weight name mapping that affects all compressed-tensors quantized Gemma 4 MoE checkpoints.

Test plan

  • Verified no duplicate PR exists for these specific fixes
  • Loaded cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit (compressed-tensors W4A16) on a single RTX 3090 (24 GB VRAM)
  • Model loads all 4 safetensor shards without errors (Marlin WNA16 MoE backend)
  • Server starts and serves requests via OpenAI-compatible API
  • Verified correct text generation output
  • Regression test: unquantized google/gemma-4-26B-A4B-it loading still works (needs 80 GB GPU)

AI assistance was used during development (Claude). All changes were reviewed and tested by the submitter.

…erge

Three issues prevent serving quantized Gemma 4 MoE checkpoints
(e.g. cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit):

1. The 3D→2D expert weight explosion in _weight_iterator only matched
   exact projection names (gate_up_proj, down_proj). Compressed-tensors
   and AWQ checkpoints store quantized weights with underscore suffixes
   (gate_up_proj_packed, down_proj_scale, etc.) which were silently
   dropped, causing KeyError during loading.

2. The expert_params_mapping loop only handled dot-separated suffixes
   (.weight_scale) and bare names, but not the underscore suffixes
   (_packed, _scale) used by compressed-tensors quantization.

3. FullAttentionSpec.merge() and SinkFullAttentionSpec.merge() did not
   forward cache_dtype_str to the merged spec, so the subsequent
   field-equality assertion always failed when cache_dtype_str was set.

Additionally, Gemma4Router.proj (GateLinear) was not receiving
quant_config, which caused crashes when router weights were quantized.

Tested with cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit on a single RTX 3090
(24 GB) — model loads and generates correct output.

Signed-off-by: neotea <neotea@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
@neotea neotea requested a review from heheda12345 as a code owner April 11, 2026 16:37
@mergify mergify Bot added v1 bug Something isn't working labels Apr 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Gemma4 model to support quantized weights from AWQ and compressed-tensors by handling underscore suffixes in weight names and passing the quantization configuration to MoE layers. It also modifies the KV cache interface to include cache_dtype_str during attention spec merging and improves assertion error messages. However, critical issues were identified in vllm/v1/kv_cache_interface.py where cache_dtype_str is passed to the constructors of FullAttentionSpec and SinkFullAttentionSpec, which do not currently define this field, leading to runtime TypeErrors.

dtype=specs[0].dtype,
kv_quant_mode=specs[0].kv_quant_mode,
page_size_padded=specs[0].page_size_padded,
cache_dtype_str=specs[0].cache_dtype_str,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The FullAttentionSpec class (and its base AttentionSpec) does not define a cache_dtype_str field in its dataclass definition. Attempting to pass this keyword argument to the constructor will result in a TypeError. To fix this, you should move the cache_dtype_str field from MLAAttentionSpec up to AttentionSpec or FullAttentionSpec so it is available to all relevant attention specs during merging.

dtype=specs[0].dtype,
kv_quant_mode=specs[0].kv_quant_mode,
page_size_padded=specs[0].page_size_padded,
cache_dtype_str=specs[0].cache_dtype_str,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to FullAttentionSpec, SinkFullAttentionSpec does not have a cache_dtype_str field. This call to the constructor will fail at runtime. Please ensure the field is defined in the class hierarchy before attempting to forward it in the merge method.

cache_dtype_str was defined only on MLAAttentionSpec, but it is
relevant for all attention types (e.g. FP8 KV cache, TurboQuant).
Move it to the base AttentionSpec so FullAttentionSpec.merge() and
SinkFullAttentionSpec.merge() can forward it without a TypeError.

Addresses review feedback from gemini-code-assist.

Signed-off-by: neotea <neotea@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
@herbertp
Copy link
Copy Markdown

Hmm, I tried this PR ontop of the current vLLM because it constantly failed with:
KeyError: 'layers.0.moe.experts.0.down_proj_packed'
when trying to serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

With this PR, I still get an error, but a different one:
TypeError: GateLinear.init() got an unexpected keyword argument 'quant_config'

What am I missing here?
Thanks in advance,
Herbert

@dfry-lhzn
Copy link
Copy Markdown

attempting a polite nudge here, I had to apply these changes locally to get up and running on NVIDIA AGX Orin, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants