[Bugfix][Gemma4] Fix quantized MoE weight loading and KV cache spec merge#39582
[Bugfix][Gemma4] Fix quantized MoE weight loading and KV cache spec merge#39582neotea wants to merge 2 commits intovllm-project:mainfrom
Conversation
…erge Three issues prevent serving quantized Gemma 4 MoE checkpoints (e.g. cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit): 1. The 3D→2D expert weight explosion in _weight_iterator only matched exact projection names (gate_up_proj, down_proj). Compressed-tensors and AWQ checkpoints store quantized weights with underscore suffixes (gate_up_proj_packed, down_proj_scale, etc.) which were silently dropped, causing KeyError during loading. 2. The expert_params_mapping loop only handled dot-separated suffixes (.weight_scale) and bare names, but not the underscore suffixes (_packed, _scale) used by compressed-tensors quantization. 3. FullAttentionSpec.merge() and SinkFullAttentionSpec.merge() did not forward cache_dtype_str to the merged spec, so the subsequent field-equality assertion always failed when cache_dtype_str was set. Additionally, Gemma4Router.proj (GateLinear) was not receiving quant_config, which caused crashes when router weights were quantized. Tested with cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit on a single RTX 3090 (24 GB) — model loads and generates correct output. Signed-off-by: neotea <neotea@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the Gemma4 model to support quantized weights from AWQ and compressed-tensors by handling underscore suffixes in weight names and passing the quantization configuration to MoE layers. It also modifies the KV cache interface to include cache_dtype_str during attention spec merging and improves assertion error messages. However, critical issues were identified in vllm/v1/kv_cache_interface.py where cache_dtype_str is passed to the constructors of FullAttentionSpec and SinkFullAttentionSpec, which do not currently define this field, leading to runtime TypeErrors.
| dtype=specs[0].dtype, | ||
| kv_quant_mode=specs[0].kv_quant_mode, | ||
| page_size_padded=specs[0].page_size_padded, | ||
| cache_dtype_str=specs[0].cache_dtype_str, |
There was a problem hiding this comment.
The FullAttentionSpec class (and its base AttentionSpec) does not define a cache_dtype_str field in its dataclass definition. Attempting to pass this keyword argument to the constructor will result in a TypeError. To fix this, you should move the cache_dtype_str field from MLAAttentionSpec up to AttentionSpec or FullAttentionSpec so it is available to all relevant attention specs during merging.
| dtype=specs[0].dtype, | ||
| kv_quant_mode=specs[0].kv_quant_mode, | ||
| page_size_padded=specs[0].page_size_padded, | ||
| cache_dtype_str=specs[0].cache_dtype_str, |
cache_dtype_str was defined only on MLAAttentionSpec, but it is relevant for all attention types (e.g. FP8 KV cache, TurboQuant). Move it to the base AttentionSpec so FullAttentionSpec.merge() and SinkFullAttentionSpec.merge() can forward it without a TypeError. Addresses review feedback from gemini-code-assist. Signed-off-by: neotea <neotea@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com>
|
Hmm, I tried this PR ontop of the current vLLM because it constantly failed with: With this PR, I still get an error, but a different one: What am I missing here? |
|
attempting a polite nudge here, I had to apply these changes locally to get up and running on NVIDIA AGX Orin, thanks! |
Summary
Fixes three issues that prevent serving quantized Gemma 4 MoE checkpoints (compressed-tensors / AWQ format) with vLLM:
3D expert weight explosion drops quantization suffixes:
_weight_iteratoronly matched exact projection names (gate_up_proj,down_proj). Quantized checkpoints use suffixed names likegate_up_proj_packedanddown_proj_scale, which were silently dropped during the 3D→2D per-expert explosion, causingKeyErrorat load time.Expert params mapping ignores underscore suffixes: The mapping loop handled dot-separated suffixes (
.weight_scale) and bare names, but not the underscore suffixes (_packed,_scale) from compressed-tensors quantization. Added a third matching branch that detects and remaps these.FullAttentionSpec.merge()dropscache_dtype_str: The merged spec was constructed without forwardingcache_dtype_strfrom the source specs. The subsequent field-equality assertion then comparedNoneagainst'auto'and crashed. Fixed in bothFullAttentionSpec.merge()andSinkFullAttentionSpec.merge(). Also improved the assertion message to show which field mismatches.GateLinearmissingquant_config:Gemma4Router.projwas not passingquant_configtoGateLinear, causing crashes when router weights are quantized (e.g. in AutoRound / compressed-tensors checkpoints).Relationship to #39406
PR #39406 makes the fallthrough path more robust by skipping unrecognized keys instead of crashing. This PR is complementary — it fixes the root cause so that quantized weight names are correctly generated and mapped in the first place, rather than falling through to the error path.
Relationship to #39460
PR #39460 fixes GPTQ-Marlin specific issues (ceiling division for scales, quantized gate weights, GELU activation). This PR fixes the upstream weight name mapping that affects all compressed-tensors quantized Gemma 4 MoE checkpoints.
Test plan
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit(compressed-tensors W4A16) on a single RTX 3090 (24 GB VRAM)google/gemma-4-26B-A4B-itloading still works (needs 80 GB GPU)AI assistance was used during development (Claude). All changes were reviewed and tested by the submitter.