[Bugfix][Gemma4] Fix AutoRound GPTQ router and INC tail shards by lesj0610 · Pull Request #41588 · vllm-project/vllm

lesj0610 · 2026-05-04T01:08:15Z

Summary

Fix the remaining Gemma4 AutoRound GPTQ runtime pieces that are independent from MoE activation support:

instantiate the Gemma4 router projection with quantization metadata
dequantize AutoRound/GPTQ router qweight/qzeros/scales into the dense GateLinear router weight during loading
add an INC/GPTQ row-parallel fallback for TP shards that cross quantization group boundaries instead of routing those shards through Marlin/GPTQ kernels that assume group alignment

This PR intentionally does not include Gemma4 MoE gelu_tanh activation wiring or broad packed expert weight-name remapping.

Duplicate-work check

Checked open PRs before opening:

[Model] Fix Gemma4 MoE activation mismatch #41574 covers Gemma4 MoE gelu_tanh activation wiring. This PR excludes that activation change.
[Bugfix][Gemma4] Fix quantized MoE weight loading and KV cache spec merge #39582 covers Gemma4 quantized MoE packed/scale expert weight loading and KV cache spec merge. This PR excludes that broad expert mapping/KV-cache scope. It includes router quantization runtime handling, which [Bugfix][Gemma4] Fix quantized MoE weight loading and KV cache spec merge #39582 does not complete.
Fix gemma4 core: quantized inference + gguf loading + fused moe gelu_tanh #40281 is my older monolithic PR containing mixed Gemma4 quant/GGUF/activation changes. This split PR supersedes only its AutoRound quant-runtime portion.

Tests

.venv/bin/python -m py_compile \
  vllm/model_executor/layers/quantization/inc.py \
  vllm/model_executor/models/gemma4.py \
  tests/quantization/test_auto_round.py

pre-commit run ruff-check --files \
  tests/quantization/test_auto_round.py \
  vllm/model_executor/layers/quantization/inc.py \
  vllm/model_executor/models/gemma4.py

pre-commit run ruff-format --files \
  tests/quantization/test_auto_round.py \
  vllm/model_executor/layers/quantization/inc.py \
  vllm/model_executor/models/gemma4.py

.venv/bin/python -m pytest \
  tests/quantization/test_auto_round.py::test_autoround_gptq_router_weight_dequantizes_symmetric_zero_point \
  tests/quantization/test_auto_round.py::test_inc_gptq_row_parallel_tail_fallback_uses_global_group_indices \
  -q

Result: targeted tests passed.

AI assistance was used to prepare and split this change; I reviewed the resulting diff and validation output.

Add a correctness fallback for INC/GPTQ row-parallel shards whose TP partition crosses group boundaries, and dequantize Gemma4 AutoRound router projection weights into GateLinear's dense parameter during load.\n\nKeep Gemma4 MoE activation and packed expert name mapping out of this branch because those are covered by separate open PRs.\n\nCo-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request introduces the INCGPTQRowParallelTailLinearMethod fallback to handle row-parallel GPTQ layers where tensor parallel shards are not aligned with quantization group boundaries, and implements a dequantization-on-load strategy for MoE routers in the Gemma4 model. Feedback was provided to restrict the new fallback method to layers sharded along the input dimension to ensure correct group index calculation. Additionally, it was noted that the router projection should remain unquantized in memory to avoid a KeyError during the weight loading process.

gemini-code-assist · 2026-05-04T01:10:13Z

            sym,
        )
+        if (
+            isinstance(layer, LinearBase)


The INCGPTQRowParallelTailLinearMethod fallback is specifically designed for row-parallel layers where tensor parallelism can split a quantization group across ranks. Its create_weights implementation (lines 728-731) calculates shard_offset by multiplying the rank by the partition width, which is only correct for RowParallelLinear. For ColumnParallelLinear (and its variants like QKVParallelLinear), the input dimension is not sharded, so shard_offset should be 0 for all ranks. Using this fallback for non-row-parallel layers will result in incorrect group indices (g_idx) and corrupted dequantized weights on ranks > 0. You should restrict this fallback to layers that are actually sharded along the input dimension.

Suggested change

isinstance(layer, LinearBase)

isinstance(layer, LinearBase) and getattr(layer, "input_is_parallel", False)

gemini-code-assist · 2026-05-04T01:10:14Z

            config.num_experts,
            bias=False,
            out_dtype=torch.float32,
+            quant_config=quant_config,


The router projection should remain unquantized in memory to support the dequantization-on-load strategy implemented in load_weights. Passing quant_config here causes GateLinear to be instantiated as a quantized layer (e.g., using GPTQLinearMethod), which registers qweight instead of weight. Consequently, the load_weights logic at line 1481 will fail with a KeyError when it attempts to access router_name.weight. Since the router is small, dequantizing it into a dense weight during loading is a reasonable trade-off for performance, but it requires the layer itself to be unquantized.

Suggested change

quant_config=quant_config,

quant_config=None,

lesj0610 · 2026-05-04T01:22:34Z

Closing per author workflow correction. This split will be handled in fork-only PRs before any upstream submission.

lesj0610 mentioned this pull request May 4, 2026

Fix gemma4 core: quantized inference + gguf loading + fused moe gelu_tanh #40281

Closed

mergify Bot added the bug Something isn't working label May 4, 2026

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

lesj0610 closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Gemma4] Fix AutoRound GPTQ router and INC tail shards#41588

[Bugfix][Gemma4] Fix AutoRound GPTQ router and INC tail shards#41588
lesj0610 wants to merge 1 commit into
vllm-project:mainfrom
lesj0610:lesj/gemma4-autoround-quant-runtime-main

lesj0610 commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

lesj0610 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	isinstance(layer, LinearBase)
	isinstance(layer, LinearBase) and getattr(layer, "input_is_parallel", False)

Uh oh!

Conversation

lesj0610 commented May 4, 2026

Summary

Duplicate-work check

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

lesj0610 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant