[Bugfix][ROCm] Fix WNA16 MoE quant config init and Qwen3-VL tie_word_embeddings#34630
[Bugfix][ROCm] Fix WNA16 MoE quant config init and Qwen3-VL tie_word_embeddings#34630laudney wants to merge 2 commits intovllm-project:mainfrom
Conversation
Both MoeWNA16Method and CompressedTensorsWNA16MoEMethod pass self.moe_quant_config to fused_experts() without ensuring it has been initialized. When it is still None, fused_experts() falls back to FUSED_MOE_UNQUANTIZED_CONFIG (use_int4_w4a16=False), making the int4 packed weight dimension assertion fail (hidden_size 2048 != w1 1024). Add lazy init guard in both apply() methods so the quant config is built on first use if ensure_moe_quant_config_init() hasn't run yet.
Some Qwen3-VL MoE configs lack tie_word_embeddings, causing AttributeError during model init. Use getattr with False default.
There was a problem hiding this comment.
Code Review
This pull request introduces defensive bug fixes for ROCm/RDNA4 environments and Qwen3-VL MoE models. Specifically, it addresses an issue where the FusedMoEQuantConfig was not initialized before the first apply() call in the WNA16 quantization path, which could lead to incorrect kernel execution or failures during the first inference pass, especially when using torch.compile. Additionally, it adds safety to the tie_word_embeddings attribute access in Qwen3MoeLLMForCausalLM to prevent AttributeError when the field is missing from checkpoint configurations. These changes improve the robustness of the model executor without altering existing behavior for standard configurations.
| if self.moe_quant_config is None: | ||
| self.moe_quant_config = self.get_fused_moe_quant_config(layer) |
There was a problem hiding this comment.
The lazy initialization of moe_quant_config here is critical for correctness when the standard initialization sequence is bypassed, such as during the first compiled forward pass. Without this, fused_experts would default to an unquantized configuration, leading to incorrect results for WNA16 quantized layers.
| if self.moe_quant_config is None: | ||
| self.moe_quant_config = self.get_fused_moe_quant_config(layer) |
There was a problem hiding this comment.
Similar to the fix in compressed_tensors_moe.py, this lazy initialization ensures that the quantization configuration is available before the first kernel invocation. This is particularly important for backends that rely on fused_experts receiving a valid quant_config to select the appropriate optimized kernels.
| prefix=maybe_prefix(prefix, "lm_head"), | ||
| ) | ||
| if self.config.tie_word_embeddings: | ||
| if getattr(self.config, "tie_word_embeddings", False): |
There was a problem hiding this comment.
|
Do you have this Pr in your branch? I think that this should have solved the quant config issue |
|
Thanks for the pointer — PR #34371 does cover the WNA16 quant config init issue. The The other change here (defensive |
Summary
Two small bug fixes found while testing on ROCm/RDNA4:
apply(): TheFusedMoEQuantConfigwas not being set up before the first forward pass in the WNA16 quantization path, causing failures on first inference.tie_word_embeddingsAttributeError: Some Qwen3-VL MoE checkpoint configs lack thetie_word_embeddingsfield entirely. Changed direct attribute access togetattr(..., False)for safety.Both fixes are defensive and should not affect existing behavior on any platform.
Test plan
tie_word_embeddingsin config