Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1979,6 +1979,12 @@ def apply(
) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]:
from vllm.model_executor.layers.fused_moe import fused_experts

# Lazy init: moe_quant_config may not yet be set if
# ensure_moe_quant_config_init() hasn't run (e.g. during the first
# compiled forward pass with piecewise backends).
if self.moe_quant_config is None:
self.moe_quant_config = self.get_fused_moe_quant_config(layer)
Comment on lines +1985 to +1986
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The lazy initialization of moe_quant_config here is critical for correctness when the standard initialization sequence is bypassed, such as during the first compiled forward pass. Without this, fused_experts would default to an unquantized configuration, leading to incorrect results for WNA16 quantized layers.


return fused_experts(
x,
layer.w13_weight_packed,
Expand Down
6 changes: 6 additions & 0 deletions vllm/model_executor/layers/quantization/moe_wna16.py
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,12 @@ def apply(
f"Only SiLU activation is supported, not {layer.activation}."
)

# Lazy init: moe_quant_config may not yet be set if
# ensure_moe_quant_config_init() hasn't run (e.g. during the first
# compiled forward pass with piecewise backends).
if self.moe_quant_config is None:
self.moe_quant_config = self.get_fused_moe_quant_config(layer)
Comment on lines +382 to +383
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the fix in compressed_tensors_moe.py, this lazy initialization ensures that the quantization configuration is available before the first kernel invocation. This is particularly important for backends that rely on fused_experts receiving a valid quant_config to select the appropriate optimized kernels.


return fused_experts(
x,
layer.w13_qweight,
Expand Down
2 changes: 1 addition & 1 deletion vllm/model_executor/models/qwen3_vl_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -341,7 +341,7 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
quant_config=self.quant_config,
prefix=maybe_prefix(prefix, "lm_head"),
)
if self.config.tie_word_embeddings:
if getattr(self.config, "tie_word_embeddings", False):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using getattr with a default value of False is a safer approach for accessing tie_word_embeddings. This prevents potential AttributeError crashes when loading checkpoints that do not explicitly define this field in their configuration, which has been observed in some Qwen3-VL MoE variants.

self.lm_head.weight = self.model.embed_tokens.weight
self.logits_processor = LogitsProcessor(self.config.vocab_size)
self.make_empty_intermediate_tensors = (
Expand Down
Loading