Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions vllm/model_executor/layers/fused_moe/runner/moe_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -550,14 +550,10 @@ def forward(
hidden_states
)

# Record before `_maybe_pad_hidden_states` pads activations to match
# `moe_config.hidden_dim`, e.g. after `align_trtllm_fp4_moe_hidden_dim_for_fi`
routed_hidden_dim = hidden_states.shape[-1]
hidden_states, og_hidden_dim = self._maybe_pad_hidden_states(
shared_experts_input,
hidden_states,
)
Comment on lines 553 to 556
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Reverting the padding detection logic re-introduces a bug where fused_output and shared_output have mismatched dimensions when padding is applied (Fixes #35949). Instead of a full revert, these variables should be retained to allow for proper unpadding before the outputs are combined.

        # Record before `_maybe_pad_hidden_states` pads activations to match
        # `moe_config.hidden_dim`, e.g. after `align_trtllm_fp4_moe_hidden_dim_for_fi`
        routed_hidden_dim = hidden_states.shape[-1]
        hidden_states, og_hidden_dim = self._maybe_pad_hidden_states(
            shared_experts_input,
            hidden_states,
        )
        hidden_dim_was_padded = hidden_states.shape[-1] > routed_hidden_dim

hidden_dim_was_padded = hidden_states.shape[-1] > routed_hidden_dim

result = self._forward_entry(
hidden_states,
Expand All @@ -577,8 +573,6 @@ def forward(

# Extract outputs from result
shared_output, fused_output = _unpack(result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The RuntimeError and garbled output reported in CI are caused by the non-contiguous tensor produced by slicing fused_output. Adding .contiguous() after the slice resolves these issues while maintaining the fix for shape mismatch during the addition of shared and routed expert outputs.

Suggested change
shared_output, fused_output = _unpack(result)
shared_output, fused_output = _unpack(result)
if hidden_dim_was_padded:
fused_output = fused_output[..., :routed_hidden_dim].contiguous()

if hidden_dim_was_padded:
fused_output = fused_output[..., :routed_hidden_dim]

# If combine kernel already reduced fused, reduce shared to match.
# See note above re: the two all-reduce points.
Expand Down