[Bugfix] Restore moe_forward output shape invariant on TRTLLM MXFP4 path#41646
[Bugfix] Restore moe_forward output shape invariant on TRTLLM MXFP4 path#41646vllm-bot merged 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the trtllm_mxfp4_moe.py file by replacing self.hidden_dim_unpadded with self.hidden_dim for the output tensor allocation in the apply method and updating the workspace_shapes calculation to use K. I have no feedback to provide as there were no review comments to evaluate.
|
Seems like there is an infra issue for the failing tests @mgoin. Should we retry them? |
|
I retried failed tests. |
|
Is this test failure related? |
I think so |
30c2eea to
75fa232
Compare
eee67c9 to
93c153e
Compare
|
The b200 tests hit the bugged agent dgxb200-01-4. Can someone re-trigger them? |
|
I ended up with a different approach: explicit hidden_dim_unpadded: int argument on the four _moe_forward custom-op signatures. The fake allocates at hidden_dim_unpadded; the real op truncates with |
I think we can get this parameter from Also I feel like the |
I think we would also need to do something like this:
I chose to add the explicit parameter instead because I wasn't sure if this would affect other consumers, but I'm happy to switch to this approach and then reading the hidden dim like this: |
|
I'm actually thinking it's better to restrict the fix to the TRT-LLM mxfp4 path. Let me work on that. |
54306df to
0927ec7
Compare
…t-oss MXFP4 + torch.compile Fixes vllm-project#41645. The TRT-LLM MXFP4 experts kernel writes output at moe_config.hidden_dim_unpadded while _moe_forward_fake returned empty_like(hidden_states) at the kernel-aligned padded width, tripping inductor's assert_size_stride. Plumb the unpadded dim through the custom op signature so the fake matches the real op's allocation, gated to TRT-LLM MXFP4 only (other backends, including Cutlass MXFP4 MXFP8, write the full padded width). Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
0927ec7 to
3180762
Compare
|
@zyongye updated:
CI is now green. |
|
Failing tests are not related. See #41887 |
Purpose
Fixes #41645.
gpt-oss-{20b,120b}crashes undertorch.compilewithtensor_parallel_size > 1on Blackwell because of a fake/real shape mismatch invllm.moe_forward:TrtLlmMxfp4Experts{Monolithic,Modular}.apply) writes output atmoe_config.hidden_dim_unpadded(gpt-oss: 2880)._moe_forward_fakereturnedtorch.empty_like(hidden_states)whose last dim is the paddedhidden_dim(gpt-oss: 3072 after the kernel-alignment pad in_maybe_pad_hidden_states).When inductor's
assert_size_stridechecks the runtime tensor against the traced fake, it fires:The divergence was introduced by #40960 (added kernel-alignment padding without updating the fake).
Approach
vllm/model_executor/layers/fused_moe/runner/moe_runner.pyAdd
hidden_dim_unpadded: intto the_moe_forward/_moe_forward_shared(and matching fake) signatures. The caller inMoERunner.forwardcomputes the value via_trtllm_mxfp4_unpadded_dim: returnsmoe_config.hidden_dim_unpaddedonly when the active backend isTrtLlmMxfp4ExpertsBase, else 0. The fake allocates the narrow shape when the int is positive, else falls through toempty_like(hidden_states).Computing the discriminator caller-side rather than peeking layer state in the fake is necessary: doing the isinstance check inside
_moe_forward_fakespecializes the fake per-layer_nameand breakstorch.compilesubgraph dedup (tests/compile/h100/test_startup.py::test_moe_startupis the canary that catches this).moe_config.hidden_dim_unpaddedalone is also insufficient: it encodes the model's logical hidden, not whether the active kernel narrows. Cutlass MXFP4 MXFP8 writes the full padded width and would be mis-classified.