[Bugfix] Fix some issues with MoERunner PR #32344#34371
[Bugfix] Fix some issues with MoERunner PR #32344#34371vllm-bot merged 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request introduces two key bug fixes for the MoE runner. First, it corrects the handling of the gate property in FusedMoE by checking the use_overlapped flag, ensuring the router is invoked correctly based on whether shared expert computation is overlapped. This aligns the gate's behavior with the shared_experts property. Second, it moves the ensure_moe_quant_config_init call into the _moe_forward and _moe_forward_shared custom op implementations. This is a good change to prevent issues with torch.compile by ensuring that this initialization logic with side effects is executed at runtime rather than during graph tracing. The changes are well-reasoned and improve the correctness and robustness of the MoE implementation. I have no further comments.
|
This seems to fix my gpt-oss H200 issue Main: fails This PR: |
Signed-off-by: Bill Nell <bnell@redhat.com>
…roject#34371) Signed-off-by: Bill Nell <bnell@redhat.com>
The code fix landed via vllm-project#34371 (31d992d). This adds a regression test to prevent future regressions: test_w4a16_moe_torch_compile loads a W4A16 MoE model with enforce_eager=False and verifies inference succeeds without the "Hidden size mismatch" assertion error. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
…roject#34371) Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
…roject#34371) Signed-off-by: Bill Nell <bnell@redhat.com>
…roject#34371) Signed-off-by: Bill Nell <bnell@redhat.com>
…roject#34371) Signed-off-by: Bill Nell <bnell@redhat.com>
Purpose
Move
ensure_moe_quant_configcall fromFusedMoE.forward_nativeinto_moe_forwardand_moe_forward_shared. This is closer to how it was before when it was hidden inside the custom op. It should avoid torch.compile issues.Fix handling of
gate. Theuse_overlappedflag should have been checked before returning_gate.Possible fix for #34357
Test Plan
Ran
openai/gpt-oss-20bTested #34357, was able to repro it with a revision earlier than #32344
Ran
nvidia/DeepSeek-R1-NVFP4Test Result
cc @mgoin , @robertgshaw2-redhat