[Bugfix] Add regression test for MoE quant_config under torch.compile#34335
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug where the MoE quantization configuration was not being initialized at runtime when using torch.compile. The issue stems from Dynamo not replaying attribute mutation side effects from a traced function. The fix, which involves moving the initialization call into DefaultMoERunner.forward_impl (a function executed eagerly within a custom op), is sound and directly addresses the problem. The addition of a targeted regression test (test_w4a16_moe_torch_compile) is excellent, as it successfully reproduces the failure and validates the fix. The changes are minimal, well-commented, and effective.
| # is compiled by torch.compile/dynamo, and the attribute mutation | ||
| # side effect is not replayed at runtime. forward_impl runs inside | ||
| # the moe_forward custom op, so it is not compiled by dynamo. | ||
| layer.ensure_moe_quant_config_init() |
There was a problem hiding this comment.
Can you try putting this in _moe_forward and _moe_forward_shared instead?
|
This should be fixed on main now. |
Right, the fix landed on main via #34371 (31d992d). |
Sure, sounds good to me. |
The code fix landed via vllm-project#34371 (31d992d). This adds a regression test to prevent future regressions: test_w4a16_moe_torch_compile loads a W4A16 MoE model with enforce_eager=False and verifies inference succeeds without the "Hidden size mismatch" assertion error. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
b95bc01 to
8b88807
Compare
Done. This PR now only brings in the regression test. |
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
…vllm-project#34335) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
…vllm-project#34335) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
…vllm-project#34335) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
…vllm-project#34335) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
…vllm-project#34335) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…vllm-project#34335) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
…vllm-project#34335) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…vllm-project#34335) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Update: The fix landed separately via #34371 and this PR only adds the regression test.
Summary
After the MoE Refactor (#32344), w4a16 models fail with
AssertionError: Hidden size mismatch 2048 != 1024under torch.compile.This is because
ensure_moe_quant_config_init()is called inFusedMoE.forward_native(). When torch.compile is active,forward_nativeis traced by Dynamo, but the side effect of settingself.quant_method.moe_quant_config(an attribute mutation) is not replayed at runtime. This causesmoe_quant_configto remainNonewhenDefaultMoERunner.forward_implexecutes inside themoe_forwardcustom op at runtime.For W4A16-quantized MoE models (e.g. AWQ 4-bit), this means
use_int4_w4a16isFalseinstead ofTrue, causing the assertionhidden_states.size(1) == w1.size(2)to fail because packed 4-bit weights have half the expected dimension.Fix: Call
layer.ensure_moe_quant_config_init()at the start ofDefaultMoERunner.forward_impl(), which runs inside themoe_forwardcustom op and is therefore not compiled by Dynamo.Reproducer:
Test plan
test_w4a16_moe_torch_compilethat loads a tiny W4A16 MoE model (nm-testing/tinysmokeqwen3moe-W4A16-first-only-CTstable) withenforce_eager=Falseand verifies inference succeeds.AssertionError: Hidden size mismatch) and passes with it.tests/kernels/moe/test_moe.py(537 passed, 1 unrelated OOM skip).