Skip to content

[ROCm][Bugfix] Re-tag AITER MoE weights as preshuffled after replace_parameter#42061

Merged
vllm-bot merged 2 commits into
vllm-project:mainfrom
maeehart:fix-aiter-fp8-moe-shuffle-marker
May 9, 2026
Merged

[ROCm][Bugfix] Re-tag AITER MoE weights as preshuffled after replace_parameter#42061
vllm-bot merged 2 commits into
vllm-project:mainfrom
maeehart:fix-aiter-fp8-moe-shuffle-marker

Conversation

@maeehart

@maeehart maeehart commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

In Fp8MoEMethod, Mxfp4MoEMethod, GptOssMxfp4MoEMethod, and UnquantizedFusedMoEMethod, the AITER MoE backends call rocm_aiter_ops.shuffle_weights() (FP8 / unquantized via aiter.ops.shuffle.shuffle_weight) or rocm_aiter_ops.shuffle_weight_a16w4() (MXFP4 BF16) to lay the MoE weights out for AITER's tuned 2-stage CK kernels.

The follow-up replace_parameter(layer, "w13_weight", ...) calls then wrap the shuffled tensors in fresh nn.Parameter instances, which do not propagate the custom is_shuffled = True Python attribute that the shuffle helper attaches to the inner tensor. As a result, AITER's runtime kernel selection reads getattr(layer.w13_weight, "is_shuffled", False) -> False, falls back to the non-tuned preshuffle-off (Nswizzle0) path, and emits

[aiter] [fused_moe] tuned config found for (...) but is_shuffled=False.
Tuned kernels are optimized for preshuffled weights (preshuffle_on).
Running with preshuffle_off may produce incorrect results.

once per MoE layer per worker. The Quark MoE method already handles this correctly (vllm/model_executor/layers/quantization/quark/quark_moe.py:1286-1287, which sets layer.w{13,2}_weight.is_shuffled = True after the equivalent shuffle_weights() + replace_parameter).

This PR re-tags the layer Parameters explicitly after the replace_parameter calls in the four affected MoE methods, bringing them in line with the Quark path.

Files changed

  • vllm/model_executor/layers/quantization/fp8.pyFp8MoEMethod (1 site)
  • vllm/model_executor/layers/quantization/mxfp4.pyGptOssMxfp4MoEMethod._setup_kernel and Mxfp4MoEMethod._setup_kernel (2 sites; both use the identical _setup_kernel block, so the patch is the same)
  • vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.pyUnquantizedFusedMoEMethod._setup_kernel (1 site)

AITER_MXFP4_FP8 is intentionally excluded: that backend uses Triton kernels (not AITER's 2-stage CK kernels) and does not go through the AITER is_shuffled check at runtime. The patch gates on Mxfp4MoeBackend.AITER_MXFP4_BF16 specifically.

Root cause walk-through (FP8 case, others analogous)

  1. vllm/model_executor/layers/fused_moe/oracle/fp8.py:445:
    elif fp8_backend == Fp8MoeBackend.AITER:
        w13, w2 = rocm_aiter_ops.shuffle_weights(w13, w2)
    shuffle_weights() returns tensors with tensor.is_shuffled = True (set in aiter/ops/shuffle.py).
  2. vllm/model_executor/layers/quantization/fp8.py:765:
    replace_parameter(layer, "w13_weight", w13)
    replace_parameter(layer, "w2_weight", w2)
    replace_parameter wraps the data in nn.Parameter(w13, requires_grad=False). The new Parameter is a separate Python object; Parameter.is_shuffled raises AttributeError, so getattr(..., False) -> False.
  3. At runtime, AITER's dispatcher (aiter/fused_moe.py) does
    is_shuffled = getattr(w1, "is_shuffled", False)
    ...
    if not is_shuffled and not run_1stage:
        logger.warning(f"[fused_moe] tuned config found for {keys} but is_shuffled=False ...")
    The data IS in the preshuffled layout (we ran the shuffle in step 1), but AITER doesn't know that, so it picks the non-tuned Nswizzle0 kernel variant.

Test plan

Empirical verification on DeepSeek-V3.2-Exp w8a8 block-scale FP8 running with --tensor-parallel-size 4 on MI355X:

warnings per process tuned 2-stage kernel selected
before 24 (is_shuffled=False ... may produce incorrect results) Nswizzle0 (preshuffle-off fallback)
after 0 Nswizzle1 (preshuffle-on, from aiter/configs/model_configs/a8w8_blockscale_tuned_fmoe_ds_v3.csv)

lm_eval gsm8k (num_concurrent=1, max_gen_toks=8192, --limit 24) flexible-extract: 0.9583 (no regression vs. unpatched).

For MXFP4 / unquantized, the patch is structurally identical and verified by code review against the Quark precedent (quark_moe.py:1286-1287); empirical confirmation will follow as we run MXFP4-BF16 / unquantized MoE workloads through the same harness.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added rocm Related to AMD ROCm bug Something isn't working labels May 8, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 8, 2026
@mergify

mergify Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Hi @maeehart, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

…parameter

In `Fp8MoEMethod`, `Mxfp4MoEMethod`, `GptOssMxfp4MoEMethod` and
`UnquantizedFusedMoEMethod`, the AITER MoE backends call
`rocm_aiter_ops.shuffle_weights()` (FP8 / unquantized via
`aiter.ops.shuffle.shuffle_weight()`) or `rocm_aiter_ops.shuffle_weight_a16w4()`
(MXFP4 BF16) to lay weights out for AITER's tuned 2-stage CK kernels. The
follow-up `replace_parameter(layer, "w13_weight", ...)` calls then wrap the
shuffled tensors in fresh `nn.Parameter` instances, which do not propagate the
custom Python attribute (`.is_shuffled = True`) that the shuffle helpers
attach to the inner tensor. As a result, AITER's runtime kernel selection
reads `getattr(layer.w13_weight, "is_shuffled", False) -> False`, falls back
to the non-tuned preshuffle-off path, and emits

    [aiter] [fused_moe] tuned config found for (...) but is_shuffled=False.
    Tuned kernels are optimized for preshuffled weights (preshuffle_on).
    Running with preshuffle_off may produce incorrect results.

once per MoE layer per worker. The Quark MoE method already handles this
correctly (see `vllm/model_executor/layers/quantization/quark/quark_moe.py`,
which sets `layer.w13_weight.is_shuffled = True` after the equivalent
`shuffle_weights()` + `replace_parameter`).

This change re-tags the layer Parameters explicitly after `replace_parameter`
in the four affected MoE methods. `AITER_MXFP4_FP8` is excluded because that
backend uses Triton kernels (not AITER's 2-stage CK kernels) and does not go
through the AITER `is_shuffled` check.

Tested empirically on DeepSeek-V3.2-Exp w8a8 block-scale FP8 (4xTP MI355X)
with `amdsiloai/vllm-private:nightly_20260508_aiter_v0.1.13-rc5_all_silo_prs`:
- Before: 24 "preshuffle_off may produce incorrect results" warnings per
  process.
- After (with this patch): zero such warnings; AITER selects the tuned
  preshuffle-on (Nswizzle1) kernel variants from the model_configs CSVs.

Address review feedback from @dllehr-amd: trim the in-code
comments to a single line per site.

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
@maeehart maeehart force-pushed the fix-aiter-fp8-moe-shuffle-marker branch from 0240b63 to 055d4e5 Compare May 8, 2026 11:42

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that the is_shuffled attribute is correctly propagated to MoE weights when using AITER backends in unquantized, FP8, and MXFP4 quantization layers. This fix prevents the AITER kernel dispatcher from falling back to non-tuned paths due to the attribute being lost during parameter replacement. I have no feedback to provide.

Comment thread vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py Outdated

@dllehr-amd dllehr-amd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but we don't need the large comments for a simple fix

@maeehart maeehart force-pushed the fix-aiter-fp8-moe-shuffle-marker branch from 055d4e5 to a680c58 Compare May 8, 2026 13:15

@tjtanaa tjtanaa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label May 8, 2026
@tjtanaa tjtanaa enabled auto-merge (squash) May 8, 2026 14:07
@vllm-bot vllm-bot merged commit e8f9038 into vllm-project:main May 9, 2026
78 of 81 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 9, 2026
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
…parameter (vllm-project#42061)

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…parameter (vllm-project#42061)

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…parameter (vllm-project#42061)

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…parameter (vllm-project#42061)

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
…parameter (vllm-project#42061)

Signed-off-by: Markus Hartikainen <markus.hartikainen@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants