[ROCm][DeepSeek] Enable V3.2 TP4 AITER MLA by akii96 · Pull Request #41835 · vllm-project/vllm

akii96 · 2026-05-06T15:18:05Z

Purpose

This PR enables the remaining DeepSeek-V3.2 TP4-specific pieces needed for ROCm AITER MLA serving.

DeepSeek-V3.2 TP4 reaches local MLA num_heads=32. Current ROCm AITER MLA rejects that head count through _AITER_UNSUPPORTED_HEADS = [32], so the model can fail before reaching the HIP graph accuracy path with unsupported head_num: 32. This PR removes that stale block now that the AITER-side MLA kernel support for this shape exists.

This PR also preserves DeepSeek's fp32 MoE correction bias when GateLinear.out_dtype is unset. The previous fallback cast e_score_correction_bias to the gate weight dtype, which can lose routing precision for DeepSeek's biased grouped top-k path and contribute to wrong expert routing or incoherent output.

This is the DeepSeek-specific split from Frida Andersson's original draft PR #41760. The shared ROCm AITER HIP graph replay fixes were split separately into #41816. This PR intentionally does not duplicate those graph replay changes. The full DeepSeek-V3.2 TP4 HIP-graphs fix is expected to require this PR plus #41816, along with an AITER build that includes num_heads=32 MLA kernel support.

Special thanks to Frida for the original investigation, patch direction, and DeepSeek-V3.2 validation. This PR is a minimal split of her DeepSeek-specific work so it can be reviewed independently from the cross-model graph replay fix.

Co-authored-by: Frida Andersson fanderss@amd.com

Test Plan

ROCm runtime validation still needed:

Launch DeepSeek-V3.2 TP4 with ROCm AITER MLA enabled.
Confirm the model no longer fails with unsupported head_num: 32.
Confirm the runtime image includes AITER MLA kernel support for local num_heads=32.
Validate with HIP graphs enabled using this PR plus [ROCm] Disable AITER allreduce fusion for HIP graph replay #41816.
Run GSM8K 5-shot or equivalent accuracy validation.
Run serving benchmark before/after if reviewers want DS TP4 perf data.

Suggested validation matrix:

Configuration	Expected result
Baseline `main`	DSv3.2 TP4 AITER MLA can fail with `unsupported head_num: 32`.
This PR only	`num_heads=32` is no longer blocked, but full HIP graph accuracy may still require #41816.
This PR + #41816	DSv3.2 TP4 HIP-graphs path should launch and produce coherent output.

Test Result

Runtime result from Frida's original full draft fix stack in #41760 showed DeepSeek-V3.2 TP4 GSM8K 5-shot recovered to:

Filter	Metric	Value	Stderr
flexible-extract	exact_match	0.9318	± 0.0069
strict-match	exact_match	0.9113	± 0.0078

Those numbers were produced with the full #41760 fix stack before splitting. After the split, complete runtime validation should be rerun with this PR plus #41816 applied together.

github-actions · 2026-05-06T15:18:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

gemini-code-assist

Code Review

The pull request modifies the DeepSeek-V2 model to preserve precision for the e_score_correction_bias by only pre-casting when a specific out_dtype is configured, and updates the ROCm AITER MLA backend to support 32 heads. Feedback indicates that the change in deepseek_v2.py may render the bias casting logic unreachable under default conditions where out_dtype is None, potentially leading to kernel compatibility issues; a code suggestion was provided to ensure out_dtype is set to float32 in such cases.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/model_executor/models/deepseek_v2.py (348-355)

The addition of and self.gate.out_dtype is not None at line 351 makes this block unreachable in the default case, as GateLinear is initialized with out_dtype=None just a few lines above (lines 272-276). Consequently, the e_score_correction_bias will remain in its original float32 dtype.

If the aiter kernel still requires the bias dtype to match the gating output dtype (as noted in the previous comment at line 348), this will cause a mismatch because GateLinear defaults to the weight dtype (usually bfloat16) when out_dtype is None. To preserve precision while maintaining compatibility, you should likely explicitly set self.gate.out_dtype = torch.float32 when it is None on ROCm with AITER enabled, ensuring both the logits and the bias are float32.

        if (
            self.is_rocm_aiter_moe_enabled
            and self.gate.e_score_correction_bias is not None
        ):
            if self.gate.out_dtype is None:
                self.gate.out_dtype = torch.float32
            self.gate.e_score_correction_bias.data = (
                self.gate.e_score_correction_bias.data.to(self.gate.out_dtype)
            )

Carry forward the DeepSeek-specific TP4 fixes from Frida's original ROCm HIP graph draft while keeping the shared graph replay fixes split out in vllm-project#41816. Co-authored-by: Frida Andersson <fanderss@amd.com> Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gshtras · 2026-05-06T19:31:22Z

@claude review

claude

Two-file ROCm/DeepSeek change: removes the stale num_heads=32 block in AITER MLA, and switches the AITER biased-grouped-topk path to explicitly request fp32 router logits via gate.set_out_dtype(torch.float32) (instead of casting the fp32 correction bias down to the gate weight dtype). Logic is small and targeted, but the dtype change affects DeepSeek MoE routing on all ROCm+AITER configurations and the PR description notes runtime accuracy validation (GSM8K + HIP graph path with #41816) is still pending — worth a human reviewer with ROCm context to confirm before merge.

Extended reasoning...

Overview\nThis PR makes two ROCm/DeepSeek-specific changes. In `vllm/v1/attention/backends/mla/rocm_aiter_mla.py`, `_AITER_UNSUPPORTED_HEADS` is changed from `[32]` to an empty tuple (with a `ClassVar` annotation), removing the runtime block on `num_heads=32` (DeepSeek-V3.2 TP4). The assertion site at line 423 still exists, so the check is now effectively a no-op until a new entry is added. In `vllm/model_executor/models/deepseek_v2.py`, the AITER biased-grouped-topk dtype handling is reworked: previously the `e_score_correction_bias` was cast to `gate.out_dtype or gate.weight.dtype` (so the bias was demoted to the gate weight dtype, since `out_dtype` was unset), now `gate.set_out_dtype(torch.float32)` is called first, the FusedMoE is constructed with `router_logits_dtype=self.gate.out_dtype`, and the bias is cast to that same fp32 dtype.\n\n### Security risks\nNone identified — no auth, crypto, deserialization, or input-handling code is touched. The changes are purely numerical/dtype-related on the ROCm AITER path.\n\n### Level of scrutiny\nMedium. The MoE dtype change runs on every DeepSeek model when ROCm AITER MoE is enabled and the model has `topk_method == "noaux_tc"`, which affects routing precision and potentially performance for a popular model family. `GateLinear.set_out_dtype` raises if `out_dtype` was already set, so the call is guarded by the gate being constructed earlier in `init` without an `out_dtype` argument — that ordering is correct in the current code but is a coupling worth noting. The MLA `num_heads=32` change is mechanical and low-risk on its own; the AITER kernel itself will surface unsupported configurations.\n\n### Other factors\nNo inline bug comments were generated by the bug-hunting system. Test plan in the PR description explicitly defers runtime/accuracy validation (DSv3.2 TP4, GSM8K 5-shot, HIP graphs with #41816) to reviewers with ROCm hardware. Co-authored split from #41760; the GSM8K numbers cited were produced with the full pre-split stack, not this PR alone. Given that this PR changes routing-dtype behavior for a critical model family and the validation matrix shows the full fix requires #41816 plus an AITER build with `num_heads=32` MLA support, a human reviewer (gshtras already invoked review) is better placed to confirm correctness than I am.

Rohan138 · 2026-05-06T20:48:29Z

    """

    _AITER_MIN_MLA_HEADS: Final = 16
-    _AITER_UNSUPPORTED_HEADS = [32]


If we're certain head size 32 works, we can remove this check/var entirely. cc @ganyi1996ppo since this check was added in #41217

Rohan138 · 2026-05-06T20:50:17Z

+            router_logits_dtype=self.gate.out_dtype,
        )

-        # Pre-cast the bias to match the gate output dtype so the


Can we leave this comment here?

Rohan138 · 2026-05-06T21:22:19Z

@akii96 Can you attah lmeval numbers with the default routing_dtype (bfloat16) and your PR (float32)? I'm not sure how much the impact of this is, and if we want to unilaterally switch to float32 for all models as in this PR currently.

akii96 · 2026-05-07T22:13:21Z

@Rohan138 sorry! We had MAFs on DS which were only fixed late today with those fixes and then this PR I was able to test this:

GSM8K 5-shot: `router_logits_dtype` bf16 vs fp32

`router_logits_dtype`	Filter	`exact_match`	Stderr
bfloat16 (default, pre-PR)	flexible-extract	0.9378	± 0.0067
	strict-match	0.9166	± 0.0076
float32 (this PR)	flexible-extract	0.9393	± 0.0066
	strict-match	0.9158	± 0.0076

Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> Signed-off-by: Libin Tang <libin.tang@intel.com>

mergify Bot added deepseek Related to DeepSeek models rocm Related to AMD ROCm v1 labels May 6, 2026

github-project-automation Bot moved this to Todo in AMD May 6, 2026

github-project-automation Bot added this to AMD May 6, 2026

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

frida-andersson mentioned this pull request May 6, 2026

[ROCm][Bugfix] Fix DeepSeek-V3.2 TP4 sparse MLA with HIP graphs #41760

Closed

4 tasks

akii96 force-pushed the fix/deepseek-v32-tp4-aiter-mla branch from d8c87ea to d3f5324 Compare May 6, 2026 16:52

akii96 marked this pull request as ready for review May 6, 2026 16:54

akii96 requested a review from tjtanaa as a code owner May 6, 2026 16:54

claude Bot reviewed May 6, 2026

View reviewed changes

gshtras approved these changes May 6, 2026

View reviewed changes

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2026

Rohan138 approved these changes May 6, 2026

View reviewed changes

Rohan138 reviewed May 6, 2026

View reviewed changes

gshtras merged commit c936548 into vllm-project:main May 7, 2026
70 of 71 checks passed

github-project-automation Bot moved this from Todo to Done in AMD May 7, 2026

libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026

[ROCm][DeepSeek] Enable V3.2 TP4 AITER MLA (vllm-project#41835)

c9c1d7c

Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> Signed-off-by: Libin Tang <libin.tang@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][DeepSeek] Enable V3.2 TP4 AITER MLA#41835

[ROCm][DeepSeek] Enable V3.2 TP4 AITER MLA#41835
gshtras merged 1 commit intovllm-project:mainfrom
akii96:fix/deepseek-v32-tp4-aiter-mla

akii96 commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

gshtras commented May 6, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Rohan138 May 6, 2026

Uh oh!

Rohan138 May 6, 2026

Uh oh!

Rohan138 commented May 6, 2026

Uh oh!

Uh oh!

akii96 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

akii96 commented May 6, 2026

Purpose

Test Plan

Test Result

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

vllm/model_executor/models/deepseek_v2.py (348-355)

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gshtras commented May 6, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Rohan138 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Rohan138 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Rohan138 commented May 6, 2026

Uh oh!

Uh oh!

akii96 commented May 7, 2026

GSM8K 5-shot: router_logits_dtype bf16 vs fp32

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GSM8K 5-shot: `router_logits_dtype` bf16 vs fp32