Skip to content

[ROCm][DeepSeek] Enable V3.2 TP4 AITER MLA#41835

Merged
gshtras merged 1 commit intovllm-project:mainfrom
akii96:fix/deepseek-v32-tp4-aiter-mla
May 7, 2026
Merged

[ROCm][DeepSeek] Enable V3.2 TP4 AITER MLA#41835
gshtras merged 1 commit intovllm-project:mainfrom
akii96:fix/deepseek-v32-tp4-aiter-mla

Conversation

@akii96
Copy link
Copy Markdown
Contributor

@akii96 akii96 commented May 6, 2026

Purpose

This PR enables the remaining DeepSeek-V3.2 TP4-specific pieces needed for ROCm AITER MLA serving.

DeepSeek-V3.2 TP4 reaches local MLA num_heads=32. Current ROCm AITER MLA rejects that head count through _AITER_UNSUPPORTED_HEADS = [32], so the model can fail before reaching the HIP graph accuracy path with unsupported head_num: 32. This PR removes that stale block now that the AITER-side MLA kernel support for this shape exists.

This PR also preserves DeepSeek's fp32 MoE correction bias when GateLinear.out_dtype is unset. The previous fallback cast e_score_correction_bias to the gate weight dtype, which can lose routing precision for DeepSeek's biased grouped top-k path and contribute to wrong expert routing or incoherent output.

This is the DeepSeek-specific split from Frida Andersson's original draft PR #41760. The shared ROCm AITER HIP graph replay fixes were split separately into #41816. This PR intentionally does not duplicate those graph replay changes. The full DeepSeek-V3.2 TP4 HIP-graphs fix is expected to require this PR plus #41816, along with an AITER build that includes num_heads=32 MLA kernel support.

Special thanks to Frida for the original investigation, patch direction, and DeepSeek-V3.2 validation. This PR is a minimal split of her DeepSeek-specific work so it can be reviewed independently from the cross-model graph replay fix.

Co-authored-by: Frida Andersson fanderss@amd.com

Test Plan

ROCm runtime validation still needed:

  • Launch DeepSeek-V3.2 TP4 with ROCm AITER MLA enabled.
  • Confirm the model no longer fails with unsupported head_num: 32.
  • Confirm the runtime image includes AITER MLA kernel support for local num_heads=32.
  • Validate with HIP graphs enabled using this PR plus [ROCm] Disable AITER allreduce fusion for HIP graph replay #41816.
  • Run GSM8K 5-shot or equivalent accuracy validation.
  • Run serving benchmark before/after if reviewers want DS TP4 perf data.

Suggested validation matrix:

Configuration Expected result
Baseline main DSv3.2 TP4 AITER MLA can fail with unsupported head_num: 32.
This PR only num_heads=32 is no longer blocked, but full HIP graph accuracy may still require #41816.
This PR + #41816 DSv3.2 TP4 HIP-graphs path should launch and produce coherent output.

Test Result

Runtime result from Frida's original full draft fix stack in #41760 showed DeepSeek-V3.2 TP4 GSM8K 5-shot recovered to:

Filter Metric Value Stderr
flexible-extract exact_match 0.9318 ± 0.0069
strict-match exact_match 0.9113 ± 0.0078

Those numbers were produced with the full #41760 fix stack before splitting. After the split, complete runtime validation should be rerun with this PR plus #41816 applied together.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added deepseek Related to DeepSeek models rocm Related to AMD ROCm v1 labels May 6, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 6, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request modifies the DeepSeek-V2 model to preserve precision for the e_score_correction_bias by only pre-casting when a specific out_dtype is configured, and updates the ROCm AITER MLA backend to support 32 heads. Feedback indicates that the change in deepseek_v2.py may render the bias casting logic unreachable under default conditions where out_dtype is None, potentially leading to kernel compatibility issues; a code suggestion was provided to ensure out_dtype is set to float32 in such cases.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/model_executor/models/deepseek_v2.py (348-355)

high

The addition of and self.gate.out_dtype is not None at line 351 makes this block unreachable in the default case, as GateLinear is initialized with out_dtype=None just a few lines above (lines 272-276). Consequently, the e_score_correction_bias will remain in its original float32 dtype.

If the aiter kernel still requires the bias dtype to match the gating output dtype (as noted in the previous comment at line 348), this will cause a mismatch because GateLinear defaults to the weight dtype (usually bfloat16) when out_dtype is None. To preserve precision while maintaining compatibility, you should likely explicitly set self.gate.out_dtype = torch.float32 when it is None on ROCm with AITER enabled, ensuring both the logits and the bias are float32.

        if (
            self.is_rocm_aiter_moe_enabled
            and self.gate.e_score_correction_bias is not None
        ):
            if self.gate.out_dtype is None:
                self.gate.out_dtype = torch.float32
            self.gate.e_score_correction_bias.data = (
                self.gate.e_score_correction_bias.data.to(self.gate.out_dtype)
            )

Carry forward the DeepSeek-specific TP4 fixes from Frida's original ROCm HIP graph draft while keeping the shared graph replay fixes split out in vllm-project#41816.
Co-authored-by: Frida Andersson <fanderss@amd.com>

Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com>
@akii96 akii96 force-pushed the fix/deepseek-v32-tp4-aiter-mla branch from d8c87ea to d3f5324 Compare May 6, 2026 16:52
@akii96 akii96 marked this pull request as ready for review May 6, 2026 16:54
@akii96 akii96 requested a review from tjtanaa as a code owner May 6, 2026 16:54
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@gshtras
Copy link
Copy Markdown
Collaborator

gshtras commented May 6, 2026

@claude review

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two-file ROCm/DeepSeek change: removes the stale num_heads=32 block in AITER MLA, and switches the AITER biased-grouped-topk path to explicitly request fp32 router logits via gate.set_out_dtype(torch.float32) (instead of casting the fp32 correction bias down to the gate weight dtype). Logic is small and targeted, but the dtype change affects DeepSeek MoE routing on all ROCm+AITER configurations and the PR description notes runtime accuracy validation (GSM8K + HIP graph path with #41816) is still pending — worth a human reviewer with ROCm context to confirm before merge.

Extended reasoning...

Overview\nThis PR makes two ROCm/DeepSeek-specific changes. In vllm/v1/attention/backends/mla/rocm_aiter_mla.py, _AITER_UNSUPPORTED_HEADS is changed from [32] to an empty tuple (with a ClassVar annotation), removing the runtime block on num_heads=32 (DeepSeek-V3.2 TP4). The assertion site at line 423 still exists, so the check is now effectively a no-op until a new entry is added. In vllm/model_executor/models/deepseek_v2.py, the AITER biased-grouped-topk dtype handling is reworked: previously the e_score_correction_bias was cast to gate.out_dtype or gate.weight.dtype (so the bias was demoted to the gate weight dtype, since out_dtype was unset), now gate.set_out_dtype(torch.float32) is called first, the FusedMoE is constructed with router_logits_dtype=self.gate.out_dtype, and the bias is cast to that same fp32 dtype.\n\n### Security risks\nNone identified — no auth, crypto, deserialization, or input-handling code is touched. The changes are purely numerical/dtype-related on the ROCm AITER path.\n\n### Level of scrutiny\nMedium. The MoE dtype change runs on every DeepSeek model when ROCm AITER MoE is enabled and the model has topk_method == "noaux_tc", which affects routing precision and potentially performance for a popular model family. GateLinear.set_out_dtype raises if out_dtype was already set, so the call is guarded by the gate being constructed earlier in __init__ without an out_dtype argument — that ordering is correct in the current code but is a coupling worth noting. The MLA num_heads=32 change is mechanical and low-risk on its own; the AITER kernel itself will surface unsupported configurations.\n\n### Other factors\nNo inline bug comments were generated by the bug-hunting system. Test plan in the PR description explicitly defers runtime/accuracy validation (DSv3.2 TP4, GSM8K 5-shot, HIP graphs with #41816) to reviewers with ROCm hardware. Co-authored split from #41760; the GSM8K numbers cited were produced with the full pre-split stack, not this PR alone. Given that this PR changes routing-dtype behavior for a critical model family and the validation matrix shows the full fix requires #41816 plus an AITER build with num_heads=32 MLA support, a human reviewer (gshtras already invoked review) is better placed to confirm correctness than I am.

@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2026
"""

_AITER_MIN_MLA_HEADS: Final = 16
_AITER_UNSUPPORTED_HEADS = [32]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're certain head size 32 works, we can remove this check/var entirely. cc @ganyi1996ppo since this check was added in #41217

router_logits_dtype=self.gate.out_dtype,
)

# Pre-cast the bias to match the gate output dtype so the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave this comment here?

@Rohan138
Copy link
Copy Markdown
Contributor

Rohan138 commented May 6, 2026

@akii96 Can you attah lmeval numbers with the default routing_dtype (bfloat16) and your PR (float32)? I'm not sure how much the impact of this is, and if we want to unilaterally switch to float32 for all models as in this PR currently.

@gshtras gshtras merged commit c936548 into vllm-project:main May 7, 2026
70 of 71 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 7, 2026
@akii96
Copy link
Copy Markdown
Contributor Author

akii96 commented May 7, 2026

@Rohan138 sorry! We had MAFs on DS which were only fixed late today with those fixes and then this PR I was able to test this:

GSM8K 5-shot: router_logits_dtype bf16 vs fp32

router_logits_dtype Filter exact_match Stderr
bfloat16 (default, pre-PR) flexible-extract 0.9378 ± 0.0067
strict-match 0.9166 ± 0.0076
float32 (this PR) flexible-extract 0.9393 ± 0.0066
strict-match 0.9158 ± 0.0076

libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026
Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com>
Signed-off-by: Libin Tang <libin.tang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants