[New Model][ROCm] Add AMD support for DeepSeek V4 by whx-sjtu · Pull Request #40864 · vllm-project/vllm

whx-sjtu · 2026-04-25T05:44:32Z

Purpose

This PR is based on #40760 and further supports DeepSeek V4 for AMD.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Signed-off-by: ganyi <ygan@amd.com>

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-04-25T05:45:18Z

Documentation preview: https://vllm--40864.org.readthedocs.build/en/40864/

mergify · 2026-04-25T05:45:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @whx-sjtu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request implements support for the DeepSeek V4 model, introducing its unique Multi-Head Latent Attention (MLA) architecture with horizontally-fused kernels, HyperCompressed (mHC) blocks, and a Multi-Token Predictor (MTP) draft model. The changes include new CUDA and Triton kernels for fused operations, updates to quantization methods for MXFP4 and E8M0 scales, and the addition of a dedicated tokenizer and tool parser. Feedback identifies hardcoded environment variables in the Dockerfile, a hardcoded device index in the mHC layer that could cause issues in multi-GPU setups, redundant code blocks in the MXFP4 oracle, and a duplicate assignment in the top-k bias router.

gemini-code-assist · 2026-04-25T05:47:59Z

+        && export SCCACHE_BUCKET=inferact-sccache \
+        && export SCCACHE_REGION=us-west-2 \
+        && export SCCACHE_S3_NO_CREDENTIALS=0\


The SCCACHE environment variables are hardcoded to a specific bucket and region (inferact-sccache, us-west-2). These should be parameterized using the build arguments ${SCCACHE_BUCKET_NAME}, ${SCCACHE_REGION_NAME}, and ${SCCACHE_S3_NO_CREDENTIALS} to ensure the Dockerfile remains generic and reusable across different environments.

&& export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} \ && export SCCACHE_REGION=${SCCACHE_REGION_NAME} \ && export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \

gemini-code-assist · 2026-04-25T05:47:59Z

+
+@cache
+def compute_num_split(block_k: int, k: int | None, grid_size: int) -> int:
+    device_props = torch.cuda.get_device_properties(0)


Using a hardcoded device index 0 in torch.cuda.get_device_properties(0) can lead to incorrect behavior in multi-GPU environments where the worker might be assigned to a different GPU. It should use the current device index to ensure correct properties are retrieved for the active worker.

Suggested change

device_props = torch.cuda.get_device_properties(0)

device_props = torch.cuda.get_device_properties(torch.cuda.current_device())

gemini-code-assist · 2026-04-25T05:47:59Z

+    elif mxfp4_backend == Mxfp4MoeBackend.AITER:
+        from vllm._aiter_ops import rocm_aiter_ops
+
+        if w13_bias is not None:
+            w13_bias = w13_bias.data.to(torch.float32)
+        if w2_bias is not None:
+            w2_bias = w2_bias.data.to(torch.float32)
+
+        e, n, k = w13_weight.shape
+
+        w13_weight.view(torch.uint8).copy_(
+            w13_weight.data.view(torch.uint8)
+            .view(e, n // 2, 2, k)
+            .permute(0, 2, 1, 3)
+            .contiguous()
+            .view(e, n, k)
+        )
+        w13_weight_scale.data = (
+            w13_weight_scale.data.view(e, n // 2, 2, -1)
+            .permute(0, 2, 1, 3)
+            .contiguous()
+            .view(e, n, -1)
+        )
+
+        w13_weight.data = w13_weight.data.view(torch.float4_e2m1fn_x2)
+        w2_weight.data = w2_weight.data.view(torch.float4_e2m1fn_x2)
+
+        w13_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w13_weight, 16, True)
+        shuffled_w13_scale = rocm_aiter_ops.shuffle_scale_a16w4(
+            w13_weight_scale.view(-1, w13_weight_scale.shape[-1]),
+            num_experts,
+            True,
+        )
+
+        w2_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w2_weight, 16, False)
+        shuffled_w2_scale = rocm_aiter_ops.shuffle_scale_a16w4(
+            w2_weight_scale.view(-1, w2_weight_scale.shape[-1]),
+            num_experts,
+            False,
+        )
+
+        if w13_bias is not None:
+            w13_bias = (
+                w13_bias.data.view(-1, n // 2, 2)
+                .permute(0, 2, 1)
+                .contiguous()
+                .view(-1, n)
+            )
+
+        return (
+            w13_weight,
+            w2_weight,
+            shuffled_w13_scale,
+            shuffled_w2_scale,
+            w13_bias,
+            w2_bias,
+        )
+
+    elif mxfp4_backend == Mxfp4MoeBackend.AITER:
+        from vllm._aiter_ops import rocm_aiter_ops
+
+        if w13_bias is not None:
+            w13_bias = w13_bias.data.to(torch.float32)
+        if w2_bias is not None:
+            w2_bias = w2_bias.data.to(torch.float32)
+
+        e, n, k = w13_weight.shape
+
+        w13_weight.view(torch.uint8).copy_(
+            w13_weight.data.view(torch.uint8)
+            .view(e, n // 2, 2, k)
+            .permute(0, 2, 1, 3)
+            .contiguous()
+            .view(e, n, k)
+        )
+        w13_weight_scale.data = (
+            w13_weight_scale.data.view(e, n // 2, 2, -1)
+            .permute(0, 2, 1, 3)
+            .contiguous()
+            .view(e, n, -1)
+        )
+
+        w13_weight.data = w13_weight.data.view(torch.float4_e2m1fn_x2)
+        w2_weight.data = w2_weight.data.view(torch.float4_e2m1fn_x2)
+
+        w13_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w13_weight, 16, True)
+        shuffled_w13_scale = rocm_aiter_ops.shuffle_scale_a16w4(
+            w13_weight_scale.view(-1, w13_weight_scale.shape[-1]),
+            num_experts,
+            True,
+        )
+
+        w2_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w2_weight, 16, False)
+        shuffled_w2_scale = rocm_aiter_ops.shuffle_scale_a16w4(
+            w2_weight_scale.view(-1, w2_weight_scale.shape[-1]),
+            num_experts,
+            False,
+        )
+
+        if w13_bias is not None:
+            w13_bias = (
+                w13_bias.data.view(-1, n // 2, 2)
+                .permute(0, 2, 1)
+                .contiguous()
+                .view(-1, n)
+            )
+
+        return (
+            w13_weight,
+            w2_weight,
+            shuffled_w13_scale,
+            shuffled_w2_scale,
+            w13_bias,
+            w2_bias,
+        )
+


There are three identical elif mxfp4_backend == Mxfp4MoeBackend.AITER: blocks in the convert_gpt_oss_weight_to_mxfp4_moe_kernel_format function (starting at lines 873, 934, and 992). This redundancy significantly increases the file size and makes maintenance difficult. The duplicate blocks should be removed.

gemini-code-assist · 2026-04-25T05:47:59Z

        self.routed_scaling_factor = routed_scaling_factor
+        self.scoring_func = scoring_func
+        self._hash_indices_table = hash_indices_table


The assignment self.scoring_func = scoring_func is duplicated (present on both line 257 and line 259).

Suggested change

self.routed_scaling_factor = routed_scaling_factor

self.scoring_func = scoring_func

self._hash_indices_table = hash_indices_table

self.routed_scaling_factor = routed_scaling_factor

self._hash_indices_table = hash_indices_table

zyongye and others added 6 commits April 24, 2026 02:58

nit

bc34b25

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

make kernel compatible with rocm platform

6b923e7

Signed-off-by: ganyi <ygan@amd.com>

adapt attention ref

51d4c9b

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

adapt moe func

f91a748

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

fix moe

0616a2c

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

whx-sjtu requested a review from xuechendi as a code owner April 25, 2026 05:44

claude Bot reviewed Apr 25, 2026

View reviewed changes

mergify Bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models new-model Requests to new models performance Performance-related issues gpt-oss Related to GPT-OSS models nvidia rocm Related to AMD ROCm speculative-decoding labels Apr 25, 2026

github-project-automation Bot added this to gpt-oss Issues & Enhancements, AMD and NVIDIA Apr 25, 2026

mergify Bot added the v1 label Apr 25, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Apr 25, 2026

github-project-automation Bot moved this to Todo in AMD Apr 25, 2026

mergify Bot added tool-calling needs-rebase labels Apr 25, 2026

github-project-automation Bot added this to Tool Calling Apr 25, 2026

mergify Bot added the kv-connector label Apr 25, 2026

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

whx-sjtu marked this pull request as draft April 25, 2026 07:27

whx-sjtu closed this Apr 25, 2026

github-project-automation Bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Apr 25, 2026

github-project-automation Bot moved this to Done in NVIDIA Apr 25, 2026

github-project-automation Bot moved this to Done in Tool Calling Apr 25, 2026

github-project-automation Bot moved this from Todo to Done in AMD Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[New Model][ROCm] Add AMD support for DeepSeek V4#40864

[New Model][ROCm] Add AMD support for DeepSeek V4#40864
whx-sjtu wants to merge 6 commits intovllm-project:mainfrom
ROCm:hexwang/dsv4_adapt

whx-sjtu commented Apr 25, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

mergify Bot commented Apr 25, 2026

Uh oh!

mergify Bot commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	device_props = torch.cuda.get_device_properties(0)
	device_props = torch.cuda.get_device_properties(torch.cuda.current_device())

Uh oh!

Conversation

whx-sjtu commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented Apr 25, 2026

Uh oh!

mergify Bot commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

whx-sjtu commented Apr 25, 2026 •

edited

Loading