Skip to content

[New Model][ROCm] Add AMD support for DeepSeek V4#40864

Closed
whx-sjtu wants to merge 6 commits intovllm-project:mainfrom
ROCm:hexwang/dsv4_adapt
Closed

[New Model][ROCm] Add AMD support for DeepSeek V4#40864
whx-sjtu wants to merge 6 commits intovllm-project:mainfrom
ROCm:hexwang/dsv4_adapt

Conversation

@whx-sjtu
Copy link
Copy Markdown
Contributor

@whx-sjtu whx-sjtu commented Apr 25, 2026

Purpose

This PR is based on #40760 and further supports DeepSeek V4 for AMD.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

zyongye and others added 6 commits April 24, 2026 02:58
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: yasong.wang <yasong.wang@inferact.ai>
Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
@whx-sjtu whx-sjtu requested a review from xuechendi as a code owner April 25, 2026 05:44
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 25, 2026

Documentation preview: https://vllm--40864.org.readthedocs.build/en/40864/

@mergify mergify Bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models new-model Requests to new models performance Performance-related issues gpt-oss Related to GPT-OSS models nvidia rocm Related to AMD ROCm speculative-decoding labels Apr 25, 2026
@mergify mergify Bot added the v1 label Apr 25, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Apr 25, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 25, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @whx-sjtu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for the DeepSeek V4 model, introducing its unique Multi-Head Latent Attention (MLA) architecture with horizontally-fused kernels, HyperCompressed (mHC) blocks, and a Multi-Token Predictor (MTP) draft model. The changes include new CUDA and Triton kernels for fused operations, updates to quantization methods for MXFP4 and E8M0 scales, and the addition of a dedicated tokenizer and tool parser. Feedback identifies hardcoded environment variables in the Dockerfile, a hardcoded device index in the mHC layer that could cause issues in multi-GPU setups, redundant code blocks in the MXFP4 oracle, and a duplicate assignment in the top-k bias router.

Comment thread docker/Dockerfile
Comment on lines +286 to +288
&& export SCCACHE_BUCKET=inferact-sccache \
&& export SCCACHE_REGION=us-west-2 \
&& export SCCACHE_S3_NO_CREDENTIALS=0\
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The SCCACHE environment variables are hardcoded to a specific bucket and region (inferact-sccache, us-west-2). These should be parameterized using the build arguments ${SCCACHE_BUCKET_NAME}, ${SCCACHE_REGION_NAME}, and ${SCCACHE_S3_NO_CREDENTIALS} to ensure the Dockerfile remains generic and reusable across different environments.

        && export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} \
        && export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
        && export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \


@cache
def compute_num_split(block_k: int, k: int | None, grid_size: int) -> int:
device_props = torch.cuda.get_device_properties(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a hardcoded device index 0 in torch.cuda.get_device_properties(0) can lead to incorrect behavior in multi-GPU environments where the worker might be assigned to a different GPU. It should use the current device index to ensure correct properties are retrieved for the active worker.

Suggested change
device_props = torch.cuda.get_device_properties(0)
device_props = torch.cuda.get_device_properties(torch.cuda.current_device())

Comment on lines +934 to +1049
elif mxfp4_backend == Mxfp4MoeBackend.AITER:
from vllm._aiter_ops import rocm_aiter_ops

if w13_bias is not None:
w13_bias = w13_bias.data.to(torch.float32)
if w2_bias is not None:
w2_bias = w2_bias.data.to(torch.float32)

e, n, k = w13_weight.shape

w13_weight.view(torch.uint8).copy_(
w13_weight.data.view(torch.uint8)
.view(e, n // 2, 2, k)
.permute(0, 2, 1, 3)
.contiguous()
.view(e, n, k)
)
w13_weight_scale.data = (
w13_weight_scale.data.view(e, n // 2, 2, -1)
.permute(0, 2, 1, 3)
.contiguous()
.view(e, n, -1)
)

w13_weight.data = w13_weight.data.view(torch.float4_e2m1fn_x2)
w2_weight.data = w2_weight.data.view(torch.float4_e2m1fn_x2)

w13_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w13_weight, 16, True)
shuffled_w13_scale = rocm_aiter_ops.shuffle_scale_a16w4(
w13_weight_scale.view(-1, w13_weight_scale.shape[-1]),
num_experts,
True,
)

w2_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w2_weight, 16, False)
shuffled_w2_scale = rocm_aiter_ops.shuffle_scale_a16w4(
w2_weight_scale.view(-1, w2_weight_scale.shape[-1]),
num_experts,
False,
)

if w13_bias is not None:
w13_bias = (
w13_bias.data.view(-1, n // 2, 2)
.permute(0, 2, 1)
.contiguous()
.view(-1, n)
)

return (
w13_weight,
w2_weight,
shuffled_w13_scale,
shuffled_w2_scale,
w13_bias,
w2_bias,
)

elif mxfp4_backend == Mxfp4MoeBackend.AITER:
from vllm._aiter_ops import rocm_aiter_ops

if w13_bias is not None:
w13_bias = w13_bias.data.to(torch.float32)
if w2_bias is not None:
w2_bias = w2_bias.data.to(torch.float32)

e, n, k = w13_weight.shape

w13_weight.view(torch.uint8).copy_(
w13_weight.data.view(torch.uint8)
.view(e, n // 2, 2, k)
.permute(0, 2, 1, 3)
.contiguous()
.view(e, n, k)
)
w13_weight_scale.data = (
w13_weight_scale.data.view(e, n // 2, 2, -1)
.permute(0, 2, 1, 3)
.contiguous()
.view(e, n, -1)
)

w13_weight.data = w13_weight.data.view(torch.float4_e2m1fn_x2)
w2_weight.data = w2_weight.data.view(torch.float4_e2m1fn_x2)

w13_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w13_weight, 16, True)
shuffled_w13_scale = rocm_aiter_ops.shuffle_scale_a16w4(
w13_weight_scale.view(-1, w13_weight_scale.shape[-1]),
num_experts,
True,
)

w2_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w2_weight, 16, False)
shuffled_w2_scale = rocm_aiter_ops.shuffle_scale_a16w4(
w2_weight_scale.view(-1, w2_weight_scale.shape[-1]),
num_experts,
False,
)

if w13_bias is not None:
w13_bias = (
w13_bias.data.view(-1, n // 2, 2)
.permute(0, 2, 1)
.contiguous()
.view(-1, n)
)

return (
w13_weight,
w2_weight,
shuffled_w13_scale,
shuffled_w2_scale,
w13_bias,
w2_bias,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are three identical elif mxfp4_backend == Mxfp4MoeBackend.AITER: blocks in the convert_gpt_oss_weight_to_mxfp4_moe_kernel_format function (starting at lines 873, 934, and 992). This redundancy significantly increases the file size and makes maintenance difficult. The duplicate blocks should be removed.

Comment on lines 258 to +260
self.routed_scaling_factor = routed_scaling_factor
self.scoring_func = scoring_func
self._hash_indices_table = hash_indices_table
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assignment self.scoring_func = scoring_func is duplicated (present on both line 257 and line 259).

Suggested change
self.routed_scaling_factor = routed_scaling_factor
self.scoring_func = scoring_func
self._hash_indices_table = hash_indices_table
self.routed_scaling_factor = routed_scaling_factor
self._hash_indices_table = hash_indices_table

@whx-sjtu whx-sjtu marked this pull request as draft April 25, 2026 07:27
@whx-sjtu whx-sjtu closed this Apr 25, 2026
@github-project-automation github-project-automation Bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Apr 25, 2026
@github-project-automation github-project-automation Bot moved this to Done in NVIDIA Apr 25, 2026
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation gpt-oss Related to GPT-OSS models kv-connector needs-rebase new-model Requests to new models nvidia performance Performance-related issues rocm Related to AMD ROCm speculative-decoding tool-calling v1

Projects

Status: Done
Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants