[New Model][ROCm] Add AMD support for DeepSeek V4#40864
[New Model][ROCm] Add AMD support for DeepSeek V4#40864whx-sjtu wants to merge 6 commits intovllm-project:mainfrom
Conversation
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
|
Documentation preview: https://vllm--40864.org.readthedocs.build/en/40864/ |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request implements support for the DeepSeek V4 model, introducing its unique Multi-Head Latent Attention (MLA) architecture with horizontally-fused kernels, HyperCompressed (mHC) blocks, and a Multi-Token Predictor (MTP) draft model. The changes include new CUDA and Triton kernels for fused operations, updates to quantization methods for MXFP4 and E8M0 scales, and the addition of a dedicated tokenizer and tool parser. Feedback identifies hardcoded environment variables in the Dockerfile, a hardcoded device index in the mHC layer that could cause issues in multi-GPU setups, redundant code blocks in the MXFP4 oracle, and a duplicate assignment in the top-k bias router.
| && export SCCACHE_BUCKET=inferact-sccache \ | ||
| && export SCCACHE_REGION=us-west-2 \ | ||
| && export SCCACHE_S3_NO_CREDENTIALS=0\ |
There was a problem hiding this comment.
The SCCACHE environment variables are hardcoded to a specific bucket and region (inferact-sccache, us-west-2). These should be parameterized using the build arguments ${SCCACHE_BUCKET_NAME}, ${SCCACHE_REGION_NAME}, and ${SCCACHE_S3_NO_CREDENTIALS} to ensure the Dockerfile remains generic and reusable across different environments.
&& export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} \
&& export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
&& export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
|
|
||
| @cache | ||
| def compute_num_split(block_k: int, k: int | None, grid_size: int) -> int: | ||
| device_props = torch.cuda.get_device_properties(0) |
There was a problem hiding this comment.
Using a hardcoded device index 0 in torch.cuda.get_device_properties(0) can lead to incorrect behavior in multi-GPU environments where the worker might be assigned to a different GPU. It should use the current device index to ensure correct properties are retrieved for the active worker.
| device_props = torch.cuda.get_device_properties(0) | |
| device_props = torch.cuda.get_device_properties(torch.cuda.current_device()) |
| elif mxfp4_backend == Mxfp4MoeBackend.AITER: | ||
| from vllm._aiter_ops import rocm_aiter_ops | ||
|
|
||
| if w13_bias is not None: | ||
| w13_bias = w13_bias.data.to(torch.float32) | ||
| if w2_bias is not None: | ||
| w2_bias = w2_bias.data.to(torch.float32) | ||
|
|
||
| e, n, k = w13_weight.shape | ||
|
|
||
| w13_weight.view(torch.uint8).copy_( | ||
| w13_weight.data.view(torch.uint8) | ||
| .view(e, n // 2, 2, k) | ||
| .permute(0, 2, 1, 3) | ||
| .contiguous() | ||
| .view(e, n, k) | ||
| ) | ||
| w13_weight_scale.data = ( | ||
| w13_weight_scale.data.view(e, n // 2, 2, -1) | ||
| .permute(0, 2, 1, 3) | ||
| .contiguous() | ||
| .view(e, n, -1) | ||
| ) | ||
|
|
||
| w13_weight.data = w13_weight.data.view(torch.float4_e2m1fn_x2) | ||
| w2_weight.data = w2_weight.data.view(torch.float4_e2m1fn_x2) | ||
|
|
||
| w13_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w13_weight, 16, True) | ||
| shuffled_w13_scale = rocm_aiter_ops.shuffle_scale_a16w4( | ||
| w13_weight_scale.view(-1, w13_weight_scale.shape[-1]), | ||
| num_experts, | ||
| True, | ||
| ) | ||
|
|
||
| w2_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w2_weight, 16, False) | ||
| shuffled_w2_scale = rocm_aiter_ops.shuffle_scale_a16w4( | ||
| w2_weight_scale.view(-1, w2_weight_scale.shape[-1]), | ||
| num_experts, | ||
| False, | ||
| ) | ||
|
|
||
| if w13_bias is not None: | ||
| w13_bias = ( | ||
| w13_bias.data.view(-1, n // 2, 2) | ||
| .permute(0, 2, 1) | ||
| .contiguous() | ||
| .view(-1, n) | ||
| ) | ||
|
|
||
| return ( | ||
| w13_weight, | ||
| w2_weight, | ||
| shuffled_w13_scale, | ||
| shuffled_w2_scale, | ||
| w13_bias, | ||
| w2_bias, | ||
| ) | ||
|
|
||
| elif mxfp4_backend == Mxfp4MoeBackend.AITER: | ||
| from vllm._aiter_ops import rocm_aiter_ops | ||
|
|
||
| if w13_bias is not None: | ||
| w13_bias = w13_bias.data.to(torch.float32) | ||
| if w2_bias is not None: | ||
| w2_bias = w2_bias.data.to(torch.float32) | ||
|
|
||
| e, n, k = w13_weight.shape | ||
|
|
||
| w13_weight.view(torch.uint8).copy_( | ||
| w13_weight.data.view(torch.uint8) | ||
| .view(e, n // 2, 2, k) | ||
| .permute(0, 2, 1, 3) | ||
| .contiguous() | ||
| .view(e, n, k) | ||
| ) | ||
| w13_weight_scale.data = ( | ||
| w13_weight_scale.data.view(e, n // 2, 2, -1) | ||
| .permute(0, 2, 1, 3) | ||
| .contiguous() | ||
| .view(e, n, -1) | ||
| ) | ||
|
|
||
| w13_weight.data = w13_weight.data.view(torch.float4_e2m1fn_x2) | ||
| w2_weight.data = w2_weight.data.view(torch.float4_e2m1fn_x2) | ||
|
|
||
| w13_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w13_weight, 16, True) | ||
| shuffled_w13_scale = rocm_aiter_ops.shuffle_scale_a16w4( | ||
| w13_weight_scale.view(-1, w13_weight_scale.shape[-1]), | ||
| num_experts, | ||
| True, | ||
| ) | ||
|
|
||
| w2_weight.data = rocm_aiter_ops.shuffle_weight_a16w4(w2_weight, 16, False) | ||
| shuffled_w2_scale = rocm_aiter_ops.shuffle_scale_a16w4( | ||
| w2_weight_scale.view(-1, w2_weight_scale.shape[-1]), | ||
| num_experts, | ||
| False, | ||
| ) | ||
|
|
||
| if w13_bias is not None: | ||
| w13_bias = ( | ||
| w13_bias.data.view(-1, n // 2, 2) | ||
| .permute(0, 2, 1) | ||
| .contiguous() | ||
| .view(-1, n) | ||
| ) | ||
|
|
||
| return ( | ||
| w13_weight, | ||
| w2_weight, | ||
| shuffled_w13_scale, | ||
| shuffled_w2_scale, | ||
| w13_bias, | ||
| w2_bias, | ||
| ) | ||
|
|
There was a problem hiding this comment.
There are three identical elif mxfp4_backend == Mxfp4MoeBackend.AITER: blocks in the convert_gpt_oss_weight_to_mxfp4_moe_kernel_format function (starting at lines 873, 934, and 992). This redundancy significantly increases the file size and makes maintenance difficult. The duplicate blocks should be removed.
| self.routed_scaling_factor = routed_scaling_factor | ||
| self.scoring_func = scoring_func | ||
| self._hash_indices_table = hash_indices_table |
There was a problem hiding this comment.
The assignment self.scoring_func = scoring_func is duplicated (present on both line 257 and line 259).
| self.routed_scaling_factor = routed_scaling_factor | |
| self.scoring_func = scoring_func | |
| self._hash_indices_table = hash_indices_table | |
| self.routed_scaling_factor = routed_scaling_factor | |
| self._hash_indices_table = hash_indices_table |
Purpose
This PR is based on #40760 and further supports DeepSeek V4 for AMD.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.