Skip to content

amd/deepseek_v4 integration 10/N optimize mhc performance#24355

Merged
HaiShaw merged 4 commits intosgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_mhc-improve_0504
May 4, 2026
Merged

amd/deepseek_v4 integration 10/N optimize mhc performance#24355
HaiShaw merged 4 commits intosgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_mhc-improve_0504

Conversation

@kkHuang-amd
Copy link
Copy Markdown
Collaborator

@kkHuang-amd kkHuang-amd commented May 4, 2026

co-author: @1am9trash

Motivation

Update amd/deepseek_v4 integration branch

Following PRs have large set of conflict, we use this PR and upstream amd/deepseek_v4 branch to integrate in parallel.
#23600
#23608

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates AMD-specific aiter multi-head compression (MHC) operations into the DeepSeek-V4 model, introduces new environment variables to control these features, and updates the run_dsv4.sh script with optimized configurations. The review feedback focuses on ensuring cross-platform compatibility by guarding aiter imports with platform checks, adhering to PEP-8 formatting standards, and improving performance by reducing frequent memory allocations during the forward pass. Additionally, redundant path assignments in the shell script were identified for cleanup.

@@ -1889,6 +1889,23 @@ def hc_pre_torch_impl(x, hc_fn):
# returned post should be [n, hc_mult]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The aiter library is an AMD-specific dependency and is not available on NVIDIA platforms. Since SGLANG_OPT_USE_AITER_MHC_PRE is enabled by default, this block will be executed on NVIDIA hardware if SGLANG_OPT_USE_TILELANG_MHC_PRE is disabled (which is necessary to use DeepGEMM), resulting in a ModuleNotFoundError. This check should be guarded by _is_hip to ensure portability.

Suggested change
# returned post should be [n, hc_mult]
if _is_hip and envs.SGLANG_OPT_USE_AITER_MHC_PRE.get():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gemini-code-assist proposed wrong diff, incorrect line #

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely correct, HaiShaw. My apologies for the incorrect line number in the previous suggestion. The if _is_hip condition should indeed guard the aiter block, specifically modifying the line if envs.SGLANG_OPT_USE_AITER_MHC_PRE.get():.

Since the original comment was anchored to line 1889, I cannot provide a direct code suggestion for line 1892 in the UI. However, the correct modification would be to change line 1892 from:

        if envs.SGLANG_OPT_USE_AITER_MHC_PRE.get():

to:

        if _is_hip and envs.SGLANG_OPT_USE_AITER_MHC_PRE.get():

This ensures that the aiter library, which is an AMD-specific dependency, is only used on AMD platforms, preventing ModuleNotFoundError on NVIDIA hardware.

Comment thread python/sglang/srt/models/deepseek_v4.py
Comment on lines +593 to 594
SGLANG_OPT_USE_AITER_MHC_POST= EnvBool(True)
# fmt: on
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are missing spaces around the assignment operator = for the new environment variables. To maintain consistency with the rest of the file and adhere to PEP-8 standards, please add spaces around the equals sign.

Suggested change
SGLANG_OPT_USE_AITER_MHC_POST= EnvBool(True)
# fmt: on
SGLANG_OPT_USE_AITER_MHC_PRE = EnvBool(True)
SGLANG_OPT_USE_AITER_MHC_POST = EnvBool(True)

Comment thread python/sglang/srt/models/deepseek_v4.py Outdated
result = mhc_post(x, residual, post, comb)
return result

elif envs.SGLANG_OPT_USE_AITER_MHC_POST.get():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Allocating a new tensor with torch.empty_like(residual) in every layer during every forward pass can be inefficient due to frequent memory allocations. If the aiter kernel supports it, consider using a pre-allocated workspace or an in-place operation to improve performance.

Comment thread python/run_dsv4.sh
Comment on lines +30 to +31
MODEL=/dockerx/data/deepseek-ai/DeepSeek-V4-Pro
MODEL=/dockerx/data/sgl-project/DeepSeek-V4-Flash-FP8/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The MODEL variable is assigned twice consecutively, making the first assignment redundant. Additionally, these absolute paths are specific to a particular environment. It is recommended to use a single assignment and consider using a more generic path or an environment variable for better flexibility.

Suggested change
MODEL=/dockerx/data/deepseek-ai/DeepSeek-V4-Pro
MODEL=/dockerx/data/sgl-project/DeepSeek-V4-Flash-FP8/
MODEL=/dockerx/data/sgl-project/DeepSeek-V4-Flash-FP8/

@kkHuang-amd kkHuang-amd changed the title Replace naive mhc design to aiter design amd/deepseek_v4 integration 10/N optimize mhc performance May 4, 2026
Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update rocm.Dockerfile for the best config also

@HaiShaw HaiShaw merged commit b4fe024 into sgl-project:amd/deepseek_v4 May 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants