Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docker/rocm.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -561,6 +561,10 @@ ENV SGLANG_TOPK_TRANSFORM_512_TORCH=0
ENV SGLANG_OPT_USE_FUSED_COMPRESS=true
ENV SGLANG_OPT_USE_TILELANG_INDEXER=true
ENV SGLANG_HACK_FLASHMLA_BACKEND=tilelang
ENV SGLANG_OPT_USE_AITER_MHC_PRE=true
ENV SGLANG_OPT_USE_AITER_MHC_POST=true
ENV SGLANG_OPT_USE_TILELANG_MHC_PRE=false
ENV SGLANG_OPT_USE_TILELANG_MHC_POST=false

ENV NCCL_MIN_NCHANNELS=112
ENV ROCM_QUICK_REDUCE_QUANTIZATION=INT8
Expand Down
88 changes: 22 additions & 66 deletions python/run_dsv4.sh
Original file line number Diff line number Diff line change
@@ -1,80 +1,37 @@
#export CUDA_VISIBLE_DEVICES=0,1,2,3
#/dockerx/data/models/DeepSeek-V4-Flash
export SGLANG_OPT_USE_OLD_COMPRESSOR=true
export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false
export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false
export SGLANG_OPT_USE_FUSED_HASH_TOPK=false

#### FP8 model path ####
#export SGLANG_REASONING_EFFORT=max
#
#export SGLANG_OPT_USE_FUSED_COMPRESS=false #use PyTorch implemented compressor
#export SGLANG_OPT_USE_OLD_COMPRESSOR=true #use old compressor
#export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false #use old prepare
#export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false #use old topk
#export SGLANG_OPT_USE_FUSED_HASH_TOPK=false #AMD: hash_topk JIT needs CUDA toolchain
#
#export SGLANG_HACK_FLASHMLA_BACKEND=torch
#export SGLANG_HACK_FLASHMLA_BACKEND=tilelang
#export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false #use old prenorm
#
#export SGLANG_OPT_USE_TILELANG_MHC_PRE=false #use torch hc_pre
#export SGLANG_OPT_USE_TILELANG_MHC_POST=false #use torch hc_post
#
#export SGLANG_ENABLE_THINKING=1
#export SGLANG_USE_AITER=1
#export SGLANG_USE_ROCM700A=1
#export SGLANG_TOPK_TRANSFORM_512_TORCH=1
#export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1
#
#export SGLANG_OPT_DPSK_V4_RADIX=0
#export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false #non-radix backend has no store_cache method
#export SGLANG_OPT_USE_FUSED_STORE_CACHE=false #fused_store_cache JIT needs CUDA toolchain
#
#export SGLANG_FORCE_TRITON_MOE_FP8=1 # this is required to apply swiglu_limit clamp in fused_moe_triton
#
#python3 -m sglang.launch_server \
# --model-path /dockerx/data2/models/DeepSeek-V4-Pro-FP8 \
# --trust-remote-code \
# --tp 8 \
# --dp 8 \
# --enable-dp-attention \
# --disable-radix-cache \
# --attention-backend compressed \
# --max-running-request 256 \
# --page-size 256 \
# --chunked-prefill-size 8192 \
# --port 8000 \
# --disable-shared-experts-fusion \
# --disable-cuda-graph \
# --tool-call-parser deepseekv4 \
# --reasoning-parser deepseek-v4
export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false

#### FP4 model path ####
export SGLANG_REASONING_EFFORT=max

export SGLANG_OPT_USE_FUSED_COMPRESS=false #use PyTorch implemented compressor
export SGLANG_OPT_USE_OLD_COMPRESSOR=true #use old compressor
export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false #use old prepare
export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false #use old topk
export SGLANG_OPT_USE_FUSED_HASH_TOPK=false #AMD: hash_topk JIT needs CUDA toolchain

export SGLANG_HACK_FLASHMLA_BACKEND=tilelang
export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false #use old prenorm

export SGLANG_OPT_USE_TILELANG_MHC_PRE=false #use torch hc_pre
export SGLANG_OPT_USE_TILELANG_MHC_POST=false #use torch hc_post
export SGLANG_OPT_USE_TILELANG_MHC_PRE=false
export SGLANG_OPT_USE_TILELANG_MHC_POST=false

export SGLANG_ENABLE_THINKING=1
export SGLANG_USE_AITER=1
export SGLANG_USE_ROCM700A=1
export SGLANG_TOPK_TRANSFORM_512_TORCH=1
export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1

export SGLANG_OPT_DPSK_V4_RADIX=0
export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false #non-radix backend has no store_cache method
export SGLANG_OPT_USE_FUSED_STORE_CACHE=false #fused_store_cache JIT needs CUDA toolchain
export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false
export SGLANG_OPT_USE_FUSED_STORE_CACHE=false

# changed
export SGLANG_OPT_USE_FUSED_COMPRESS=true
export SGLANG_TOPK_TRANSFORM_512_TORCH=0
export SGLANG_OPT_USE_TILELANG_INDEXER=true
export SGLANG_HACK_FLASHMLA_BACKEND=tilelang
export SGLANG_REASONING_EFFORT=max
export SGLANG_FORCE_TRITON_MOE_FP8=0
export SGLANG_OPT_USE_AITER_MHC_PRE=true
export SGLANG_OPT_USE_AITER_MHC_POST=true

export SGLANG_FORCE_TRITON_MOE_FP8=0 # this is required to apply swiglu_limit clamp in fused_moe_triton
MODEL=/dockerx/data/deepseek-ai/DeepSeek-V4-Pro
MODEL=/dockerx/data/sgl-project/DeepSeek-V4-Flash-FP8/
Comment on lines +30 to +31
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The MODEL variable is assigned twice consecutively, making the first assignment redundant. Additionally, these absolute paths are specific to a particular environment. It is recommended to use a single assignment and consider using a more generic path or an environment variable for better flexibility.

Suggested change
MODEL=/dockerx/data/deepseek-ai/DeepSeek-V4-Pro
MODEL=/dockerx/data/sgl-project/DeepSeek-V4-Flash-FP8/
MODEL=/dockerx/data/sgl-project/DeepSeek-V4-Flash-FP8/


python3 -m sglang.launch_server \
--model-path /dockerx/data/deepseek-ai/DeepSeek-V4-Pro \
--model-path ${MODEL} \
--trust-remote-code \
--tp 8 \
--disable-radix-cache \
Expand All @@ -84,6 +41,5 @@ python3 -m sglang.launch_server \
--chunked-prefill-size 8192 \
--port 8000 \
--disable-shared-experts-fusion \
--disable-cuda-graph \
--tool-call-parser deepseekv4 \
--reasoning-parser deepseek-v4
2 changes: 2 additions & 0 deletions python/sglang/srt/environ.py
Original file line number Diff line number Diff line change
Expand Up @@ -589,6 +589,8 @@ class Envs:
SGLANG_TOPK_TRANSFORM_512_TORCH = EnvBool(False)
SGLANG_OPT_BF16_FP32_GEMM_ALGO = EnvBool(False)
SGLANG_FORCE_TRITON_MOE_FP8 = EnvBool(False)
SGLANG_OPT_USE_AITER_MHC_PRE= EnvBool(True)
SGLANG_OPT_USE_AITER_MHC_POST= EnvBool(True)
# fmt: on
Comment on lines +593 to 594
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are missing spaces around the assignment operator = for the new environment variables. To maintain consistency with the rest of the file and adhere to PEP-8 standards, please add spaces around the equals sign.

Suggested change
SGLANG_OPT_USE_AITER_MHC_POST= EnvBool(True)
# fmt: on
SGLANG_OPT_USE_AITER_MHC_PRE = EnvBool(True)
SGLANG_OPT_USE_AITER_MHC_POST = EnvBool(True)


# EPD
Expand Down
10 changes: 10 additions & 0 deletions python/sglang/srt/layers/mhc.py
Original file line number Diff line number Diff line change
Expand Up @@ -543,11 +543,21 @@ def mhc_pre(

if num_tokens <= 2048:
assert n_splits == 1
if hc_hidden_size == 16384:
hidden_block = 256
elif hc_hidden_size == 28672:
hidden_block = 128
else:
raise NotImplementedError(
f"mhc_pre splitk kernel only supports hc_hidden_size in {{16384, 28672}}, "
f"got {hc_hidden_size}"
)
kernel_0, kernel_1 = mhc_pre_gemm_sqrsum_splitk_kernel(
hc_mult3,
hc_hidden_size,
split_k=n_splits_pre,
token_block=32,
hidden_block=hidden_block,
)
partial_out = gemm_out_mul.new_empty(n_splits_pre, num_tokens, 32)
partial_sqrsum = gemm_out_sqrsum.new_empty(n_splits_pre, num_tokens)
Expand Down
25 changes: 25 additions & 0 deletions python/sglang/srt/models/deepseek_v4.py
Original file line number Diff line number Diff line change
Expand Up @@ -1889,6 +1889,23 @@ def hc_pre_torch_impl(x, hc_fn):
# returned post should be [n, hc_mult]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The aiter library is an AMD-specific dependency and is not available on NVIDIA platforms. Since SGLANG_OPT_USE_AITER_MHC_PRE is enabled by default, this block will be executed on NVIDIA hardware if SGLANG_OPT_USE_TILELANG_MHC_PRE is disabled (which is necessary to use DeepGEMM), resulting in a ModuleNotFoundError. This check should be guarded by _is_hip to ensure portability.

Suggested change
# returned post should be [n, hc_mult]
if _is_hip and envs.SGLANG_OPT_USE_AITER_MHC_PRE.get():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gemini-code-assist proposed wrong diff, incorrect line #

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely correct, HaiShaw. My apologies for the incorrect line number in the previous suggestion. The if _is_hip condition should indeed guard the aiter block, specifically modifying the line if envs.SGLANG_OPT_USE_AITER_MHC_PRE.get():.

Since the original comment was anchored to line 1889, I cannot provide a direct code suggestion for line 1892 in the UI. However, the correct modification would be to change line 1892 from:

        if envs.SGLANG_OPT_USE_AITER_MHC_PRE.get():

to:

        if _is_hip and envs.SGLANG_OPT_USE_AITER_MHC_PRE.get():

This ensures that the aiter library, which is an AMD-specific dependency, is only used on AMD platforms, preventing ModuleNotFoundError on NVIDIA hardware.

return y, post.squeeze(-1), comb

if _is_hip and envs.SGLANG_OPT_USE_AITER_MHC_PRE.get():
from aiter.ops.mhc import mhc_pre

post, comb, y = mhc_pre(
residual=x,
fn=hc_fn,
hc_scale=hc_scale,
hc_base=hc_base,
rms_eps=self.rms_norm_eps,
hc_pre_eps=self.hc_eps,
hc_sinkhorn_eps=self.hc_eps,
hc_post_mult_value=2.0,
sinkhorn_repeat=self.hc_sinkhorn_iters,
)
# returned post should be [n, hc_mult]
return y, post.squeeze(-1), comb

if envs.SGLANG_OPT_DEEPGEMM_HC_PRENORM.get():
# DeepGEMM implementation
import deep_gemm
Expand Down Expand Up @@ -1945,6 +1962,14 @@ def hc_post(
result = mhc_post(x, residual, post, comb)
Comment thread
HaiShaw marked this conversation as resolved.
return result

elif _is_hip and envs.SGLANG_OPT_USE_AITER_MHC_POST.get():
from aiter.ops.mhc import mhc_post

result = torch.empty_like(residual)
mhc_post(result, x, residual, post, comb)

return result

assert residual.shape == (x.shape[0], self.hc_mult, x.shape[-1])
assert post.shape == (x.shape[0], self.hc_mult)
assert comb.shape == (x.shape[0], self.hc_mult, self.hc_mult)
Expand Down
Loading