Add AMD support for DeepSeek V4 by AgainstEntropy · Pull Request #23608 · sgl-project/sglang

AgainstEntropy · 2026-04-24T04:00:37Z

Docker image

lmsysorg/sglang:v0.5.8-rocm700-mi35x-dsk

To build a docker image yourself, please reference to the comment from @yushengsu-thu below.

Launch command

export CUDA_VISIBLE_DEVICES=0,1,2,3

export SGLANG_OPT_USE_FUSED_COMPRESS=false #use PyTorch implemented compressor
export SGLANG_OPT_USE_OLD_COMPRESSOR=true #use old compressor
export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false #use old prepare
export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false #use old topk
export SGLANG_OPT_USE_FUSED_HASH_TOPK=false #AMD: hash_topk JIT needs CUDA toolchain

export SGLANG_HACK_FLASHMLA_BACKEND=torch
export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false #use old prenorm

export SGLANG_OPT_USE_TILELANG_MHC_PRE=false #use torch hc_pre
export SGLANG_OPT_USE_TILELANG_MHC_POST=false #use torch hc_post

export SGLANG_ENABLE_THINKING=1
export SGLANG_USE_AITER=1
export SGLANG_USE_ROCM700A=1
export SGLANG_TOPK_TRANSFORM_512_TORCH=1
export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1

export SGLANG_DSV4_FP4_EXPERTS=false

export SGLANG_OPT_DPSK_V4_RADIX=0
export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false #non-radix backend has no store_cache method
export SGLANG_OPT_USE_FUSED_STORE_CACHE=false #fused_store_cache JIT needs CUDA toolchain

export SGLANG_FORCE_TRITON_MOE_FP8=1  # this is required to apply swiglu_limit clamp in fused_moe_triton

python3 -m sglang.launch_server \
    --model-path sgl-project/DeepSeek-V4-Flash-FP8 \
    --trust-remote-code \
    --tp 4 \
    --dp 4 \
    --enable-dp-attention \
    --disable-radix-cache \
    --attention-backend compressed \
    --max-running-request 256 \
    --page-size 256 \
    --chunked-prefill-size 8192 \
    --port 30010 \
    --disable-shared-experts-fusion \
    --disable-cuda-graph \
    --tool-call-parser deepseekv4 \
    --reasoning-parser deepseek-v4

Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Baizhou Zhang <baizhouzhang@radixark.ai> Co-authored-by: Baizhou Zhang <baizhou@radixark.ai> Co-authored-by: Baizhou Zhang <soberedezhang@gmail.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Fridge803 <soberedezhang@gmail.com> Co-authored-by: Ke Bao <26454835+ispobock@users.noreply.github.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> Co-authored-by: Mingyi Lu <wisclmy0611@gmail.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Co-authored-by: Yueming Yuan <yy28@illinois.edu> Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: yueming-yuan <yym022502@gmail.com>

gemini-code-assist · 2026-04-24T04:00:40Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Cherry-pick the torch reference FlashMLA implementation from sgl-project#23608 (AgainstEntropy:amd/deepseek_v4) so XPU and other non-CUDA backends can route SGLANG_HACK_FLASHMLA_BACKEND=torch through ref_sparse_attn_decode instead of the CUDA flash_mla kernel. - Add python/sglang/srt/flashmla_tests/{__init__,lib,quant,ref}.py and the kernelkit/ helper package, taken verbatim from PR sgl-project#23608. - Replace debug_flash_mla_adapter.py with the PR's version, but move 'import flash_mla' into the backend == 'kernel' branch so the torch fallback does not require the CUDA-only flash_mla module. - Gate the top-level 'import flash_mla' in flashmla_tests/lib.py on is_cuda() (was: 'not is_hip()'), since on XPU torch.cuda is also unavailable. The two helpers that actually call into flash_mla (run_flash_mla_sparse_fwd / run_flash_mla_decode) are unused on the torch fallback path.

Port the non-fused 'old' compressor implementation from PR sgl-project#23608 so DeepSeek V4 can run on XPU (and other backends without the CUDA-only fused compress kernels). - environ.py: register SGLANG_OPT_USE_OLD_COMPRESSOR, SGLANG_OPT_USE_FUSED_COMPRESS, SGLANG_OPT_USE_FUSED_PAGED_COMPRESS, SGLANG_OPT_DPSK_V4_RADIX (names match PR sgl-project#23608). - mem_cache/compress_state.py: add KVAndScoreOld dataclass for the non-paged compress state used by the old path. - mem_cache/deepseekv4_memory_pool.py: add non-paged DeepSeekV4CompressState pools and accessors. - models/deepseek_v4.py: add Compressor.compress_decode_old / compress_extend_old / compress_dispatch with self-rewire, and route forward() through compress_dispatch. use_fused_compress mirrors PR sgl-project#23608 (paged-fused vs fused vs old).

…ct#23608 - fp8_paged_mqa_logits_torch: restore strict 1D (batch_size,) seq_lens assert. - forward_c4_indexer: pass indexer_metadata.c4_seq_lens directly (1D) instead of an unsqueeze(-1) variant. Matches PR sgl-project#23608 so the torch fallback agrees with the CUDA kernel signature.

On XPU, TopK.forward_native unconditionally forced topk_config.torch_native=True, which routed DeepSeek V4 (sqrtsoftplus + correction_bias + non-zero num_fused_shared_experts + routed_scaling_factor) through fused_topk_torch_native. That naive path ignores num_fused_shared_experts and routed_scaling_factor, so the fused-shared-expert slot was never populated and weights were never scaled — producing garbage tokens. - topk.py forward_native: only force torch_native when the input is NOT the DSv4 sqrtsoftplus + correction_bias + non-grouped case, so it falls through to biased_topk_impl (matches PR sgl-project#23608 CUDA semantics). - topk.py fused_topk_torch_native: add sqrtsoftplus branch and a defensive correction_bias.to(scores.device) cast as a fallback. - deepseek_v4_topk.py biased_topk_impl: same defensive device cast for correction_bias, and disable @torch.compile on XPU (mirrors the NPU carve-out).

HaiShaw · 2026-04-28T22:53:26Z

Integration branch:
https://github.com/sgl-project/sglang/tree/amd/deepseek_v4

HaiShaw · 2026-05-02T01:55:10Z

SGLang daily image for AMD deepseek_v4 is posted at:
https://hub.docker.com/r/rocm/sgl-dev/tags
TAG ending with -DSv4

tawan0109 · 2026-05-05T04:45:05Z

@HaiShaw @AgainstEntropy @jhinpan any plan to support mi300x? we have a good amount of mi300x and want to use them for dsv4.

sunway513 mentioned this pull request Apr 25, 2026

[Decision Adopted] DSV4 multi-request KV cache: 6 walls hit, Path 2 + Path 3 in flight sunway513/ATOM#37

Open

HaiShaw mentioned this pull request Apr 27, 2026

amd/deepseek_v4 integration 1/N - 0426 #23787

Merged

5 tasks

This was referenced Apr 29, 2026

amd/deepseek_v4 integration 2/N - cuda graph 0426 #23832

Merged

amd/deepseek_v4 integration 3/N - FP4 Models 0428 #24031

Merged

amd/deepseek_v4 integration 4/N - TilelangAttn 0428 #24033

Merged

sgl-project deleted a comment from ichbinblau Apr 29, 2026

sgl-project deleted a comment from andyluo7 Apr 29, 2026

HaiShaw mentioned this pull request Apr 29, 2026

amd/deepseek_v4 integration 5/N - indexer TilelangAttn 0428 #24050

Merged

5 tasks

Oseltamivir mentioned this pull request Apr 29, 2026

mi355x dsv4 retry SemiAnalysisAI/InferenceX#1160

Merged

This was referenced Apr 30, 2026

amd/deepseek_v4 integration 6/N - Fix run failed of dp enablement 0430 #24127

Merged

amd/deepseek_v4 integration 7/N - topk512transform kernel 0430 #24143

Merged

RolaoDenthu mentioned this pull request May 1, 2026

amd/deepseek_v4 integration 8/N - fuse compress decode 0501 #24249

Merged

5 tasks

sgl-project deleted a comment from andyluo7 May 2, 2026

This was referenced May 4, 2026

amd/deepseek_v4 integration 9/N - Check MoE weight type from config.json 0504 #24337

Merged

amd/deepseek_v4 integration 10/N optimize mhc performance #24355

Merged

[bugfix] +1 padding row in compress state #24398

Merged

This was referenced May 8, 2026

amd/deepseek_v4 integration 14/N fuse rope hadamard using rope_rotate_activation 0508 #24727

Merged

amd/deepseek_v4 integration 15/N fuse hash topk 0508 #24728

Merged

This was referenced May 9, 2026

DSv4 on MI300: force uint8 store dtype for packed KV pool #24783

Open

DSv4 on MI300: use fp8 fnuz output dtype on HIP in triton act_quant #24789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMD support for DeepSeek V4#23608

Add AMD support for DeepSeek V4#23608
AgainstEntropy wants to merge 3 commits intosgl-project:mainfrom
AgainstEntropy:amd/deepseek_v4

AgainstEntropy commented Apr 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

HaiShaw commented Apr 28, 2026 •

edited

Loading

Uh oh!

HaiShaw commented May 2, 2026

Uh oh!

tawan0109 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

AgainstEntropy commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Docker image

Launch command

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

HaiShaw commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaiShaw commented May 2, 2026

Uh oh!

tawan0109 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

AgainstEntropy commented Apr 24, 2026 •

edited

Loading

HaiShaw commented Apr 28, 2026 •

edited

Loading