Skip to content

Add AMD support for DeepSeek V4#23608

Open
AgainstEntropy wants to merge 3 commits intosgl-project:mainfrom
AgainstEntropy:amd/deepseek_v4
Open

Add AMD support for DeepSeek V4#23608
AgainstEntropy wants to merge 3 commits intosgl-project:mainfrom
AgainstEntropy:amd/deepseek_v4

Conversation

@AgainstEntropy
Copy link
Copy Markdown
Collaborator

@AgainstEntropy AgainstEntropy commented Apr 24, 2026

Docker image

lmsysorg/sglang:v0.5.8-rocm700-mi35x-dsk

To build a docker image yourself, please reference to the comment from @yushengsu-thu below.

Launch command

export CUDA_VISIBLE_DEVICES=0,1,2,3

export SGLANG_OPT_USE_FUSED_COMPRESS=false #use PyTorch implemented compressor
export SGLANG_OPT_USE_OLD_COMPRESSOR=true #use old compressor
export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false #use old prepare
export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false #use old topk
export SGLANG_OPT_USE_FUSED_HASH_TOPK=false #AMD: hash_topk JIT needs CUDA toolchain

export SGLANG_HACK_FLASHMLA_BACKEND=torch
export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false #use old prenorm

export SGLANG_OPT_USE_TILELANG_MHC_PRE=false #use torch hc_pre
export SGLANG_OPT_USE_TILELANG_MHC_POST=false #use torch hc_post

export SGLANG_ENABLE_THINKING=1
export SGLANG_USE_AITER=1
export SGLANG_USE_ROCM700A=1
export SGLANG_TOPK_TRANSFORM_512_TORCH=1
export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1

export SGLANG_DSV4_FP4_EXPERTS=false

export SGLANG_OPT_DPSK_V4_RADIX=0
export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false #non-radix backend has no store_cache method
export SGLANG_OPT_USE_FUSED_STORE_CACHE=false #fused_store_cache JIT needs CUDA toolchain

export SGLANG_FORCE_TRITON_MOE_FP8=1  # this is required to apply swiglu_limit clamp in fused_moe_triton

python3 -m sglang.launch_server \
    --model-path sgl-project/DeepSeek-V4-Flash-FP8 \
    --trust-remote-code \
    --tp 4 \
    --dp 4 \
    --enable-dp-attention \
    --disable-radix-cache \
    --attention-backend compressed \
    --max-running-request 256 \
    --page-size 256 \
    --chunked-prefill-size 8192 \
    --port 30010 \
    --disable-shared-experts-fusion \
    --disable-cuda-graph \
    --tool-call-parser deepseekv4 \
    --reasoning-parser deepseek-v4

Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <baizhouzhang@radixark.ai>
Co-authored-by: Baizhou Zhang <baizhou@radixark.ai>
Co-authored-by: Baizhou Zhang <soberedezhang@gmail.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com>
Co-authored-by: Fridge803 <soberedezhang@gmail.com>
Co-authored-by: Ke Bao <26454835+ispobock@users.noreply.github.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
Co-authored-by: Mingyi Lu <wisclmy0611@gmail.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Co-authored-by: Yueming Yuan <yy28@illinois.edu>
Co-authored-by: Yueming Yuan <yym022502@gmail.com>
Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: yueming-yuan <yym022502@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

rahulvijayaraghavan added a commit to rahulvijayaraghavan/sglang that referenced this pull request Apr 27, 2026
Cherry-pick the torch reference FlashMLA implementation from
sgl-project#23608 (AgainstEntropy:amd/deepseek_v4) so XPU and
other non-CUDA backends can route SGLANG_HACK_FLASHMLA_BACKEND=torch
through ref_sparse_attn_decode instead of the CUDA flash_mla kernel.

- Add python/sglang/srt/flashmla_tests/{__init__,lib,quant,ref}.py and
  the kernelkit/ helper package, taken verbatim from PR sgl-project#23608.
- Replace debug_flash_mla_adapter.py with the PR's version, but move
  'import flash_mla' into the backend == 'kernel' branch so the torch
  fallback does not require the CUDA-only flash_mla module.
- Gate the top-level 'import flash_mla' in flashmla_tests/lib.py on
  is_cuda() (was: 'not is_hip()'), since on XPU torch.cuda is also
  unavailable. The two helpers that actually call into flash_mla
  (run_flash_mla_sparse_fwd / run_flash_mla_decode) are unused on the
  torch fallback path.
rahulvijayaraghavan added a commit to rahulvijayaraghavan/sglang that referenced this pull request Apr 27, 2026
Port the non-fused 'old' compressor implementation from PR sgl-project#23608 so
DeepSeek V4 can run on XPU (and other backends without the CUDA-only
fused compress kernels).

- environ.py: register SGLANG_OPT_USE_OLD_COMPRESSOR,
  SGLANG_OPT_USE_FUSED_COMPRESS, SGLANG_OPT_USE_FUSED_PAGED_COMPRESS,
  SGLANG_OPT_DPSK_V4_RADIX (names match PR sgl-project#23608).
- mem_cache/compress_state.py: add KVAndScoreOld dataclass for the
  non-paged compress state used by the old path.
- mem_cache/deepseekv4_memory_pool.py: add non-paged
  DeepSeekV4CompressState pools and accessors.
- models/deepseek_v4.py: add Compressor.compress_decode_old /
  compress_extend_old / compress_dispatch with self-rewire, and route
  forward() through compress_dispatch. use_fused_compress mirrors
  PR sgl-project#23608 (paged-fused vs fused vs old).
rahulvijayaraghavan added a commit to rahulvijayaraghavan/sglang that referenced this pull request Apr 27, 2026
…ct#23608

- fp8_paged_mqa_logits_torch: restore strict 1D (batch_size,) seq_lens
  assert.
- forward_c4_indexer: pass indexer_metadata.c4_seq_lens directly (1D)
  instead of an unsqueeze(-1) variant. Matches PR sgl-project#23608 so the torch
  fallback agrees with the CUDA kernel signature.
rahulvijayaraghavan added a commit to rahulvijayaraghavan/sglang that referenced this pull request Apr 27, 2026
On XPU, TopK.forward_native unconditionally forced
topk_config.torch_native=True, which routed DeepSeek V4 (sqrtsoftplus +
correction_bias + non-zero num_fused_shared_experts +
routed_scaling_factor) through fused_topk_torch_native. That naive path
ignores num_fused_shared_experts and routed_scaling_factor, so the
fused-shared-expert slot was never populated and weights were never
scaled — producing garbage tokens.

- topk.py forward_native: only force torch_native when the input is NOT
  the DSv4 sqrtsoftplus + correction_bias + non-grouped case, so it
  falls through to biased_topk_impl (matches PR sgl-project#23608 CUDA semantics).
- topk.py fused_topk_torch_native: add sqrtsoftplus branch and a
  defensive correction_bias.to(scores.device) cast as a fallback.
- deepseek_v4_topk.py biased_topk_impl: same defensive device cast for
  correction_bias, and disable @torch.compile on XPU (mirrors the NPU
  carve-out).
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 28, 2026

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented May 2, 2026

SGLang daily image for AMD deepseek_v4 is posted at:
https://hub.docker.com/r/rocm/sgl-dev/tags
TAG ending with -DSv4

@tawan0109
Copy link
Copy Markdown

@HaiShaw @AgainstEntropy @jhinpan any plan to support mi300x? we have a good amount of mi300x and want to use them for dsv4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants