Add AMD support for DeepSeek V4#23608
Open
AgainstEntropy wants to merge 3 commits intosgl-project:mainfrom
Open
Add AMD support for DeepSeek V4#23608AgainstEntropy wants to merge 3 commits intosgl-project:mainfrom
AgainstEntropy wants to merge 3 commits intosgl-project:mainfrom
Conversation
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Baizhou Zhang <baizhouzhang@radixark.ai> Co-authored-by: Baizhou Zhang <baizhou@radixark.ai> Co-authored-by: Baizhou Zhang <soberedezhang@gmail.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Fridge803 <soberedezhang@gmail.com> Co-authored-by: Ke Bao <26454835+ispobock@users.noreply.github.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> Co-authored-by: Mingyi Lu <wisclmy0611@gmail.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Co-authored-by: Yueming Yuan <yy28@illinois.edu> Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: yueming-yuan <yym022502@gmail.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
5 tasks
rahulvijayaraghavan
added a commit
to rahulvijayaraghavan/sglang
that referenced
this pull request
Apr 27, 2026
Cherry-pick the torch reference FlashMLA implementation from sgl-project#23608 (AgainstEntropy:amd/deepseek_v4) so XPU and other non-CUDA backends can route SGLANG_HACK_FLASHMLA_BACKEND=torch through ref_sparse_attn_decode instead of the CUDA flash_mla kernel. - Add python/sglang/srt/flashmla_tests/{__init__,lib,quant,ref}.py and the kernelkit/ helper package, taken verbatim from PR sgl-project#23608. - Replace debug_flash_mla_adapter.py with the PR's version, but move 'import flash_mla' into the backend == 'kernel' branch so the torch fallback does not require the CUDA-only flash_mla module. - Gate the top-level 'import flash_mla' in flashmla_tests/lib.py on is_cuda() (was: 'not is_hip()'), since on XPU torch.cuda is also unavailable. The two helpers that actually call into flash_mla (run_flash_mla_sparse_fwd / run_flash_mla_decode) are unused on the torch fallback path.
rahulvijayaraghavan
added a commit
to rahulvijayaraghavan/sglang
that referenced
this pull request
Apr 27, 2026
Port the non-fused 'old' compressor implementation from PR sgl-project#23608 so DeepSeek V4 can run on XPU (and other backends without the CUDA-only fused compress kernels). - environ.py: register SGLANG_OPT_USE_OLD_COMPRESSOR, SGLANG_OPT_USE_FUSED_COMPRESS, SGLANG_OPT_USE_FUSED_PAGED_COMPRESS, SGLANG_OPT_DPSK_V4_RADIX (names match PR sgl-project#23608). - mem_cache/compress_state.py: add KVAndScoreOld dataclass for the non-paged compress state used by the old path. - mem_cache/deepseekv4_memory_pool.py: add non-paged DeepSeekV4CompressState pools and accessors. - models/deepseek_v4.py: add Compressor.compress_decode_old / compress_extend_old / compress_dispatch with self-rewire, and route forward() through compress_dispatch. use_fused_compress mirrors PR sgl-project#23608 (paged-fused vs fused vs old).
rahulvijayaraghavan
added a commit
to rahulvijayaraghavan/sglang
that referenced
this pull request
Apr 27, 2026
…ct#23608 - fp8_paged_mqa_logits_torch: restore strict 1D (batch_size,) seq_lens assert. - forward_c4_indexer: pass indexer_metadata.c4_seq_lens directly (1D) instead of an unsqueeze(-1) variant. Matches PR sgl-project#23608 so the torch fallback agrees with the CUDA kernel signature.
rahulvijayaraghavan
added a commit
to rahulvijayaraghavan/sglang
that referenced
this pull request
Apr 27, 2026
On XPU, TopK.forward_native unconditionally forced topk_config.torch_native=True, which routed DeepSeek V4 (sqrtsoftplus + correction_bias + non-zero num_fused_shared_experts + routed_scaling_factor) through fused_topk_torch_native. That naive path ignores num_fused_shared_experts and routed_scaling_factor, so the fused-shared-expert slot was never populated and weights were never scaled — producing garbage tokens. - topk.py forward_native: only force torch_native when the input is NOT the DSv4 sqrtsoftplus + correction_bias + non-grouped case, so it falls through to biased_topk_impl (matches PR sgl-project#23608 CUDA semantics). - topk.py fused_topk_torch_native: add sqrtsoftplus branch and a defensive correction_bias.to(scores.device) cast as a fallback. - deepseek_v4_topk.py biased_topk_impl: same defensive device cast for correction_bias, and disable @torch.compile on XPU (mirrors the NPU carve-out).
Collaborator
|
Integration branch: |
This was referenced Apr 29, 2026
5 tasks
This was referenced Apr 30, 2026
5 tasks
Collaborator
|
SGLang daily image for AMD deepseek_v4 is posted at: |
This was referenced May 4, 2026
|
@HaiShaw @AgainstEntropy @jhinpan any plan to support mi300x? we have a good amount of mi300x and want to use them for dsv4. |
This was referenced May 8, 2026
This was referenced May 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Docker image
lmsysorg/sglang:v0.5.8-rocm700-mi35x-dsk
To build a docker image yourself, please reference to the comment from @yushengsu-thu below.
Launch command