amd/deepseek_v4 integration 2/N - cuda graph 0426 by kkHuang-amd · Pull Request #23832 · sgl-project/sglang

kkHuang-amd · 2026-04-27T10:19:53Z

Motivation

Update amd/deepseek_v4 integration branch

Following PRs have large set of conflict, we use this PR and upstream amd/deepseek_v4 branch to integrate in parallel.
#23600
#23608

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…follow-up to sgl-project#23731) (sgl-project#23734) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llow-up to sgl-project#23731) (sgl-project#23732) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

…#23725)

…kout (sgl-project#23747) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…meters (sgl-project#23742)

sgl-project#23749) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…project#23750) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ect#23716)

…gelu_and_mul (sgl-project#23707) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port DSv4 integration from sglang-entropy codebase including: - DeepseekV4ForCausalLM model and NextN speculative decoding - Compressed attention backend (deepseek_v4_backend, radix variant) - DSv4 memory pool (deepseekv4_memory_pool, compress_state) - DSv4 pool configurator and memory profiler - Hash-based MoE routing (deepseek_v4_topk, HashTopK) - DSv4 JIT kernels (CUDA .cuh headers + Python wrappers) - Function call parser (deepseekv4_detector) and encoding (encoding_dsv4) - Reasoning parser support (deepseek-v4) - Environment variables for DSv4 configuration - Config loading for deepseek_v4 model_type via PretrainedConfig - Integration with server_args, model_config, scheduler, and forward pass Made-with: Cursor

Made-with: Cursor

…ix bare except - Use getattr with None check instead of hasattr in _get_sliding_window_size to properly fall through None-valued attributes (e.g. Qwen2's sliding_window=None) - Replace bare except with except Exception in encoding_dsv4.py Made-with: Cursor

- Pass out_cache_loc and actual_forward_mode in CudaGraphRunner replay - Add **kwargs to all attention backend replay signatures for compatibility - Rewrite fp8_paged_mqa_logits_torch as vectorized (no .item() / Python loops) - Make topk_transform_512_pytorch_vectorized graph-capturable (cache arange tensors, use masked_fill_ instead of torch.tensor creation) - Guard set_swa_loc for pools that don't support it (DeepSeekV4TokenToKVPool) - Fix forward_normal_dual_stream missing input_ids_global for HashTopK - Fix dummy out_cache_loc dtype (int32 -> int64) during capture - Call on_after_cuda_graph_warmup_pass after warmup Made-with: Cursor

On HIP/ROCm, deep_gemm_metadata is None. Move it to check_eq_fields instead of copy_fields so copy_metadata's assertion that all dataclass fields are accounted for passes during CUDA graph replay. Made-with: Cursor

gemini-code-assist

Code Review

This pull request implements comprehensive support for DeepSeek-V4, introducing optimized CUDA kernels for compressed attention (C4/C128), expert-filtered activations, and a specialized memory pool architecture. It also updates diffusion pipelines to support request-local schedulers and adds extensive configuration for deployment. Review feedback identifies several critical issues in the attention backend, specifically regarding performance bottlenecks from host-device synchronizations, incorrect metadata copying during CUDA graph replay for variable batch sizes, and potential indexing errors in multi-token prediction (MTP). Additionally, improvements were suggested to enable expert filtering on non-CUDA platforms and to enhance thread safety by avoiding stateful caching within memory pools.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (1132)

Calling .item() on a GPU tensor causes a host-device synchronization. This is a significant performance bottleneck in the hot path and will break CUDA graph capture if this method is called during the capture phase. Consider using torch.cat with a sliced and expanded tensor to avoid the sync: torch.cat([req_pool_indices_repeated, req_pool_indices_repeated[-1:].expand(pad_size)]).

python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (190-237)

The copy_ implementation for DSV4AttnMetadataRadix (and its constituent metadata classes) uses dst_val.copy_(src_val) which requires exact shape matching. During CUDA graph replay, the destination tensors have the captured max_bs size, while the source tensors from temp_metadata have the current bs size. This will cause a runtime error when bs < max_bs. You should use slicing to copy only the active portion: dst_val[:src_val.shape[0]].copy_(src_val).

python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (1044)

The use of _pad_tensor_to_size (which internally uses torch.cat) is problematic for MTP if topk > 1. If q has bs * topk tokens but swa_page_indices only has bs rows, padding with 0 will result in incorrect indices for all but the first token of each request. It should likely be repeat_interleave(topk, dim=0) instead. Furthermore, torch.cat should be avoided in the forward path to maintain CUDA graph compatibility and performance.

python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe.py (532)

The expert filtering logic is restricted to _is_cuda. While the comment mentions that HIP/XPU fall through to the unfiltered path where the down kernel handles zeros, this results in redundant computation in the activation kernel for filtered tokens on those platforms. If the silu_and_mul JIT kernel supports filtering on ROCm, it should be enabled here as well to improve efficiency.

python/sglang/srt/mem_cache/deepseekv4_memory_pool.py (830)

Caching swa_loc in self.cached_loc assumes that the model forward pass is strictly sequential across layers and that only one forward pass happens at a time. While this is currently true for the SGLang model runner, it makes the code non-thread-safe and could lead to subtle bugs if the execution model changes (e.g., multi-threaded speculative decoding steps). Consider passing this metadata through the forward_batch or metadata objects instead of storing it as state in the pool.

…24019)

sglang-bot and others added 18 commits April 25, 2026 17:13

chore: bump sgl-kernel version to 0.4.1.post1 (sgl-project#23720)

7141735

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix Qwen3 MoE: also guard EP all-reduce with not use_reduce_scatter (…

71029ab

…follow-up to sgl-project#23731) (sgl-project#23734) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(DeepSeek-V4): add GB200 platform to cookbook recipe (sgl-project…

049f1bf

…#23725)

[CI] release-whl-kernel: clean root-owned build artifacts before chec…

282b47f

…kout (sgl-project#23747) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(DeepSeek-V4): add h200|big verified recipes + tune H200 Pro para…

3cfd156

…meters (sgl-project#23742)

[CI] release-whl-kernel: strip +cu129 local version before PyPI upload (

4be853e

sgl-project#23749) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[CI] release-pypi-nightly: install protoc before building wheel (sgl-…

8efa177

…project#23750) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: bump sglang-kernel version to 0.4.1.post1 (sgl-project#23733)

9003f24

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[diffusion] refactor: make timestep scheduler request-local (sgl-proj…

d49a037

…ect#23716)

[MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/…

c7878db

…gelu_and_mul (sgl-project#23707) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix pre-commit: ruff unused imports, codespell typos, formatting

7b9bec8

Made-with: Cursor

Add dsv4 tilelang bf16 attn kernel support

039ebe8

Remove multi-stream when _is_hip to avoid graph capture seg-fault

a3ddac2

fix: handle deep_gemm_metadata in PagedIndexerMetadata.copy_ on HIP

0e1436d

On HIP/ROCm, deep_gemm_metadata is None. Move it to check_eq_fields instead of copy_fields so copy_metadata's assertion that all dataclass fields are accounted for passes during CUDA graph replay. Made-with: Cursor

kkHuang-amd requested review from BBuf, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, fzyzcjy, hnyls2002, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 27, 2026 10:19

kkHuang-amd requested review from CatherineSue, DarkSharpness, HydraQYH, JustinTong0323, Kangyan-Zhou, bingxche, celve, iforgetmyname, ishandhanani, mickqian, slin1237, wisclmy0611, yctseng0211 and yhyang201 as code owners April 27, 2026 10:19

github-actions Bot added quant LLM Quantization amd dependencies Pull requests that update a dependency file deepseek hicache Hierarchical Caching for SGLang sgl-kernel blackwell SM100/SM120 npu diffusion SGLang Diffusion mthreads jit-kernel labels Apr 27, 2026

Merge branch 'amd/deepseek_v4' into amd/deepseek_v4_cuda-graph_0426

d4ad62d

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

HaiShaw changed the title ~~Amd/deepseek v4 cuda graph 0426~~ amd/deepseek_v4 integration 2/N - cuda graph 0426 Apr 27, 2026

HaiShaw merged commit f348386 into sgl-project:amd/deepseek_v4 Apr 27, 2026
1 check passed

HaiShaw added a commit that referenced this pull request Apr 29, 2026

Revert "amd/deepseek_v4 integration 2/N - cuda graph 0426 (#23832)" (#…

3da2e86

…24019)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amd/deepseek_v4 integration 2/N - cuda graph 0426#23832

amd/deepseek_v4 integration 2/N - cuda graph 0426#23832
HaiShaw merged 19 commits intosgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_cuda-graph_0426

kkHuang-amd commented Apr 27, 2026 •

edited by HaiShaw

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

kkHuang-amd commented Apr 27, 2026 • edited by HaiShaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (1132)

python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (190-237)

python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (1044)

python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe.py (532)

python/sglang/srt/mem_cache/deepseekv4_memory_pool.py (830)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

kkHuang-amd commented Apr 27, 2026 •

edited by HaiShaw

Loading