amd/deepseek_v4 integration 1/N - 0426 by HaiShaw · Pull Request #23787 · sgl-project/sglang

HaiShaw · 2026-04-27T02:10:36Z

Motivation

Update amd/deepseek_v4 integration branch

Following PRs have large set of conflict, we use this PR and upstream amd/deepseek_v4 branch to integrate in parallel.
#23600
#23608

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…follow-up to sgl-project#23731) (sgl-project#23734) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llow-up to sgl-project#23731) (sgl-project#23732) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

…#23725)

…kout (sgl-project#23747) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…meters (sgl-project#23742)

sgl-project#23749) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…project#23750) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ect#23716)

…gelu_and_mul (sgl-project#23707) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port DSv4 integration from sglang-entropy codebase including: - DeepseekV4ForCausalLM model and NextN speculative decoding - Compressed attention backend (deepseek_v4_backend, radix variant) - DSv4 memory pool (deepseekv4_memory_pool, compress_state) - DSv4 pool configurator and memory profiler - Hash-based MoE routing (deepseek_v4_topk, HashTopK) - DSv4 JIT kernels (CUDA .cuh headers + Python wrappers) - Function call parser (deepseekv4_detector) and encoding (encoding_dsv4) - Reasoning parser support (deepseek-v4) - Environment variables for DSv4 configuration - Config loading for deepseek_v4 model_type via PretrainedConfig - Integration with server_args, model_config, scheduler, and forward pass Made-with: Cursor

Made-with: Cursor

…ix bare except - Use getattr with None check instead of hasattr in _get_sliding_window_size to properly fall through None-valued attributes (e.g. Qwen2's sliding_window=None) - Replace bare except with except Exception in encoding_dsv4.py Made-with: Cursor

gemini-code-assist

Code Review

This pull request introduces support for DeepSeek-V4, featuring compressed attention backends, a specialized indexer, and speculative decoding (NextN) with optimized JIT kernels. Key updates include a redesigned memory pool and fused metadata initialization. Feedback identifies critical logic and performance issues in the DeepseekV4BackendRadix and DeepseekV4MultiStepBackend implementations, specifically addressing off-by-one errors in speculative decoding loops, performance bottlenecks from host-device synchronization, and platform compatibility concerns on AMD/ROCm systems.

gemini-code-assist · 2026-04-27T02:14:09Z

+def flash_mla_with_kvcache_entrypoint(backend: str, **kwargs):
+    if is_hip():
+        # backend == "torch"
+        import os


On AMD/ROCm platforms (is_hip()), the flash_mla kernel is typically not available. Defaulting to "kernel" will lead to an import error or crash. It is better to default to "torch" when running on HIP to ensure the fallback path is used unless explicitly overridden.

Suggested change

import os

backend = os.environ.get("SGLANG_HACK_FLASHMLA_BACKEND", "torch")

gemini-code-assist · 2026-04-27T02:14:09Z

+            )
+
+    def init_forward_metadata(self, forward_batch: ForwardBatch):
+        for i in range(self.speculative_num_steps - 1):


The loop range range(self.speculative_num_steps - 1) will skip the last backend in self.attn_backends. It should be range(self.speculative_num_steps) to ensure all backends are initialized.

Suggested change

for i in range(self.speculative_num_steps - 1):

for i in range(self.speculative_num_steps):

gemini-code-assist · 2026-04-27T02:14:10Z

+            forward_mode=ForwardMode.DECODE,
+            spec_info=forward_batch.spec_info,
+            seq_lens_cpu=forward_batch.seq_lens_cpu,
+            out_cache_loc=forward_batch.out_cache_loc,
+        )
+        temp_metadata = self.attn_backends[0].forward_metadata
+
+        # Copy to other backends without recomputing


The loop range range(1, self.speculative_num_steps - 1) skips the last backend in self.attn_backends. For example, if speculative_num_steps is 2, the range is empty and the second backend is never updated. It should be range(1, self.speculative_num_steps). Additionally, self.forward_metadata for the MultiStepBackend instance itself should be set to temp_metadata to avoid using stale or None metadata during the forward pass.

temp_metadata = self.attn_backends[0].forward_metadata self.forward_metadata = temp_metadata # Copy to other backends without recomputing for i in range(1, self.speculative_num_steps):

gemini-code-assist · 2026-04-27T02:14:10Z

+                req_pool_indices_repeated,
+                (0, pad_size),
+                value=req_pool_indices_repeated[-1].item(),
+            )
+


Calling .item() on a GPU tensor causes a host-device synchronization, which can significantly degrade performance in the hot path. Additionally, if num_tokens is 0, req_pool_indices_repeated[-1] will raise an IndexError. Consider using torch.cat with expand to perform the padding entirely on the device without synchronization.

if num_tokens > 0: padding = req_pool_indices_repeated[-1:].expand(pad_size) else: padding = req_pool_indices_repeated.new_zeros(pad_size) req_pool_indices_repeated = torch.cat([req_pool_indices_repeated, padding])

gemini-code-assist · 2026-04-27T02:14:10Z

+    if value == 0:
+        return torch.cat(
+            [tensor, tensor.new_zeros(size - tensor.shape[0], *tensor.shape[1:])],
+            dim=0,
+        )


This function performs a torch.cat even when no padding is required (i.e., when size == tensor.shape[0]), which is inefficient. Also, it will fail if size < tensor.shape[0]. Adding a check to return the original tensor when padding is not needed would be better.

if size <= tensor.shape[0]: return tensor if value == 0: return torch.cat( [tensor, tensor.new_zeros(size - tensor.shape[0], *tensor.shape[1:])], dim=0, )

sglang-bot and others added 14 commits April 25, 2026 17:13

chore: bump sgl-kernel version to 0.4.1.post1 (sgl-project#23720)

7141735

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix Qwen3 MoE: also guard EP all-reduce with not use_reduce_scatter (…

71029ab

…follow-up to sgl-project#23731) (sgl-project#23734) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(DeepSeek-V4): add GB200 platform to cookbook recipe (sgl-project…

049f1bf

…#23725)

[CI] release-whl-kernel: clean root-owned build artifacts before chec…

282b47f

…kout (sgl-project#23747) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(DeepSeek-V4): add h200|big verified recipes + tune H200 Pro para…

3cfd156

…meters (sgl-project#23742)

[CI] release-whl-kernel: strip +cu129 local version before PyPI upload (

4be853e

sgl-project#23749) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[CI] release-pypi-nightly: install protoc before building wheel (sgl-…

8efa177

…project#23750) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: bump sglang-kernel version to 0.4.1.post1 (sgl-project#23733)

9003f24

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[diffusion] refactor: make timestep scheduler request-local (sgl-proj…

d49a037

…ect#23716)

[MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/…

c7878db

…gelu_and_mul (sgl-project#23707) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix pre-commit: ruff unused imports, codespell typos, formatting

7b9bec8

Made-with: Cursor

HaiShaw requested review from 1am9trash, BBuf, FlamingoPg, Fridge003, Ying1123, ch-wan, fzyzcjy, hanming-lu, hnyls2002, hubertlu-tw, hzh0425, ispobock, kkHuang-amd, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 27, 2026 02:10

HaiShaw requested review from CatherineSue, DarkSharpness, Edwardf0t1, HydraQYH, JustinTong0323, Kangyan-Zhou, bingxche, celve, hebiao064, ishandhanani, mickqian, ping1jing2, slin1237, wisclmy0611, yctseng0211, yhyang201 and yuan-luo as code owners April 27, 2026 02:10

github-actions Bot added quant LLM Quantization amd dependencies Pull requests that update a dependency file deepseek hicache Hierarchical Caching for SGLang sgl-kernel diffusion SGLang Diffusion mthreads jit-kernel labels Apr 27, 2026

HaiShaw changed the title ~~Amd/deepseek v4 0426~~ amd/deepseek_v4 integration 0426 Apr 27, 2026

HaiShaw merged commit d773133 into sgl-project:amd/deepseek_v4 Apr 27, 2026
51 of 105 checks passed

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

HaiShaw changed the title ~~amd/deepseek_v4 integration 0426~~ amd/deepseek_v4 integration 1/N - 0426 Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amd/deepseek_v4 integration 1/N - 0426#23787

amd/deepseek_v4 integration 1/N - 0426#23787
HaiShaw merged 14 commits into
sgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_0426

HaiShaw commented Apr 27, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

	import os
	backend = os.environ.get("SGLANG_HACK_FLASHMLA_BACKEND", "torch")

	for i in range(self.speculative_num_steps - 1):
	for i in range(self.speculative_num_steps):

Conversation

HaiShaw commented Apr 27, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants