Skip to content

amd/deepseek_v4 integration 1/N - 0426#23787

Merged
HaiShaw merged 14 commits into
sgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_0426
Apr 27, 2026
Merged

amd/deepseek_v4 integration 1/N - 0426#23787
HaiShaw merged 14 commits into
sgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_0426

Conversation

@HaiShaw
Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw commented Apr 27, 2026

Motivation

Update amd/deepseek_v4 integration branch

Following PRs have large set of conflict, we use this PR and upstream amd/deepseek_v4 branch to integrate in parallel.
#23600
#23608

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

sglang-bot and others added 14 commits April 25, 2026 17:13
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…follow-up to sgl-project#23731) (sgl-project#23734)

Co-authored-by: Byron Hsu <byron@periodiclabs.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llow-up to sgl-project#23731) (sgl-project#23732)

Co-authored-by: Byron Hsu <byronhsu@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
…kout (sgl-project#23747)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sgl-project#23749)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…project#23750)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
Co-authored-by: Kangyan Zhou <zky314343421@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gelu_and_mul (sgl-project#23707)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port DSv4 integration from sglang-entropy codebase including:
- DeepseekV4ForCausalLM model and NextN speculative decoding
- Compressed attention backend (deepseek_v4_backend, radix variant)
- DSv4 memory pool (deepseekv4_memory_pool, compress_state)
- DSv4 pool configurator and memory profiler
- Hash-based MoE routing (deepseek_v4_topk, HashTopK)
- DSv4 JIT kernels (CUDA .cuh headers + Python wrappers)
- Function call parser (deepseekv4_detector) and encoding (encoding_dsv4)
- Reasoning parser support (deepseek-v4)
- Environment variables for DSv4 configuration
- Config loading for deepseek_v4 model_type via PretrainedConfig
- Integration with server_args, model_config, scheduler, and forward pass

Made-with: Cursor
…ix bare except

- Use getattr with None check instead of hasattr in _get_sliding_window_size
  to properly fall through None-valued attributes (e.g. Qwen2's sliding_window=None)
- Replace bare except with except Exception in encoding_dsv4.py

Made-with: Cursor
@github-actions github-actions Bot added quant LLM Quantization amd dependencies Pull requests that update a dependency file deepseek hicache Hierarchical Caching for SGLang sgl-kernel diffusion SGLang Diffusion mthreads jit-kernel labels Apr 27, 2026
@HaiShaw HaiShaw changed the title Amd/deepseek v4 0426 amd/deepseek_v4 integration 0426 Apr 27, 2026
@HaiShaw HaiShaw merged commit d773133 into sgl-project:amd/deepseek_v4 Apr 27, 2026
51 of 105 checks passed
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for DeepSeek-V4, featuring compressed attention backends, a specialized indexer, and speculative decoding (NextN) with optimized JIT kernels. Key updates include a redesigned memory pool and fused metadata initialization. Feedback identifies critical logic and performance issues in the DeepseekV4BackendRadix and DeepseekV4MultiStepBackend implementations, specifically addressing off-by-one errors in speculative decoding loops, performance bottlenecks from host-device synchronization, and platform compatibility concerns on AMD/ROCm systems.

def flash_mla_with_kvcache_entrypoint(backend: str, **kwargs):
if is_hip():
# backend == "torch"
import os
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

On AMD/ROCm platforms (is_hip()), the flash_mla kernel is typically not available. Defaulting to "kernel" will lead to an import error or crash. It is better to default to "torch" when running on HIP to ensure the fallback path is used unless explicitly overridden.

Suggested change
import os
backend = os.environ.get("SGLANG_HACK_FLASHMLA_BACKEND", "torch")

)

def init_forward_metadata(self, forward_batch: ForwardBatch):
for i in range(self.speculative_num_steps - 1):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The loop range range(self.speculative_num_steps - 1) will skip the last backend in self.attn_backends. It should be range(self.speculative_num_steps) to ensure all backends are initialized.

Suggested change
for i in range(self.speculative_num_steps - 1):
for i in range(self.speculative_num_steps):

Comment on lines +1301 to +1308
forward_mode=ForwardMode.DECODE,
spec_info=forward_batch.spec_info,
seq_lens_cpu=forward_batch.seq_lens_cpu,
out_cache_loc=forward_batch.out_cache_loc,
)
temp_metadata = self.attn_backends[0].forward_metadata

# Copy to other backends without recomputing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The loop range range(1, self.speculative_num_steps - 1) skips the last backend in self.attn_backends. For example, if speculative_num_steps is 2, the range is empty and the second backend is never updated. It should be range(1, self.speculative_num_steps). Additionally, self.forward_metadata for the MultiStepBackend instance itself should be set to temp_metadata to avoid using stale or None metadata during the forward pass.

        temp_metadata = self.attn_backends[0].forward_metadata
        self.forward_metadata = temp_metadata

        # Copy to other backends without recomputing
        for i in range(1, self.speculative_num_steps):

Comment on lines +1141 to +1145
req_pool_indices_repeated,
(0, pad_size),
value=req_pool_indices_repeated[-1].item(),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling .item() on a GPU tensor causes a host-device synchronization, which can significantly degrade performance in the hot path. Additionally, if num_tokens is 0, req_pool_indices_repeated[-1] will raise an IndexError. Consider using torch.cat with expand to perform the padding entirely on the device without synchronization.

            if num_tokens > 0:
                padding = req_pool_indices_repeated[-1:].expand(pad_size)
            else:
                padding = req_pool_indices_repeated.new_zeros(pad_size)
            req_pool_indices_repeated = torch.cat([req_pool_indices_repeated, padding])

Comment on lines +1318 to +1322
if value == 0:
return torch.cat(
[tensor, tensor.new_zeros(size - tensor.shape[0], *tensor.shape[1:])],
dim=0,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function performs a torch.cat even when no padding is required (i.e., when size == tensor.shape[0]), which is inefficient. Also, it will fail if size < tensor.shape[0]. Adding a check to return the original tensor when padding is not needed would be better.

    if size <= tensor.shape[0]:
        return tensor
    if value == 0:
        return torch.cat(
            [tensor, tensor.new_zeros(size - tensor.shape[0], *tensor.shape[1:])],
            dim=0,
        )

@HaiShaw HaiShaw changed the title amd/deepseek_v4 integration 0426 amd/deepseek_v4 integration 1/N - 0426 Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd deepseek dependencies Pull requests that update a dependency file diffusion SGLang Diffusion hicache Hierarchical Caching for SGLang jit-kernel mthreads quant LLM Quantization sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants