Skip to content

amd/deepseek_v4 integration 2/N - cuda graph 0426#23832

Merged
HaiShaw merged 19 commits intosgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_cuda-graph_0426
Apr 27, 2026
Merged

amd/deepseek_v4 integration 2/N - cuda graph 0426#23832
HaiShaw merged 19 commits intosgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_cuda-graph_0426

Conversation

@kkHuang-amd
Copy link
Copy Markdown
Collaborator

@kkHuang-amd kkHuang-amd commented Apr 27, 2026

Motivation

Update amd/deepseek_v4 integration branch

Following PRs have large set of conflict, we use this PR and upstream amd/deepseek_v4 branch to integrate in parallel.
#23600
#23608

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

sglang-bot and others added 18 commits April 25, 2026 17:13
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…follow-up to sgl-project#23731) (sgl-project#23734)

Co-authored-by: Byron Hsu <byron@periodiclabs.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llow-up to sgl-project#23731) (sgl-project#23732)

Co-authored-by: Byron Hsu <byronhsu@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
…kout (sgl-project#23747)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sgl-project#23749)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…project#23750)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
Co-authored-by: Kangyan Zhou <zky314343421@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gelu_and_mul (sgl-project#23707)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port DSv4 integration from sglang-entropy codebase including:
- DeepseekV4ForCausalLM model and NextN speculative decoding
- Compressed attention backend (deepseek_v4_backend, radix variant)
- DSv4 memory pool (deepseekv4_memory_pool, compress_state)
- DSv4 pool configurator and memory profiler
- Hash-based MoE routing (deepseek_v4_topk, HashTopK)
- DSv4 JIT kernels (CUDA .cuh headers + Python wrappers)
- Function call parser (deepseekv4_detector) and encoding (encoding_dsv4)
- Reasoning parser support (deepseek-v4)
- Environment variables for DSv4 configuration
- Config loading for deepseek_v4 model_type via PretrainedConfig
- Integration with server_args, model_config, scheduler, and forward pass

Made-with: Cursor
…ix bare except

- Use getattr with None check instead of hasattr in _get_sliding_window_size
  to properly fall through None-valued attributes (e.g. Qwen2's sliding_window=None)
- Replace bare except with except Exception in encoding_dsv4.py

Made-with: Cursor
- Pass out_cache_loc and actual_forward_mode in CudaGraphRunner replay
- Add **kwargs to all attention backend replay signatures for compatibility
- Rewrite fp8_paged_mqa_logits_torch as vectorized (no .item() / Python loops)
- Make topk_transform_512_pytorch_vectorized graph-capturable (cache arange
  tensors, use masked_fill_ instead of torch.tensor creation)
- Guard set_swa_loc for pools that don't support it (DeepSeekV4TokenToKVPool)
- Fix forward_normal_dual_stream missing input_ids_global for HashTopK
- Fix dummy out_cache_loc dtype (int32 -> int64) during capture
- Call on_after_cuda_graph_warmup_pass after warmup

Made-with: Cursor
On HIP/ROCm, deep_gemm_metadata is None. Move it to check_eq_fields
instead of copy_fields so copy_metadata's assertion that all dataclass
fields are accounted for passes during CUDA graph replay.

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements comprehensive support for DeepSeek-V4, introducing optimized CUDA kernels for compressed attention (C4/C128), expert-filtered activations, and a specialized memory pool architecture. It also updates diffusion pipelines to support request-local schedulers and adds extensive configuration for deployment. Review feedback identifies several critical issues in the attention backend, specifically regarding performance bottlenecks from host-device synchronizations, incorrect metadata copying during CUDA graph replay for variable batch sizes, and potential indexing errors in multi-token prediction (MTP). Additionally, improvements were suggested to enable expert filtering on non-CUDA platforms and to enhance thread safety by avoiding stateful caching within memory pools.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (1132)

high

Calling .item() on a GPU tensor causes a host-device synchronization. This is a significant performance bottleneck in the hot path and will break CUDA graph capture if this method is called during the capture phase. Consider using torch.cat with a sliced and expanded tensor to avoid the sync: torch.cat([req_pool_indices_repeated, req_pool_indices_repeated[-1:].expand(pad_size)]).

python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (190-237)

high

The copy_ implementation for DSV4AttnMetadataRadix (and its constituent metadata classes) uses dst_val.copy_(src_val) which requires exact shape matching. During CUDA graph replay, the destination tensors have the captured max_bs size, while the source tensors from temp_metadata have the current bs size. This will cause a runtime error when bs < max_bs. You should use slicing to copy only the active portion: dst_val[:src_val.shape[0]].copy_(src_val).

python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (1044)

medium

The use of _pad_tensor_to_size (which internally uses torch.cat) is problematic for MTP if topk > 1. If q has bs * topk tokens but swa_page_indices only has bs rows, padding with 0 will result in incorrect indices for all but the first token of each request. It should likely be repeat_interleave(topk, dim=0) instead. Furthermore, torch.cat should be avoided in the forward path to maintain CUDA graph compatibility and performance.

python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe.py (532)

medium

The expert filtering logic is restricted to _is_cuda. While the comment mentions that HIP/XPU fall through to the unfiltered path where the down kernel handles zeros, this results in redundant computation in the activation kernel for filtered tokens on those platforms. If the silu_and_mul JIT kernel supports filtering on ROCm, it should be enabled here as well to improve efficiency.

python/sglang/srt/mem_cache/deepseekv4_memory_pool.py (830)

medium

Caching swa_loc in self.cached_loc assumes that the model forward pass is strictly sequential across layers and that only one forward pass happens at a time. While this is currently true for the SGLang model runner, it makes the code non-thread-safe and could lead to subtle bugs if the execution model changes (e.g., multi-threaded speculative decoding steps). Consider passing this metadata through the forward_batch or metadata objects instead of storing it as state in the pool.

@HaiShaw HaiShaw changed the title Amd/deepseek v4 cuda graph 0426 amd/deepseek_v4 integration 2/N - cuda graph 0426 Apr 27, 2026
@HaiShaw HaiShaw merged commit f348386 into sgl-project:amd/deepseek_v4 Apr 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd blackwell SM100/SM120 deepseek dependencies Pull requests that update a dependency file diffusion SGLang Diffusion hicache Hierarchical Caching for SGLang jit-kernel mthreads npu quant LLM Quantization sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.