amd/deepseek_v4 integration 2/N - cuda graph 0426#23832
amd/deepseek_v4 integration 2/N - cuda graph 0426#23832HaiShaw merged 19 commits intosgl-project:amd/deepseek_v4from
Conversation
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…follow-up to sgl-project#23731) (sgl-project#23734) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llow-up to sgl-project#23731) (sgl-project#23732) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
…kout (sgl-project#23747) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sgl-project#23749) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…project#23750) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gelu_and_mul (sgl-project#23707) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port DSv4 integration from sglang-entropy codebase including: - DeepseekV4ForCausalLM model and NextN speculative decoding - Compressed attention backend (deepseek_v4_backend, radix variant) - DSv4 memory pool (deepseekv4_memory_pool, compress_state) - DSv4 pool configurator and memory profiler - Hash-based MoE routing (deepseek_v4_topk, HashTopK) - DSv4 JIT kernels (CUDA .cuh headers + Python wrappers) - Function call parser (deepseekv4_detector) and encoding (encoding_dsv4) - Reasoning parser support (deepseek-v4) - Environment variables for DSv4 configuration - Config loading for deepseek_v4 model_type via PretrainedConfig - Integration with server_args, model_config, scheduler, and forward pass Made-with: Cursor
Made-with: Cursor
…ix bare except - Use getattr with None check instead of hasattr in _get_sliding_window_size to properly fall through None-valued attributes (e.g. Qwen2's sliding_window=None) - Replace bare except with except Exception in encoding_dsv4.py Made-with: Cursor
- Pass out_cache_loc and actual_forward_mode in CudaGraphRunner replay - Add **kwargs to all attention backend replay signatures for compatibility - Rewrite fp8_paged_mqa_logits_torch as vectorized (no .item() / Python loops) - Make topk_transform_512_pytorch_vectorized graph-capturable (cache arange tensors, use masked_fill_ instead of torch.tensor creation) - Guard set_swa_loc for pools that don't support it (DeepSeekV4TokenToKVPool) - Fix forward_normal_dual_stream missing input_ids_global for HashTopK - Fix dummy out_cache_loc dtype (int32 -> int64) during capture - Call on_after_cuda_graph_warmup_pass after warmup Made-with: Cursor
On HIP/ROCm, deep_gemm_metadata is None. Move it to check_eq_fields instead of copy_fields so copy_metadata's assertion that all dataclass fields are accounted for passes during CUDA graph replay. Made-with: Cursor
There was a problem hiding this comment.
Code Review
This pull request implements comprehensive support for DeepSeek-V4, introducing optimized CUDA kernels for compressed attention (C4/C128), expert-filtered activations, and a specialized memory pool architecture. It also updates diffusion pipelines to support request-local schedulers and adds extensive configuration for deployment. Review feedback identifies several critical issues in the attention backend, specifically regarding performance bottlenecks from host-device synchronizations, incorrect metadata copying during CUDA graph replay for variable batch sizes, and potential indexing errors in multi-token prediction (MTP). Additionally, improvements were suggested to enable expert filtering on non-CUDA platforms and to enhance thread safety by avoiding stateful caching within memory pools.
I am having trouble creating individual review comments. Click here to see my feedback.
python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (1132)
Calling .item() on a GPU tensor causes a host-device synchronization. This is a significant performance bottleneck in the hot path and will break CUDA graph capture if this method is called during the capture phase. Consider using torch.cat with a sliced and expanded tensor to avoid the sync: torch.cat([req_pool_indices_repeated, req_pool_indices_repeated[-1:].expand(pad_size)]).
python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (190-237)
The copy_ implementation for DSV4AttnMetadataRadix (and its constituent metadata classes) uses dst_val.copy_(src_val) which requires exact shape matching. During CUDA graph replay, the destination tensors have the captured max_bs size, while the source tensors from temp_metadata have the current bs size. This will cause a runtime error when bs < max_bs. You should use slicing to copy only the active portion: dst_val[:src_val.shape[0]].copy_(src_val).
python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py (1044)
The use of _pad_tensor_to_size (which internally uses torch.cat) is problematic for MTP if topk > 1. If q has bs * topk tokens but swa_page_indices only has bs rows, padding with 0 will result in incorrect indices for all but the first token of each request. It should likely be repeat_interleave(topk, dim=0) instead. Furthermore, torch.cat should be avoided in the forward path to maintain CUDA graph compatibility and performance.
python/sglang/srt/layers/moe/moe_runner/triton_utils/fused_moe.py (532)
The expert filtering logic is restricted to _is_cuda. While the comment mentions that HIP/XPU fall through to the unfiltered path where the down kernel handles zeros, this results in redundant computation in the activation kernel for filtered tokens on those platforms. If the silu_and_mul JIT kernel supports filtering on ROCm, it should be enabled here as well to improve efficiency.
python/sglang/srt/mem_cache/deepseekv4_memory_pool.py (830)
Caching swa_loc in self.cached_loc assumes that the model forward pass is strictly sequential across layers and that only one forward pass happens at a time. While this is currently true for the SGLang model runner, it makes the code non-thread-safe and could lead to subtle bugs if the execution model changes (e.g., multi-threaded speculative decoding steps). Consider passing this metadata through the forward_batch or metadata objects instead of storing it as state in the pool.
Motivation
Update amd/deepseek_v4 integration branch
Following PRs have large set of conflict, we use this PR and upstream amd/deepseek_v4 branch to integrate in parallel.
#23600
#23608
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci