amd/deepseek_v4 integration 1/N - 0426#23787
Conversation
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…follow-up to sgl-project#23731) (sgl-project#23734) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llow-up to sgl-project#23731) (sgl-project#23732) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
…kout (sgl-project#23747) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sgl-project#23749) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…project#23750) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gelu_and_mul (sgl-project#23707) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port DSv4 integration from sglang-entropy codebase including: - DeepseekV4ForCausalLM model and NextN speculative decoding - Compressed attention backend (deepseek_v4_backend, radix variant) - DSv4 memory pool (deepseekv4_memory_pool, compress_state) - DSv4 pool configurator and memory profiler - Hash-based MoE routing (deepseek_v4_topk, HashTopK) - DSv4 JIT kernels (CUDA .cuh headers + Python wrappers) - Function call parser (deepseekv4_detector) and encoding (encoding_dsv4) - Reasoning parser support (deepseek-v4) - Environment variables for DSv4 configuration - Config loading for deepseek_v4 model_type via PretrainedConfig - Integration with server_args, model_config, scheduler, and forward pass Made-with: Cursor
Made-with: Cursor
…ix bare except - Use getattr with None check instead of hasattr in _get_sliding_window_size to properly fall through None-valued attributes (e.g. Qwen2's sliding_window=None) - Replace bare except with except Exception in encoding_dsv4.py Made-with: Cursor
There was a problem hiding this comment.
Code Review
This pull request introduces support for DeepSeek-V4, featuring compressed attention backends, a specialized indexer, and speculative decoding (NextN) with optimized JIT kernels. Key updates include a redesigned memory pool and fused metadata initialization. Feedback identifies critical logic and performance issues in the DeepseekV4BackendRadix and DeepseekV4MultiStepBackend implementations, specifically addressing off-by-one errors in speculative decoding loops, performance bottlenecks from host-device synchronization, and platform compatibility concerns on AMD/ROCm systems.
| def flash_mla_with_kvcache_entrypoint(backend: str, **kwargs): | ||
| if is_hip(): | ||
| # backend == "torch" | ||
| import os |
There was a problem hiding this comment.
On AMD/ROCm platforms (is_hip()), the flash_mla kernel is typically not available. Defaulting to "kernel" will lead to an import error or crash. It is better to default to "torch" when running on HIP to ensure the fallback path is used unless explicitly overridden.
| import os | |
| backend = os.environ.get("SGLANG_HACK_FLASHMLA_BACKEND", "torch") |
| ) | ||
|
|
||
| def init_forward_metadata(self, forward_batch: ForwardBatch): | ||
| for i in range(self.speculative_num_steps - 1): |
There was a problem hiding this comment.
| forward_mode=ForwardMode.DECODE, | ||
| spec_info=forward_batch.spec_info, | ||
| seq_lens_cpu=forward_batch.seq_lens_cpu, | ||
| out_cache_loc=forward_batch.out_cache_loc, | ||
| ) | ||
| temp_metadata = self.attn_backends[0].forward_metadata | ||
|
|
||
| # Copy to other backends without recomputing |
There was a problem hiding this comment.
The loop range range(1, self.speculative_num_steps - 1) skips the last backend in self.attn_backends. For example, if speculative_num_steps is 2, the range is empty and the second backend is never updated. It should be range(1, self.speculative_num_steps). Additionally, self.forward_metadata for the MultiStepBackend instance itself should be set to temp_metadata to avoid using stale or None metadata during the forward pass.
temp_metadata = self.attn_backends[0].forward_metadata
self.forward_metadata = temp_metadata
# Copy to other backends without recomputing
for i in range(1, self.speculative_num_steps):| req_pool_indices_repeated, | ||
| (0, pad_size), | ||
| value=req_pool_indices_repeated[-1].item(), | ||
| ) | ||
|
|
There was a problem hiding this comment.
Calling .item() on a GPU tensor causes a host-device synchronization, which can significantly degrade performance in the hot path. Additionally, if num_tokens is 0, req_pool_indices_repeated[-1] will raise an IndexError. Consider using torch.cat with expand to perform the padding entirely on the device without synchronization.
if num_tokens > 0:
padding = req_pool_indices_repeated[-1:].expand(pad_size)
else:
padding = req_pool_indices_repeated.new_zeros(pad_size)
req_pool_indices_repeated = torch.cat([req_pool_indices_repeated, padding])| if value == 0: | ||
| return torch.cat( | ||
| [tensor, tensor.new_zeros(size - tensor.shape[0], *tensor.shape[1:])], | ||
| dim=0, | ||
| ) |
There was a problem hiding this comment.
This function performs a torch.cat even when no padding is required (i.e., when size == tensor.shape[0]), which is inefficient. Also, it will fail if size < tensor.shape[0]. Adding a check to return the original tensor when padding is not needed would be better.
if size <= tensor.shape[0]:
return tensor
if value == 0:
return torch.cat(
[tensor, tensor.new_zeros(size - tensor.shape[0], *tensor.shape[1:])],
dim=0,
)
Motivation
Update amd/deepseek_v4 integration branch
Following PRs have large set of conflict, we use this PR and upstream
amd/deepseek_v4branch to integrate in parallel.#23600
#23608
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci