Skip to content

Deepseek V4#23882

Merged
hnyls2002 merged 428 commits intomainfrom
dsv4-rebase
May 8, 2026
Merged

Deepseek V4#23882
hnyls2002 merged 428 commits intomainfrom
dsv4-rebase

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Apr 27, 2026

Rebase progress

merge-base 0519b09 → target main ea794de (latest), 2880 commits across 29 batches (~100/batch). Done: 29 / 29. ✅

main:81449b4b (merged with 6bf5a265): fp8 MoE init + custom AR dispatch
  • fp8.py Fp8MoEMethod init: keep dsv4 is_fp4_expert gating, accept main use_mxfp8 so block_quant = use_mxfp8 or weight_block_size is not None
  • custom_all_reduce.py: dsv4 CustomAllReduceV2 JIT dispatch takes precedence; accept main _is_cuda or _is_musa for fallback
main:25508d11 (merged with a05bef1a): hybrid SWA pool memory-based sizing (needs review @ispobock)
  • model_runner_kv_cache_mixin.py (hybrid SWA pool sizing): keep dsv4 memory-based formula total_memory / denominator; reject main per-token formula. This is the DSv4PoolConfigurator call site
  • schedule_batch.py maybe_evict_swa: keep dsv4 SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW path on top of main's early-return refactor
  • swa_radix_cache.py cache_finished_req: keep dsv4 dec_lock_ref(skip_swa=…) + swa_prefix_lock_released plumbing
  • memory_pool.py ReqToTokenPool: keep dsv4 free_slots = range(1, size) (slot 0 reserved)
main:75997ebe (merged with 9cd53fde): nixl generic send_state for SWA/NSA/Mamba (needs review @ShangmingCai @xiezhq-hermann)
  • major nixl/conn.py: replace main's _send_mamba_state dispatcher (feat: add nsa and swa disagg support with nixl #18939) with dsv4 generic send_state for SWA/NSA/Mamba state buffers; absorb main's PP-aware received_state_per_pp tracking + {room}_state_{pp_rank} notif format
  • 62f6077a follow-up: design choice TODO — mamba TP-slice currently unsupported on the generic path
  • 714eefe1 follow-up: fix fp8 MoE use_mxfp8 set before block_quant
main:f4417475 (merged with bd8ff150): V4 arch detect + NSA tilelang HIP gating
main:5ddc84e3 (merged with e47d56a6): NSA cp rename + ROCm-aware fp8_dtype
  • major adopt main's NSA cp rename: attn_tp_*attn_cp_* (context-parallel) in nsa/utils.py
  • accept main's refactor: quantize_and_rope_for_fp8 lifted from trtllm MLA backend to common attention/utils.py; hardcoded torch.float8_e4m3fn → ROCm-aware fp8_dtype (e4m3fnuz on fnuz GPUs)
  • moe_runner/deep_gemm.py: keep dsv4 _legacy_silu_and_mul for SGLANG_OPT_FIX_MEGA_MOE_MEMORY=False path, accept main _is_cuda guard
main:38a69652 (merged with b0da3713): reject upstream maybe_send_extra dispatcher (needs review @ShangmingCai @xiezhq-hermann)
  • major nixl/conn.py: reject main's new maybe_send_extra state-type dispatcher; keep dsv4 generic send_state path. Promote main's runtime length check to assert
  • 404ab04e follow-up: TODO — reconcile vs upstream _send_mamba_state dispatcher (feat: add nsa and swa disagg support with nixl #18939) for future state types
main:9bce3b04 (merged with d699109a): adopt ForwardInputBuffers + cuda graph runners port (needs review @ispobock)
  • major adopt main's ForwardInputBuffers reflection _share_one_buffer (replaces dsv4 per-runner inline create); eagle draft / draft-extend / multi-layer / piecewise runners ported to runner.buffers.xxx; keep dsv4 spec_hidden_size field
  • topk.py: accept main's drop of num_token_non_padded / expert_location_dispatch_info from fused_topk; keep dsv4 sqrtsoftplus + SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK branch
  • fp8.py: keep dsv4 is_fp4_expert init, accept main with_bias = False
main:bdc1e46e (merged with ffacc0ff): main rename pass — FlashInferFusedMoE delete + dp_rank rename
main:ea6ff7b0 (merged with 3af6f1d3): streaming session controller + req-pool leak check (needs review @ispobock)
  • major scheduler_runtime_checker_mixin.py _check_req_pool: union main [Session] Add streaming mode with SessionAwareCache fast path #19171 streaming-session count (session_req_count) with dsv4 slot-0 reserve invariant (expected_free = total - 1 in non-DECODE); TODO(DSV4) — combo not exercised in CI
  • deepseek_v2.py: keep dsv4 imports (fp8_kernel trio, SymmBuffer TYPE_CHECKING, DeepSeekMxfp4MoEMethod, batched_gemm_a8w8_...); main deleted the fp8_kernel + batched_gemm blocks as dead-import cleanup, dsv4 kept them — leaves 2 truly-dead imports for follow-up
Follow-up: PD-MTP MetadataBuffers hidden state — asymmetric P/D unblock attempt, reverted on V4 (NSA c4/c128 indexer ring asymmetry, c7ff4e7c) (needs review @ispobock @ShangmingCai @xiezhq-hermann)
  • attempt scheduler.py MetadataBuffers hidden buffer (fix pd-mtp metadata buffer hidden size #23918, support asymmetric pd-mtp via mock spec hidden #23958): size to model_config.spec_hidden_size + model_config.dtype unconditionally on both P and D so the wire layout aligns regardless of which side runs spec — DSv4 4x pre_hc_head ferries correctly, and asymmetric P (no spec) ↔ D (EAGLE) bootstraps from zero-init mock conditioning
  • (fix UnboundLocalError on model_config in init_disaggregation #23959) UnboundLocalError follow-up to the attempt: default model_config to self.model_config before draft-worker dispatch (independent fix, kept after revert below)
  • 4c4aa719 open TODO: verify always-on full-size buffer side effects vs main's 16-byte padding — moot / closed by the conditional-path restore below
  • major revert on V4 (c7ff4e7c) — MetadataBuffer mock alone is insufficient: deepseekv4_memory_pool.get_compress_state_ring_size doubles the NSA c4/c128 indexer ring (8→16 / 128→256) under is_speculative, and get_state_buf_infos bakes ring_size into item_len ⇒ prefill (no spec) and decode (EAGLE) ship mismatched state-pool layouts (item_len 65536 vs 131072 at index 43, the c4 indexer state). nixl send_state trips the layout assert; bypassing the size check would still leave a ring-buffer protocol mismatch (P writes ring=8, D reads ring=16) ⇒ corrupt indexer state. scheduler.py restored to main-style conditional (spec_hidden_size only when spec_algorithm.is_eagle(), else 16-byte / float32 padding). Asymm P/D on V4 stays unsupported until the indexer ring is unified across spec status
  • test_dsv4_pd_disagg_nixl.py: switch to fully-symmetric topology — both P and D run dp-attention + deepep + EAGLE MTP (steps=3, topk=1, draft=4); only base-gpu-id and disaggregation-mode differ. Removes attn_tp asymmetry (dp-attn matches → SWA item_len aligns) and spec asymmetry (both is_speculative → c4/c128 ring matches)
  • independent V4 PD dead-code fix bundled in same commit: drop dead asserts self.decode_tp_size == self.scheduler.tp_size (prefill.py) and self.prefill_pp_size == 1 (decode.py) — both attributes were removed by [PD] Remove unused server args for disaggregation #19618 cleanup, the V4 squash kept references → AttributeError on every V4 PD launch. pp_size == 1 sanity preserved on prefill side (peer-independent)
main:1135e214 (merged with ce7848e6): flashinfer bump + fp8 ignored-layer skip + glm45 detector
  • fp8.py Fp8MoEMethod dispatch: stack main's is_layer_skipped short-circuit (return UnquantizedFusedMoEMethod) before dsv4's DeepSeekMxfp4MoEMethod wrap (DSV4_MODE=2604 + FP4_EXPERTS + flashinfer_mxfp4 backend)
main:04e364d5 (merged with dcdd5f85): priority preemption rename
Pre-adopt: apply main MemoryPoolConfig framework + pool_configurator.py ahead of rebase (a9f3c47d, cherry-pick 2ac9024d)
  • major port main 3e8abc71's MemoryPoolConfig dataclass + _resolve_memory_pool_config / _apply_memory_pool_config / _init_pools decomposition into model_runner_kv_cache_mixin.py; eliminate server_args._draft_pool_config mutation hack — draft worker now receives the resolved MemoryPoolConfig via constructor chain (tp_workermodel_runner, plumbed through all 6 spec workers: eagle / eagle_v2 / multi_layer / multi_layer_v2 / standalone / standalone_v2)
  • new pool_configurator.py: DSv4PoolConfigurator class absorbs the deleted memory_profiler.py:DSv4MemoryCalculator + old set_num_tokens_hybrid_swa_compress mixin method into one resolve flow. Layered as a single if is_deepseek_compressed and is_hybrid_swa branch on top of main's path; future align to main [core] Extract pool sizing logic to pool_configurator.py #22384 free-function form is mechanical
  • behavior delta vs old dsv4: DSv4 path now uses main's _resolve_max_num_reqs (applies // dp_size on user-provided --max-running-requests); single-DP unaffected, DP-attention setup needs cold-start verify
main:3e8abc71 (merged with a77b1b1d): pool sizing finalize + ngram embedding + aiter fp8/compressed_tensors bf16
  • model_runner_kv_cache_mixin.py: keep ours' superset (DSv4 c4/c128/state fields + _resolve_dsv4_compressed_config + draft zero-out + unconditional state_dtype=fp32) over main's bare framework. Pre-adopt path covers everything; main has nothing dsv4-specific to add
  • deepseek_v2.py MoEGate.__init__ correction_bias_dtype: union dsv4 not is_hash_moe guard + main refactor (default fp32, then quant_config dispatch — modelopt_fp4 + flashinfer_trtllm → bf16, new _use_aiter and name in {"fp8","compressed_tensors"} → bf16 branch)
main:b227e53e (merged with 51bccbfb): health-check idle helper + flashinfer_trtllm_routed MoE + draft extend cuda graph helpers
  • major scheduler.py health-check path: drop dsv4 inline backport (261334394d) — main [Disagg] Fix health check false-positive in disagg is_fully_idle #20756's is_fully_idle(for_health_check=True) is the canonical fix and now covers all PD bootstrap/prealloc/transfer queue checks. Net loss: offload_tags guard (main's helper doesn't include it); follow-up if dsv4 offload path needs the guard at the health-check call site
  • token_dispatcher/standard.py: union dsv4 enable_flashinfer_mxfp4_moe + skip_local_expert_mapping (gated by SGLANG_OPT_MXFP4_SKIP_DISPATCHER_MAPPING) with main's new enable_flashinfer_trtllm_routed_moe — both flags chain into the EP local-mapping not guard
  • topk.py select_experts elif: chain dsv4 scoring_func == "sqrtsoftplus" branch (uses biased_topk_impl with optional JIT kernel) before main's new flashinfer_trtllm_routed + softmax + no bias branch (uses fused_topk_softmax_torch_raw_logits); fall-through to original fused_topk
  • eagle_worker_v2.py draft extend cuda graph capture: accept main's helper refactor (supports_cuda_draft_extend_graph wraps TritonMultiStep+TRTLLMMLA; new supports_hip_aiter_draft_extend_graph for HIP path); keep dsv4's DeepseekV4BackendRadix branch (gated by SGLANG_OPT_V4_DRAFT_EXTEND_CUDA_GRAPH) as extra or clause
main:71a54c1c (merged with 286924b6): RadixTree #20330 unified lock interface — apply dsv4 SWA leaf-lock release onto main API (needs review @ispobock)
  • major swa_radix_cache.py: rebase dsv4 fork-only commit 97d1a672fe release lock after window (ispobock 2026-04-26) onto main [RadixTree][8/N Refactor]: unify lock interface #20330 (97b2a8933 [RadixTree][8/N] unify lock interface) — main wraps (node, swa_uuid_for_lock) into DecLockRefParams dataclass + -> DecLockRefResult return. Resolve choice: keep skip_swa as Liskov-compatible extra default-valued kwarg on SWARadixCache.dec_lock_ref override, NOT in base_prefix_cache.py:DecLockRefParams — minimizes fork divergence surface for the ongoing [RadixTree][N/M] series. dec_swa_lock_only stays unchanged as fork-only API. cache_finished_req / cache_unfinished_req call sites updated to DecLockRefParams(swa_uuid_for_lock=..., skip_swa=req.swa_prefix_lock_released) form; inc_lock_ref caller unpacks new IncLockRefResult. TODO(DSV4) @ispobock review placed at the override
  • server_args.py DSv4.2-DSA Context Parallel: keep ours assert tp_size <= 8 (allow tp ∈ {1,2,4,8} for single-machine CP); accept main's new self.attn_cp_size = self.tp_size assignment
main:574572b2 (merged with e1a43ba9): mscale helper extract + chat_template_kwargs dict refactor + transformers v5 KeyError fallback
  • model_config.py: accept main's compute_mla_mscale_scaling(rope_scaling, base_scaling) helper extract; opportunistically also collapse the same 4-line inline mscale calc inside the dsv4 DeepseekV4ForCausalLM elif branch onto the helper for file-wide consistency
  • serving_chat.py apply_chat_template kwargs: accept main's extra_template_kwargs dict refactor (collapses inline reasoning_effort= + **chat_template_kwargs into a single **extra_template_kwargs spread). Redirect dsv4 fork-only SGLANG_ENABLE_THINKING protocol from the deleted inline chat_template_kwargs dict into extra_template_kwargs["thinking"] = True at the unified setup site (try-block top → moved out to try-block-outer with the rest); same for the Mistral fallback hunk. reasoning_parser map: union dsv4 "deepseek-v4" + main's "mistral" already auto-merged
  • fused_moe.py fused_experts_impl: keep ours' restructured else: branch (DSv4 2604B clamp + SGLANG_OPT_SWIGLU_CLAMP_FUSION dispatch fold the cuda/hip/xpu path inside, replacing main's elif _is_cuda or _is_hip or _is_xpu: sibling). Follow main rename topk_ids → curr_topk_ids (chunked slice; main's PR is the bug fix) at ours' act_and_mul_triton call site
  • deepseek_v2.py DeepseekV2MoE.forward (2 hunks): union topk() kwargs — keep dsv4 topk_kwargs = {"input_ids": input_ids_global} if self.is_hash else {} + main's new expert_location_dispatch_info=dispatch_info (EPLB) → call as self.topk(..., expert_location_dispatch_info=dispatch_info, **topk_kwargs)
  • hf_transformers_utils.py get_config: accept main's new except KeyError as e: fallback for transformers v5 (raw PretrainedConfig.get_config_dict + _CONFIG_REGISTRY bypass); redirect the "deepseek_v32" in str(e) branch from main's _load_deepseek_v32_model(...) to ours' generalized _load_deepseek_temp_model(model_type="deepseek_v32", architecture="DeepseekV3ForCausalLM", ...) helper. _load_mistral_large_3_for_causal_LM caller signature, _fix_special_tokens_pattern hook, ours' deepseek_ref/v32 ValueError branches all auto-merged
main:2406ddfd (merged with f9e69522): keep dsv4 fork JIT custom AR v2 over main #19880 upstream rewrite (needs review @DarkSharpness)
  • major keep dsv4 fork's CustomAllReduceV2 JIT (originally d17d622fd1 prototype) over main [JIT Kernel][Feature] Support JIT custom all reduce (rewrite as v2) #19880 (2dd9196079 [JIT Kernel] Support JIT custom all reduce (rewrite as v2) by DarkSharpness + BBuf — same author team's polished upstream rewrite). 5 add/add files all kept ours: jit_kernel/all_reduce.py, include/sgl_kernel/distributed/{common,custom_all_reduce}.cuh, jit_kernel/tests/test_custom_all_reduce.py, srt/distributed/device_communicators/custom_all_reduce_v2.py
  • TODO(DSV4) @DarkSharpness: dsv4 fork's CustomAllReduceV2 impl diverges from main's [JIT Kernel][Feature] Support JIT custom all reduce (rewrite as v2) #19880 polished version. Main has additions ours lacks: can_use_custom_all_reduce_with_nvlink nvlink pre-check + is_in_piecewise_cuda_graph inplace optimization handling + bench_custom_all_reduce.py + 104-line ffi.h. Ours has: pull/push split into 2 modules + max_pull_blocks=0 disable path + min(NUM_CTA, max_pull_blocks) SM cap + __slots__ opt. Reconcile when ready — the two team versions need cross-pollination
  • scheduler.py get_new_batch_prefill: revert dsv4 fork commit 05ab33bf57 rm token_usage call in prefill delayer (fzyzcjy 2026-04-25 HACK token_usage = 0.5 # since it is unused) onto main [Fix #20389] Illegal memory access in triton attention for large token counts #20390's hybrid_swa / hybrid_ssm token-usage extension (correct calculation across pools, fallback to _get_token_info). TODO(DSV4) @fzyzcjy placed at site to re-evaluate the HACK if prefill scheduling latency regresses
main:80389fec (merged with 92e1c587): keep dsv4 fork HiSparse over main #20343 upstream rewrite + tilelang sparse_fwd refactor reject + DP-aware attn_cp_size (needs review @xiezhq-hermann)
  • major keep dsv4 fork's HiSparse implementation (originally d17d622fd1 prototype) over main HiSparse for Sparse Attention #20343 (13f4f010d8 HiSparse for Sparse Attention by xiezhq-hermann — same author's polished upstream rewrite, same pattern as batch 16's JIT custom AR v2). 3 add/add files all kept ours: jit_kernel/csrc/hisparse.cuh, srt/managers/hisparse_coordinator.py, srt/mem_cache/hisparse_memory_pool.py. schedule_batch.py keeps Req.hisparse_staging name (theirs Req.staging); scheduler.py keeps fork's full hisparse_coordinator init + SGLANG_FIX_SWA_CHUNKED_REQ_DOUBLE_FREE gate + decode batch dispatch flow; server_args.py keeps hierarchical_sparse_attention_extra_config name (theirs hisparse_config); model_runner_kv_cache_mixin.py reverts main's auto-merged HiSparseNSATokenToKVPool branch back to ours' single NSATokenToKVPool path (HiSparseNSATokenToKVPool class doesn't exist in dsv4 fork's hisparse_memory_pool.py)
  • TODO(DSV4) @xiezhq-hermann: dsv4 fork's HiSparse impl diverges from main's HiSparse for Sparse Attention #20343 polished version. Main's hisparse_coordinator.py has +148 lines, hisparse_memory_pool.py introduces HiSparseNSATokenToKVPool separate class + parse_hisparse_config(host_to_device_ratio) integration. Reconcile when ready
  • tilelang_kernel.py: keep ours' full tilelang_sparse_fwd (SGLANG_DSV4_ISOLATE single v1 kernel + _is_gfx95_supported small-batch decode partial+combine + gfx942 fallback) + ours-only fp8_paged_mqa_logits_kernel; reject main [AMD] Tilelang sparse fwd for dsv32 mi355/mi300 #19945 (855d15adf6 [AMD] Tilelang sparse fwd for dsv32 mi355/mi300) refactored partial+combine with adaptive inner_iter heuristic — duplicates ours path and would require non-trivial reconcile of the SGLANG_DSV4_ISOLATE gate
  • server_args.py DSv4.2-DSA Context Parallel: keep ours assert tp_size <= 8; accept main fix self.attn_cp_size = self.tp_size // self.dp_size (DP-aware divisor — ours' = self.tp_size didn't account for dp_size, single-DP unaffected but multi-DP attn_cp setup was off)
  • jit_kernel/tests/test_custom_all_reduce.py: keep ours (consistent with batch 16 — dsv4 fork's multiprocess_main framework + stage-b-kernel-unit-8-gpu-h200 CI tier; main version is disabled "requires multi-GPU distributed setup" on 1-gpu CI tier so wouldn't actually exercise the kernel)
main:18074e25 (merged with bfa0923d): NUMA helpers refactor to numa_utils.py + Ascend NPU hybrid_swa support
  • utils/common.py: accept main's deletion of NUMA helpers (refactored to utils/numa_utils.py with new get_numa_node_if_available / numa_bind_to_node API; scheduler.py:226 caller already uses new path); keep ours maybe_torch_compile (dsv4 fork-only, used by deepseek_v4_rope.py + deepseek_v4.py:113,992)
  • model_runner_kv_cache_mixin.py _init_pools: keep ours is_v4_model top-level branch + ascend mambaish guard; absorb main's new ascend is_hybrid_swa sub-branch (NPUMHATokenToKVPool + SWAKVPool for hybrid SWA on NPU)
main:1b45d81e (merged with 16f22f48): aiter_dsv3_router_gemm signature simplify + base64 encoding layer + spec backend rename
  • schedule_batch.py maybe_evict_swa: keep ours' early-return (decode + piecewise CG enabled + non-chunk_cache) + release_leaf_lock gate (fork-only SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW); main deletion is for main's own cleanup, dsv4 path needs both
  • tokenizer_manager.py + detokenizer_manager.py: keep ours' base64-encoding layer at detokenizer (with _extract_topk_base64 helper + indexer_topk field). Reject main Simplify routed experts test and move base64 encoding to tokenizer manager #21634 redesign (move base64 from detokenizer to tokenizer manager) — fork-only design choice; dsv4 has dual routed_experts + indexer_topk base64 paths
  • deepseek_v2.py MoEGate: accept main's aiter_dsv3_router_gemm(hidden_states, weight) 2-arg simplify (main dropped gemm_output_zero_allocator; ours' 3-arg signature is obsolete); linear_bf16_fp32 fallback to dsv4 JIT kernel preserved; keep ours' SymmBuffer TYPE_CHECKING import
  • eagle_worker_v2.py: accept main rename TritonMultiStepDraftBackend → TritonAttnBackend + TRTLLMMLAMultiStepDraftBackend → TRTLLMMLABackend; keep ours DeepseekV4BackendRadix import (fork-only)
main:658a2813 (merged with 5fc88fc9): HiSparse main #21198/#21591 + AMD NSA FP8 #21511 vs NightFall HEAD (needs review @xiezhq-hermann)
main:d72f58d1 (merged with c93d658f): scheduler output processor refactor (#22146 / #15562 / #22148) + reasoning tokens
main:a64905a7 (merged with 6e59b211): MemoryPoolConfigurator framework + DSv4 subclass (needs review @ispobock)
  • major pool_configurator.py + model_runner_kv_cache_mixin.py: adopt main [core] Extract pool sizing logic to pool_configurator.py #22384/[core] Introduce MemoryPoolConfigurator class hierarchy #22389 MemoryPoolConfigurator base class + DefaultPoolConfigurator / HybridSWAPoolConfigurator + create_memory_pool_configurator factory; convert dsv4 DSv4PoolConfigurator from standalone to MemoryPoolConfigurator subclass implementing calculate_pool_sizes / calculate_pool_sizes_from_max_tokens; extend MemoryPoolConfig with c4/c128 / c4_state / c128_state fields (default 0); factory dispatches is_deepseek_compressed and is_hybrid_swa to DSv4 first
  • major nixl/conn.py: keep ours' wire format (state_data_ptrs at msg[11], state_item_lens at msg[12]) + unified send_state + state-after-aux call order; reject main [Disagg][NIXL] Support Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5) #22240 _send_mamba_state_slice heterogeneous Mamba TP (fork has TODO @nealvaidya); fix auto-merge dup packed_state_item_lens
  • 5293af33 follow-up: DSv4PoolConfigurator align main pattern — pure __init__ (drop mr.state_dtype = fp32 write, mixin _apply_memory_pool_config covers it); spec inflate bytes_per_full_token *= (T+D)/T (mirror dflash scale_kv_cell_size_per_token_for_dflash) instead of shrinking available_bytes
  • c47ed92d follow-up: clean — drop dead self._mr; rename _solve_pool_sizes_compute_dsv4_sizes (avoid sibling collision); docstring mentions spec inflate
main:870a21bf (merged with ff6ba6c3): PoolStats refactor + max_pool_usage (needs review @ispobock)
  • major scheduler_runtime_checker_mixin.py: adopt main [mem] Introduce PoolStats dataclass; unify pool metrics and token_usage #22554/[metrics] Add PoolStats.update_scheduler_stats to deduplicate metrics assignment #22559/[mem] Flatten memory checkers into composable per-pool invariant checks #22562 PoolStats dataclass + composable _check_full_pool / _check_swa_pool / _check_mamba_pool + _check_all_pools driver + _maybe_log_idle_metrics + _check_tree_cache + flat on_idle; graft 5 fork-only additions on top: (1) _get_swa_token_info hisparse clamp max(0, full/swa_num_used), (2) _check_req_pool slot-0-reserved expected_free = req_total_size - 1 for non-DECODE, (3) on_idle skip mem leak check when enable_hisparse, (4) self_check_during_busy hybrid_swa branch using _check_full_pool + _check_swa_pool, (5) _get_batch_swa_uncached_sizes + _get_total_swa_uncached_sizes helpers (graft applied via 4b5ec30a)
  • scheduler.py: adopt main max_pool_usage = self.get_pool_stats().get_max_pool_usage() in prefill_delayer; drop dsv4 fork hack (05ab33bf57 token_usage = 0.5) and the manual hybrid_swa/ssm dispatcher
  • schedule_batch.py: drop fork-only SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN env (correctness, not opt-in — without -page_size SWA leaks via tombstoned leaf); align main Fix SWA eviction boundary and page-align chunked prefill #22470 unconditional -page_size
  • token_dispatcher/standard.py: OR-merge skip_local_expert_mapping — main's cutlass / cutedsl / trtllm_routed + ours' mxfp4 and SGLANG_OPT_MXFP4_SKIP_DISPATCHER_MAPPING; simplify use site to not skip_local_expert_mapping (fix auto-merge dup assignment that lost main's 3 cases)
main:36891ab5 (merged with 8d45228d): SWA eviction interval + budget tracking + main subsumes fork's busy SWA check
  • schedule_batch.py maybe_evict_swa: 3-way union — adopt main env: add knob to control SWA eviction interval #22645 SGLANG_SWA_EVICTION_INTERVAL_MULTIPLIER tunable eviction_interval (replaces hard-coded % sliding_window_size); keep ours' fork-only piecewise-CG early-return + SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW leaf-lock release after window. Three blocks orthogonal
  • schedule_policy.py: keep ours' min(_rem_tokens, rem_swa_tokens - page_size) + if hybrid_swa: return req chunked-prefill OOM guard — fork pre-cherry-picked main Fix hybrid swa chunked prefill oom #23174 (50fc2c9, ispobock, lands in later batch); accept main docstring on _swa_budget_for_req
  • major scheduler_runtime_checker_mixin.py: main Rename _alive_streaming_session_count; use _is_streaming helper #22755 subsumes batch 23 graft — drop fork-only _get_batch_swa_uncached_sizes + _get_total_swa_uncached_sizes helpers; adopt main's unified _get_total_uncached_sizes returning (full, swa) tuple + self_check_during_busy always-compute-both pattern (also folds main's new native _get_hisparse_token_info + is_hisparse PoolStats fields). Fork-only grafts retained: hisparse clamp in _get_swa_token_info, slot-0-reserved _check_req_pool, hisparse skip in on_idle
main:6e3bbef5 (merged with 948efa36): hf_transformers subpackage refactor (#21569) + transformers 5.5.4 + DSv4 v4 case
  • major hf_transformers_utils.py: adopt main Upgrade transformers to 5.5.3 and refactor hf_transformers_utils into subpackage #21569 — 1500-line monolithic file → 17-line backward-compat shim re-exporting from new hf_transformers/ subpackage (common.py / compat.py / config.py / tokenizer.py / processor.py / mistral_utils.py). All fork-only helpers were already subsumed by main: _load_mistral_large_3_for_causal_LM = mistral_utils.load_mistral_config, _ensure_llama_flash_attention2_compat moved into compat._patch_removed_symbols auto-applied via apply_all(), _patch_mistral_common_tokenizer = mistral_utils.patch_mistral_common_tokenizer, _fix_v5_add_bos_eos_token / _fix_added_tokens_encoding / get_rope_config moved to corresponding submodules
  • hf_transformers/common.py: generalize _load_deepseek_v32_model_load_deepseek_temp_model(architecture, name_prefix, ...) for the dsv4 case
  • hf_transformers/config.py get_config: add elif "deepseek_v4" in str(e) branch (DSv4ForCausalLM dispatch) before existing dsv32 case in the combined (ValueError, KeyError) handler
  • hisparse_memory_pool.py: keep ours = NightFall HEAD (DSv4 path); main's separate HiSparseNSATokenToKVPool design is for NSA models, not used by fork. Verified get_num_new_pages dsv4-aware version (logical vs hisparse with compress_ratio divisor) already covers main [sgl] improve accuracy of additional page requirement during spec decode #22406 fix scope
  • test_serving_chat.py path: accept main migrate CPU-only unit tests from openai_server to unit/ #22965 rename openai_server/basic/unit/entrypoints/openai/ + _MockTokenizerManager simplification; rewrite Cases 1-4 to use fork's chat_encoding_spec attribute (replaces main's use_dpsk_v32_encoding); keep 4 fork-only dsv4 tests (test_dsv4_task_field_schema, test_latest_reminder_role_accepted, test_attach_task_to_last_user_message, test_dsv4_content_parts_list_normalized, test_dsv4_task_and_reminder_encode_end_to_end)
main:48daa831 (merged with 96718620): triton MoE runner refactor + multi-platform plugin + StreamingSession rename
  • major moe_runner/triton_utils/fused_moe.py: adopt main refactor(moe): de-duplicate triton MoE runner path into shared helpers #23019 — split fused_experts_impl into _prepare_fused_moe_run + _fused_moe_kernel_sequence helpers; thin fused_experts_impl delegates. Graft fork-only swiglu_limit through inplace/outplace_fused_expertsfused_expertsfused_experts_impl_fused_moe_kernel_sequence; DSv4 2604B clamp logic (env-gated SGLANG_DSV4_2604_SUBMODE + SGLANG_OPT_SWIGLU_CLAMP_FUSION with silu_and_mul_clamp fused kernel) moved into _fused_moe_kernel_sequence activation block; triton.py runner caller passes swiglu_limit=self.config.swiglu_limit
  • major model_runner_kv_cache_mixin.py: adopt main Multi platform Plugin #21388 multi-platform plugin (current_platform.is_out_of_tree() with get_{nsa,mla,mha}_kv_pool_cls() dispatch). DSv4 is_v4_model branch keeps top priority; plugin path inserted as elif before existing ascend branch. from sglang.srt.platforms import current_platform hoisted before the if-chain
  • major scheduler.py: adopt main move session to python/sglang/srt/session #23144/integrate streaming session into UnifiedRadixCache #23145/[core] Always-on StreamingSession in UnifiedRadixCache #23202 (Liangsheng Yin) SessionAwareCache → StreamingSession rename + not tree_cache.supports_streaming_session() guard (UnifiedRadixCache embeds StreamingSession, prevent double-wrap); adopt main's set_decode_producer_stream(self.forward_stream) call (race fix — fork's decode_producer_stream was always None, leaving decode_backup_stream not waiting on forward_stream)
  • nsa/utils.py: adopt main [Refactor] Deduplicate NSA utils.py into cp_utils.py for context parallel #22914 cp_utils dedup (8 symbols moved to layers/utils/cp_utils.py, can_cp_split rename, nsa_cp_metadata → attn_cp_metadata); graft fork-only _assert_cp_pure_extend + assert_tensor_identical_across_cp_ranks debug helpers (env-gated SGLANG_DEBUG_HACK_CP_ASSERT_PURE_EXTEND)
  • schedule_batch.py: 3-way union — adopt main env: add knob to control SWA eviction interval #22645 SGLANG_SWA_EVICTION_INTERVAL_MULTIPLIER tunable + keep ours' fork-only piecewise-CG early-return + SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW leaf-lock release
  • drop 3 stale legacy-docs additions (main moved docs/ → docs_new/; pre-commit hook rejects docs/ writes)
main:bf98eb3a (merged with f6f62737): expert_mask_gpu refactor + MUSA backend + RoutedExpertsOutput overlap
  • routed_experts_capturer.py: adopt main [perf] support return_routed_experts with overlap scheduling #22911 — new RoutedExpertsOutput dataclass for overlap scheduling; _get_local_range + _prepare_routed_experts_output helpers; on_forward_end(no_copy_to_cpu=False). Graft fork-only deepep blocks: __init__ adds self.gather_buffer for attn-tp all-gather; capture() adds attn_tp_all_gather_into_tensor before device_cache.capture_fwd_routed_experts
  • token_dispatcher/standard.py: adopt main Move expert_mask_gpu from FusedMoE layer to StandardDispatcher #23585 — add num_local_experts field + expert_mask_gpu = None; use site splits _use_aiter (writes expert_mask_gpu) vs non-aiter (writes topk_ids remap). Graft fork-only skip_local_expert_mapping as outer guard if local_expert_mapping is not None and not skip_local_expert_mapping
  • fused_moe.py: adopt main [MUSA][16/N] Add MUSA backend support for layers and DeepSeek models (V2/V3/R1) #22774 MUSA — auto-merge picked up is_musa import + _is_musa flag + _silu_and_mul_musa + 4 moe_sum_reduce dispatch sites. Activation block grafts main's elif _is_musa into ours' DSv4 2604B else chain, after if _is_cuda or _is_hip or _is_xpu and before elif _has_vllm_ops. MUSA uses the explicit clamp_ (fusion=False) path; fusion=True+MUSA is excluded by the upstream _is_cuda or _is_hip assert
main:9ffc0cc6 (merged with 8066dd09): centralized post-experts all-reduce + dp_reduce_scatterv bugfix; act_and_mul_triton kept for DSv4 swiglu_limit (needs review @ByronHsu)
main:ea794dee (merged with 958f3931): final batch — bumped target past PR-body's 3066ba8 to latest main; spec_hidden_size attribute + draft-kv-pool helper + Aiter RMSNorm layout
  • configs/model_config.py: spec_hidden_size calc — keep ours' DSv4-only env-gated form (hc_mult > 1 and SGLANG_FIX_MTP_HC_HIDDEN and SGLANG_DSV4_MODE=="2604") over main [spec decoding] add extra attribute 'spec_hidden_size' #23890's unconditional hc_mult > 1 (Mode 4: explicit kill switch is safer for the fork). is_hybrid_swa_model arch list: trivial concat — main adds MiMoV2ForCausalLM next to ours' DeepseekV4ForCausalLM/DeepseekV4ForCausalLMNextN
  • layers/layernorm.py RMSNorm.forward_aiter: concat — ours' HIP empty-batch early return (x.shape[0] == 0) stays as the fast path; graft main [AMD] Fix Aiter RMSNorm layout handling #23974 layout normalization (needs_reshape = x.dim() != 2 and residual is Nonecontiguous().reshape(-1, last_dim); non-contiguous → contiguous()) for non-2D Q/K slices
  • managers/scheduler.py init_disaggregation: adopt main [HiCache] feat: add draft KV cache backing for L2/L3 #21125 helper self._get_draft_kv_pool() -> (token_pool_or_None, model_config_or_None) in place of ours' inline if/elif/else; graft ours' DSv4 PD-spec invariant (fix UnboundLocalError on model_config in init_disaggregation #23959) — when helper returns model_config=None (no draft worker), default to self.model_config so MetadataBuffers branches downstream always have a non-None config
  • speculative/eagle_worker.py idle-batch hidden_size: adopt main [spec decoding] add extra attribute 'spec_hidden_size' #23890's compact ternary form — hidden*3 if eagle3 and aux else spec_hidden_size. Semantically equivalent to ours' if-statement form; main's is more concise and aligned with sibling sites

Post-rebase follow-up fixes

9a18d32a restore docs/ and docs_new/ to origin/main
  • Per-batch git checkout HEAD -- docs/ (used to bypass fork's reject changes under legacy docs/ pre-commit hook) inadvertently rolled back ALL of main's docs/ + docs_new/ updates accumulated during the rebase, leaving 23 files showing as fork-side deletions (-2186 / +69 net). Pre-rebase fork (origin/deepseek_v4) never modified either tree, so a one-shot git checkout origin/main -- docs/ docs_new/ restore is fully safe. Pushed via --no-verify as a rebase artifact cleanup
382dd420 fix dangling use_dpsk_v32_encoding ref
  • entrypoints/openai/serving_chat.py:537: self.use_dpsk_v32_encodingself.chat_encoding_spec == "dsv32". __init__ already had been refactored to set self.chat_encoding_spec ∈ {"dsv4", "dsv32", None} via _resolve_chat_encoding_spec, but this single callsite was a stale reference. Cold-start + import don't trip; only fires on chat completion → _apply_jinja_templateAttributeError
53e4ee30 disable piecewise cuda graph for DSv4 archs
  • configs/model_config.py piecewise_cuda_graph_disabled_model_archs: add DeepseekV4ForCausalLM and DeepseekV4ForCausalLMNextN next to existing DeepseekV32ForCausalLM. Pre-rebase fork had no piecewise CG concept (default-disabled implicitly via DP-attn / HIP fallback paths); main turned piecewise CG on by default. run_flash_tp8.sh (pure TP8, no DP-attn) was the first recipe to expose the missing arch entry — DSv4 compressed-attn path goes through dynamo and trips on _patched_getfile skip
6348cb50 fix retract — disable SGLANG_OPT_SWA_RADIX_CACHE_COMPACT default (needs review @ispobock)
  • major environ.py: flip SGLANG_OPT_SWA_RADIX_CACHE_COMPACT default True → False with TODO(DSV4) @ispobock. The fork-only _compact_single_child_chain in swa_radix_cache.py removes child from swa_lru_list / full_lru_list via remove_node() when merging into parent, but does NOT decrement swa_evictable_size_ / full_evictable_size_. Combined with main [RadixTree][6/N Refactor]: Refactor SWARadixTree to simplify the computation and alignment of bigram keys. #19427's stable old_prefix_len = req.cache_protected_len + retract pressure, pool slot accounting drifts (avail + evictable > total) and the runtime checker's on_idle leak detector trips. Manifests as ValueError: pool memory leak detected! across all DP ranks and even #swa token: -1280, swa token usage: -0.01 negative counts mid-decode. Default-off is the conservative fix; @ispobock to audit / re-enable after fixing the size accounting in compact
  • new test test/registered/4-gpu-models/test_dsv4_swa_radix_retract.py: stress test that forces deterministic retract via SGLANG_TEST_RETRACT=1 + SGLANG_TEST_RETRACT_INTERVAL=3; 64 concurrent long-prompt reqs sharing a 30k+ token prefix; gates on scheduler liveness only. Currently passes with SGLANG_OPT_SWA_RADIX_CACHE_COMPACT=0 set in env (matches new default); will trip pool memory leak detected! if compact is re-enabled

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test registered/8-gpu-models/test_dsa_models_basic.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

8-gpu-h200 (1 test): View workflow run

cd test/ && python3 registered/8-gpu-models/test_dsa_models_basic.py

@Fridge003
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-dsv4-4-gpu-b200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

❌ Stage stage-c-test-dsv4-4-gpu-b200 doesn't support isolated runs yet.

NVIDIA stages:

  • stage-a-test-1-gpu-small
  • stage-a-test-cpu
  • stage-b-test-1-gpu-small
  • stage-b-test-1-gpu-large
  • stage-b-test-2-gpu-large
  • stage-b-test-4-gpu-b200
  • stage-c-test-4-gpu-h100
  • stage-c-test-8-gpu-h200
  • stage-c-test-8-gpu-h20
  • stage-c-test-4-gpu-b200
  • stage-c-test-4-gpu-gb200
  • stage-c-test-deepep-4-gpu-h100
  • stage-c-test-deepep-8-gpu-h200
  • multimodal-gen-test-1-gpu
  • multimodal-gen-test-2-gpu
  • multimodal-gen-component-accuracy
  • multimodal-gen-component-accuracy-1-gpu
  • multimodal-gen-component-accuracy-2-gpu
  • multimodal-gen-test-1-b200

AMD stages:

  • sgl-kernel-unit-test-amd
  • sgl-kernel-unit-test-2-gpu-amd
  • stage-a-test-1-gpu-small-amd
  • stage-b-test-1-gpu-small-amd
  • stage-b-test-1-gpu-small-amd-nondeterministic
  • stage-b-test-1-gpu-small-amd-mi35x
  • stage-b-test-1-gpu-large-amd
  • stage-b-test-2-gpu-large-amd
  • multimodal-gen-test-1-gpu-amd
  • multimodal-gen-test-2-gpu-amd
  • stage-c-test-large-8-gpu-amd
  • stage-c-test-large-8-gpu-amd-mi35x

Other stages will be added soon. For now, use /rerun-failed-ci for those stages.

@hnyls2002 hnyls2002 changed the title Deepseek V4 rebase tracking Deepseek V4 May 8, 2026
@hnyls2002 hnyls2002 merged commit 35870d5 into main May 8, 2026
268 of 288 checks passed
@hnyls2002 hnyls2002 deleted the dsv4-rebase branch May 8, 2026 01:32
@fzyzcjy fzyzcjy mentioned this pull request May 8, 2026
Dogacel pushed a commit to Dogacel/sglang-fork that referenced this pull request May 8, 2026
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: yueming-yuan <yym022502@gmail.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: yhyang201 <yhyang201@users.noreply.github.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Qiaolin Yu <90088090+qiaolin-yu@users.noreply.github.com>
Co-authored-by: Ethan (Yusheng) Su <11704492+yushengsu-thu@users.noreply.github.com>
Co-authored-by: Mingyi <27337995+wisclmy0611@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Yihao Wang <42559837+againstentropy@users.noreply.github.com>
LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: yueming-yuan <yym022502@gmail.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: yhyang201 <yhyang201@users.noreply.github.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Qiaolin Yu <90088090+qiaolin-yu@users.noreply.github.com>
Co-authored-by: Ethan (Yusheng) Su <11704492+yushengsu-thu@users.noreply.github.com>
Co-authored-by: Mingyi <27337995+wisclmy0611@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Yihao Wang <42559837+againstentropy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark blackwell SM100/SM120 bypass-fastfail deepseek dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation high priority jit-kernel npu quant LLM Quantization run-ci sgl-kernel speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants