Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/rerun-test registered/8-gpu-models/test_dsa_models_basic.py |
Contributor
|
✅ |
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
/rerun-stage stage-c-test-dsv4-4-gpu-b200 |
Contributor
|
❌ Stage NVIDIA stages:
AMD stages:
Other stages will be added soon. For now, use |
Closed
Dogacel
pushed a commit
to Dogacel/sglang-fork
that referenced
this pull request
May 8, 2026
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: yueming-yuan <yym022502@gmail.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Qiaolin Yu <90088090+qiaolin-yu@users.noreply.github.com> Co-authored-by: Ethan (Yusheng) Su <11704492+yushengsu-thu@users.noreply.github.com> Co-authored-by: Mingyi <27337995+wisclmy0611@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Yihao Wang <42559837+againstentropy@users.noreply.github.com>
5 tasks
2 tasks
LLThomas
pushed a commit
to LLThomas/sglang
that referenced
this pull request
May 8, 2026
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: yueming-yuan <yym022502@gmail.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Qiaolin Yu <90088090+qiaolin-yu@users.noreply.github.com> Co-authored-by: Ethan (Yusheng) Su <11704492+yushengsu-thu@users.noreply.github.com> Co-authored-by: Mingyi <27337995+wisclmy0611@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Yihao Wang <42559837+againstentropy@users.noreply.github.com>
1 task
5 tasks
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rebase progress
merge-base 0519b09 → target main ea794de (latest), 2880 commits across 29 batches (~100/batch). Done: 29 / 29. ✅
main:81449b4b (merged with 6bf5a265): fp8 MoE init + custom AR dispatch
fp8.pyFp8MoEMethod init: keep dsv4is_fp4_expertgating, accept mainuse_mxfp8soblock_quant = use_mxfp8 or weight_block_size is not Nonecustom_all_reduce.py: dsv4CustomAllReduceV2JIT dispatch takes precedence; accept main_is_cuda or _is_musafor fallbackmain:25508d11 (merged with a05bef1a): hybrid SWA pool memory-based sizing (needs review @ispobock)
model_runner_kv_cache_mixin.py(hybrid SWA pool sizing): keep dsv4 memory-based formulatotal_memory / denominator; reject main per-token formula. This is theDSv4PoolConfiguratorcall siteschedule_batch.pymaybe_evict_swa: keep dsv4SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOWpath on top of main's early-return refactorswa_radix_cache.pycache_finished_req: keep dsv4dec_lock_ref(skip_swa=…)+swa_prefix_lock_releasedplumbingmemory_pool.pyReqToTokenPool: keep dsv4free_slots = range(1, size)(slot 0 reserved)main:75997ebe (merged with 9cd53fde): nixl generic send_state for SWA/NSA/Mamba (needs review @ShangmingCai @xiezhq-hermann)
nixl/conn.py: replace main's_send_mamba_statedispatcher (feat: add nsa and swa disagg support with nixl #18939) with dsv4 genericsend_statefor SWA/NSA/Mamba state buffers; absorb main's PP-awarereceived_state_per_pptracking +{room}_state_{pp_rank}notif formatuse_mxfp8set beforeblock_quantmain:f4417475 (merged with bd8ff150): V4 arch detect + NSA tilelang HIP gating
model_config.py: keep dsv4DeepseekV4ForCausalLMarch detection branchnsa/tilelang_kernel.py: gate HIP path onSGLANG_DSV4_ISOLATE— when set, single v1 kernel; else main per-arch dispatch (gfx95 / gfx942). TODO to reconcile once main [AMD] Enable FP8 KV cache and FP8 attention kernel for NSA on MI300/MI355 with TileLang backend #21511 landsfused_moe.py: keep dsv4swiglu_limit/SGLANG_DSV4_2604_SUBMODEbranching, accept main_has_vllm_ops/ F.silu fallbackmain:5ddc84e3 (merged with e47d56a6): NSA cp rename + ROCm-aware fp8_dtype
attn_tp_*→attn_cp_*(context-parallel) innsa/utils.pyquantize_and_rope_for_fp8lifted from trtllm MLA backend to commonattention/utils.py; hardcodedtorch.float8_e4m3fn→ ROCm-awarefp8_dtype(e4m3fnuzon fnuz GPUs)moe_runner/deep_gemm.py: keep dsv4_legacy_silu_and_mulforSGLANG_OPT_FIX_MEGA_MOE_MEMORY=Falsepath, accept main_is_cudaguardmain:38a69652 (merged with b0da3713): reject upstream maybe_send_extra dispatcher (needs review @ShangmingCai @xiezhq-hermann)
nixl/conn.py: reject main's newmaybe_send_extrastate-type dispatcher; keep dsv4 genericsend_statepath. Promote main's runtime length check to assert_send_mamba_statedispatcher (feat: add nsa and swa disagg support with nixl #18939) for future state typesmain:9bce3b04 (merged with d699109a): adopt ForwardInputBuffers + cuda graph runners port (needs review @ispobock)
ForwardInputBuffersreflection_share_one_buffer(replaces dsv4 per-runner inline create); eagle draft / draft-extend / multi-layer / piecewise runners ported torunner.buffers.xxx; keep dsv4spec_hidden_sizefieldtopk.py: accept main's drop ofnum_token_non_padded/expert_location_dispatch_infofromfused_topk; keep dsv4sqrtsoftplus+SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPKbranchfp8.py: keep dsv4is_fp4_expertinit, accept mainwith_bias = Falsemain:bdc1e46e (merged with ffacc0ff): main rename pass — FlashInferFusedMoE delete + dp_rank rename
fused_moe_triton/layer.py: accept main [MoE Refactor] Refactor FlashInferFusedMoE into FusedMoE and flashinfer_trtllm.py #19266 deletion ofFlashInferFusedMoEsubclass —swiglu_limit/ multi-format topk refactored intoFusedMoEbase +flashinfer_trtllm.pyQuantInfo classes; dsv4 forward delta absorbedmain:ea6ff7b0 (merged with 3af6f1d3): streaming session controller + req-pool leak check (needs review @ispobock)
scheduler_runtime_checker_mixin.py_check_req_pool: union main [Session] Addstreamingmode withSessionAwareCachefast path #19171 streaming-session count (session_req_count) with dsv4 slot-0 reserve invariant (expected_free = total - 1in non-DECODE); TODO(DSV4) — combo not exercised in CIdeepseek_v2.py: keep dsv4 imports (fp8_kerneltrio,SymmBufferTYPE_CHECKING,DeepSeekMxfp4MoEMethod,batched_gemm_a8w8_...); main deleted thefp8_kernel+batched_gemmblocks as dead-import cleanup, dsv4 kept them — leaves 2 truly-dead imports for follow-upFollow-up: PD-MTP
MetadataBuffershidden state — asymmetric P/D unblock attempt, reverted on V4 (NSA c4/c128 indexer ring asymmetry, c7ff4e7c) (needs review @ispobock @ShangmingCai @xiezhq-hermann)scheduler.pyMetadataBuffershidden buffer (fix pd-mtp metadata buffer hidden size #23918, support asymmetric pd-mtp via mock spec hidden #23958): size tomodel_config.spec_hidden_size+model_config.dtypeunconditionally on both P and D so the wire layout aligns regardless of which side runs spec — DSv4 4xpre_hc_headferries correctly, and asymmetric P (no spec) ↔ D (EAGLE) bootstraps from zero-init mock conditioningUnboundLocalErrorfollow-up to the attempt: defaultmodel_configtoself.model_configbefore draft-worker dispatch (independent fix, kept after revert below)deepseekv4_memory_pool.get_compress_state_ring_sizedoubles the NSA c4/c128 indexer ring (8→16 / 128→256) underis_speculative, andget_state_buf_infosbakesring_sizeintoitem_len⇒ prefill (no spec) and decode (EAGLE) ship mismatched state-pool layouts (item_len 65536 vs 131072 at index 43, the c4 indexer state). nixlsend_statetrips the layout assert; bypassing the size check would still leave a ring-buffer protocol mismatch (P writes ring=8, D reads ring=16) ⇒ corrupt indexer state.scheduler.pyrestored to main-style conditional (spec_hidden_sizeonly whenspec_algorithm.is_eagle(), else 16-byte / float32 padding). Asymm P/D on V4 stays unsupported until the indexer ring is unified across spec statustest_dsv4_pd_disagg_nixl.py: switch to fully-symmetric topology — both P and D run dp-attention + deepep + EAGLE MTP (steps=3, topk=1, draft=4); only base-gpu-id and disaggregation-mode differ. Removes attn_tp asymmetry (dp-attn matches → SWAitem_lenaligns) and spec asymmetry (bothis_speculative→ c4/c128 ring matches)self.decode_tp_size == self.scheduler.tp_size(prefill.py) andself.prefill_pp_size == 1(decode.py) — both attributes were removed by [PD] Remove unused server args for disaggregation #19618 cleanup, the V4 squash kept references →AttributeErroron every V4 PD launch.pp_size == 1sanity preserved on prefill side (peer-independent)main:1135e214 (merged with ce7848e6): flashinfer bump + fp8 ignored-layer skip + glm45 detector
fp8.pyFp8MoEMethod dispatch: stack main'sis_layer_skippedshort-circuit (returnUnquantizedFusedMoEMethod) before dsv4'sDeepSeekMxfp4MoEMethodwrap (DSV4_MODE=2604 + FP4_EXPERTS + flashinfer_mxfp4 backend)main:04e364d5 (merged with dcdd5f85): priority preemption rename
scheduler.py_get_new_batch_prefill_raw: accept main feat: Priority-based scheduling optimization (including default priority, preemption toggle, priority-based metrics, etc.) #17026 mechanical renametry_preemption→enable_priority_preemption(verified via NightFall optimize:try_preemptionwas an alias ofenable_priority_scheduling, rename preserves value); keep dsv4or self.is_hybrid_swaon the batch_is_full reset conditionPre-adopt: apply main
MemoryPoolConfigframework +pool_configurator.pyahead of rebase (a9f3c47d, cherry-pick 2ac9024d)MemoryPoolConfigdataclass +_resolve_memory_pool_config/_apply_memory_pool_config/_init_poolsdecomposition intomodel_runner_kv_cache_mixin.py; eliminateserver_args._draft_pool_configmutation hack — draft worker now receives the resolvedMemoryPoolConfigvia constructor chain (tp_worker→model_runner, plumbed through all 6 spec workers: eagle / eagle_v2 / multi_layer / multi_layer_v2 / standalone / standalone_v2)pool_configurator.py:DSv4PoolConfiguratorclass absorbs the deletedmemory_profiler.py:DSv4MemoryCalculator+ oldset_num_tokens_hybrid_swa_compressmixin method into one resolve flow. Layered as a singleif is_deepseek_compressed and is_hybrid_swabranch on top of main's path; future align to main [core] Extract pool sizing logic to pool_configurator.py #22384 free-function form is mechanical_resolve_max_num_reqs(applies// dp_sizeon user-provided--max-running-requests); single-DP unaffected, DP-attention setup needs cold-start verifymain:3e8abc71 (merged with a77b1b1d): pool sizing finalize + ngram embedding + aiter fp8/compressed_tensors bf16
model_runner_kv_cache_mixin.py: keep ours' superset (DSv4 c4/c128/state fields +_resolve_dsv4_compressed_config+ draft zero-out + unconditionalstate_dtype=fp32) over main's bare framework. Pre-adopt path covers everything; main has nothing dsv4-specific to adddeepseek_v2.pyMoEGate.__init__correction_bias_dtype: union dsv4not is_hash_moeguard + main refactor (default fp32, then quant_config dispatch —modelopt_fp4 + flashinfer_trtllm→ bf16, new_use_aiter and name in {"fp8","compressed_tensors"}→ bf16 branch)main:b227e53e (merged with 51bccbfb): health-check idle helper + flashinfer_trtllm_routed MoE + draft extend cuda graph helpers
scheduler.pyhealth-check path: drop dsv4 inline backport (261334394d) — main [Disagg] Fix health check false-positive in disaggis_fully_idle#20756'sis_fully_idle(for_health_check=True)is the canonical fix and now covers all PD bootstrap/prealloc/transfer queue checks. Net loss:offload_tagsguard (main's helper doesn't include it); follow-up if dsv4 offload path needs the guard at the health-check call sitetoken_dispatcher/standard.py: union dsv4enable_flashinfer_mxfp4_moe+skip_local_expert_mapping(gated bySGLANG_OPT_MXFP4_SKIP_DISPATCHER_MAPPING) with main's newenable_flashinfer_trtllm_routed_moe— both flags chain into the EP local-mappingnotguardtopk.pyselect_expertselif: chain dsv4scoring_func == "sqrtsoftplus"branch (usesbiased_topk_implwith optional JIT kernel) before main's newflashinfer_trtllm_routed + softmax + no biasbranch (usesfused_topk_softmax_torch_raw_logits); fall-through to originalfused_topkeagle_worker_v2.pydraft extend cuda graph capture: accept main's helper refactor (supports_cuda_draft_extend_graphwrapsTritonMultiStep+TRTLLMMLA; newsupports_hip_aiter_draft_extend_graphfor HIP path); keep dsv4'sDeepseekV4BackendRadixbranch (gated bySGLANG_OPT_V4_DRAFT_EXTEND_CUDA_GRAPH) as extraorclausemain:71a54c1c (merged with 286924b6): RadixTree #20330 unified lock interface — apply dsv4 SWA leaf-lock release onto main API (needs review @ispobock)
swa_radix_cache.py: rebase dsv4 fork-only commit97d1a672fe release lock after window(ispobock 2026-04-26) onto main [RadixTree][8/N Refactor]: unify lock interface #20330 (97b2a8933 [RadixTree][8/N] unify lock interface) — main wraps(node, swa_uuid_for_lock)intoDecLockRefParamsdataclass +-> DecLockRefResultreturn. Resolve choice: keepskip_swaas Liskov-compatible extra default-valued kwarg onSWARadixCache.dec_lock_refoverride, NOT inbase_prefix_cache.py:DecLockRefParams— minimizes fork divergence surface for the ongoing[RadixTree][N/M]series.dec_swa_lock_onlystays unchanged as fork-only API.cache_finished_req/cache_unfinished_reqcall sites updated toDecLockRefParams(swa_uuid_for_lock=..., skip_swa=req.swa_prefix_lock_released)form;inc_lock_refcaller unpacks newIncLockRefResult. TODO(DSV4) @ispobock review placed at the overrideserver_args.pyDSv4.2-DSA Context Parallel: keep oursassert tp_size <= 8(allow tp ∈ {1,2,4,8} for single-machine CP); accept main's newself.attn_cp_size = self.tp_sizeassignmentmain:574572b2 (merged with e1a43ba9): mscale helper extract + chat_template_kwargs dict refactor + transformers v5 KeyError fallback
model_config.py: accept main'scompute_mla_mscale_scaling(rope_scaling, base_scaling)helper extract; opportunistically also collapse the same 4-line inline mscale calc inside the dsv4DeepseekV4ForCausalLMelif branch onto the helper for file-wide consistencyserving_chat.pyapply_chat_templatekwargs: accept main'sextra_template_kwargsdict refactor (collapses inlinereasoning_effort=+**chat_template_kwargsinto a single**extra_template_kwargsspread). Redirect dsv4 fork-onlySGLANG_ENABLE_THINKINGprotocol from the deleted inlinechat_template_kwargsdict intoextra_template_kwargs["thinking"] = Trueat the unified setup site (try-block top → moved out to try-block-outer with the rest); same for the Mistral fallback hunk.reasoning_parsermap: union dsv4"deepseek-v4"+ main's"mistral"already auto-mergedfused_moe.pyfused_experts_impl: keep ours' restructuredelse:branch (DSv4 2604B clamp +SGLANG_OPT_SWIGLU_CLAMP_FUSIONdispatch fold the cuda/hip/xpu path inside, replacing main'selif _is_cuda or _is_hip or _is_xpu:sibling). Follow main renametopk_ids → curr_topk_ids(chunked slice; main's PR is the bug fix) at ours'act_and_mul_tritoncall sitedeepseek_v2.pyDeepseekV2MoE.forward(2 hunks): uniontopk()kwargs — keep dsv4topk_kwargs = {"input_ids": input_ids_global} if self.is_hash else {}+ main's newexpert_location_dispatch_info=dispatch_info(EPLB) → call asself.topk(..., expert_location_dispatch_info=dispatch_info, **topk_kwargs)hf_transformers_utils.pyget_config: accept main's newexcept KeyError as e:fallback for transformers v5 (rawPretrainedConfig.get_config_dict+_CONFIG_REGISTRYbypass); redirect the"deepseek_v32" in str(e)branch from main's_load_deepseek_v32_model(...)to ours' generalized_load_deepseek_temp_model(model_type="deepseek_v32", architecture="DeepseekV3ForCausalLM", ...)helper._load_mistral_large_3_for_causal_LMcaller signature,_fix_special_tokens_patternhook, ours' deepseek_ref/v32ValueErrorbranches all auto-mergedmain:2406ddfd (merged with f9e69522): keep dsv4 fork JIT custom AR v2 over main #19880 upstream rewrite (needs review @DarkSharpness)
d17d622fd1prototype) over main [JIT Kernel][Feature] Support JIT custom all reduce (rewrite as v2) #19880 (2dd9196079 [JIT Kernel] Support JIT custom all reduce (rewrite as v2)by DarkSharpness + BBuf — same author team's polished upstream rewrite). 5 add/add files all kept ours:jit_kernel/all_reduce.py,include/sgl_kernel/distributed/{common,custom_all_reduce}.cuh,jit_kernel/tests/test_custom_all_reduce.py,srt/distributed/device_communicators/custom_all_reduce_v2.pyCustomAllReduceV2impl diverges from main's [JIT Kernel][Feature] Support JIT custom all reduce (rewrite as v2) #19880 polished version. Main has additions ours lacks:can_use_custom_all_reduce_with_nvlinknvlink pre-check +is_in_piecewise_cuda_graphinplace optimization handling +bench_custom_all_reduce.py+ 104-lineffi.h. Ours has: pull/push split into 2 modules +max_pull_blocks=0disable path +min(NUM_CTA, max_pull_blocks)SM cap +__slots__opt. Reconcile when ready — the two team versions need cross-pollinationscheduler.pyget_new_batch_prefill: revert dsv4 fork commit05ab33bf57 rm token_usage call in prefill delayer(fzyzcjy 2026-04-25 HACKtoken_usage = 0.5 # since it is unused) onto main [Fix #20389] Illegal memory access in triton attention for large token counts #20390's hybrid_swa / hybrid_ssm token-usage extension (correct calculation across pools, fallback to_get_token_info). TODO(DSV4) @fzyzcjy placed at site to re-evaluate the HACK if prefill scheduling latency regressesmain:80389fec (merged with 92e1c587): keep dsv4 fork HiSparse over main #20343 upstream rewrite + tilelang sparse_fwd refactor reject + DP-aware attn_cp_size (needs review @xiezhq-hermann)
d17d622fd1prototype) over main HiSparse for Sparse Attention #20343 (13f4f010d8 HiSparse for Sparse Attentionby xiezhq-hermann — same author's polished upstream rewrite, same pattern as batch 16's JIT custom AR v2). 3 add/add files all kept ours:jit_kernel/csrc/hisparse.cuh,srt/managers/hisparse_coordinator.py,srt/mem_cache/hisparse_memory_pool.py.schedule_batch.pykeepsReq.hisparse_stagingname (theirsReq.staging);scheduler.pykeeps fork's full hisparse_coordinator init +SGLANG_FIX_SWA_CHUNKED_REQ_DOUBLE_FREEgate + decode batch dispatch flow;server_args.pykeepshierarchical_sparse_attention_extra_configname (theirshisparse_config);model_runner_kv_cache_mixin.pyreverts main's auto-mergedHiSparseNSATokenToKVPoolbranch back to ours' singleNSATokenToKVPoolpath (HiSparseNSATokenToKVPoolclass doesn't exist in dsv4 fork'shisparse_memory_pool.py)hisparse_coordinator.pyhas +148 lines,hisparse_memory_pool.pyintroducesHiSparseNSATokenToKVPoolseparate class +parse_hisparse_config(host_to_device_ratio)integration. Reconcile when readytilelang_kernel.py: keep ours' fulltilelang_sparse_fwd(SGLANG_DSV4_ISOLATEsingle v1 kernel +_is_gfx95_supportedsmall-batch decode partial+combine + gfx942 fallback) + ours-onlyfp8_paged_mqa_logits_kernel; reject main [AMD] Tilelang sparse fwd for dsv32 mi355/mi300 #19945 (855d15adf6 [AMD] Tilelang sparse fwd for dsv32 mi355/mi300) refactored partial+combine with adaptiveinner_iterheuristic — duplicates ours path and would require non-trivial reconcile of the SGLANG_DSV4_ISOLATE gateserver_args.pyDSv4.2-DSA Context Parallel: keep oursassert tp_size <= 8; accept main fixself.attn_cp_size = self.tp_size // self.dp_size(DP-aware divisor — ours'= self.tp_sizedidn't account for dp_size, single-DP unaffected but multi-DP attn_cp setup was off)jit_kernel/tests/test_custom_all_reduce.py: keep ours (consistent with batch 16 — dsv4 fork'smultiprocess_mainframework +stage-b-kernel-unit-8-gpu-h200CI tier; main version isdisabled "requires multi-GPU distributed setup"on 1-gpu CI tier so wouldn't actually exercise the kernel)main:18074e25 (merged with bfa0923d): NUMA helpers refactor to numa_utils.py + Ascend NPU hybrid_swa support
utils/common.py: accept main's deletion of NUMA helpers (refactored toutils/numa_utils.pywith newget_numa_node_if_available/numa_bind_to_nodeAPI;scheduler.py:226caller already uses new path); keep oursmaybe_torch_compile(dsv4 fork-only, used bydeepseek_v4_rope.py+deepseek_v4.py:113,992)model_runner_kv_cache_mixin.py_init_pools: keep oursis_v4_modeltop-level branch + ascend mambaish guard; absorb main's new ascendis_hybrid_swasub-branch (NPUMHATokenToKVPool+SWAKVPoolfor hybrid SWA on NPU)main:1b45d81e (merged with 16f22f48): aiter_dsv3_router_gemm signature simplify + base64 encoding layer + spec backend rename
schedule_batch.pymaybe_evict_swa: keep ours' early-return (decode + piecewise CG enabled + non-chunk_cache) +release_leaf_lockgate (fork-onlySGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW); main deletion is for main's own cleanup, dsv4 path needs bothtokenizer_manager.py+detokenizer_manager.py: keep ours' base64-encoding layer at detokenizer (with_extract_topk_base64helper +indexer_topkfield). Reject main Simplify routed experts test and move base64 encoding to tokenizer manager #21634 redesign (move base64 from detokenizer to tokenizer manager) — fork-only design choice; dsv4 has dualrouted_experts+indexer_topkbase64 pathsdeepseek_v2.pyMoEGate: accept main'saiter_dsv3_router_gemm(hidden_states, weight)2-arg simplify (main droppedgemm_output_zero_allocator; ours' 3-arg signature is obsolete);linear_bf16_fp32fallback to dsv4 JIT kernel preserved; keep ours'SymmBufferTYPE_CHECKING importeagle_worker_v2.py: accept main renameTritonMultiStepDraftBackend → TritonAttnBackend+TRTLLMMLAMultiStepDraftBackend → TRTLLMMLABackend; keep oursDeepseekV4BackendRadiximport (fork-only)main:658a2813 (merged with 5fc88fc9): HiSparse main #21198/#21591 + AMD NSA FP8 #21511 vs NightFall HEAD (needs review @xiezhq-hermann)
jit_kernel/csrc/hisparse.cuh: keep ours' 4-way(SeqLensT, ReqPoolIndicesT)template dispatch — verified NightFall HEAD31fb99be2has the same superset (NightFall absorbed main style refinement for hisparse #21198SeqLensTby same author Zhiqiang Xie, then addedReqPoolIndicesTfor dsv4); ASCII->/<-in comments (main has Unicode)mem_cache/hisparse_memory_pool.py: keep ours'alloc_device_bufferrewrite (newest-token reorder + page-aware surplus free) — verified NightFall HEAD3c485ff30identical. main [PD]: Add support for HiSparse to directly transfer the cache from Prefill to Decode DRAM. #21591 PD direct-to-host (alloc_logical_only+ zero-filter inalloc_device_buffer) is main-only, not in NightFall scopemanagers/hisparse_coordinator.py: keep ours' coordinator rewrite (compress_ratio/DeepSeekV4SingleKVPoolHost/translate_loc_from_full_to_hisparse_device/hisparse_staging) — verified ours = NightFall HEAD864ce2227. main [PD]: Add support for HiSparse to directly transfer the cache from Prefill to Decode DRAM. #21591admit_request_direct+_preload_to_device_buffer+naive_load_topk+ int64 seq_lens dispatch are all main-onlynsa/tilelang_kernel.py: keep ours'SGLANG_DSV4_ISOLATEgate (single v1 kernel for dsv4) — verified NightFall HEAD864ce2227HIP path is justsparse_attention_fwd_kernel_v1. main [AMD] Enable FP8 KV cache and FP8 attention kernel for NSA on MI300/MI355 with TileLang backend #21511 AMD FP8 NSA path (sparse_mla_fwd_decode_partial_fp8) is main-only AMD optimization, TODO future align once NightFall picks up FP8 NSAmain:d72f58d1 (merged with c93d658f): scheduler output processor refactor (#22146 / #15562 / #22148) + reasoning tokens
scheduler_output_processor_mixin.py: adopt main Isolate spec V1 path in decode post-processing #22146 (by Liangsheng Yin) —_handle_finished_reqhelper extract +is_spec_v1early-continue; adopt main [Feature] Add Reasoning Tokens Usage #15562 —_maybe_update_reasoning_tokenshelper +reasoning_tokensfield; adopt Unify think_end_id to model_config as single source of truth #22148 —self.model_config.think_end_idsource-of-truth. Port DSv4-onlymaybe_collect_indexer_topk(req)def + 2 call sites (prefill finished branch +_handle_finished_reqhelper). ours' inline(spec_algorithm.is_none() or is_spec_v2)logprob gate is now equivalent toif req.return_logprob:after the spec_v1 short-circuit, no semantic changemain:a64905a7 (merged with 6e59b211): MemoryPoolConfigurator framework + DSv4 subclass (needs review @ispobock)
pool_configurator.py+model_runner_kv_cache_mixin.py: adopt main [core] Extract pool sizing logic to pool_configurator.py #22384/[core] IntroduceMemoryPoolConfiguratorclass hierarchy #22389MemoryPoolConfiguratorbase class +DefaultPoolConfigurator/HybridSWAPoolConfigurator+create_memory_pool_configuratorfactory; convert dsv4DSv4PoolConfiguratorfrom standalone toMemoryPoolConfiguratorsubclass implementingcalculate_pool_sizes/calculate_pool_sizes_from_max_tokens; extendMemoryPoolConfigwith c4/c128 / c4_state / c128_state fields (default 0); factory dispatchesis_deepseek_compressed and is_hybrid_swato DSv4 firstnixl/conn.py: keep ours' wire format (state_data_ptrs at msg[11], state_item_lens at msg[12]) + unifiedsend_state+ state-after-aux call order; reject main [Disagg][NIXL] Support Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5) #22240_send_mamba_state_sliceheterogeneous Mamba TP (fork has TODO @nealvaidya); fix auto-merge duppacked_state_item_lens__init__(dropmr.state_dtype = fp32write, mixin_apply_memory_pool_configcovers it); spec inflatebytes_per_full_token *= (T+D)/T(mirror dflashscale_kv_cell_size_per_token_for_dflash) instead of shrinkingavailable_bytesself._mr; rename_solve_pool_sizes→_compute_dsv4_sizes(avoid sibling collision); docstring mentions spec inflatemain:870a21bf (merged with ff6ba6c3): PoolStats refactor + max_pool_usage (needs review @ispobock)
scheduler_runtime_checker_mixin.py: adopt main [mem] Introduce PoolStats dataclass; unify pool metrics and token_usage #22554/[metrics] AddPoolStats.update_scheduler_statsto deduplicate metrics assignment #22559/[mem] Flatten memory checkers into composable per-pool invariant checks #22562PoolStatsdataclass + composable_check_full_pool/_check_swa_pool/_check_mamba_pool+_check_all_poolsdriver +_maybe_log_idle_metrics+_check_tree_cache+ flaton_idle; graft 5 fork-only additions on top: (1)_get_swa_token_infohisparse clampmax(0, full/swa_num_used), (2)_check_req_poolslot-0-reservedexpected_free = req_total_size - 1for non-DECODE, (3)on_idleskip mem leak check whenenable_hisparse, (4)self_check_during_busyhybrid_swa branch using_check_full_pool + _check_swa_pool, (5)_get_batch_swa_uncached_sizes+_get_total_swa_uncached_sizeshelpers (graft applied via 4b5ec30a)scheduler.py: adopt mainmax_pool_usage = self.get_pool_stats().get_max_pool_usage()in prefill_delayer; drop dsv4 fork hack (05ab33bf57 token_usage = 0.5) and the manual hybrid_swa/ssm dispatcherschedule_batch.py: drop fork-onlySGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGINenv (correctness, not opt-in — without-page_sizeSWA leaks via tombstoned leaf); align main Fix SWA eviction boundary and page-align chunked prefill #22470 unconditional-page_sizetoken_dispatcher/standard.py: OR-mergeskip_local_expert_mapping— main'scutlass / cutedsl / trtllm_routed+ ours'mxfp4 and SGLANG_OPT_MXFP4_SKIP_DISPATCHER_MAPPING; simplify use site tonot skip_local_expert_mapping(fix auto-merge dup assignment that lost main's 3 cases)main:36891ab5 (merged with 8d45228d): SWA eviction interval + budget tracking + main subsumes fork's busy SWA check
schedule_batch.pymaybe_evict_swa: 3-way union — adopt main env: add knob to control SWA eviction interval #22645SGLANG_SWA_EVICTION_INTERVAL_MULTIPLIERtunableeviction_interval(replaces hard-coded% sliding_window_size); keep ours' fork-only piecewise-CG early-return +SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOWleaf-lock release after window. Three blocks orthogonalschedule_policy.py: keep ours'min(_rem_tokens, rem_swa_tokens - page_size)+if hybrid_swa: return reqchunked-prefill OOM guard — fork pre-cherry-picked main Fix hybrid swa chunked prefill oom #23174 (50fc2c9, ispobock, lands in later batch); accept main docstring on_swa_budget_for_reqscheduler_runtime_checker_mixin.py: main Rename _alive_streaming_session_count; use _is_streaming helper #22755 subsumes batch 23 graft — drop fork-only_get_batch_swa_uncached_sizes+_get_total_swa_uncached_sizeshelpers; adopt main's unified_get_total_uncached_sizesreturning(full, swa)tuple +self_check_during_busyalways-compute-both pattern (also folds main's new native_get_hisparse_token_info+is_hisparsePoolStats fields). Fork-only grafts retained: hisparse clamp in_get_swa_token_info, slot-0-reserved_check_req_pool, hisparse skip inon_idlemain:6e3bbef5 (merged with 948efa36): hf_transformers subpackage refactor (#21569) + transformers 5.5.4 + DSv4 v4 case
hf_transformers_utils.py: adopt main Upgrade transformers to 5.5.3 and refactor hf_transformers_utils into subpackage #21569 — 1500-line monolithic file → 17-line backward-compat shim re-exporting from newhf_transformers/subpackage (common.py/compat.py/config.py/tokenizer.py/processor.py/mistral_utils.py). All fork-only helpers were already subsumed by main:_load_mistral_large_3_for_causal_LM=mistral_utils.load_mistral_config,_ensure_llama_flash_attention2_compatmoved intocompat._patch_removed_symbolsauto-applied viaapply_all(),_patch_mistral_common_tokenizer=mistral_utils.patch_mistral_common_tokenizer,_fix_v5_add_bos_eos_token/_fix_added_tokens_encoding/get_rope_configmoved to corresponding submoduleshf_transformers/common.py: generalize_load_deepseek_v32_model→_load_deepseek_temp_model(architecture, name_prefix, ...)for the dsv4 casehf_transformers/config.pyget_config: addelif "deepseek_v4" in str(e)branch (DSv4ForCausalLM dispatch) before existing dsv32 case in the combined(ValueError, KeyError)handlerhisparse_memory_pool.py: keep ours = NightFall HEAD (DSv4 path); main's separateHiSparseNSATokenToKVPooldesign is for NSA models, not used by fork. Verifiedget_num_new_pagesdsv4-aware version (logical vs hisparse withcompress_ratiodivisor) already covers main [sgl] improve accuracy of additional page requirement during spec decode #22406 fix scopetest_serving_chat.pypath: accept main migrate CPU-only unit tests from openai_server to unit/ #22965 renameopenai_server/basic/→unit/entrypoints/openai/+_MockTokenizerManagersimplification; rewrite Cases 1-4 to use fork'schat_encoding_specattribute (replaces main'suse_dpsk_v32_encoding); keep 4 fork-only dsv4 tests (test_dsv4_task_field_schema,test_latest_reminder_role_accepted,test_attach_task_to_last_user_message,test_dsv4_content_parts_list_normalized,test_dsv4_task_and_reminder_encode_end_to_end)main:48daa831 (merged with 96718620): triton MoE runner refactor + multi-platform plugin + StreamingSession rename
moe_runner/triton_utils/fused_moe.py: adopt main refactor(moe): de-duplicate triton MoE runner path into shared helpers #23019 — splitfused_experts_implinto_prepare_fused_moe_run+_fused_moe_kernel_sequencehelpers; thinfused_experts_impldelegates. Graft fork-onlyswiglu_limitthroughinplace/outplace_fused_experts→fused_experts→fused_experts_impl→_fused_moe_kernel_sequence; DSv4 2604B clamp logic (env-gatedSGLANG_DSV4_2604_SUBMODE+SGLANG_OPT_SWIGLU_CLAMP_FUSIONwithsilu_and_mul_clampfused kernel) moved into_fused_moe_kernel_sequenceactivation block;triton.pyrunner caller passesswiglu_limit=self.config.swiglu_limitmodel_runner_kv_cache_mixin.py: adopt main Multi platform Plugin #21388 multi-platform plugin (current_platform.is_out_of_tree()withget_{nsa,mla,mha}_kv_pool_cls()dispatch). DSv4is_v4_modelbranch keeps top priority; plugin path inserted aselifbefore existing ascend branch.from sglang.srt.platforms import current_platformhoisted before the if-chainscheduler.py: adopt main move session to python/sglang/srt/session #23144/integrate streaming session into UnifiedRadixCache #23145/[core] Always-onStreamingSessioninUnifiedRadixCache#23202 (Liangsheng Yin)SessionAwareCache → StreamingSessionrename +not tree_cache.supports_streaming_session()guard (UnifiedRadixCache embeds StreamingSession, prevent double-wrap); adopt main'sset_decode_producer_stream(self.forward_stream)call (race fix — fork'sdecode_producer_streamwas always None, leavingdecode_backup_streamnot waiting on forward_stream)nsa/utils.py: adopt main [Refactor] Deduplicate NSA utils.py into cp_utils.py for context parallel #22914 cp_utils dedup (8 symbols moved tolayers/utils/cp_utils.py,can_cp_splitrename,nsa_cp_metadata → attn_cp_metadata); graft fork-only_assert_cp_pure_extend+assert_tensor_identical_across_cp_ranksdebug helpers (env-gatedSGLANG_DEBUG_HACK_CP_ASSERT_PURE_EXTEND)schedule_batch.py: 3-way union — adopt main env: add knob to control SWA eviction interval #22645SGLANG_SWA_EVICTION_INTERVAL_MULTIPLIERtunable + keep ours' fork-only piecewise-CG early-return +SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOWleaf-lock releasemain:bf98eb3a (merged with f6f62737): expert_mask_gpu refactor + MUSA backend + RoutedExpertsOutput overlap
routed_experts_capturer.py: adopt main [perf] support return_routed_experts with overlap scheduling #22911 — newRoutedExpertsOutputdataclass for overlap scheduling;_get_local_range+_prepare_routed_experts_outputhelpers;on_forward_end(no_copy_to_cpu=False). Graft fork-only deepep blocks:__init__addsself.gather_bufferfor attn-tp all-gather;capture()addsattn_tp_all_gather_into_tensorbeforedevice_cache.capture_fwd_routed_expertstoken_dispatcher/standard.py: adopt main Move expert_mask_gpu from FusedMoE layer to StandardDispatcher #23585 — addnum_local_expertsfield +expert_mask_gpu = None; use site splits_use_aiter(writesexpert_mask_gpu) vs non-aiter (writestopk_idsremap). Graft fork-onlyskip_local_expert_mappingas outer guardif local_expert_mapping is not None and not skip_local_expert_mappingfused_moe.py: adopt main [MUSA][16/N] Add MUSA backend support for layers and DeepSeek models (V2/V3/R1) #22774 MUSA — auto-merge picked upis_musaimport +_is_musaflag +_silu_and_mul_musa+ 4 moe_sum_reduce dispatch sites. Activation block grafts main'selif _is_musainto ours' DSv4 2604Belsechain, afterif _is_cuda or _is_hip or _is_xpuand beforeelif _has_vllm_ops. MUSA uses the explicit clamp_ (fusion=False) path; fusion=True+MUSA is excluded by the upstream_is_cuda or _is_hipassertmain:9ffc0cc6 (merged with 8066dd09): centralized post-experts all-reduce + dp_reduce_scatterv bugfix; act_and_mul_triton kept for DSv4 swiglu_limit (needs review @ByronHsu)
deepseek_v2.pyDeepseekV2MoE.forward_normal_dual_stream+forward(2 symmetric sites): adopt main refactor(moe): centralize post-experts all-reduce skip predicate #23748 centralized helpershould_skip_post_experts_all_reduce(is_tp_path=True, ...)in place of ours' explicit 3-conditionnotchain; this is a bugfix — main's helper includes a 4th skip caseshould_use_dp_reduce_scatterv()(Fix Qwen3 MoE double-reduce when DP attention + EP + reduce_scatterv (#23729) #23731 / Apply should_use_dp_reduce_scatterv guard to remaining MoE models (follow-up to #23731) #23732 fix for Qwen3 double-reduce; ours' deepseek_v2 was missing it), now aligned with sibling MoE models. Graft ours'DeepSeekMxfp4MoEMethod + SGLANG_OPT_MXFP4_FUSE_RSF_SHARED_ADDshared_output add_ fusion block before the helper callmoe_runner/triton_utils/fused_moe_triton_kernels.py: keep ours'act_and_mul_tritonkernel +SWIGLU_LIMIT/HAS_SWIGLU_LIMITcompile-time consts. Reject main [MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/gelu_and_mul #23707 deprecation — DSv4 2604B fused clamp + filter_expert path still depends on it (HF configswiglu_limit→MoeRunnerConfig.swiglu_limit→act_and_mul_triton(swiglu_limit=...)); main'sexpert_ids/expert_stepkw on JITsilu_and_muldoes not accept swiglu_limit, no drop-in replacementmoe_runner/triton_utils/fused_moe.py: keep ours — HEAD is empty in the conflict region (ours' DSv4elsebranch already absorbedelif _is_cuda or _is_hip or _is_xpu+elif _is_musa, just drop the markers). Re-addact_and_mul_tritonto the import block (main removed it in [MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/gelu_and_mul #23707)moe_runner/deep_gemm.py: adopt main [JIT Kernel] Reland JIT activation #22094 import source switch tosglang.jit_kernel.activation+ graft ours'as _legacy_silu_and_mulrename (used for theSGLANG_OPT_FIX_MEGA_MOE_MEMORY=Falsefallback path)configs/model_config.py: trivial concat — main adds anHYV3ForCausalLMdraft branch in_derive_special_models, ours adds a new_maybe_auto_set_dsv4_fp4_expertsmethod (probes routed-expert safetensors dtype to auto-setSGLANG_DSV4_FP4_EXPERTS)main:ea794dee (merged with 958f3931): final batch — bumped target past PR-body's 3066ba8 to latest main; spec_hidden_size attribute + draft-kv-pool helper + Aiter RMSNorm layout
configs/model_config.py:spec_hidden_sizecalc — keep ours' DSv4-only env-gated form (hc_mult > 1 and SGLANG_FIX_MTP_HC_HIDDEN and SGLANG_DSV4_MODE=="2604") over main [spec decoding] add extra attribute 'spec_hidden_size' #23890's unconditionalhc_mult > 1(Mode 4: explicit kill switch is safer for the fork).is_hybrid_swa_modelarch list: trivial concat — main addsMiMoV2ForCausalLMnext to ours'DeepseekV4ForCausalLM/DeepseekV4ForCausalLMNextNlayers/layernorm.pyRMSNorm.forward_aiter: concat — ours' HIP empty-batch early return (x.shape[0] == 0) stays as the fast path; graft main [AMD] Fix Aiter RMSNorm layout handling #23974 layout normalization (needs_reshape = x.dim() != 2 and residual is None→contiguous().reshape(-1, last_dim); non-contiguous →contiguous()) for non-2D Q/K slicesmanagers/scheduler.pyinit_disaggregation: adopt main [HiCache] feat: add draft KV cache backing for L2/L3 #21125 helperself._get_draft_kv_pool() -> (token_pool_or_None, model_config_or_None)in place of ours' inline if/elif/else; graft ours' DSv4 PD-spec invariant (fix UnboundLocalError on model_config in init_disaggregation #23959) — when helper returnsmodel_config=None(no draft worker), default toself.model_configso MetadataBuffers branches downstream always have a non-None configspeculative/eagle_worker.pyidle-batchhidden_size: adopt main [spec decoding] add extra attribute 'spec_hidden_size' #23890's compact ternary form —hidden*3 if eagle3 and aux else spec_hidden_size. Semantically equivalent to ours' if-statement form; main's is more concise and aligned with sibling sitesPost-rebase follow-up fixes
9a18d32a restore docs/ and docs_new/ to origin/main
git checkout HEAD -- docs/(used to bypass fork'sreject changes under legacy docs/pre-commit hook) inadvertently rolled back ALL of main's docs/ + docs_new/ updates accumulated during the rebase, leaving 23 files showing as fork-side deletions (-2186 / +69 net). Pre-rebase fork (origin/deepseek_v4) never modified either tree, so a one-shotgit checkout origin/main -- docs/ docs_new/restore is fully safe. Pushed via--no-verifyas a rebase artifact cleanup382dd420 fix dangling
use_dpsk_v32_encodingrefentrypoints/openai/serving_chat.py:537:self.use_dpsk_v32_encoding→self.chat_encoding_spec == "dsv32".__init__already had been refactored to setself.chat_encoding_spec ∈ {"dsv4", "dsv32", None}via_resolve_chat_encoding_spec, but this single callsite was a stale reference. Cold-start + import don't trip; only fires on chat completion →_apply_jinja_template→AttributeError53e4ee30 disable piecewise cuda graph for DSv4 archs
configs/model_config.pypiecewise_cuda_graph_disabled_model_archs: addDeepseekV4ForCausalLMandDeepseekV4ForCausalLMNextNnext to existingDeepseekV32ForCausalLM. Pre-rebase fork had no piecewise CG concept (default-disabled implicitly via DP-attn / HIP fallback paths); main turned piecewise CG on by default.run_flash_tp8.sh(pure TP8, no DP-attn) was the first recipe to expose the missing arch entry — DSv4 compressed-attn path goes through dynamo and trips on_patched_getfileskip6348cb50 fix retract — disable
SGLANG_OPT_SWA_RADIX_CACHE_COMPACTdefault (needs review @ispobock)environ.py: flipSGLANG_OPT_SWA_RADIX_CACHE_COMPACTdefaultTrue → Falsewith TODO(DSV4) @ispobock. The fork-only_compact_single_child_chaininswa_radix_cache.pyremoveschildfromswa_lru_list/full_lru_listviaremove_node()when merging into parent, but does NOT decrementswa_evictable_size_/full_evictable_size_. Combined with main [RadixTree][6/N Refactor]: Refactor SWARadixTree to simplify the computation and alignment of bigram keys. #19427's stableold_prefix_len = req.cache_protected_len+ retract pressure, pool slot accounting drifts (avail + evictable > total) and the runtime checker'son_idleleak detector trips. Manifests asValueError: pool memory leak detected!across all DP ranks and even#swa token: -1280, swa token usage: -0.01negative counts mid-decode. Default-off is the conservative fix; @ispobock to audit / re-enable after fixing the size accounting in compacttest/registered/4-gpu-models/test_dsv4_swa_radix_retract.py: stress test that forces deterministic retract viaSGLANG_TEST_RETRACT=1+SGLANG_TEST_RETRACT_INTERVAL=3; 64 concurrent long-prompt reqs sharing a 30k+ token prefix; gates on scheduler liveness only. Currently passes withSGLANG_OPT_SWA_RADIX_CACHE_COMPACT=0set in env (matches new default); will trippool memory leak detected!if compact is re-enabled