Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
8c76d49
feat(disagg): add layer-pipelined KV transfer for PD disaggregation
Apr 22, 2026
7e9921d
style: apply black formatting to layer-pipelined KV transfer code
Apr 22, 2026
35986c3
feat(disagg): register pipelined KV transfer env vars in environ.py
Apr 22, 2026
b8d3b26
feat(disagg): adaptive pipeline group_size based on prompt length
Apr 23, 2026
ce6e669
feat(disagg): support different TP in layer-pipelined KV transfer
Apr 23, 2026
15da6d1
fix(disagg): disable layer-pipelined mode for Mamba/SWA/NSA models
Apr 23, 2026
144c7f5
feat(disagg): support Mamba/SWA/NSA state in layer-pipelined mode
Apr 23, 2026
02558c3
feat(disagg): add multimodal fallback guard for layer-pipelined mode
May 6, 2026
215ef2d
Update python/sglang/srt/disaggregation/mooncake/conn.py
michael7193 May 6, 2026
1069bb8
feat(disagg): add forward_split_prefill for Qwen3.5 layer-pipelined mode
UNIDY2002 May 6, 2026
10c4d08
refactor(disagg): refine multimodal guard to allow VL models with for…
May 6, 2026
4f0a327
fix(models): add missing import and fix formatting in qwen3_5.py
May 6, 2026
ff8e1a7
test(disagg): add CI tests for layer-pipelined KV transfer
May 6, 2026
346c1b0
feat(disagg): universal guard + FalconH1 forward_split_prefill
May 7, 2026
79d504e
fix(disagg): align pipelined result handler with normal path
May 7, 2026
4a4f0eb
fix(falcon_h1): import LogitsProcessorOutput to fix ruff F821 lint error
May 7, 2026
955d100
style: fix formatting in falcon_h1.py
michael7193 May 9, 2026
e47eed2
fix(disagg): correct import path for kv_to_page_indices in scheduler.py
michael7193 May 10, 2026
823a013
fix(disagg): use BaseSWAKVPool for isinstance check to fix F821 lint
michael7193 May 12, 2026
184404c
style: use black instead of ruff for falcon_h1.py formatting
michael7193 May 12, 2026
923f7ea
fix(disagg): add missing imports for HybridLinearKVPool, BaseSWAKVPoo…
michael7193 May 12, 2026
2e8424a
docs: add MOONCAKE_DEVICE isolation note for pipelined + hicache
michael7193 May 12, 2026
383b95d
ci: re-trigger CI (flaky H20/H200/AMD/NPU infra failures)
michael7193 May 13, 2026
cffad0c
fix: adapt split_init to upstream ModelWorkerBatch removal (#25516)
michael7193 May 18, 2026
3a792e8
style: format single-arg call to one line (black)
michael7193 May 18, 2026
a2a719e
refactor: address review feedback on layer-pipelined KV transfer
michael7193 May 18, 2026
10ee204
fix(ci): use correct register_cuda_ci params for pipelined test
michael7193 May 18, 2026
3c6582d
merge: resolve scheduler.py conflict with upstream component refactor
michael7193 May 19, 2026
8a20fb4
fix: remove unused time import in kv_pacing.py
michael7193 May 19, 2026
f16464e
Merge origin/main: resolve conflict in mooncake/conn.py
michael7193 May 19, 2026
d0008bb
Address ShangmingCai review comments
michael7193 May 20, 2026
af99f46
Fix lint: remove unused imports, fix line length
michael7193 May 20, 2026
1f6def3
cleanup: remove stale hisparse code and unrelated formatting changes
michael7193 May 21, 2026
5e86a71
fix: correct chunked-prefill guard and add transfer metrics recording
michael7193 May 21, 2026
90d372d
fix: route pipelined path through process_batch_result for full side …
michael7193 May 21, 2026
f7bf65c
fix: disable overlap schedule when pipelining enabled + add sanity ch…
michael7193 May 21, 2026
9408ded
Resolve merge conflict: move layer_id/cuda_event to common TransferKV…
michael7193 May 24, 2026
9b4cc8c
Fix AttributeError: use profiler_manager._profile_batch_predicate
michael7193 May 24, 2026
f76f39a
Remove undefined set_prefill_run_batch_start_time call
michael7193 May 24, 2026
6183bf7
Fix StateType.NSA -> StateType.DSA in _prepare_pipelined_state_indices
michael7193 May 25, 2026
2bc9e69
feat(pipelined): replace step-function with continuous adaptive formula
michael7193 May 25, 2026
13dd540
docs: improve pipeline formula comments with precise pipeline model
michael7193 May 25, 2026
2a9ec22
fix(pipelined): swap min/max iters if user misconfigures
michael7193 May 25, 2026
397d8ef
fix: remove hardcoded default group_size=10 and fix NSA→DSA comment
michael7193 May 26, 2026
f06bf37
fix(pipelined): set forward_context in forward_split_prefill
michael7193 May 26, 2026
8eaa3df
docs: clarify GROUP_SIZE default and fix formula comment in environ.py
michael7193 May 27, 2026
05e5826
docs: move env var docs from docs/ to docs_new/ (fix lint)
michael7193 May 27, 2026
375ae8d
fix(pipelined): harden _get_pipeline_group_size for edge cases
michael7193 May 27, 2026
34225bc
fix(pipelined): add guards for DP attention, EAGLE, and input_embeds
michael7193 May 27, 2026
da1c019
fix(pipelined): add EPLB guard to prevent lost routing statistics
michael7193 May 27, 2026
e235b92
[Disagg] Add empty batch guard in _get_pipeline_group_size
michael7193 Jun 2, 2026
de425de
fix(pipelined): add transfer early-abort and configurable saturation …
michael7193 Jun 3, 2026
1286256
fix(pipelined): guard against division-by-zero when SAT_MULTIPLIER <=…
michael7193 Jun 3, 2026
4e012a6
fix(pipelined): handle transfer fallback edge cases
michael7193 Jun 3, 2026
090ebc7
Merge origin/main into pipelined KV transfer PR
michael7193 Jun 3, 2026
6707f10
fix(pipelined): finalize metadata after layer transfers
michael7193 Jun 4, 2026
5f1dee2
fix(pipelined): materialize inputs before split prefill
michael7193 Jun 4, 2026
0d9b628
fix(pipelined): align edge-case semantics
michael7193 Jun 5, 2026
5cbeca7
fix(pipelined): use split-prefill capability guard
michael7193 Jun 5, 2026
c59d819
fix(pipelined): address split prefill review notes
michael7193 Jun 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs_new/docs/references/environment_variables.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -673,6 +673,36 @@ SGLang supports various environment variables that can be used to configure its
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Force using PyTorch gather/scatter fallback instead of Triton fused kernels for staging operations. Useful for debugging.</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>false</code></td>
</tr>
<tr>
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>SGLANG_ENABLE_PIPELINED_KV_TRANSFER</code></td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable layer-pipelined KV transfer. Splits prefill into layer groups and transfers KV cache incrementally after each group, overlapping RDMA transfer with GPU compute. Only effective in PD disaggregation prefill mode. When enabled together with overlap schedule, overlap is automatically disabled (pipelined subsumes its CPU sync savings). When using with <code>--hicache-storage-backend mooncake</code>, set <code>MOONCAKE_DEVICE</code> to a separate IB port (e.g. <code>mlx5_5</code>) to avoid RDMA resource conflicts.</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>false</code></td>
</tr>
<tr>
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>SGLANG_PIPELINE_GROUP_SIZE</code></td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>(Optional) Override adaptive formula with a fixed number of layers per pipeline group. When not set, group_size is computed automatically based on prompt length.</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Not set (adaptive)</td>
</tr>
<tr>
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>SGLANG_PIPELINE_MIN_TOKENS</code></td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Minimum average input token length to activate pipelined mode. Batches below this threshold use the normal (non-pipelined) path to avoid overhead on short prompts.</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>3072</code></td>
</tr>
<tr>
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>SGLANG_PIPELINE_SAT_MULTIPLIER</code></td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Multiplier for MIN_TOKENS that defines the prompt-length saturation point in the adaptive pipeline group-size formula.</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>3.0</code></td>
</tr>
<tr>
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>SGLANG_PIPELINE_MAX_ITERS</code></td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Maximum pipeline iterations (used for shortest eligible prompts). The adaptive formula linearly interpolates between MAX_ITERS and MIN_ITERS based on prompt length. Higher values mean smaller groups and more transfer overlap.</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>10</code></td>
</tr>
<tr>
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>SGLANG_PIPELINE_MIN_ITERS</code></td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Minimum pipeline iterations (used for longest prompts at saturation point = SAT_MULTIPLIER × MIN_TOKENS). Lower bound ensures large groups don't add excessive per-group overhead.</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>4</code></td>
</tr>
</tbody>
</table>

Expand Down
27 changes: 27 additions & 0 deletions python/sglang/srt/disaggregation/base/conn.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,33 @@ def send(
"""
...

def send_layer(
self,
kv_indices: npt.NDArray[np.int32],
layer_id: int,
cuda_event: object,
is_last: bool = False,
state_indices: Optional[List[int]] = None,
):
"""Send a single layer's KV cache for layer-pipelined transfer.

Backends that support layer-pipelined KV transfer should override
this method. The default raises NotImplementedError so that
unsupported backends fail loudly if pipelining is misconfigured.
"""
raise NotImplementedError(
f"{type(self).__name__} does not support layer-pipelined KV transfer"
)

def send_final_metadata(
self,
state_indices: Optional[List[int]] = None,
):
"""Send final metadata after layer-pipelined KV transfer."""
raise NotImplementedError(
f"{type(self).__name__} does not support layer-pipelined KV transfer"
)

def pop_decode_prefix_len(self) -> int:
return 0

Expand Down
40 changes: 40 additions & 0 deletions python/sglang/srt/disaggregation/common/conn.py
Original file line number Diff line number Diff line change
Expand Up @@ -811,6 +811,46 @@ def _prepare_send_indices(

return kv_indices, index_slice, is_last_chunk, False

def _prepare_layer_send_indices(
self,
kv_indices: npt.NDArray[np.int32],
) -> Tuple[npt.NDArray[np.int32], slice, bool]:
"""Common pre-processing for per-layer sends.

Layer-pipelined transfer sends the same page slice once per layer, so it
must not advance curr_idx for every layer. It still needs the same CP
rank filtering as send(). Final status is sent separately after metadata
is written.
"""
num_indices = len(kv_indices)
index_slice = slice(self.curr_idx, self.curr_idx + num_indices)

if self.kv_mgr.enable_all_cp_ranks_for_transfer:
cache_key = (self.curr_idx, num_indices)
cached_key = getattr(self, "_layer_send_cache_key", None)
if cached_key == cache_key:
kv_indices, index_slice = self._layer_send_cache
else:
kv_indices, index_slice = filter_kv_indices_for_cp_rank(
self.kv_mgr,
kv_indices,
index_slice,
)
self._layer_send_cache_key = cache_key
self._layer_send_cache = (kv_indices, index_slice)
elif self.kv_mgr.is_dummy_cp_rank:
return kv_indices, index_slice, True

return kv_indices, index_slice, len(kv_indices) == 0

def _prepare_final_metadata_send(self) -> Tuple[slice, bool]:
index_slice = slice(self.curr_idx, self.curr_idx)
self.curr_idx = self.num_kv_indices
if self.kv_mgr.is_dummy_cp_rank:
self.kv_mgr.update_status(self.bootstrap_room, KVPoll.Success)
return index_slice, True
return index_slice, False

def send(
self,
kv_indices: npt.NDArray[np.int32],
Expand Down
2 changes: 2 additions & 0 deletions python/sglang/srt/disaggregation/common/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ class TransferKVChunk:
prefill_aux_index: Optional[int]
state_indices: Optional[List]
chunk_id: Optional[int] = None
layer_id: Optional[int] = None
cuda_event: Optional[object] = None
trace_ctx: Union[TraceReqContext, TraceNullContext] = dataclasses.field(
default_factory=TraceNullContext
)
Expand Down
20 changes: 20 additions & 0 deletions python/sglang/srt/disaggregation/fake/conn.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,9 @@ def poll(self) -> KVPoll:
def get_transfer_metric(self) -> KVTransferMetric:
return KVTransferMetric()

def should_send_kv_chunk(self, num_pages: int, last_chunk: bool) -> bool:
return num_pages > 0 or last_chunk

def init(
self,
kv_indices: list[int],
Expand All @@ -83,6 +86,23 @@ def send(
f"FakeKVSender send with kv_indices: {kv_indices}, state_indices: {state_indices}"
)

def send_layer(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to implement this for the common backend (pass) as well, and might need to check if it is mooncake backend in server args for this feature

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add a send_layer with NotImplementedError for the base backend

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a2a719e:

  1. Added send_layer() with NotImplementedError to BaseKVSender so unsupported backends fail loudly instead of silently breaking.
  2. Added a backend guard in _get_pipeline_group_size — pipelining now only activates for MOONCAKE and FAKE backends. Other backends (nixl, mori, ascend) safely fall back to the normal transfer path.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — BaseKVSender.send_layer() now raises NotImplementedError by default. See a2a719e.

self,
kv_indices: npt.NDArray[np.int32],
layer_id: int = 0,
cuda_event=None,
is_last: bool = False,
state_indices=None,
):
"""Per-layer KV send stub for warmup."""
logger.debug(f"FakeKVSender send_layer layer_id={layer_id} is_last={is_last}")

def send_final_metadata(self, state_indices=None):
self.has_sent = True
logger.debug(
f"FakeKVSender send_final_metadata with state_indices: {state_indices}"
)

def failure_exception(self):
raise Exception("Fake KVSender Exception")

Expand Down
Loading
Loading