[Disagg] Layer-pipelined KV transfer: overlap RDMA with GPU compute by michael7193 · Pull Request #23515 · sgl-project/sglang

michael7193 · 2026-04-23T02:23:33Z

Motivation

In PD disaggregation mode, KV cache transfer happens after full prefill computation completes. For long prompts (≥1K tokens), this creates a significant TTFT bottleneck — the decode side must wait for all layers to be computed and then transferred sequentially.

This PR implements layer-pipelined KV transfer: instead of computing all layers then transferring all KV at once, we split layers into groups and transfer each group incrementally. Transfer of group N overlaps with GPU compute of group N+1, significantly reducing TTFT.

Related: #19931 (same direction, different approach)

Key Results

Benchmark environment: SGLang v0.4.10.post2, torch 2.7.1+cu126, Qwen2.5-72B-Instruct, 2×8 H20 GPUs (TP=4 each), 4×400G IB (RDMA), PD + Mooncake backend.
The pipelined KV transfer logic is algorithmically identical between v0.4.10.post2 and this PR — differences are limited to upstream API adaptation (environ.py registration, PP-aware pointers, EAGLE/staging guards). See "Code equivalence" section below.

TTFT (ms) — Prompt Length Sweep (C=32, output=256)

Prompt	Baseline	Pipelined	Δ%
256	229	269	+18% (normal path, below threshold)
1024	696	274	-61%
4096	1192	378	-68%
8192	1234	637	-48%
16384	1070	922	-14%

TTFT p95 (ms)

Prompt	Baseline	Pipelined	Δ%
1024	3378	589	-83%
4096	5553	859	-85%
8192	5344	1832	-66%
16384	6164	3029	-51%

Throughput (output tok/s)

Prompt	Baseline	Pipelined	Δ%
1024	888	923	+4%
4096	723	858	+19%
16384	136	636	+367%

Multi-turn Dialogue (16 sessions × 10 turns)

Metric	Baseline	Pipelined	Δ
Completed	160	160	same
Throughput	485 t/s	481 t/s	-1%
TTFT avg	311 ms	314 ms	+1%

Extreme Stress (C=64, prompt=4096, output=1024)

Metric	Baseline	Pipelined	Δ
Completed reqs	128	128	same
TPOT avg	46.1 ms	46.0 ms	0%
Throughput	1352 t/s	1347 t/s	0%

Design

The feature is controlled by environment variables (registered in `environ.py`), disabled by default:

`SGLANG_PIPELINED_KV_TRANSFER=true` — enable the feature
`SGLANG_PIPELINE_MIN_TOKENS=3072` — threshold; short prompts use normal path
`SGLANG_PIPELINE_MAX_ITERS=10` — max pipeline stages (for short prompts, more overlap)
`SGLANG_PIPELINE_MIN_ITERS=4` — min pipeline stages (for long prompts, less overhead)
`SGLANG_PIPELINE_GROUP_SIZE=10` — (optional) override adaptive formula with fixed value

How it works

`_get_pipeline_group_size(batch)` — per-batch decision: returns adaptive group_size (>0) or 0 to skip pipeline. A universal guard ensures models without `forward_split_prefill` safely fallback to the normal path. Short prompts also fall back with zero overhead.
`run_batch_pipelined(batch, group_size)` in `Scheduler` — splits forward into layer groups using `model_runner.forward_split_prefill()`, enqueues per-layer KV transfer via `send_layer()` after each group. CUDA events synchronize GPU→transfer ordering. Pre-computes state indices for hybrid models via `_prepare_pipelined_state_indices()`.
`process_batch_result_pipelined_prefill()` — result handler that dispatches to `run_batch_pipelined` instead of `run_batch`, then follows the same downstream logic (including EAGLE spec_info propagation, staging sync, A2A MoE finalization).
`MooncakeKVManager.send_kvcache_layer()` — single-layer RDMA transfer supporting both MHA and MLA architectures via `get_mha_kv_ptrs_with_pp` / `get_mla_kv_ptrs_with_pp`.
`TpModelWorker.forward_batch_generation_split_{init,layer,sample}()` — three-phase split forward: init attention backend → run N layers per call → sample after last group.

Call chain

```
event_loop_normal_disagg_prefill
→ _get_pipeline_group_size(batch)
→ >0: run_batch_pipelined → split_init → [split_layer + send_layer] × N → split_sample
→ 0: run_batch (unchanged)
```

Event loop compatibility

Pipelined mode integrates with event_loop_normal_disagg_prefill only. Combining with overlap mode (event_loop_overlap_disagg_prefill) provides no additional benefit — pipelined already eliminates the .cpu() index sync that overlap mode defers (pipelined pre-computes indices before the forward loop and process_batch_result with pipelined=True skips send_kv_chunk entirely), and the per-group indices must be available during the forward pass so they cannot be deferred to the next iteration.

Adaptive group_size (E1)

Instead of a fixed SGLANG_PIPELINE_GROUP_SIZE, group_size is automatically computed via a continuous formula that adapts to both prompt length and model depth:

sat_tokens = MIN_TOKENS × 3
t = clamp((avg_tokens - MIN_TOKENS) / (sat_tokens - MIN_TOKENS), 0, 1)
target_iters = clamp(MAX_ITERS - t × (MAX_ITERS - MIN_ITERS), MIN_ITERS, MAX_ITERS)
group_size = max(1, num_layers // target_iters)

Why prompt length matters — Pipeline total time depends on which is the bottleneck:

Good bandwidth (T<C): total = C + T/N — last group's transfer is exposed
Poor bandwidth (T>C): total = C/N + T — first group's compute is exposed

Short prompts have higher T/C ratio (attention is O(n²) but transfer is O(n)), so more groups (larger N) are needed to reduce the exposed T/N or C/N. Long prompts have compute dominating (T≪C), even few groups hide most transfer.

Why model depth is handled automatically — target_iters (number of pipeline stages) determines the overlap fraction (N-1)/N, which depends on T/C ratio, not on total layer count. The num_layers // target_iters division naturally produces appropriate group sizes for any depth (e.g., 80-layer → 8 layers/group, 32-layer → 3 layers/group at target_iters=10).

Configurable env vars (two new, both optional):

Env Var	Default	Effect
`SGLANG_PIPELINE_MAX_ITERS`	10	Iterations for shortest eligible prompts (more groups = less exposed time)
`SGLANG_PIPELINE_MIN_ITERS`	4	Floor for longest prompts (fewer groups = less per-group overhead)

Tuning guide: Poor bandwidth → increase MAX_ITERS (e.g., 12) so network starts sooner; excellent bandwidth → decrease MIN_ITERS (e.g., 3) to reduce dispatch overhead.

User can still override with a fixed value via SGLANG_PIPELINE_GROUP_SIZE env var (backward compatible).

Different TP support (E2)

`send_kvcache_layer()` supports MHA head slicing when prefill TP ≠ decode TP, using vectorized numpy addressing (same math as `send_kvcache_slice`). MLA is TP-invariant and needs no slicing.

Mamba/SWA/NSA state support (E4)

Hybrid models (Jamba, FalconH1, DeepSeek-R1 with SWA) are fully supported. `_prepare_pipelined_state_indices()` pre-computes state indices before the layer loop, then passes them through `send_layer(state_indices=...)` on the last layer to trigger `maybe_send_extra()`. This covers:

HybridLinearKVPool (Mamba SSM): `req_index_to_mamba_index_mapping`
SWAKVPool (Sliding Window): windowed page indices via `translate_loc_from_full_to_swa`
NSATokenToKVPool (Native Sparse Attention): full sequence page indices

No decode-side changes needed — decode already waits for all data (KV + state) before starting.

Universal guard + FalconH1 support (E7)

A universal `hasattr(model, "forward_split_prefill")` guard replaces the previous multimodal-only guard. This ensures:

Models without `forward_split_prefill` (e.g. DeepSeek-V2/V3) safely fallback — no crash
Models with `forward_split_prefill` (LLaMA, Qwen, Gemma, FalconH1, etc.) use pipelined path

FalconH1 (Mamba hybrid) now has `forward_split_prefill`, enabling layer-pipelined transfer. Each layer's attention produces KV cache (transferred per-layer via pipeline), while SSM state is sent once at the end via `maybe_send_extra()` (SSM state is fixed-size, independent of sequence length — no benefit from per-layer pipelining).

MTP/EAGLE compatibility (E8)

Reviewed and confirmed that `process_batch_result_pipelined_prefill` correctly propagates EAGLE `spec_info` (`topk_p`, `topk_index`, `hidden_states`) to requests — identical to the normal path. MTP decode-side rollback is purely a decode-phase operation with no interaction with prefill-time pipelined transfer. Also aligned `copy_done.synchronize()`, `routed_experts_output.finalize()`, and `maybe_cache_unfinished_req` with the normal result handler.

Zero regression guarantee

When `SGLANG_PIPELINED_KV_TRANSFER=false` (default):

`_get_pipeline_group_size()` returns `0` on the first line
`run_batch_pipelined`, `process_batch_result_pipelined_prefill`, and all `split_*` methods are never called
`TransferKVChunk.layer_id` defaults to `None`, so `transfer_worker` always takes the existing path
`add_transfer_request` new parameters have default values — existing callers unaffected

Code equivalence (v0.4.10.post2 → this PR)

Benchmarks were collected on v0.4.10.post2. This PR ports the same logic to upstream main with these adaptations:

`os.environ.get()` → `envs.XXX.get()` (central environ.py registry)
`send_kvcache_layer` uses PP-aware `get_mha_kv_ptrs_with_pp` / `get_mla_kv_ptrs_with_pp` (upstream helpers, equivalent at PP=1)
`process_batch_result_pipelined_prefill` adapted for upstream's EAGLE spec info, staging buffer, and `report_prefill_stats` APIs
Core algorithm (grouped layer forward loop, CUDA event sync, per-layer RDMA enqueue) is identical

Checklist

#	Item	Status
1	GQA, same TP, no MTP	✅ Done (LLaMA, Qwen, Gemma verified)
2	Mamba hybrid (FalconH1), same TP	✅ Done — `forward_split_prefill` implemented, SSM state via `maybe_send_extra()`
3	GQA + Mamba, different TP	✅ Done (E2 head slicing + E4 state transfer)
4	MLA (DeepSeek-V2/V3)	⚠️ Transport layer ready (`send_kvcache_layer` MLA path), but DeepSeek models lack `forward_split_prefill` due to complex TBO/A2A/CP interactions — safely fallback via universal guard. Note: MLA compresses KV ~4.6x (1152 B vs 4096 B/token/layer for GQA), so pipelined benefit is inherently smaller (~15-30% TTFT reduction vs 48-68% for GQA models); prioritizing GQA/MHA models is intentional.
5	MTP/EAGLE compatibility	✅ Done — `spec_info` propagation verified, result handler aligned with normal path
6	PP support	❌ Not started
7	EPLB (Expert Load Balancing)	⚠️ Pipelined path bypasses `expert_distribution_recorder.with_forward_pass()` context and `experts_capturer.on_forward_end()` — automatically disabled when `enable_eplb=True` to ensure complete routing statistics. Full EPLB+pipelined support (wrapping split forward with capture logic) deferred to follow-up PR.
8	DP Attention	⚠️ `forward_split_prefill` bypasses `prepare_mlp_sync_batch()` — automatically disabled when `enable_dp_attention=True`.
9	input_embeds (custom embeddings via API)	⚠️ `forward_split_prefill` always calls `embed_tokens(input_ids)` — automatically disabled when any request has `input_embeds`.

Modified Files

File	Changes
`sglang/srt/disaggregation/prefill.py`	`_get_pipeline_group_size` (universal guard + adaptive formula), `_prepare_pipelined_state_indices`, pipelined event loop branch, `process_batch_result_pipelined_prefill` (with EAGLE/staging/A2A alignment)
`sglang/srt/managers/scheduler.py`	`run_batch_pipelined` (with state_indices pass-through), overlap auto-disable logic, info log in `dispatch_event_loop`
`sglang/srt/disaggregation/mooncake/conn.py`	`send_kvcache_layer` (MHA+MLA+head slicing), `_send_kvcache_layer_head_slice`, `TransferKVChunk` extension, `transfer_worker` dispatch, `MooncakeKVSender.send_layer` (with state_indices)
`sglang/srt/disaggregation/base/conn.py`	`BaseKVSender.send_layer` default (raises NotImplementedError)
`sglang/srt/disaggregation/common/utils.py`	`TransferKVChunk`: add `layer_id`, `cuda_event` fields
`sglang/srt/disaggregation/fake/conn.py`	`FakeKVSender.send_layer` stub (with state_indices)
`sglang/srt/managers/tp_worker.py`	`forward_batch_generation_split_{init,layer,sample}`
`sglang/srt/model_executor/model_runner.py`	`ModelRunner.forward_split_prefill` (with forward_context + attn_backend init)
`sglang/srt/models/falcon_h1.py`	`FalconH1ForCausalLM.forward_split_prefill`
`sglang/srt/models/qwen3_5.py`	`forward_split_prefill` for Qwen3.5 + VL (contributed by @UNIDY2002)
`sglang/srt/environ.py`	Register 5 env vars with adaptive formula comments
`docs_new/docs/references/environment_variables.mdx`	Document 5 env vars
`test/registered/disaggregation/test_disaggregation_pipelined.py`	CI tests for pipelined transfer

Future Work

MLA `forward_split_prefill` — DeepSeek-V2/V3 has complex interactions with TBO, NSA context parallel, and A2A MoE. Deferred as separate PR by someone familiar with DeepSeek internals.
NIXL backend support — NIXL already has async queue architecture (PR Nixl async transfer #23967); adding send_layer() is ~160 LOC. Planned as follow-up PR after this merges.
PP support — Cross-PP-stage pipeline coordination (long-term)

CI States

Latest PR Test (Base): ❌ Run #27007567360
Latest PR Test (Extra): ❌ Run #27007567259

gemini-code-assist · 2026-04-23T02:23:37Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ShangmingCai · 2026-04-23T07:23:39Z

CC: @UNIDY2002 Could you check this? I haven't gone through this PR carefully yet, but this seems like a cleaner implementation.

UNIDY2002 · 2026-04-23T10:15:50Z

Nice work. We took a different approach in #19931 (callback-driven, per-layer notifications from inside HybridLinearAttnBackend.forward()), but your split_init/layer/sample + send_layer design is cleaner — the scheduler controls layer ranges and Mooncake just executes transfers, which is the right separation.

We've been working on Qwen3.5-397B-A17B (hybrid linear attention + GQA + VL), and there are a couple of gaps we can help fill:

Qwen3.5 lacks forward_split_prefill — Upstream qwen3_5.py and its parent qwen3_vl.py only have forward(), so tp_worker.forward_batch_generation_split_layer() → model_runner.forward_split_prefill() would fail for this model. We have a working implementation that handles capture_aux_hidden_states / _is_layer_to_capture for DFLASH. Happy to port it as a follow-up.
Multimodal fallback — Qwen3.5-VL inherits from Qwen3VLForConditionalGeneration and needs general_mm_embed_routine for multimodal inputs, which is incompatible with split-prefill. It'd be useful to add a multimodal guard in _get_pipeline_group_size() so those batches fall back to the normal path. We hit this in our testing.

We'd like to collaborate on getting Qwen3.5 support into this PR (or a follow-up).

michael7193 · 2026-04-24T09:39:14Z

Thanks for the thoughtful review and kind words, @UNIDY2002!

Great to hear about your experience with #19931. The callback-driven approach is interesting — glad we converged on similar goals from different angles.

Both issues you raised are very practical:

forward_split_prefill for Qwen3.5 — Makes total sense. The current implementation assumes models provide forward_split_prefill, so hybrid models like Qwen3.5-397B-A17B would indeed need that. Would love to see your implementation — a follow-up PR sounds perfect.

Multimodal fallback — Good catch. Adding a multimodal guard in _get_pipeline_group_size() to fall back to the normal path is straightforward and the right thing to do. Happy to include it in this PR if you'd like to send a patch, or we can handle it in the follow-up together.

Very much looking forward to collaborating on Qwen3.5 support. Feel free to ping me anytime!

michael7193 · 2026-05-06T10:42:41Z

@UNIDY2002 Thanks for the catch — applied your is_last_chunk fix and also resolved the lint issues (missing import + formatting in qwen3_5.py). All pre-commit checks are passing now. ✅

@ShangmingCai Gentle ping — this PR is ready for review whenever you have a chance. Summary of what's been done since your last look:

Rebased onto latest main (conflict resolved cleanly)
Merged UNIDY2002's Qwen3.5 forward_split_prefill support
Added model-aware multimodal guard (VL models with forward_split_prefill can still use pipelining)
Fixed is_last_chunk param name bug (UNIDY2002's suggestion)
Lint all green

Happy to address any further feedback!

michael7193 · 2026-05-09T01:38:24Z

@ShangmingCai Friendly ping — this PR has been rebased onto the latest main (no conflicts). Would appreciate your review when you get a chance.

Also, could a maintainer add the run-ci label so the GPU tests can run? Lint is passing. Thanks!

ShangmingCai · 2026-05-09T04:30:14Z

Great! Too busy lately, let me trigger the CI first, will start to review next week. Thank you so much for the PR.

ShangmingCai · 2026-05-09T04:30:20Z

/tag-and-rerun-ci

ShangmingCai · 2026-05-09T04:31:44Z

This file has a lint error. Also, is this modification mis-added by cc?

The falcon_h1.py change is intentional — FalconH1 is a Mamba/Attention hybrid model where SSM conv states need special handling during layer-pipelined transfer (sent once at the final group via maybe_send_extra()). I'll fix the lint error in the next push.

ShangmingCai · 2026-05-09T04:33:01Z

Does is means that we need to impl this forward_split_prefill for every single model? This might not be a robust design. Will dive in next week.

Good question! Actually this is not a new pattern we're introducing — there are already 15 models in the upstream codebase that implement forward_split_prefill (llama, qwen, qwen2, qwen3, gemma, gemma2, gemma3, glm4, exaone4, sarvam_moe, qwen2_moe, qwen3_moe, etc.), added for chunked prefill / PP support.

Our layer-pipelined feature simply reuses this existing interface. The design has two layers of safety:

Guard fallback: If a model doesn't have forward_split_prefill, the pipelined path is automatically skipped and the request goes through the normal path (no crash, no regression).

Pattern is mechanical: For standard transformer models, the implementation is identical — embed → layers[start:end] → norm → logits. Only hybrid models (Mamba SSM, hybrid linear attention) need custom logic.

That said, if you'd prefer a more robust approach, we could add a default generic implementation in a base class that works for any standard transformer model, so new models get pipelined support for free without writing any code. Happy to explore that direction if you think it's worthwhile.

michael7193 · 2026-05-09T09:59:44Z

Fixed the falcon_h1.py formatting issue (commit 489bf9d). Could you re-trigger CI when you get a chance? Thanks!

michael7193 · 2026-05-10T01:26:08Z

Fixed the CI failure — root cause was ImportError: cannot import name 'kv_to_page_indices' from 'sglang.srt.disaggregation.utils'. The function was moved to sglang.srt.mem_cache.common in upstream main but scheduler.py still imported from the old path. Fixed in commit d204bf1.

@ShangmingCai Could you re-trigger CI when you get a chance? Thanks!

ShangmingCai · 2026-05-11T07:05:23Z

@michael7193 No problem, will do, but please fix lint first, or all the CI will be aborted.

michael7193 · 2026-05-13T03:31:11Z

@ShangmingCai Lint is fixed and passing now. CI completed — our test_disaggregation_pipelined.py passes (2/2 tests). The 11 remaining failures are all pre-existing infra issues (HiCache file backend flake, AMD perf regression, Docker build race, NPU timeout) unrelated to this PR. Ready for your review whenever you have time!

- Change SGLANG_PIPELINE_GROUP_SIZE default to EnvInt(0) with comment explaining adaptive formula is used when env var is not set - Fix formula comment: use SAT_TOKENS = MIN_TOKENS * 3 notation instead of opaque "MIN_TOKENS * 2" denominator - Update env var docs: note overlap schedule auto-disable behavior, clarify GROUP_SIZE is optional override Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The docs/ directory is deprecated; CI rejects changes there. Move pipelined KV transfer env var documentation to docs_new/docs/references/environment_variables.mdx and restore docs/references/environment_variables.md to upstream state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix division-by-zero when SGLANG_PIPELINE_MIN_TOKENS=0 (clamp to 1) - Skip pipelining for very small models (<=4 layers) where per-group overhead outweighs any compute/transfer overlap benefit - Move GROUP_SIZE user override before min_tokens check so explicit configuration always takes priority over heuristics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Disable pipelining when enable_dp_attention is set, because forward_split_prefill bypasses prepare_mlp_sync_batch() needed for DP buffer initialization - Disable pipelining when draft_worker is present (EAGLE/spec decode), because run_batch_pipelined only iterates target model layers and would never transfer draft KV to the decode side - Disable pipelining when any request has input_embeds, because forward_split_prefill always calls embed_tokens(input_ids) and silently ignores custom embeddings passed via API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When enable_eplb is set, pipelined mode bypasses the expert_distribution_recorder.with_forward_pass() context and experts_capturer.on_forward_end() call in model_runner.forward(). This causes EPLB to lose routing statistics for pipelined batches, leading to suboptimal expert rebalancing decisions. Disable pipelining when EPLB is active to ensure complete routing data collection. This affects MoE models (Qwen3MoE, Qwen2MoE, SarvamMoE, ExaoneMoE) that have both forward_split_prefill and EPLB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

michael7193 · 2026-05-28T01:35:27Z

@ShangmingCai Hi! We've done several rounds of self-review and addressed the issues we found (generality guards for DP Attention/EAGLE/input_embeds/EPLB, edge-case protections for small models, docs migration, etc.). The code should be in good shape now. Could you take another look when you get a chance? Thanks!

ShangmingCai · 2026-05-28T12:32:01Z

Will find time to review tomorrow.

Prevent ZeroDivisionError if batch.reqs is empty (defensive guard). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…multiplier 1. Early-abort in transfer_worker: skip queued chunks/batches whose room has already been marked Failed, avoiding redundant lock acquisitions, record_failure calls, and wasted RDMA doorbells after a layer fails. 2. Expose SGLANG_PIPELINE_SAT_MULTIPLIER env var (default 3.0) to make the adaptive formula's saturation point tunable across different network bandwidths (lower for 800G IB, higher for 200G-400G IB). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… 1.0 Clamp SAT_MULTIPLIER to at least 1.01 so sat_tokens > min_tokens, preventing ZeroDivisionError in the adaptive formula. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

michael7193 · 2026-06-03T06:26:02Z

Hi @ShangmingCai, how's the review going? Are there any remaining concerns or anything else you'd like me to address before merging? Thanks!

Ensure layer-pipelined disagg prefill preserves final completion metadata for zero-page and chunked requests, while falling back for unsupported heterogeneous staging paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflicts between upstream Mooncake tracing changes and layer-pipelined KV transfer worker fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ShangmingCai · 2026-06-03T12:06:11Z

Sorry, I was temporarily held up by a few more critical features that required immediate review and testing. I could begin moving this PR forward for review today and tomorrow. I will also be pinging a few reviews to support this as well, since this PR requires cross-module co-design.

Send final metadata only after layer-pipelined KV chunks are enqueued so aux buffers cannot race with worker finalization, and gate pipelined split prefill to audited Llama models while preserving normal ModelRunner forward hooks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ensure the pipelined split-prefill path follows the normal forward path by resolving deferred input IDs before constructing ForwardBatch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Avoid generating tokens for prefill-only requests and keep pipelined transfer status/state handling consistent with the normal path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ShangmingCai · 2026-06-05T08:27:00Z

        self: Scheduler,
        batch: ScheduleBatch,
        result: GenerationBatchResult,
+        pipelined: bool = False,


Looks like this is not needed.

Done in 5cbeca7 — removed the unused pipelined argument. The per-request state (req.pipelined_kv_sent) is enough to distinguish the pipelined path, so the extra function parameter is not needed.

ShangmingCai · 2026-06-05T08:30:05Z

        ]

        self.capture_aux_hidden_states = False
+        self.supports_layer_pipelined_kv_transfer = type(self) is LlamaForCausalLM


Why only llama need this supports_layer_pipelined_kv_transfer, I wonder do we need to set it for all supported models? Such as flacon and qwen in this PR?

Good point. The Llama-only opt-in was over-conservative and inconsistent with the supported model checklist/benchmark (including Qwen and Falcon). I removed the redundant supports_layer_pipelined_kv_transfer flag and restored forward_split_prefill as the model capability guard. Unsupported cases still fall back via the existing backend/PP/DP-attention/EPLB/speculative/input_embeds/heterogeneous-TP guards.

ShangmingCai · 2026-06-05T09:16:52Z

    state_indices: Optional[List]
    chunk_id: Optional[int] = None
+    layer_id: Optional[int] = None
+    cuda_event: object = None


Should we make this Optional?

Yes, done in c59d819 — changed cuda_event to Optional[object] since normal chunks/final metadata chunks may not carry a CUDA event and use None.

ShangmingCai · 2026-06-05T09:18:03Z

+        Returns (LogitsProcessorOutput, event) for the final group
+        (when split_index reaches num_hidden_layers).
+        """
+        out = self.model_runner.forward(


Should we better call forward_split_prefill here instead?

I kept this routed through ModelRunner.forward intentionally. ModelRunner.forward still dispatches to forward_split_prefill for SPLIT_PREFILL, but it also preserves the common forward bookkeeping/context around it (e.g. recorder/profiler hooks and the _forward_raw preparation path). Calling forward_split_prefill directly here would bypass that shared wrapper. Added a short comment in c59d819 to make this explicit.

ShangmingCai · 2026-06-05T09:20:22Z

@UNIDY2002 @zhangxiaolei123456 Do you have time to take a look and test this PR? I think it is almost ready for production test.

UNIDY2002 · 2026-06-05T09:22:19Z

@UNIDY2002 @zhangxiaolei123456 Do you have time to take a look and test this PR? I think it is almost ready for production test.

I'll find some time to test it next week.

Remove the redundant model opt-in guard so pipelined KV transfer follows the PR's documented split-prefill capability contract, and drop an unused review-only parameter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zhangxiaolei123456 · 2026-06-05T09:25:03Z

@ShangmingCai OK

Clarify the optional CUDA event type and document why the split layer path goes through the common ModelRunner.forward wrapper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

michael7193 requested review from ByronHsu, ShangmingCai, Ying1123, hnyls2002, merrymercy, wisclmy0611 and xiezhq-hermann as code owners April 23, 2026 02:23

github-actions Bot added the documentation Improvements or additions to documentation label Apr 23, 2026

ShangmingCai mentioned this pull request Apr 23, 2026

[Roadmap] Prefill-Decode Disaggregation Roadmap (2026 Q2) #21703

Open

16 tasks

UNIDY2002 mentioned this pull request Apr 28, 2026

[Qwen3.5] Add forward_split_prefill support for layer-pipelined KV transfer michael7193/sglang#1

Merged

UNIDY2002 reviewed May 6, 2026

View reviewed changes

Comment thread python/sglang/srt/disaggregation/mooncake/conn.py Outdated

michael7193 force-pushed the feature/layer-pipelined-kv-transfer branch from 39d680d to 155f9b7 Compare May 6, 2026 06:46

michael7193 force-pushed the feature/layer-pipelined-kv-transfer branch from b5267cb to 2547641 Compare May 9, 2026 01:38

github-actions Bot added the run-ci label May 9, 2026

ShangmingCai reviewed May 9, 2026

View reviewed changes

michael7193 force-pushed the feature/layer-pipelined-kv-transfer branch 2 times, most recently from 7bc1715 to 75d6aa3 Compare May 12, 2026 06:22

michael7193 and others added 2 commits May 26, 2026 19:41

michael7193 requested a review from JustinTong0323 as a code owner May 27, 2026 03:32

michael7193 and others added 3 commits May 27, 2026 01:29

michael7193 and others added 3 commits June 1, 2026 23:30

[Disagg] Add empty batch guard in _get_pipeline_group_size

e235b92

Prevent ZeroDivisionError if batch.reqs is empty (defensive guard). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(pipelined): guard against division-by-zero when SAT_MULTIPLIER <=…

1286256

… 1.0 Clamp SAT_MULTIPLIER to at least 1.01 so sat_tokens > min_tokens, preventing ZeroDivisionError in the adaptive formula. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

michael7193 and others added 2 commits June 3, 2026 02:12

fix(pipelined): handle transfer fallback edge cases

4e012a6

Ensure layer-pipelined disagg prefill preserves final completion metadata for zero-page and chunked requests, while falling back for unsupported heterogeneous staging paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge origin/main into pipelined KV transfer PR

090ebc7

Resolve conflicts between upstream Mooncake tracing changes and layer-pipelined KV transfer worker fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ShangmingCai assigned cctry, ShangmingCai and hnyls2002 Jun 3, 2026

michael7193 and others added 3 commits June 3, 2026 19:28

fix(pipelined): materialize inputs before split prefill

5f1dee2

Ensure the pipelined split-prefill path follows the normal forward path by resolving deferred input IDs before constructing ForwardBatch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(pipelined): align edge-case semantics

0d9b628

Avoid generating tokens for prefill-only requests and keep pipelined transfer status/state handling consistent with the normal path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ShangmingCai reviewed Jun 5, 2026

View reviewed changes

fix(pipelined): use split-prefill capability guard

5cbeca7

Remove the redundant model opt-in guard so pipelined KV transfer follows the PR's documented split-prefill capability contract, and drop an unused review-only parameter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(pipelined): address split prefill review notes

c59d819

Clarify the optional CUDA event type and document why the split layer path goes through the common ModelRunner.forward wrapper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Conversation

michael7193 commented Apr 23, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Key Results

TTFT (ms) — Prompt Length Sweep (C=32, output=256)

TTFT p95 (ms)

Throughput (output tok/s)

Multi-turn Dialogue (16 sessions × 10 turns)

Extreme Stress (C=64, prompt=4096, output=1024)

Design

How it works

Call chain

Event loop compatibility

Adaptive group_size (E1)

Different TP support (E2)

Mamba/SWA/NSA state support (E4)

Universal guard + FalconH1 support (E7)

MTP/EAGLE compatibility (E8)

Zero regression guarantee

Code equivalence (v0.4.10.post2 → this PR)

Checklist

Modified Files

Future Work

CI States

Uh oh!

gemini-code-assist Bot commented Apr 23, 2026

Uh oh!

ShangmingCai commented Apr 23, 2026

Uh oh!

UNIDY2002 commented Apr 23, 2026

Uh oh!

michael7193 commented Apr 24, 2026

Uh oh!

Uh oh!

michael7193 commented May 6, 2026

Uh oh!

michael7193 commented May 9, 2026

Uh oh!

ShangmingCai commented May 9, 2026

Uh oh!

ShangmingCai commented May 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michael7193 commented May 9, 2026

Uh oh!

michael7193 commented May 10, 2026

Uh oh!

ShangmingCai commented May 11, 2026

Uh oh!

michael7193 commented May 13, 2026

Uh oh!

michael7193 commented May 28, 2026

Uh oh!

ShangmingCai commented May 28, 2026

Uh oh!

michael7193 commented Jun 3, 2026

Uh oh!

ShangmingCai commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

michael7193 commented Apr 23, 2026 •

edited by github-actions Bot

Loading

ShangmingCai Jun 5, 2026 •

edited

Loading