[Disagg][NIXL] Add staging buffer support for heterogeneous TP KV transfer by YAMY1234 · Pull Request #22536 · sgl-project/sglang

YAMY1234 · 2026-04-10T17:57:02Z

Motivation

NIXL disaggregated serving currently requires prefill and decode to use the same TP layout. When prefill uses TP4 and decode uses DEP4 (DP4+TP4+EP4), each prefill rank's KV cache must be split and sent to multiple decode ranks. Without staging buffers, the prefill side must issue prefill_tp × decode_tp separate RDMA transfers per chunk, saturating the RDMA descriptor table and adding significant per-transfer overhead.

The staging buffer approach (already implemented for mooncake in #19890) consolidates KV heads into a contiguous staging region on prefill, issues a single bulk RDMA transfer per rank pair, and lets the decode side scatter from the staging buffer into the final KV cache pages asynchronously.

Collaborate with @Aphoh (The author of #18968)

Modification

This PR extends the existing staging buffer support from mooncake to NIXL. The core staging lifecycle logic (staging_handler.py, staging_buffer.py) is already shared between backends — this PR adds the NIXL-specific integration and refactors mooncake to use the newly shared functions.

Key differences from mooncake implementation:

Transfer notifications via RDMA tags instead of ZMQ: NIXL encodes staging metadata (chunk index, page offset, etc.) in RDMA notification tags (stg tag), while mooncake uses ZMQ CHUNK_READY messages. The RDMA notification path avoids an extra network round-trip.
RDMA notifications processed on main thread: Due to C++ thread-safety constraints in nixl_agent.get_new_notifs(), RDMA notification processing and scatter submission stay on the main thread poll path. The background thread only handles ZMQ STAGING_REQ messages. In mooncake, both notification processing and scatter run on the background thread.
Staging buffer registration via callback: init_staging_buffers(register_fn, ...) and init_staging_allocator(register_fn, ...) now accept a generic register_fn callback instead of a mooncake engine directly, enabling NIXL to register buffers via nixl_agent. Mooncake is refactored to use the same callback pattern.
Shared utility extraction: handle_watermark_msg(), handle_staging_rsp(), and handle_chunk_arrived() are extracted from mooncake into staging_handler.py as common functions used by both backends.

Files changed: nixl/conn.py, common/staging_handler.py, mooncake/conn.py

Accuracy

GSM8K (full 1319 questions), Qwen3.5-397B-A17B-FP8, 1P1D TP4→DEP4, NIXL + staging:

Test	Job ID	Score	Std Dev
TP4→DEP4 NIXL + Staging	1433486	0.9786	0.1446
TP4→DEP4 NIXL + Staging (run 2)	1434413	0.9764	0.1519
Reference baseline (non-disagg)	—	0.978	—

Accuracy is consistent with the non-disaggregated baseline (~0.98), confirming staging buffer does not affect model correctness.

Performance

Setup: Qwen3.5-397B-A17B-FP8, 1P1D, GB200 Lyris cluster, ISL=1000, OSL=1000, sa-bench concurrency sweep 1→1024.

1. TP4→DEP4 Staging vs No-Staging (heterogeneous TP, measuring staging benefit)

Concurrency	No-Staging TPS	Staging TPS	Δ TPS	No-Staging TTFT (ms)	Staging TTFT (ms)	Δ TTFT
1	899	922	+2.6%	304	209	-31.1%
2	1,745	1,803	+3.3%	521	354	-32.1%
4	3,424	3,556	+3.9%	688	450	-34.5%
8	6,204	6,473	+4.3%	1,035	417	-59.7%
16	10,787	11,519	+6.8%	1,303	551	-57.7%
32	18,616	13,807	-25.8%	1,492	734	-50.8%
64	33,247	30,760	-7.5%	1,674	780	-53.4%
128	55,561	56,228	+1.2%	2,257	769	-65.9%
256	89,915	92,776	+3.2%	3,050	1,037	-66.0%
512	106,427	137,480	+29.2%	21,346	1,288	-94.0%
1024	109,244	179,232	+64.1%	66,803	7,467	-88.8%

At high concurrency (512–1024), staging delivers 29–64% throughput improvement and dramatically lower TTFT (1.3s vs 21s at c=512, 7.5s vs 67s at c=1024). The no-staging path saturates RDMA descriptors at high concurrency, causing TTFT to blow up. TPOT remains comparable across both configurations.

2. TP4→DEP4 Staging vs DEP4→DEP4 NIXL (heterogeneous vs homogeneous, measuring staging overhead)

Concurrency	DEP4→DEP4 TPS	TP4→DEP4 Staging TPS	Δ TPS	DEP4→DEP4 TTFT (ms)	Staging TTFT (ms)	DEP4→DEP4 TPOT (ms)	Staging TPOT (ms)
1	914	922	+0.9%	219	209	1.07	1.06
2	1,801	1,803	+0.1%	254	354	1.09	1.08
4	3,550	3,556	+0.2%	301	450	1.09	1.08
8	6,482	6,473	-0.1%	352	417	1.19	1.20
16	7,952	11,519	+44.9%	447	551	1.39	1.33
32	14,362	13,807	-3.9%	611	734	1.61	1.66
64	32,714	30,760	-6.0%	707	780	1.79	1.79
128	51,983	56,228	+8.2%	849	769	2.09	2.09
256	81,387	92,776	+14.0%	997	1,037	2.54	2.55
512	135,531	137,480	+1.4%	1,330	1,288	3.17	3.18
1024	192,839	179,232	-7.1%	7,471	7,467	4.37	4.34

TP4→DEP4 with staging has no systematic overhead compared to homogeneous DEP4→DEP4. TTFT and TPOT are comparable. At mid-to-high concurrency (128–512), TP4 prefill is actually more efficient than DEP4 prefill, so the heterogeneous layout outperforms homogeneous in those ranges.

…nsfer Implement GPU staging buffer for NIXL backend to enable bulk RDMA transfers under heterogeneous TP, reducing RDMA work requests by ~1000x compared to per-head scatter transfers. Key changes: - server_args.py: Allow SGLANG_DISAGG_STAGING_BUFFER with NIXL backend - Staging buffer lifecycle: _init_staging_prefill_ctx/decode_ctx, _init_staging_buffers, _init_staging_allocator, _register_staging_memory - Prefill side: send_kvcache_staged() gathers KV heads into staging buffer then posts a single bulk RDMA write; _prefetch_staging_reqs() pre-sends STAGING_REQ to decode before forward starts - Decode side: _start_decode_staging_thread() receives STAGING_REQ, allocates staging offsets, replies with STAGING_RSP; watermark-based flow control for ring buffer reuse - Notification handling: stg notifications trigger chunk scatter; _maybe_submit_last_scatter() for final scatter after all chunks arrive - KVArgsRegisterInfo: add staging_base_ptr, staging_total_size fields at msg[12]/msg[13] - KVReceiver: send staging allocator metadata during registration

NIXL staging buffers were allocated with cudaMalloc instead of cuMemCreate. Pass custom_mem_pool from init_mooncake_custom_mem_pool() to StagingBuffer/StagingAllocator constructors, matching Mooncake behavior.

…ake pattern - Extract staging transfer logic into helper methods - Delegate common operations to staging_handler.py - Remove unnecessary getattr/hasattr defensive checks - Simplify NixlKVReceiver staging registration

- Extract _get_custom_mem_pool() in staging_handler.py to centralize mooncake custom memory pool initialization - Change init_staging_buffers/init_staging_allocator to accept a register_fn callback instead of mooncake engine, making them transport-agnostic - NIXL now delegates to staging_handler instead of directly importing mooncake.utils.init_mooncake_custom_mem_pool - Replace hardcoded 8192 with DEFAULT_CHUNKED_PREFILL_SIZE constant Made-with: Cursor

Refactor commit accidentally removed defensive getattr() calls, # type: ignore comments, and changed is_dummy() method to @Property. These were pre-existing patterns not introduced by staging buffer code and should not be altered. Made-with: Cursor

…ategy, use StagingRegisterInfo in NIXL - Fix prefetch_staging_reqs() is_dummy compatibility: handle both bool field (mooncake) and method (NIXL) via callable() check - Remove misleading "Mooncake-specific" section header in staging_handler.py — most code is backend-agnostic - Generalize PrefillStagingStrategy.check_ready() with session_id param so NIXL can pass req.agent_name instead of mooncake_session_id - NIXL: use StagingRegisterInfo.from_zmq_fields() instead of manual staging_base_ptr/staging_total_size parsing - NIXL: delegate readiness check to PrefillStagingStrategy.check_ready() instead of inlining chunk_idx/offset/watermark logic Made-with: Cursor

These # --- section headers were not in the original codebase and add unnecessary noise. Made-with: Cursor

Made-with: Cursor

Consolidate the 3-way KV dispatch (same-TP, staging, slice-fallback) into a single _send_kv_for_req method, eliminating _send_kv_slice_fallback. Made-with: Cursor

Move inline stg tag parsing and aux staging checks into dedicated _handle_stg_notification and _handle_aux_notification methods, keeping the main notification dispatch loop concise. Made-with: Cursor

…mon handler - Group scattered staging sub-functions in nixl/conn.py: _get_staging_strategy + _do_staging_transfer now adjacent to send_kvcache_staged, _handle_watermark_msg + _handle_staging_rsp now adjacent to _maybe_submit_last_scatter - Extract DecodeStagingHandler.handle_chunk_arrived() in staging_handler.py, unifying the chunk writer tracking + scatter submission logic used by both NIXL (_handle_staging_chunk_arrived) and mooncake (CHUNK_READY handler) Made-with: Cursor

…fer_request Restore the original inline if/elif/else call style for send_kvcache and send_kvcache_slice, adding staging as a new first branch without changing the existing call patterns. Made-with: Cursor

- Remove DEFAULT_CHUNKED_PREFILL_SIZE constant, restore inline `or 8192` - Revert getattr(req, "staging", None) back to req.staging Made-with: Cursor

Use self.decode_kv_args_table[req.agent_name].xxx consistently instead of extracting dst_info, matching the upstream code style. Made-with: Cursor

…ndler Both NIXL and mooncake had identical WATERMARK and STAGING_RSP message handling logic. Extract into staging_handler.py as shared functions, reducing duplication across backends. Made-with: Cursor

Made-with: Cursor

gemini-code-assist

Code Review

This pull request refactors the staging buffer logic to be transport-agnostic and introduces staging support for the NIXL disaggregation backend. Key changes include moving staging-related message handling and buffer initialization to a common handler, updating the Mooncake backend to utilize these shared utilities, and implementing RDMA-based staging transfers for NIXL. Feedback suggests implementing a retry mechanism in the staging transfer logic to avoid premature fallbacks to less efficient slice-based transfers and improving the robustness of notification tag parsing to handle potential underscores in agent names.

Made-with: Cursor

gemini-code-assist · 2026-04-17T06:57:18Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

- Drop redundant `torch.cuda.current_stream().synchronize()` in `send_kvcache_staged`. `gather_all_layers_to_staging` already syncs its dedicated `_gather_stream` before returning, so the staging buffer is fully populated and visible to the NIC by the time the RDMA WRITE is posted (matches mooncake's behavior). Drops the resulting unused `import torch`. - Replace `Optional[object]` fields on `TransferInfo` / `KVArgsRegisterInfo` with `Optional["StagingTransferInfo"]` / `Optional["StagingRegisterInfo"]` (forward refs under TYPE_CHECKING). - Normalize `is_dummy` API: convert NIXL `TransferInfo.is_dummy()` to an `@property`, matching mooncake's plain attribute. Updates the two NIXL call sites and removes the `callable(tinfo.is_dummy)` hack from `prefetch_staging_reqs`. - Inline `_handle_watermark_msg` / `_handle_staging_rsp` wrappers in NIXL `bootstrap_thread` so both backends call the common helpers the same way. - Extract `_dispatch_kv_transfer` helper from `add_transfer_request` so each request appends exactly one kv handle, instead of three different `handles.append` sites in nested branches. - Comment `split("_", 8)` to document the per-tag layout and why the maxsplit value lets `agent_name` (which can itself contain underscores) survive the split intact. - Replace `chunk_id == 0` prefetch sentinel with an explicit `PrefillStagingContext.prefetched_rooms` set so `_prefetch_staging_reqs` is idempotent per room and the invariant ("fan out STAGING_REQ once per room") is local to the function instead of relying on caller behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ShangmingCai · 2026-04-20T06:56:28Z

The Mooncake part LGTM

ShangmingCai · 2026-04-20T06:58:04Z

Looks good.

nvpohanh · 2026-05-08T05:43:06Z

@YAMY1234 could you rebase this? thanks

ShangmingCai · 2026-05-08T05:47:22Z

cc: @ishandhanani, we should be able to merge this PR once we get an approval from the nixl backend maintainer.

Bring in sgl-project#23967 (Nixl async transfer) and other main changes since the last merge. Conflicts were limited to python/sglang/srt/disaggregation/nixl/conn.py: 1. TransferInfo: kept main's `decode_prefix_len` field + `is_dummy()` method form, appended this PR's `staging` field at the tail. Updated 2 callers in this file from `req.is_dummy` to `req.is_dummy()`. 2. NixlKVManager.__init__ (PREFILL branch): kept this PR's `_init_staging_prefill_ctx()` AND main's `transfer_queues` / `transfer_worker` thread pool. Both run; staging ctx is initialized before workers spawn. 3. add_transfer_request: took main's async enqueue body (puts TransferKVChunk into transfer_queues[room % N], returns None) but kept this PR's `_prefetch_staging_reqs(bootstrap_room)` call before the enqueue. The staging dispatch (`_dispatch_kv_transfer`, `_do_staging_transfer`, `send_kvcache_staged`) is now temporarily dead code: enabling SGLANG_DISAGG_STAGING_BUFFER on NIXL has no effect until the next commit moves staging dispatch into `transfer_worker` (per the mooncake pattern). 4. update_transfer_status: kept this PR's tag-based dispatch (`_track_kv_arrival` / `_handle_stg_notification` / `_handle_aux_notification`) and merged main's "nokv" handling for decode-side radix cache hit (sgl-project#19746) into `_handle_aux_notification`. After this commit the staging buffer code path is preserved but unused; plain heterogeneous-TP transfers fall back to send_kvcache_slice via the new async worker. The next commit will wire staging into the worker (per-worker staging buffer + deferred re-enqueue on watermark not-ready, matching mooncake). Co-authored-by: Cursor <cursoragent@cursor.com>

…ke parity) After the previous merge of sgl-project#23967 (Nixl async transfer), staging buffer dispatch lived only in the now-deleted synchronous path of add_transfer_request, leaving SGLANG_DISAGG_STAGING_BUFFER a no-op on NIXL. This commit ports the staging dispatch into transfer_worker, 1:1 mirroring mooncake's per-worker staging design. 1. PREFILL __init__: build N staging buffers (one per transfer_queue) before workers spawn, and pass each worker its private buffer (NixlKVManager.__init__). Removes the lazy single-buffer creation in set_kv_buffer_tensors -- mooncake-style, staging buffers no longer depend on kv_buffer_tensors. 2. _try_create_staging_strategy(staging_buffer) replaces _get_staging_strategy. Returns a fresh PrefillStagingStrategy bound to the caller's staging buffer. The strategy MUST be a worker-local variable; never cache on self -- multiple workers would race on the same staging ring. 3. transfer_worker(queue, staging_buffer=None) now lazy-creates a per-worker staging_strategy on the first chunk it sees, then for each req in a chunk picks among: - staging (heterogeneous TP, both sides registered, watermark ready) -> _do_staging_transfer - send_kvcache (MLA / homogeneous TP) - send_kvcache_slice (heterogeneous TP, no staging or staging hard-failed for this chunk) When staging is not ready (watermark/alloc pending), _do_staging_transfer re-enqueues the chunk and signals `staging_deferred=True`; the worker breaks the per-req loop and `continue`s the main loop without advancing room status, so the chunk gets retried on the next pop. Same control-flow as mooncake.transfer_worker. 4. _do_staging_transfer reshaped to (handle, deferred) return tuple: - (None, True) -> chunk re-enqueued, caller should defer - (None, False) -> hard fallback, caller should try slice - (handle, False) -> staging RDMA posted; handle joins the per-chunk handle list and is busy-polled to DONE alongside aux/state handles. Oversized chunks (cannot ever fit) raise immediately. 5. _dispatch_kv_transfer (the old synchronous-path entry) is removed. add_transfer_request stays a thin enqueue + _prefetch_staging_reqs wrapper. Notes vs mooncake: - NIXL workers do NOT need an executor (no per-slice ThreadPoolExecutor); send_kvcache_slice posts a single bulk transfer. - NIXL workers do NOT send a separate ZMQ CHUNK_READY message: decode observes chunk arrival via the RDMA `stg_*` notification tag posted by send_kvcache_staged, which the decode-side receiver thread already handles. - Memory: staging pool grows N x (one per worker, default SGLANG_DISAGGREGATION_QUEUE_SIZE=4). Tunable via SGLANG_DISAGG_STAGING_POOL_SIZE_MB. Co-authored-by: Cursor <cursoragent@cursor.com>

ShangmingCai · 2026-05-11T07:02:27Z

need to fix lint as well

The shared helper prefetch_staging_reqs() in common/staging_handler.py was written under the assumption that TransferInfo.is_dummy is a plain attribute / @Property on both backends. After merging upstream main (which introduced decode_prefix_len, sgl-project#19746), NIXL's TransferInfo.is_dummy was changed from @Property to a regular method to consult decode_prefix_len. NIXL's own conn.py call sites were updated to use is_dummy() but this shared helper was missed. Effect: tinfo.is_dummy evaluates to a bound-method object on NIXL, which is always truthy. The if branch is always taken, every STAGING_REQ is silently skipped, decode never allocates a staging chunk, STAGING_RSP never returns to prefill, the per-chunk staging info stays None, check_ready always returns not-ready, and the chunk is re-enqueued forever. The transfer worker spins in its dispatch loop and the prefill inflight queue never drains -- exactly the deadlock observed on lyris job 1733669 (11 inflight reqs, no further metrics, decode side never sees any stg_* notification). Mooncake is not affected because its TransferInfo.is_dummy is a real dataclass bool field. Fix: normalize via callable() in the shared helper so it works for both the mooncake (attribute) and NIXL (method) shapes. This is the minimal single-point fix and does not require touching either backend's existing TransferInfo definition or any other call site. Co-authored-by: Cursor <cursoragent@cursor.com>

@ShangmingCai

…nup) Addresses the three actionable comments from @ShangmingCai on PR sgl-project#22536: 1. Drop duplicate StagingRegisterInfo import. The class was imported both in the TYPE_CHECKING block and again lazily inside KVArgsRegisterInfo. from_zmq(). Promote it to a module-level runtime import (no circular dep risk -- staging_handler.py only imports stdlib + torch) and remove the redundant lazy import. Keep StagingTransferInfo in TYPE_CHECKING because it is only referenced from a forward-ref annotation. 2. Add an explanatory comment to KVArgsRegisterInfo.staging noting why the optional staging field must remain the LAST field of the dataclass -- from_zmq() relies on positional construction and the staging payload is a variable-length tail of the ZMQ frame. 3. Clean up per-room state in the prefill transfer_worker when the last chunk reports Success. Without this, prefetched_rooms / prefetch_requested / transfer_infos / req_to_decode_prefix_len grew without bound across long-running services as new bootstrap rooms kept arriving (mooncake's transfer_worker already does the equivalent transfer_infos.pop on Success -- this brings NIXL to parity and additionally sweeps the staging-only prefetch sets). Also pick up an incidental black reformat of one call site in transfer_worker. Co-authored-by: Cursor <cursoragent@cursor.com>

Trim the two prose blocks added in the previous review-feedback commit down to a single sentence each, keeping the substance (why staging is last, mooncake-parity cleanup) without restating it across multiple lines. Co-authored-by: Cursor <cursoragent@cursor.com>

iyastreb

I have tested it with my banchmark, and these are results I have:

# p2d4, Best TTFT of 4 runs
num_prompts  before  after
1            34      32      
2            54      50      
4            57      53      
8            70      59      
16           88      74      
32           150     115
64           226     154
128          425     238
256          749     355
512          2078    1437

# p4d4, Best TTFT of 4 runs
num_prompts  before  after
1            31      31
2            45      45
4            47      48
8            52      51
16           58      60
32           94      97
64           127     127
128          231     182
256          327     314
512          633     577

It speeds up significantly the heterogenous setup (p2d4), and even homogenous one on large dimensions (>64)

YAMY1234 · 2026-05-12T16:35:05Z

Thanks @iyastreb for the verification! Since it has been verified by Nixl team @ShangmingCai could you take a second look and merge it if it looks good to you? Thanks!😄

ShangmingCai · 2026-05-13T07:01:59Z

I have addressed the rebase request in bcb6a69, since #24932 has been merged. Let me trigger the related CI now.

ShangmingCai · 2026-05-13T07:06:35Z

/rerun-test test/registered/distributed/test_disaggregation_different_tp.py

github-actions · 2026-05-13T07:07:00Z

🚀 8-gpu-h20 (1 test): ❌ View workflow run

cd test/ && python3 registered/distributed/test_disaggregation_different_tp.py

ShangmingCai · 2026-05-13T10:52:29Z

/rerun-test test/registered/distributed/test_disaggregation_different_tp.py

github-actions · 2026-05-13T10:52:52Z

🚀 8-gpu-h20 (1 test): ✅ View workflow run

cd test/ && python3 registered/distributed/test_disaggregation_different_tp.py

…nsfer (sgl-project#22536) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>

YAMY1234 added 15 commits April 7, 2026 11:58

Fix NIXL staging buffer: use cuMemCreate via custom_mem_pool

04f3af9

NIXL staging buffers were allocated with cudaMalloc instead of cuMemCreate. Pass custom_mem_pool from init_mooncake_custom_mem_pool() to StagingBuffer/StagingAllocator constructors, matching Mooncake behavior.

Remove section divider comments added by refactor

615b513

These # --- section headers were not in the original codebase and add unnecessary noise. Made-with: Cursor

Rename decode_info -> dst_info for consistency

ea9fe3d

Made-with: Cursor

Extract _send_kv_for_req to reduce add_transfer_request complexity

4676ff7

Consolidate the 3-way KV dispatch (same-TP, staging, slice-fallback) into a single _send_kv_for_req method, eliminating _send_kv_slice_fallback. Made-with: Cursor

Extract staging notification handlers from update_transfer_status

8065a95

Move inline stg tag parsing and aux staging checks into dedicated _handle_stg_notification and _handle_aux_notification methods, keeping the main notification dispatch loop concise. Made-with: Cursor

Revert _send_kv_for_req extraction, keep inline dispatch in add_trans…

466ebad

…fer_request Restore the original inline if/elif/else call style for send_kvcache and send_kvcache_slice, adding staging as a new first branch without changing the existing call patterns. Made-with: Cursor

Revert unnecessary changes in staging_handler.py

dbb47c4

- Remove DEFAULT_CHUNKED_PREFILL_SIZE constant, restore inline `or 8192` - Revert getattr(req, "staging", None) back to req.staging Made-with: Cursor

Restore original field access style in add_transfer_request

0a04f45

Use self.decode_kv_args_table[req.agent_name].xxx consistently instead of extracting dst_info, matching the upstream code style. Made-with: Cursor

Move handle_watermark_msg and handle_staging_rsp to common staging_ha…

19d7f2d

…ndler Both NIXL and mooncake had identical WATERMARK and STAGING_RSP message handling logic. Extract into staging_handler.py as shared functions, reducing duplication across backends. Made-with: Cursor

YAMY1234 requested review from ByronHsu, ShangmingCai and hnyls2002 as code owners April 10, 2026 17:57

YAMY1234 marked this pull request as draft April 10, 2026 17:57

style: format staging buffer code with black

d72e9e6

Made-with: Cursor

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread python/sglang/srt/disaggregation/nixl/conn.py Outdated

Comment thread python/sglang/srt/disaggregation/nixl/conn.py Outdated

YAMY1234 and others added 2 commits April 10, 2026 11:04

fix: limit notification tag split to handle underscores in agent_name

8aedb1b

Made-with: Cursor

Merge branch 'main' into feat/nixl-staging-buffer-independent

c9c0d1c

YAMY1234 marked this pull request as ready for review April 17, 2026 06:57

YAMY1234 marked this pull request as draft April 17, 2026 07:14

YAMY1234 marked this pull request as ready for review April 17, 2026 08:46

nvpohanh mentioned this pull request Apr 17, 2026

[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations #20024

Open

ShangmingCai self-assigned this Apr 17, 2026

ShangmingCai reviewed Apr 20, 2026

View reviewed changes

Comment thread python/sglang/srt/disaggregation/mooncake/conn.py

Copy link
Copy Markdown

Collaborator

ShangmingCai Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Mooncake part LGTM

ShangmingCai reviewed Apr 20, 2026

View reviewed changes

Comment thread python/sglang/srt/server_args.py

Copy link
Copy Markdown

Collaborator

ShangmingCai Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

ShangmingCai reviewed May 8, 2026

View reviewed changes

Comment thread python/sglang/srt/disaggregation/nixl/conn.py Outdated

ShangmingCai reviewed May 8, 2026

View reviewed changes

Comment thread python/sglang/srt/disaggregation/nixl/conn.py

ShangmingCai reviewed May 8, 2026

View reviewed changes

Comment thread python/sglang/srt/disaggregation/nixl/conn.py

YAMY1234 and others added 2 commits May 10, 2026 14:41

YAMY1234 and others added 3 commits May 11, 2026 00:24

iyastreb approved these changes May 12, 2026

View reviewed changes

YAMY1234 changed the title ~~Disagg][NIXL] Add staging buffer support for heterogeneous TP KV transfer~~ [Disagg][NIXL] Add staging buffer support for heterogeneous TP KV transfer May 12, 2026

Merge branch 'main' into feat/nixl-staging-buffer-independent

bcb6a69

ShangmingCai approved these changes May 13, 2026

View reviewed changes

ShangmingCai merged commit 2a4d382 into sgl-project:main May 13, 2026
63 of 73 checks passed

Conversation

YAMY1234 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Accuracy

Performance

1. TP4→DEP4 Staging vs No-Staging (heterogeneous TP, measuring staging benefit)

2. TP4→DEP4 Staging vs DEP4→DEP4 NIXL (heterogeneous vs homogeneous, measuring staging overhead)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot commented Apr 17, 2026

Uh oh!

ShangmingCai Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

nvpohanh commented May 8, 2026

Uh oh!

ShangmingCai May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangmingCai commented May 11, 2026

Uh oh!

iyastreb left a comment

Choose a reason for hiding this comment

Uh oh!

YAMY1234 commented May 12, 2026

Uh oh!

ShangmingCai commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

YAMY1234 commented Apr 10, 2026 •

edited

Loading

ShangmingCai commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading