Nixl async transfer by ovidiusm · Pull Request #23967 · sgl-project/sglang

ovidiusm · 2026-04-28T21:43:22Z

Taken over from #20680

Motivation

This PR improves the performance of NixlKVManager by making KV transfer asynchronous and multi-threaded on the prefill node. Previously, add_transfer_request performed each chunk transfer synchronously and the caller (NixlKVSender) had to track and poll all transfer handles. With many decode instances and chunked transfers, this caused the prefill scheduler to block on transfer completion and limited throughput. This change aligns NIXL with the queue-based, multi-worker transfer design.

Performance

We ran Qwen3-32B PD disaggregation with NIXL and observed a clear improvement in transfer latency via NIXL telemetry:

Mean transfer time: 162,225 μs → 41,225 μs (about 4× lower).
Distribution: Before, transfer times had high variance with many samples in the 250k–1.2M μs range and a long tail; after the change, the vast majority of samples sit in the 34k–42k μs band with much lower variance and no large outliers.

Async multi-worker transfer removes the synchronous bottleneck on the prefill path: chunks are processed in parallel by worker threads, and decode instances are sharded across queues for better overlap, which explains the lower mean and significantly improved tail (P95/P99) latency.

Modifications

Async transfer with queue + worker pool (PREFILL mode)
- Introduced multiple FastQueue instances (count controlled by SGLANG_DISAGGREGATION_QUEUE_SIZE) and a ThreadPoolExecutor per queue (total worker count from SGLANG_DISAGGREGATION_THREAD_POOL_SIZE).
- Added a TransferKVChunk dataclass and daemon transfer_worker threads that consume chunks from the queues and execute send_kvcache / send_kvcache_slice, maybe_send_extra, and send_aux in the worker.
- Default thread pool size: min(max(4, (0.5 * cpu_count) // 8), 12) when the env var is not set; queue size defaults to env (e.g. 4).
Non-blocking add_transfer_request
- add_transfer_request no longer performs transfer inline; it enqueues a TransferKVChunk to transfer_queues[bootstrap_room % len(transfer_queues)] and returns None.
- Workers update request_status (e.g. Transferring, Success, Failed), so the sender no longer needs to hold or poll transfer handles.
NixlKVSender simplifications
- Removed xfer_handles; poll() now relies on kv_mgr.check_status(bootstrap_room) only.
- Added clear() to remove bootstrap_room from request_status when appropriate.
- Last-chunk path no longer deletes request_status in the sender; the worker clears transfer_infos and sets status to Success when the last chunk is done.
Scheduler handling of Bootstrapping
- In prefill.py, requests in KVPoll.Bootstrapping are now treated as undone (together with WaitingForInput and Transferring) so the scheduler does not consider them complete before transfer progress.

Testing

python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --host 127.0.0.1 --port 8000: Accuracy: 0.945 with Qwen/Qwen3-8B
TestDisaggregationAccuracy passes with NIXL (score 0.76, throughput 3949 token/s)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests. (Regression: ran test_disaggregation_basic.py; 7 tests passed.)
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so. (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci)
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-04-28T21:43:26Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ishandhanani · 2026-04-30T18:42:12Z

/tag-and-rerun-ci

gemini-code-assist · 2026-04-30T22:52:25Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ovidiusm · 2026-04-30T22:52:45Z

/tag-and-rerun-ci

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

ovidiusm · 2026-05-05T23:52:13Z

@ishandhanani @iyastreb could you please help with review? It's the same PR as #20680 but with conflicts resolved (and fixing the P>D issue from main)

ovidiusm · 2026-05-05T23:53:02Z

FYI @usernamehaha2022

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

ovidiusm · 2026-05-06T17:18:12Z

/tag-and-rerun-ci

ishandhanani · 2026-05-06T17:20:44Z

/tag-and-rerun-ci

ishandhanani · 2026-05-06T19:59:44Z

/rerun-failed-ci

ShangmingCai · 2026-05-07T09:30:50Z

-        except _NIXL_TRANSPORT_ERRORS as e:
-            logger.warning(
-                f"KVSender check_xfer_state failed for room {self.bootstrap_room}: {e}"
-            )
-            self._send_failed = True
-            self._send_error = e
-            return KVPoll.Failed  # type: ignore
-        if all(x == "DONE" for x in states):
-            if (
-                self._transfer_start_time is not None
-                and self._transfer_metric.transfer_latency_s is None
-            ):
-                self._transfer_metric.transfer_latency_s = (
-                    time.perf_counter() - self._transfer_start_time
-                )
-            return KVPoll.Success  # type: ignore
-        if any(x == "ERR" for x in states):
-            self._send_failed = True
-            self._send_error = RuntimeError(
-                f"NIXL transfer error for room {self.bootstrap_room}"


It's a good point. I have now changed the code to catch exceptions in the worker thread, pass them to the main thread and raise from there, so that we can detect _NIXL_TRANSPORT_ERRORS as before. The worker thread still has to catch all exceptions otherwise it may die in case of other errors, which may cause hangs

ShangmingCai

Overall LGTM, but why remove _NIXL_TRANSPORT_ERRORS? I remember this was just added a short while ago.

…after bootstrap) Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

ShangmingCai

LGTM

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

Bring in sgl-project#23967 (Nixl async transfer) and other main changes since the last merge. Conflicts were limited to python/sglang/srt/disaggregation/nixl/conn.py: 1. TransferInfo: kept main's `decode_prefix_len` field + `is_dummy()` method form, appended this PR's `staging` field at the tail. Updated 2 callers in this file from `req.is_dummy` to `req.is_dummy()`. 2. NixlKVManager.__init__ (PREFILL branch): kept this PR's `_init_staging_prefill_ctx()` AND main's `transfer_queues` / `transfer_worker` thread pool. Both run; staging ctx is initialized before workers spawn. 3. add_transfer_request: took main's async enqueue body (puts TransferKVChunk into transfer_queues[room % N], returns None) but kept this PR's `_prefetch_staging_reqs(bootstrap_room)` call before the enqueue. The staging dispatch (`_dispatch_kv_transfer`, `_do_staging_transfer`, `send_kvcache_staged`) is now temporarily dead code: enabling SGLANG_DISAGG_STAGING_BUFFER on NIXL has no effect until the next commit moves staging dispatch into `transfer_worker` (per the mooncake pattern). 4. update_transfer_status: kept this PR's tag-based dispatch (`_track_kv_arrival` / `_handle_stg_notification` / `_handle_aux_notification`) and merged main's "nokv" handling for decode-side radix cache hit (sgl-project#19746) into `_handle_aux_notification`. After this commit the staging buffer code path is preserved but unused; plain heterogeneous-TP transfers fall back to send_kvcache_slice via the new async worker. The next commit will wire staging into the worker (per-worker staging buffer + deferred re-enqueue on watermark not-ready, matching mooncake). Co-authored-by: Cursor <cursoragent@cursor.com>

…ke parity) After the previous merge of sgl-project#23967 (Nixl async transfer), staging buffer dispatch lived only in the now-deleted synchronous path of add_transfer_request, leaving SGLANG_DISAGG_STAGING_BUFFER a no-op on NIXL. This commit ports the staging dispatch into transfer_worker, 1:1 mirroring mooncake's per-worker staging design. 1. PREFILL __init__: build N staging buffers (one per transfer_queue) before workers spawn, and pass each worker its private buffer (NixlKVManager.__init__). Removes the lazy single-buffer creation in set_kv_buffer_tensors -- mooncake-style, staging buffers no longer depend on kv_buffer_tensors. 2. _try_create_staging_strategy(staging_buffer) replaces _get_staging_strategy. Returns a fresh PrefillStagingStrategy bound to the caller's staging buffer. The strategy MUST be a worker-local variable; never cache on self -- multiple workers would race on the same staging ring. 3. transfer_worker(queue, staging_buffer=None) now lazy-creates a per-worker staging_strategy on the first chunk it sees, then for each req in a chunk picks among: - staging (heterogeneous TP, both sides registered, watermark ready) -> _do_staging_transfer - send_kvcache (MLA / homogeneous TP) - send_kvcache_slice (heterogeneous TP, no staging or staging hard-failed for this chunk) When staging is not ready (watermark/alloc pending), _do_staging_transfer re-enqueues the chunk and signals `staging_deferred=True`; the worker breaks the per-req loop and `continue`s the main loop without advancing room status, so the chunk gets retried on the next pop. Same control-flow as mooncake.transfer_worker. 4. _do_staging_transfer reshaped to (handle, deferred) return tuple: - (None, True) -> chunk re-enqueued, caller should defer - (None, False) -> hard fallback, caller should try slice - (handle, False) -> staging RDMA posted; handle joins the per-chunk handle list and is busy-polled to DONE alongside aux/state handles. Oversized chunks (cannot ever fit) raise immediately. 5. _dispatch_kv_transfer (the old synchronous-path entry) is removed. add_transfer_request stays a thin enqueue + _prefetch_staging_reqs wrapper. Notes vs mooncake: - NIXL workers do NOT need an executor (no per-slice ThreadPoolExecutor); send_kvcache_slice posts a single bulk transfer. - NIXL workers do NOT send a separate ZMQ CHUNK_READY message: decode observes chunk arrival via the RDMA `stg_*` notification tag posted by send_kvcache_staged, which the decode-side receiver thread already handles. - Memory: staging pool grows N x (one per worker, default SGLANG_DISAGGREGATION_QUEUE_SIZE=4). Tunable via SGLANG_DISAGG_STAGING_POOL_SIZE_MB. Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

github-actions Bot added the run-ci label Apr 30, 2026

ovidiusm marked this pull request as ready for review April 30, 2026 22:52

ovidiusm requested review from ByronHsu, ShangmingCai and hnyls2002 as code owners April 30, 2026 22:52

ovidiusm mentioned this pull request Apr 30, 2026

NixlKVManager: async multi-threaded KV transfer #20680

Open

5 tasks

ovidiusm requested a review from wisclmy0611 as a code owner May 4, 2026 14:36

github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization lora labels May 4, 2026

ovidiusm force-pushed the nixl-async-transfer branch from a50ea89 to 616ca55 Compare May 4, 2026 14:42

Nixl async transfer -- rebased onto latest main

28b6504

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

ovidiusm force-pushed the nixl-async-transfer branch from 616ca55 to 28b6504 Compare May 5, 2026 09:05

Fix P>D notifications

752f02b

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

ovidiusm mentioned this pull request May 5, 2026

[P/D disagg] - support decode side radix cache #19746

Merged

6 tasks

iyastreb approved these changes May 6, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into nixl-async-transfer

66f674d

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

ovidiusm force-pushed the nixl-async-transfer branch from bf1059d to 66f674d Compare May 6, 2026 17:17

ishandhanani approved these changes May 6, 2026

View reviewed changes

ShangmingCai reviewed May 7, 2026

View reviewed changes

Comment thread python/sglang/srt/disaggregation/prefill.py Outdated

ShangmingCai reviewed May 7, 2026

View reviewed changes

ovidiusm added 4 commits May 7, 2026 14:51

Change check_status to default to WaitingForInput (the initial state …

b5d4d17

…after bootstrap) Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

Merge remote-tracking branch 'upstream/main' into nixl-async-transfer

ef016ca

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

Handle separately _NIXL_TRANSPORT_ERRORS and generic exceptions

227fa3b

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

Propagate exceptions for logging and telemetry

ba5ad52

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

ShangmingCai approved these changes May 7, 2026

View reviewed changes

ShangmingCai merged commit 811d138 into sgl-project:main May 7, 2026
56 of 64 checks passed

ovidiusm deleted the nixl-async-transfer branch May 7, 2026 14:06

LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026

Nixl async transfer (sgl-project#23967)

e335f70

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026

Nixl async transfer (sgl-project#23967)

7f54895

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>

michael7193 mentioned this pull request May 26, 2026

[Disagg] Layer-pipelined KV transfer: overlap RDMA with GPU compute #23515

Open

Conversation

ovidiusm commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Performance

Modifications

Testing

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Apr 28, 2026

Uh oh!

ishandhanani commented Apr 30, 2026

Uh oh!

gemini-code-assist Bot commented Apr 30, 2026

Uh oh!

ovidiusm commented Apr 30, 2026

Uh oh!

ovidiusm commented May 5, 2026

Uh oh!

ovidiusm commented May 5, 2026

Uh oh!

ovidiusm commented May 6, 2026

Uh oh!

ishandhanani commented May 6, 2026

Uh oh!

ishandhanani commented May 6, 2026

Uh oh!

Uh oh!

ShangmingCai May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ovidiusm May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ovidiusm commented Apr 28, 2026 •

edited

Loading

ovidiusm May 7, 2026 •

edited

Loading