[P/D disagg] - support decode side radix cache#19746
[P/D disagg] - support decode side radix cache#19746ishandhanani wants to merge 57 commits intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@ishandhanani Can this feature be understood as follows: Current status of pd-disagg: Based on this PR's implementation: |
Yep. This is correct |
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
- Set req.prefix_indices in _pre_alloc so init_next_round_input(None) computes extend_input_len correctly from the cached prefix length. Without this, prepare_for_prebuilt runs a full-length extend instead of a delta extend. - Always call inc_lock_ref on the matched node (even on empty match) to match aggregated scheduler behavior. Prevents lock_ref underflow when cache_finished_req unconditionally calls dec_lock_ref.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Next step is testing with a larger model on B200. And then step after (maybe in follow up) is to do the same for mooncake |
|
@ishandhanani There seems to be a constraint here: |
|
Could you share the exact command you used to run this? I'd like to reproduce it and test it on my side. |
Theres a few things here.
|
|
How is warm up request handled? iiuc it has fake bootstrap addr so its kv is garbage and should not be inserted? |
please address all the performance issues. this is actually not expected. |
Can you explain why this is not expected? If we have a larger batch being driven through the decode we might take a slight hit on ITL. I might be missing something so please feel free to correct me |
Remove the if/else branch for disable_radix_cache in pop_preallocated. _match_prefix_and_lock works with both RadixCache and ChunkCache — ChunkCache returns prefix_len=0 and lock/unlock are no-ops.
8a9a52a to
53294d6
Compare
Add fail-fast guards for --disaggregation-decode-enable-radix-cache with: - speculative decoding (checked in server_args) - SWA hybrid models (checked in scheduler init, model-dependent) - Mamba/SSM hybrid models (checked in scheduler init, model-dependent)
Running requests' pages are already inserted into the radix tree during process_prebuilt -> cache_unfinished_req, so they should be properly accounted for by check_memory. The guard was masking potential accounting issues rather than fixing them.
For that long common prefix, the memory traffic of loading kv cache does not increase proportionally to batch size. Also, the 8x TTFT speedup only leads to 3x-ish concurrency improvements, which means the engine is not properly loaded. (8x looks sus to me) |
…check" This reverts commit 7856688.
|
I have found a similar effort, #22234. Should we arrange a meeting and gather all these people who are interested in this feature and settle on a design together? |
Oh nice. Yea that would be good. They do some things very well. Maybe over slack? |
- Remove unnecessary running_batch guard in decode idle memory check: running requests' pages are already in the tree via process_prebuilt -> cache_unfinished_req. - Move start_send_idx update after send() to prevent cursor advance on skipped/failed sends. - Add TP sync guards in _update_handshake_waiters and pop_transferred to prevent gloo hangs from queue size divergence across TP ranks. - Add diagnostics to Bug #14 KV cache full assertion in _pre_alloc. Cursor fix and TP sync guards inspired by #22234 (yudian0504). Validated on sa-b200 Qwen 32B 3P1D at concurrency 128 and 256.
| req.req_pool_idx, : len(req.fill_ids) | ||
| ] | ||
| kv_indices = kv_indices_orig[: req.kv_committed_len] | ||
|
|
There was a problem hiding this comment.
Let's check the different scenarios:
- Incremental transfer & success:
- inc_lock_ref(pop_preallocated) → dec_lock_ref+inc_lock_ref(cache_unfinished_req) → dec_lock_ref(cache_finished_req)
- Full transfer & success:
- inc_lock_ref(get_new_prebuilt_batch) → dec_lock_ref+inc_lock_ref(cache_unfinished_req) → dec_lock_ref(cache_finished_req)
- Incremental transfer & failure:
- inc_lock_ref(pop_preallocated) → dec_lock_ref(cache_finished_req via abort and release)
- Full transfer & failure:
- no inc_lock_ref → dec_lock_ref(root_node) # no-op
There was a problem hiding this comment.
Verified all four scenarios with unit tests in 4f2c34e (test/registered/unit/mem_cache/test_decode_radix_lock_ref.py). All pass.
Your analysis is correct. One nuance worth noting: inc_lock_ref and dec_lock_ref both use while node != root_node, so calls on root are no-ops. This means scenarios 2 and 4 (full transfer, where last_node = root) have trivially balanced lock_refs — the real balancing work happens in scenarios 1 and 3 (incremental transfer with actual prefix nodes).
The tests cover:
- Incremental transfer & success — exercises full
inc → dec+inc → decchain ✓ - Full transfer & success — confirms root is no-op, leaf inc/dec balances ✓
- Incremental transfer & failure —
inc_lock_refthencache_finished_req(is_insert=False)properly dec's ✓ - Full transfer & failure —
dec_lock_ref(root)is safe no-op ✓ - Repeated incremental (5 iterations) — no cumulative lock_ref drift ✓
Warmup requests use FAKE_BOOTSTRAP_HOST, so no actual KV transfer occurs — the decode side runs forward passes on uninitialized memory. With decode radix cache enabled, cache_unfinished_req / cache_finished_req insert this garbage into the tree, where it could be matched by real requests sharing the same prefix tokens. Fix: call flush_cache after both warmup paths (custom --warmups and the general _execute_server_warmup) complete in disagg mode. The flush is safe because it runs before ServerStatus.Up is set, so no real traffic can race with it.
Addressed in a6b6e8b.
|
Extract inline floor-alignment math into a shared helper in disaggregation/utils.py. Used in two places: prefix_len alignment and SWA window_start alignment.
Revert cache_unfinished_req back to upstream: it now uses req.fill_ids directly without truncation. The decode-specific truncation to kv_committed_len is done at the two decode call sites instead: - _pre_alloc: when fill_ids is first set - get_new_prebuilt_batch: after init_next_round_input resets fill_ids This addresses review feedback (hzh0425, yudian0504) that modifying RadixTree internals is too aggressive — the truncation is a decode-disagg concern, not a tree concern.
Restore getattr(config, "rope_theta", 1000000) from main — our branch had config.rope_parameters["rope_theta"] from a bad merge resolution, which breaks on checkpoints without rope_parameters.
…-decode # Conflicts: # python/sglang/srt/disaggregation/nixl/conn.py
Tests the four inc_lock_ref/dec_lock_ref scenarios identified by yudian0504 in PR #19746 review: 1. Incremental transfer & success (prefix match > 0) 2. Full transfer & success (no prefix match, root) 3. Incremental transfer & failure (prefix match, is_insert=False) 4. Full transfer & failure (root, dec_lock_ref is no-op) 5. Repeated incremental transfers (no lock_ref leak) Key invariant: after each scenario, protected_size()==0 and root lock_ref==1 (unchanged).
Summary
In PD disaggregation, the decode worker can now use radix cache to reuse shared prefixes and request only the delta KV from prefill instead of transferring the full prefix on every turn.
This is enabled with
--disaggregation-decode-enable-radix-cacheon the decode server.For now, this path is supported only with
--disaggregation-transfer-backend nixl.server_args.pynow rejects other transfer backends early when the decode radix cache flag is enabled. Mooncake support will follow in a separate PR.Main Changes
decode_prefix_lenfrom decode to prefill for the NIXL path.--disaggregation-decode-enable-radix-cache.--disaggregation-transfer-backend nixlwhen that flag is set.Interface
Enable decode radix cache on the decode worker with:
Prefill continues to run with
--disaggregation-transfer-backend nixl.Note: DP attention is still experimental here. The flag is allowed, but good cache hit rates require prefix-aware DP routing.
Benchmark
Setup
Qwen/Qwen3-32B, FP8 KV cache, 3P1D, TP=2 per workerResults
Decode-side logs show the reason for the throughput gain: baseline decode ran near KV capacity (
token_usage ~ 0.99) and only fit ~37 running requests, while decode radix cache reduced duplicate prefix residency (token_usage ~ 0.75) and fit roughly 104-126 running requests. The ITL regression is expected from the larger decode batch.Test Plan
nixlinserver_args.py