Skip to content

[UnifiedTree]: Fix Unified HiCache tombstone lock release replay#24972

Merged
ispobock merged 1 commit into
sgl-project:mainfrom
hzh0425:hybrid_tree/fix-tomstone-lock-release
May 12, 2026
Merged

[UnifiedTree]: Fix Unified HiCache tombstone lock release replay#24972
ispobock merged 1 commit into
sgl-project:mainfrom
hzh0425:hybrid_tree/fix-tomstone-lock-release

Conversation

@hzh0425
Copy link
Copy Markdown
Collaborator

@hzh0425 hzh0425 commented May 11, 2026

Motivation

Record component node ids skipped during lock acquire and pass them back
to release. This prevents a temporary admission lock from
decrementing SWA or Mamba locks that were acquired later after HiCache
load-back revived a tombstoned device value.

The main race is _lock_node -> init_load_back -> release: acquire sees
a component tombstone and skips it, load-back restores it, then release
must not consume the newer load-back/request lock.

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@hzh0425
Copy link
Copy Markdown
Collaborator Author

hzh0425 commented May 11, 2026

/rerun-test test/registered/unit/mem_cache/test_unified_radix_cache_unittest.py test/registered/radix_cache/test_unified_radix_cache_kl.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

🚀 1-gpu-5090 (1 test): ✅ View workflow run

cd test/ && python3 registered/unit/mem_cache/test_unified_radix_cache_unittest.py

🚀 4-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/radix_cache/test_unified_radix_cache_kl.py

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug in the unified cache locking mechanism where temporary locks could incorrectly release nodes that were restored from tombstones while the lock was held. It introduces skip_lock_node_ids in the lock parameters to track nodes skipped during acquisition, ensuring that the release operation accurately mirrors the acquisition state. The fix is implemented for both SWA and Mamba components, and new unit tests have been added to verify that restored tombstones are not prematurely released. I have no feedback to provide as there were no review comments.

Comment thread python/sglang/srt/managers/schedule_policy.py
@baoskee
Copy link
Copy Markdown

baoskee commented May 12, 2026

I've been running write_through_selective on this branch for last 8 hours. Will let you know if it is stable. Thank you!

@ispobock ispobock merged commit 91907b7 into sgl-project:main May 12, 2026
178 of 195 checks passed
@baoskee
Copy link
Copy Markdown

baoskee commented May 12, 2026

Hi, I am still getting the same error:

         ^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/managers/schedule_policy.py", line 878, in add_one_req
    new_indices, req.last_node = self.tree_cache.init_load_back(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_radix_cache.py", line 1390, in init_load_back
    loading_values = self.load_back(last_node, mem_quota, req=req)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_radix_cache.py", line 1240, in load_back
    t = comp.build_hicache_transfers(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_cache_components/swa_component.py", line 448, in build_hicache_transfers
    assert cd.host_value is not None or cd.value is not None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

[2026-05-12 06:06:13 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 4001, in run_scheduler_process
    scheduler.run_event_loop()
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1521, in run_event_loop
    dispatch_event_loop(self)
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3871, in dispatch_event_loop
    scheduler.event_loop_overlap()
  File "/home/baoskee/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1570, in event_loop_overlap
    batch = self.get_next_batch_to_run()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2538, in get_next_batch_to_run
    new_batch = self.get_new_batch_prefill()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2592, in get_new_batch_prefill
    ret = self._get_new_batch_prefill_raw(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2723, in _get_new_batch_prefill_raw
    res = adder.add_one_req(
          ^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/managers/schedule_policy.py", line 878, in add_one_req
    new_indices, req.last_node = self.tree_cache.init_load_back(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_radix_cache.py", line 1390, in init_load_back
    loading_values = self.load_back(last_node, mem_quota, req=req)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_radix_cache.py", line 1240, in load_back
    t = comp.build_hicache_transfers(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_cache_components/swa_component.py", line 448, in build_hicache_transfers
    assert cd.host_value is not None or cd.value is not None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

[2026-05-12 06:06:13] SIGQUIT received. signum=None, frame=None. It usually means one child failed.

It looks like all 4 workers failed, but over time rather than all at once (see TFTT spike and then no requests being served).
Screenshot 2026-05-11 at 11 59 29 PM

Same configuration:

CUDA_VISIBLE_DEVICES=0,1 \
SGLANG_ENABLE_UNIFIED_RADIX_TREE=1 \
sglang serve \
  --model-path google/gemma-4-31B-it \
  --host 0.0.0.0 --port 30001 \
  --tp 2 \
  --context-length 32768 \
  --mem-fraction-static 0.7 \
  --kv-cache-dtype fp8_e4m3 \
  ...
  --enable-cache-report \
  --enable-metrics \
  --trust-remote-code \
  --page-size 64 \
  --enable-hierarchical-cache \
  --hicache-ratio 1 \
  --hicache-write-policy write_through_selective \
  --hicache-storage-prefetch-policy wait_complete \
  --hicache-io-backend kernel \
  --hicache-mem-layout page_first \
  --trust-remote-code \
  --log-level info

@baoskee
Copy link
Copy Markdown

baoskee commented May 12, 2026

Screenshot 2026-05-12 at 12 02 23 AM Our full dashboard if this helps.

@hzh0425
Copy link
Copy Markdown
Collaborator Author

hzh0425 commented May 12, 2026

**baoskee **

@baoskee how can I reach you in slack?

BTW, I noticed that in your setup, hicache-ratio=1, which means the host memory size is equal to the device memory size.
Usually, we set it to at least 2.

@baoskee
Copy link
Copy Markdown

baoskee commented May 12, 2026

I sent an invitation on Slack to hzh0425@apache.org. I had to set to hicache-ratio=1 because our node did not have enough System RAM.

For some reason, the workers starting process crash (won't even run) if I set it to hicache ratio = 2.

The node are B200s on 1800GiB RAM machine. You would think that's enough RAM....

SpencerGarnets added a commit to ai-blaise/optimization-playground that referenced this pull request May 12, 2026
…ack)

Brings in upstream sgl-project/sglang main commits since
096ad02 (merge base, Laguna-XS.2 model support).
Total: 28 upstream commits composed.

Custom-stack files preserved intact (entirely-ours, byte-identical to
origin/main):
  - Blackwell CuTe kernel suite (warp_decode_cute, g1_attention_cute,
    gated_norm_cute, layersplit_cute, fused_store_index_cache)
  - TurboQuant 2.5-bit dense KV cache path
  - HIGGS 2-bit dense KV cache path (with split-K decode)
  - NVFP4 IndexCache dispatcher (active gate)
  - quantization_config_dispatch (HF-config-driven runtime routing)
  - All custom server-args flags and runtime methods preserved

Verification:
  - 200+ merged Python files compile cleanly
  - Dispatcher symbol presence verified
  - HIGGS pool / TurboQuant pool classes present at expected lines
  - compressed_tensors_w4a4_nvfp4_moe imports clean
  - All custom server-args flags present (enable_higgs_dense_2bit_kv_cache,
    enable_turboquant_dense_kv_cache, turboquant_dense_kv_preset,
    indexer_quantization_declared, higgs_mla_decode_num_splits, etc.)

Manual-merged shared files (auto-merge gave broken/mixed output; cleaned
up post-merge):
  - python/sglang/srt/disaggregation/mooncake/conn.py: upstream's PR#24932
    refactored maybe_send_extra into a state-types-loop. Replayed our
    LayerSplit NSA state-index-length-mismatch check inside the SWA/NSA
    branch of the new loop body.
  - sgl-kernel/python/sgl_kernel/__init__.py: upstream's PR#23449 (Apple
    Silicon Metal kernel) wrapped the entire module body in
    `if darwin/arm64: from sgl_kernel.metal import * else: ...`. The
    auto-merge duplicated the file body; rewrote cleanly with upstream's
    structure and re-injected our `g1_gate_forward`,
    `warp_decode_cute_moe_forward`, and
    `warp_decode_cute_moe_packed_forward` imports plus `g1_gate_forward`
    in _DEBUG_EXPORT_NAMES.
  - python/sglang/srt/managers/scheduler_output_processor_mixin.py: line
    628 still referenced `result.num_accepted_drafts` (renamed by PR
    sgl-project#25038 to `num_correct_drafts`). Renamed in place.
  - python/sglang/srt/observability/scheduler_metrics_mixin.py: a block
    around the spec-decode logging path had mixed old/new names from
    auto-merge (lines 553/557/560). Renamed `spec_num_accepted_tokens`
    -> `spec_num_accept_tokens` and local `num_accepted_drafts` ->
    `num_correct_drafts` to match the rest of the file.
  - test/test_smc_info.py: stub Req mock used the old field names
    `spec_accepted_drafts` and `update_spec_acceptance_histogram`.
    Renamed to `spec_num_correct_drafts` and
    `update_spec_correct_drafts_histogram` per PR sgl-project#24081.

Auto-merge cleanly integrated upstream changes to:
  - server_args.py (new fields: prefill_only_disable_kv_cache,
    weight_loader_drop_cache_after_load, prefill_delayer_queue_min_ratio,
    prefill_delayer_max_delay_ms, speculative_draft_window_size, etc.)
  - mem_cache/memory_pool.py (new NoOpMHATokenToKVPool)
  - model_executor/model_runner_kv_cache_mixin.py (NoOpMHATokenToKVPool
    pool factory + _validate_prefill_only_disable_kv_cache_pool_family)
  - layers/attention/nsa_backend.py (spec rename
    num_accepted_drafts -> num_correct_drafts;
    num_accepted_tokens -> num_accept_tokens)
  - layers/attention/nsa/nsa_indexer.py (new _apply_q_scale_and_softmax_scale
    compile method; torch.mm replaces deep_gemm wrapper)
  - 28+ disaggregation/spec/runner files with mostly clean
    upstream-side-only integration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

----- upstream commit subjects (28) -----
fd3eb77 [Cookbook]: add Laguna-XS.2 (Poolside) (sgl-project#24730)
6be1a45 Fix swa component host hit (sgl-project#25085)
693f497 [NPU] use causal_conv1d_update_v2 for performance (sgl-project#24595)
1efe9e2 [Bug Fix] Reject incompatible combination of --disable-cuda-graph-padding and --enable-torch-compile (sgl-project#23903)
8d27ce7 Optimize uvicorn startup command (sgl-project#25041)
b35fd5f [fix] skip legacy minicpmv conv template for MiniCPM-V 4.6 (sgl-project#24998)
7582237 [Tiny Fix] Disable BCG when inner layer_model unresolved (sgl-project#25021)
ca3bc05 Deepseek-v4-Pro share expert tp1 (sgl-project#24949)
a72d3ae [Spec] Multi-layer mamba scatter cleanup; fix positional call bug (sgl-project#25030)
7128533 Revert "Migrate Intel CPU cases to the test/registered." (sgl-project#25044)
1f985c5 [Spec] Rename `accepted_indices` -> `accept_indices`; drop `_token_id` suffix per Rule 5 (sgl-project#25038)
ecf5d84 Migrate Intel CPU cases to the test/registered. (sgl-project#22670)
d7f4761 [PD] Refactor hybrid state transfer (sgl-project#24932)
91907b7 [UnifiedTree]: Fix Unified HiCache tombstone lock release replay (sgl-project#24972)
4ad63ad [Spec] Rename `accepted_drafts` -> `correct_drafts` for unambiguous naming (sgl-project#24081)
6bfb365 [PD] Rate limit prefill inflight polling warnings (sgl-project#24967)
6bb79c1 [Linear Attn] Add CUSTOM enum and plugin extensibility for kernel backends (sgl-project#24937)
cfc41d5 Fix kimi k2.5 mla eagle + dp attention (sgl-project#25033)
0f3932c [Fix] Qwen3-ASR config: set thinker_config before super().__init__ (sgl-project#24187)
f526e3f [Spec] Mamba scatter cleanup; fix multi-layer positional bug; dflash naming (sgl-project#25029)
10375a1 [NIXL][XPU] Fix uint64 overflow for mismatched P/D TP sizes (e.g. prefill_tp=1, decode_tp=2) (sgl-project#24648)
0a37d24 [diffusion] hardware: support sage attention backend on MUSA (attn backend, 21/N) (sgl-project#24752)
5495026 [HiCache] feat: default storage prefetch timeout (sgl-project#23309)
186eb42 Feat: Support SWA (Sliding Window Attention) for EAGLE-3 drafter (sgl-project#24664)
a75b79e Feat: Support newer EAGLE-3 drafters (sgl-project#24663)
f3a8189 [Spec] Internal rename per N2 v2 naming rule (sgl-project#25014)
bfc2eda [MUSA] Use MUSA-optimized operators in piecewise CUDA graph (sgl-project#23633)
74d70af [Apple Silicon] Add Metal kernel support in sgl-kernel (sgl-project#23449)
@hzh0425 hzh0425 deleted the hybrid_tree/fix-tomstone-lock-release branch May 13, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants