[mem] Add v2 hybrid-pool (KV + MAMBA) support to HiCacheHF3FS#22601
[mem] Add v2 hybrid-pool (KV + MAMBA) support to HiCacheHF3FS#22601parasol-aser wants to merge 1 commit into
Conversation
Implements the HiCacheStorage v2 interface for the 3FS backend so that hybrid models (Mamba/linear-attention, and in the future DSA) can offload both KV pages and auxiliary per-pool state to 3FS via HybridCacheController. - Introduce _Hf3fsPoolEngine: a per-pool bundle of (file, client list, executor, metadata client, rank namespace, is_zero_copy, skip_backup) so each registered host pool has its own 3FS file and metadata scope. - Construct the KV engine in __init__ so v1 callers keep working unchanged. - Implement register_mem_host_pool_v2 to lazily allocate auxiliary (MAMBA/...) engines with their own preallocated files, clients and metadata namespaces. Idempotent and order-agnostic. - Implement batch_exists_v2 / batch_get_v2 / batch_set_v2 mirroring the HiCacheFile semantics, including ALL_PAGES and TRAILING_PAGES hit policies, min-across-pools final hit, and per-pool result dicts. - Refactor _batch_get / _batch_set to take an engine argument so both v1 and v2 entry points share the same IO core. - Key namespacing: auxiliary pools prefix the metadata key with the pool name, KV keeps the bare key for backwards compatibility. MHA zero-copy -k/-v suffixing remains strictly KV-scoped. - Per-pool skip_backup so MLA rank>0 still skips KV but backs up MAMBA on every rank. Fix a pre-existing bug where skip_backup returned a scalar True instead of a per-key list. - close() now iterates all engines; _engines is populated before the SIGTERM handler is installed. Test plan: - New test/registered/hicache/test_hicache_storage_3fs_hybrid.py uses the mock HF3FS client to cover: construction sanity, KV-only v2 fallback, ALL_PAGES and TRAILING_PAGES exists semantics, v2 set/get round-trip, MHA zero-copy + mamba interplay, MLA skip_backup KV-only scoping, partial-pool failure, and a no-pool error contract. - Extended test_hicache_storage_3fs_backend.py with TestHf3fsBackendHybrid end-to-end test for a hybrid model, gated on model availability. Scope: PoolName.KV + PoolName.MAMBA. DSA is deferred until a caller exists (see PLAN.md §3 and Appendix B). Tracking issue: sgl-project#22572 Reference PRs: sgl-project#21259, sgl-project#20457 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements a multi-pool storage interface for HiCacheHF3FS to support hybrid models, such as those combining KV and MAMBA pools. It introduces a per-pool engine architecture, lazy registration of auxiliary pools, and updated batch operations (exists, get, set) that handle multiple pools with configurable hit policies like ALL_PAGES and TRAILING_PAGES. Feedback focuses on improving the robustness of index releasing during failures, ensuring unique file paths for auxiliary pools across ranks to prevent collisions, and refactoring duplicated logic for pool name normalization and page size calculations.
| if indices: | ||
| self.metadata_client.confirm_write( | ||
| self.rank, [], [index[1] for index in indices] | ||
| engine.metadata_client.confirm_write( | ||
| engine.metadata_rank, [], [index[1] for index in indices] | ||
| ) |
There was a problem hiding this comment.
Releasing all indices in the indices list on a length mismatch can lead to data corruption. If is_written was True for some entries, those indices refer to existing pages that should not be released. Additionally, index[1] could be -1 if allocation failed for some keys. It's safer to filter the list to only release newly allocated, valid indices.
| if indices: | |
| self.metadata_client.confirm_write( | |
| self.rank, [], [index[1] for index in indices] | |
| engine.metadata_client.confirm_write( | |
| engine.metadata_rank, [], [index[1] for index in indices] | |
| ) | |
| # free allocated pages | |
| if indices: | |
| to_release = [idx for is_written, idx in indices if not is_written and idx != -1] | |
| if to_release: | |
| engine.metadata_client.confirm_write( | |
| engine.metadata_rank, [], to_release | |
| ) |
| # Strip the trailing rank suffix from the KV path (".<rank>") so we | ||
| # can substitute the *original* (per-rank) rank for aux pools. | ||
| kv_rank_suffix = f".{self.rank}" | ||
| if base.endswith(kv_rank_suffix): | ||
| base = base[: -len(kv_rank_suffix)] + f".{self._original_rank}" | ||
| return f"{base}.{pool_name}{ext or '.bin'}" |
There was a problem hiding this comment.
The current logic for deriving the auxiliary pool file path relies on finding and replacing the KV rank suffix. If the file_path does not contain the expected .{self.rank} suffix (e.g., a shared base path), multiple ranks will collide on the same auxiliary file path. It is safer to always explicitly include the _original_rank in the auxiliary path to ensure isolation.
| # Strip the trailing rank suffix from the KV path (".<rank>") so we | |
| # can substitute the *original* (per-rank) rank for aux pools. | |
| kv_rank_suffix = f".{self.rank}" | |
| if base.endswith(kv_rank_suffix): | |
| base = base[: -len(kv_rank_suffix)] + f".{self._original_rank}" | |
| return f"{base}.{pool_name}{ext or '.bin'}" | |
| # Strip the trailing rank suffix from the KV path (".<rank>") if present | |
| kv_rank_suffix = f".{self.rank}" | |
| if base.endswith(kv_rank_suffix): | |
| base = base[: -len(kv_rank_suffix)] | |
| # Always include the original rank to ensure isolation. | |
| return f"{base}.{self._original_rank}.{pool_name}{ext or '.bin'}" |
| except Exception as e: | ||
| # The mini metadata server raises when its free-page list and | ||
| # key-to-index map are both empty (over-allocation past the | ||
| # configured num_pages). Surface that to the caller as a per-key | ||
| # failure list rather than letting the exception escape — the | ||
| # v2 contract guarantees a List[bool] result so the controller | ||
| # can attribute the loss to this specific pool. PLAN.md §4 #5. | ||
| logger.warning( | ||
| "[Rank %s] HiCacheHF3FS batch_set (%s) capacity exhausted: %s", | ||
| engine.metadata_rank, | ||
| engine.pool_name, | ||
| e, | ||
| ) | ||
| return [False] * len(keys) |
There was a problem hiding this comment.
Catching a broad Exception here might mask unexpected errors (e.g., network issues, programming errors) as simple capacity exhaustion. It is recommended to catch more specific exceptions if possible, or at least include the stack trace in the log to aid debugging.
| except Exception as e: | |
| # The mini metadata server raises when its free-page list and | |
| # key-to-index map are both empty (over-allocation past the | |
| # configured num_pages). Surface that to the caller as a per-key | |
| # failure list rather than letting the exception escape — the | |
| # v2 contract guarantees a List[bool] result so the controller | |
| # can attribute the loss to this specific pool. PLAN.md §4 #5. | |
| logger.warning( | |
| "[Rank %s] HiCacheHF3FS batch_set (%s) capacity exhausted: %s", | |
| engine.metadata_rank, | |
| engine.pool_name, | |
| e, | |
| ) | |
| return [False] * len(keys) | |
| except Exception as e: | |
| # The mini metadata server raises when its free-page list and | |
| # key-to-index map are both empty (over-allocation past the | |
| # configured num_pages). Surface that to the caller as a per-key | |
| # failure list rather than letting the exception escape — the | |
| # v2 contract guarantees a List[bool] result so the controller | |
| # can attribute the loss to this specific pool. PLAN.md §4 #5. | |
| logger.warning( | |
| "[Rank %s] HiCacheHF3FS batch_set (%s) capacity exhausted: %s", | |
| engine.metadata_rank, | |
| engine.pool_name, | |
| e, | |
| exc_info=True, | |
| ) |
| self._next_pool_idx += 1 | ||
| # Use the original (per-rank) rank as the base. Mamba/DSA state is | ||
| # per-rank — even in MLA — so the metadata namespace must be too. | ||
| return self._original_rank + (self._pool_idx_map[pool_name] << 24) |
There was a problem hiding this comment.
Using a fixed shift of 24 bits for the pool namespace assumes that _original_rank will always be less than
| if host_pool.layout in ["page_first", "page_first_direct"]: | ||
| new_bpp = ( | ||
| host_pool.get_ksize_per_token() * host_pool.page_size | ||
| ) | ||
| else: | ||
| new_bpp = ( | ||
| host_pool.get_size_per_token() * host_pool.page_size | ||
| ) |
There was a problem hiding this comment.
| name = ( | ||
| transfer.name.value | ||
| if isinstance(transfer.name, PoolName) | ||
| else str(transfer.name) | ||
| ) |
Summary
Implements the
HiCacheStoragev2 interface for the 3FS backend so hybrid models (Mamba / linear attention, and later DSA) can offload both the KV cache and auxiliary per-pool state to 3FS viaHybridCacheController. TodayHiCacheHF3FSonly implements the v1 KV-only methods and inheritsNotImplementedErrorfor everything v2; this PR closes that gap.Tracking issue: #22572
Reference PRs: #21259 (Mooncake hybrid), #20457 (HybridCacheController groundwork)
What changes
python/sglang/srt/mem_cache/storage/hf3fs/storage_hf3fs.py_Hf3fsPoolEnginedataclass bundling per-pool state: preallocated file, client list,AtomicCounter,ThreadPoolExecutor, metadata client, rank namespace,is_zero_copy,skip_backup.__init__, so v1 callers and non-hybrid deployments keep working unchanged.register_mem_host_pool_v2lazily creates auxiliary engines (MAMBA today) with their own preallocated files, clients and metadata scope. Idempotent and order-agnostic — we do not rely on KV being registered first.batch_exists_v2/batch_get_v2/batch_set_v2mirrorHiCacheFile's reference semantics: longest-prefix KV hit, per-poolALL_PAGES/TRAILING_PAGEShit policies,minacross pools for the final hit, and per-pool result dicts._batch_get/_batch_setto take a_Hf3fsPoolEngine, so v1 and v2 share the same IO core. The existing MHA zero-copy-k/-vkey doubling remains strictly KV-scoped.mini_3fs_metadata_server.py.skip_backup: MLA rank>0 still skips the KV backup but MAMBA state is backed up on every rank (mamba state is per-rank and not shared across TP).skip_backuppath returned a scalarTrueinstead of a[True] * Nlist — corrected in the shared_batch_setso the v2 contract is honored.close()now iterates all engines;self._enginesis populated before the SIGTERM handler is installed to avoid a startup window race.from_env_confignow also accepts an inlinedictconfig so unit tests don't needSGLANG_HICACHE_HF3FS_CONFIG_PATH, and forwards apoolssub-dict aspools_extra_configfor per-pool file-size overrides (default: 0.1× the KV file size).test/registered/hicache/test_hicache_storage_3fs_hybrid.py(new) — mock-client unit tests exercising the v2 path end-to-end.test/registered/hicache/test_hicache_storage_3fs_backend.py— newTestHf3fsBackendHybridend-to-end test that launches a hybrid/linear-attention model with--hicache-storage-backend hf3fs --enable-hierarchical-cache --hicache-storage-prefetch-policy wait_complete, primes the cache, flushes, re-runs the same prompt, and asserts a non-trivial cache hit on the second run.Scope / non-goals
PoolName.KV+PoolName.MAMBAare supported in this PR. This is sufficient for linear / hybrid-attention models (Qwen3.5-9B, etc.). DSA (DeepSeek Sparse Attention) is deferred until a v2 DSA caller exists — today Mooncake routes DSA through a separate embedding store, not throughPoolName, andHiMambaRadixCacheonly registers KV + MAMBA. AddingPoolName.DSAis a follow-up once that caller lands.mini_3fs_metadata_server.py. Pool separation is achieved by per-pool key prefixes and (for auxiliary pools) a distinct per-pool rank namespace on the metadata client, avoiding any wire-protocol change.backend_factory.pyis unchanged. The auxiliary pool'sbytes_per_pageis only known atregister_mem_host_pool_v2time, which is called later byHybridCacheController.attach_storage_backend— verified end-to-end.Test plan
Unit tests (mock HF3FS client, no cluster needed) — all in
test/registered/hicache/test_hicache_storage_3fs_hybrid.py:self._engineswith distinct files, clients and executors; idempotent under double-registration.batch_exists_v2KV-only fallback —pool_transfers=Nonematches the v1batch_existsresult.batch_exists_v2ALL_PAGES— 4 KV pages, 2 DSA-style pages at slots[0,1]: assertsfinal_hit == 2andextra_pool_hit_pages[X] == 2.batch_exists_v2TRAILING_PAGES— 4 KV pages, mamba only at[2,3]: assertsfinal_hit == 4with trailing-window semantics matchingHiCacheFileexactly.batch_set_v2+batch_get_v2round-trip — per-pool result dicts allTrue, host buffers populated correctly.is_mla_model=False, is_zero_copy=True, KV keys double into-k/-vbut mamba keys do not.is_mla_model=True, rank=2, KV returns[True]*N(skipped) but MAMBA actually writes through.True, mamba[True, False, False, False], no exception escapes.batch_exists_v2without any pool registered raises a clear error, notAttributeError.End-to-end tests (extend
test_hicache_storage_3fs_backend.py):TestHf3fsBackendHybrid.test_hybrid_cache_hit_after_flush— launches a hybrid model with the 3FS backend, primes, flushes, re-prompts, assertscached_tokens > 700on the second run. Gated onDEFAULT_HYBRID_MAMBA_MODEL_NAME_FOR_TESTavailability.Regression:
test_hicache_storage_3fs_backend.pysuite — all pre-existing v1 tests still pass unchanged.test_hicache_storage_file_backend.py— no collateral damage (PR does not touchhicache_storage.py'sPoolNameenum or base-class stubs).Manual / cluster validation (optional, pre-merge):
--hicache-storage-backend hf3fson a Mamba model. Verify gsm8k accuracy is identical between first and flushed-second run, second-run TTFT drops substantially, andget_stats()reports non-zero prefetch/backup bandwidth for both the KV engine and the mamba engine.🤖 Generated with Claude Code