[mem] Add v2 hybrid-pool (KV + MAMBA) support to HiCacheHF3FS by parasol-aser · Pull Request #22601 · sgl-project/sglang

parasol-aser · 2026-04-11T22:04:27Z

Summary

Implements the HiCacheStorage v2 interface for the 3FS backend so hybrid models (Mamba / linear attention, and later DSA) can offload both the KV cache and auxiliary per-pool state to 3FS via HybridCacheController. Today HiCacheHF3FS only implements the v1 KV-only methods and inherits NotImplementedError for everything v2; this PR closes that gap.

Tracking issue: #22572
Reference PRs: #21259 (Mooncake hybrid), #20457 (HybridCacheController groundwork)

What changes

python/sglang/srt/mem_cache/storage/hf3fs/storage_hf3fs.py
- New _Hf3fsPoolEngine dataclass bundling per-pool state: preallocated file, client list, AtomicCounter, ThreadPoolExecutor, metadata client, rank namespace, is_zero_copy, skip_backup.
- KV engine is built in __init__, so v1 callers and non-hybrid deployments keep working unchanged.
- register_mem_host_pool_v2 lazily creates auxiliary engines (MAMBA today) with their own preallocated files, clients and metadata scope. Idempotent and order-agnostic — we do not rely on KV being registered first.
- batch_exists_v2 / batch_get_v2 / batch_set_v2 mirror HiCacheFile's reference semantics: longest-prefix KV hit, per-pool ALL_PAGES / TRAILING_PAGES hit policies, min across pools for the final hit, and per-pool result dicts.
- Refactored _batch_get / _batch_set to take a _Hf3fsPoolEngine, so v1 and v2 share the same IO core. The existing MHA zero-copy -k/-v key doubling remains strictly KV-scoped.
- Key namespacing: auxiliary pools prefix metadata keys with the pool name; KV keeps the bare key for backwards compatibility. This is Option A from the plan — no wire-protocol changes to mini_3fs_metadata_server.py.
- Per-pool skip_backup: MLA rank>0 still skips the KV backup but MAMBA state is backed up on every rank (mamba state is per-rank and not shared across TP).
- Drive-by fix: the v1 skip_backup path returned a scalar True instead of a [True] * N list — corrected in the shared _batch_set so the v2 contract is honored.
- close() now iterates all engines; self._engines is populated before the SIGTERM handler is installed to avoid a startup window race.
- from_env_config now also accepts an inline dict config so unit tests don't need SGLANG_HICACHE_HF3FS_CONFIG_PATH, and forwards a pools sub-dict as pools_extra_config for per-pool file-size overrides (default: 0.1× the KV file size).
test/registered/hicache/test_hicache_storage_3fs_hybrid.py (new) — mock-client unit tests exercising the v2 path end-to-end.
test/registered/hicache/test_hicache_storage_3fs_backend.py — new TestHf3fsBackendHybrid end-to-end test that launches a hybrid/linear-attention model with --hicache-storage-backend hf3fs --enable-hierarchical-cache --hicache-storage-prefetch-policy wait_complete, primes the cache, flushes, re-runs the same prompt, and asserts a non-trivial cache hit on the second run.

Scope / non-goals

Only PoolName.KV + PoolName.MAMBA are supported in this PR. This is sufficient for linear / hybrid-attention models (Qwen3.5-9B, etc.). DSA (DeepSeek Sparse Attention) is deferred until a v2 DSA caller exists — today Mooncake routes DSA through a separate embedding store, not through PoolName, and HiMambaRadixCache only registers KV + MAMBA. Adding PoolName.DSA is a follow-up once that caller lands.
No changes to mini_3fs_metadata_server.py. Pool separation is achieved by per-pool key prefixes and (for auxiliary pools) a distinct per-pool rank namespace on the metadata client, avoiding any wire-protocol change.
backend_factory.py is unchanged. The auxiliary pool's bytes_per_page is only known at register_mem_host_pool_v2 time, which is called later by HybridCacheController.attach_storage_backend — verified end-to-end.

Test plan

Unit tests (mock HF3FS client, no cluster needed) — all in test/registered/hicache/test_hicache_storage_3fs_hybrid.py:

Construction sanity — register KV + MAMBA, assert both appear in self._engines with distinct files, clients and executors; idempotent under double-registration.
batch_exists_v2 KV-only fallback — pool_transfers=None matches the v1 batch_exists result.
batch_exists_v2 ALL_PAGES — 4 KV pages, 2 DSA-style pages at slots [0,1]: asserts final_hit == 2 and extra_pool_hit_pages[X] == 2.
batch_exists_v2 TRAILING_PAGES — 4 KV pages, mamba only at [2,3]: asserts final_hit == 4 with trailing-window semantics matching HiCacheFile exactly.
batch_set_v2 + batch_get_v2 round-trip — per-pool result dicts all True, host buffers populated correctly.
MHA zero-copy + mamba interplay — with is_mla_model=False, is_zero_copy=True, KV keys double into -k/-v but mamba keys do not.
Skip-backup is KV-only under MLA — with is_mla_model=True, rank=2, KV returns [True]*N (skipped) but MAMBA actually writes through.
Partial pool failure — tiny mamba file (1 page) + 4-page set: KV all True, mamba [True, False, False, False], no exception escapes.
Interface contract — batch_exists_v2 without any pool registered raises a clear error, not AttributeError.

End-to-end tests (extend test_hicache_storage_3fs_backend.py):

TestHf3fsBackendHybrid.test_hybrid_cache_hit_after_flush — launches a hybrid model with the 3FS backend, primes, flushes, re-prompts, asserts cached_tokens > 700 on the second run. Gated on DEFAULT_HYBRID_MAMBA_MODEL_NAME_FOR_TEST availability.

Regression:

Full test_hicache_storage_3fs_backend.py suite — all pre-existing v1 tests still pass unchanged.
test_hicache_storage_file_backend.py — no collateral damage (PR does not touch hicache_storage.py's PoolName enum or base-class stubs).

Manual / cluster validation (optional, pre-merge):

On a single-node mini-3FS cluster, run the command from [HiCache & HybridModel] mooncake backend support DSA & mamba model #21259 but with --hicache-storage-backend hf3fs on a Mamba model. Verify gsm8k accuracy is identical between first and flushed-second run, second-run TTFT drops substantially, and get_stats() reports non-zero prefetch/backup bandwidth for both the KV engine and the mamba engine.

🤖 Generated with Claude Code

Implements the HiCacheStorage v2 interface for the 3FS backend so that hybrid models (Mamba/linear-attention, and in the future DSA) can offload both KV pages and auxiliary per-pool state to 3FS via HybridCacheController. - Introduce _Hf3fsPoolEngine: a per-pool bundle of (file, client list, executor, metadata client, rank namespace, is_zero_copy, skip_backup) so each registered host pool has its own 3FS file and metadata scope. - Construct the KV engine in __init__ so v1 callers keep working unchanged. - Implement register_mem_host_pool_v2 to lazily allocate auxiliary (MAMBA/...) engines with their own preallocated files, clients and metadata namespaces. Idempotent and order-agnostic. - Implement batch_exists_v2 / batch_get_v2 / batch_set_v2 mirroring the HiCacheFile semantics, including ALL_PAGES and TRAILING_PAGES hit policies, min-across-pools final hit, and per-pool result dicts. - Refactor _batch_get / _batch_set to take an engine argument so both v1 and v2 entry points share the same IO core. - Key namespacing: auxiliary pools prefix the metadata key with the pool name, KV keeps the bare key for backwards compatibility. MHA zero-copy -k/-v suffixing remains strictly KV-scoped. - Per-pool skip_backup so MLA rank>0 still skips KV but backs up MAMBA on every rank. Fix a pre-existing bug where skip_backup returned a scalar True instead of a per-key list. - close() now iterates all engines; _engines is populated before the SIGTERM handler is installed. Test plan: - New test/registered/hicache/test_hicache_storage_3fs_hybrid.py uses the mock HF3FS client to cover: construction sanity, KV-only v2 fallback, ALL_PAGES and TRAILING_PAGES exists semantics, v2 set/get round-trip, MHA zero-copy + mamba interplay, MLA skip_backup KV-only scoping, partial-pool failure, and a no-pool error contract. - Extended test_hicache_storage_3fs_backend.py with TestHf3fsBackendHybrid end-to-end test for a hybrid model, gated on model availability. Scope: PoolName.KV + PoolName.MAMBA. DSA is deferred until a caller exists (see PLAN.md §3 and Appendix B). Tracking issue: sgl-project#22572 Reference PRs: sgl-project#21259, sgl-project#20457 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements a multi-pool storage interface for HiCacheHF3FS to support hybrid models, such as those combining KV and MAMBA pools. It introduces a per-pool engine architecture, lazy registration of auxiliary pools, and updated batch operations (exists, get, set) that handle multiple pools with configurable hit policies like ALL_PAGES and TRAILING_PAGES. Feedback focuses on improving the robustness of index releasing during failures, ensuring unique file paths for auxiliary pools across ranks to prevent collisions, and refactoring duplicated logic for pool name normalization and page size calculations.

gemini-code-assist · 2026-04-11T22:06:37Z

            if indices:
-                self.metadata_client.confirm_write(
-                    self.rank, [], [index[1] for index in indices]
+                engine.metadata_client.confirm_write(
+                    engine.metadata_rank, [], [index[1] for index in indices]
                )


Releasing all indices in the indices list on a length mismatch can lead to data corruption. If is_written was True for some entries, those indices refer to existing pages that should not be released. Additionally, index[1] could be -1 if allocation failed for some keys. It's safer to filter the list to only release newly allocated, valid indices.

Suggested change

if indices:

self.metadata_client.confirm_write(

self.rank, [], [index[1] for index in indices]

engine.metadata_client.confirm_write(

engine.metadata_rank, [], [index[1] for index in indices]

)

# free allocated pages

if indices:

to_release = [idx for is_written, idx in indices if not is_written and idx != -1]

if to_release:

engine.metadata_client.confirm_write(

engine.metadata_rank, [], to_release

)

gemini-code-assist · 2026-04-11T22:06:37Z

+        # Strip the trailing rank suffix from the KV path (".<rank>") so we
+        # can substitute the *original* (per-rank) rank for aux pools.
+        kv_rank_suffix = f".{self.rank}"
+        if base.endswith(kv_rank_suffix):
+            base = base[: -len(kv_rank_suffix)] + f".{self._original_rank}"
+        return f"{base}.{pool_name}{ext or '.bin'}"


The current logic for deriving the auxiliary pool file path relies on finding and replacing the KV rank suffix. If the file_path does not contain the expected .{self.rank} suffix (e.g., a shared base path), multiple ranks will collide on the same auxiliary file path. It is safer to always explicitly include the _original_rank in the auxiliary path to ensure isolation.

Suggested change

# Strip the trailing rank suffix from the KV path (".<rank>") so we

# can substitute the *original* (per-rank) rank for aux pools.

kv_rank_suffix = f".{self.rank}"

if base.endswith(kv_rank_suffix):

base = base[: -len(kv_rank_suffix)] + f".{self._original_rank}"

return f"{base}.{pool_name}{ext or '.bin'}"

# Strip the trailing rank suffix from the KV path (".<rank>") if present

kv_rank_suffix = f".{self.rank}"

if base.endswith(kv_rank_suffix):

base = base[: -len(kv_rank_suffix)]

# Always include the original rank to ensure isolation.

return f"{base}.{self._original_rank}.{pool_name}{ext or '.bin'}"

gemini-code-assist · 2026-04-11T22:06:37Z

+        except Exception as e:
+            # The mini metadata server raises when its free-page list and
+            # key-to-index map are both empty (over-allocation past the
+            # configured num_pages). Surface that to the caller as a per-key
+            # failure list rather than letting the exception escape — the
+            # v2 contract guarantees a List[bool] result so the controller
+            # can attribute the loss to this specific pool. PLAN.md §4 #5.
+            logger.warning(
+                "[Rank %s] HiCacheHF3FS batch_set (%s) capacity exhausted: %s",
+                engine.metadata_rank,
+                engine.pool_name,
+                e,
+            )
+            return [False] * len(keys)


Catching a broad Exception here might mask unexpected errors (e.g., network issues, programming errors) as simple capacity exhaustion. It is recommended to catch more specific exceptions if possible, or at least include the stack trace in the log to aid debugging.

Suggested change

except Exception as e:

# The mini metadata server raises when its free-page list and

# key-to-index map are both empty (over-allocation past the

# configured num_pages). Surface that to the caller as a per-key

# failure list rather than letting the exception escape — the

# v2 contract guarantees a List[bool] result so the controller

# can attribute the loss to this specific pool. PLAN.md §4 #5.

logger.warning(

"[Rank %s] HiCacheHF3FS batch_set (%s) capacity exhausted: %s",

engine.metadata_rank,

engine.pool_name,

e,

)

return [False] * len(keys)

except Exception as e:

# The mini metadata server raises when its free-page list and

# key-to-index map are both empty (over-allocation past the

# configured num_pages). Surface that to the caller as a per-key

# failure list rather than letting the exception escape — the

# v2 contract guarantees a List[bool] result so the controller

# can attribute the loss to this specific pool. PLAN.md §4 #5.

logger.warning(

"[Rank %s] HiCacheHF3FS batch_set (%s) capacity exhausted: %s",

engine.metadata_rank,

engine.pool_name,

e,

exc_info=True,

)

gemini-code-assist · 2026-04-11T22:06:37Z

+            self._next_pool_idx += 1
+        # Use the original (per-rank) rank as the base. Mamba/DSA state is
+        # per-rank — even in MLA — so the metadata namespace must be too.
+        return self._original_rank + (self._pool_idx_map[pool_name] << 24)


Using a fixed shift of 24 bits for the pool namespace assumes that _original_rank will always be less than $2^{24}$ (approx. 16.7 million). While this is likely true for most current deployments, it's a magic number constraint that could lead to collisions if ranks are assigned from a larger space (e.g., global unique IDs). Consider using a more robust scheme or documenting this limitation.

gemini-code-assist · 2026-04-11T22:06:37Z

+                if host_pool.layout in ["page_first", "page_first_direct"]:
+                    new_bpp = (
+                        host_pool.get_ksize_per_token() * host_pool.page_size
+                    )
+                else:
+                    new_bpp = (
+                        host_pool.get_size_per_token() * host_pool.page_size
+                    )


The logic for calculating bytes_per_page based on the host pool layout is duplicated here and at lines 847-854, and it also mirrors logic in backend_factory.py. Consider refactoring this into a helper method or a property on the HostKVCache class to improve maintainability.

gemini-code-assist · 2026-04-11T22:06:37Z

+            name = (
+                transfer.name.value
+                if isinstance(transfer.name, PoolName)
+                else str(transfer.name)
+            )


The normalization of the pool name from transfer.name is duplicated in several places (batch_exists_v2, _resolve_v2_transfer, batch_get_v2, batch_set_v2). Consider refactoring this into a small helper method.

parasol-aser requested review from Ying1123, hanming-lu, hnyls2002, hzh0425, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 11, 2026 22:04

github-actions Bot added the hicache Hierarchical Caching for SGLang label Apr 11, 2026

gemini-code-assist Bot reviewed Apr 11, 2026

View reviewed changes

parasol-aser closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mem] Add v2 hybrid-pool (KV + MAMBA) support to HiCacheHF3FS#22601

[mem] Add v2 hybrid-pool (KV + MAMBA) support to HiCacheHF3FS#22601
parasol-aser wants to merge 1 commit into
sgl-project:mainfrom
parasol-aser:hicache-hf3fs-hybrid-v2

parasol-aser commented Apr 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

parasol-aser commented Apr 11, 2026

Summary

What changes

Scope / non-goals

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant