Skip to content

[mem] Add v2 hybrid-pool (KV + MAMBA) support to HiCacheHF3FS#22601

Closed
parasol-aser wants to merge 1 commit into
sgl-project:mainfrom
parasol-aser:hicache-hf3fs-hybrid-v2
Closed

[mem] Add v2 hybrid-pool (KV + MAMBA) support to HiCacheHF3FS#22601
parasol-aser wants to merge 1 commit into
sgl-project:mainfrom
parasol-aser:hicache-hf3fs-hybrid-v2

Conversation

@parasol-aser
Copy link
Copy Markdown
Contributor

Summary

Implements the HiCacheStorage v2 interface for the 3FS backend so hybrid models (Mamba / linear attention, and later DSA) can offload both the KV cache and auxiliary per-pool state to 3FS via HybridCacheController. Today HiCacheHF3FS only implements the v1 KV-only methods and inherits NotImplementedError for everything v2; this PR closes that gap.

Tracking issue: #22572
Reference PRs: #21259 (Mooncake hybrid), #20457 (HybridCacheController groundwork)

What changes

  • python/sglang/srt/mem_cache/storage/hf3fs/storage_hf3fs.py

    • New _Hf3fsPoolEngine dataclass bundling per-pool state: preallocated file, client list, AtomicCounter, ThreadPoolExecutor, metadata client, rank namespace, is_zero_copy, skip_backup.
    • KV engine is built in __init__, so v1 callers and non-hybrid deployments keep working unchanged.
    • register_mem_host_pool_v2 lazily creates auxiliary engines (MAMBA today) with their own preallocated files, clients and metadata scope. Idempotent and order-agnostic — we do not rely on KV being registered first.
    • batch_exists_v2 / batch_get_v2 / batch_set_v2 mirror HiCacheFile's reference semantics: longest-prefix KV hit, per-pool ALL_PAGES / TRAILING_PAGES hit policies, min across pools for the final hit, and per-pool result dicts.
    • Refactored _batch_get / _batch_set to take a _Hf3fsPoolEngine, so v1 and v2 share the same IO core. The existing MHA zero-copy -k/-v key doubling remains strictly KV-scoped.
    • Key namespacing: auxiliary pools prefix metadata keys with the pool name; KV keeps the bare key for backwards compatibility. This is Option A from the plan — no wire-protocol changes to mini_3fs_metadata_server.py.
    • Per-pool skip_backup: MLA rank>0 still skips the KV backup but MAMBA state is backed up on every rank (mamba state is per-rank and not shared across TP).
    • Drive-by fix: the v1 skip_backup path returned a scalar True instead of a [True] * N list — corrected in the shared _batch_set so the v2 contract is honored.
    • close() now iterates all engines; self._engines is populated before the SIGTERM handler is installed to avoid a startup window race.
    • from_env_config now also accepts an inline dict config so unit tests don't need SGLANG_HICACHE_HF3FS_CONFIG_PATH, and forwards a pools sub-dict as pools_extra_config for per-pool file-size overrides (default: 0.1× the KV file size).
  • test/registered/hicache/test_hicache_storage_3fs_hybrid.py (new) — mock-client unit tests exercising the v2 path end-to-end.

  • test/registered/hicache/test_hicache_storage_3fs_backend.py — new TestHf3fsBackendHybrid end-to-end test that launches a hybrid/linear-attention model with --hicache-storage-backend hf3fs --enable-hierarchical-cache --hicache-storage-prefetch-policy wait_complete, primes the cache, flushes, re-runs the same prompt, and asserts a non-trivial cache hit on the second run.

Scope / non-goals

  • Only PoolName.KV + PoolName.MAMBA are supported in this PR. This is sufficient for linear / hybrid-attention models (Qwen3.5-9B, etc.). DSA (DeepSeek Sparse Attention) is deferred until a v2 DSA caller exists — today Mooncake routes DSA through a separate embedding store, not through PoolName, and HiMambaRadixCache only registers KV + MAMBA. Adding PoolName.DSA is a follow-up once that caller lands.
  • No changes to mini_3fs_metadata_server.py. Pool separation is achieved by per-pool key prefixes and (for auxiliary pools) a distinct per-pool rank namespace on the metadata client, avoiding any wire-protocol change.
  • backend_factory.py is unchanged. The auxiliary pool's bytes_per_page is only known at register_mem_host_pool_v2 time, which is called later by HybridCacheController.attach_storage_backend — verified end-to-end.

Test plan

Unit tests (mock HF3FS client, no cluster needed) — all in test/registered/hicache/test_hicache_storage_3fs_hybrid.py:

  • Construction sanity — register KV + MAMBA, assert both appear in self._engines with distinct files, clients and executors; idempotent under double-registration.
  • batch_exists_v2 KV-only fallbackpool_transfers=None matches the v1 batch_exists result.
  • batch_exists_v2 ALL_PAGES — 4 KV pages, 2 DSA-style pages at slots [0,1]: asserts final_hit == 2 and extra_pool_hit_pages[X] == 2.
  • batch_exists_v2 TRAILING_PAGES — 4 KV pages, mamba only at [2,3]: asserts final_hit == 4 with trailing-window semantics matching HiCacheFile exactly.
  • batch_set_v2 + batch_get_v2 round-trip — per-pool result dicts all True, host buffers populated correctly.
  • MHA zero-copy + mamba interplay — with is_mla_model=False, is_zero_copy=True, KV keys double into -k/-v but mamba keys do not.
  • Skip-backup is KV-only under MLA — with is_mla_model=True, rank=2, KV returns [True]*N (skipped) but MAMBA actually writes through.
  • Partial pool failure — tiny mamba file (1 page) + 4-page set: KV all True, mamba [True, False, False, False], no exception escapes.
  • Interface contractbatch_exists_v2 without any pool registered raises a clear error, not AttributeError.

End-to-end tests (extend test_hicache_storage_3fs_backend.py):

  • TestHf3fsBackendHybrid.test_hybrid_cache_hit_after_flush — launches a hybrid model with the 3FS backend, primes, flushes, re-prompts, asserts cached_tokens > 700 on the second run. Gated on DEFAULT_HYBRID_MAMBA_MODEL_NAME_FOR_TEST availability.

Regression:

  • Full test_hicache_storage_3fs_backend.py suite — all pre-existing v1 tests still pass unchanged.
  • test_hicache_storage_file_backend.py — no collateral damage (PR does not touch hicache_storage.py's PoolName enum or base-class stubs).

Manual / cluster validation (optional, pre-merge):

  • On a single-node mini-3FS cluster, run the command from [HiCache & HybridModel] mooncake backend support DSA & mamba model #21259 but with --hicache-storage-backend hf3fs on a Mamba model. Verify gsm8k accuracy is identical between first and flushed-second run, second-run TTFT drops substantially, and get_stats() reports non-zero prefetch/backup bandwidth for both the KV engine and the mamba engine.

🤖 Generated with Claude Code

Implements the HiCacheStorage v2 interface for the 3FS backend so that
hybrid models (Mamba/linear-attention, and in the future DSA) can offload
both KV pages and auxiliary per-pool state to 3FS via HybridCacheController.

- Introduce _Hf3fsPoolEngine: a per-pool bundle of (file, client list,
  executor, metadata client, rank namespace, is_zero_copy, skip_backup)
  so each registered host pool has its own 3FS file and metadata scope.
- Construct the KV engine in __init__ so v1 callers keep working unchanged.
- Implement register_mem_host_pool_v2 to lazily allocate auxiliary
  (MAMBA/...) engines with their own preallocated files, clients and
  metadata namespaces. Idempotent and order-agnostic.
- Implement batch_exists_v2 / batch_get_v2 / batch_set_v2 mirroring the
  HiCacheFile semantics, including ALL_PAGES and TRAILING_PAGES hit
  policies, min-across-pools final hit, and per-pool result dicts.
- Refactor _batch_get / _batch_set to take an engine argument so both
  v1 and v2 entry points share the same IO core.
- Key namespacing: auxiliary pools prefix the metadata key with the
  pool name, KV keeps the bare key for backwards compatibility. MHA
  zero-copy -k/-v suffixing remains strictly KV-scoped.
- Per-pool skip_backup so MLA rank>0 still skips KV but backs up MAMBA
  on every rank. Fix a pre-existing bug where skip_backup returned a
  scalar True instead of a per-key list.
- close() now iterates all engines; _engines is populated before the
  SIGTERM handler is installed.

Test plan:
- New test/registered/hicache/test_hicache_storage_3fs_hybrid.py uses the
  mock HF3FS client to cover: construction sanity, KV-only v2 fallback,
  ALL_PAGES and TRAILING_PAGES exists semantics, v2 set/get round-trip,
  MHA zero-copy + mamba interplay, MLA skip_backup KV-only scoping,
  partial-pool failure, and a no-pool error contract.
- Extended test_hicache_storage_3fs_backend.py with TestHf3fsBackendHybrid
  end-to-end test for a hybrid model, gated on model availability.

Scope: PoolName.KV + PoolName.MAMBA. DSA is deferred until a caller
exists (see PLAN.md §3 and Appendix B).

Tracking issue: sgl-project#22572
Reference PRs: sgl-project#21259, sgl-project#20457

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a multi-pool storage interface for HiCacheHF3FS to support hybrid models, such as those combining KV and MAMBA pools. It introduces a per-pool engine architecture, lazy registration of auxiliary pools, and updated batch operations (exists, get, set) that handle multiple pools with configurable hit policies like ALL_PAGES and TRAILING_PAGES. Feedback focuses on improving the robustness of index releasing during failures, ensuring unique file paths for auxiliary pools across ranks to prevent collisions, and refactoring duplicated logic for pool name normalization and page size calculations.

Comment on lines 580 to 583
if indices:
self.metadata_client.confirm_write(
self.rank, [], [index[1] for index in indices]
engine.metadata_client.confirm_write(
engine.metadata_rank, [], [index[1] for index in indices]
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Releasing all indices in the indices list on a length mismatch can lead to data corruption. If is_written was True for some entries, those indices refer to existing pages that should not be released. Additionally, index[1] could be -1 if allocation failed for some keys. It's safer to filter the list to only release newly allocated, valid indices.

Suggested change
if indices:
self.metadata_client.confirm_write(
self.rank, [], [index[1] for index in indices]
engine.metadata_client.confirm_write(
engine.metadata_rank, [], [index[1] for index in indices]
)
# free allocated pages
if indices:
to_release = [idx for is_written, idx in indices if not is_written and idx != -1]
if to_release:
engine.metadata_client.confirm_write(
engine.metadata_rank, [], to_release
)

Comment on lines +759 to +764
# Strip the trailing rank suffix from the KV path (".<rank>") so we
# can substitute the *original* (per-rank) rank for aux pools.
kv_rank_suffix = f".{self.rank}"
if base.endswith(kv_rank_suffix):
base = base[: -len(kv_rank_suffix)] + f".{self._original_rank}"
return f"{base}.{pool_name}{ext or '.bin'}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for deriving the auxiliary pool file path relies on finding and replacing the KV rank suffix. If the file_path does not contain the expected .{self.rank} suffix (e.g., a shared base path), multiple ranks will collide on the same auxiliary file path. It is safer to always explicitly include the _original_rank in the auxiliary path to ensure isolation.

Suggested change
# Strip the trailing rank suffix from the KV path (".<rank>") so we
# can substitute the *original* (per-rank) rank for aux pools.
kv_rank_suffix = f".{self.rank}"
if base.endswith(kv_rank_suffix):
base = base[: -len(kv_rank_suffix)] + f".{self._original_rank}"
return f"{base}.{pool_name}{ext or '.bin'}"
# Strip the trailing rank suffix from the KV path (".<rank>") if present
kv_rank_suffix = f".{self.rank}"
if base.endswith(kv_rank_suffix):
base = base[: -len(kv_rank_suffix)]
# Always include the original rank to ensure isolation.
return f"{base}.{self._original_rank}.{pool_name}{ext or '.bin'}"

Comment on lines +560 to +573
except Exception as e:
# The mini metadata server raises when its free-page list and
# key-to-index map are both empty (over-allocation past the
# configured num_pages). Surface that to the caller as a per-key
# failure list rather than letting the exception escape — the
# v2 contract guarantees a List[bool] result so the controller
# can attribute the loss to this specific pool. PLAN.md §4 #5.
logger.warning(
"[Rank %s] HiCacheHF3FS batch_set (%s) capacity exhausted: %s",
engine.metadata_rank,
engine.pool_name,
e,
)
return [False] * len(keys)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception here might mask unexpected errors (e.g., network issues, programming errors) as simple capacity exhaustion. It is recommended to catch more specific exceptions if possible, or at least include the stack trace in the log to aid debugging.

Suggested change
except Exception as e:
# The mini metadata server raises when its free-page list and
# key-to-index map are both empty (over-allocation past the
# configured num_pages). Surface that to the caller as a per-key
# failure list rather than letting the exception escape — the
# v2 contract guarantees a List[bool] result so the controller
# can attribute the loss to this specific pool. PLAN.md §4 #5.
logger.warning(
"[Rank %s] HiCacheHF3FS batch_set (%s) capacity exhausted: %s",
engine.metadata_rank,
engine.pool_name,
e,
)
return [False] * len(keys)
except Exception as e:
# The mini metadata server raises when its free-page list and
# key-to-index map are both empty (over-allocation past the
# configured num_pages). Surface that to the caller as a per-key
# failure list rather than letting the exception escape — the
# v2 contract guarantees a List[bool] result so the controller
# can attribute the loss to this specific pool. PLAN.md §4 #5.
logger.warning(
"[Rank %s] HiCacheHF3FS batch_set (%s) capacity exhausted: %s",
engine.metadata_rank,
engine.pool_name,
e,
exc_info=True,
)

self._next_pool_idx += 1
# Use the original (per-rank) rank as the base. Mamba/DSA state is
# per-rank — even in MLA — so the metadata namespace must be too.
return self._original_rank + (self._pool_idx_map[pool_name] << 24)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a fixed shift of 24 bits for the pool namespace assumes that _original_rank will always be less than $2^{24}$ (approx. 16.7 million). While this is likely true for most current deployments, it's a magic number constraint that could lead to collisions if ranks are assigned from a larger space (e.g., global unique IDs). Consider using a more robust scheme or documenting this limitation.

Comment on lines +811 to +818
if host_pool.layout in ["page_first", "page_first_direct"]:
new_bpp = (
host_pool.get_ksize_per_token() * host_pool.page_size
)
else:
new_bpp = (
host_pool.get_size_per_token() * host_pool.page_size
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for calculating bytes_per_page based on the host pool layout is duplicated here and at lines 847-854, and it also mirrors logic in backend_factory.py. Consider refactoring this into a helper method or a property on the HostKVCache class to improve maintainability.

Comment on lines +967 to +971
name = (
transfer.name.value
if isinstance(transfer.name, PoolName)
else str(transfer.name)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The normalization of the pool name from transfer.name is duplicated in several places (batch_exists_v2, _resolve_v2_transfer, batch_get_v2, batch_set_v2). Consider refactoring this into a small helper method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hicache Hierarchical Caching for SGLang

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant