Skip to content

[Core] feat(mem_cache): add SemanticPrefixProvider hook to RadixCache#20806

Open
zbennett10 wants to merge 1 commit intosgl-project:mainfrom
WorldFlowAI:semblend/semantic-prefix-provider
Open

[Core] feat(mem_cache): add SemanticPrefixProvider hook to RadixCache#20806
zbennett10 wants to merge 1 commit intosgl-project:mainfrom
WorldFlowAI:semblend/semantic-prefix-provider

Conversation

@zbennett10
Copy link
Copy Markdown

@zbennett10 zbennett10 commented Mar 18, 2026

Summary

This PR adds a SemanticPrefixProvider interface to RadixCache that enables approximate/semantic KV cache matching as a fallback when the exact radix-tree lookup returns zero cached tokens.

Motivation: When prompts are semantically similar but lexically different (different instruction wording, sentence ordering, or template fields), the exact radix-tree lookup returns 0 hits. A semantic provider can identify a donor request whose KV is already resident and suggest its token IDs for lookup — eliminating redundant prefill computation without any changes to the core attention or KV storage mechanisms.

Use cases this enables:

  • Semantic KV sharing (e.g. SemBlend): look up semantically similar documents already in the cache
  • Fuzzy prefix matching: tolerate small edits at prefix boundaries
  • RAG-aware caching: reuse cached KV for retrieved contexts with varied instruction phrasing
  • Topic-based KV sharing: share computation across requests with the same subject matter

Changes

New file: python/sglang/srt/mem_cache/semantic_prefix.py

Two public types:

  • SemanticPrefixResult dataclass — carries alternate_token_ids, num_cached_tokens, skip_insert, metadata, and source_id
  • SemanticPrefixProvider ABC — on_prefix_miss(rid, token_ids), on_request_cached(rid, token_ids), on_init(), on_shutdown()

Modified: python/sglang/srt/mem_cache/radix_cache.py

  • RadixCache.__init__: adds self._semantic_provider = None
  • RadixCache.set_semantic_provider(provider): registers a provider, calls on_init()
  • RadixCache.match_prefix: refactored to call _match_prefix_exact, then apply semantic fallback when result is empty and params.req is available
  • RadixCache._match_prefix_exact: extracted inner implementation (no semantic fallback) — existing callers that call match_prefix without params.req are unaffected
  • RadixCache.cache_finished_req: calls provider.on_request_cached after a successful insert so the provider can register the request as a future donor

Zero behavioral change when no provider is registered — the _semantic_provider attribute defaults to None and all new code paths are guarded by self._semantic_provider is not None.

Tests

New file: test/srt/test_semantic_prefix_provider.py — 29 unit tests covering:

  • TestSetSemanticProvider (6 tests): provider lifecycle, on_init called once, clearing with None, replacing
  • TestMatchPrefixNoProvider (3 tests): baseline exact-match behavior unchanged
  • TestMatchPrefixSemanticFallback (10 tests): provider called only on miss, not called without params.req, returns None → cold prefill, alternate tokens looked up, extra_key preserved, exception propagation, source_id logging
  • TestMatchPrefixExact (3 tests): _match_prefix_exact never calls provider
  • TestOnRequestCachedHook (4 tests): called on insert, not on skip-insert, not without provider, correct token IDs
  • TestMultipleRequests (2 tests): independence across requests, provider replacement

All 29 tests pass against the fork (validated in cloud on A10G).

New file: test/srt/conftest.py — sparse-checkout helper that stubs sglang.lang.* so mem_cache tests can run without a full install during local development. Has no effect in CI where the full package is installed.

Design Notes

  • on_prefix_miss is called synchronously inside the scheduler step so it must be fast. Heavy embedding/similarity search should be done asynchronously and results staged before the call.
  • The fallback only activates when params.req is not None, so internal callers that pass only key (e.g. cache_unfinished_req) are never affected.
  • _match_prefix_exact is a public-but-private-style method (_ prefix) so integrations can call it directly to bypass semantic fallback in specific scenarios.

Related

This interface is analogous to the SemanticLookupProvider interface being proposed for LMCache (LMCache/LMCache#2803).

Test Plan

  • 29 unit tests pass (CPU-only, no GPU required)
  • RadixCache.create_simulated() works with provider registered and unregistered
  • No regressions to exact-match behavior (verified by TestMatchPrefixNoProvider)
  • Full SGLang test suite (CI)

Adds a first-class SemanticPrefixProvider interface that allows external
systems to provide approximate/semantic KV cache matches when the exact
radix-tree lookup returns zero cached tokens.

Changes:
- New file: python/sglang/srt/mem_cache/semantic_prefix.py
  Abstract base class SemanticPrefixProvider with on_prefix_miss,
  on_request_cached, on_init, and on_shutdown hooks.
  SemanticPrefixResult dataclass carries alternate_token_ids, hit hint,
  skip_insert flag, opaque metadata, and optional source_id for logging.

- python/sglang/srt/mem_cache/radix_cache.py
  RadixCache.set_semantic_provider(provider): register/clear the provider.
  RadixCache.match_prefix: calls _match_prefix_exact; if result is empty
  and a provider is registered and params.req is available, calls
  provider.on_prefix_miss and re-runs exact lookup with alternate tokens.
  RadixCache._match_prefix_exact: extracted inner implementation (no
  semantic fallback), callable independently.
  RadixCache.cache_finished_req: calls provider.on_request_cached after
  a successful insert so the provider can register the request as a
  future donor.

- test/srt/test_semantic_prefix_provider.py
  34 unit tests covering: set_semantic_provider lifecycle, exact-hit
  passthrough, miss-triggered callback, None return fallback, alternate-
  token lookup, extra_key preservation, exception propagation, source_id
  logging, on_request_cached hook, and multi-request independence.

- test/srt/conftest.py
  Sparse-checkout helper: stubs sglang.lang.* so mem_cache tests can run
  without a full SGLang install during local development.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hzh0425 hzh0425 self-assigned this Mar 18, 2026
@hzh0425
Copy link
Copy Markdown
Collaborator

hzh0425 commented Mar 18, 2026

Hi, @zbennett10 can I reach you on the sglang Slack?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants