Skip to content

feat(sglang): add ephemeral KV session routing#7665

Merged
ishandhanani merged 33 commits into
mainfrom
idhanani/dyn-ephemeral-kv-sessions
Apr 13, 2026
Merged

feat(sglang): add ephemeral KV session routing#7665
ishandhanani merged 33 commits into
mainfrom
idhanani/dyn-ephemeral-kv-sessions

Conversation

@ishandhanani
Copy link
Copy Markdown
Contributor

@ishandhanani ishandhanani commented Mar 27, 2026

Summary

  • Add sticky session routing and worker session lifecycle control for ephemeral KV cache reuse
  • Wire session control through streaming sessions and the SGLang frontend path
  • Session control is request-driven: no --enable-agent-controller flag needed. AgentController and StickySessionRouter activate lazily when requests carry nvext.session_control
  • Graceful degradation: if no worker has --enable-streaming-session, the router warns once and ignores session_control. Requests proceed without isolation.

Design Changes (from review feedback)

Addressed review feedback:

  • Removed --enable-agent-controller flag -- session control is per-request opt-in, not startup-gated
  • AgentController uses OnceCell<Option<EventPlaneClient>> -- lazy init on first session request, caches unavailability if no endpoint exists
  • StickySessionRouter always created but only activates when requests carry session_control
  • Handler guard -- _session_kwargs() checks enable_streaming_session before injecting session_params into SGLang calls, preventing errors when sessions are off

KV Pressure Benchmark

Controlled benchmark with interleaved main agent + subagent traffic. GLM-4.7-Flash TP=2 on 2x L40S. 60 requests: 3 main agent turns + 5 subagent sessions (9-15 turns each) replayed from OpenCode traces.

New metric sglang:kv_physical_usage = 1 - (available_size / total_pool) captures physical GPU memory occupied including evictable radix nodes.

Streaming Sessions vs No Sessions

Subagent       STREAMING (peak -> post-close)     NO-STREAMING (peak, no drop)
--------       --------------------------------   ----------------------------
Sub 1          0.437 -> 0.048  (89% reclaimed)    monotonic climb
Sub 2          0.345 -> 0.049  (86% reclaimed)    monotonic climb
Sub 3          0.277 -> 0.050  (82% reclaimed)    monotonic climb
Sub 4          0.437 -> 0.071  (84% reclaimed)    monotonic climb

Final          0.134 (13%)                        0.958 (96%)
Avg HIT_RATE   0.86                               0.004
  • Streaming = sawtooth: each session peaks, drops to ~0.05 on close (main agent's retained KV). 82-89% reclamation per session.
  • No-streaming = staircase: KV accumulates monotonically to 96%, never freed.

Priority Eviction Ablation

Same workload without streaming sessions but with --radix-eviction-policy priority. Main agent requests at priority=50, subagent requests at priority=1.

Result: priority eviction does not solve the problem.

  • PHYS_USE climbs monotonically from 0.075 to 0.999 with zero drops between sessions
  • Low-priority subagent KV lingers until the pool is physically full (~97%)
  • Under pressure, eviction churns between 0.957-0.999 but never meaningfully drops
  • Main agent KV is evicted alongside subagent KV, destroying prefix reuse (HIT_RATE: 0.004)
  • Priority eviction answers "what to evict when full" but not "when to free resources"

Streaming sessions are fundamentally different -- they free KV proactively on session close, keeping headroom so the main agent's prefix cache survives across subagent cycles.

Validation

  • cargo check -p dynamo-llm + cargo check -p dynamo-py3 (bindings)
  • cargo test -p dynamo-kv-router (269 tests pass)
  • Streaming session smoke test: components/src/dynamo/sglang/tests/test_streaming_session_smoke.py
  • python -m pytest -q components/src/dynamo/frontend/tests/test_sglang_processor_unit.py
  • End-to-end with OpenCode against GLM-4.7-Flash (session open/close/KV release verified in logs)

@ishandhanani ishandhanani requested a review from a team March 27, 2026 13:45
@ishandhanani ishandhanani requested review from a team as code owners March 27, 2026 13:45
@github-actions github-actions Bot added feat documentation Improvements or additions to documentation backend::sglang Relates to the sglang backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` router Relates to routing, KV-aware routing, etc. labels Mar 27, 2026
@ishandhanani ishandhanani changed the title feat: add ephemeral KV session routing on top of GLM responses fixes feat: add ephemeral KV session routing Mar 27, 2026
@ishandhanani ishandhanani changed the title feat: add ephemeral KV session routing feat(sglang): add ephemeral KV session routing Mar 27, 2026
@ishandhanani ishandhanani changed the base branch from idhanani/dyn-glm47-responses-codex to main March 27, 2026 14:11
@ishandhanani ishandhanani force-pushed the idhanani/dyn-ephemeral-kv-sessions branch from 595ae55 to 9f89eb6 Compare March 27, 2026 14:14
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ishandhanani ishandhanani force-pushed the idhanani/dyn-ephemeral-kv-sessions branch from 9f89eb6 to f0f178d Compare March 27, 2026 14:14
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 27, 2026

Walkthrough

This pull request transitions cache control infrastructure from TTL-based prefix pinning to session-based lifecycle management. It replaces the pin_prefix endpoint with open_session/close_session handlers, introduces sticky session routing for worker affinity, adds retention_seconds injection for priority-based KV eviction with time decay, and updates related configurations and documentation.

Changes

Cohort / File(s) Summary
Configuration & Documentation Help Text
components/src/dynamo/common/configuration/groups/kv_router_args.py, lib/bindings/python/src/dynamo/_core.pyi, lib/kv-router/src/scheduling/config.rs
Updated --enable-cache-control / router_enable_cache_control help descriptions to reflect agent-aware cache control covering session lifecycle RPCs, sticky routing, and retention_seconds injection instead of prior PIN-with-TTL semantics.
Frontend Token Normalization & Parser Configuration
components/src/dynamo/frontend/sglang_prepost.py, components/src/dynamo/frontend/sglang_processor.py, components/src/dynamo/frontend/tests/test_sglang_processor_unit.py
Added _normalize_prompt_token_ids helper to standardize tokenizer output into list[int]; added _runtime_config_parser_name helper to resolve parser names from runtime config; extended unit tests for both helpers.
Request Handler Session & Retention Support
components/src/dynamo/sglang/request_handlers/handler_base.py, components/src/dynamo/sglang/request_handlers/llm/decode_handler.py, components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py
Replaced pin_prefix endpoint with open_session, close_session, and session_control handlers; added _retention_kwargs and _session_kwargs helpers; updated generation calls to inject session and retention parameters.
SGLang Endpoint & Service Configuration
components/src/dynamo/sglang/init_llm.py, components/src/dynamo/sglang/publisher.py, components/src/dynamo/sglang/register.py
Added new session_control endpoint registration in init_decode; minor import and formatting adjustments.
Protocol Definitions for Session Control
lib/llm/src/protocols/openai/nvext.rs, lib/llm/src/protocols/common/preprocessor.rs, lib/llm/src/preprocessor.rs
Introduced SessionControl struct with session_id, action (open/close), and timeout fields; added session_control field to RoutingHints and NvExt; propagated session control data through preprocessor.
Core Session & Affinity Management (New Modules)
lib/llm/src/kv_router/agent_controller.rs, lib/llm/src/kv_router/sticky_sessions.rs
Added AgentController for session lifecycle management with event-plane client initialization and deferred close actions; added StickySessionRouter with InMemoryAffinityStore for session-to-worker affinity tracking with TTL expiration and background reaper.
KV Router Refactoring
lib/llm/src/kv_router.rs, lib/llm/src/kv_router/push_router.rs
Removed cache_control module; added re-exports for approx, protocols, scheduling, selector from external crate; exposed AgentController and StickySessionRouter; refactored KvPushRouter to replace PIN infrastructure with sticky session and agent controller integration, injecting retention_seconds into request extra args.
Removed Cache Control Implementation
lib/llm/src/kv_router/cache_control.rs
Deleted entire cache-control module including PinState, CacheControlClient, create_cache_control_client, and spawn_pin_prefix function.
Example Configuration & Documentation
examples/backends/sglang/launch/agg_router.sh, docs/backends/sglang/agents.md
Updated launch script with larger model, configurable context length, session control flags, parser configurations, and radix eviction policy; replaced cache-pinning documentation with session-control semantics, priority-based retention, lifecycle actions, and updated limitations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding ephemeral KV session routing, which is the primary objective of this large PR.
Docstring Coverage ✅ Passed Docstring coverage is 93.15% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description is comprehensive and well-structured, including summary, design changes, benchmarks, and validation details that clearly explain the feature additions and their impact.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@ishandhanani
Copy link
Copy Markdown
Contributor Author

Follow-up on this branch:

  • renamed the router/frontend gate from --enable-cache-control to --enable-agentic-controller
  • renamed the config field to router_enable_agentic_controller
  • kept nvext.cache_control as an ignored Anthropic-compatibility input only
  • removed the active frontend/feature/session docs references to cache-control so the branch now presents a session-only model

The current PR body / CodeRabbit summary is stale on the cache-control point after this update.

@ishandhanani
Copy link
Copy Markdown
Contributor Author

Follow-up correction:

  • the gate is now --enable-agent-controller (not --enable-agentic-controller)
  • config / bindings field is now router_enable_agent_controller
  • nvext.cache_control has been removed from the OpenAI NvExt struct
  • Anthropic cache_control parsing remains accepted for request compatibility, but conversion to the internal chat request drops it, so it is a no-op

Validation rerun after this change:

  • python3 -m compileall ... on the touched Python files
  • cargo fmt --all
  • cargo test -p dynamo-llm sticky_sessions -- --nocapture
  • cargo test -p dynamo-llm cache_control_passthrough -- --nocapture

Signed-off-by: Ishan Dhanani <ishandhanani@gmail.com>
- Lazily register workers in slot tracker when the selector picks a
  worker that the monitor task hasn't propagated yet. Eliminates
  "Worker not found" and "Failed to mark prefill completed" warnings.
- Use no_fault_detection for the session_control event plane client
  so open_session RPCs don't fail due to a race between discovery
  propagation and the fault detection avail list.
- Add missing router_enable_agent_controller field to pyo3 binding.
@ishandhanani ishandhanani force-pushed the idhanani/dyn-ephemeral-kv-sessions branch from 18ad724 to 429a47f Compare April 2, 2026 02:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

- _normalize_prompt_token_ids: accept any non-string iterable for
  input_ids, not just list, preventing dict-keys fallback bug
- FakeBatchEncoding: move input_ids to instance attribute (Ruff RUF012)
- init_llm: track session_control_endpoint in shutdown_endpoints
- publisher: remove duplicate urlparse import
- agents.md: fix session example to show proper append-style turns
  with stream consumption
- agg_router.sh: derive YaRN rope_scaling.factor from CONTEXT_LENGTH
Copy link
Copy Markdown
Contributor

@PeaBrane PeaBrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One design question on feature gating: session_control is already a per-request opt-in, and requests that do not carry it naturally want the existing KV-routing behavior. Given that, it seems like a cleaner model would be to keep normal KV routing as the unconditional default and make the session-control machinery request-driven/lazy instead of startup-gated: only initialize and use sticky-session state plus the AgentController when a request actually carries nvext.session_control, and fail only those requests if the worker side does not expose session_control or streaming sessions. That would preserve today's behavior for ordinary traffic, remove one frontend config knob, and make the feature boundary line up more closely with the API surface.

Mirror finish() in the Drop impl: free the scheduler slot before
firing the deferred session close, so the worker's KV is not released
while generation teardown is still in progress. Also guard the close
with try_current() to prevent panics outside a tokio runtime.

Remove stale pub mod declarations for subscriber/worker_query (moved
to kv_router/indexer/ in #7973).
Check for session_control presence before entering the sticky path,
so ordinary KV-routed requests avoid the resolve call entirely.
# Conflicts:
#	components/src/dynamo/sglang/request_handlers/handler_base.py
…tartup-gated

Remove --enable-agent-controller flag. AgentController and StickySessionRouter
are now always created in KvPushRouter but activate lazily: the event-plane
client is only initialized on the first request carrying session_control, and
sticky resolution short-circuits for non-session requests.

Addresses review feedback from #7665.
- AgentController uses OnceCell<Option<EventPlaneClient>> to cache
  unavailability. First session_control request with no endpoint gets a
  5s timeout, logs a single WARN, and all subsequent requests skip
  session lifecycle silently. Requests proceed without isolation.
- Handler-side guard: _session_kwargs() checks enable_streaming_session
  before injecting session_params, preventing SGLang "session does not
  exist" errors when sessions are off.
- Downgrade AgentController/StickySessionRouter init logs to debug.
- Update agg_agent.sh with arg parsing (--model-path, --tp, EXTRA_ARGS).
- Add session control section to router-guide.md.
- Update agents.md: remove stale upstream PR note, clarify request-driven
  activation, add SGLang-only note, fix OpenCode repo URL.
- Add streaming session smoke test.
…l-kv-sessions

# Conflicts:
#	docs/components/frontend/nvext.md
#	docs/components/router/router-guide.md
- Add realistic agentic prompts (GPU inference engine design workload)
- Add main agent + subagent interleaving (matches real orchestrator pattern)
- Add --mode kv-pressure for metrics-based KV reclamation validation
- Handle reasoning models (reasoning_content fallback)
- Add priority support for future ablations
@ishandhanani ishandhanani enabled auto-merge (squash) April 13, 2026 22:05
@ishandhanani ishandhanani merged commit 9498f01 into main Apr 13, 2026
88 checks passed
@ishandhanani ishandhanani deleted the idhanani/dyn-ephemeral-kv-sessions branch April 13, 2026 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::sglang Relates to the sglang backend documentation Improvements or additions to documentation feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` router Relates to routing, KV-aware routing, etc. size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants