-
Notifications
You must be signed in to change notification settings - Fork 690
feat: DIS-323 [trtllm backend publisher] only publish kv event with the biggest window size to support kv routing with variable sliding window attention #2241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces new YAML configuration files for the Gemma3 backend with Variable Sliding Window Attention (VSWA), along with a markdown deployment guide. Several Llama4 Eagle model configuration files are updated to add new flags, rename keys, and adjust cache settings. The Changes
Sequence Diagram(s)sequenceDiagram
participant EventSource as KV Cache Event Source
participant Publisher as Publisher
participant Downstream as Downstream Consumer
loop For each KV cache event
EventSource->>Publisher: Send KV cache event
alt Event type is "created" and initial phase
Publisher->>Publisher: update_max_window_size(event)
Publisher->>Downstream: Publish event
else Event type is "stored" or "removed"
Publisher->>Publisher: processing_initial_created_events = False
Publisher->>Downstream: Publish event
else
Publisher->>Publisher: should_drop_event(event)?
alt should_drop_event = False
Publisher->>Downstream: Publish event
else should_drop_event = True
Publisher--xDownstream: Drop event
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
components/backends/trtllm/launch/disagg_router.sh (1)
23-27:clear_namespaceexit status is ignored – router may start with a dirty namespace
python3 utils/clear_namespace.py --namespace dynamois executed without&&or an explicit exit-on-failure guard. A non-zero exit code will be silently ignored, contradicting the documented design decision (see learning from ai-dynamo #2137) that the router must not start if the namespace is uncleared.-# run clear_namespace -python3 utils/clear_namespace.py --namespace dynamo +# run clear_namespace – abort if it fails +python3 utils/clear_namespace.py --namespace dynamo || { + echo "[ERROR] Failed to clear namespace; aborting launch." >&2 + exit 1 +}
♻️ Duplicate comments (1)
components/backends/trtllm/src/dynamo/trtllm/publisher.py (1)
412-419: Window size tracking implementation looks good.The method correctly tracks the maximum window size. Since this is now only called during initial "created" events (as per line 362-363), the efficiency concern from the previous review has been addressed.
🧹 Nitpick comments (1)
components/backends/trtllm/launch/disagg_router.sh (1)
1-3: Consider adding strict shell options for safer scriptingAdding
set -euo pipefail(and optionallyIFS=$'\n\t') early in the script will terminate on first error, catch unset variables, and propagate pipeline failures, giving more predictable behaviour for production deployment scripts.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml(1 hunks)components/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml(1 hunks)components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml(1 hunks)components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml(1 hunks)components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml(2 hunks)components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml(1 hunks)components/backends/trtllm/gemma3_sliding_window_attention.md(1 hunks)components/backends/trtllm/launch/disagg_router.sh(1 hunks)components/backends/trtllm/src/dynamo/trtllm/publisher.py(5 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally d...
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.
Applied to files:
components/backends/trtllm/launch/disagg_router.sh
📚 Learning: trtllm llm-api expects all caps for backend field names in configuration files. when migrating trtll...
Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.
Applied to files:
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yamlcomponents/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.ymlcomponents/backends/trtllm/engine_configs/gemma3/vswa_agg.yamlcomponents/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml
📚 Learning: the `--torch-backend=auto` flag works with vllm installations via uv pip install, even though it's n...
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
Applied to files:
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.ymlcomponents/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml
🔇 Additional comments (16)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml (1)
22-30: Verify that renamed key and new flag are recognised by TRTLLM
enable_autotunerandspeculative_model_dirare newly introduced keys. Ensure the serving stack (both aggregated and disaggregated paths) already consumes these exact field names; otherwise they will be silently ignored.components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml (1)
16-18: Backend value case-sensitivityPrevious migrations (learning #2217) showed TRTLLM backend values must be in ALL-CAPS (
PYTORCH). Confirm that lowercasepytorchis still accepted by the latest parser; otherwise change to upper-case to avoid startup failures.components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml (1)
29-31: Same backend-case issue as agg configDouble-check if
backend: defaultmust beDEFAULT.components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml (1)
24-32: Confirm new keys match runtime expectationsAs with
eagle_agg.yml, validate that:
enable_autotuneris actually consumed by the decode worker.speculative_model_diris the expected replacement forpytorch_weights_path.A mismatch will silently revert to defaults and degrade performance.
components/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml (1)
16-30: LGTM! Configuration correctly implements VSWA.The configuration properly defines variable sliding window sizes with 5 layers using 512-token windows and 1 layer with a 32768-token window, which aligns with the Gemma3 model's alternating attention mechanism described in the documentation.
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml (4)
24-24: Configuration enhancement looks good.Explicitly setting
enable_autotuner: falseprovides clear control over the autotuning behavior.
30-30: Key rename improves clarity.Renaming
pytorch_weights_pathtospeculative_model_dirbetter reflects the actual content and purpose of this configuration field.
35-35: Cache optimization enabled.Enabling block reuse improves KV cache efficiency by allowing reuse of previously allocated blocks.
37-38: Cache transceiver configuration added.The addition of
cache_transceiver_configmaintains consistency with other engine configurations.components/backends/trtllm/gemma3_sliding_window_attention.md (3)
18-21: Clear and informative documentation introduction.The explanation of VSWA and hardware requirements provides essential context for deployment.
23-28: Version requirements properly documented.Clear specification of TensorRT-LLM version requirements and build instructions for KV Routing support.
30-66: Comprehensive deployment instructions.All serving modes are covered with appropriate configuration files and launch scripts. The commands correctly reference the VSWA engine configurations.
components/backends/trtllm/src/dynamo/trtllm/publisher.py (4)
120-125: Efficient initialization for window size tracking.Good implementation with the
processing_initial_created_eventsflag to limit window size checking to the initial phase, addressing the efficiency concern from the previous review.
298-300: Event filtering correctly implements VSWA support.The logic properly filters out KV events from non-global attention layers, ensuring only events with the largest window size are published.
305-305: State management for initial events is correct.The flag properly transitions from initial "created" event processing to normal operation when "stored" or "removed" events are received.
Also applies to: 346-346, 362-363
421-428: Event dropping logic correctly implements selective publishing.The method properly determines which events to drop based on window size, ensuring only events from the global attention layer (with max window size) are published after the initial phase.
components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml
Outdated
Show resolved
Hide resolved
f8759a7 to
d0198a5
Compare
d0198a5 to
7c803d1
Compare
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml
Outdated
Show resolved
Hide resolved
wording update readme
fixing the eagle serving update trtllm commit
7add6eb to
719c81d
Compare
…he biggest window size to support kv routing with variable sliding window attention (ai-dynamo#2241)
Overview:
Update the TRTLLM backend publisher to emit only KV events corresponding to the largest window_size. This change enables compatibility with variable sliding window attention (VSWA) in KV routing.
Details:
When VSWA is enabled in TRTLLM, each attention layer emits a KV event to the same KV event manager. However, only the KV events from the global attention layer, which typically has the largest window_size, are relevant for Dynamo’s KV routing (which relies on prefix matching).
To avoid publishing redundant or irrelevant KV events from non-global layers, the TRTLLM backend publisher will be modified to filter and publish only the KV event with the largest window_size. TRTLLM already includes a window_size field in each KV event, which will be used to identify and select the appropriate event for publishing.
This selective publishing ensures that Dynamo receives only the meaningful KV event data necessary for accurate routing in VSWA scenarios.
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit