feat: DIS-323 [trtllm backend publisher] only publish kv event with the biggest window size to support kv routing with variable sliding window attention #2241

richardhuo-nv · 2025-08-01T03:16:30Z

Overview:

Update the TRTLLM backend publisher to emit only KV events corresponding to the largest window_size. This change enables compatibility with variable sliding window attention (VSWA) in KV routing.

Details:

When VSWA is enabled in TRTLLM, each attention layer emits a KV event to the same KV event manager. However, only the KV events from the global attention layer, which typically has the largest window_size, are relevant for Dynamo’s KV routing (which relies on prefix matching).

To avoid publishing redundant or irrelevant KV events from non-global layers, the TRTLLM backend publisher will be modified to filter and publish only the KV event with the largest window_size. TRTLLM already includes a window_size field in each KV event, which will be used to identify and select the appropriate event for publishing.

This selective publishing ensures that Dynamo receives only the meaningful KV event data necessary for accurate routing in VSWA scenarios.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added configuration files and deployment guide for Gemma3 model with Variable Sliding Window Attention, supporting various serving modes and hardware recommendations.
Documentation
- Introduced a new markdown guide detailing setup and deployment instructions for the Gemma3 model using VSWA.
Bug Fixes
- Corrected file formatting by adding a missing newline in the disaggregated router script.
Enhancements
- Improved event handling for key-value cache processing, enabling selective filtering based on window size for more refined event management.
Configuration Updates
- Updated Llama4 Eagle model configs with new tuning flags, renamed keys, enabled block reuse, and added cache transceiver backend settings.

copy-pr-bot · 2025-08-01T03:16:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

components/backends/trtllm/src/dynamo/trtllm/publisher.py

coderabbitai · 2025-08-05T18:15:45Z

Walkthrough

This update introduces new YAML configuration files for the Gemma3 backend with Variable Sliding Window Attention (VSWA), along with a markdown deployment guide. Several Llama4 Eagle model configuration files are updated to add new flags, rename keys, and adjust cache settings. The Publisher class in the codebase is enhanced to filter KV cache events based on window size, introducing new attributes and methods. A shell script receives a formatting fix.

Changes

Cohort / File(s)	Change Summary
Gemma3 VSWA YAML Configs `components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml`, `.../vswa_decode.yaml`, `.../vswa_prefill.yaml`	Added new YAML configuration files for Gemma3 VSWA, specifying tensor parallel size, backend (PyTorch), key-value cache window sizes, and cache transceiver settings.
Gemma3 VSWA Documentation `components/backends/trtllm/gemma3_sliding_window_attention.md`	Added a markdown guide detailing prerequisites, hardware requirements, and launch commands for deploying the Gemma3 model with VSWA using Dynamo.
Llama4 Eagle Model Config Updates `components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml`, `.../eagle_decode.yaml`, `.../eagle_prefill.yaml`	Updated YAML configs: added `enable_autotuner` flag, renamed `pytorch_weights_path` to `speculative_model_dir` in `speculative_config`, introduced `cache_transceiver_config`, and enabled `enable_block_reuse` in prefill config.
Publisher Event Filtering Logic `components/backends/trtllm/src/dynamo/trtllm/publisher.py`	Enhanced `Publisher` class to track and filter KV cache events by window size, with new attributes, methods (`update_max_window_size`, `should_drop_event`), and event filtering logic in the async publishing task.
Shell Script Formatting `components/backends/trtllm/launch/disagg_router.sh`	Fixed missing newline at end of script; no functional changes.

Sequence Diagram(s)

sequenceDiagram
    participant EventSource as KV Cache Event Source
    participant Publisher as Publisher
    participant Downstream as Downstream Consumer

    loop For each KV cache event
        EventSource->>Publisher: Send KV cache event
        alt Event type is "created" and initial phase
            Publisher->>Publisher: update_max_window_size(event)
            Publisher->>Downstream: Publish event
        else Event type is "stored" or "removed"
            Publisher->>Publisher: processing_initial_created_events = False
            Publisher->>Downstream: Publish event
        else
            Publisher->>Publisher: should_drop_event(event)?
            alt should_drop_event = False
                Publisher->>Downstream: Publish event
            else should_drop_event = True
                Publisher--xDownstream: Drop event
            end
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

docs: add TRTLLM variable sliding window attention example for gemma3 model #2134: Adds similar Gemma3 VSWA YAML configs and documentation for VSWA usage, directly overlapping with the configuration and documentation changes in this PR.

Poem

In the meadow of code where configs bloom bright,
Gemma3 and Eagle now take flight.
With sliding windows and caches anew,
The Publisher filters what’s passing through.
YAMLs align, scripts end just right—
A rabbit’s delight in the soft moonlight! 🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

components/backends/trtllm/launch/disagg_router.sh (1)
23-27: clear_namespace exit status is ignored – router may start with a dirty namespace

python3 utils/clear_namespace.py --namespace dynamo is executed without && or an explicit exit-on-failure guard. A non-zero exit code will be silently ignored, contradicting the documented design decision (see learning from ai-dynamo #2137) that the router must not start if the namespace is uncleared.
-# run clear_namespace
-python3 utils/clear_namespace.py --namespace dynamo
+# run clear_namespace – abort if it fails
+python3 utils/clear_namespace.py --namespace dynamo || {
+  echo "[ERROR] Failed to clear namespace; aborting launch." >&2
+  exit 1
+}

♻️ Duplicate comments (1)

components/backends/trtllm/src/dynamo/trtllm/publisher.py (1)

412-419: Window size tracking implementation looks good.

The method correctly tracks the maximum window size. Since this is now only called during initial "created" events (as per line 362-363), the efficiency concern from the previous review has been addressed.

🧹 Nitpick comments (1)

components/backends/trtllm/launch/disagg_router.sh (1)

1-3: Consider adding strict shell options for safer scripting

Adding set -euo pipefail (and optionally IFS=$'\n\t') early in the script will terminate on first error, catch unset variables, and propagate pipeline failures, giving more predictable behaviour for production deployment scripts.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8f6d4d and f8759a7.

📒 Files selected for processing (9)

components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml (1 hunks)
components/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml (1 hunks)
components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml (1 hunks)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml (1 hunks)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml (2 hunks)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml (1 hunks)
components/backends/trtllm/gemma3_sliding_window_attention.md (1 hunks)
components/backends/trtllm/launch/disagg_router.sh (1 hunks)
components/backends/trtllm/src/dynamo/trtllm/publisher.py (5 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: in components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally d...

Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.

Applied to files:

components/backends/trtllm/launch/disagg_router.sh

📚 Learning: trtllm llm-api expects all caps for backend field names in configuration files. when migrating trtll...

Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.

Applied to files:

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml
components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml

📚 Learning: the `--torch-backend=auto` flag works with vllm installations via uv pip install, even though it's n...

Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.

Applied to files:

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml

🔇 Additional comments (16)

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml (1)

22-30: Verify that renamed key and new flag are recognised by TRTLLM

enable_autotuner and speculative_model_dir are newly introduced keys. Ensure the serving stack (both aggregated and disaggregated paths) already consumes these exact field names; otherwise they will be silently ignored.

components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml (1)

16-18: Backend value case-sensitivity

Previous migrations (learning #2217) showed TRTLLM backend values must be in ALL-CAPS (PYTORCH). Confirm that lowercase pytorch is still accepted by the latest parser; otherwise change to upper-case to avoid startup failures.

components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml (1)

29-31: Same backend-case issue as agg config

Double-check if backend: default must be DEFAULT.

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml (1)

24-32: Confirm new keys match runtime expectations

As with eagle_agg.yml, validate that:

enable_autotuner is actually consumed by the decode worker.

speculative_model_dir is the expected replacement for pytorch_weights_path.

A mismatch will silently revert to defaults and degrade performance.

components/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml (1)

16-30: LGTM! Configuration correctly implements VSWA.

The configuration properly defines variable sliding window sizes with 5 layers using 512-token windows and 1 layer with a 32768-token window, which aligns with the Gemma3 model's alternating attention mechanism described in the documentation.

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml (4)

24-24: Configuration enhancement looks good.

Explicitly setting enable_autotuner: false provides clear control over the autotuning behavior.

30-30: Key rename improves clarity.

Renaming pytorch_weights_path to speculative_model_dir better reflects the actual content and purpose of this configuration field.

35-35: Cache optimization enabled.

Enabling block reuse improves KV cache efficiency by allowing reuse of previously allocated blocks.

37-38: Cache transceiver configuration added.

The addition of cache_transceiver_config maintains consistency with other engine configurations.

components/backends/trtllm/gemma3_sliding_window_attention.md (3)

18-21: Clear and informative documentation introduction.

The explanation of VSWA and hardware requirements provides essential context for deployment.

23-28: Version requirements properly documented.

Clear specification of TensorRT-LLM version requirements and build instructions for KV Routing support.

30-66: Comprehensive deployment instructions.

All serving modes are covered with appropriate configuration files and launch scripts. The commands correctly reference the VSWA engine configurations.

components/backends/trtllm/src/dynamo/trtllm/publisher.py (4)

120-125: Efficient initialization for window size tracking.

Good implementation with the processing_initial_created_events flag to limit window size checking to the initial phase, addressing the efficiency concern from the previous review.

298-300: Event filtering correctly implements VSWA support.

The logic properly filters out KV events from non-global attention layers, ensuring only events with the largest window size are published.

305-305: State management for initial events is correct.

The flag properly transitions from initial "created" event processing to normal operation when "stored" or "removed" events are received.

Also applies to: 346-346, 362-363

421-428: Event dropping logic correctly implements selective publishing.

The method properly determines which events to drop based on window size, ensuring only events from the global attention layer (with max window size) are published after the initial phase.

components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml

components/backends/trtllm/src/dynamo/trtllm/publisher.py

container/build.sh

wording update readme

fixing the eagle serving update trtllm commit

…he biggest window size to support kv routing with variable sliding window attention (ai-dynamo#2241)

richardhuo-nv requested review from a team, GuanLuo, PeaBrane, alec-flowers, biswapanda, grahamking, ishandhanani, jthomson04, kkranen, nnshah1, paulhendricks, piotrm-nvidia, ptarasiewiczNV, rmccorm4, ryanolson, tanmayv25, tedzhouhk and tmonty12 as code owners August 1, 2025 03:16

pull-request-size bot added the size/M label Aug 1, 2025

github-actions bot added the feat label Aug 1, 2025

jthomson04 reviewed Aug 1, 2025

View reviewed changes

components/backends/trtllm/src/dynamo/trtllm/publisher.py Show resolved Hide resolved

jthomson04 self-requested a review August 5, 2025 16:19

Base automatically changed from rihuo/add_vswa to main August 5, 2025 18:09

pull-request-size bot added size/L and removed size/M labels Aug 5, 2025

coderabbitai bot reviewed Aug 5, 2025

View reviewed changes

components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml Outdated Show resolved Hide resolved

jthomson04 approved these changes Aug 5, 2025

View reviewed changes

richardhuo-nv force-pushed the rihuo/vswa_kv_routing branch from f8759a7 to d0198a5 Compare August 6, 2025 00:32

pull-request-size bot added size/M and removed size/L labels Aug 6, 2025

richardhuo-nv force-pushed the rihuo/vswa_kv_routing branch from d0198a5 to 7c803d1 Compare August 6, 2025 00:36

tanmayv25 reviewed Aug 6, 2025

View reviewed changes

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml Outdated Show resolved Hide resolved

components/backends/trtllm/src/dynamo/trtllm/publisher.py Show resolved Hide resolved

container/build.sh Outdated Show resolved Hide resolved

richardhuo-nv requested a review from tanmayv25 August 6, 2025 19:23

richardhuo-nv added 4 commits August 6, 2025 14:50

add window_size chec

52e7b3f

wording update readme

address comments, fixing the eagle serving

4830a7e

fixing the eagle serving

fa2dbb4

fixing the eagle serving update trtllm commit

resolve comments

719c81d

richardhuo-nv force-pushed the rihuo/vswa_kv_routing branch from 7add6eb to 719c81d Compare August 6, 2025 21:50

tanmayv25 approved these changes Aug 6, 2025

View reviewed changes

resolve comments

2ea56e1

richardhuo-nv merged commit 2cf6776 into main Aug 7, 2025
13 of 16 checks passed

richardhuo-nv deleted the rihuo/vswa_kv_routing branch August 7, 2025 01:34

mkhazraee pushed a commit to whoisj/dynamo that referenced this pull request Aug 8, 2025

feat: DIS-323 [trtllm backend publisher] only publish kv event with t…

b21196b

…he biggest window size to support kv routing with variable sliding window attention (ai-dynamo#2241)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: DIS-323 [trtllm backend publisher] only publish kv event with the biggest window size to support kv routing with variable sliding window attention #2241

feat: DIS-323 [trtllm backend publisher] only publish kv event with the biggest window size to support kv routing with variable sliding window attention #2241

Uh oh!

richardhuo-nv commented Aug 1, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Aug 1, 2025

Uh oh!

Uh oh!

coderabbitai bot commented Aug 5, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: DIS-323 [trtllm backend publisher] only publish kv event with the biggest window size to support kv routing with variable sliding window attention #2241

feat: DIS-323 [trtllm backend publisher] only publish kv event with the biggest window size to support kv routing with variable sliding window attention #2241

Uh oh!

Conversation

richardhuo-nv commented Aug 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Aug 1, 2025

Uh oh!

Uh oh!

coderabbitai bot commented Aug 5, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

richardhuo-nv commented Aug 1, 2025 •

edited by coderabbitai bot

Loading