Skip to content

Conversation

@richardhuo-nv
Copy link
Contributor

@richardhuo-nv richardhuo-nv commented Aug 1, 2025

Overview:

Update the TRTLLM backend publisher to emit only KV events corresponding to the largest window_size. This change enables compatibility with variable sliding window attention (VSWA) in KV routing.

Details:

When VSWA is enabled in TRTLLM, each attention layer emits a KV event to the same KV event manager. However, only the KV events from the global attention layer, which typically has the largest window_size, are relevant for Dynamo’s KV routing (which relies on prefix matching).

To avoid publishing redundant or irrelevant KV events from non-global layers, the TRTLLM backend publisher will be modified to filter and publish only the KV event with the largest window_size. TRTLLM already includes a window_size field in each KV event, which will be used to identify and select the appropriate event for publishing.

This selective publishing ensures that Dynamo receives only the meaningful KV event data necessary for accurate routing in VSWA scenarios.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features
    • Added configuration files and deployment guide for Gemma3 model with Variable Sliding Window Attention, supporting various serving modes and hardware recommendations.
  • Documentation
    • Introduced a new markdown guide detailing setup and deployment instructions for the Gemma3 model using VSWA.
  • Bug Fixes
    • Corrected file formatting by adding a missing newline in the disaggregated router script.
  • Enhancements
    • Improved event handling for key-value cache processing, enabling selective filtering based on window size for more refined event management.
  • Configuration Updates
    • Updated Llama4 Eagle model configs with new tuning flags, renamed keys, enabled block reuse, and added cache transceiver backend settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 1, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@richardhuo-nv richardhuo-nv changed the title DRAFT: DIS-323 [trtllm backend publisher] only publish kv event with the biggest window size to support kv routing with variable sliding window attention feat: DIS-323 [DRAFT] [trtllm backend publisher] only publish kv event with the biggest window size to support kv routing with variable sliding window attention Aug 1, 2025
@github-actions github-actions bot added the feat label Aug 1, 2025
@jthomson04 jthomson04 self-requested a review August 5, 2025 16:19
Base automatically changed from rihuo/add_vswa to main August 5, 2025 18:09
@pull-request-size pull-request-size bot added size/L and removed size/M labels Aug 5, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 5, 2025

Walkthrough

This update introduces new YAML configuration files for the Gemma3 backend with Variable Sliding Window Attention (VSWA), along with a markdown deployment guide. Several Llama4 Eagle model configuration files are updated to add new flags, rename keys, and adjust cache settings. The Publisher class in the codebase is enhanced to filter KV cache events based on window size, introducing new attributes and methods. A shell script receives a formatting fix.

Changes

Cohort / File(s) Change Summary
Gemma3 VSWA YAML Configs
components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml, .../vswa_decode.yaml, .../vswa_prefill.yaml
Added new YAML configuration files for Gemma3 VSWA, specifying tensor parallel size, backend (PyTorch), key-value cache window sizes, and cache transceiver settings.
Gemma3 VSWA Documentation
components/backends/trtllm/gemma3_sliding_window_attention.md
Added a markdown guide detailing prerequisites, hardware requirements, and launch commands for deploying the Gemma3 model with VSWA using Dynamo.
Llama4 Eagle Model Config Updates
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml, .../eagle_decode.yaml, .../eagle_prefill.yaml
Updated YAML configs: added enable_autotuner flag, renamed pytorch_weights_path to speculative_model_dir in speculative_config, introduced cache_transceiver_config, and enabled enable_block_reuse in prefill config.
Publisher Event Filtering Logic
components/backends/trtllm/src/dynamo/trtllm/publisher.py
Enhanced Publisher class to track and filter KV cache events by window size, with new attributes, methods (update_max_window_size, should_drop_event), and event filtering logic in the async publishing task.
Shell Script Formatting
components/backends/trtllm/launch/disagg_router.sh
Fixed missing newline at end of script; no functional changes.

Sequence Diagram(s)

sequenceDiagram
    participant EventSource as KV Cache Event Source
    participant Publisher as Publisher
    participant Downstream as Downstream Consumer

    loop For each KV cache event
        EventSource->>Publisher: Send KV cache event
        alt Event type is "created" and initial phase
            Publisher->>Publisher: update_max_window_size(event)
            Publisher->>Downstream: Publish event
        else Event type is "stored" or "removed"
            Publisher->>Publisher: processing_initial_created_events = False
            Publisher->>Downstream: Publish event
        else
            Publisher->>Publisher: should_drop_event(event)?
            alt should_drop_event = False
                Publisher->>Downstream: Publish event
            else should_drop_event = True
                Publisher--xDownstream: Drop event
            end
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

In the meadow of code where configs bloom bright,
Gemma3 and Eagle now take flight.
With sliding windows and caches anew,
The Publisher filters what’s passing through.
YAMLs align, scripts end just right—
A rabbit’s delight in the soft moonlight! 🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
components/backends/trtllm/launch/disagg_router.sh (1)

23-27: clear_namespace exit status is ignored – router may start with a dirty namespace

python3 utils/clear_namespace.py --namespace dynamo is executed without && or an explicit exit-on-failure guard. A non-zero exit code will be silently ignored, contradicting the documented design decision (see learning from ai-dynamo #2137) that the router must not start if the namespace is uncleared.

-# run clear_namespace
-python3 utils/clear_namespace.py --namespace dynamo
+# run clear_namespace – abort if it fails
+python3 utils/clear_namespace.py --namespace dynamo || {
+  echo "[ERROR] Failed to clear namespace; aborting launch." >&2
+  exit 1
+}
♻️ Duplicate comments (1)
components/backends/trtllm/src/dynamo/trtllm/publisher.py (1)

412-419: Window size tracking implementation looks good.

The method correctly tracks the maximum window size. Since this is now only called during initial "created" events (as per line 362-363), the efficiency concern from the previous review has been addressed.

🧹 Nitpick comments (1)
components/backends/trtllm/launch/disagg_router.sh (1)

1-3: Consider adding strict shell options for safer scripting

Adding set -euo pipefail (and optionally IFS=$'\n\t') early in the script will terminate on first error, catch unset variables, and propagate pipeline failures, giving more predictable behaviour for production deployment scripts.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8f6d4d and f8759a7.

📒 Files selected for processing (9)
  • components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml (1 hunks)
  • components/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml (1 hunks)
  • components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml (1 hunks)
  • components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml (1 hunks)
  • components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml (2 hunks)
  • components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml (1 hunks)
  • components/backends/trtllm/gemma3_sliding_window_attention.md (1 hunks)
  • components/backends/trtllm/launch/disagg_router.sh (1 hunks)
  • components/backends/trtllm/src/dynamo/trtllm/publisher.py (5 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally d...
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.

Applied to files:

  • components/backends/trtllm/launch/disagg_router.sh
📚 Learning: trtllm llm-api expects all caps for backend field names in configuration files. when migrating trtll...
Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.

Applied to files:

  • components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml
  • components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml
  • components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
  • components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml
📚 Learning: the `--torch-backend=auto` flag works with vllm installations via uv pip install, even though it's n...
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.

Applied to files:

  • components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml
  • components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml
🔇 Additional comments (16)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml (1)

22-30: Verify that renamed key and new flag are recognised by TRTLLM

enable_autotuner and speculative_model_dir are newly introduced keys. Ensure the serving stack (both aggregated and disaggregated paths) already consumes these exact field names; otherwise they will be silently ignored.

components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml (1)

16-18: Backend value case-sensitivity

Previous migrations (learning #2217) showed TRTLLM backend values must be in ALL-CAPS (PYTORCH). Confirm that lowercase pytorch is still accepted by the latest parser; otherwise change to upper-case to avoid startup failures.

components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml (1)

29-31: Same backend-case issue as agg config

Double-check if backend: default must be DEFAULT.

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml (1)

24-32: Confirm new keys match runtime expectations

As with eagle_agg.yml, validate that:

  1. enable_autotuner is actually consumed by the decode worker.
  2. speculative_model_dir is the expected replacement for pytorch_weights_path.

A mismatch will silently revert to defaults and degrade performance.

components/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml (1)

16-30: LGTM! Configuration correctly implements VSWA.

The configuration properly defines variable sliding window sizes with 5 layers using 512-token windows and 1 layer with a 32768-token window, which aligns with the Gemma3 model's alternating attention mechanism described in the documentation.

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml (4)

24-24: Configuration enhancement looks good.

Explicitly setting enable_autotuner: false provides clear control over the autotuning behavior.


30-30: Key rename improves clarity.

Renaming pytorch_weights_path to speculative_model_dir better reflects the actual content and purpose of this configuration field.


35-35: Cache optimization enabled.

Enabling block reuse improves KV cache efficiency by allowing reuse of previously allocated blocks.


37-38: Cache transceiver configuration added.

The addition of cache_transceiver_config maintains consistency with other engine configurations.

components/backends/trtllm/gemma3_sliding_window_attention.md (3)

18-21: Clear and informative documentation introduction.

The explanation of VSWA and hardware requirements provides essential context for deployment.


23-28: Version requirements properly documented.

Clear specification of TensorRT-LLM version requirements and build instructions for KV Routing support.


30-66: Comprehensive deployment instructions.

All serving modes are covered with appropriate configuration files and launch scripts. The commands correctly reference the VSWA engine configurations.

components/backends/trtllm/src/dynamo/trtllm/publisher.py (4)

120-125: Efficient initialization for window size tracking.

Good implementation with the processing_initial_created_events flag to limit window size checking to the initial phase, addressing the efficiency concern from the previous review.


298-300: Event filtering correctly implements VSWA support.

The logic properly filters out KV events from non-global attention layers, ensuring only events with the largest window size are published.


305-305: State management for initial events is correct.

The flag properly transitions from initial "created" event processing to normal operation when "stored" or "removed" events are received.

Also applies to: 346-346, 362-363


421-428: Event dropping logic correctly implements selective publishing.

The method properly determines which events to drop based on window size, ensuring only events from the global attention layer (with max window size) are published after the initial phase.

@richardhuo-nv richardhuo-nv force-pushed the rihuo/vswa_kv_routing branch from f8759a7 to d0198a5 Compare August 6, 2025 00:32
@pull-request-size pull-request-size bot added size/M and removed size/L labels Aug 6, 2025
@richardhuo-nv richardhuo-nv force-pushed the rihuo/vswa_kv_routing branch from d0198a5 to 7c803d1 Compare August 6, 2025 00:36
@richardhuo-nv richardhuo-nv changed the title feat: DIS-323 [DRAFT] [trtllm backend publisher] only publish kv event with the biggest window size to support kv routing with variable sliding window attention feat: DIS-323 [trtllm backend publisher] only publish kv event with the biggest window size to support kv routing with variable sliding window attention Aug 6, 2025
@richardhuo-nv richardhuo-nv requested a review from tanmayv25 August 6, 2025 19:23
@richardhuo-nv richardhuo-nv force-pushed the rihuo/vswa_kv_routing branch from 7add6eb to 719c81d Compare August 6, 2025 21:50
@richardhuo-nv richardhuo-nv merged commit 2cf6776 into main Aug 7, 2025
13 of 16 checks passed
@richardhuo-nv richardhuo-nv deleted the rihuo/vswa_kv_routing branch August 7, 2025 01:34
mkhazraee pushed a commit to whoisj/dynamo that referenced this pull request Aug 8, 2025
…he biggest window size to support kv routing with variable sliding window attention (ai-dynamo#2241)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants