Skip to content

Conversation

@richardhuo-nv
Copy link
Contributor

@richardhuo-nv richardhuo-nv commented Sep 29, 2025

Overview:

There is a bug in TRTLLM KVBM offloading, that we might double offloading the G1 matched blocks again. This PR fixed this bug.

Details:

In the prefill stage, the scheduler output for a new request will always report 0 computed tokens; it only indicates the connector of the device blocks, and the prefill tokens.

Therefore, if the request is new and the computed position is greater than 0, but the slot’s evaluated blocks count is 0, it means there are matched tokens already present on the GPU.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Bug Fixes

    • More reliable handling of new vs. cached requests during scheduling.
    • Prevented negative candidate calculations, improving stability.
    • Refined offloading decisions for more accurate memory management.
  • Performance Improvements

    • Reduced unnecessary block evaluations for new requests, potentially lowering latency.
    • Improved disk-backed pool initialization to avoid side effects and ensure smoother reuse of configuration.
  • Refactor

    • Clarified onboarding logic for requests, improving readability and maintainability without changing user-facing behavior.

@richardhuo-nv richardhuo-nv requested a review from a team as a code owner September 29, 2025 23:55
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 29, 2025

Walkthrough

Updated Slot trait and VllmConnectorSlot method signatures to include an is_new_request flag, adjusted internal logic around evaluated_blocks and offloading decisions, set explicit true/false at two call sites distinguishing new vs cached requests, and changed disk pool layout construction to clone config before use.

Changes

Cohort / File(s) Summary of Changes
Slot interface and implementation
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs
Added is_new_request: bool to apply_scheduler_output_with_computed_position in trait and impl; conditional advancement of evaluated_blocks when onboarding; switched to saturating_sub for num_candidate_blocks; offload condition uses > 0 instead of != 0.
Leader call sites (onboarding vs cached)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs
Explicitly passed true for new requests and false for cached requests to apply_scheduler_output_with_computed_position, separating onboarding and cached paths.
Offload config handling
lib/llm/src/block_manager/offload.rs
Passed config.clone() to disk-backed build_layout, avoiding moving the original config.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Leader as KvConnectorLeader
  participant Slot as VllmConnectorSlot
  participant Scheduler

  Client->>Leader: build_connector_metadata(request)
  Leader->>Scheduler: get scheduler output
  alt New request
    note over Leader,Slot: is_new_request = true
    Leader->>Slot: apply_scheduler_output_with_computed_position(..., true)
    Slot->>Slot: Conditionally advance evaluated_blocks
    Slot->>Slot: Compute num_candidate_blocks via saturating_sub
    Slot->>Slot: Offload if num_candidate_blocks > 0
  else Cached request
    note over Leader,Slot: is_new_request = false
    Leader->>Slot: apply_scheduler_output_with_computed_position(..., false)
    Slot->>Slot: Skip onboarding advancement
    Slot->>Slot: Compute/decide offload as above
  end
  Leader-->>Client: connector metadata
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my ears at flags so new,
True for fresh hops, false for déjà vu.
Blocks advance where seedlings sprout,
Offloads nibble when carrots run out.
A cloned config in the burrow’s light—
Neat little trails, all routed right. 🥕🐇

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning While the overview and details sections adequately describe the bug and its resolution, the “Where should the reviewer start?” section is left as the template placeholder and lacks any file references, and the related issue placeholder still uses “#xxx” instead of the actual issue number, so the description is incomplete. Please fill in the “Where should the reviewer start?” section with specific file paths or components to review, and replace “#xxx” in the Related Issues section with the actual issue number the PR closes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly identifies the primary fix—skipping offloading of G1 matched blocks—and includes the relevant ticket reference, making it clear and focused on the core change without extraneous detail.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3aa3077 and cfb5f0a.

📒 Files selected for processing (3)
  • lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (3 hunks)
  • lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (2 hunks)
  • lib/llm/src/block_manager/offload.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
lib/llm/src/block_manager/offload.rs (1)
lib/llm/src/block_manager/layout.rs (3)
  • config (196-196)
  • config (509-511)
  • config (740-742)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: trtllm
  • GitHub Check: sglang
  • GitHub Check: vllm
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: tests (.)
🔇 Additional comments (8)
lib/llm/src/block_manager/offload.rs (1)

742-742: LGTM! Cloning config enables subsequent pool constructions.

The change from config to config.clone() ensures the original config remains available after the disk pool layout is built, allowing it to be reused for any subsequent pool constructions. This prevents a move error and is the correct approach.

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (2)

337-337: LGTM! Correctly marks new requests for onboarding.

Passing is_new_request = true when applying scheduler output to new onboarding requests is correct and aligns with the PR's goal of distinguishing G1 matched blocks from fresh allocations during the prefill stage.


368-368: LGTM! Correctly marks cached requests.

Passing is_new_request = false for cached requests ensures that the slot logic does not treat these as onboarding requests, preventing double offloading of already-cached blocks.

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (5)

118-119: LGTM! Trait signature updated to support new/cached request distinction.

Adding the is_new_request: bool parameter to the trait method enables implementations to distinguish between onboarding new requests and processing cached requests, which is essential for the offloading bugfix.


596-596: LGTM! Implementation signature matches updated trait.

The implementation correctly adopts the new is_new_request parameter from the trait definition.


644-644: LGTM! Using saturating_sub prevents underflow.

The change from direct subtraction to saturating_sub is safer and prevents potential underflow if evaluated_blocks somehow exceeds the computed block count. This is a defensive improvement.


646-646: LGTM! Positive check is clearer for offloading decision.

Changing the condition from != 0 to > 0 makes the intent clearer: we only offload when there are positive candidate blocks. Functionally equivalent for usize, but semantically more explicit.


635-641: Block-skipping formula is correct. computed_position is a zero‐based token index, and (computed_position + 1) / block_size correctly computes the number of fully computed blocks without off‐by‐one errors.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

  • Public repositories are always opted into early access features.
  • You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

Comment @coderabbitai help to get the list of available commands and usage tips.

@richardhuo-nv
Copy link
Contributor Author

/ok to test 9aa8230

fix

fix

fix

fix

fix

fix

fix

Signed-off-by: richardhuo-nv <[email protected]>
Signed-off-by: richardhuo-nv <[email protected]>
Signed-off-by: richardhuo-nv <[email protected]>
@richardhuo-nv richardhuo-nv merged commit 713e9e4 into main Oct 1, 2025
31 of 36 checks passed
@richardhuo-nv richardhuo-nv deleted the rihuo/fix_offload branch October 1, 2025 17:28
ziqifan617 pushed a commit that referenced this pull request Oct 1, 2025
nv-tusharma pushed a commit that referenced this pull request Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants