[None][feat] KV Cache Connector API #6488

jthomson04 · 2025-07-30T16:04:18Z

This is a first implementation of a KV connector API in TRTLLM. See https://docs.google.com/document/d/104jQ_Qhx5-rb1d-8ZBSN-jYQESaRjNJVhJoT3X7_KGw/edit?tab=t.0#heading=h.dh5skq4lgud6 for details. This is a very limited initial implementation, so a couple caveats:

Doesn't support VSWA or cuda graphs
Only supports the pytorch runtime
Probably doesn't work with peft, lora, etc.

Summary by CodeRabbit

New Features
- Optional KV-cache connector: end-to-end support for async remote KV load/save with executor integration and per-layer hooks.
Python API
- Connector config/manager available in constructors and bindings; KV-cache manager exposes unique primary-pool access and new connector-aware APIs.
Examples
- Added runnable disk-backed KV-cache connector demonstration.
Tests
- New unit and integration tests covering connector scheduler/worker coordination and end-to-end flows.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-07-30T16:04:25Z

📝 Walkthrough

Walkthrough

Adds a KV-cache connector interface and wires it through C++ and Python: new connector manager type and bindings (pybind/nanobind), threads an optional KvCacheConnectorManager through KV-cache managers and executor paths, implements a Python connector manager/scheduler/worker, adds example and tests, and exposes a unique-primary-pool API.

Changes

Cohort / File(s)	Summary
C++ KV connector interface `cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h`	New abstract `KvCacheConnectorManager` (ctor/dtor) with pure virtual `SizeType32 getNumNewMatchedTokens(LlmRequest const&, SizeType32)`.
C++ KV cache manager API & impl `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`, `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`, `.../kvCacheManager.*`	Thread optional `std::shared_ptr<kv_connector::KvCacheConnectorManager>` through KVCacheManager/BlockManager/WindowBlockManager constructors and addSequence APIs; add `BaseKVCacheManager::getUniquePrimaryPool()` and `KVCacheManager::getUniquePrimaryPool()`; WindowBlockManager stores connector and queries `getNumNewMatchedTokens` to adjust prefill; connector-aware logging and runtime checks added.
PyBind batch_manager bindings `cpp/tensorrt_llm/pybind/CMakeLists.txt`, `cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp`, `cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h`, `cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp`	Added pybind trampoline and bindings for `KvCacheConnectorManager` (trampoline class, `get_num_new_matched_tokens`), registered connector bindings in Python module, and added connector source to pybind SRCS.
Nanobind batch_manager bindings `cpp/tensorrt_llm/nanobind/CMakeLists.txt`, `cpp/tensorrt_llm/nanobind/bindings.cpp`, `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheConnector.h`, `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheConnector.cpp`, `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`	Added nanobind trampoline/bindings for `KvCacheConnectorManager`; updated KVCacheManager/add_sequence bindings to accept optional connector; exposed `get_unique_primary_pool`; added connector source to nanobind SRCS.
Python connector implementation `tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py`	New Python-side connector API and manager: `KvCacheConnectorWorker`/`KvCacheConnectorScheduler` ABCs, scheduler output and `AsyncRequests` types, `KvCacheConnectorManager` (MPI-aware) with hooks for layer pre/post and scheduler/worker orchestration.
PyExecutor & KV-cache plumbing (Python) `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `_util.py`, `resource_manager.py`, `py_executor_creator.py`	Propagate optional `KvCacheConnectorManager` through creator/KV-cache construction; PyExecutor integrates connector lifecycle (init, hooks, start/terminate/wait), defers request termination when connector retains requests, registers unique primary pool with connector, and enforces compatibility; creator dynamically loads configured connector classes.
Executor wiring (Python) `tensorrt_llm/executor/executor.py`, `tensorrt_llm/executor/proxy.py`, `tensorrt_llm/executor/worker.py`	Add `kv_connector_config` parameter and propagate through GenerationExecutor.create / GenerationExecutorProxy / GenerationExecutorWorker; PyTorch-only validation and forwarding into worker/PyExecutor creation.
LLM API args and usage `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/llmapi/llm.py`	Add `KvCacheConnectorConfig` Pydantic model and `TorchLlmArgs.connector_config`; pass connector config into executor creation.
Runtime minor change `cpp/tensorrt_llm/runtime/torch.h`	Simplified Torch::tensor deleter: removed try/catch/assert and log a warning if pointer mismatch before reset.
Bindings include adjustments `cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp`, `cpp/tensorrt_llm/pybind/CMakeLists.txt`	Added includes for `kvCacheConnector.h` and `medusaBuffers.h`, initialized connector pybind bindings, and added connector source to pybind SRCS.
Examples `examples/llm-api/llm_kv_cache_connector.py`	New disk-backed persistent KV-cache connector example: `PersistentKvCacheConnectorWorker` and `PersistentKvCacheConnectorLeader`, helpers, and CLI demo to populate/reuse on-disk cache.
Tests: unit and integration `tests/unittest/_torch/test_connector.py`, `tests/integration/defs/llmapi/test_llm_api_connector.py`, `tests/integration/defs/llmapi/test_llm_examples.py`, `tests/integration/test_lists/test-db/l0_a10.yml`	New unit tests for connector manager MPI behaviors and integration tests covering scheduler/worker flows, async onboarding/save, scheduler outputs, chunked contexts; example runner test and CI test list additions.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant LLM_Executor
  participant PyExecutor
  participant KVCacheManager
  participant KvConnectorManager
  participant Worker

  Client->>LLM_Executor: create(..., kv_connector_config)
  LLM_Executor->>PyExecutor: create_py_executor(..., kv_connector_config)
  PyExecutor->>KvConnectorManager: construct(worker, scheduler)
  PyExecutor->>KVCacheManager: build and attach KvConnectorManager
  KVCacheManager->>PyExecutor: get_unique_primary_pool()
  PyExecutor->>Worker: register_kv_caches(primary_pool)

sequenceDiagram
  participant Scheduler
  participant PyExecutor
  participant KvConnectorManager
  participant KVCacheManager
  participant DecoderLayer
  participant Worker

  Scheduler->>PyExecutor: scheduled_batch
  PyExecutor->>KvConnectorManager: take_scheduled_requests_pending_load(batch)
  KvConnectorManager->>Scheduler: build_connector_meta(output)
  KvConnectorManager->>Worker: start_load_kv(stream)
  loop per layer
    DecoderLayer->>KvConnectorManager: layer_pre_hook(layer_idx, stream)
    KvConnectorManager->>Worker: wait_for_layer_load(layer_idx, stream)
    DecoderLayer-->>DecoderLayer: forward
    DecoderLayer->>KvConnectorManager: layer_post_hook(layer_idx, stream)
    KvConnectorManager->>Worker: save_kv_layer(layer_idx, stream)
  end
  PyExecutor->>KvConnectorManager: get_finished()
  KvConnectorManager->>Scheduler: request_finished(req, block_ids)
  PyExecutor->>KVCacheManager: terminate/free as needed

sequenceDiagram
  participant App
  participant KVCacheManager
  participant BlockManager
  participant WindowBlockManager
  participant KvConnectorManager

  App->>KVCacheManager: addSequence(reqId, inputLength, beamWidth, llmRequest, kvConn?)
  KVCacheManager->>BlockManager: addSequence(..., kvConn?, windowSize)
  BlockManager->>WindowBlockManager: addSequence(..., kvConn?)
  alt kvConn provided
    WindowBlockManager->>KvConnectorManager: getNumNewMatchedTokens(request, numComputedTokens)
    KvConnectorManager-->>WindowBlockManager: matchedTokens
    WindowBlockManager-->>WindowBlockManager: adjust prompt length
  end
  WindowBlockManager-->>BlockManager: allocate/context setup
  BlockManager-->>KVCacheManager: done

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse #6767 — modifies KV cache manager/AddSequence and related constructor flows; likely to intersect on API changes.
fix: bindings unit tests for nanobind #6221 — touches nanobind/pybind batch_manager binding areas and may conflict with connector binding additions.
Revert "feat: nanobind bindings (#5961)" #6160 — overlaps on KV-cache connector bindings and trampoline additions; likely to intersect on binding files.

Suggested labels

KV-Cache Management

Suggested reviewers

achartier
DomBrown
Tabrizian
lowsfer
chuangz0
nv-guomingz
juney-nvidia
yechank-nvidia

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (14)

cpp/tensorrt_llm/pybind/CMakeLists.txt (1)

11-11: Keep source list alphabetical for easier maintenance

The newly added entry sits between cacheTransceiver.cpp and kvCacheManager.cpp; alphabetically it belongs after kvCacheManager.cpp.
Not critical for build correctness, but keeping the order sorted reduces merge conflicts over time.

cpp/tensorrt_llm/pybind/bindings.cpp (2)

32-32: Include order nitpick

For readability the local include list is alphabetically sorted; inserting kvCacheConnector.h breaks that order.
Consider moving it after kvCacheManager.h.

468-468: Use the existing tbk namespace alias

A dedicated alias (tbk) for kv_cache_manager is defined on Line 49 but not used here.
Swapping to the alias keeps style consistent with the other initBindings calls.
-    tb::kv_cache_manager::KVCacheManagerConnectorBindings::initBindings(mInternalBatchManager);
+    tbk::KVCacheManagerConnectorBindings::initBindings(mInternalBatchManager);

cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (1)

1-16: Use consistent copyright header format.

The file has both SPDX-FileCopyrightText format (lines 2-3) and full Apache license text. For consistency with other files in the codebase, use one format throughout.

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

266-302: Add docstring to clarify the purpose and behavior.

The initialization logic is sound with appropriate validation checks and hook registration. Consider adding a docstring to explain:

The purpose of the KV connector manager
Why it's incompatible with certain features
What the hooks accomplish

 def _maybe_init_kv_connector_manager(self):
+    """Initialize the KV Cache Connector Manager if provided.
+    
+    The connector manager enables asynchronous and distributed KV cache management.
+    It's incompatible with KvCacheTransceiver, pipeline parallelism, and overlap scheduler
+    due to conflicting resource management approaches.
+    
+    Registers pre/post hooks on decoder layers to coordinate KV cache loading/saving.
+    """
     if self.kv_connector_manager is not None:

913-939: Simplify duplicated conditional checks.

The conditional block starting at line 930 duplicates the same condition from line 913. Consider merging these blocks for clarity:

                 if scheduled_batch.batch_size > 0 or (
                         self.enable_attention_dp and self.dist.tp_size > 1):
                     if self.kv_cache_transceiver:
                         # For generation requests which have completed KV cache transfer
                         self._prepare_disagg_gen_transmission_complete(
                             scheduled_batch)
 
                         # Return the first token to the client
                         self._handle_first_token_response(scheduled_batch)
                     self.resource_manager.prepare_resources(scheduled_batch)
 
                     # We have to run this after we've run the KV cache manager (via the resource manager).
                     # This takes requests that are pending an async load, and removes them from the scheduled context batch.
                     if self.kv_connector_manager:
                         self.kv_connector_manager.take_scheduled_requests_pending_load(
                             scheduled_batch)
 
-                if scheduled_batch.batch_size > 0 or (
-                        self.enable_attention_dp and self.dist.tp_size > 1):
                     if self.drafter is not None:
                         self.drafter.prepare_draft_tokens(
                             scheduled_batch, self.resource_manager)
 
                     if self.kv_connector_manager is not None:
                         self.kv_connector_manager.build_connector_meta()
                         self.kv_connector_manager.worker.start_load_kv()

1422-1426: Consider adding error handling for connector operations.

The logic correctly delegates resource management to the connector when applicable. Consider wrapping the connector call in error handling to ensure resources are properly freed even if the connector fails:

     def _terminate_request(self, request: LlmRequest):
-        if self.kv_connector_manager is None or not self.kv_connector_manager.request_finished(
-                request):
-            self.resource_manager.free_resources(request)
+        try:
+            if self.kv_connector_manager is None or not self.kv_connector_manager.request_finished(
+                    request):
+                self.resource_manager.free_resources(request)
+        except Exception as e:
+            logger.warning(f"Error checking connector request status, freeing resources anyway: {e}")
+            self.resource_manager.free_resources(request)

tests/unittest/_torch/test_connector.py (1)

108-154: Consider adding edge case tests.

The test correctly validates the filtering of async-loading requests. Consider adding tests for edge cases:

Empty scheduled requests

All requests pending async load

No requests pending async load

tensorrt_llm/_torch/pyexecutor/_util.py (1)

399-402: Consider improving the error message for draft model incompatibility.

The validation and propagation logic is correct. Consider making the error message more descriptive:
         if self._kv_connector_manager is not None and self._draft_model_engine is not None:
             raise ValueError(
-                "Connector manager is not supported for draft model.")
+                "KV Cache Connector is not supported with draft models due to conflicting resource management.")
Also applies to: 423-438, 576-577

tensorrt_llm/_torch/pyexecutor/connector.py (5)

123-133: Fix line length to comply with 120 character limit.

Several lines exceed the 120 character limit specified in the coding guidelines.

Apply this diff to fix the line length issues:

-            finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now asynchronously saving.
+            finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now
+                asynchronously saving.
-            started_loading_req_ids: The IDs of the requests that have started asynchronously loading.
+            started_loading_req_ids: The IDs of the requests that have started asynchronously loading.

-        Note: IDs may only be returned from this call after they've been provided in the `finished_gen_req_ids` and `started_loading_req_ids` arguments.
-        Additionally, the runtime will only take action based on these returned IDs once they've been returned by ALL workers. This allows some workers to take longer than others to complete the operations.
+        Note: IDs may only be returned from this call after they've been provided in the `finished_gen_req_ids`
+        and `started_loading_req_ids` arguments. Additionally, the runtime will only take action based on these
+        returned IDs once they've been returned by ALL workers. This allows some workers to take longer than
+        others to complete the operations.

178-180: Fix line length in docstring.

Apply this diff to fix the line length:

-            Whether the request is performing asynchronous saving operations.
-            If true, this indicates that the kv cache manager should wait to deallocate the blocks until the saving has completed (determined by `get_finished` on the workers).
+            Whether the request is performing asynchronous saving operations. If true, this indicates that
+            the kv cache manager should wait to deallocate the blocks until the saving has completed
+            (determined by `get_finished` on the workers).

276-283: Address the architectural concern with async loading.

The comment indicates this is "a bit ugly" because requests are modified after scheduling. This suggests a potential design issue where the connector's async loading decision comes too late in the pipeline.

Consider refactoring the architecture to make async loading decisions before scheduling, possibly by:

Adding a pre-scheduling phase for connector decisions
Or integrating connector logic into the scheduler itself

279-279: Fix line length issues.

Apply this diff to fix line length:

-        # This is problematic, since at the point when this function is called, the request has already been scheduled!
-        # Because of this, we need to remove it from our list of scheduled requests (see `take_scheduled_requests_pending_load`).
+        # This is problematic, since at the point when this function is called, the request has already been
+        # scheduled! Because of this, we need to remove it from our list of scheduled requests
+        # (see `take_scheduled_requests_pending_load`).

-            # If this request is being loaded asynchronously, in addition to removing it from the list of scheduled requests,
+            # If this request is being loaded asynchronously, in addition to removing it from the list of
+            # scheduled requests,

Also applies to: 301-301

327-327: Fix line length in docstrings and comments.

Apply this diff:

-            Whether the request is performing asynchronous saving operations. If true, we do not immediately call free_resources on the request.
+            Whether the request is performing asynchronous saving operations. If true, we do not immediately
+            call free_resources on the request.

-        # Pass these newly finished requests into get_finished, and get the list of requests that have finished saving and loading.
+        # Pass these newly finished requests into get_finished, and get the list of requests that have
+        # finished saving and loading.

Also applies to: 354-354

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f083b9 and 7faae02.

📒 Files selected for processing (22)

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (11 hunks)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (11 hunks)
cpp/tensorrt_llm/pybind/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (4 hunks)
cpp/tensorrt_llm/pybind/bindings.cpp (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/connector.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (6 hunks)
tensorrt_llm/executor/executor.py (6 hunks)
tensorrt_llm/executor/proxy.py (3 hunks)
tensorrt_llm/executor/worker.py (6 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tensorrt_llm/models/modeling_utils.py (1 hunks)
tests/unittest/_torch/test_connector.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/llmapi/llm.py
tensorrt_llm/models/modeling_utils.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/executor/worker.py
tensorrt_llm/executor/proxy.py
tensorrt_llm/executor/executor.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tests/unittest/_torch/test_connector.py
tensorrt_llm/_torch/pyexecutor/connector.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/llmapi/llm.py
tensorrt_llm/models/modeling_utils.py
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
cpp/tensorrt_llm/pybind/bindings.cpp
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/executor/worker.py
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
tensorrt_llm/executor/proxy.py
tensorrt_llm/executor/executor.py
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
tests/unittest/_torch/test_connector.py
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
tensorrt_llm/_torch/pyexecutor/connector.py
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camelcase prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope ...

Files:

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
cpp/tensorrt_llm/pybind/bindings.cpp
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

🧠 Learnings (10)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/llmapi/llm.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

cpp/tensorrt_llm/pybind/CMakeLists.txt (1)

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.362Z
Learning: Applies to **/*.{cpp,h,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/executor/worker.py (3)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.362Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

tensorrt_llm/executor/proxy.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

tests/unittest/_torch/test_connector.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/_torch/pyexecutor/connector.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/py_executor.py

925-925: Line too long (124 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/resource_manager.py

364-364: Line too long (146 > 120)

(E501)

366-366: Line too long (143 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/connector.py

124-124: Line too long (130 > 120)

(E501)

131-131: Line too long (152 > 120)

(E501)

132-132: Line too long (206 > 120)

(E501)

179-179: Line too long (176 > 120)

(E501)

279-279: Line too long (129 > 120)

(E501)

301-301: Line too long (125 > 120)

(E501)

327-327: Line too long (144 > 120)

(E501)

354-354: Line too long (131 > 120)

(E501)

🔇 Additional comments (60)

tensorrt_llm/llmapi/llm.py (1)

1008-1010: LGTM! Clean integration of KV connector configuration.

The addition of kv_connector_config=self.args.connector_config follows the established pattern for passing configuration objects to the executor. The parameter naming and structure are consistent with other config parameters like lora_config.

tensorrt_llm/llmapi/llm_args.py (1)

55-57: Import looks good – keep it.
New symbol KvCacheConnectorConfig is imported from the expected module; no concerns.

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (3)

1843-1846: LGTM!

The setter method follows established patterns in the class and correctly implements the simple assignment pattern. The naming convention adheres to the coding guidelines.

1848-1851: LGTM!

The getter method is properly implemented with the [[nodiscard]] attribute and const qualifier. The implementation follows established patterns in the class and adheres to coding guidelines.

2027-2027: LGTM!

The member variable follows the coding guidelines with proper naming convention (camelCase with 'm' prefix), appropriate initialization, and correct placement in the private section.

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (2)

22-22: LGTM!

The include directive is properly formatted and positioned with other batch manager headers.

258-259: LGTM!

The new property follows the established pattern and naming conventions. The addition properly exposes the KV cache connector async onboard state to Python.

tensorrt_llm/executor/worker.py (5)

16-16: LGTM!

The import is properly placed and follows Python import conventions.

63-63: LGTM!

The parameter addition follows proper Python typing conventions with appropriate default value.

87-89: LGTM!

The validation logic properly ensures KV connector config is only used with supported backends and provides a clear error message.

136-136: LGTM!

The parameter is correctly passed only for PyTorch backend, maintaining consistency with the validation logic.

642-642: LGTM!

The worker_main function properly accepts and passes the KV connector config parameter, maintaining consistency with the constructor signature.

Also applies to: 769-770

cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h (3)

1-16: LGTM!

The file structure follows coding guidelines with proper copyright header, include guard, and minimal includes.

8-15: LGTM!

The class declaration follows naming conventions and the static method signature is appropriate for pybind11 binding initialization.

17-22: LGTM!

The namespace structure properly organizes pybind code and the using directive correctly imports the core connector namespace.

tensorrt_llm/executor/executor.py (3)

30-30: LGTM!

The import is properly positioned with related imports and follows Python conventions.

356-356: LGTM!

The parameter addition follows proper Python typing conventions with appropriate default value and positioning.

402-403: LGTM!

The parameter is consistently passed through all execution paths (MPI, single process, multi-process, Windows) ensuring KV connector configuration is available in all supported modes.

Also applies to: 417-418, 432-433, 446-447

tensorrt_llm/executor/proxy.py (2)

13-13: LGTM!

The import statement correctly imports KvCacheConnectorConfig from the appropriate module.

50-50: LGTM!

The kv_connector_config parameter is properly added to the constructor signature and correctly passed through to worker_kwargs, following the established pattern for configuration propagation.

Also applies to: 99-100

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)

192-194: LGTM!

The function signature properly adds the optional kv_connector_config parameter following the existing pattern.

384-391: LGTM!

The kv_connector_manager is correctly passed to KvCacheCreator and conditionally to create_py_executor_instance (omitted during KV cache estimation). This design allows the connector to be integrated into the resource management flow while avoiding unnecessary initialization during estimation.

Also applies to: 429-431, 475-475

cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (2)

29-39: LGTM!

The trampoline class correctly implements the Python override for the pure virtual method using PYBIND11_OVERRIDE_PURE_NAME with proper argument forwarding and self-life support.

45-65: LGTM!

The Python bindings are well-implemented:

Read-only properties appropriately expose data members

Tensor conversion from internal representation to Torch tensor is handled correctly

Method naming follows Python conventions with snake_case

py::smart_holder is appropriately used for the manager class

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (2)

17-17: LGTM!

The header guard follows the required pattern with TRTLLM_ prefix.

82-92: LGTM!

The abstract interface is well-designed with a clear single responsibility. The pure virtual method allows for flexible implementations while maintaining a consistent interface for the C++ KV Cache Manager integration.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)

136-136: LGTM!

The kv_connector_manager parameter is properly added to the constructor and stored as an instance variable.

Also applies to: 153-154

918-920: LGTM!

The new method properly delegates to the C++ implementation to retrieve KV cache connector pool data.

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

30-32: LGTM!

The constructor changes properly integrate the KV Cache Connector feature with appropriate imports and initialization.

Also applies to: 131-150, 259-262

881-886: LGTM!

The conditional assertion correctly handles the case where the KV Cache Connector can put requests to sleep, similar to the transceiver behavior.

tests/unittest/_torch/test_connector.py (2)

25-77: LGTM!

The test correctly validates distributed synchronization of finished requests, ensuring requests are only marked as finished when all ranks agree.

79-106: LGTM!

The test properly validates token matching functionality and async onboard flag propagation in a distributed setting.

tensorrt_llm/_torch/pyexecutor/_util.py (2)

24-24: LGTM!

The KvCacheCreator constructor properly integrates the connector manager parameter following the existing pattern.

Also applies to: 47-48, 56-56

317-318: LGTM!

The method correctly propagates the connector manager to supported cache managers and appropriately raises an error for the unsupported MambaHybridCacheManager case.

Also applies to: 325-329, 387-388

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (4)

93-97: LGTM! Method signature updated correctly.

The addition of the optional returnNewBlockId parameter and the return type change to std::optional<SizeType32> properly extends the interface for KV cache connector integration while maintaining backward compatibility.

100-106: LGTM! Optional parameter addition is appropriate.

The optional kvCacheConnectorManager parameter enables KV cache connector awareness during sequence addition while preserving backward compatibility.

241-245: LGTM! New method follows established patterns.

The getKvCacheConnectorPoolsData method is properly declared as a pure virtual method with the appropriate [[nodiscard]] attribute for a getter function.

351-354: LGTM! Python bindings properly expose the updated interface.

The bindings correctly map the C++ methods to Python with appropriate default values and naming conventions.

Also applies to: 435-436

tensorrt_llm/_torch/pyexecutor/connector.py (6)

1-31: Excellent documentation and appropriate imports.

The module docstring clearly explains the KV Cache Connector's purpose, responsibilities, and potential use cases. The imports are well-organized and appropriate for the implementation.

34-58: Well-structured dataclasses with clear documentation.

The RequestData and SchedulerOutput dataclasses are properly designed with informative field comments and a convenient add_request helper method.

183-230: Well-implemented async request management.

The AsyncRequests dataclass provides clean methods for managing asynchronous saving and loading requests with proper ownership transfer semantics.

286-318: Correct implementation of request filtering and metadata building.

The methods properly handle the removal of async loading requests from scheduled requests and coordinate metadata building across ranks.

341-391: Excellent distributed coordination implementation.

The get_finished method implements a robust consensus protocol using MPI allgather and set intersection to ensure all workers agree on which requests have completed. This prevents race conditions in a distributed environment.

395-403: Clean hook implementation for model integration.

The pre/post hooks provide clear integration points for waiting on loads and triggering saves at the appropriate times during model execution.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

1201-1254: Good integration of KV cache connector with proper safety checks.

The implementation correctly handles the optional connector manager and includes a defensive check to ensure at least one uncomputed token remains. The logging provides good visibility into the token matching process.

1983-2056: Constraints properly enforced for block ID return feature.

The implementation correctly enforces that block ID return only works with single window size and beam width 1, which aligns with the KV cache connector requirements. The optional return value is handled cleanly.

2065-2138: Excellent defensive programming and user-friendly warnings.

The implementation includes proper checks for beam search compatibility and helpful warning messages when users specify a retention configuration or KV cache connector with block reuse disabled. This will help users identify configuration issues early.

2632-2649: Clean implementation of pool data aggregation.

The method correctly extracts pool data and layer mappings, properly handling the interleaved data structure in the layer-to-pool mapping buffer.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (12)

19-19: LGTM!

The include statement is properly formatted and necessary for the KV cache connector integration.

552-553: LGTM!

The addition of the optional kvCacheConnectorManager parameter follows the established pattern of using OptionalRef and maintains backward compatibility while enabling new connector functionality.

751-751: LGTM!

The new method declaration follows C++ best practices with the [[nodiscard]] attribute and const correctness. It provides necessary access to connector pool data.

886-888: LGTM!

The modified signature maintains consistency with the WindowBlockManager changes and follows the established OptionalRef pattern. The addition of both connector manager and window size parameters is appropriate.

1142-1142: LGTM!

The method signature is consistent with the WindowBlockManager equivalent and appropriately returns a vector for multiple pools managed by BlockManager.

1250-1252: LGTM!

The interface change consistently propagates the connector manager option through the inheritance hierarchy while maintaining backward compatibility.

1380-1380: LGTM!

The new pure virtual method properly defines the abstract interface that all concrete KV cache managers must implement, ensuring consistent access to connector pool data.

1543-1545: LGTM!

The method signature changes maintain consistency with the base class interface and preserve backward compatibility.

1553-1555: LGTM!

The method signature changes maintain consistency with the base class interface and follow the established OptionalRef pattern.

1683-1683: LGTM!

The override method properly implements the pure virtual method from the base class, completing the interface contract.

1689-1689: LGTM!

The private method signature change supports the new functionality of optionally returning block IDs, maintaining consistency with the public API changes.

1240-1242: No caller updates required for the new return type

All existing call sites use the default returnNewBlockId = false and simply ignore the returned std::optional<SizeType32>, which is safe and preserves the previous “void” behavior. No code or test needs to be updated to capture or handle the optional value.

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

tensorrt_llm/_torch/pyexecutor/resource_manager.py

tensorrt_llm/llmapi/llm_args.py

tensorrt_llm/models/modeling_utils.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

376-418: Complex connector integration logic needs refactoring and line formatting fixes.

The nested conditionals and async onboarding logic are difficult to follow, and there are line length violations that need to be addressed.

Apply this fix for the line length violations:

-                # In add sequence, the connector API's get_num_new_matched_tokens is called.
-                # The result of this call may be that blocks will be loaded asynchronously.
+                # In add sequence, the connector API's get_num_new_matched_tokens is called.
+                # The result of this call may be that blocks will be loaded asynchronously.
-                # If so, we set the is_kv_cache_connector_async_onboard flag, and set the request state to be DISAGG_GENERATION_TRANS_IN_PROGRESS.
+                # If so, we set the is_kv_cache_connector_async_onboard flag, and set the 
+                # request state to be DISAGG_GENERATION_TRANS_IN_PROGRESS.
-                # When the async load is complete, we set the request state back to CONTEXT_INIT.
+                # When the async load is complete, we set the request state back to 
+                # CONTEXT_INIT.

🧹 Nitpick comments (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
264-300: LGTM: Comprehensive validation and hook registration logic.

The method properly validates incompatible configurations and registers necessary hooks. The error handling with specific NotImplementedError messages is appropriate for unsupported feature combinations.

Consider a minor optimization for the module iteration:
- for _name, module in self.model_engine.model.named_modules():
+ for module in self.model_engine.model.modules():
     if isinstance(module, DecoderLayer):
Since the name is unused (indicated by _name), using modules() is slightly more efficient.

970-986: LGTM: Well-integrated KV connector workflow.

The integration correctly handles the sequence of operations: resource preparation, request filtering, and asynchronous loading trigger. The timing dependency comment is helpful.

Fix the line length issue on line 972:
- # This takes requests that are pending an async load, and removes them from the scheduled context batch.
+ # This takes requests that are pending an async load, and removes them 
+ # from the scheduled context batch.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7faae02 and 4326a24.

📒 Files selected for processing (6)

tensorrt_llm/_torch/pyexecutor/_util.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (6 hunks)
tensorrt_llm/executor/worker.py (6 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

tensorrt_llm/llmapi/llm.py
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/_util.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

🧠 Learnings (3)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/resource_manager.py

379-379: Line too long (146 > 120)

(E501)

381-381: Line too long (143 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/py_executor.py

972-972: Line too long (124 > 120)

(E501)

🔇 Additional comments (15)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (9)

13-14: LGTM: Import addition for KV cache connector support.

The import of KvCacheConnectorPoolsData from the batch manager bindings is correctly added to support the new KV cache connector functionality.

20-20: LGTM: Import addition for connector components.

The import of KvCacheConnectorManager and SchedulerOutput from the connector module is necessary for the KV cache connector integration.

136-136: LGTM: Constructor parameter and instance variable for connector manager.

The optional kv_connector_manager parameter and its assignment to an instance variable follow good design patterns for optional feature integration.

Also applies to: 153-154

358-360: LGTM: SchedulerOutput initialization for connector coordination.

Creating the SchedulerOutput object early in the method to collect scheduling metadata throughout the resource preparation process is a good architectural choice.

374-374: LGTM: Connector manager integration in sequence operations.

The updates to add_sequence calls to pass the kv_connector_manager parameter enable proper coordination with the connector during sequence addition.

Also applies to: 384-386

392-399: LGTM: SchedulerOutput updates with request metadata.

The calls to scheduler_output.add_request() with appropriate tokens, block IDs, and context positions correctly populate the scheduling metadata for connector coordination.

Also applies to: 407-412, 415-417, 427-429

420-420: LGTM: Generation batch token handling with connector awareness.

The add_token call updated to return new_block_id for generation requests enables proper block tracking in the scheduler output.

431-433: LGTM: Connector manager scheduler output coordination.

The conditional call to set_scheduler_output() when a connector manager is present properly coordinates the scheduling metadata with the connector system.

933-935: LGTM: KV cache connector pools data exposure.

The new method get_kv_cache_connector_pools_data() correctly delegates to the underlying C++ implementation to expose connector pool data, maintaining consistency with the API design pattern used throughout the class.

tensorrt_llm/_torch/pyexecutor/py_executor.py (6)

30-30: LGTM: Necessary imports for KV Cache Connector functionality.

The imports for DecoderLayer and KvCacheConnectorManager are appropriately added to support the new KV cache connector integration.

Also applies to: 32-32

148-149: LGTM: Parameter addition follows proper conventions.

The kv_connector_manager parameter is correctly typed as optional with a sensible default value, maintaining backward compatibility.

257-260: LGTM: Clean instance variable assignment and initialization.

The assignment and initialization call are properly structured and placed appropriately in the constructor sequence.

923-929: LGTM: Logical assertion relaxation for KV Cache Connector.

The modification correctly handles the case where KV Cache Connector can put requests to sleep, similar to the transceiver. The conditional logic and explanatory comment make the intention clear.

1012-1016: LGTM: Consistent cleanup logic for finished requests.

The connector cleanup follows the same pattern as the transceiver cleanup, properly retrieving and terminating finished requests. The elif structure ensures mutually exclusive execution paths.

1463-1466: LGTM: Proper resource management delegation to KV connector.

The conditional logic correctly delegates resource management to the KV connector when it reports a request as finished, preventing double-freeing of resources while maintaining backward compatibility.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1)
59-65: Constructor still has unnecessary copy issue.

The constructor takes references but stores by value, creating copies of the vectors. This matches the previous review comment about avoiding unnecessary copies.

Change the constructor parameters to take by value and use move semantics:
-    explicit KvCacheConnectorPoolsData(
-        std::vector<KvCacheConnectorPoolData>& poolsData, std::vector<SizeType32>& layerToPoolMapping)
-        : mPoolsData(poolsData)
-        , mLayerToPoolMapping(layerToPoolMapping)
+    explicit KvCacheConnectorPoolsData(
+        std::vector<KvCacheConnectorPoolData> poolsData, std::vector<SizeType32> layerToPoolMapping)
+        : mPoolsData(std::move(poolsData))
+        , mLayerToPoolMapping(std::move(layerToPoolMapping))
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
376-418: Complex connector logic still needs refactoring.

The nested conditionals and state management logic remains difficult to follow, matching the previous review comment about this needing a mega refactor.

The TODO comment correctly identifies this as needing refactoring. Consider extracting this logic into separate methods for better readability:
def _should_add_sequence(self, req: LlmRequest) -> bool:
    """Determine if a sequence should be added based on request state."""
    return req.is_first_context_chunk and not req.is_kv_cache_connector_async_onboard

def _handle_async_onboard_completion(self, req: LlmRequest, scheduler_output: SchedulerOutput):
    """Handle completion of async KV cache onboarding."""
    req.is_kv_cache_connector_async_onboard = False
    scheduler_output.add_request(
        req.request_id, req.get_tokens(0),
        self.impl.get_cache_block_ids(
            req.request_id,
            self.max_attention_window_vec[0]),
        req.context_current_position)
Additionally, the long lines should be reformatted to comply with the 120-character limit.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

1551-1567: Single window size constraint properly enforced.

The implementation correctly enforces the constraint that KV cache connector only works with single window sizes. The error message clearly communicates this limitation.

🧹 Nitpick comments (8)

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1)

67-80: Consider const-correctness for getters.

The getters return non-const references, allowing external modification of internal data. Consider if const versions are needed for read-only access.

Add const overloads for the getters:

     std::vector<KvCacheConnectorPoolData>& getPoolsData()
     {
         return mPoolsData;
     }
+
+    std::vector<KvCacheConnectorPoolData> const& getPoolsData() const
+    {
+        return mPoolsData;
+    }

     std::vector<SizeType32>& getLayerToPoolMapping()
     {
         return mLayerToPoolMapping;
     }
+
+    std::vector<SizeType32> const& getLayerToPoolMapping() const
+    {
+        return mLayerToPoolMapping;
+    }

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

264-299: Consider improving error messages for better debugging experience.

The initialization logic and hook registration are well-implemented. However, the error messages could be more informative by specifying which features are mutually exclusive.

         if self.kv_connector_manager is not None:
             if self.kv_cache_transceiver is not None:
                 raise NotImplementedError(
-                    "KV Cache Connector is not supported with KvCacheTransceiver."
+                    "KV Cache Connector and KvCacheTransceiver are mutually exclusive features. Please use only one."
                 )
 
             if self.dist.pp_size > 1:
                 raise NotImplementedError(
-                    "KV Cache Connector is not supported with pipeline parallelism."
+                    "KV Cache Connector is not supported with pipeline parallelism (pp_size > 1)."
                 )
 
             if not self.disable_overlap_scheduler:
                 raise NotImplementedError(
-                    "KV Cache Connector is not supported with overlap scheduler."
+                    "KV Cache Connector requires disable_overlap_scheduler=True."
                 )

tensorrt_llm/_torch/pyexecutor/connector.py (6)

60-134: Worker ABC well-designed but has formatting issues.

The abstract base class properly defines the worker interface with clear method signatures and documentation.

Fix line length issues for better readability:

     def get_finished(
             self, finished_gen_req_ids: list[int],
             started_loading_req_ids: list[int]) -> tuple[list[int], list[int]]:
         """
         Get the requests that have finished loading and saving.
 
         Args:
-            finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now asynchronously saving.
+            finished_gen_req_ids: The IDs of the requests that have finished generating tokens,
+                and are now asynchronously saving.
             started_loading_req_ids: The IDs of the requests that have started asynchronously loading.
 
         Returns:
             The IDs of the requests that have finished saving.
             The IDs of the requests that have finished loading.
 
-        Note: IDs may only be returned from this call after they've been provided in the `finished_gen_req_ids` and `started_loading_req_ids` arguments.
-        Additionally, the runtime will only take action based on these returned IDs once they've been returned by ALL workers. This allows some workers to take longer than others to complete the operations.
+        Note: IDs may only be returned from this call after they've been provided in the
+        `finished_gen_req_ids` and `started_loading_req_ids` arguments.
+        Additionally, the runtime will only take action based on these returned IDs once
+        they've been returned by ALL workers. This allows some workers to take longer than
+        others to complete the operations.
         """

136-181: Scheduler ABC properly defines orchestration interface.

The abstract base class clearly specifies the scheduler's responsibilities for metadata building and request management.

Fix line length for consistency:

         Returns:
             Whether the request is performing asynchronous saving operations.
-            If true, this indicates that the kv cache manager should wait to deallocate the blocks until the saving has completed (determined by `get_finished` on the workers).
+            If true, this indicates that the kv cache manager should wait to deallocate the blocks
+            until the saving has completed (determined by `get_finished` on the workers).
         """

266-285: Token matching logic needs line formatting fixes.

The implementation correctly handles distributed token matching, but has formatting issues and a TODO that should be tracked.

Fix line length and consider tracking the TODO:

     def get_num_new_matched_tokens(self, request: LlmRequest,
                                    num_computed_tokens: int) -> int:
         num_tokens, load_kv_async = self._run_on_leader(
             lambda: self.scheduler.get_num_new_matched_tokens(
                 request, num_computed_tokens))
 
         if num_tokens == 0 and load_kv_async:
             raise RuntimeError(
                 "load_kv_async must be False when num_tokens is 0!")
 
         # TODO(jthomson04): This part is a bit ugly.
-        # When the connector indicates that a request will be loaded asynchronously, we need to suspend it's execution.
-        # This is problematic, since at the point when this function is called, the request has already been scheduled!
-        # Because of this, we need to remove it from our list of scheduled requests (see `take_scheduled_requests_pending_load`).
+        # When the connector indicates that a request will be loaded asynchronously, we need to
+        # suspend its execution. This is problematic, since at the point when this function is
+        # called, the request has already been scheduled! Because of this, we need to remove it
+        # from our list of scheduled requests (see `take_scheduled_requests_pending_load`).
         if load_kv_async:
             self.new_async_requests.loading[request.request_id] = request
             request.is_kv_cache_connector_async_onboard = True
 
         return num_tokens

Would you like me to create an issue to track the TODO comment about improving the async loading architecture?

286-310: Scheduled request filtering correctly implemented.

The method properly removes asynchronously loading requests and updates their state.

Fix line length on line 301:

         for req in scheduled_requests.context_requests:
-            # If this request is being loaded asynchronously, in addition to removing it from the list of scheduled requests,
-            # we also need to update it's state.
+            # If this request is being loaded asynchronously, in addition to removing it from the
+            # list of scheduled requests, we also need to update its state.
             if req.request_id in self.new_async_requests.loading.keys():

319-340: Request completion handling properly manages async operations.

The method correctly delegates to the scheduler and updates request state for async saving.

Fix line length and typo:

         Returns:
-            Whether the request is performing asynchronous saving operations. If true, we do not immediately call free_resources on the request.
+            Whether the request is performing asynchronous saving operations. If true, we do not
+            immediately call free_resources on the request.
         """

341-391: Distributed synchronization logic is well-implemented.

The get_finished method correctly implements the complex distributed coordination using MPI allgather to ensure all ranks agree on request completion.

Fix line length on line 354:

-        # Pass these newly finished requests into get_finished, and get the list of requests that have finished saving and loading.
+        # Pass these newly finished requests into get_finished, and get the list of requests
+        # that have finished saving and loading.
         (finished_saving,
          finished_loading) = self.worker.get_finished(finished_gen_req_ids,
                                                       started_loading_req_ids)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4326a24 and a2a8a82.

📒 Files selected for processing (22)

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (11 hunks)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (11 hunks)
cpp/tensorrt_llm/pybind/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (4 hunks)
cpp/tensorrt_llm/pybind/bindings.cpp (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/connector.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (6 hunks)
tensorrt_llm/executor/executor.py (6 hunks)
tensorrt_llm/executor/proxy.py (3 hunks)
tensorrt_llm/executor/worker.py (6 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tensorrt_llm/models/modeling_utils.py (1 hunks)
tests/unittest/_torch/test_connector.py (1 hunks)

✅ Files skipped from review due to trivial changes (3)

tensorrt_llm/models/modeling_utils.py
cpp/tensorrt_llm/pybind/CMakeLists.txt
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h

🚧 Files skipped from review as they are similar to previous changes (12)

tensorrt_llm/llmapi/llm.py
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
tensorrt_llm/executor/worker.py
tensorrt_llm/executor/executor.py
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/executor/proxy.py
cpp/tensorrt_llm/pybind/bindings.cpp
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camelcase prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope ...

Files:

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp

**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/connector.py
tests/unittest/_torch/test_connector.py
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/connector.py
tests/unittest/_torch/test_connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

🧠 Learnings (6)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/connector.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/_torch/test_connector.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/connector.py

124-124: Line too long (130 > 120)

(E501)

131-131: Line too long (152 > 120)

(E501)

132-132: Line too long (206 > 120)

(E501)

179-179: Line too long (176 > 120)

(E501)

279-279: Line too long (129 > 120)

(E501)

301-301: Line too long (125 > 120)

(E501)

327-327: Line too long (144 > 120)

(E501)

354-354: Line too long (131 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/resource_manager.py

379-379: Line too long (146 > 120)

(E501)

381-381: Line too long (143 > 120)

(E501)

🔇 Additional comments (40)

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (4)

1-15: NVIDIA copyright header looks good.

The copyright header includes the required current year (2024) and follows the standard NVIDIA format for TensorRT-LLM open source code.

32-54: LGTM on KvCacheConnectorPoolData class design.

The class follows good C++ practices:

Uses move semantics in constructor to avoid unnecessary copies

Proper const-correctness on getters

Private member variables with appropriate accessor methods

Clear, descriptive naming

83-92: LGTM on abstract interface design.

The KvCacheConnectorManager follows good abstract interface design:

Pure virtual destructor with default implementation

Clear, documented pure virtual method

Appropriate parameter types and const-correctness

Good documentation explaining the purpose

94-94: Namespace closing comment follows guidelines.

The closing brace properly includes a comment identifying the namespace being closed, as required by the coding guidelines.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (10)

13-15: Import looks correct.

The import of KvCacheConnectorPoolsData from the internal batch manager bindings is properly structured and follows the existing pattern in the file.

20-20: Import follows proper namespace convention.

The import maintains the namespace when importing from the connector module, which aligns with the Python coding guidelines.

136-137: Constructor parameter addition looks good.

The optional kv_connector_manager parameter follows the existing pattern and is properly typed with Optional[KvCacheConnectorManager].

153-154: Storage of connector manager is appropriate.

The connector manager is stored as an instance variable for later use in the resource management methods.

358-361: Scheduler output initialization is correct.

Creating a SchedulerOutput instance to accumulate scheduling metadata for the connector is the right approach.

374-374: Connector manager passed to add_sequence calls.

The connector manager is properly passed to the add_sequence calls, enabling coordination with the connector during sequence addition.

Also applies to: 385-386

420-420: Token addition with return value capture.

The modification to capture the new_block_id from add_token is correct for tracking newly allocated blocks in generation requests.

425-430: Scheduler output update for generation requests.

The logic to add the generation request to the scheduler output with the new block ID (if any) is implemented correctly.

431-433: Connector manager update is appropriate.

The conditional check and call to set_scheduler_output properly updates the connector manager with the accumulated scheduling information.

933-935: New method implementation looks correct.

The get_kv_cache_connector_pools_data method properly delegates to the implementation and returns the connector pools data.

tensorrt_llm/_torch/pyexecutor/py_executor.py (6)

30-30: LGTM! Import additions are appropriate.

The new imports for DecoderLayer and KvCacheConnectorManager are correctly placed and necessary for the KV cache connector integration.

Also applies to: 32-32

148-149: Constructor properly extended with KV connector manager parameter.

The optional parameter follows best practices with type hints and appropriate default value. The initialization is correctly deferred to a dedicated method.

Also applies to: 257-259

923-928: Assertion correctly relaxed for KV cache connector compatibility.

The modified assertion appropriately accounts for both the KV cache transceiver and connector's ability to suspend requests, preventing false failures when no requests are immediately schedulable.

972-974: Proper integration of scheduled request filtering.

The call to take_scheduled_requests_pending_load correctly removes asynchronously loading requests from the scheduled batch before forward execution.

1008-1012: Finished request handling properly integrated.

The connector manager correctly takes over the responsibility of tracking and terminating finished requests, avoiding conflicts with the standard resource management flow.

1459-1461: Resource management delegation correctly implemented.

The conditional logic ensures that the connector manager handles resource cleanup for its managed requests, preventing double-free scenarios.

tests/unittest/_torch/test_connector.py (4)

1-18: Test setup properly configured for MPI testing.

The imports and pickle configuration are correctly set up for distributed testing with cloudpickle and mpi4py.

25-76: Comprehensive test for distributed request completion synchronization.

The test effectively validates the MPI allgather behavior, ensuring requests are only marked as finished when all ranks agree. The mock setup and assertions properly cover the distributed coordination logic.

79-105: Token matching test correctly validates leader-only behavior.

The test properly verifies that token matching queries are handled only on the leader rank and that the async onboard flag is correctly set on requests.

108-153: Scheduled request filtering test validates critical functionality.

The test properly verifies that asynchronously loading requests are removed from the scheduled batch while synchronously loading requests are preserved.

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (4)

93-97: addToken signature properly extended with optional return value.

The method correctly uses std::optional<SizeType32> for the return type and includes the new returnNewBlockId parameter with appropriate default value.

99-106: addSequence signature correctly extended for connector integration.

The method properly accepts optional references for both llmRequest and kvCacheConnectorManager, maintaining backward compatibility.

241-245: New virtual method properly declared for connector pool data.

The getKvCacheConnectorPoolsData method follows the established pattern with [[nodiscard]] attribute and proper const-correctness.

351-354: Python bindings correctly updated for new signatures.

The bindings properly expose the new parameters and methods with appropriate default values and naming conventions.

Also applies to: 436-436

tensorrt_llm/_torch/pyexecutor/connector.py (5)

1-31: Well-structured module with clear documentation.

The imports are appropriate and the module docstring provides excellent context about the KV Cache Connector's purpose and capabilities.

34-58: Data classes appropriately model request metadata.

The RequestData and SchedulerOutput classes provide clean interfaces for tracking inflight requests and their associated metadata.

183-230: AsyncRequests helper class provides clean request management.

The internal dataclass effectively encapsulates async request tracking with well-designed methods for manipulation and querying.

232-265: Manager initialization correctly enforces rank constraints.

The constructor properly validates that the scheduler only exists on rank 0 and initializes request tracking structures.

392-403: Hook methods provide clean integration points.

The scheduler output setter and layer/model hooks are properly implemented for executor integration.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (7)

1200-1206: LGTM - Clean parameter forwarding.

The method signature correctly adds the KV cache connector manager parameter and properly forwards it to the underlying window block manager.

1208-1254: Well-implemented KV cache connector integration.

The method properly:

Validates that beam search is not used with the connector

Retrieves additional matched tokens from the connector

Updates the prepopulated prompt length correctly

Includes helpful debug logging

The constraint checking and integration logic appears sound.

1983-2050: Correctly implemented optional block ID return.

The method properly:

Validates that the returnNewBlockId feature only works with single window size and no beam search

Conditionally returns the newly allocated block ID when requested

Maintains backward compatibility with existing callers

The constraint validation is consistent with other KV cache connector limitations.

2052-2056: LGTM - Clean parameter addition.

The method signature correctly adds the optional returnNewBlockId parameter and properly forwards it to the updateToken method.

2065-2137: Good validation and user guidance for connector usage.

The method properly:

Validates beam search compatibility with the connector

Provides helpful warnings when the connector is specified but block reuse is disabled

Maintains consistency with connector constraints across the codebase

The user-facing warnings will help developers understand when and why the connector is ignored.

2632-2648: Correctly implemented connector pools data aggregation.

The method properly:

Retrieves pool data from the block manager

Extracts layer-to-pool mapping from the internal buffer

Combines both pieces of information into the connector pools data structure

The buffer indexing logic (layer * 2) correctly accounts for the paired data structure.

2497-2497: LGTM - Correct parameter addition.

The call to updateToken correctly includes the returnNewBlockId parameter set to false, which is appropriate for the removeToken operation.

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

377-420: Complex connector integration logic needs refactoring.

This logic contains the same complexity issues previously identified, with nested conditionals and state management that make it difficult to follow. The TODO comment confirms this needs refactoring.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/connector.py (1)

266-284: Address the technical debt noted in the TODO comment.

The current approach of modifying request state after scheduling is acknowledged as "ugly" in the TODO comment. While the implementation works, consider refactoring this in a future iteration to make the async loading decision earlier in the scheduling process.

The comment on line 276-279 indicates technical debt where async loading decisions cause post-scheduling modifications. Consider refactoring to make these decisions during the scheduling phase to avoid the need for take_scheduled_requests_pending_load.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0b41f9 and 7b378bd.

📒 Files selected for processing (24)

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (11 hunks)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (11 hunks)
cpp/tensorrt_llm/pybind/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (4 hunks)
cpp/tensorrt_llm/pybind/bindings.cpp (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/connector.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (6 hunks)
tensorrt_llm/_torch/pyexecutor/scheduler.py (1 hunks)
tensorrt_llm/executor/executor.py (6 hunks)
tensorrt_llm/executor/proxy.py (3 hunks)
tensorrt_llm/executor/worker.py (6 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tensorrt_llm/models/modeling_utils.py (1 hunks)
tests/integration/defs/llmapi/test_llm_api_connector.py (1 hunks)
tests/unittest/_torch/test_connector.py (1 hunks)

✅ Files skipped from review due to trivial changes (5)

cpp/tensorrt_llm/pybind/CMakeLists.txt
cpp/tensorrt_llm/pybind/bindings.cpp
tensorrt_llm/_torch/pyexecutor/scheduler.py
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h

🚧 Files skipped from review as they are similar to previous changes (15)

tensorrt_llm/llmapi/llm.py
tensorrt_llm/models/modeling_utils.py
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
tensorrt_llm/executor/worker.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/executor/executor.py
tests/unittest/_torch/test_connector.py
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
tensorrt_llm/executor/proxy.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/integration/defs/llmapi/test_llm_api_connector.py
tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/integration/defs/llmapi/test_llm_api_connector.py
tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

🧠 Learnings (5)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/integration/defs/llmapi/test_llm_api_connector.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/_torch/pyexecutor/connector.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/connector.py

124-124: Line too long (130 > 120)

(E501)

131-131: Line too long (152 > 120)

(E501)

132-132: Line too long (206 > 120)

(E501)

179-179: Line too long (176 > 120)

(E501)

279-279: Line too long (129 > 120)

(E501)

301-301: Line too long (125 > 120)

(E501)

327-327: Line too long (144 > 120)

(E501)

354-354: Line too long (131 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/resource_manager.py

380-380: Line too long (146 > 120)

(E501)

382-382: Line too long (143 > 120)

(E501)

🔇 Additional comments (22)

tests/integration/defs/llmapi/test_llm_api_connector.py (4)

1-30: LGTM! Well-structured imports and mocking setup.

The copyright header, imports, and global mock setup follow TensorRT-LLM conventions and provide a clean foundation for integration testing.

32-64: LGTM! Excellent test utilities for call order verification.

The CallTimeMonitor class is a well-designed utility that provides clear verification of method call ordering, which is crucial for testing distributed KV cache coordination.

66-99: LGTM! Comprehensive fixture setup for connector testing.

The fixture properly mocks the connector components and sets up thorough monitoring of key API methods. This provides excellent test coverage for the KV cache connector integration.

101-159: LGTM! Comprehensive integration test for connector workflow.

The test effectively validates the complete KV cache connector workflow including:

Proper initialization and registration

Correct sequencing of scheduler metadata building before worker operations

Layer-wise KV cache loading and saving operations

Request completion handling

The call ordering verification using CallTimeMonitor ensures the distributed coordination works correctly.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

13-20: LGTM! Appropriate imports for connector integration.

The new imports correctly bring in the KV cache connector types and manager classes needed for the feature integration.

136-154: LGTM! Clean constructor enhancement for connector support.

The optional kv_connector_manager parameter is properly added with appropriate type hints and follows the existing parameter pattern.

935-937: LGTM! Clean accessor method for connector pools data.

The new method follows the established pattern of delegating to the underlying C++ implementation and provides necessary access to connector pool information.

tensorrt_llm/_torch/pyexecutor/py_executor.py (7)

30-30: LGTM! Necessary imports for KV Cache Connector integration.

The imports are properly structured and follow existing patterns in the file.

Also applies to: 32-32

150-150: LGTM! Constructor parameter addition follows best practices.

The optional parameter with proper type annotation and default value ensures backward compatibility while enabling the new KV Cache Connector functionality.

Also applies to: 258-261

265-295: LGTM! Excellent error checking and initialization logic.

The method properly validates incompatible configurations and provides clear error messages. The hook registration on DecoderLayer modules follows a clean pattern for integrating with the model's forward pass.

918-924: LGTM! Correctly handles KV Cache Connector request suspension.

The logic appropriately exempts the assertion when the KV Cache Connector is enabled, since it can suspend requests similar to the transceiver. The explanatory comment makes the intent clear.

965-965: LGTM! Well-integrated KV Cache Connector operations.

The connector operations are properly sequenced within the executor loop:

Setting the warmup flag provides necessary context

Async loading operations occur at the right time after resource preparation

Finished request handling is appropriately placed in the cleanup phase

The control flow logic correctly ensures connector operations only run when enabled.

Also applies to: 968-973, 1006-1010

1360-1363: LGTM! Properly placed synchronization call.

The wait_for_save() call is correctly positioned after the forward step to ensure all KV cache saving operations complete before continuing. The conditional check ensures it only runs when the connector is enabled.

1461-1464: LGTM! Proper resource management delegation.

The conditional logic correctly delegates resource management to the KV Cache Connector when it's handling a request's lifecycle, preventing resource conflicts while maintaining backward compatibility.

tensorrt_llm/_torch/pyexecutor/connector.py (8)

1-31: LGTM! Excellent module documentation and clean imports.

The comprehensive documentation clearly explains the KV Cache Connector's purpose, use cases, and architecture. The imports are well-organized and include all necessary components for the interface definition.

34-58: LGTM! Well-designed data classes with clear structure.

Both RequestData and SchedulerOutput follow dataclass best practices:

Clear field documentation

Appropriate use of field(default_factory=list) for mutable defaults

Clean helper methods for building the data structures

60-134: LGTM! Comprehensive worker interface with excellent documentation.

The abstract base class defines a clear contract for KV cache operations:

Well-separated concerns for registration, loading, and saving

Comprehensive docstrings explaining expected behavior

The get_finished method design allows for proper coordination across distributed workers

Note: There are some line length violations flagged by static analysis, but the readability is more important than strict adherence to 120 character limits for documentation.

136-181: LGTM! Well-designed scheduler interface with appropriate abstractions.

The scheduler interface provides the right level of abstraction:

Required methods are abstract while optional ones have safe defaults

Clear documentation of expected behavior and return values

Supports both synchronous and asynchronous operation patterns

The default implementation of get_num_new_matched_tokens returning (0, False) is a safe fallback.

183-230: LGTM! Excellent state management for asynchronous operations.

The AsyncRequests class provides clean encapsulation of async request state:

Efficient dictionary-based storage for fast lookups

Proper ownership transfer in add_from method

Clean extraction and querying methods

Well-documented interface for managing async saving/loading requests

232-265: LGTM! Well-structured coordinator class with proper rank validation.

The constructor correctly enforces that the scheduler only exists on rank 0, which is essential for the distributed coordination pattern. The inheritance from the C++ binding provides the necessary low-level functionality.

341-390: Excellent distributed coordination logic in get_finished method.

The implementation correctly handles the complex distributed synchronization:

Uses MPI allgather to collect results from all workers

Employs set intersection to ensure consensus across all workers

Proper state management for async requests through different phases

Correct state transitions for finished loading requests

This ensures that resources are only freed when all workers agree the operations are complete.

395-403: LGTM! Clean hook integration for model execution.

The hook methods provide clean integration points:

layer_pre_hook waits for layer loading before computation

layer_post_hook initiates layer saving after computation

model_post_hook ensures all saving operations complete

This design allows seamless integration with the model's forward pass without modifying core model code.

juney-nvidia · 2025-07-31T02:44:46Z

@lowsfer for vis also who is working on KV Cache Manager refactoring efforts.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/integration/defs/llmapi/test_llm_api_connector.py (1)

145-147: Break long line to comply with line length limit.

Line 146 exceeds the 120 character limit.

-    model.generate([
-        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
-    ], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True))
+    model.generate([
+        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
+        "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
+    ], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True))

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b378bd and af4c2b9.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (8 hunks)
tests/integration/defs/llmapi/test_llm_api_connector.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case, and prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/integration/defs/llmapi/test_llm_api_connector.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/integration/defs/llmapi/test_llm_api_connector.py

🧠 Learnings (3)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/integration/defs/llmapi/test_llm_api_connector.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

🪛 Ruff (0.12.2)

tests/integration/defs/llmapi/test_llm_api_connector.py

146-146: Line too long (133 > 120)

(E501)

🔇 Additional comments (6)

tensorrt_llm/_torch/pyexecutor/py_executor.py (6)

30-30: LGTM!

The import additions are necessary for the KV Cache Connector functionality - DecoderLayer for registering forward hooks and KvCacheConnectorManager for connector management.

Also applies to: 32-32

132-150: LGTM!

The constructor properly accepts the optional kv_connector_manager parameter and defers initialization to a dedicated method, following good separation of concerns.

Also applies to: 258-260

265-295: Well implemented initialization with proper validation.

The method correctly validates incompatible configurations and properly integrates the KV connector with existing systems. The forward hook registration on DecoderLayer modules is essential for layer-level KV cache management.

918-923: LGTM!

The assertion logic correctly excludes the KV cache connector case, mirroring the existing pattern for kv_cache_transceiver. The comment clearly explains why both features can put requests to sleep.

965-973: Well integrated KV connector operations in executor loop.

The connector operations are properly placed in the execution flow:

Pre-forward: request preparation and KV loading initiation

Post-forward: wait for save completion

Post-batch: cleanup of finished requests

The integration maintains consistency with existing patterns and properly handles resource management.

Also applies to: 974-976, 1006-1009, 1360-1362

1461-1463: Proper resource management delegation to KV connector.

The method correctly delegates resource freeing to the connector when available, preventing double-freeing and maintaining proper resource ownership between the connector and resource manager.

tests/integration/defs/llmapi/test_llm_api_connector.py

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

tensorrt_llm/_torch/pyexecutor/connector.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
377-416: Address the line length violations and refactor as previously suggested.

The complex connector integration logic still needs the refactoring suggested in the previous review. Additionally, lines 380 and 382 exceed the 120-character limit and should be reformatted.

Apply this diff to fix the line length violations:
-                # The result of this call may be that blocks will be loaded asynchronously.
-                # If so, we set the is_kv_cache_connector_async_onboard flag, and set the request state to be DISAGG_GENERATION_TRANS_IN_PROGRESS.
+                # The result of this call may be that blocks will be loaded 
+                # asynchronously. If so, we set the is_kv_cache_connector_async_onboard 
+                # flag, and set the request state to be DISAGG_GENERATION_TRANS_IN_PROGRESS.
-                # When the async load is complete, we set the request state back to CONTEXT_INIT.
-                # When that happens, the request will go through this same code path, but with is_kv_cache_connector_async_onboard set to True.
+                # When the async load is complete, we set the request state back to 
+                # CONTEXT_INIT. When that happens, the request will go through this same 
+                # code path, but with is_kv_cache_connector_async_onboard set to True.
tests/integration/defs/llmapi/test_llm_api_connector.py (2)

26-26: Remove sys.path manipulation in test file.

This issue was already identified in a previous review.

28-43: Remove unused global mock variables and function.

This issue was already identified in a previous review.

🧹 Nitpick comments (2)

tests/integration/defs/llmapi/test_llm_api_connector.py (1)

149-149: Fix line length violation.

The line exceeds the 120-character limit. Consider breaking the long string literal.

-    model.generate([
-        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
-    ], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True))
+    long_text = ("Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
+                 "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.")
+    model.generate([long_text], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True))

tensorrt_llm/_torch/pyexecutor/connector.py (1)

126-126: Fix line length violations.

Several lines exceed the 120-character limit. Consider breaking long comments and parameter lists for better readability.

Example fixes:

-        Args:
-            finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now asynchronously saving.
+        Args:
+            finished_gen_req_ids: The IDs of the requests that have finished 
+                generating tokens, and are now asynchronously saving.

-        Note: IDs may only be returned from this call after they've been provided in the `finished_gen_req_ids` and `started_loading_req_ids` arguments.
+        Note: IDs may only be returned from this call after they've been provided 
+            in the `finished_gen_req_ids` and `started_loading_req_ids` arguments.

Also applies to: 133-134, 181-181, 240-240, 291-291, 313-313, 339-339, 366-366

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af4c2b9 and 9a5bab4.

📒 Files selected for processing (24)

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (11 hunks)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (9 hunks)
cpp/tensorrt_llm/pybind/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (4 hunks)
cpp/tensorrt_llm/pybind/bindings.cpp (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/connector.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (6 hunks)
tensorrt_llm/_torch/pyexecutor/scheduler.py (1 hunks)
tensorrt_llm/executor/executor.py (6 hunks)
tensorrt_llm/executor/proxy.py (3 hunks)
tensorrt_llm/executor/worker.py (6 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tensorrt_llm/models/modeling_utils.py (1 hunks)
tests/integration/defs/llmapi/test_llm_api_connector.py (1 hunks)
tests/unittest/_torch/test_connector.py (1 hunks)

✅ Files skipped from review due to trivial changes (5)

tensorrt_llm/_torch/pyexecutor/scheduler.py
cpp/tensorrt_llm/pybind/CMakeLists.txt
cpp/tensorrt_llm/pybind/bindings.cpp
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h

🚧 Files skipped from review as they are similar to previous changes (15)

tensorrt_llm/llmapi/llm.py
tensorrt_llm/models/modeling_utils.py
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
tensorrt_llm/executor/worker.py
tensorrt_llm/executor/executor.py
tensorrt_llm/llmapi/llm_args.py
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
tensorrt_llm/executor/proxy.py
tensorrt_llm/_torch/pyexecutor/_util.py
tests/unittest/_torch/test_connector.py
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/integration/defs/llmapi/test_llm_api_connector.py
tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/integration/defs/llmapi/test_llm_api_connector.py
tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py

🪛 Ruff (0.12.2)

tests/integration/defs/llmapi/test_llm_api_connector.py

149-149: Line too long (133 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/connector.py

126-126: Line too long (130 > 120)

(E501)

133-133: Line too long (152 > 120)

(E501)

134-134: Line too long (206 > 120)

(E501)

181-181: Line too long (176 > 120)

(E501)

240-240: Line too long (123 > 120)

(E501)

291-291: Line too long (129 > 120)

(E501)

313-313: Line too long (125 > 120)

(E501)

339-339: Line too long (144 > 120)

(E501)

366-366: Line too long (131 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/resource_manager.py

380-380: Line too long (146 > 120)

(E501)

382-382: Line too long (143 > 120)

(E501)

🔇 Additional comments (22)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

13-14: LGTM!

The new imports are properly placed and necessary for the KV cache connector integration. They follow the existing import patterns and maintain proper namespacing.

Also applies to: 20-20

136-136: LGTM!

The constructor modification properly integrates the KV cache connector manager with correct type annotations and maintains backward compatibility through the optional parameter.

Also applies to: 153-154

937-939: LGTM!

The new method properly exposes the KV cache connector pools data with correct type annotations and follows the established delegation pattern used throughout the class.

tensorrt_llm/_torch/pyexecutor/py_executor.py (7)

132-150: LGTM!

The constructor parameter addition follows good practices with proper optional typing and placement at the end of the parameter list.

265-295: Well-structured connector initialization with proper validation.

The method correctly enforces mutual exclusivity with incompatible features, provides clear error messages, and properly registers KV cache pools and decoder layer hooks. The precondition checks and initialization sequence are well-organized.

918-924: LGTM!

The assertion relaxation correctly mirrors the existing kv_cache_transceiver logic. Both components can suspend requests, so the batch size assertion should only apply when neither is active. The explanatory comment is helpful.

968-973: Proper connector integration in executor loop.

The connector operations are called at appropriate points: removing pending async requests after resource preparation, handling metadata, and starting KV loading before the forward pass. The sequence is logical and well-integrated.

1006-1010: Correct handling of finished connector requests.

The connector cleanup is properly placed after the transceiver checks, and resource freeing is appropriately delegated to the resource manager for finished requests.

1360-1362: LGTM!

The wait_for_save call is correctly placed after the forward pass to ensure asynchronous saving operations complete before proceeding. The null check is appropriate for the optional connector manager.

1461-1464: Proper resource management delegation to connector.

The conditional logic correctly delegates resource freeing to the connector manager when it handles the request, preventing double-freeing while maintaining fallback to the standard resource manager. This follows the connector's responsibility for managing async request resources.

tests/integration/defs/llmapi/test_llm_api_connector.py (3)

45-67: Well-designed test fixture with clear limitations.

The fixture properly mocks the importlib mechanism to inject test doubles, and the single-process limitation is clearly documented with a good explanation of why MagicMock requires this constraint.

69-127: Comprehensive test coverage for connector scenarios.

The test thoroughly validates the connector integration by verifying call counts, argument correctness, and proper sequencing of scheduler and worker method calls. The assertions effectively test the expected behavior patterns.

159-198: Well-structured async save test.

The test properly configures the scheduler to simulate asynchronous saving and validates that the correct request IDs are passed to get_finished at the appropriate times. The assertion logic effectively verifies the async save behavior.

tensorrt_llm/_torch/pyexecutor/connector.py (9)

37-60: Well-designed data classes with clear documentation.

The RequestData and SchedulerOutput classes provide clean abstractions for request tracking with proper type annotations and helpful documentation. The add_request helper method is a good convenience addition.

62-136: Comprehensive and well-documented worker interface.

The abstract base class provides a clear contract for worker implementations with well-defined methods for KV cache operations. The documentation is thorough and explains the purpose and expected behavior of each method. The metadata binding mechanism enables flexible scheduler-worker coordination.

138-183: Well-structured scheduler interface with appropriate abstractions.

The scheduler interface correctly focuses on orchestration responsibilities with clear method contracts. The combination of abstract methods for core functionality and default implementations for optional features provides good flexibility. Documentation effectively explains the expected behavior and return values.

185-232: Efficient helper class for async request management.

The AsyncRequests class provides clean abstractions for managing saving and loading requests with efficient operations for transferring, extracting, and querying requests. The design well-supports the state transitions needed by the connector manager.

234-276: Solid distributed coordination design with proper MPI integration.

The manager correctly enforces scheduler centralization on rank 0, implements proper three-phase async request tracking, and provides clean MPI coordination through the _run_on_leader method. The inheritance from the C++ binding enables effective integration with native components.

278-296: Correctly handles complex async loading scenario.

The method properly validates async loading conditions and handles the challenging case of suspending already-scheduled requests. The detailed comment explains the architectural complexity, and the implementation correctly manages request state transitions for async loading.

298-351: Proper request scheduling and metadata coordination.

The methods correctly handle the complex interactions between scheduling and async operations. Request state management is consistent, metadata coordination follows the proper leader pattern, and the filtering logic appropriately removes async requests from normal scheduling.

353-402: Sophisticated and correct async request completion handling.

The get_finished method implements proper distributed coordination using MPI allgather to ensure all workers agree on completion before cleanup. The three-phase request tracking and intersection logic correctly handle race conditions and maintain consistent state across the distributed system.

407-415: Clean integration hooks for layer-level operations.

The hook methods provide appropriate integration points for waiting on layer loads and triggering saves, enabling seamless integration with the model execution flow.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
375-416: Address line length violations and consider the previous refactoring suggestion.

The complex connector integration logic has line length violations on lines 381 and 383 (146 and 143 characters respectively, exceeding the 120-character limit).

Apply this diff to fix the line length violations:
-                # The result of this call may be that blocks will be loaded asynchronously.
+                # The result of this call may be that blocks will be loaded 
+                # asynchronously.
-                # When that happens, the request will go through this same code path, but with is_kv_cache_connector_async_onboard set to True.
+                # When that happens, the request will go through this same code path, 
+                # but with is_kv_cache_connector_async_onboard set to True.
The previous review comment about refactoring this complex logic into separate methods remains valid and should be addressed.
tests/integration/defs/llmapi/test_llm_api_connector.py (2)
27-27: Remove sys.path manipulation in test file.

Modifying sys.path in test files is a code smell that can lead to import issues and makes tests fragile. Consider using proper module imports or test fixtures instead.
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
29-44: Remove unused global mock variables and function.

The global KvConnectorScheduler and KvConnectorWorker variables and the init_connector_classes function are defined but never used. The actual mocking is done in the connector fixture.
-KvConnectorScheduler = MagicMock()
-KvConnectorWorker = MagicMock()
-
-
-def init_connector_classes():
-    KvConnectorScheduler.reset_mock()
-    KvConnectorWorker.reset_mock()
-
-    scheduler = MagicMock()
-    worker = MagicMock()
-
-    KvConnectorScheduler.return_value = scheduler
-    KvConnectorWorker.return_value = worker
-
-    return scheduler, worker
-
-

🧹 Nitpick comments (1)

tests/integration/defs/llmapi/test_llm_api_connector.py (1)
153-153: Address line length issue.

The line exceeds the 120-character limit as flagged by static analysis.
-        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
+        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
+        "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a5bab4 and b678d9b.

📒 Files selected for processing (24)

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (11 hunks)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (9 hunks)
cpp/tensorrt_llm/pybind/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (4 hunks)
cpp/tensorrt_llm/pybind/bindings.cpp (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/connector.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (6 hunks)
tensorrt_llm/_torch/pyexecutor/scheduler.py (1 hunks)
tensorrt_llm/executor/executor.py (6 hunks)
tensorrt_llm/executor/proxy.py (3 hunks)
tensorrt_llm/executor/worker.py (6 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tensorrt_llm/models/modeling_utils.py (1 hunks)
tests/integration/defs/llmapi/test_llm_api_connector.py (1 hunks)
tests/unittest/_torch/test_connector.py (1 hunks)

✅ Files skipped from review due to trivial changes (4)

tensorrt_llm/_torch/pyexecutor/scheduler.py
cpp/tensorrt_llm/pybind/CMakeLists.txt
cpp/tensorrt_llm/pybind/bindings.cpp
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp

🚧 Files skipped from review as they are similar to previous changes (16)

tensorrt_llm/llmapi/llm.py
tensorrt_llm/models/modeling_utils.py
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/executor/executor.py
tensorrt_llm/executor/proxy.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
tests/unittest/_torch/test_connector.py
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tests/integration/defs/llmapi/test_llm_api_connector.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tests/integration/defs/llmapi/test_llm_api_connector.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py

🪛 Ruff (0.12.2)

tests/integration/defs/llmapi/test_llm_api_connector.py

153-153: Line too long (133 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/connector.py

128-128: Line too long (130 > 120)

(E501)

135-135: Line too long (152 > 120)

(E501)

136-136: Line too long (206 > 120)

(E501)

184-184: Line too long (176 > 120)

(E501)

243-243: Line too long (123 > 120)

(E501)

294-294: Line too long (129 > 120)

(E501)

316-316: Line too long (125 > 120)

(E501)

342-342: Line too long (144 > 120)

(E501)

369-369: Line too long (131 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/resource_manager.py

381-381: Line too long (146 > 120)

(E501)

383-383: Line too long (143 > 120)

(E501)

🔇 Additional comments (32)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)

13-14: LGTM! Clean integration of KV cache connector imports and parameter.

The imports maintain proper namespace conventions and the optional parameter design allows backward compatibility.

Also applies to: 20-20, 137-137, 154-154

361-361: LGTM! Clean scheduler output management pattern.

The pattern of creating a scheduler output object to accumulate information and then dispatching it to the connector manager (when present) is well-designed and null-safe.

Also applies to: 437-438

419-435: LGTM! Well-structured generation batch processing.

The pattern of capturing block IDs before and after token addition to compute deltas is clear and correctly implemented. The use of scheduler_output to communicate changes maintains consistency with the context batch processing.

939-940: LGTM! Clean delegation method.

The new method properly exposes the KV cache connector pools data from the underlying implementation with correct type annotations.

tensorrt_llm/_torch/pyexecutor/py_executor.py (9)

132-150: LGTM! Constructor parameter addition follows established pattern.

The new kv_connector_manager parameter is properly integrated into the constructor signature with appropriate default value and type annotation.

258-261: LGTM! Proper initialization of connector manager.

The connector manager is stored as an instance variable and initialization is delegated to a separate method, following good separation of concerns.

265-295: Excellent compatibility validation and integration logic.

The _maybe_init_kv_connector_manager method properly validates incompatible feature combinations and integrates the connector with the KV cache manager and model layers. The compatibility checks prevent runtime issues by failing early.

Key validations implemented:

Incompatibility with KvCacheTransceiver

Incompatibility with pipeline parallelism

Incompatibility with overlap scheduler

The integration with decoder layers using forward hooks is well-designed for layer-level KV cache coordination.

921-926: Proper assertion relaxation for connector functionality.

The modification correctly relaxes the batch size assertion when either the KV Cache Connector or transceiver is enabled, as both can suspend requests asynchronously. The comment clearly explains the reasoning.

968-976: Well-structured connector integration in executor loop.

The connector integration in the main executor loop is properly placed after resource preparation but before the forward pass. The sequence of operations is logical:

Remove requests pending async load

Handle metadata

Start KV loading

This placement ensures correct coordination between resource management and connector operations.

1009-1012: LGTM! Proper finished request handling for connector.

The connector-specific finished request handling is well-integrated into the executor loop. The resource freeing is appropriately delegated to the connector manager, ensuring proper lifecycle management.

1363-1364: LGTM! Appropriate synchronization point for connector save operations.

The wait_for_save() call is correctly placed after the forward pass to ensure all asynchronous save operations complete before returning the outputs.

1464-1466: Excellent resource management delegation pattern.

The _terminate_request method correctly delegates resource freeing to the connector manager when present, avoiding double-freeing issues. The conditional logic ensures proper resource cleanup regardless of connector presence.

977-978: Batch size checks are intentional and consistent
Each scheduled_batch.batch_size > 0 check occurs in a different execution context (queuing, assertion when the connector is disabled, scheduling with/without attention DP, post-transmission). They’re not duplicated by mistake but enforce the correct behavior at each stage. No changes required.

tests/integration/defs/llmapi/test_llm_api_connector.py (7)

46-66: LGTM! Well-designed test fixture with proper mocking.

The model_with_connector fixture properly mocks the importlib functionality to inject test doubles for the connector classes. The fixture yields both the model factory function and the mock objects for verification in tests.

68-70: LGTM! Appropriate environment variable for test constraints.

Setting TLLM_WORKER_USE_SINGLE_PROCESS=1 is necessary because MagicMocks don't work across process boundaries. The comment clearly explains the limitation and TODO for future improvement.

73-130: Comprehensive test coverage for basic connector functionality.

The test_connector_simple test provides excellent coverage of the basic connector workflow:

KV cache registration

Forward pass coordination

Layer-level load/save operations

Request lifecycle management

The assertions verify correct call counts and argument values, ensuring the connector integrates properly with the executor loop.

132-161: LGTM! Good test for asynchronous onboarding scenario.

The test properly simulates the asynchronous loading scenario by setting appropriate return values and verifies the correct handling of async loading requests.

163-202: LGTM! Thorough test for asynchronous saving operations.

The test_connector_async_save test validates the async saving workflow, including proper request tracking through the saving pipeline and verification of finished request handling.

204-257: Excellent detailed verification of scheduler output.

The test_connector_scheduler_output test provides comprehensive validation of the scheduler metadata generation, including:

Token and block tracking across iterations

Prefill vs generation phase handling

Block allocation patterns

Computed position tracking

The test also verifies the behavior when tokens are pre-matched, which is important for performance optimization.

259-306: LGTM! Comprehensive chunked context handling test.

The test_connector_scheduler_output_chunked_context test validates the connector's handling of chunked prefill scenarios, ensuring proper token and block management across chunk boundaries. This is crucial for large context handling.

tensorrt_llm/_torch/pyexecutor/connector.py (12)

1-34: LGTM! Excellent module documentation and imports.

The comprehensive module documentation clearly explains the purpose, responsibilities, and architecture of the KV Cache Connector. The import structure is well-organized and follows Python conventions.

37-61: LGTM! Well-designed data structures for request tracking.

The RequestData and SchedulerOutput dataclasses provide clean interfaces for tracking request metadata. The add_request method on SchedulerOutput offers a convenient API for building scheduler metadata.

63-138: Excellent abstract worker interface design.

The KvCacheConnectorWorker abstract base class provides a comprehensive interface for worker implementations:

Clear separation of concerns between metadata binding and KV operations

Well-documented abstract methods with clear responsibilities

Proper lifecycle management (register → load → process → save → finish)

Good error handling considerations in the get_finished method documentation

The interface design supports various connector implementations while maintaining consistency.

140-186: LGTM! Well-structured scheduler interface.

The KvCacheConnectorScheduler abstract base class provides a clean interface for orchestration logic:

Clear separation between metadata building and token matching

Proper async operation support

Good documentation of return values and behavior expectations

The get_num_new_matched_tokens method design allows for flexible token matching strategies while supporting both sync and async loading.

188-235: Excellent async request management design.

The AsyncRequests dataclass provides a clean API for managing asynchronous saving and loading operations:

Clear separation between saving and loading requests

Efficient ID-based operations

Proper state transfer methods (add_from, extract_by_id)

Convenient query methods for getting ID sets

The implementation supports complex async request lifecycle management.

237-279: LGTM! Proper MPI coordination architecture.

The KvCacheConnectorManager initialization and _run_on_leader method demonstrate excellent distributed system design:

Enforces single leader pattern (scheduler only on rank 0)

Proper MPI broadcast pattern for coordinating operations

Clean error handling with assertions

The architecture ensures consistent state across all ranks while centralizing orchestration logic.

281-299: Excellent token matching coordination with proper validation.

The get_num_new_matched_tokens method properly coordinates token matching across ranks and handles async loading setup. The validation logic prevents invalid states (async loading with zero tokens), and the request state management is well-structured.

The TODO comment identifies a design challenge with request scheduling timing - this is good documentation of known limitations.

301-325: LGTM! Clean request filtering for async operations.

The take_scheduled_requests_pending_load method properly removes async loading requests from the scheduled batch and updates their state appropriately. This prevents double-processing and ensures correct request lifecycle management.

326-333: LGTM! Proper metadata coordination and cleanup.

The handle_metadata method correctly coordinates metadata building across ranks and binds it to the worker. The cleanup of scheduler output prevents memory leaks and ensures fresh state for each iteration.

334-354: Excellent request completion handling.

The request_finished method properly coordinates async saving decisions across ranks and updates request state to prevent premature resource cleanup. The state management ensures resources remain allocated during async operations.

356-405: Sophisticated distributed coordination for async operations.

The get_finished method implements excellent distributed coordination logic:

Request State Management: Properly tracks requests through different async states

Worker Coordination: Collects completion status from all workers

Consensus Building: Uses MPI allgather to achieve consensus across all ranks

State Transitions: Correctly transitions loading requests back to schedulable state

Resource Management: Returns finished saving requests for proper cleanup

The implementation ensures that async operations are only considered complete when all workers agree, preventing race conditions in distributed environments.

407-418: LGTM! Clean hook implementations for layer coordination.

The hook methods provide clean integration points for KV cache operations:

layer_pre_hook: Synchronizes KV loading before layer execution

layer_post_hook: Initiates KV saving after layer execution

model_post_hook: Ensures all saves complete before forward pass completion

The hook pattern integrates seamlessly with PyTorch's module system.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
383-422: Address complex connector integration logic and line length violations.

The nested conditionals and state management logic is difficult to follow, as correctly identified by the existing TODO comment. Additionally, static analysis flags line length violations on lines 386 and 388.

The refactoring suggested in the previous review remains valid. Additionally, please address the line length violations:
-                # TODO(jthomson04): This is begging for a mega refactor, and can likely be significantly simplified.
-                # In add sequence, the connector API's get_num_new_matched_tokens is called.
+                # TODO(jthomson04): This is begging for a mega refactor, and can likely be 
+                # significantly simplified. In add sequence, the connector API's 
+                # get_num_new_matched_tokens is called.
-                # The result of this call may be that blocks will be loaded asynchronously.
+                # The result of this call may be that blocks will be loaded 
+                # asynchronously.
Consider extracting the complex conditional logic into helper methods as suggested in the previous review.
tests/integration/defs/llmapi/test_llm_api_connector.py (2)
27-27: Remove sys.path manipulation in test file.

Modifying sys.path in test files is a code smell that can lead to import issues and makes tests fragile. Consider using proper module imports or test fixtures instead.
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
29-44: Remove unused global mock variables and function.

The global KvConnectorScheduler and KvConnectorWorker variables and the init_connector_classes function are defined but never used. The actual mocking is done in the connector fixture.
-KvConnectorScheduler = MagicMock()
-KvConnectorWorker = MagicMock()
-

-def init_connector_classes():
-    KvConnectorScheduler.reset_mock()
-    KvConnectorWorker.reset_mock()
-
-    scheduler = MagicMock()
-    worker = MagicMock()
-
-    KvConnectorScheduler.return_value = scheduler
-    KvConnectorWorker.return_value = worker
-
-    return scheduler, worker
-

🧹 Nitpick comments (3)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
265-306: Comprehensive connector validation and initialization.

The _maybe_init_kv_connector_manager method properly validates compatibility constraints and initializes the connector. The validation logic correctly prevents incompatible feature combinations and ensures uniform KV cache shapes across layers.

However, there's a minor issue on line 291:
-                return ValueError(
+                raise ValueError(
tests/integration/defs/llmapi/test_llm_api_connector.py (1)
153-153: Fix line length violation.

The line exceeds the 120-character limit specified in the static analysis hints.
-        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
+        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
+        "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
tensorrt_llm/_torch/pyexecutor/connector.py (1)
129-129: Address line length violations.

Multiple lines exceed the 120-character limit. Consider breaking long lines for better readability.

For example, line 129:
-        Args:
-            finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now asynchronously saving.
+        Args:
+            finished_gen_req_ids: The IDs of the requests that have finished 
+                generating tokens, and are now asynchronously saving.
Similar formatting should be applied to other long lines for consistency with the project's style guidelines.

Also applies to: 136-137, 185-185, 244-244, 295-295, 317-317, 343-343, 370-370

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b678d9b and 9cf5466.

📒 Files selected for processing (25)

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (8 hunks)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8 hunks)
cpp/tensorrt_llm/pybind/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (2 hunks)
cpp/tensorrt_llm/pybind/bindings.cpp (2 hunks)
cpp/tensorrt_llm/runtime/torch.h (1 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/connector.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/scheduler.py (1 hunks)
tensorrt_llm/executor/executor.py (6 hunks)
tensorrt_llm/executor/proxy.py (3 hunks)
tensorrt_llm/executor/worker.py (6 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tensorrt_llm/models/modeling_utils.py (1 hunks)
tests/integration/defs/llmapi/test_llm_api_connector.py (1 hunks)
tests/unittest/_torch/test_connector.py (1 hunks)

✅ Files skipped from review due to trivial changes (5)

tensorrt_llm/_torch/pyexecutor/scheduler.py
cpp/tensorrt_llm/pybind/CMakeLists.txt
cpp/tensorrt_llm/pybind/bindings.cpp
tensorrt_llm/models/modeling_utils.py
cpp/include/tensorrt_llm/batch_manager/llmRequest.h

🚧 Files skipped from review as they are similar to previous changes (15)

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/executor/worker.py
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h
tensorrt_llm/executor/proxy.py
tensorrt_llm/executor/executor.py
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
tests/unittest/_torch/test_connector.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic...

Files:

cpp/tensorrt_llm/runtime/torch.h

**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

cpp/tensorrt_llm/runtime/torch.h

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

cpp/tensorrt_llm/runtime/torch.h
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/integration/defs/llmapi/test_llm_api_connector.py

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/integration/defs/llmapi/test_llm_api_connector.py

🧠 Learnings (5)

📓 Common learnings

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : do not use smart pointers that have been deprecated in c++11....

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-01T07:34:42.734Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Do not use smart pointers that have been deprecated in C++11.

Applied to files:

cpp/tensorrt_llm/runtime/torch.h

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

cpp/tensorrt_llm/runtime/torch.h

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/connector.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/resource_manager.py

386-386: Line too long (146 > 120)

(E501)

388-388: Line too long (143 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/connector.py

129-129: Line too long (130 > 120)

(E501)

136-136: Line too long (152 > 120)

(E501)

137-137: Line too long (206 > 120)

(E501)

185-185: Line too long (176 > 120)

(E501)

244-244: Line too long (123 > 120)

(E501)

295-295: Line too long (129 > 120)

(E501)

317-317: Line too long (125 > 120)

(E501)

343-343: Line too long (144 > 120)

(E501)

370-370: Line too long (131 > 120)

(E501)

tests/integration/defs/llmapi/test_llm_api_connector.py

153-153: Line too long (133 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (37)

cpp/tensorrt_llm/runtime/torch.h (1)

44-51: LGTM! Improved error handling and memory safety.

The deleter modification removes exception handling and replaces assertion with warning logging, which is safer for library boundaries. The conditional check ensures proper handling of deallocated memory scenarios while maintaining memory safety by always resetting the shared pointer.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (6)

18-18: LGTM! Proper import for KV cache connector components.

The import statement correctly brings in the necessary connector components for the integration.

135-135: LGTM! Clean integration of connector manager parameter.

The constructor properly accepts the optional kv_connector_manager parameter and stores it as an instance attribute, following good object-oriented design principles.

Also applies to: 152-153

364-367: LGTM! Proper scheduler output initialization.

The creation of SchedulerOutput object to accumulate scheduling information for the KV cache connector is well-implemented.

380-381: LGTM! Correct connector manager integration in sequence addition.

The conditional passing of kv_connector_manager to add_sequence calls (excluding warmup batches) is implemented correctly for both code paths.

Also applies to: 393-394

423-441: LGTM! Proper token and block tracking for generation requests.

The logic correctly tracks tokens and block IDs before and after token addition, computing delta block IDs for the scheduler output. The implementation handles the generation batch processing appropriately.

442-444: LGTM! Clean connector manager integration.

The conditional check and scheduler output setting for the connector manager is implemented correctly, ensuring the connector is updated when present.

tensorrt_llm/_torch/pyexecutor/py_executor.py (10)

30-30: LGTM: Import added for connector integration.

The DecoderLayer import is correctly added to support forward hook registration in the KV Cache Connector initialization.

32-32: LGTM: Connector manager import.

The KvCacheConnectorManager import is properly added for the new connector functionality.

132-150: LGTM: Constructor parameter addition.

The kv_connector_manager parameter is correctly added to the PyExecutor constructor with proper typing and default value.

258-261: LGTM: Connector manager initialization.

The connector manager is properly stored and initialized with the dedicated method call.

932-937: Appropriate assertion relaxation for connector functionality.

The batch size assertion is correctly relaxed when the KV Cache Connector is enabled, acknowledging that the connector can put requests to sleep similar to the transceiver.

979-987: Proper connector integration in execution loop.

The KV Cache Connector integration follows the correct execution flow:

Remove requests pending asynchronous load from scheduled batch

Handle metadata building and broadcasting

Start KV loading operations

This maintains proper isolation and orchestration of the connector operations.

988-990: Conditional execution logic maintains performance.

The conditional check ensures forward pass execution only when there are scheduled requests or attention DP is enabled, maintaining execution efficiency.

1375-1377: Proper synchronization for KV cache saving.

The wait_for_save() call ensures all KV cache saving operations complete before proceeding, maintaining data consistency in the connector workflow.

1020-1024: Correct resource management delegation.

The connector cleanup logic properly retrieves finished requests and delegates resource freeing to the connector when appropriate, maintaining proper resource lifecycle management.

1476-1478: Proper resource management delegation in termination.

The _terminate_request method correctly delegates resource freeing to the KV Cache Connector when enabled, avoiding double-freeing of resources.

tests/integration/defs/llmapi/test_llm_api_connector.py (7)

46-66: Well-structured test fixture for connector mocking.

The model_with_connector fixture properly mocks the connector components using importlib patching, which is a clean approach for testing the connector integration without requiring actual connector implementations.

68-71: Appropriate test limitation documentation.

The comment clearly explains the limitation to single process testing due to MagicMock cross-process issues, and the environment variable setting is necessary for the current testing approach.

73-130: Comprehensive basic connector functionality test.

The test_connector_simple test thoroughly validates the connector workflow including:

KV cache registration

Metadata building calls

Load/save synchronization

Request lifecycle management

The assertions verify the correct number of calls and proper sequencing of operations.

132-161: Effective asynchronous onboarding test.

The test validates async loading behavior by setting up appropriate mock return values and verifying the correct handling of finished requests with asynchronous loading.

163-202: Thorough asynchronous saving test.

The test validates async saving behavior by checking that finished requests are properly tracked and passed to the get_finished method on the final call.

204-257: Comprehensive scheduler output validation.

The test validates that scheduler output contains correct metadata for both prefill and generation phases, including proper token and block ID tracking.

259-306: Appropriate chunked context testing.

The test validates connector behavior with chunked prefill, ensuring that scheduler output correctly handles multi-chunk context processing.

tensorrt_llm/_torch/pyexecutor/connector.py (13)

1-36: Excellent interface documentation and design.

The file header provides clear documentation of the KV Cache Connector's purpose, responsibilities, and architecture. The separation of scheduler and worker concerns is well-designed.

38-62: Well-structured data classes for request tracking.

The RequestData and SchedulerOutput classes provide clean abstractions for tracking request metadata and scheduling information with appropriate helper methods.

64-139: Comprehensive worker interface definition.

The KvCacheConnectorWorker abstract base class defines a clear contract for worker implementations with well-documented methods covering the full KV cache lifecycle.

141-187: Complete scheduler interface with proper abstractions.

The KvCacheConnectorScheduler interface provides appropriate abstractions for orchestration, with both required abstract methods and optional implementations for token matching.

189-236: Robust async request management.

The AsyncRequests dataclass provides comprehensive methods for managing asynchronous saving and loading requests with proper state transfer and query capabilities.

238-270: Proper manager initialization with rank validation.

The KvCacheConnectorManager constructor correctly enforces that the scheduler only exists on rank 0, ensuring proper distributed coordination.

271-280: Effective MPI coordination pattern.

The _run_on_leader method provides a clean abstraction for executing functions on the leader rank and broadcasting results, which is essential for distributed coordination.

282-300: Robust token matching with async handling.

The get_num_new_matched_tokens method properly validates the async loading logic and correctly updates request state for asynchronous onboarding.

302-326: Correct request filtering for async operations.

The take_scheduled_requests_pending_load method properly removes asynchronously loading requests from the scheduled batch and updates their state appropriately.

327-334: Clean metadata handling and distribution.

The handle_metadata method properly coordinates metadata building on the leader and distributes it to all workers.

335-356: Proper request completion handling.

The request_finished method correctly determines async saving behavior and updates request state, maintaining consistency with the async request management pattern.

357-406: Sophisticated distributed synchronization logic.

The get_finished method implements complex but correct distributed coordination using MPI allgather to ensure all workers agree on completion status before proceeding. The state management for loading and saving requests is properly handled.

408-418: Clean hook implementations for layer coordination.

The pre/post hook methods provide clean integration points for layer-level KV cache coordination, properly delegating to the worker implementation.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
364-442: Address line length violations and ongoing complexity.

The connector integration logic remains complex as noted in previous reviews. Additionally, lines 385 and 387 exceed the 120-character limit and should be reformatted.

Apply this formatting fix for the line length violations:
-                # The result of this call may be that blocks will be loaded asynchronously.
-                # When that happens, the request will go through this same code path, but with is_kv_cache_connector_async_onboard set to True.
+                # The result of this call may be that blocks will be loaded 
+                # asynchronously.
+                # When that happens, the request will go through this same code path, 
+                # but with is_kv_cache_connector_async_onboard set to True.
tests/integration/defs/llmapi/test_llm_api_connector.py (2)
27-27: Remove sys.path manipulation in test file.

Modifying sys.path in test files is a code smell that can lead to import issues and makes tests fragile. Consider using proper module imports or test fixtures instead.
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
29-44: Remove unused global mock variables and function.

The global KvConnectorScheduler and KvConnectorWorker variables and the init_connector_classes function are defined but never used. The actual mocking is done in the connector fixture.
-KvConnectorScheduler = MagicMock()
-KvConnectorWorker = MagicMock()
-
-
-def init_connector_classes():
-    KvConnectorScheduler.reset_mock()
-    KvConnectorWorker.reset_mock()
-
-    scheduler = MagicMock()
-    worker = MagicMock()
-
-    KvConnectorScheduler.return_value = scheduler
-    KvConnectorWorker.return_value = worker
-
-    return scheduler, worker
-
-

🧹 Nitpick comments (2)

tests/integration/defs/llmapi/test_llm_api_connector.py (2)

162-163: Fix line length violations.

The long string literal exceeds the 120-character limit. Consider breaking it into multiple lines or using a shorter test string.

-    model.generate([
-        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
-    ], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True))
+    model.generate([
+        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
+        "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
+    ], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True))

202-214: Fix line length violation in comment.

The comment on line 202 exceeds the 120-character limit.

-    # On the last call to get_finished, we should be providing the async saving request. One extra token when using the overlap scheduler.
+    # On the last call to get_finished, we should be providing the async saving request.
+    # One extra token when using the overlap scheduler.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9105f24 and a50f2d1.

📒 Files selected for processing (5)

tensorrt_llm/_torch/pyexecutor/_util.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (11 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (5 hunks)
tests/integration/defs/llmapi/test_llm_api_connector.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/_util.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tests/integration/defs/llmapi/test_llm_api_connector.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tests/integration/defs/llmapi/test_llm_api_connector.py

🧠 Learnings (2)

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/resource_manager.py

385-385: Line too long (146 > 120)

(E501)

387-387: Line too long (143 > 120)

(E501)

tests/integration/defs/llmapi/test_llm_api_connector.py

162-162: Line too long (133 > 120)

(E501)

202-202: Line too long (138 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (23)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

18-18: LGTM!

Import is correctly added to support the new KV cache connector functionality.

135-135: LGTM!

The optional parameter is properly typed and maintains backward compatibility.

152-152: LGTM!

Standard pattern for storing the connector manager as an instance variable.

tensorrt_llm/_torch/pyexecutor/py_executor.py (13)

30-32: LGTM! New imports for KV connector integration.

The imports for DecoderLayer and KvCacheConnectorManager are correctly added to support the KV cache connector functionality.

132-150: Constructor parameter addition looks correct.

The new kv_connector_manager parameter is properly added to the constructor with appropriate type hint and default value.

258-261: Good initialization of KV connector manager.

The connector manager is properly stored as an instance variable and initialization is delegated to a dedicated method, following good separation of concerns.

927-933: Appropriate relaxation of batch size assertion.

The logic correctly relaxes the assertion when KV connector manager is enabled, as the connector can put requests to sleep similar to the transceiver. The comment clearly explains the rationale.

946-952: Well-structured KV connector execution method.

The method properly orchestrates the connector workflow:

Takes scheduled requests pending load

Handles metadata communication

Starts KV loading

The method follows the established pattern of conditional execution based on manager availability.

953-958: Proper cleanup of finished async save requests.

The method correctly retrieves finished requests from the connector and delegates resource cleanup to the resource manager, maintaining proper separation of responsibilities.

987-990: Strategic placement of KV connector execution.

The connector execution is properly placed after resource preparation but before the forward step, allowing the connector to modify the batch as needed.

991-993: Appropriate restructuring of conditional logic.

The conditional logic is correctly restructured to accommodate the KV connector workflow while maintaining the existing execution flow.

1024-1025: Proper integration of async save request termination.

The call to terminate async save requests is correctly placed after response handling, ensuring proper cleanup of connector-managed resources.

1092-1095: Consistent integration pattern in overlap scheduler.

The KV connector execution and conditional restructuring follows the same pattern as in the non-overlap executor loop, maintaining consistency across execution paths.

1148-1149: Proper cleanup integration in overlap scheduler.

The async save request termination is consistently integrated in the overlap scheduler execution path.

1380-1383: Critical synchronization point for KV connector.

The wait_for_save() call after the forward step is crucial for ensuring KV cache consistency. This synchronization point prevents race conditions where the next iteration might start before the current iteration's KV cache operations complete.

1482-1485: Smart resource management delegation.

The conditional logic properly delegates resource cleanup to the connector when a request is finished according to the connector's state. This prevents double-cleanup and ensures the connector can manage resource lifecycle appropriately.

tests/integration/defs/llmapi/test_llm_api_connector.py (7)

46-66: Well-designed test fixture for connector mocking.

The fixture properly mocks the connector components using importlib patching, which is appropriate for testing the integration without requiring actual connector implementations. The use of function scope ensures clean test isolation.

68-71: Appropriate environment configuration for test constraints.

The environment variable setting and comment clearly explain the limitation to single-process testing due to MagicMock serialization constraints. This is a reasonable testing approach for the current implementation.

73-138: Comprehensive basic connector functionality test.

The test effectively validates:

KV cache registration during model initialization

Correct call counts for connector methods during token generation

Layer-wise KV operations (load/save per layer per forward pass)

Request lifecycle management

Scheduler-worker coordination

The test coverage is thorough and validates the core connector integration points.

140-171: Good test coverage for async onboarding scenario.

The test properly simulates the async onboarding workflow where the scheduler returns matched tokens and sets the async flag. The verification of call counts and request ID propagation is appropriate.

173-214: Effective async save workflow validation.

The test correctly validates the async save scenario where the scheduler marks a request as finished and the connector handles it asynchronously. The verification of request ID propagation through the call chain is well done.

216-274: Thorough scheduler output validation.

This test provides excellent coverage of the scheduler output structure, validating:

Token counts and block allocation during prefill vs generation

Block allocation logic based on position and block size

Computed position tracking

Impact of matched tokens on computed position

The mathematical assertions for block allocation are particularly valuable.

276-325: Good coverage of chunked prefill scenario.

The test effectively validates the chunked prefill workflow, ensuring that:

All tokens and blocks are provided upfront in the first chunk

Subsequent chunks have correct computed positions but no new tokens/blocks

Generation phase follows normal single-token pattern

This covers an important optimization scenario for large context processing.

tensorrt_llm/_torch/pyexecutor/py_executor.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
265-301: Fix the ValueError bug on line 286.

The implementation correctly validates configurations and sets up the KV connector, but there's a syntax error on line 286 where return ValueError should be raise ValueError.
-                return ValueError(
+                raise ValueError(

🧹 Nitpick comments (2)

tests/integration/defs/llmapi/test_llm_api_connector.py (1)

143-143: Consider breaking long lines for better readability.

Lines 143 and 183 exceed the 120-character limit. While long test strings are sometimes acceptable, consider breaking them for consistency:

-        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
+        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
+        "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

-    # On the last call to get_finished, we should be providing the async saving request. One extra token when using the overlap scheduler.
+    # On the last call to get_finished, we should be providing the async saving request.
+    # One extra token when using the overlap scheduler.

Also applies to: 183-183

tensorrt_llm/_torch/pyexecutor/connector.py (1)

129-129: Fix line length violations flagged by static analysis.

Several lines exceed the 120-character limit. Please wrap these lines appropriately.

Apply these fixes for line length compliance:

-            finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now asynchronously saving.
+            finished_gen_req_ids: The IDs of the requests that have finished 
+                generating tokens, and are now asynchronously saving.

-        Note: IDs may only be returned from this call after they've been provided in the `finished_gen_req_ids` and `started_loading_req_ids` arguments.
-        Additionally, the runtime will only take action based on these returned IDs once they've been returned by ALL workers. This allows some workers to take longer than others to complete the operations.
+        Note: IDs may only be returned from this call after they've been provided 
+        in the `finished_gen_req_ids` and `started_loading_req_ids` arguments.
+        Additionally, the runtime will only take action based on these returned IDs 
+        once they've been returned by ALL workers. This allows some workers to take 
+        longer than others to complete the operations.

-            If true, this indicates that the kv cache manager should wait to deallocate the blocks until the saving has completed (determined by `get_finished` on the workers).
+            If true, this indicates that the kv cache manager should wait to 
+            deallocate the blocks until the saving has completed (determined by 
+            `get_finished` on the workers).

-    1. Managing the state of async requests (both offload and onboard)
-    2. Handling MPI communication. We only run the leader on one rank, but need the results of the leader API on all ranks.
+    1. Managing the state of async requests (both offload and onboard)
+    2. Handling MPI communication. We only run the leader on one rank, but need 
+       the results of the leader API on all ranks.

-        # When the connector indicates that a request will be loaded asynchronously, we need to suspend it's execution.
+        # When the connector indicates that a request will be loaded asynchronously, 
+        # we need to suspend it's execution.

-            # If this request is being loaded asynchronously, in addition to removing it from the list of scheduled requests,
+            # If this request is being loaded asynchronously, in addition to removing 
+            # it from the list of scheduled requests,

-            Whether the request is performing asynchronous saving operations. If true, we do not immediately call free_resources on the request.
+            Whether the request is performing asynchronous saving operations. If true, 
+            we do not immediately call free_resources on the request.

-        # Pass these newly finished requests into get_finished, and get the list of requests that have finished saving and loading.
+        # Pass these newly finished requests into get_finished, and get the list of 
+        # requests that have finished saving and loading.

Also applies to: 136-137, 185-185, 244-244, 295-295, 317-317, 343-343, 370-370

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a50f2d1 and c5879a2.

📒 Files selected for processing (4)

cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (1 hunks)
tensorrt_llm/_torch/pyexecutor/connector.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (11 hunks)
tests/integration/defs/llmapi/test_llm_api_connector.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/integration/defs/llmapi/test_llm_api_connector.py
tensorrt_llm/_torch/pyexecutor/connector.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/integration/defs/llmapi/test_llm_api_connector.py
tensorrt_llm/_torch/pyexecutor/connector.py

🧠 Learnings (2)

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/connector.py

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py

🪛 Ruff (0.12.2)

tests/integration/defs/llmapi/test_llm_api_connector.py

143-143: Line too long (133 > 120)

(E501)

183-183: Line too long (138 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/connector.py

129-129: Line too long (130 > 120)

(E501)

136-136: Line too long (152 > 120)

(E501)

137-137: Line too long (206 > 120)

(E501)

185-185: Line too long (176 > 120)

(E501)

244-244: Line too long (123 > 120)

(E501)

295-295: Line too long (129 > 120)

(E501)

317-317: Line too long (125 > 120)

(E501)

343-343: Line too long (144 > 120)

(E501)

370-370: Line too long (131 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (24)

tensorrt_llm/_torch/pyexecutor/py_executor.py (7)

30-30: LGTM! Appropriate imports for KV connector functionality.

The imports of DecoderLayer and KvCacheConnectorManager are necessary for the KV cache connector integration being added.

Also applies to: 32-32

132-151: LGTM! Proper KV connector manager integration in constructor.

The constructor changes correctly:

Add an optional kv_connector_manager parameter

Store it as an instance variable

Call the initialization method to set up the connector

Also applies to: 258-261

927-932: LGTM! Correct handling of KV connector in batch preparation.

The logic correctly excludes the batch size assertion when the KV connector manager is active, similar to the existing KV cache transceiver logic. The comment clearly explains why this is necessary.

946-958: LGTM! Well-structured KV connector handler methods.

Both methods provide clean abstractions for KV connector operations:

_handle_kv_connector properly sequences request handling, metadata processing, and KV loading

_terminate_async_save_requests ensures proper resource cleanup for finished requests

Both include appropriate null checks

988-993: LGTM! Proper integration of KV connector into executor loops.

The KV connector is correctly integrated into all executor loop variants:

_handle_kv_connector is called after resource preparation but before the forward step

_terminate_async_save_requests is called after response handling for proper cleanup

Integration is consistent across both standard and overlap scheduler loops

Also applies to: 1092-1095, 1024-1024, 1148-1148

1380-1383: LGTM! Proper synchronization of async save operations.

The addition of wait_for_save() after the model forward ensures that asynchronous KV cache save operations complete before proceeding. The null check is appropriate.

1482-1484: LGTM! Appropriate delegation of resource management to KV connector.

The conditional logic correctly delegates resource management to the KV connector when it reports the request as finished, avoiding duplicate resource cleanup.

tests/integration/defs/llmapi/test_llm_api_connector.py (5)

1-26: LGTM! Proper test file structure with required copyright header.

The file follows appropriate testing conventions and includes the required NVIDIA copyright header as specified in the coding guidelines.

27-51: LGTM! Comprehensive test fixture setup for connector mocking.

The fixture properly mocks the connector classes and handles the MagicMock cross-process limitation by setting TLLM_WORKER_USE_SINGLE_PROCESS=1. The mock setup is well-structured and provides the necessary interfaces for testing.

54-119: LGTM! Comprehensive test coverage of basic connector workflow.

This test excellently validates the core connector functionality:

Verifies proper registration and initialization

Tests both scheduler variants (overlap/non-overlap)

Validates correct call sequences and counts for all connector operations

Ensures proper layer-wise processing through the forward passes

121-195: LGTM! Thorough testing of asynchronous connector operations.

Both async tests provide excellent coverage:

test_connector_async_onboard validates async loading with matched tokens

test_connector_async_save verifies async saving request handling

Both tests properly simulate connector behavior and validate the expected interaction patterns

197-306: LGTM! Detailed validation of scheduler output generation.

Both tests provide thorough coverage of scheduler behavior:

Token and block ID calculations are properly validated

Block boundary conditions are tested

Chunked prefill scenario is covered with correct position tracking

Mathematical relationships between tokens, blocks, and positions are verified

tensorrt_llm/_torch/pyexecutor/connector.py (12)

1-14: LGTM!

The import structure follows Python coding guidelines by maintaining namespaces and using appropriate import patterns.

38-62: LGTM!

The dataclasses follow Python coding guidelines with proper snake_case naming, appropriate type hints, and correct use of field(default_factory=list) to avoid mutable default arguments.

64-139: LGTM!

The abstract base class is well-designed with:

Proper inheritance from ABC

Clear method documentation using docstrings

Appropriate snake_case naming conventions

Clean metadata binding interface

141-187: LGTM!

The scheduler abstract base class follows the same good design patterns as the worker, with clear documentation and appropriate method signatures.

189-236: LGTM!

The AsyncRequests class provides a clean interface for managing asynchronous request state with clear documentation of the destructive nature of the transfer operations.

238-270: LGTM!

The constructor properly enforces the constraint that the scheduler only exists on rank 0, which is essential for correct MPI operation. The state management design with separate AsyncRequests objects creates a clear state machine for async operations.

271-280: LGTM!

Excellent pattern for MPI coordination that ensures scheduler operations only run on the leader rank while broadcasting results to all ranks.

282-300: Approve with acknowledged design limitation.

The implementation correctly handles MPI coordination and includes important validation. The TODO comment accurately identifies a design challenge where requests must be removed from scheduling after being scheduled - this is a known limitation of the current architecture rather than an error.

327-334: LGTM!

The method correctly uses the MPI coordination pattern and properly manages the scheduler output lifecycle. The method name appropriately reflects the bind operation as noted in previous reviews.

335-355: LGTM!

The method correctly coordinates with the scheduler and properly manages request state to prevent premature deallocation during asynchronous saving operations.

357-406: LGTM!

Complex but well-implemented state machine that correctly coordinates asynchronous operations across all MPI ranks. The use of set intersection ensures consensus before considering requests globally finished, and state transitions are handled appropriately.

408-416: LGTM!

The remaining methods provide clean interfaces for scheduler output management and layer-level hooks that properly delegate to the worker implementation.

tensorrt_llm/_torch/pyexecutor/connector.py

Signed-off-by: jthomson04 <[email protected]>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)

518-520: Clarify and document the single-window constraint for KV Cache Connector

Improve the error message to state what was detected and what users can do next. Also ensure the API docs/header call this out explicitly.

-    TLLM_CHECK_WITH_INFO(kvCacheConnectorManager == nullptr || uniqueWindowSizeToLayers.size() == 1,
-        "KV Cache Connector is not supported with multiple window sizes");
+    TLLM_CHECK_WITH_INFO(
+        kvCacheConnectorManager == nullptr || uniqueWindowSizeToLayers.size() == 1,
+        "KV Cache Connector is only supported when the model has a single max attention window; found %zu. "
+        "If you need multiple window sizes, disable the connector or configure a single window size.",
+        uniqueWindowSizeToLayers.size());

1199-1210: Clamp prepopulatedPromptLen (+ connector tokens) to context capacity; fix logging format; update reuse stats

As written, prepopulatedPromptLen + numConnectorMatchedTokens can exceed the capacity represented by numContextBlocks × tokensPerBlock, risking inconsistency in downstream consumers and buffer indexing. Clamp before setting it. Also fix RequestId printf specifier and update reuse stats to include connector-matched tokens.

-    SizeType32 numConnectorMatchedTokens = 0;
-
-    // If we're using a KV cache connector, check if any additional blocks can be loaded.
-    if (mKvCacheConnectorManager && !llmRequest.isDummyRequest())
-    {
-        numConnectorMatchedTokens = mKvCacheConnectorManager->getNumNewMatchedTokens(llmRequest, prepopulatedPromptLen);
-    }
-
-    llmRequest.setPrepopulatedPromptLen(prepopulatedPromptLen + numConnectorMatchedTokens, getTokensPerBlock());
-    TLLM_LOG_DEBUG("addSequence: Request %lu, inputLength %d, prepopulatedPromptLen %d, numConnectorMatchedTokens %d",
-        llmRequest.mRequestId, inputLength, prepopulatedPromptLen, numConnectorMatchedTokens);
+    SizeType32 numConnectorMatchedTokens = 0;
+    // If we're using a KV cache connector, check if any additional tokens can be considered matched.
+    if (mKvCacheConnectorManager && !llmRequest.isDummyRequest())
+    {
+        numConnectorMatchedTokens = mKvCacheConnectorManager->getNumNewMatchedTokens(llmRequest, prepopulatedPromptLen);
+    }
+
+    // Ensure we never exceed the allocated context capacity.
+    SizeType32 const maxPrepopulatedTokens = numContextBlocks * getTokensPerBlock();
+    SizeType32 const requestedPrepopulated = prepopulatedPromptLen + numConnectorMatchedTokens;
+    SizeType32 const clampedPrepopulated = std::min(requestedPrepopulated, maxPrepopulatedTokens);
+    if (requestedPrepopulated != clampedPrepopulated)
+    {
+        TLLM_LOG_WARNING(
+            "Clamping prepopulated prompt length for request %" PRIu64 " from %d to %d (capacity=%d tokens across %d blocks)",
+            static_cast<uint64_t>(llmRequest.mRequestId), requestedPrepopulated, clampedPrepopulated,
+            maxPrepopulatedTokens, numContextBlocks);
+    }
+    // Account connector-matched tokens in reuse stats (post-clamp)
+    mReusedTokens += static_cast<double>(clampedPrepopulated - prepopulatedPromptLen);
+
+    llmRequest.setPrepopulatedPromptLen(clampedPrepopulated, getTokensPerBlock());
+    TLLM_LOG_DEBUG(
+        "addSequence: Request %" PRIu64 ", inputLength %d, prepopulatedPromptLen %d, numConnectorMatchedTokens %d",
+        static_cast<uint64_t>(llmRequest.mRequestId), inputLength, prepopulatedPromptLen, numConnectorMatchedTokens);

🧹 Nitpick comments (5)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
2410-2416: getUniquePrimaryPool guard is fine; consider mirroring improved error text

Method correctly guards single-window usage. Optionally align the error message wording with the constructor’s check for consistency.
-    TLLM_CHECK_WITH_INFO(mBlockManager.getWindowSizesMetadata().size() == 1,
-        "getUniquePrimaryPool is only supported for a single window size");
+    TLLM_CHECK_WITH_INFO(
+        mBlockManager.getWindowSizesMetadata().size() == 1,
+        "getUniquePrimaryPool is only supported when the model has a single max attention window; found %zu.",
+        mBlockManager.getWindowSizesMetadata().size());
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)
19-19: Prefer forward-declare KvCacheConnectorManager in headers to reduce coupling and compile time

You only need the type for pointer/reference members and params here. Replacing the include with a forward declaration avoids pulling in more dependencies from this high-traffic header.

Apply this diff:
-#include "tensorrt_llm/batch_manager/kvCacheConnector.h"
+// Forward declare to avoid heavy header inclusion; include in the .cpp where it's used.
+namespace tensorrt_llm::batch_manager::kv_connector { class KvCacheConnectorManager; }
Additionally, ensure the corresponding .cpp includes the concrete header:
#include "tensorrt_llm/batch_manager/kvCacheConnector.h"
838-840: Storing the connector manager in WindowBlockManager is appropriate

This centralizes connector usage at the per-window granularity, aligning with prior discussion. Consider documenting thread-safety expectations if the connector will be invoked from concurrent paths.

1293-1293: Add a brief contract note for getUniquePrimaryPool()

The .cpp enforces “single window size only.” Capture that precondition in the header for discoverability.
-    [[nodiscard]] virtual runtime::ITensor::SharedPtr getUniquePrimaryPool() const = 0;
+    //! Return the primary pool tensor when exactly one attention window size is configured.
+    //! Precondition: getBlockManager().getWindowSizesMetadata().size() == 1.
+    [[nodiscard]] virtual runtime::ITensor::SharedPtr getUniquePrimaryPool() const = 0;
1391-1393: Fix copyOnPartialReuse typo and add missing connector parameter for API parity

Keep the parameter name consistent across all four KVCacheManager constructors in
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h and ensure the final overload with int64_t stream also accepts a connector:

Rename every occurrence of
bool copyOnpartialReuse = true
to
bool copyOnPartialReuse = true.

In the last overload (around line 1405), extend the signature to include
std::shared_ptr<kv_connector::KvCacheConnectorManager> kvCacheConnectorManager = nullptr
after copyOnPartialReuse:
--- a/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
+++ b/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
@@ -1403,7 +1403,8 @@
         std::optional<TempAttentionWindowInputs> const& tempAttentionWindowInputs, nvinfer1::DataType dtype,
         SizeType32 sinkTokenLength, int64_t stream, std::optional<SizeType32> maxSequenceLength,
-        bool enablePartialReuse = true, bool copyOnpartialReuse = true);
+        bool enablePartialReuse = true, bool copyOnPartialReuse = true,
+        std::shared_ptr<kv_connector::KvCacheConnectorManager> kvCacheConnectorManager = nullptr);

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 209052a and f9a3960.

📒 Files selected for processing (2)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (9 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (13 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

🧠 Learnings (4)

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

📚 Learning: 2025-08-20T06:56:02.889Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

📚 Learning: 2025-08-20T06:48:45.368Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

🧬 Code Graph Analysis (2)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)

stream (4108-4114)

stream (4194-4200)

cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)

numLayers (287-388)

numLayers (602-708)

numLayers (602-604)

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4)

numLayers (32-69)

numLayers (32-33)

numLayers (77-88)

numLayers (77-77)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1)

KvCacheConnectorManager (35-44)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)

getUniquePrimaryPool (2410-2415)

getUniquePrimaryPool (2410-2410)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

538-539: Wiring the connector into per-window managers looks correct

Connector manager is forwarded to each WindowBlockManager. Good propagation.

580-582: Connector storage on WindowBlockManager is straightforward

Constructor accepts and stores the KvCacheConnectorManager. No concerns.

Also applies to: 605-606

1225-1231: Warning when connector is present but reuse is disabled is appropriate

Clear communication that the connector is ignored without reuse. LGTM.

1644-1651: Constructor parameter plumb-through is consistent

All KVCacheManager overloads accept and forward the connector to BlockManager correctly.

Also applies to: 1662-1673, 1696-1702
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)

540-542: Constructor wiring for KvCacheConnectorManager looks good

Appending KvCacheConnectorManager at the end maintains source compatibility and keeps ownership semantics clear.

1633-1633: getUniquePrimaryPool() override is fine; precondition is enforced in .cpp

The runtime check in the implementation ensures this is only called for a single window size. Header docs (suggested above) will make this clearer to callers.

857-859: Confirmed: kvCacheConnectorManager is forwarded to each WindowBlockManager

I verified in BlockManager’s constructor (cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp around lines 533–538) that
mWindowBlockManagers.try_emplace(
    windowSize, dtype, windowSize, layersWithWindowSize, numKvHeadsPerLayer,
    sizePerHead, tokensPerBlock, allottedPrimaryBlocks, allottedSecondaryBlocks,
    maxNumSequences, stream, onboardBlocks, cacheType,
    secondaryOffloadMinPriority, mEventManager, enablePartialReuse,
    copyOnPartialReuse, kvCacheConnectorManager);
passes the kvCacheConnectorManager pointer through to every WindowBlockManager instance. No further changes needed.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

Signed-off-by: jthomson04 <[email protected]>

syuoni

LGTM for the LLM API part, thanks!

Superjomn · 2025-08-22T04:22:49Z

examples/llm-api/llm_kv_cache_connector.py

+
+        self.kv_cache_tensor = None
+
+    def register_kv_caches(self, kv_cache_tensor: torch.Tensor):


I noticed this example is a bit long. Could you add comments to the core logic/method to make it more intuitive? Since this example will be included in our documentation, such as this, we rely on both the code and comments to serve as guidance.

@Superjomn if that's ok with you, @richardhuo-nv will address in a separate MR since John's last day was last week.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

421-435: Fix VSWA handling and mirror connector logic in STAR branch

The prepare_resources method currently calls get_cache_indices(req) without an explicit window_size in the non-STAR path, which will raise a ValueError under VSWA (when len(self.max_attention_window_vec) > 1). In addition, the STAR cp_type branch skips both the connector gating (_kv_connector_should_add_sequence) and the post-alloc update_state_after_alloc call, leading to inconsistent behavior.

Required changes:

Always pass an explicit window_size to get_cache_indices (e.g. self.max_attention_window_vec[0]) to avoid the VSWA hazard.
In the STAR branch, wrap sequence addition in the same _kv_connector_should_add_sequence check and invoke update_state_after_alloc after obtaining block_ids.

Locations to update in tensorrt_llm/_torch/pyexecutor/resource_manager.py:

Lines ~408–414 (non-STAR branch)
Lines ~391–399 (STAR branch)

Proposed diff:

--- a/tensorrt_llm/_torch/pyexecutor/resource_manager.py
+++ b/tensorrt_llm/_torch/pyexecutor/resource_manager.py
@@ context_batch loop:
-                if 'cp_type' in self.mapping.cp_config and CpType.STAR == self.mapping.cp_config[
-                        'cp_type']:
-                    if req.ctx_iters == 0:
-                        seq_len = sum(
-                            len(ctx_block) for ctx_block in req.ctx_blocks)
-                        self.impl.add_sequence(
-                            req.py_request_id,
-                            seq_len + (len(req.query_id) if self.mapping.cp_rank
-                                       == self.mapping.cp_size - 1 else 0),
-                            req_beam_width, req)
+                if 'cp_type' in self.mapping.cp_config and CpType.STAR == self.mapping.cp_config['cp_type']:
+                    if req.ctx_iters == 0 and self._kv_connector_should_add_sequence(req):
+                        seq_len = sum(len(ctx_block) for ctx_block in req.ctx_blocks)
+                        self.impl.add_sequence(
+                            req.py_request_id,
+                            seq_len + (len(req.query_id)
+                                       if self.mapping.cp_rank == self.mapping.cp_size - 1
+                                       else 0),
+                            req_beam_width, req)
+                        if self.kv_connector_manager is not None:
+                            block_ids = self.get_cache_indices(
+                                req,
+                                self.max_attention_window_vec[0]
+                            )
+                            self.kv_connector_manager.update_state_after_alloc(
+                                req, block_ids)
                 else:
-                    if req.is_first_context_chunk and self._kv_connector_should_add_sequence(
-                            req):
+                    if req.is_first_context_chunk and self._kv_connector_should_add_sequence(req):
                         self.impl.add_sequence(req.py_request_id,
                                                req.prompt_len, req_beam_width,
                                                req)
                         for _ in range(self.num_extra_kv_tokens):
                             self.impl.add_token(req.py_request_id)
                         for _ in range(get_draft_token_length(req)):
                             self.impl.add_token(req.py_request_id)
 
                         if self.kv_connector_manager is not None:
-                            block_ids = self.get_cache_indices(req)
+                            block_ids = self.get_cache_indices(
+                                req,
+                                self.max_attention_window_vec[0]
+                            )
                             self.kv_connector_manager.update_state_after_alloc(
                                 req, block_ids)

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/_util.py (1)

45-50: Make kv_connector_manager truly optional by defaulting to None.

The parameter is typed Optional but is still required. Default it to None to avoid churn and reduce risk of call-site omissions. This was requested previously and remains applicable.

-    def __init__(self, *, executor_config: ExecutorConfig,
+    def __init__(self, *, executor_config: ExecutorConfig,
                  model_engine: PyTorchModelEngine,
                  draft_model_engine: Optional[PyTorchModelEngine],
-                 mapping: Mapping, net_max_seq_len: int,
-                 kv_connector_manager: Optional[KvCacheConnectorManager]):
+                 mapping: Mapping, net_max_seq_len: int,
+                 kv_connector_manager: Optional[KvCacheConnectorManager] = None):

Also applies to: 57-57

🧹 Nitpick comments (3)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
336-337: Pass kv_connector_manager into C++ only when not None.

If the binding does not accept the keyword in older builds, passing a None-typed kwarg can raise a TypeError. Guard the kw insertion to be safe across versions.
-            'kv_connector_manager': self.kv_connector_manager,
+            **({'kv_connector_manager': self.kv_connector_manager}
+               if self.kv_connector_manager is not None else {}),
tensorrt_llm/_torch/pyexecutor/_util.py (1)

332-336: Explicitly block unsupported configurations (Mamba hybrid, draft model).

Clear NotImplementedError paths reduce ambiguity for integrators. Consider centralizing the guard into a small helper if this grows further, but fine as-is.

Also applies to: 413-416
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1674-1689: Broaden exception guard when querying cache indices on termination.

get_cache_indices may raise ValueError for VSWA (window_size required). Even with the new fail-fast, catching both maintains robustness and prevents noisy terminations if future conditions change.
-            except IndexError:
+            except (IndexError, ValueError):
                 # If the request has not yet been added to the kv cache manager,
                 # we still need to free resources corresponding to other resource managers.
                 self.resource_manager.free_resources(request)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a383d03 and 19b03c4.

📒 Files selected for processing (9)

cpp/tensorrt_llm/nanobind/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/pybind/CMakeLists.txt (1 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (10 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (7 hunks)
tensorrt_llm/executor/proxy.py (3 hunks)
tensorrt_llm/executor/worker.py (6 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (6)

tensorrt_llm/executor/proxy.py
tensorrt_llm/executor/worker.py
tests/integration/test_lists/test-db/l0_a10.yml
cpp/tensorrt_llm/nanobind/CMakeLists.txt
tensorrt_llm/llmapi/llm.py
cpp/tensorrt_llm/pybind/CMakeLists.txt

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

🧠 Learnings (1)

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

🧬 Code graph analysis (3)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (6)

KvCacheConnectorManager (345-549)

update_state_after_alloc (213-221)

update_state_after_alloc (537-539)

build_scheduler_output (306-338)

build_scheduler_output (419-422)

should_add_sequence (415-417)

tensorrt_llm/_torch/pyexecutor/_util.py (3)

tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (1)

KvCacheConnectorManager (345-549)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

KVCacheManager (142-953)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

start_worker (337-343)

PyExecutor (140-1862)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

tensorrt_llm/_torch/modules/decoder_layer.py (1)

DecoderLayer (10-34)

tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (11)

register_kv_caches (99-106)

layer_pre_hook (544-546)

layer_post_hook (548-549)

take_scheduled_requests_pending_load (424-450)

handle_metadata (452-458)

start_load_kv (109-113)

get_finished (144-160)

get_finished (486-535)

wait_for_save (138-141)

request_finished (199-210)

request_finished (460-484)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (9)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

441-444: Right place to build connector scheduler output.

Building the scheduler output after prepare_resources is the correct integration point. No issues.

445-448: Helper encapsulation looks good.

Centralizing the “should add sequence” decision behind _kv_connector_should_add_sequence keeps call sites clean.

648-650: get_unique_primary_pool exposure is appropriate.

Exposing the contiguous primary pool for registration with the connector worker is needed; method naming is clear.

tensorrt_llm/_torch/pyexecutor/_util.py (2)

279-282: Clean separation of “estimating_kv_cache” from “attached connector.”

Great call making attachment conditional on estimating_kv_cache. This avoids connector side effects while probing capacity.

Also applies to: 323-325, 397-399

436-452: All call sites correctly handle kv_connector_manager

The two invocations of create_py_executor_instance in py_executor_creator.py (lines 447–456 and 576–584) now both pass the new kv_connector_manager parameter appropriately (using None during estimation and the real manager afterward). The implementation of create_py_executor_instance in _util.py accepts the optional kv_connector_manager and forwards it into the PyExecutor constructor. No further changes are needed.

tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

142-161: Constructor: optional kv_connector_manager wiring is fine.

Parameter threading looks good and leaves existing call sites intact.

990-997: Start-batch connector handshake is at the right point in the loop.

Taking pending-load requests out, pushing metadata, then starting the load on the current stream matches the intended overlap.

999-1003: Freeing resources only after connector reports completion is correct.

This pairs with request_finished deferral and avoids premature deallocation.

1004-1008: Waiting for synchronous saves post-forward is correct.

Ensures on-stream KV writes finish before proceeding to next iteration.

tensorrt_llm/_torch/pyexecutor/py_executor.py

tensorrt_llm/_torch/pyexecutor/resource_manager.py

pcastonguay · 2025-08-25T12:27:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-25T12:32:20Z

PR_Github #16444 [ run ] triggered by Bot

pcastonguay · 2025-08-25T18:20:17Z

@richardhuo-nv The branch has fallen behind main and cannot be rebased with UI due to conflict. Could you do the rebase?

richardhuo-nv · 2025-08-25T19:04:03Z

@richardhuo-nv The branch has fallen behind main and cannot be rebased with UI due to conflict. Could you do the rebase?

I find I don't have the permission to push stuff to John's fork. @jthomson04 could you grant me the permission?

tensorrt-cicd · 2025-08-25T19:40:06Z

PR_Github #16444 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12355 completed with status: 'FAILURE'

pcastonguay · 2025-09-03T17:58:37Z

Closing.

coderabbitai bot requested review from DomBrown and nv-guomingz July 30, 2025 16:05

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Jul 30, 2025

coderabbitai bot reviewed Jul 30, 2025

View reviewed changes

coderabbitai bot requested a review from achartier July 30, 2025 16:28

coderabbitai bot reviewed Jul 30, 2025

View reviewed changes

jthomson04 force-pushed the jthomson04/connector-api branch from 4326a24 to a2a8a82 Compare July 30, 2025 16:49

coderabbitai bot reviewed Jul 30, 2025

View reviewed changes

cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h Show resolved Hide resolved

jthomson04 force-pushed the jthomson04/connector-api branch from c0b41f9 to 7b378bd Compare July 31, 2025 00:04

coderabbitai bot requested review from Shixiaowei02 and pcastonguay July 31, 2025 00:05

coderabbitai bot reviewed Jul 31, 2025

View reviewed changes

juney-nvidia requested a review from lowsfer July 31, 2025 02:44

coderabbitai bot requested a review from juney-nvidia July 31, 2025 05:33

coderabbitai bot reviewed Jul 31, 2025

View reviewed changes

tests/integration/defs/llmapi/test_llm_api_connector.py Outdated Show resolved Hide resolved

tests/integration/defs/llmapi/test_llm_api_connector.py Outdated Show resolved Hide resolved

Shixiaowei02 requested review from Tabrizian and chuangz0 July 31, 2025 06:34

Shixiaowei02 reviewed Jul 31, 2025

View reviewed changes

jthomson04 force-pushed the jthomson04/connector-api branch from af4c2b9 to 9a5bab4 Compare July 31, 2025 19:06

coderabbitai bot reviewed Jul 31, 2025

View reviewed changes

jthomson04 force-pushed the jthomson04/connector-api branch from 9a5bab4 to b678d9b Compare July 31, 2025 22:06

coderabbitai bot reviewed Jul 31, 2025

View reviewed changes

jthomson04 force-pushed the jthomson04/connector-api branch from b678d9b to 9cf5466 Compare August 1, 2025 20:12

coderabbitai bot reviewed Aug 1, 2025

View reviewed changes

jthomson04 changed the title ~~feat: KV Cache Connector API~~ [None][feat] KV Cache Connector API Aug 1, 2025

coderabbitai bot reviewed Aug 1, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

coderabbitai bot reviewed Aug 2, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/connector.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/connector.py Show resolved Hide resolved

Tabrizian reviewed Aug 5, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/connector.py Outdated Show resolved Hide resolved

Dont pass connector manager through add_sequence

209052a

Signed-off-by: jthomson04 <[email protected]>

jthomson04 force-pushed the jthomson04/connector-api branch from ceb86c8 to 209052a Compare August 13, 2025 18:34

richardhuo-nv mentioned this pull request Aug 14, 2025

feat: DIS-373 dynamo KVBM connector API integration with TRTLLM ai-dynamo/dynamo#2440

Closed

jthomson04 added 3 commits August 14, 2025 21:57

Merge remote-tracking branch 'origin/main' into jthomson04/connector-api

5059269

Merge remote-tracking branch 'origin/main' into jthomson04/connector-api

6018fc9

Init scheduler and worker concurrently

df6350d

Signed-off-by: jthomson04 <[email protected]>

jthomson04 requested a review from Shixiaowei02 August 19, 2025 16:06

jthomson04 added 2 commits August 19, 2025 09:08

Merge branch 'main' into jthomson04/connector-api

f24eff2

maybe fix CI

d5f7f1d

Signed-off-by: jthomson04 <[email protected]>

richardhuo-nv mentioned this pull request Aug 19, 2025

feat: DIS-373 dynamo KVBM connector API integration with TRTLLM ai-dynamo/dynamo#2544

Merged

Add fix for llm stability

ebfe401

Signed-off-by: jthomson04 <[email protected]>

jthomson04 requested a review from a team as a code owner August 20, 2025 01:23

Shixiaowei02 approved these changes Aug 20, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into jthomson04/connector-api

f9a3960

coderabbitai bot reviewed Aug 20, 2025

View reviewed changes

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h Show resolved Hide resolved

pcastonguay approved these changes Aug 20, 2025

View reviewed changes

jthomson04 added 2 commits August 21, 2025 10:06

Merge branch 'main' into jthomson04/connector-api

60b3ad9

Dont call request_finished unless request has already been scheduled

a383d03

Signed-off-by: jthomson04 <[email protected]>

syuoni approved these changes Aug 22, 2025

View reviewed changes

Superjomn reviewed Aug 22, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into jthomson04/connector-api

19b03c4

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/resource_manager.py Show resolved Hide resolved

richardhuo-nv mentioned this pull request Aug 25, 2025

[None][feat] KV Cache Connector API #7228

Merged

pcastonguay closed this Sep 3, 2025


		self.kv_cache_tensor = None

		def register_kv_caches(self, kv_cache_tensor: torch.Tensor):

Uh oh!

[None][feat] KV Cache Connector API #6488

[None][feat] KV Cache Connector API #6488

Uh oh!

Conversation

jthomson04 commented Jul 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

juney-nvidia commented Jul 31, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jthomson04 commented Jul 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 30, 2025 •

edited

Loading