-
Couldn't load subscription status.
- Fork 1.8k
[None][feat] KV Cache Connector API #6488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds a KV-cache connector interface and wires it through C++ and Python: new connector manager type and bindings (pybind/nanobind), threads an optional KvCacheConnectorManager through KV-cache managers and executor paths, implements a Python connector manager/scheduler/worker, adds example and tests, and exposes a unique-primary-pool API. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant LLM_Executor
participant PyExecutor
participant KVCacheManager
participant KvConnectorManager
participant Worker
Client->>LLM_Executor: create(..., kv_connector_config)
LLM_Executor->>PyExecutor: create_py_executor(..., kv_connector_config)
PyExecutor->>KvConnectorManager: construct(worker, scheduler)
PyExecutor->>KVCacheManager: build and attach KvConnectorManager
KVCacheManager->>PyExecutor: get_unique_primary_pool()
PyExecutor->>Worker: register_kv_caches(primary_pool)
sequenceDiagram
participant Scheduler
participant PyExecutor
participant KvConnectorManager
participant KVCacheManager
participant DecoderLayer
participant Worker
Scheduler->>PyExecutor: scheduled_batch
PyExecutor->>KvConnectorManager: take_scheduled_requests_pending_load(batch)
KvConnectorManager->>Scheduler: build_connector_meta(output)
KvConnectorManager->>Worker: start_load_kv(stream)
loop per layer
DecoderLayer->>KvConnectorManager: layer_pre_hook(layer_idx, stream)
KvConnectorManager->>Worker: wait_for_layer_load(layer_idx, stream)
DecoderLayer-->>DecoderLayer: forward
DecoderLayer->>KvConnectorManager: layer_post_hook(layer_idx, stream)
KvConnectorManager->>Worker: save_kv_layer(layer_idx, stream)
end
PyExecutor->>KvConnectorManager: get_finished()
KvConnectorManager->>Scheduler: request_finished(req, block_ids)
PyExecutor->>KVCacheManager: terminate/free as needed
sequenceDiagram
participant App
participant KVCacheManager
participant BlockManager
participant WindowBlockManager
participant KvConnectorManager
App->>KVCacheManager: addSequence(reqId, inputLength, beamWidth, llmRequest, kvConn?)
KVCacheManager->>BlockManager: addSequence(..., kvConn?, windowSize)
BlockManager->>WindowBlockManager: addSequence(..., kvConn?)
alt kvConn provided
WindowBlockManager->>KvConnectorManager: getNumNewMatchedTokens(request, numComputedTokens)
KvConnectorManager-->>WindowBlockManager: matchedTokens
WindowBlockManager-->>WindowBlockManager: adjust prompt length
end
WindowBlockManager-->>BlockManager: allocate/context setup
BlockManager-->>KVCacheManager: done
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (14)
cpp/tensorrt_llm/pybind/CMakeLists.txt (1)
11-11: Keep source list alphabetical for easier maintenanceThe newly added entry sits between
cacheTransceiver.cppandkvCacheManager.cpp; alphabetically it belongs afterkvCacheManager.cpp.
Not critical for build correctness, but keeping the order sorted reduces merge conflicts over time.cpp/tensorrt_llm/pybind/bindings.cpp (2)
32-32: Include order nitpickFor readability the local include list is alphabetically sorted; inserting
kvCacheConnector.hbreaks that order.
Consider moving it afterkvCacheManager.h.
468-468: Use the existingtbknamespace aliasA dedicated alias (
tbk) forkv_cache_manageris defined on Line 49 but not used here.
Swapping to the alias keeps style consistent with the otherinitBindingscalls.- tb::kv_cache_manager::KVCacheManagerConnectorBindings::initBindings(mInternalBatchManager); + tbk::KVCacheManagerConnectorBindings::initBindings(mInternalBatchManager);cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (1)
1-16: Use consistent copyright header format.The file has both SPDX-FileCopyrightText format (lines 2-3) and full Apache license text. For consistency with other files in the codebase, use one format throughout.
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
266-302: Add docstring to clarify the purpose and behavior.The initialization logic is sound with appropriate validation checks and hook registration. Consider adding a docstring to explain:
- The purpose of the KV connector manager
- Why it's incompatible with certain features
- What the hooks accomplish
def _maybe_init_kv_connector_manager(self): + """Initialize the KV Cache Connector Manager if provided. + + The connector manager enables asynchronous and distributed KV cache management. + It's incompatible with KvCacheTransceiver, pipeline parallelism, and overlap scheduler + due to conflicting resource management approaches. + + Registers pre/post hooks on decoder layers to coordinate KV cache loading/saving. + """ if self.kv_connector_manager is not None:
913-939: Simplify duplicated conditional checks.The conditional block starting at line 930 duplicates the same condition from line 913. Consider merging these blocks for clarity:
if scheduled_batch.batch_size > 0 or ( self.enable_attention_dp and self.dist.tp_size > 1): if self.kv_cache_transceiver: # For generation requests which have completed KV cache transfer self._prepare_disagg_gen_transmission_complete( scheduled_batch) # Return the first token to the client self._handle_first_token_response(scheduled_batch) self.resource_manager.prepare_resources(scheduled_batch) # We have to run this after we've run the KV cache manager (via the resource manager). # This takes requests that are pending an async load, and removes them from the scheduled context batch. if self.kv_connector_manager: self.kv_connector_manager.take_scheduled_requests_pending_load( scheduled_batch) - if scheduled_batch.batch_size > 0 or ( - self.enable_attention_dp and self.dist.tp_size > 1): if self.drafter is not None: self.drafter.prepare_draft_tokens( scheduled_batch, self.resource_manager) if self.kv_connector_manager is not None: self.kv_connector_manager.build_connector_meta() self.kv_connector_manager.worker.start_load_kv()
1422-1426: Consider adding error handling for connector operations.The logic correctly delegates resource management to the connector when applicable. Consider wrapping the connector call in error handling to ensure resources are properly freed even if the connector fails:
def _terminate_request(self, request: LlmRequest): - if self.kv_connector_manager is None or not self.kv_connector_manager.request_finished( - request): - self.resource_manager.free_resources(request) + try: + if self.kv_connector_manager is None or not self.kv_connector_manager.request_finished( + request): + self.resource_manager.free_resources(request) + except Exception as e: + logger.warning(f"Error checking connector request status, freeing resources anyway: {e}") + self.resource_manager.free_resources(request)tests/unittest/_torch/test_connector.py (1)
108-154: Consider adding edge case tests.The test correctly validates the filtering of async-loading requests. Consider adding tests for edge cases:
- Empty scheduled requests
- All requests pending async load
- No requests pending async load
tensorrt_llm/_torch/pyexecutor/_util.py (1)
399-402: Consider improving the error message for draft model incompatibility.The validation and propagation logic is correct. Consider making the error message more descriptive:
if self._kv_connector_manager is not None and self._draft_model_engine is not None: raise ValueError( - "Connector manager is not supported for draft model.") + "KV Cache Connector is not supported with draft models due to conflicting resource management.")Also applies to: 423-438, 576-577
tensorrt_llm/_torch/pyexecutor/connector.py (5)
123-133: Fix line length to comply with 120 character limit.Several lines exceed the 120 character limit specified in the coding guidelines.
Apply this diff to fix the line length issues:
- finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now asynchronously saving. + finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now + asynchronously saving. - started_loading_req_ids: The IDs of the requests that have started asynchronously loading. + started_loading_req_ids: The IDs of the requests that have started asynchronously loading. - Note: IDs may only be returned from this call after they've been provided in the `finished_gen_req_ids` and `started_loading_req_ids` arguments. - Additionally, the runtime will only take action based on these returned IDs once they've been returned by ALL workers. This allows some workers to take longer than others to complete the operations. + Note: IDs may only be returned from this call after they've been provided in the `finished_gen_req_ids` + and `started_loading_req_ids` arguments. Additionally, the runtime will only take action based on these + returned IDs once they've been returned by ALL workers. This allows some workers to take longer than + others to complete the operations.
178-180: Fix line length in docstring.Apply this diff to fix the line length:
- Whether the request is performing asynchronous saving operations. - If true, this indicates that the kv cache manager should wait to deallocate the blocks until the saving has completed (determined by `get_finished` on the workers). + Whether the request is performing asynchronous saving operations. If true, this indicates that + the kv cache manager should wait to deallocate the blocks until the saving has completed + (determined by `get_finished` on the workers).
276-283: Address the architectural concern with async loading.The comment indicates this is "a bit ugly" because requests are modified after scheduling. This suggests a potential design issue where the connector's async loading decision comes too late in the pipeline.
Consider refactoring the architecture to make async loading decisions before scheduling, possibly by:
- Adding a pre-scheduling phase for connector decisions
- Or integrating connector logic into the scheduler itself
279-279: Fix line length issues.Apply this diff to fix line length:
- # This is problematic, since at the point when this function is called, the request has already been scheduled! - # Because of this, we need to remove it from our list of scheduled requests (see `take_scheduled_requests_pending_load`). + # This is problematic, since at the point when this function is called, the request has already been + # scheduled! Because of this, we need to remove it from our list of scheduled requests + # (see `take_scheduled_requests_pending_load`). - # If this request is being loaded asynchronously, in addition to removing it from the list of scheduled requests, + # If this request is being loaded asynchronously, in addition to removing it from the list of + # scheduled requests,Also applies to: 301-301
327-327: Fix line length in docstrings and comments.Apply this diff:
- Whether the request is performing asynchronous saving operations. If true, we do not immediately call free_resources on the request. + Whether the request is performing asynchronous saving operations. If true, we do not immediately + call free_resources on the request. - # Pass these newly finished requests into get_finished, and get the list of requests that have finished saving and loading. + # Pass these newly finished requests into get_finished, and get the list of requests that have + # finished saving and loading.Also applies to: 354-354
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h(1 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h(11 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp(11 hunks)cpp/tensorrt_llm/pybind/CMakeLists.txt(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp(4 hunks)cpp/tensorrt_llm/pybind/bindings.cpp(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(7 hunks)tensorrt_llm/_torch/pyexecutor/connector.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(7 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(7 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(6 hunks)tensorrt_llm/executor/executor.py(6 hunks)tensorrt_llm/executor/proxy.py(3 hunks)tensorrt_llm/executor/worker.py(6 hunks)tensorrt_llm/llmapi/llm.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tensorrt_llm/models/modeling_utils.py(1 hunks)tests/unittest/_torch/test_connector.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/llmapi/llm.pytensorrt_llm/models/modeling_utils.pytensorrt_llm/llmapi/llm_args.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/executor/worker.pytensorrt_llm/executor/proxy.pytensorrt_llm/executor/executor.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytests/unittest/_torch/test_connector.pytensorrt_llm/_torch/pyexecutor/connector.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/llmapi/llm.pytensorrt_llm/models/modeling_utils.pycpp/tensorrt_llm/pybind/batch_manager/bindings.cppcpp/tensorrt_llm/pybind/bindings.cpptensorrt_llm/llmapi/llm_args.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/executor/worker.pycpp/include/tensorrt_llm/batch_manager/llmRequest.htensorrt_llm/executor/proxy.pytensorrt_llm/executor/executor.pycpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.htensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/py_executor.pycpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpptensorrt_llm/_torch/pyexecutor/resource_manager.pycpp/include/tensorrt_llm/batch_manager/kvCacheConnector.htests/unittest/_torch/test_connector.pycpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpptensorrt_llm/_torch/pyexecutor/connector.pycpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camelcase prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope ...
Files:
cpp/tensorrt_llm/pybind/batch_manager/bindings.cppcpp/tensorrt_llm/pybind/bindings.cppcpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cppcpp/include/tensorrt_llm/batch_manager/kvCacheConnector.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.hcpp/include/tensorrt_llm/batch_manager/kvCacheConnector.hcpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (10)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm.py (1)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
cpp/tensorrt_llm/pybind/CMakeLists.txt (1)
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.362Z
Learning: Applies to **/*.{cpp,h,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tensorrt_llm/executor/worker.py (3)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.362Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
tensorrt_llm/executor/proxy.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tests/unittest/_torch/test_connector.py (1)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tensorrt_llm/_torch/pyexecutor/connector.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
925-925: Line too long (124 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
364-364: Line too long (146 > 120)
(E501)
366-366: Line too long (143 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/connector.py
124-124: Line too long (130 > 120)
(E501)
131-131: Line too long (152 > 120)
(E501)
132-132: Line too long (206 > 120)
(E501)
179-179: Line too long (176 > 120)
(E501)
279-279: Line too long (129 > 120)
(E501)
301-301: Line too long (125 > 120)
(E501)
327-327: Line too long (144 > 120)
(E501)
354-354: Line too long (131 > 120)
(E501)
🔇 Additional comments (60)
tensorrt_llm/llmapi/llm.py (1)
1008-1010: LGTM! Clean integration of KV connector configuration.The addition of
kv_connector_config=self.args.connector_configfollows the established pattern for passing configuration objects to the executor. The parameter naming and structure are consistent with other config parameters likelora_config.tensorrt_llm/llmapi/llm_args.py (1)
55-57: Import looks good – keep it.
New symbolKvCacheConnectorConfigis imported from the expected module; no concerns.cpp/include/tensorrt_llm/batch_manager/llmRequest.h (3)
1843-1846: LGTM!The setter method follows established patterns in the class and correctly implements the simple assignment pattern. The naming convention adheres to the coding guidelines.
1848-1851: LGTM!The getter method is properly implemented with the [[nodiscard]] attribute and const qualifier. The implementation follows established patterns in the class and adheres to coding guidelines.
2027-2027: LGTM!The member variable follows the coding guidelines with proper naming convention (camelCase with 'm' prefix), appropriate initialization, and correct placement in the private section.
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (2)
22-22: LGTM!The include directive is properly formatted and positioned with other batch manager headers.
258-259: LGTM!The new property follows the established pattern and naming conventions. The addition properly exposes the KV cache connector async onboard state to Python.
tensorrt_llm/executor/worker.py (5)
16-16: LGTM!The import is properly placed and follows Python import conventions.
63-63: LGTM!The parameter addition follows proper Python typing conventions with appropriate default value.
87-89: LGTM!The validation logic properly ensures KV connector config is only used with supported backends and provides a clear error message.
136-136: LGTM!The parameter is correctly passed only for PyTorch backend, maintaining consistency with the validation logic.
642-642: LGTM!The worker_main function properly accepts and passes the KV connector config parameter, maintaining consistency with the constructor signature.
Also applies to: 769-770
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h (3)
1-16: LGTM!The file structure follows coding guidelines with proper copyright header, include guard, and minimal includes.
8-15: LGTM!The class declaration follows naming conventions and the static method signature is appropriate for pybind11 binding initialization.
17-22: LGTM!The namespace structure properly organizes pybind code and the using directive correctly imports the core connector namespace.
tensorrt_llm/executor/executor.py (3)
30-30: LGTM!The import is properly positioned with related imports and follows Python conventions.
356-356: LGTM!The parameter addition follows proper Python typing conventions with appropriate default value and positioning.
402-403: LGTM!The parameter is consistently passed through all execution paths (MPI, single process, multi-process, Windows) ensuring KV connector configuration is available in all supported modes.
Also applies to: 417-418, 432-433, 446-447
tensorrt_llm/executor/proxy.py (2)
13-13: LGTM!The import statement correctly imports
KvCacheConnectorConfigfrom the appropriate module.
50-50: LGTM!The
kv_connector_configparameter is properly added to the constructor signature and correctly passed through toworker_kwargs, following the established pattern for configuration propagation.Also applies to: 99-100
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)
192-194: LGTM!The function signature properly adds the optional
kv_connector_configparameter following the existing pattern.
384-391: LGTM!The
kv_connector_manageris correctly passed toKvCacheCreatorand conditionally tocreate_py_executor_instance(omitted during KV cache estimation). This design allows the connector to be integrated into the resource management flow while avoiding unnecessary initialization during estimation.Also applies to: 429-431, 475-475
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp (2)
29-39: LGTM!The trampoline class correctly implements the Python override for the pure virtual method using
PYBIND11_OVERRIDE_PURE_NAMEwith proper argument forwarding and self-life support.
45-65: LGTM!The Python bindings are well-implemented:
- Read-only properties appropriately expose data members
- Tensor conversion from internal representation to Torch tensor is handled correctly
- Method naming follows Python conventions with snake_case
py::smart_holderis appropriately used for the manager classcpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (2)
17-17: LGTM!The header guard follows the required pattern with
TRTLLM_prefix.
82-92: LGTM!The abstract interface is well-designed with a clear single responsibility. The pure virtual method allows for flexible implementations while maintaining a consistent interface for the C++ KV Cache Manager integration.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
136-136: LGTM!The
kv_connector_managerparameter is properly added to the constructor and stored as an instance variable.Also applies to: 153-154
918-920: LGTM!The new method properly delegates to the C++ implementation to retrieve KV cache connector pool data.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
30-32: LGTM!The constructor changes properly integrate the KV Cache Connector feature with appropriate imports and initialization.
Also applies to: 131-150, 259-262
881-886: LGTM!The conditional assertion correctly handles the case where the KV Cache Connector can put requests to sleep, similar to the transceiver behavior.
tests/unittest/_torch/test_connector.py (2)
25-77: LGTM!The test correctly validates distributed synchronization of finished requests, ensuring requests are only marked as finished when all ranks agree.
79-106: LGTM!The test properly validates token matching functionality and async onboard flag propagation in a distributed setting.
tensorrt_llm/_torch/pyexecutor/_util.py (2)
24-24: LGTM!The KvCacheCreator constructor properly integrates the connector manager parameter following the existing pattern.
Also applies to: 47-48, 56-56
317-318: LGTM!The method correctly propagates the connector manager to supported cache managers and appropriately raises an error for the unsupported MambaHybridCacheManager case.
Also applies to: 325-329, 387-388
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (4)
93-97: LGTM! Method signature updated correctly.The addition of the optional
returnNewBlockIdparameter and the return type change tostd::optional<SizeType32>properly extends the interface for KV cache connector integration while maintaining backward compatibility.
100-106: LGTM! Optional parameter addition is appropriate.The optional
kvCacheConnectorManagerparameter enables KV cache connector awareness during sequence addition while preserving backward compatibility.
241-245: LGTM! New method follows established patterns.The
getKvCacheConnectorPoolsDatamethod is properly declared as a pure virtual method with the appropriate[[nodiscard]]attribute for a getter function.
351-354: LGTM! Python bindings properly expose the updated interface.The bindings correctly map the C++ methods to Python with appropriate default values and naming conventions.
Also applies to: 435-436
tensorrt_llm/_torch/pyexecutor/connector.py (6)
1-31: Excellent documentation and appropriate imports.The module docstring clearly explains the KV Cache Connector's purpose, responsibilities, and potential use cases. The imports are well-organized and appropriate for the implementation.
34-58: Well-structured dataclasses with clear documentation.The
RequestDataandSchedulerOutputdataclasses are properly designed with informative field comments and a convenientadd_requesthelper method.
183-230: Well-implemented async request management.The
AsyncRequestsdataclass provides clean methods for managing asynchronous saving and loading requests with proper ownership transfer semantics.
286-318: Correct implementation of request filtering and metadata building.The methods properly handle the removal of async loading requests from scheduled requests and coordinate metadata building across ranks.
341-391: Excellent distributed coordination implementation.The
get_finishedmethod implements a robust consensus protocol using MPI allgather and set intersection to ensure all workers agree on which requests have completed. This prevents race conditions in a distributed environment.
395-403: Clean hook implementation for model integration.The pre/post hooks provide clear integration points for waiting on loads and triggering saves at the appropriate times during model execution.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
1201-1254: Good integration of KV cache connector with proper safety checks.The implementation correctly handles the optional connector manager and includes a defensive check to ensure at least one uncomputed token remains. The logging provides good visibility into the token matching process.
1983-2056: Constraints properly enforced for block ID return feature.The implementation correctly enforces that block ID return only works with single window size and beam width 1, which aligns with the KV cache connector requirements. The optional return value is handled cleanly.
2065-2138: Excellent defensive programming and user-friendly warnings.The implementation includes proper checks for beam search compatibility and helpful warning messages when users specify a retention configuration or KV cache connector with block reuse disabled. This will help users identify configuration issues early.
2632-2649: Clean implementation of pool data aggregation.The method correctly extracts pool data and layer mappings, properly handling the interleaved data structure in the layer-to-pool mapping buffer.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (12)
19-19: LGTM!The include statement is properly formatted and necessary for the KV cache connector integration.
552-553: LGTM!The addition of the optional
kvCacheConnectorManagerparameter follows the established pattern of usingOptionalRefand maintains backward compatibility while enabling new connector functionality.
751-751: LGTM!The new method declaration follows C++ best practices with the
[[nodiscard]]attribute and const correctness. It provides necessary access to connector pool data.
886-888: LGTM!The modified signature maintains consistency with the
WindowBlockManagerchanges and follows the establishedOptionalRefpattern. The addition of both connector manager and window size parameters is appropriate.
1142-1142: LGTM!The method signature is consistent with the
WindowBlockManagerequivalent and appropriately returns a vector for multiple pools managed byBlockManager.
1250-1252: LGTM!The interface change consistently propagates the connector manager option through the inheritance hierarchy while maintaining backward compatibility.
1380-1380: LGTM!The new pure virtual method properly defines the abstract interface that all concrete KV cache managers must implement, ensuring consistent access to connector pool data.
1543-1545: LGTM!The method signature changes maintain consistency with the base class interface and preserve backward compatibility.
1553-1555: LGTM!The method signature changes maintain consistency with the base class interface and follow the established
OptionalRefpattern.
1683-1683: LGTM!The override method properly implements the pure virtual method from the base class, completing the interface contract.
1689-1689: LGTM!The private method signature change supports the new functionality of optionally returning block IDs, maintaining consistency with the public API changes.
1240-1242: No caller updates required for the new return typeAll existing call sites use the default
returnNewBlockId = falseand simply ignore the returnedstd::optional<SizeType32>, which is safe and preserves the previous “void” behavior. No code or test needs to be updated to capture or handle the optional value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
376-418: Complex connector integration logic needs refactoring and line formatting fixes.The nested conditionals and async onboarding logic are difficult to follow, and there are line length violations that need to be addressed.
Apply this fix for the line length violations:
- # In add sequence, the connector API's get_num_new_matched_tokens is called. - # The result of this call may be that blocks will be loaded asynchronously. + # In add sequence, the connector API's get_num_new_matched_tokens is called. + # The result of this call may be that blocks will be loaded asynchronously. - # If so, we set the is_kv_cache_connector_async_onboard flag, and set the request state to be DISAGG_GENERATION_TRANS_IN_PROGRESS. + # If so, we set the is_kv_cache_connector_async_onboard flag, and set the + # request state to be DISAGG_GENERATION_TRANS_IN_PROGRESS. - # When the async load is complete, we set the request state back to CONTEXT_INIT. + # When the async load is complete, we set the request state back to + # CONTEXT_INIT.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
264-300: LGTM: Comprehensive validation and hook registration logic.The method properly validates incompatible configurations and registers necessary hooks. The error handling with specific
NotImplementedErrormessages is appropriate for unsupported feature combinations.Consider a minor optimization for the module iteration:
- for _name, module in self.model_engine.model.named_modules(): + for module in self.model_engine.model.modules(): if isinstance(module, DecoderLayer):Since the name is unused (indicated by
_name), usingmodules()is slightly more efficient.
970-986: LGTM: Well-integrated KV connector workflow.The integration correctly handles the sequence of operations: resource preparation, request filtering, and asynchronous loading trigger. The timing dependency comment is helpful.
Fix the line length issue on line 972:
- # This takes requests that are pending an async load, and removes them from the scheduled context batch. + # This takes requests that are pending an async load, and removes them + # from the scheduled context batch.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tensorrt_llm/_torch/pyexecutor/_util.py(7 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(7 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(7 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(6 hunks)tensorrt_llm/executor/worker.py(6 hunks)tensorrt_llm/llmapi/llm.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/executor/worker.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/_torch/pyexecutor/_util.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
379-379: Line too long (146 > 120)
(E501)
381-381: Line too long (143 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/py_executor.py
972-972: Line too long (124 > 120)
(E501)
🔇 Additional comments (15)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (9)
13-14: LGTM: Import addition for KV cache connector support.The import of
KvCacheConnectorPoolsDatafrom the batch manager bindings is correctly added to support the new KV cache connector functionality.
20-20: LGTM: Import addition for connector components.The import of
KvCacheConnectorManagerandSchedulerOutputfrom the connector module is necessary for the KV cache connector integration.
136-136: LGTM: Constructor parameter and instance variable for connector manager.The optional
kv_connector_managerparameter and its assignment to an instance variable follow good design patterns for optional feature integration.Also applies to: 153-154
358-360: LGTM: SchedulerOutput initialization for connector coordination.Creating the
SchedulerOutputobject early in the method to collect scheduling metadata throughout the resource preparation process is a good architectural choice.
374-374: LGTM: Connector manager integration in sequence operations.The updates to
add_sequencecalls to pass thekv_connector_managerparameter enable proper coordination with the connector during sequence addition.Also applies to: 384-386
392-399: LGTM: SchedulerOutput updates with request metadata.The calls to
scheduler_output.add_request()with appropriate tokens, block IDs, and context positions correctly populate the scheduling metadata for connector coordination.Also applies to: 407-412, 415-417, 427-429
420-420: LGTM: Generation batch token handling with connector awareness.The
add_tokencall updated to returnnew_block_idfor generation requests enables proper block tracking in the scheduler output.
431-433: LGTM: Connector manager scheduler output coordination.The conditional call to
set_scheduler_output()when a connector manager is present properly coordinates the scheduling metadata with the connector system.
933-935: LGTM: KV cache connector pools data exposure.The new method
get_kv_cache_connector_pools_data()correctly delegates to the underlying C++ implementation to expose connector pool data, maintaining consistency with the API design pattern used throughout the class.tensorrt_llm/_torch/pyexecutor/py_executor.py (6)
30-30: LGTM: Necessary imports for KV Cache Connector functionality.The imports for
DecoderLayerandKvCacheConnectorManagerare appropriately added to support the new KV cache connector integration.Also applies to: 32-32
148-149: LGTM: Parameter addition follows proper conventions.The
kv_connector_managerparameter is correctly typed as optional with a sensible default value, maintaining backward compatibility.
257-260: LGTM: Clean instance variable assignment and initialization.The assignment and initialization call are properly structured and placed appropriately in the constructor sequence.
923-929: LGTM: Logical assertion relaxation for KV Cache Connector.The modification correctly handles the case where KV Cache Connector can put requests to sleep, similar to the transceiver. The conditional logic and explanatory comment make the intention clear.
1012-1016: LGTM: Consistent cleanup logic for finished requests.The connector cleanup follows the same pattern as the transceiver cleanup, properly retrieving and terminating finished requests. The
elifstructure ensures mutually exclusive execution paths.
1463-1466: LGTM: Proper resource management delegation to KV connector.The conditional logic correctly delegates resource management to the KV connector when it reports a request as finished, preventing double-freeing of resources while maintaining backward compatibility.
4326a24 to
a2a8a82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1)
59-65: Constructor still has unnecessary copy issue.The constructor takes references but stores by value, creating copies of the vectors. This matches the previous review comment about avoiding unnecessary copies.
Change the constructor parameters to take by value and use move semantics:
- explicit KvCacheConnectorPoolsData( - std::vector<KvCacheConnectorPoolData>& poolsData, std::vector<SizeType32>& layerToPoolMapping) - : mPoolsData(poolsData) - , mLayerToPoolMapping(layerToPoolMapping) + explicit KvCacheConnectorPoolsData( + std::vector<KvCacheConnectorPoolData> poolsData, std::vector<SizeType32> layerToPoolMapping) + : mPoolsData(std::move(poolsData)) + , mLayerToPoolMapping(std::move(layerToPoolMapping))tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
376-418: Complex connector logic still needs refactoring.The nested conditionals and state management logic remains difficult to follow, matching the previous review comment about this needing a mega refactor.
The TODO comment correctly identifies this as needing refactoring. Consider extracting this logic into separate methods for better readability:
def _should_add_sequence(self, req: LlmRequest) -> bool: """Determine if a sequence should be added based on request state.""" return req.is_first_context_chunk and not req.is_kv_cache_connector_async_onboard def _handle_async_onboard_completion(self, req: LlmRequest, scheduler_output: SchedulerOutput): """Handle completion of async KV cache onboarding.""" req.is_kv_cache_connector_async_onboard = False scheduler_output.add_request( req.request_id, req.get_tokens(0), self.impl.get_cache_block_ids( req.request_id, self.max_attention_window_vec[0]), req.context_current_position)Additionally, the long lines should be reformatted to comply with the 120-character limit.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
1551-1567: Single window size constraint properly enforced.The implementation correctly enforces the constraint that KV cache connector only works with single window sizes. The error message clearly communicates this limitation.
🧹 Nitpick comments (8)
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1)
67-80: Consider const-correctness for getters.The getters return non-const references, allowing external modification of internal data. Consider if const versions are needed for read-only access.
Add const overloads for the getters:
std::vector<KvCacheConnectorPoolData>& getPoolsData() { return mPoolsData; } + + std::vector<KvCacheConnectorPoolData> const& getPoolsData() const + { + return mPoolsData; + } std::vector<SizeType32>& getLayerToPoolMapping() { return mLayerToPoolMapping; } + + std::vector<SizeType32> const& getLayerToPoolMapping() const + { + return mLayerToPoolMapping; + }tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
264-299: Consider improving error messages for better debugging experience.The initialization logic and hook registration are well-implemented. However, the error messages could be more informative by specifying which features are mutually exclusive.
if self.kv_connector_manager is not None: if self.kv_cache_transceiver is not None: raise NotImplementedError( - "KV Cache Connector is not supported with KvCacheTransceiver." + "KV Cache Connector and KvCacheTransceiver are mutually exclusive features. Please use only one." ) if self.dist.pp_size > 1: raise NotImplementedError( - "KV Cache Connector is not supported with pipeline parallelism." + "KV Cache Connector is not supported with pipeline parallelism (pp_size > 1)." ) if not self.disable_overlap_scheduler: raise NotImplementedError( - "KV Cache Connector is not supported with overlap scheduler." + "KV Cache Connector requires disable_overlap_scheduler=True." )tensorrt_llm/_torch/pyexecutor/connector.py (6)
60-134: Worker ABC well-designed but has formatting issues.The abstract base class properly defines the worker interface with clear method signatures and documentation.
Fix line length issues for better readability:
def get_finished( self, finished_gen_req_ids: list[int], started_loading_req_ids: list[int]) -> tuple[list[int], list[int]]: """ Get the requests that have finished loading and saving. Args: - finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now asynchronously saving. + finished_gen_req_ids: The IDs of the requests that have finished generating tokens, + and are now asynchronously saving. started_loading_req_ids: The IDs of the requests that have started asynchronously loading. Returns: The IDs of the requests that have finished saving. The IDs of the requests that have finished loading. - Note: IDs may only be returned from this call after they've been provided in the `finished_gen_req_ids` and `started_loading_req_ids` arguments. - Additionally, the runtime will only take action based on these returned IDs once they've been returned by ALL workers. This allows some workers to take longer than others to complete the operations. + Note: IDs may only be returned from this call after they've been provided in the + `finished_gen_req_ids` and `started_loading_req_ids` arguments. + Additionally, the runtime will only take action based on these returned IDs once + they've been returned by ALL workers. This allows some workers to take longer than + others to complete the operations. """
136-181: Scheduler ABC properly defines orchestration interface.The abstract base class clearly specifies the scheduler's responsibilities for metadata building and request management.
Fix line length for consistency:
Returns: Whether the request is performing asynchronous saving operations. - If true, this indicates that the kv cache manager should wait to deallocate the blocks until the saving has completed (determined by `get_finished` on the workers). + If true, this indicates that the kv cache manager should wait to deallocate the blocks + until the saving has completed (determined by `get_finished` on the workers). """
266-285: Token matching logic needs line formatting fixes.The implementation correctly handles distributed token matching, but has formatting issues and a TODO that should be tracked.
Fix line length and consider tracking the TODO:
def get_num_new_matched_tokens(self, request: LlmRequest, num_computed_tokens: int) -> int: num_tokens, load_kv_async = self._run_on_leader( lambda: self.scheduler.get_num_new_matched_tokens( request, num_computed_tokens)) if num_tokens == 0 and load_kv_async: raise RuntimeError( "load_kv_async must be False when num_tokens is 0!") # TODO(jthomson04): This part is a bit ugly. - # When the connector indicates that a request will be loaded asynchronously, we need to suspend it's execution. - # This is problematic, since at the point when this function is called, the request has already been scheduled! - # Because of this, we need to remove it from our list of scheduled requests (see `take_scheduled_requests_pending_load`). + # When the connector indicates that a request will be loaded asynchronously, we need to + # suspend its execution. This is problematic, since at the point when this function is + # called, the request has already been scheduled! Because of this, we need to remove it + # from our list of scheduled requests (see `take_scheduled_requests_pending_load`). if load_kv_async: self.new_async_requests.loading[request.request_id] = request request.is_kv_cache_connector_async_onboard = True return num_tokensWould you like me to create an issue to track the TODO comment about improving the async loading architecture?
286-310: Scheduled request filtering correctly implemented.The method properly removes asynchronously loading requests and updates their state.
Fix line length on line 301:
for req in scheduled_requests.context_requests: - # If this request is being loaded asynchronously, in addition to removing it from the list of scheduled requests, - # we also need to update it's state. + # If this request is being loaded asynchronously, in addition to removing it from the + # list of scheduled requests, we also need to update its state. if req.request_id in self.new_async_requests.loading.keys():
319-340: Request completion handling properly manages async operations.The method correctly delegates to the scheduler and updates request state for async saving.
Fix line length and typo:
Returns: - Whether the request is performing asynchronous saving operations. If true, we do not immediately call free_resources on the request. + Whether the request is performing asynchronous saving operations. If true, we do not + immediately call free_resources on the request. """
341-391: Distributed synchronization logic is well-implemented.The
get_finishedmethod correctly implements the complex distributed coordination using MPI allgather to ensure all ranks agree on request completion.Fix line length on line 354:
- # Pass these newly finished requests into get_finished, and get the list of requests that have finished saving and loading. + # Pass these newly finished requests into get_finished, and get the list of requests + # that have finished saving and loading. (finished_saving, finished_loading) = self.worker.get_finished(finished_gen_req_ids, started_loading_req_ids)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h(1 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h(11 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp(11 hunks)cpp/tensorrt_llm/pybind/CMakeLists.txt(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp(4 hunks)cpp/tensorrt_llm/pybind/bindings.cpp(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(7 hunks)tensorrt_llm/_torch/pyexecutor/connector.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(7 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(7 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(6 hunks)tensorrt_llm/executor/executor.py(6 hunks)tensorrt_llm/executor/proxy.py(3 hunks)tensorrt_llm/executor/worker.py(6 hunks)tensorrt_llm/llmapi/llm.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tensorrt_llm/models/modeling_utils.py(1 hunks)tests/unittest/_torch/test_connector.py(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- tensorrt_llm/models/modeling_utils.py
- cpp/tensorrt_llm/pybind/CMakeLists.txt
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h
🚧 Files skipped from review as they are similar to previous changes (12)
- tensorrt_llm/llmapi/llm.py
- cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
- tensorrt_llm/executor/worker.py
- tensorrt_llm/executor/executor.py
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/llmapi/llm_args.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/executor/proxy.py
- cpp/tensorrt_llm/pybind/bindings.cpp
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
- cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camelcase prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope ...
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.htensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/connector.pytests/unittest/_torch/test_connector.pycpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpptensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/connector.pytests/unittest/_torch/test_connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/connector.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/_torch/test_connector.py (1)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/connector.py
124-124: Line too long (130 > 120)
(E501)
131-131: Line too long (152 > 120)
(E501)
132-132: Line too long (206 > 120)
(E501)
179-179: Line too long (176 > 120)
(E501)
279-279: Line too long (129 > 120)
(E501)
301-301: Line too long (125 > 120)
(E501)
327-327: Line too long (144 > 120)
(E501)
354-354: Line too long (131 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
379-379: Line too long (146 > 120)
(E501)
381-381: Line too long (143 > 120)
(E501)
🔇 Additional comments (40)
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (4)
1-15: NVIDIA copyright header looks good.The copyright header includes the required current year (2024) and follows the standard NVIDIA format for TensorRT-LLM open source code.
32-54: LGTM on KvCacheConnectorPoolData class design.The class follows good C++ practices:
- Uses move semantics in constructor to avoid unnecessary copies
- Proper const-correctness on getters
- Private member variables with appropriate accessor methods
- Clear, descriptive naming
83-92: LGTM on abstract interface design.The
KvCacheConnectorManagerfollows good abstract interface design:
- Pure virtual destructor with default implementation
- Clear, documented pure virtual method
- Appropriate parameter types and const-correctness
- Good documentation explaining the purpose
94-94: Namespace closing comment follows guidelines.The closing brace properly includes a comment identifying the namespace being closed, as required by the coding guidelines.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (10)
13-15: Import looks correct.The import of
KvCacheConnectorPoolsDatafrom the internal batch manager bindings is properly structured and follows the existing pattern in the file.
20-20: Import follows proper namespace convention.The import maintains the namespace when importing from the connector module, which aligns with the Python coding guidelines.
136-137: Constructor parameter addition looks good.The optional
kv_connector_managerparameter follows the existing pattern and is properly typed withOptional[KvCacheConnectorManager].
153-154: Storage of connector manager is appropriate.The connector manager is stored as an instance variable for later use in the resource management methods.
358-361: Scheduler output initialization is correct.Creating a
SchedulerOutputinstance to accumulate scheduling metadata for the connector is the right approach.
374-374: Connector manager passed to add_sequence calls.The connector manager is properly passed to the
add_sequencecalls, enabling coordination with the connector during sequence addition.Also applies to: 385-386
420-420: Token addition with return value capture.The modification to capture the
new_block_idfromadd_tokenis correct for tracking newly allocated blocks in generation requests.
425-430: Scheduler output update for generation requests.The logic to add the generation request to the scheduler output with the new block ID (if any) is implemented correctly.
431-433: Connector manager update is appropriate.The conditional check and call to
set_scheduler_outputproperly updates the connector manager with the accumulated scheduling information.
933-935: New method implementation looks correct.The
get_kv_cache_connector_pools_datamethod properly delegates to the implementation and returns the connector pools data.tensorrt_llm/_torch/pyexecutor/py_executor.py (6)
30-30: LGTM! Import additions are appropriate.The new imports for
DecoderLayerandKvCacheConnectorManagerare correctly placed and necessary for the KV cache connector integration.Also applies to: 32-32
148-149: Constructor properly extended with KV connector manager parameter.The optional parameter follows best practices with type hints and appropriate default value. The initialization is correctly deferred to a dedicated method.
Also applies to: 257-259
923-928: Assertion correctly relaxed for KV cache connector compatibility.The modified assertion appropriately accounts for both the KV cache transceiver and connector's ability to suspend requests, preventing false failures when no requests are immediately schedulable.
972-974: Proper integration of scheduled request filtering.The call to
take_scheduled_requests_pending_loadcorrectly removes asynchronously loading requests from the scheduled batch before forward execution.
1008-1012: Finished request handling properly integrated.The connector manager correctly takes over the responsibility of tracking and terminating finished requests, avoiding conflicts with the standard resource management flow.
1459-1461: Resource management delegation correctly implemented.The conditional logic ensures that the connector manager handles resource cleanup for its managed requests, preventing double-free scenarios.
tests/unittest/_torch/test_connector.py (4)
1-18: Test setup properly configured for MPI testing.The imports and pickle configuration are correctly set up for distributed testing with cloudpickle and mpi4py.
25-76: Comprehensive test for distributed request completion synchronization.The test effectively validates the MPI allgather behavior, ensuring requests are only marked as finished when all ranks agree. The mock setup and assertions properly cover the distributed coordination logic.
79-105: Token matching test correctly validates leader-only behavior.The test properly verifies that token matching queries are handled only on the leader rank and that the async onboard flag is correctly set on requests.
108-153: Scheduled request filtering test validates critical functionality.The test properly verifies that asynchronously loading requests are removed from the scheduled batch while synchronously loading requests are preserved.
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (4)
93-97: addToken signature properly extended with optional return value.The method correctly uses
std::optional<SizeType32>for the return type and includes the newreturnNewBlockIdparameter with appropriate default value.
99-106: addSequence signature correctly extended for connector integration.The method properly accepts optional references for both
llmRequestandkvCacheConnectorManager, maintaining backward compatibility.
241-245: New virtual method properly declared for connector pool data.The
getKvCacheConnectorPoolsDatamethod follows the established pattern with[[nodiscard]]attribute and proper const-correctness.
351-354: Python bindings correctly updated for new signatures.The bindings properly expose the new parameters and methods with appropriate default values and naming conventions.
Also applies to: 436-436
tensorrt_llm/_torch/pyexecutor/connector.py (5)
1-31: Well-structured module with clear documentation.The imports are appropriate and the module docstring provides excellent context about the KV Cache Connector's purpose and capabilities.
34-58: Data classes appropriately model request metadata.The
RequestDataandSchedulerOutputclasses provide clean interfaces for tracking inflight requests and their associated metadata.
183-230: AsyncRequests helper class provides clean request management.The internal dataclass effectively encapsulates async request tracking with well-designed methods for manipulation and querying.
232-265: Manager initialization correctly enforces rank constraints.The constructor properly validates that the scheduler only exists on rank 0 and initializes request tracking structures.
392-403: Hook methods provide clean integration points.The scheduler output setter and layer/model hooks are properly implemented for executor integration.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (7)
1200-1206: LGTM - Clean parameter forwarding.The method signature correctly adds the KV cache connector manager parameter and properly forwards it to the underlying window block manager.
1208-1254: Well-implemented KV cache connector integration.The method properly:
- Validates that beam search is not used with the connector
- Retrieves additional matched tokens from the connector
- Updates the prepopulated prompt length correctly
- Includes helpful debug logging
The constraint checking and integration logic appears sound.
1983-2050: Correctly implemented optional block ID return.The method properly:
- Validates that the returnNewBlockId feature only works with single window size and no beam search
- Conditionally returns the newly allocated block ID when requested
- Maintains backward compatibility with existing callers
The constraint validation is consistent with other KV cache connector limitations.
2052-2056: LGTM - Clean parameter addition.The method signature correctly adds the optional returnNewBlockId parameter and properly forwards it to the updateToken method.
2065-2137: Good validation and user guidance for connector usage.The method properly:
- Validates beam search compatibility with the connector
- Provides helpful warnings when the connector is specified but block reuse is disabled
- Maintains consistency with connector constraints across the codebase
The user-facing warnings will help developers understand when and why the connector is ignored.
2632-2648: Correctly implemented connector pools data aggregation.The method properly:
- Retrieves pool data from the block manager
- Extracts layer-to-pool mapping from the internal buffer
- Combines both pieces of information into the connector pools data structure
The buffer indexing logic (layer * 2) correctly accounts for the paired data structure.
2497-2497: LGTM - Correct parameter addition.The call to updateToken correctly includes the returnNewBlockId parameter set to false, which is appropriate for the removeToken operation.
c0b41f9 to
7b378bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
377-420: Complex connector integration logic needs refactoring.This logic contains the same complexity issues previously identified, with nested conditionals and state management that make it difficult to follow. The TODO comment confirms this needs refactoring.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/connector.py (1)
266-284: Address the technical debt noted in the TODO comment.The current approach of modifying request state after scheduling is acknowledged as "ugly" in the TODO comment. While the implementation works, consider refactoring this in a future iteration to make the async loading decision earlier in the scheduling process.
The comment on line 276-279 indicates technical debt where async loading decisions cause post-scheduling modifications. Consider refactoring to make these decisions during the scheduling phase to avoid the need for
take_scheduled_requests_pending_load.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h(1 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h(11 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp(11 hunks)cpp/tensorrt_llm/pybind/CMakeLists.txt(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp(4 hunks)cpp/tensorrt_llm/pybind/bindings.cpp(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(7 hunks)tensorrt_llm/_torch/pyexecutor/connector.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(7 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(6 hunks)tensorrt_llm/_torch/pyexecutor/scheduler.py(1 hunks)tensorrt_llm/executor/executor.py(6 hunks)tensorrt_llm/executor/proxy.py(3 hunks)tensorrt_llm/executor/worker.py(6 hunks)tensorrt_llm/llmapi/llm.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tensorrt_llm/models/modeling_utils.py(1 hunks)tests/integration/defs/llmapi/test_llm_api_connector.py(1 hunks)tests/unittest/_torch/test_connector.py(1 hunks)
✅ Files skipped from review due to trivial changes (5)
- cpp/tensorrt_llm/pybind/CMakeLists.txt
- cpp/tensorrt_llm/pybind/bindings.cpp
- tensorrt_llm/_torch/pyexecutor/scheduler.py
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h
🚧 Files skipped from review as they are similar to previous changes (15)
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/models/modeling_utils.py
- cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- tensorrt_llm/executor/worker.py
- tensorrt_llm/llmapi/llm_args.py
- tensorrt_llm/executor/executor.py
- tests/unittest/_torch/test_connector.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
- tensorrt_llm/executor/proxy.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
- cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/llmapi/test_llm_api_connector.pytensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/llmapi/test_llm_api_connector.pytensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/integration/defs/llmapi/test_llm_api_connector.py (1)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tensorrt_llm/_torch/pyexecutor/connector.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/connector.py
124-124: Line too long (130 > 120)
(E501)
131-131: Line too long (152 > 120)
(E501)
132-132: Line too long (206 > 120)
(E501)
179-179: Line too long (176 > 120)
(E501)
279-279: Line too long (129 > 120)
(E501)
301-301: Line too long (125 > 120)
(E501)
327-327: Line too long (144 > 120)
(E501)
354-354: Line too long (131 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
380-380: Line too long (146 > 120)
(E501)
382-382: Line too long (143 > 120)
(E501)
🔇 Additional comments (22)
tests/integration/defs/llmapi/test_llm_api_connector.py (4)
1-30: LGTM! Well-structured imports and mocking setup.The copyright header, imports, and global mock setup follow TensorRT-LLM conventions and provide a clean foundation for integration testing.
32-64: LGTM! Excellent test utilities for call order verification.The
CallTimeMonitorclass is a well-designed utility that provides clear verification of method call ordering, which is crucial for testing distributed KV cache coordination.
66-99: LGTM! Comprehensive fixture setup for connector testing.The fixture properly mocks the connector components and sets up thorough monitoring of key API methods. This provides excellent test coverage for the KV cache connector integration.
101-159: LGTM! Comprehensive integration test for connector workflow.The test effectively validates the complete KV cache connector workflow including:
- Proper initialization and registration
- Correct sequencing of scheduler metadata building before worker operations
- Layer-wise KV cache loading and saving operations
- Request completion handling
The call ordering verification using
CallTimeMonitorensures the distributed coordination works correctly.tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
13-20: LGTM! Appropriate imports for connector integration.The new imports correctly bring in the KV cache connector types and manager classes needed for the feature integration.
136-154: LGTM! Clean constructor enhancement for connector support.The optional
kv_connector_managerparameter is properly added with appropriate type hints and follows the existing parameter pattern.
935-937: LGTM! Clean accessor method for connector pools data.The new method follows the established pattern of delegating to the underlying C++ implementation and provides necessary access to connector pool information.
tensorrt_llm/_torch/pyexecutor/py_executor.py (7)
30-30: LGTM! Necessary imports for KV Cache Connector integration.The imports are properly structured and follow existing patterns in the file.
Also applies to: 32-32
150-150: LGTM! Constructor parameter addition follows best practices.The optional parameter with proper type annotation and default value ensures backward compatibility while enabling the new KV Cache Connector functionality.
Also applies to: 258-261
265-295: LGTM! Excellent error checking and initialization logic.The method properly validates incompatible configurations and provides clear error messages. The hook registration on DecoderLayer modules follows a clean pattern for integrating with the model's forward pass.
918-924: LGTM! Correctly handles KV Cache Connector request suspension.The logic appropriately exempts the assertion when the KV Cache Connector is enabled, since it can suspend requests similar to the transceiver. The explanatory comment makes the intent clear.
965-965: LGTM! Well-integrated KV Cache Connector operations.The connector operations are properly sequenced within the executor loop:
- Setting the warmup flag provides necessary context
- Async loading operations occur at the right time after resource preparation
- Finished request handling is appropriately placed in the cleanup phase
The control flow logic correctly ensures connector operations only run when enabled.
Also applies to: 968-973, 1006-1010
1360-1363: LGTM! Properly placed synchronization call.The
wait_for_save()call is correctly positioned after the forward step to ensure all KV cache saving operations complete before continuing. The conditional check ensures it only runs when the connector is enabled.
1461-1464: LGTM! Proper resource management delegation.The conditional logic correctly delegates resource management to the KV Cache Connector when it's handling a request's lifecycle, preventing resource conflicts while maintaining backward compatibility.
tensorrt_llm/_torch/pyexecutor/connector.py (8)
1-31: LGTM! Excellent module documentation and clean imports.The comprehensive documentation clearly explains the KV Cache Connector's purpose, use cases, and architecture. The imports are well-organized and include all necessary components for the interface definition.
34-58: LGTM! Well-designed data classes with clear structure.Both
RequestDataandSchedulerOutputfollow dataclass best practices:
- Clear field documentation
- Appropriate use of
field(default_factory=list)for mutable defaults- Clean helper methods for building the data structures
60-134: LGTM! Comprehensive worker interface with excellent documentation.The abstract base class defines a clear contract for KV cache operations:
- Well-separated concerns for registration, loading, and saving
- Comprehensive docstrings explaining expected behavior
- The
get_finishedmethod design allows for proper coordination across distributed workersNote: There are some line length violations flagged by static analysis, but the readability is more important than strict adherence to 120 character limits for documentation.
136-181: LGTM! Well-designed scheduler interface with appropriate abstractions.The scheduler interface provides the right level of abstraction:
- Required methods are abstract while optional ones have safe defaults
- Clear documentation of expected behavior and return values
- Supports both synchronous and asynchronous operation patterns
The default implementation of
get_num_new_matched_tokensreturning(0, False)is a safe fallback.
183-230: LGTM! Excellent state management for asynchronous operations.The
AsyncRequestsclass provides clean encapsulation of async request state:
- Efficient dictionary-based storage for fast lookups
- Proper ownership transfer in
add_frommethod- Clean extraction and querying methods
- Well-documented interface for managing async saving/loading requests
232-265: LGTM! Well-structured coordinator class with proper rank validation.The constructor correctly enforces that the scheduler only exists on rank 0, which is essential for the distributed coordination pattern. The inheritance from the C++ binding provides the necessary low-level functionality.
341-390: Excellent distributed coordination logic in get_finished method.The implementation correctly handles the complex distributed synchronization:
- Uses MPI allgather to collect results from all workers
- Employs set intersection to ensure consensus across all workers
- Proper state management for async requests through different phases
- Correct state transitions for finished loading requests
This ensures that resources are only freed when all workers agree the operations are complete.
395-403: LGTM! Clean hook integration for model execution.The hook methods provide clean integration points:
layer_pre_hookwaits for layer loading before computationlayer_post_hookinitiates layer saving after computationmodel_post_hookensures all saving operations completeThis design allows seamless integration with the model's forward pass without modifying core model code.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/integration/defs/llmapi/test_llm_api_connector.py (1)
145-147: Break long line to comply with line length limit.Line 146 exceeds the 120 character limit.
- model.generate([ - "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." - ], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True)) + model.generate([ + "Lorem ipsum dolor sit amet, consectetur adipiscing elit. " + "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." + ], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True))
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py(8 hunks)tests/integration/defs/llmapi/test_llm_api_connector.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case, and prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/llmapi/test_llm_api_connector.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/llmapi/test_llm_api_connector.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/integration/defs/llmapi/test_llm_api_connector.py (1)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
🪛 Ruff (0.12.2)
tests/integration/defs/llmapi/test_llm_api_connector.py
146-146: Line too long (133 > 120)
(E501)
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/py_executor.py (6)
30-30: LGTM!The import additions are necessary for the KV Cache Connector functionality -
DecoderLayerfor registering forward hooks andKvCacheConnectorManagerfor connector management.Also applies to: 32-32
132-150: LGTM!The constructor properly accepts the optional
kv_connector_managerparameter and defers initialization to a dedicated method, following good separation of concerns.Also applies to: 258-260
265-295: Well implemented initialization with proper validation.The method correctly validates incompatible configurations and properly integrates the KV connector with existing systems. The forward hook registration on
DecoderLayermodules is essential for layer-level KV cache management.
918-923: LGTM!The assertion logic correctly excludes the KV cache connector case, mirroring the existing pattern for
kv_cache_transceiver. The comment clearly explains why both features can put requests to sleep.
965-973: Well integrated KV connector operations in executor loop.The connector operations are properly placed in the execution flow:
- Pre-forward: request preparation and KV loading initiation
- Post-forward: wait for save completion
- Post-batch: cleanup of finished requests
The integration maintains consistency with existing patterns and properly handles resource management.
Also applies to: 974-976, 1006-1009, 1360-1362
1461-1463: Proper resource management delegation to KV connector.The method correctly delegates resource freeing to the connector when available, preventing double-freeing and maintaining proper resource ownership between the connector and resource manager.
af4c2b9 to
9a5bab4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
377-416: Address the line length violations and refactor as previously suggested.The complex connector integration logic still needs the refactoring suggested in the previous review. Additionally, lines 380 and 382 exceed the 120-character limit and should be reformatted.
Apply this diff to fix the line length violations:
- # The result of this call may be that blocks will be loaded asynchronously. - # If so, we set the is_kv_cache_connector_async_onboard flag, and set the request state to be DISAGG_GENERATION_TRANS_IN_PROGRESS. + # The result of this call may be that blocks will be loaded + # asynchronously. If so, we set the is_kv_cache_connector_async_onboard + # flag, and set the request state to be DISAGG_GENERATION_TRANS_IN_PROGRESS. - # When the async load is complete, we set the request state back to CONTEXT_INIT. - # When that happens, the request will go through this same code path, but with is_kv_cache_connector_async_onboard set to True. + # When the async load is complete, we set the request state back to + # CONTEXT_INIT. When that happens, the request will go through this same + # code path, but with is_kv_cache_connector_async_onboard set to True.tests/integration/defs/llmapi/test_llm_api_connector.py (2)
26-26: Remove sys.path manipulation in test file.This issue was already identified in a previous review.
28-43: Remove unused global mock variables and function.This issue was already identified in a previous review.
🧹 Nitpick comments (2)
tests/integration/defs/llmapi/test_llm_api_connector.py (1)
149-149: Fix line length violation.The line exceeds the 120-character limit. Consider breaking the long string literal.
- model.generate([ - "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." - ], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True)) + long_text = ("Lorem ipsum dolor sit amet, consectetur adipiscing elit. " + "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.") + model.generate([long_text], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True))tensorrt_llm/_torch/pyexecutor/connector.py (1)
126-126: Fix line length violations.Several lines exceed the 120-character limit. Consider breaking long comments and parameter lists for better readability.
Example fixes:
- Args: - finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now asynchronously saving. + Args: + finished_gen_req_ids: The IDs of the requests that have finished + generating tokens, and are now asynchronously saving. - Note: IDs may only be returned from this call after they've been provided in the `finished_gen_req_ids` and `started_loading_req_ids` arguments. + Note: IDs may only be returned from this call after they've been provided + in the `finished_gen_req_ids` and `started_loading_req_ids` arguments.Also applies to: 133-134, 181-181, 240-240, 291-291, 313-313, 339-339, 366-366
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h(1 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h(11 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp(9 hunks)cpp/tensorrt_llm/pybind/CMakeLists.txt(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp(4 hunks)cpp/tensorrt_llm/pybind/bindings.cpp(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(7 hunks)tensorrt_llm/_torch/pyexecutor/connector.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(7 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(6 hunks)tensorrt_llm/_torch/pyexecutor/scheduler.py(1 hunks)tensorrt_llm/executor/executor.py(6 hunks)tensorrt_llm/executor/proxy.py(3 hunks)tensorrt_llm/executor/worker.py(6 hunks)tensorrt_llm/llmapi/llm.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tensorrt_llm/models/modeling_utils.py(1 hunks)tests/integration/defs/llmapi/test_llm_api_connector.py(1 hunks)tests/unittest/_torch/test_connector.py(1 hunks)
✅ Files skipped from review due to trivial changes (5)
- tensorrt_llm/_torch/pyexecutor/scheduler.py
- cpp/tensorrt_llm/pybind/CMakeLists.txt
- cpp/tensorrt_llm/pybind/bindings.cpp
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h
🚧 Files skipped from review as they are similar to previous changes (15)
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/models/modeling_utils.py
- cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
- tensorrt_llm/executor/worker.py
- tensorrt_llm/executor/executor.py
- tensorrt_llm/llmapi/llm_args.py
- cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
- tensorrt_llm/executor/proxy.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tests/unittest/_torch/test_connector.py
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
- cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/llmapi/test_llm_api_connector.pytensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/llmapi/test_llm_api_connector.pytensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🪛 Ruff (0.12.2)
tests/integration/defs/llmapi/test_llm_api_connector.py
149-149: Line too long (133 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/connector.py
126-126: Line too long (130 > 120)
(E501)
133-133: Line too long (152 > 120)
(E501)
134-134: Line too long (206 > 120)
(E501)
181-181: Line too long (176 > 120)
(E501)
240-240: Line too long (123 > 120)
(E501)
291-291: Line too long (129 > 120)
(E501)
313-313: Line too long (125 > 120)
(E501)
339-339: Line too long (144 > 120)
(E501)
366-366: Line too long (131 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
380-380: Line too long (146 > 120)
(E501)
382-382: Line too long (143 > 120)
(E501)
🔇 Additional comments (22)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
13-14: LGTM!The new imports are properly placed and necessary for the KV cache connector integration. They follow the existing import patterns and maintain proper namespacing.
Also applies to: 20-20
136-136: LGTM!The constructor modification properly integrates the KV cache connector manager with correct type annotations and maintains backward compatibility through the optional parameter.
Also applies to: 153-154
937-939: LGTM!The new method properly exposes the KV cache connector pools data with correct type annotations and follows the established delegation pattern used throughout the class.
tensorrt_llm/_torch/pyexecutor/py_executor.py (7)
132-150: LGTM!The constructor parameter addition follows good practices with proper optional typing and placement at the end of the parameter list.
265-295: Well-structured connector initialization with proper validation.The method correctly enforces mutual exclusivity with incompatible features, provides clear error messages, and properly registers KV cache pools and decoder layer hooks. The precondition checks and initialization sequence are well-organized.
918-924: LGTM!The assertion relaxation correctly mirrors the existing
kv_cache_transceiverlogic. Both components can suspend requests, so the batch size assertion should only apply when neither is active. The explanatory comment is helpful.
968-973: Proper connector integration in executor loop.The connector operations are called at appropriate points: removing pending async requests after resource preparation, handling metadata, and starting KV loading before the forward pass. The sequence is logical and well-integrated.
1006-1010: Correct handling of finished connector requests.The connector cleanup is properly placed after the transceiver checks, and resource freeing is appropriately delegated to the resource manager for finished requests.
1360-1362: LGTM!The
wait_for_savecall is correctly placed after the forward pass to ensure asynchronous saving operations complete before proceeding. The null check is appropriate for the optional connector manager.
1461-1464: Proper resource management delegation to connector.The conditional logic correctly delegates resource freeing to the connector manager when it handles the request, preventing double-freeing while maintaining fallback to the standard resource manager. This follows the connector's responsibility for managing async request resources.
tests/integration/defs/llmapi/test_llm_api_connector.py (3)
45-67: Well-designed test fixture with clear limitations.The fixture properly mocks the importlib mechanism to inject test doubles, and the single-process limitation is clearly documented with a good explanation of why MagicMock requires this constraint.
69-127: Comprehensive test coverage for connector scenarios.The test thoroughly validates the connector integration by verifying call counts, argument correctness, and proper sequencing of scheduler and worker method calls. The assertions effectively test the expected behavior patterns.
159-198: Well-structured async save test.The test properly configures the scheduler to simulate asynchronous saving and validates that the correct request IDs are passed to
get_finishedat the appropriate times. The assertion logic effectively verifies the async save behavior.tensorrt_llm/_torch/pyexecutor/connector.py (9)
37-60: Well-designed data classes with clear documentation.The
RequestDataandSchedulerOutputclasses provide clean abstractions for request tracking with proper type annotations and helpful documentation. Theadd_requesthelper method is a good convenience addition.
62-136: Comprehensive and well-documented worker interface.The abstract base class provides a clear contract for worker implementations with well-defined methods for KV cache operations. The documentation is thorough and explains the purpose and expected behavior of each method. The metadata binding mechanism enables flexible scheduler-worker coordination.
138-183: Well-structured scheduler interface with appropriate abstractions.The scheduler interface correctly focuses on orchestration responsibilities with clear method contracts. The combination of abstract methods for core functionality and default implementations for optional features provides good flexibility. Documentation effectively explains the expected behavior and return values.
185-232: Efficient helper class for async request management.The
AsyncRequestsclass provides clean abstractions for managing saving and loading requests with efficient operations for transferring, extracting, and querying requests. The design well-supports the state transitions needed by the connector manager.
234-276: Solid distributed coordination design with proper MPI integration.The manager correctly enforces scheduler centralization on rank 0, implements proper three-phase async request tracking, and provides clean MPI coordination through the
_run_on_leadermethod. The inheritance from the C++ binding enables effective integration with native components.
278-296: Correctly handles complex async loading scenario.The method properly validates async loading conditions and handles the challenging case of suspending already-scheduled requests. The detailed comment explains the architectural complexity, and the implementation correctly manages request state transitions for async loading.
298-351: Proper request scheduling and metadata coordination.The methods correctly handle the complex interactions between scheduling and async operations. Request state management is consistent, metadata coordination follows the proper leader pattern, and the filtering logic appropriately removes async requests from normal scheduling.
353-402: Sophisticated and correct async request completion handling.The
get_finishedmethod implements proper distributed coordination using MPI allgather to ensure all workers agree on completion before cleanup. The three-phase request tracking and intersection logic correctly handle race conditions and maintain consistent state across the distributed system.
407-415: Clean integration hooks for layer-level operations.The hook methods provide appropriate integration points for waiting on layer loads and triggering saves, enabling seamless integration with the model execution flow.
9a5bab4 to
b678d9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
375-416: Address line length violations and consider the previous refactoring suggestion.The complex connector integration logic has line length violations on lines 381 and 383 (146 and 143 characters respectively, exceeding the 120-character limit).
Apply this diff to fix the line length violations:
- # The result of this call may be that blocks will be loaded asynchronously. + # The result of this call may be that blocks will be loaded + # asynchronously. - # When that happens, the request will go through this same code path, but with is_kv_cache_connector_async_onboard set to True. + # When that happens, the request will go through this same code path, + # but with is_kv_cache_connector_async_onboard set to True.The previous review comment about refactoring this complex logic into separate methods remains valid and should be addressed.
tests/integration/defs/llmapi/test_llm_api_connector.py (2)
27-27: Remove sys.path manipulation in test file.Modifying
sys.pathin test files is a code smell that can lead to import issues and makes tests fragile. Consider using proper module imports or test fixtures instead.-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
29-44: Remove unused global mock variables and function.The global
KvConnectorSchedulerandKvConnectorWorkervariables and theinit_connector_classesfunction are defined but never used. The actual mocking is done in theconnectorfixture.-KvConnectorScheduler = MagicMock() -KvConnectorWorker = MagicMock() - - -def init_connector_classes(): - KvConnectorScheduler.reset_mock() - KvConnectorWorker.reset_mock() - - scheduler = MagicMock() - worker = MagicMock() - - KvConnectorScheduler.return_value = scheduler - KvConnectorWorker.return_value = worker - - return scheduler, worker - -
🧹 Nitpick comments (1)
tests/integration/defs/llmapi/test_llm_api_connector.py (1)
153-153: Address line length issue.The line exceeds the 120-character limit as flagged by static analysis.
- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." + "Lorem ipsum dolor sit amet, consectetur adipiscing elit. " + "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h(1 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h(11 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp(9 hunks)cpp/tensorrt_llm/pybind/CMakeLists.txt(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp(4 hunks)cpp/tensorrt_llm/pybind/bindings.cpp(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(7 hunks)tensorrt_llm/_torch/pyexecutor/connector.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(7 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(6 hunks)tensorrt_llm/_torch/pyexecutor/scheduler.py(1 hunks)tensorrt_llm/executor/executor.py(6 hunks)tensorrt_llm/executor/proxy.py(3 hunks)tensorrt_llm/executor/worker.py(6 hunks)tensorrt_llm/llmapi/llm.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tensorrt_llm/models/modeling_utils.py(1 hunks)tests/integration/defs/llmapi/test_llm_api_connector.py(1 hunks)tests/unittest/_torch/test_connector.py(1 hunks)
✅ Files skipped from review due to trivial changes (4)
- tensorrt_llm/_torch/pyexecutor/scheduler.py
- cpp/tensorrt_llm/pybind/CMakeLists.txt
- cpp/tensorrt_llm/pybind/bindings.cpp
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
🚧 Files skipped from review as they are similar to previous changes (16)
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/models/modeling_utils.py
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h
- tensorrt_llm/llmapi/llm_args.py
- tensorrt_llm/executor/executor.py
- tensorrt_llm/executor/proxy.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
- cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
- tests/unittest/_torch/test_connector.py
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
- tensorrt_llm/executor/worker.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
- cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/integration/defs/llmapi/test_llm_api_connector.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/integration/defs/llmapi/test_llm_api_connector.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🪛 Ruff (0.12.2)
tests/integration/defs/llmapi/test_llm_api_connector.py
153-153: Line too long (133 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/connector.py
128-128: Line too long (130 > 120)
(E501)
135-135: Line too long (152 > 120)
(E501)
136-136: Line too long (206 > 120)
(E501)
184-184: Line too long (176 > 120)
(E501)
243-243: Line too long (123 > 120)
(E501)
294-294: Line too long (129 > 120)
(E501)
316-316: Line too long (125 > 120)
(E501)
342-342: Line too long (144 > 120)
(E501)
369-369: Line too long (131 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
381-381: Line too long (146 > 120)
(E501)
383-383: Line too long (143 > 120)
(E501)
🔇 Additional comments (32)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
13-14: LGTM! Clean integration of KV cache connector imports and parameter.The imports maintain proper namespace conventions and the optional parameter design allows backward compatibility.
Also applies to: 20-20, 137-137, 154-154
361-361: LGTM! Clean scheduler output management pattern.The pattern of creating a scheduler output object to accumulate information and then dispatching it to the connector manager (when present) is well-designed and null-safe.
Also applies to: 437-438
419-435: LGTM! Well-structured generation batch processing.The pattern of capturing block IDs before and after token addition to compute deltas is clear and correctly implemented. The use of
scheduler_outputto communicate changes maintains consistency with the context batch processing.
939-940: LGTM! Clean delegation method.The new method properly exposes the KV cache connector pools data from the underlying implementation with correct type annotations.
tensorrt_llm/_torch/pyexecutor/py_executor.py (9)
132-150: LGTM! Constructor parameter addition follows established pattern.The new
kv_connector_managerparameter is properly integrated into the constructor signature with appropriate default value and type annotation.
258-261: LGTM! Proper initialization of connector manager.The connector manager is stored as an instance variable and initialization is delegated to a separate method, following good separation of concerns.
265-295: Excellent compatibility validation and integration logic.The
_maybe_init_kv_connector_managermethod properly validates incompatible feature combinations and integrates the connector with the KV cache manager and model layers. The compatibility checks prevent runtime issues by failing early.Key validations implemented:
- Incompatibility with
KvCacheTransceiver- Incompatibility with pipeline parallelism
- Incompatibility with overlap scheduler
The integration with decoder layers using forward hooks is well-designed for layer-level KV cache coordination.
921-926: Proper assertion relaxation for connector functionality.The modification correctly relaxes the batch size assertion when either the KV Cache Connector or transceiver is enabled, as both can suspend requests asynchronously. The comment clearly explains the reasoning.
968-976: Well-structured connector integration in executor loop.The connector integration in the main executor loop is properly placed after resource preparation but before the forward pass. The sequence of operations is logical:
- Remove requests pending async load
- Handle metadata
- Start KV loading
This placement ensures correct coordination between resource management and connector operations.
1009-1012: LGTM! Proper finished request handling for connector.The connector-specific finished request handling is well-integrated into the executor loop. The resource freeing is appropriately delegated to the connector manager, ensuring proper lifecycle management.
1363-1364: LGTM! Appropriate synchronization point for connector save operations.The
wait_for_save()call is correctly placed after the forward pass to ensure all asynchronous save operations complete before returning the outputs.
1464-1466: Excellent resource management delegation pattern.The
_terminate_requestmethod correctly delegates resource freeing to the connector manager when present, avoiding double-freeing issues. The conditional logic ensures proper resource cleanup regardless of connector presence.
977-978: Batch size checks are intentional and consistent
Eachscheduled_batch.batch_size > 0check occurs in a different execution context (queuing, assertion when the connector is disabled, scheduling with/without attention DP, post-transmission). They’re not duplicated by mistake but enforce the correct behavior at each stage. No changes required.tests/integration/defs/llmapi/test_llm_api_connector.py (7)
46-66: LGTM! Well-designed test fixture with proper mocking.The
model_with_connectorfixture properly mocks the importlib functionality to inject test doubles for the connector classes. The fixture yields both the model factory function and the mock objects for verification in tests.
68-70: LGTM! Appropriate environment variable for test constraints.Setting
TLLM_WORKER_USE_SINGLE_PROCESS=1is necessary because MagicMocks don't work across process boundaries. The comment clearly explains the limitation and TODO for future improvement.
73-130: Comprehensive test coverage for basic connector functionality.The
test_connector_simpletest provides excellent coverage of the basic connector workflow:
- KV cache registration
- Forward pass coordination
- Layer-level load/save operations
- Request lifecycle management
The assertions verify correct call counts and argument values, ensuring the connector integrates properly with the executor loop.
132-161: LGTM! Good test for asynchronous onboarding scenario.The test properly simulates the asynchronous loading scenario by setting appropriate return values and verifies the correct handling of async loading requests.
163-202: LGTM! Thorough test for asynchronous saving operations.The
test_connector_async_savetest validates the async saving workflow, including proper request tracking through the saving pipeline and verification of finished request handling.
204-257: Excellent detailed verification of scheduler output.The
test_connector_scheduler_outputtest provides comprehensive validation of the scheduler metadata generation, including:
- Token and block tracking across iterations
- Prefill vs generation phase handling
- Block allocation patterns
- Computed position tracking
The test also verifies the behavior when tokens are pre-matched, which is important for performance optimization.
259-306: LGTM! Comprehensive chunked context handling test.The
test_connector_scheduler_output_chunked_contexttest validates the connector's handling of chunked prefill scenarios, ensuring proper token and block management across chunk boundaries. This is crucial for large context handling.tensorrt_llm/_torch/pyexecutor/connector.py (12)
1-34: LGTM! Excellent module documentation and imports.The comprehensive module documentation clearly explains the purpose, responsibilities, and architecture of the KV Cache Connector. The import structure is well-organized and follows Python conventions.
37-61: LGTM! Well-designed data structures for request tracking.The
RequestDataandSchedulerOutputdataclasses provide clean interfaces for tracking request metadata. Theadd_requestmethod onSchedulerOutputoffers a convenient API for building scheduler metadata.
63-138: Excellent abstract worker interface design.The
KvCacheConnectorWorkerabstract base class provides a comprehensive interface for worker implementations:
- Clear separation of concerns between metadata binding and KV operations
- Well-documented abstract methods with clear responsibilities
- Proper lifecycle management (register → load → process → save → finish)
- Good error handling considerations in the
get_finishedmethod documentationThe interface design supports various connector implementations while maintaining consistency.
140-186: LGTM! Well-structured scheduler interface.The
KvCacheConnectorSchedulerabstract base class provides a clean interface for orchestration logic:
- Clear separation between metadata building and token matching
- Proper async operation support
- Good documentation of return values and behavior expectations
The
get_num_new_matched_tokensmethod design allows for flexible token matching strategies while supporting both sync and async loading.
188-235: Excellent async request management design.The
AsyncRequestsdataclass provides a clean API for managing asynchronous saving and loading operations:
- Clear separation between saving and loading requests
- Efficient ID-based operations
- Proper state transfer methods (
add_from,extract_by_id)- Convenient query methods for getting ID sets
The implementation supports complex async request lifecycle management.
237-279: LGTM! Proper MPI coordination architecture.The
KvCacheConnectorManagerinitialization and_run_on_leadermethod demonstrate excellent distributed system design:
- Enforces single leader pattern (scheduler only on rank 0)
- Proper MPI broadcast pattern for coordinating operations
- Clean error handling with assertions
The architecture ensures consistent state across all ranks while centralizing orchestration logic.
281-299: Excellent token matching coordination with proper validation.The
get_num_new_matched_tokensmethod properly coordinates token matching across ranks and handles async loading setup. The validation logic prevents invalid states (async loading with zero tokens), and the request state management is well-structured.The TODO comment identifies a design challenge with request scheduling timing - this is good documentation of known limitations.
301-325: LGTM! Clean request filtering for async operations.The
take_scheduled_requests_pending_loadmethod properly removes async loading requests from the scheduled batch and updates their state appropriately. This prevents double-processing and ensures correct request lifecycle management.
326-333: LGTM! Proper metadata coordination and cleanup.The
handle_metadatamethod correctly coordinates metadata building across ranks and binds it to the worker. The cleanup of scheduler output prevents memory leaks and ensures fresh state for each iteration.
334-354: Excellent request completion handling.The
request_finishedmethod properly coordinates async saving decisions across ranks and updates request state to prevent premature resource cleanup. The state management ensures resources remain allocated during async operations.
356-405: Sophisticated distributed coordination for async operations.The
get_finishedmethod implements excellent distributed coordination logic:
- Request State Management: Properly tracks requests through different async states
- Worker Coordination: Collects completion status from all workers
- Consensus Building: Uses MPI allgather to achieve consensus across all ranks
- State Transitions: Correctly transitions loading requests back to schedulable state
- Resource Management: Returns finished saving requests for proper cleanup
The implementation ensures that async operations are only considered complete when all workers agree, preventing race conditions in distributed environments.
407-418: LGTM! Clean hook implementations for layer coordination.The hook methods provide clean integration points for KV cache operations:
layer_pre_hook: Synchronizes KV loading before layer executionlayer_post_hook: Initiates KV saving after layer executionmodel_post_hook: Ensures all saves complete before forward pass completionThe hook pattern integrates seamlessly with PyTorch's module system.
b678d9b to
9cf5466
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
383-422: Address complex connector integration logic and line length violations.The nested conditionals and state management logic is difficult to follow, as correctly identified by the existing TODO comment. Additionally, static analysis flags line length violations on lines 386 and 388.
The refactoring suggested in the previous review remains valid. Additionally, please address the line length violations:
- # TODO(jthomson04): This is begging for a mega refactor, and can likely be significantly simplified. - # In add sequence, the connector API's get_num_new_matched_tokens is called. + # TODO(jthomson04): This is begging for a mega refactor, and can likely be + # significantly simplified. In add sequence, the connector API's + # get_num_new_matched_tokens is called. - # The result of this call may be that blocks will be loaded asynchronously. + # The result of this call may be that blocks will be loaded + # asynchronously.Consider extracting the complex conditional logic into helper methods as suggested in the previous review.
tests/integration/defs/llmapi/test_llm_api_connector.py (2)
27-27: Remove sys.path manipulation in test file.Modifying
sys.pathin test files is a code smell that can lead to import issues and makes tests fragile. Consider using proper module imports or test fixtures instead.-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
29-44: Remove unused global mock variables and function.The global
KvConnectorSchedulerandKvConnectorWorkervariables and theinit_connector_classesfunction are defined but never used. The actual mocking is done in theconnectorfixture.-KvConnectorScheduler = MagicMock() -KvConnectorWorker = MagicMock() - -def init_connector_classes(): - KvConnectorScheduler.reset_mock() - KvConnectorWorker.reset_mock() - - scheduler = MagicMock() - worker = MagicMock() - - KvConnectorScheduler.return_value = scheduler - KvConnectorWorker.return_value = worker - - return scheduler, worker -
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
265-306: Comprehensive connector validation and initialization.The
_maybe_init_kv_connector_managermethod properly validates compatibility constraints and initializes the connector. The validation logic correctly prevents incompatible feature combinations and ensures uniform KV cache shapes across layers.However, there's a minor issue on line 291:
- return ValueError( + raise ValueError(tests/integration/defs/llmapi/test_llm_api_connector.py (1)
153-153: Fix line length violation.The line exceeds the 120-character limit specified in the static analysis hints.
- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." + "Lorem ipsum dolor sit amet, consectetur adipiscing elit. " + "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."tensorrt_llm/_torch/pyexecutor/connector.py (1)
129-129: Address line length violations.Multiple lines exceed the 120-character limit. Consider breaking long lines for better readability.
For example, line 129:
- Args: - finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now asynchronously saving. + Args: + finished_gen_req_ids: The IDs of the requests that have finished + generating tokens, and are now asynchronously saving.Similar formatting should be applied to other long lines for consistency with the project's style guidelines.
Also applies to: 136-137, 185-185, 244-244, 295-295, 317-317, 343-343, 370-370
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h(1 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h(8 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp(8 hunks)cpp/tensorrt_llm/pybind/CMakeLists.txt(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp(2 hunks)cpp/tensorrt_llm/pybind/bindings.cpp(2 hunks)cpp/tensorrt_llm/runtime/torch.h(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(7 hunks)tensorrt_llm/_torch/pyexecutor/connector.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(7 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(5 hunks)tensorrt_llm/_torch/pyexecutor/scheduler.py(1 hunks)tensorrt_llm/executor/executor.py(6 hunks)tensorrt_llm/executor/proxy.py(3 hunks)tensorrt_llm/executor/worker.py(6 hunks)tensorrt_llm/llmapi/llm.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tensorrt_llm/models/modeling_utils.py(1 hunks)tests/integration/defs/llmapi/test_llm_api_connector.py(1 hunks)tests/unittest/_torch/test_connector.py(1 hunks)
✅ Files skipped from review due to trivial changes (5)
- tensorrt_llm/_torch/pyexecutor/scheduler.py
- cpp/tensorrt_llm/pybind/CMakeLists.txt
- cpp/tensorrt_llm/pybind/bindings.cpp
- tensorrt_llm/models/modeling_utils.py
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
🚧 Files skipped from review as they are similar to previous changes (15)
- cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
- tensorrt_llm/llmapi/llm_args.py
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/executor/worker.py
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.h
- tensorrt_llm/executor/proxy.py
- tensorrt_llm/executor/executor.py
- cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h
- tests/unittest/_torch/test_connector.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
- cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic...
Files:
cpp/tensorrt_llm/runtime/torch.h
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/runtime/torch.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/runtime/torch.htensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/llmapi/test_llm_api_connector.py
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/llmapi/test_llm_api_connector.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : do not use smart pointers that have been deprecated in c++11....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-01T07:34:42.734Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Do not use smart pointers that have been deprecated in C++11.
Applied to files:
cpp/tensorrt_llm/runtime/torch.h
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
cpp/tensorrt_llm/runtime/torch.h
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/connector.pytensorrt_llm/_torch/pyexecutor/py_executor.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
386-386: Line too long (146 > 120)
(E501)
388-388: Line too long (143 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/connector.py
129-129: Line too long (130 > 120)
(E501)
136-136: Line too long (152 > 120)
(E501)
137-137: Line too long (206 > 120)
(E501)
185-185: Line too long (176 > 120)
(E501)
244-244: Line too long (123 > 120)
(E501)
295-295: Line too long (129 > 120)
(E501)
317-317: Line too long (125 > 120)
(E501)
343-343: Line too long (144 > 120)
(E501)
370-370: Line too long (131 > 120)
(E501)
tests/integration/defs/llmapi/test_llm_api_connector.py
153-153: Line too long (133 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (37)
cpp/tensorrt_llm/runtime/torch.h (1)
44-51: LGTM! Improved error handling and memory safety.The deleter modification removes exception handling and replaces assertion with warning logging, which is safer for library boundaries. The conditional check ensures proper handling of deallocated memory scenarios while maintaining memory safety by always resetting the shared pointer.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (6)
18-18: LGTM! Proper import for KV cache connector components.The import statement correctly brings in the necessary connector components for the integration.
135-135: LGTM! Clean integration of connector manager parameter.The constructor properly accepts the optional
kv_connector_managerparameter and stores it as an instance attribute, following good object-oriented design principles.Also applies to: 152-153
364-367: LGTM! Proper scheduler output initialization.The creation of
SchedulerOutputobject to accumulate scheduling information for the KV cache connector is well-implemented.
380-381: LGTM! Correct connector manager integration in sequence addition.The conditional passing of
kv_connector_managertoadd_sequencecalls (excluding warmup batches) is implemented correctly for both code paths.Also applies to: 393-394
423-441: LGTM! Proper token and block tracking for generation requests.The logic correctly tracks tokens and block IDs before and after token addition, computing delta block IDs for the scheduler output. The implementation handles the generation batch processing appropriately.
442-444: LGTM! Clean connector manager integration.The conditional check and scheduler output setting for the connector manager is implemented correctly, ensuring the connector is updated when present.
tensorrt_llm/_torch/pyexecutor/py_executor.py (10)
30-30: LGTM: Import added for connector integration.The DecoderLayer import is correctly added to support forward hook registration in the KV Cache Connector initialization.
32-32: LGTM: Connector manager import.The KvCacheConnectorManager import is properly added for the new connector functionality.
132-150: LGTM: Constructor parameter addition.The
kv_connector_managerparameter is correctly added to the PyExecutor constructor with proper typing and default value.
258-261: LGTM: Connector manager initialization.The connector manager is properly stored and initialized with the dedicated method call.
932-937: Appropriate assertion relaxation for connector functionality.The batch size assertion is correctly relaxed when the KV Cache Connector is enabled, acknowledging that the connector can put requests to sleep similar to the transceiver.
979-987: Proper connector integration in execution loop.The KV Cache Connector integration follows the correct execution flow:
- Remove requests pending asynchronous load from scheduled batch
- Handle metadata building and broadcasting
- Start KV loading operations
This maintains proper isolation and orchestration of the connector operations.
988-990: Conditional execution logic maintains performance.The conditional check ensures forward pass execution only when there are scheduled requests or attention DP is enabled, maintaining execution efficiency.
1375-1377: Proper synchronization for KV cache saving.The
wait_for_save()call ensures all KV cache saving operations complete before proceeding, maintaining data consistency in the connector workflow.
1020-1024: Correct resource management delegation.The connector cleanup logic properly retrieves finished requests and delegates resource freeing to the connector when appropriate, maintaining proper resource lifecycle management.
1476-1478: Proper resource management delegation in termination.The
_terminate_requestmethod correctly delegates resource freeing to the KV Cache Connector when enabled, avoiding double-freeing of resources.tests/integration/defs/llmapi/test_llm_api_connector.py (7)
46-66: Well-structured test fixture for connector mocking.The
model_with_connectorfixture properly mocks the connector components using importlib patching, which is a clean approach for testing the connector integration without requiring actual connector implementations.
68-71: Appropriate test limitation documentation.The comment clearly explains the limitation to single process testing due to MagicMock cross-process issues, and the environment variable setting is necessary for the current testing approach.
73-130: Comprehensive basic connector functionality test.The
test_connector_simpletest thoroughly validates the connector workflow including:
- KV cache registration
- Metadata building calls
- Load/save synchronization
- Request lifecycle management
The assertions verify the correct number of calls and proper sequencing of operations.
132-161: Effective asynchronous onboarding test.The test validates async loading behavior by setting up appropriate mock return values and verifying the correct handling of finished requests with asynchronous loading.
163-202: Thorough asynchronous saving test.The test validates async saving behavior by checking that finished requests are properly tracked and passed to the
get_finishedmethod on the final call.
204-257: Comprehensive scheduler output validation.The test validates that scheduler output contains correct metadata for both prefill and generation phases, including proper token and block ID tracking.
259-306: Appropriate chunked context testing.The test validates connector behavior with chunked prefill, ensuring that scheduler output correctly handles multi-chunk context processing.
tensorrt_llm/_torch/pyexecutor/connector.py (13)
1-36: Excellent interface documentation and design.The file header provides clear documentation of the KV Cache Connector's purpose, responsibilities, and architecture. The separation of scheduler and worker concerns is well-designed.
38-62: Well-structured data classes for request tracking.The
RequestDataandSchedulerOutputclasses provide clean abstractions for tracking request metadata and scheduling information with appropriate helper methods.
64-139: Comprehensive worker interface definition.The
KvCacheConnectorWorkerabstract base class defines a clear contract for worker implementations with well-documented methods covering the full KV cache lifecycle.
141-187: Complete scheduler interface with proper abstractions.The
KvCacheConnectorSchedulerinterface provides appropriate abstractions for orchestration, with both required abstract methods and optional implementations for token matching.
189-236: Robust async request management.The
AsyncRequestsdataclass provides comprehensive methods for managing asynchronous saving and loading requests with proper state transfer and query capabilities.
238-270: Proper manager initialization with rank validation.The
KvCacheConnectorManagerconstructor correctly enforces that the scheduler only exists on rank 0, ensuring proper distributed coordination.
271-280: Effective MPI coordination pattern.The
_run_on_leadermethod provides a clean abstraction for executing functions on the leader rank and broadcasting results, which is essential for distributed coordination.
282-300: Robust token matching with async handling.The
get_num_new_matched_tokensmethod properly validates the async loading logic and correctly updates request state for asynchronous onboarding.
302-326: Correct request filtering for async operations.The
take_scheduled_requests_pending_loadmethod properly removes asynchronously loading requests from the scheduled batch and updates their state appropriately.
327-334: Clean metadata handling and distribution.The
handle_metadatamethod properly coordinates metadata building on the leader and distributes it to all workers.
335-356: Proper request completion handling.The
request_finishedmethod correctly determines async saving behavior and updates request state, maintaining consistency with the async request management pattern.
357-406: Sophisticated distributed synchronization logic.The
get_finishedmethod implements complex but correct distributed coordination using MPI allgather to ensure all workers agree on completion status before proceeding. The state management for loading and saving requests is properly handled.
408-418: Clean hook implementations for layer coordination.The pre/post hook methods provide clean integration points for layer-level KV cache coordination, properly delegating to the worker implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
364-442: Address line length violations and ongoing complexity.The connector integration logic remains complex as noted in previous reviews. Additionally, lines 385 and 387 exceed the 120-character limit and should be reformatted.
Apply this formatting fix for the line length violations:
- # The result of this call may be that blocks will be loaded asynchronously. - # When that happens, the request will go through this same code path, but with is_kv_cache_connector_async_onboard set to True. + # The result of this call may be that blocks will be loaded + # asynchronously. + # When that happens, the request will go through this same code path, + # but with is_kv_cache_connector_async_onboard set to True.tests/integration/defs/llmapi/test_llm_api_connector.py (2)
27-27: Remove sys.path manipulation in test file.Modifying
sys.pathin test files is a code smell that can lead to import issues and makes tests fragile. Consider using proper module imports or test fixtures instead.-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
29-44: Remove unused global mock variables and function.The global
KvConnectorSchedulerandKvConnectorWorkervariables and theinit_connector_classesfunction are defined but never used. The actual mocking is done in theconnectorfixture.-KvConnectorScheduler = MagicMock() -KvConnectorWorker = MagicMock() - - -def init_connector_classes(): - KvConnectorScheduler.reset_mock() - KvConnectorWorker.reset_mock() - - scheduler = MagicMock() - worker = MagicMock() - - KvConnectorScheduler.return_value = scheduler - KvConnectorWorker.return_value = worker - - return scheduler, worker - -
🧹 Nitpick comments (2)
tests/integration/defs/llmapi/test_llm_api_connector.py (2)
162-163: Fix line length violations.The long string literal exceeds the 120-character limit. Consider breaking it into multiple lines or using a shorter test string.
- model.generate([ - "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." - ], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True)) + model.generate([ + "Lorem ipsum dolor sit amet, consectetur adipiscing elit. " + "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." + ], SamplingParams(max_tokens=NUM_TOKENS, ignore_eos=True))
202-214: Fix line length violation in comment.The comment on line 202 exceeds the 120-character limit.
- # On the last call to get_finished, we should be providing the async saving request. One extra token when using the overlap scheduler. + # On the last call to get_finished, we should be providing the async saving request. + # One extra token when using the overlap scheduler.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tensorrt_llm/_torch/pyexecutor/_util.py(7 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(11 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(8 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(5 hunks)tests/integration/defs/llmapi/test_llm_api_connector.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/_torch/pyexecutor/_util.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytests/integration/defs/llmapi/test_llm_api_connector.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytests/integration/defs/llmapi/test_llm_api_connector.py
🧠 Learnings (2)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
385-385: Line too long (146 > 120)
(E501)
387-387: Line too long (143 > 120)
(E501)
tests/integration/defs/llmapi/test_llm_api_connector.py
162-162: Line too long (133 > 120)
(E501)
202-202: Line too long (138 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (23)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
18-18: LGTM!Import is correctly added to support the new KV cache connector functionality.
135-135: LGTM!The optional parameter is properly typed and maintains backward compatibility.
152-152: LGTM!Standard pattern for storing the connector manager as an instance variable.
tensorrt_llm/_torch/pyexecutor/py_executor.py (13)
30-32: LGTM! New imports for KV connector integration.The imports for
DecoderLayerandKvCacheConnectorManagerare correctly added to support the KV cache connector functionality.
132-150: Constructor parameter addition looks correct.The new
kv_connector_managerparameter is properly added to the constructor with appropriate type hint and default value.
258-261: Good initialization of KV connector manager.The connector manager is properly stored as an instance variable and initialization is delegated to a dedicated method, following good separation of concerns.
927-933: Appropriate relaxation of batch size assertion.The logic correctly relaxes the assertion when KV connector manager is enabled, as the connector can put requests to sleep similar to the transceiver. The comment clearly explains the rationale.
946-952: Well-structured KV connector execution method.The method properly orchestrates the connector workflow:
- Takes scheduled requests pending load
- Handles metadata communication
- Starts KV loading
The method follows the established pattern of conditional execution based on manager availability.
953-958: Proper cleanup of finished async save requests.The method correctly retrieves finished requests from the connector and delegates resource cleanup to the resource manager, maintaining proper separation of responsibilities.
987-990: Strategic placement of KV connector execution.The connector execution is properly placed after resource preparation but before the forward step, allowing the connector to modify the batch as needed.
991-993: Appropriate restructuring of conditional logic.The conditional logic is correctly restructured to accommodate the KV connector workflow while maintaining the existing execution flow.
1024-1025: Proper integration of async save request termination.The call to terminate async save requests is correctly placed after response handling, ensuring proper cleanup of connector-managed resources.
1092-1095: Consistent integration pattern in overlap scheduler.The KV connector execution and conditional restructuring follows the same pattern as in the non-overlap executor loop, maintaining consistency across execution paths.
1148-1149: Proper cleanup integration in overlap scheduler.The async save request termination is consistently integrated in the overlap scheduler execution path.
1380-1383: Critical synchronization point for KV connector.The
wait_for_save()call after the forward step is crucial for ensuring KV cache consistency. This synchronization point prevents race conditions where the next iteration might start before the current iteration's KV cache operations complete.
1482-1485: Smart resource management delegation.The conditional logic properly delegates resource cleanup to the connector when a request is finished according to the connector's state. This prevents double-cleanup and ensures the connector can manage resource lifecycle appropriately.
tests/integration/defs/llmapi/test_llm_api_connector.py (7)
46-66: Well-designed test fixture for connector mocking.The fixture properly mocks the connector components using importlib patching, which is appropriate for testing the integration without requiring actual connector implementations. The use of function scope ensures clean test isolation.
68-71: Appropriate environment configuration for test constraints.The environment variable setting and comment clearly explain the limitation to single-process testing due to MagicMock serialization constraints. This is a reasonable testing approach for the current implementation.
73-138: Comprehensive basic connector functionality test.The test effectively validates:
- KV cache registration during model initialization
- Correct call counts for connector methods during token generation
- Layer-wise KV operations (load/save per layer per forward pass)
- Request lifecycle management
- Scheduler-worker coordination
The test coverage is thorough and validates the core connector integration points.
140-171: Good test coverage for async onboarding scenario.The test properly simulates the async onboarding workflow where the scheduler returns matched tokens and sets the async flag. The verification of call counts and request ID propagation is appropriate.
173-214: Effective async save workflow validation.The test correctly validates the async save scenario where the scheduler marks a request as finished and the connector handles it asynchronously. The verification of request ID propagation through the call chain is well done.
216-274: Thorough scheduler output validation.This test provides excellent coverage of the scheduler output structure, validating:
- Token counts and block allocation during prefill vs generation
- Block allocation logic based on position and block size
- Computed position tracking
- Impact of matched tokens on computed position
The mathematical assertions for block allocation are particularly valuable.
276-325: Good coverage of chunked prefill scenario.The test effectively validates the chunked prefill workflow, ensuring that:
- All tokens and blocks are provided upfront in the first chunk
- Subsequent chunks have correct computed positions but no new tokens/blocks
- Generation phase follows normal single-token pattern
This covers an important optimization scenario for large context processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
265-301: Fix the ValueError bug on line 286.The implementation correctly validates configurations and sets up the KV connector, but there's a syntax error on line 286 where
return ValueErrorshould beraise ValueError.- return ValueError( + raise ValueError(
🧹 Nitpick comments (2)
tests/integration/defs/llmapi/test_llm_api_connector.py (1)
143-143: Consider breaking long lines for better readability.Lines 143 and 183 exceed the 120-character limit. While long test strings are sometimes acceptable, consider breaking them for consistency:
- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." + "Lorem ipsum dolor sit amet, consectetur adipiscing elit. " + "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."- # On the last call to get_finished, we should be providing the async saving request. One extra token when using the overlap scheduler. + # On the last call to get_finished, we should be providing the async saving request. + # One extra token when using the overlap scheduler.Also applies to: 183-183
tensorrt_llm/_torch/pyexecutor/connector.py (1)
129-129: Fix line length violations flagged by static analysis.Several lines exceed the 120-character limit. Please wrap these lines appropriately.
Apply these fixes for line length compliance:
- finished_gen_req_ids: The IDs of the requests that have finished generating tokens, and are now asynchronously saving. + finished_gen_req_ids: The IDs of the requests that have finished + generating tokens, and are now asynchronously saving. - Note: IDs may only be returned from this call after they've been provided in the `finished_gen_req_ids` and `started_loading_req_ids` arguments. - Additionally, the runtime will only take action based on these returned IDs once they've been returned by ALL workers. This allows some workers to take longer than others to complete the operations. + Note: IDs may only be returned from this call after they've been provided + in the `finished_gen_req_ids` and `started_loading_req_ids` arguments. + Additionally, the runtime will only take action based on these returned IDs + once they've been returned by ALL workers. This allows some workers to take + longer than others to complete the operations. - If true, this indicates that the kv cache manager should wait to deallocate the blocks until the saving has completed (determined by `get_finished` on the workers). + If true, this indicates that the kv cache manager should wait to + deallocate the blocks until the saving has completed (determined by + `get_finished` on the workers). - 1. Managing the state of async requests (both offload and onboard) - 2. Handling MPI communication. We only run the leader on one rank, but need the results of the leader API on all ranks. + 1. Managing the state of async requests (both offload and onboard) + 2. Handling MPI communication. We only run the leader on one rank, but need + the results of the leader API on all ranks. - # When the connector indicates that a request will be loaded asynchronously, we need to suspend it's execution. + # When the connector indicates that a request will be loaded asynchronously, + # we need to suspend it's execution. - # If this request is being loaded asynchronously, in addition to removing it from the list of scheduled requests, + # If this request is being loaded asynchronously, in addition to removing + # it from the list of scheduled requests, - Whether the request is performing asynchronous saving operations. If true, we do not immediately call free_resources on the request. + Whether the request is performing asynchronous saving operations. If true, + we do not immediately call free_resources on the request. - # Pass these newly finished requests into get_finished, and get the list of requests that have finished saving and loading. + # Pass these newly finished requests into get_finished, and get the list of + # requests that have finished saving and loading.Also applies to: 136-137, 185-185, 244-244, 295-295, 317-317, 343-343, 370-370
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp(1 hunks)tensorrt_llm/_torch/pyexecutor/connector.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(11 hunks)tests/integration/defs/llmapi/test_llm_api_connector.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tensorrt_llm/pybind/batch_manager/kvCacheConnector.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/llmapi/test_llm_api_connector.pytensorrt_llm/_torch/pyexecutor/connector.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/integration/defs/llmapi/test_llm_api_connector.pytensorrt_llm/_torch/pyexecutor/connector.py
🧠 Learnings (2)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/connector.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.py
🪛 Ruff (0.12.2)
tests/integration/defs/llmapi/test_llm_api_connector.py
143-143: Line too long (133 > 120)
(E501)
183-183: Line too long (138 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/connector.py
129-129: Line too long (130 > 120)
(E501)
136-136: Line too long (152 > 120)
(E501)
137-137: Line too long (206 > 120)
(E501)
185-185: Line too long (176 > 120)
(E501)
244-244: Line too long (123 > 120)
(E501)
295-295: Line too long (129 > 120)
(E501)
317-317: Line too long (125 > 120)
(E501)
343-343: Line too long (144 > 120)
(E501)
370-370: Line too long (131 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (24)
tensorrt_llm/_torch/pyexecutor/py_executor.py (7)
30-30: LGTM! Appropriate imports for KV connector functionality.The imports of
DecoderLayerandKvCacheConnectorManagerare necessary for the KV cache connector integration being added.Also applies to: 32-32
132-151: LGTM! Proper KV connector manager integration in constructor.The constructor changes correctly:
- Add an optional
kv_connector_managerparameter- Store it as an instance variable
- Call the initialization method to set up the connector
Also applies to: 258-261
927-932: LGTM! Correct handling of KV connector in batch preparation.The logic correctly excludes the batch size assertion when the KV connector manager is active, similar to the existing KV cache transceiver logic. The comment clearly explains why this is necessary.
946-958: LGTM! Well-structured KV connector handler methods.Both methods provide clean abstractions for KV connector operations:
_handle_kv_connectorproperly sequences request handling, metadata processing, and KV loading_terminate_async_save_requestsensures proper resource cleanup for finished requests- Both include appropriate null checks
988-993: LGTM! Proper integration of KV connector into executor loops.The KV connector is correctly integrated into all executor loop variants:
_handle_kv_connectoris called after resource preparation but before the forward step_terminate_async_save_requestsis called after response handling for proper cleanup- Integration is consistent across both standard and overlap scheduler loops
Also applies to: 1092-1095, 1024-1024, 1148-1148
1380-1383: LGTM! Proper synchronization of async save operations.The addition of
wait_for_save()after the model forward ensures that asynchronous KV cache save operations complete before proceeding. The null check is appropriate.
1482-1484: LGTM! Appropriate delegation of resource management to KV connector.The conditional logic correctly delegates resource management to the KV connector when it reports the request as finished, avoiding duplicate resource cleanup.
tests/integration/defs/llmapi/test_llm_api_connector.py (5)
1-26: LGTM! Proper test file structure with required copyright header.The file follows appropriate testing conventions and includes the required NVIDIA copyright header as specified in the coding guidelines.
27-51: LGTM! Comprehensive test fixture setup for connector mocking.The fixture properly mocks the connector classes and handles the MagicMock cross-process limitation by setting
TLLM_WORKER_USE_SINGLE_PROCESS=1. The mock setup is well-structured and provides the necessary interfaces for testing.
54-119: LGTM! Comprehensive test coverage of basic connector workflow.This test excellently validates the core connector functionality:
- Verifies proper registration and initialization
- Tests both scheduler variants (overlap/non-overlap)
- Validates correct call sequences and counts for all connector operations
- Ensures proper layer-wise processing through the forward passes
121-195: LGTM! Thorough testing of asynchronous connector operations.Both async tests provide excellent coverage:
test_connector_async_onboardvalidates async loading with matched tokenstest_connector_async_saveverifies async saving request handling- Both tests properly simulate connector behavior and validate the expected interaction patterns
197-306: LGTM! Detailed validation of scheduler output generation.Both tests provide thorough coverage of scheduler behavior:
- Token and block ID calculations are properly validated
- Block boundary conditions are tested
- Chunked prefill scenario is covered with correct position tracking
- Mathematical relationships between tokens, blocks, and positions are verified
tensorrt_llm/_torch/pyexecutor/connector.py (12)
1-14: LGTM!The import structure follows Python coding guidelines by maintaining namespaces and using appropriate import patterns.
38-62: LGTM!The dataclasses follow Python coding guidelines with proper snake_case naming, appropriate type hints, and correct use of
field(default_factory=list)to avoid mutable default arguments.
64-139: LGTM!The abstract base class is well-designed with:
- Proper inheritance from ABC
- Clear method documentation using docstrings
- Appropriate snake_case naming conventions
- Clean metadata binding interface
141-187: LGTM!The scheduler abstract base class follows the same good design patterns as the worker, with clear documentation and appropriate method signatures.
189-236: LGTM!The
AsyncRequestsclass provides a clean interface for managing asynchronous request state with clear documentation of the destructive nature of the transfer operations.
238-270: LGTM!The constructor properly enforces the constraint that the scheduler only exists on rank 0, which is essential for correct MPI operation. The state management design with separate AsyncRequests objects creates a clear state machine for async operations.
271-280: LGTM!Excellent pattern for MPI coordination that ensures scheduler operations only run on the leader rank while broadcasting results to all ranks.
282-300: Approve with acknowledged design limitation.The implementation correctly handles MPI coordination and includes important validation. The TODO comment accurately identifies a design challenge where requests must be removed from scheduling after being scheduled - this is a known limitation of the current architecture rather than an error.
327-334: LGTM!The method correctly uses the MPI coordination pattern and properly manages the scheduler output lifecycle. The method name appropriately reflects the bind operation as noted in previous reviews.
335-355: LGTM!The method correctly coordinates with the scheduler and properly manages request state to prevent premature deallocation during asynchronous saving operations.
357-406: LGTM!Complex but well-implemented state machine that correctly coordinates asynchronous operations across all MPI ranks. The use of set intersection ensures consensus before considering requests globally finished, and state transitions are handled appropriately.
408-416: LGTM!The remaining methods provide clean interfaces for scheduler output management and layer-level hooks that properly delegate to the worker implementation.
Signed-off-by: jthomson04 <[email protected]>
ceb86c8 to
209052a
Compare
Signed-off-by: jthomson04 <[email protected]>
Signed-off-by: jthomson04 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
518-520: Clarify and document the single-window constraint for KV Cache ConnectorImprove the error message to state what was detected and what users can do next. Also ensure the API docs/header call this out explicitly.
- TLLM_CHECK_WITH_INFO(kvCacheConnectorManager == nullptr || uniqueWindowSizeToLayers.size() == 1, - "KV Cache Connector is not supported with multiple window sizes"); + TLLM_CHECK_WITH_INFO( + kvCacheConnectorManager == nullptr || uniqueWindowSizeToLayers.size() == 1, + "KV Cache Connector is only supported when the model has a single max attention window; found %zu. " + "If you need multiple window sizes, disable the connector or configure a single window size.", + uniqueWindowSizeToLayers.size());
1199-1210: Clamp prepopulatedPromptLen (+ connector tokens) to context capacity; fix logging format; update reuse statsAs written, prepopulatedPromptLen + numConnectorMatchedTokens can exceed the capacity represented by numContextBlocks × tokensPerBlock, risking inconsistency in downstream consumers and buffer indexing. Clamp before setting it. Also fix RequestId printf specifier and update reuse stats to include connector-matched tokens.
- SizeType32 numConnectorMatchedTokens = 0; - - // If we're using a KV cache connector, check if any additional blocks can be loaded. - if (mKvCacheConnectorManager && !llmRequest.isDummyRequest()) - { - numConnectorMatchedTokens = mKvCacheConnectorManager->getNumNewMatchedTokens(llmRequest, prepopulatedPromptLen); - } - - llmRequest.setPrepopulatedPromptLen(prepopulatedPromptLen + numConnectorMatchedTokens, getTokensPerBlock()); - TLLM_LOG_DEBUG("addSequence: Request %lu, inputLength %d, prepopulatedPromptLen %d, numConnectorMatchedTokens %d", - llmRequest.mRequestId, inputLength, prepopulatedPromptLen, numConnectorMatchedTokens); + SizeType32 numConnectorMatchedTokens = 0; + // If we're using a KV cache connector, check if any additional tokens can be considered matched. + if (mKvCacheConnectorManager && !llmRequest.isDummyRequest()) + { + numConnectorMatchedTokens = mKvCacheConnectorManager->getNumNewMatchedTokens(llmRequest, prepopulatedPromptLen); + } + + // Ensure we never exceed the allocated context capacity. + SizeType32 const maxPrepopulatedTokens = numContextBlocks * getTokensPerBlock(); + SizeType32 const requestedPrepopulated = prepopulatedPromptLen + numConnectorMatchedTokens; + SizeType32 const clampedPrepopulated = std::min(requestedPrepopulated, maxPrepopulatedTokens); + if (requestedPrepopulated != clampedPrepopulated) + { + TLLM_LOG_WARNING( + "Clamping prepopulated prompt length for request %" PRIu64 " from %d to %d (capacity=%d tokens across %d blocks)", + static_cast<uint64_t>(llmRequest.mRequestId), requestedPrepopulated, clampedPrepopulated, + maxPrepopulatedTokens, numContextBlocks); + } + // Account connector-matched tokens in reuse stats (post-clamp) + mReusedTokens += static_cast<double>(clampedPrepopulated - prepopulatedPromptLen); + + llmRequest.setPrepopulatedPromptLen(clampedPrepopulated, getTokensPerBlock()); + TLLM_LOG_DEBUG( + "addSequence: Request %" PRIu64 ", inputLength %d, prepopulatedPromptLen %d, numConnectorMatchedTokens %d", + static_cast<uint64_t>(llmRequest.mRequestId), inputLength, prepopulatedPromptLen, numConnectorMatchedTokens);
🧹 Nitpick comments (5)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
2410-2416: getUniquePrimaryPool guard is fine; consider mirroring improved error textMethod correctly guards single-window usage. Optionally align the error message wording with the constructor’s check for consistency.
- TLLM_CHECK_WITH_INFO(mBlockManager.getWindowSizesMetadata().size() == 1, - "getUniquePrimaryPool is only supported for a single window size"); + TLLM_CHECK_WITH_INFO( + mBlockManager.getWindowSizesMetadata().size() == 1, + "getUniquePrimaryPool is only supported when the model has a single max attention window; found %zu.", + mBlockManager.getWindowSizesMetadata().size());cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)
19-19: Prefer forward-declare KvCacheConnectorManager in headers to reduce coupling and compile timeYou only need the type for pointer/reference members and params here. Replacing the include with a forward declaration avoids pulling in more dependencies from this high-traffic header.
Apply this diff:
-#include "tensorrt_llm/batch_manager/kvCacheConnector.h" +// Forward declare to avoid heavy header inclusion; include in the .cpp where it's used. +namespace tensorrt_llm::batch_manager::kv_connector { class KvCacheConnectorManager; }Additionally, ensure the corresponding .cpp includes the concrete header:
#include "tensorrt_llm/batch_manager/kvCacheConnector.h"
838-840: Storing the connector manager in WindowBlockManager is appropriateThis centralizes connector usage at the per-window granularity, aligning with prior discussion. Consider documenting thread-safety expectations if the connector will be invoked from concurrent paths.
1293-1293: Add a brief contract note for getUniquePrimaryPool()The .cpp enforces “single window size only.” Capture that precondition in the header for discoverability.
- [[nodiscard]] virtual runtime::ITensor::SharedPtr getUniquePrimaryPool() const = 0; + //! Return the primary pool tensor when exactly one attention window size is configured. + //! Precondition: getBlockManager().getWindowSizesMetadata().size() == 1. + [[nodiscard]] virtual runtime::ITensor::SharedPtr getUniquePrimaryPool() const = 0;
1391-1393: FixcopyOnPartialReusetypo and add missing connector parameter for API parityKeep the parameter name consistent across all four
KVCacheManagerconstructors in
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.hand ensure the final overload withint64_t streamalso accepts a connector:
Rename every occurrence of
bool copyOnpartialReuse = true
to
bool copyOnPartialReuse = true.In the last overload (around line 1405), extend the signature to include
std::shared_ptr<kv_connector::KvCacheConnectorManager> kvCacheConnectorManager = nullptr
aftercopyOnPartialReuse:--- a/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h +++ b/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h @@ -1403,7 +1403,8 @@ std::optional<TempAttentionWindowInputs> const& tempAttentionWindowInputs, nvinfer1::DataType dtype, SizeType32 sinkTokenLength, int64_t stream, std::optional<SizeType32> maxSequenceLength, - bool enablePartialReuse = true, bool copyOnpartialReuse = true); + bool enablePartialReuse = true, bool copyOnPartialReuse = true, + std::shared_ptr<kv_connector::KvCacheConnectorManager> kvCacheConnectorManager = nullptr);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h(9 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp(13 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (4)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
stream(4108-4114)stream(4194-4200)cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
numLayers(287-388)numLayers(602-708)numLayers(602-604)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4)
numLayers(32-69)numLayers(32-33)numLayers(77-88)numLayers(77-77)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h (1)
KvCacheConnectorManager(35-44)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
getUniquePrimaryPool(2410-2415)getUniquePrimaryPool(2410-2410)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
538-539: Wiring the connector into per-window managers looks correctConnector manager is forwarded to each WindowBlockManager. Good propagation.
580-582: Connector storage on WindowBlockManager is straightforwardConstructor accepts and stores the KvCacheConnectorManager. No concerns.
Also applies to: 605-606
1225-1231: Warning when connector is present but reuse is disabled is appropriateClear communication that the connector is ignored without reuse. LGTM.
1644-1651: Constructor parameter plumb-through is consistentAll KVCacheManager overloads accept and forward the connector to BlockManager correctly.
Also applies to: 1662-1673, 1696-1702
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)
540-542: Constructor wiring for KvCacheConnectorManager looks goodAppending KvCacheConnectorManager at the end maintains source compatibility and keeps ownership semantics clear.
1633-1633: getUniquePrimaryPool() override is fine; precondition is enforced in .cppThe runtime check in the implementation ensures this is only called for a single window size. Header docs (suggested above) will make this clearer to callers.
857-859: Confirmed: kvCacheConnectorManager is forwarded to each WindowBlockManagerI verified in BlockManager’s constructor (cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp around lines 533–538) that
mWindowBlockManagers.try_emplace( windowSize, dtype, windowSize, layersWithWindowSize, numKvHeadsPerLayer, sizePerHead, tokensPerBlock, allottedPrimaryBlocks, allottedSecondaryBlocks, maxNumSequences, stream, onboardBlocks, cacheType, secondaryOffloadMinPriority, mEventManager, enablePartialReuse, copyOnPartialReuse, kvCacheConnectorManager);passes the
kvCacheConnectorManagerpointer through to every WindowBlockManager instance. No further changes needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the LLM API part, thanks!
|
|
||
| self.kv_cache_tensor = None | ||
|
|
||
| def register_kv_caches(self, kv_cache_tensor: torch.Tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this example is a bit long. Could you add comments to the core logic/method to make it more intuitive? Since this example will be included in our documentation, such as this, we rely on both the code and comments to serve as guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Superjomn if that's ok with you, @richardhuo-nv will address in a separate MR since John's last day was last week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
421-435: Fix VSWA handling and mirror connector logic in STAR branchThe
prepare_resourcesmethod currently callsget_cache_indices(req)without an explicitwindow_sizein the non-STAR path, which will raise aValueErrorunder VSWA (whenlen(self.max_attention_window_vec) > 1). In addition, the STARcp_typebranch skips both the connector gating (_kv_connector_should_add_sequence) and the post-allocupdate_state_after_alloccall, leading to inconsistent behavior.Required changes:
- Always pass an explicit
window_sizetoget_cache_indices(e.g.self.max_attention_window_vec[0]) to avoid the VSWA hazard.- In the STAR branch, wrap sequence addition in the same
_kv_connector_should_add_sequencecheck and invokeupdate_state_after_allocafter obtainingblock_ids.Locations to update in
tensorrt_llm/_torch/pyexecutor/resource_manager.py:
- Lines ~408–414 (non-STAR branch)
- Lines ~391–399 (STAR branch)
Proposed diff:
--- a/tensorrt_llm/_torch/pyexecutor/resource_manager.py +++ b/tensorrt_llm/_torch/pyexecutor/resource_manager.py @@ context_batch loop: - if 'cp_type' in self.mapping.cp_config and CpType.STAR == self.mapping.cp_config[ - 'cp_type']: - if req.ctx_iters == 0: - seq_len = sum( - len(ctx_block) for ctx_block in req.ctx_blocks) - self.impl.add_sequence( - req.py_request_id, - seq_len + (len(req.query_id) if self.mapping.cp_rank - == self.mapping.cp_size - 1 else 0), - req_beam_width, req) + if 'cp_type' in self.mapping.cp_config and CpType.STAR == self.mapping.cp_config['cp_type']: + if req.ctx_iters == 0 and self._kv_connector_should_add_sequence(req): + seq_len = sum(len(ctx_block) for ctx_block in req.ctx_blocks) + self.impl.add_sequence( + req.py_request_id, + seq_len + (len(req.query_id) + if self.mapping.cp_rank == self.mapping.cp_size - 1 + else 0), + req_beam_width, req) + if self.kv_connector_manager is not None: + block_ids = self.get_cache_indices( + req, + self.max_attention_window_vec[0] + ) + self.kv_connector_manager.update_state_after_alloc( + req, block_ids) else: - if req.is_first_context_chunk and self._kv_connector_should_add_sequence( - req): + if req.is_first_context_chunk and self._kv_connector_should_add_sequence(req): self.impl.add_sequence(req.py_request_id, req.prompt_len, req_beam_width, req) for _ in range(self.num_extra_kv_tokens): self.impl.add_token(req.py_request_id) for _ in range(get_draft_token_length(req)): self.impl.add_token(req.py_request_id) if self.kv_connector_manager is not None: - block_ids = self.get_cache_indices(req) + block_ids = self.get_cache_indices( + req, + self.max_attention_window_vec[0] + ) self.kv_connector_manager.update_state_after_alloc( req, block_ids)
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
45-50: Make kv_connector_manager truly optional by defaulting to None.The parameter is typed Optional but is still required. Default it to None to avoid churn and reduce risk of call-site omissions. This was requested previously and remains applicable.
- def __init__(self, *, executor_config: ExecutorConfig, + def __init__(self, *, executor_config: ExecutorConfig, model_engine: PyTorchModelEngine, draft_model_engine: Optional[PyTorchModelEngine], - mapping: Mapping, net_max_seq_len: int, - kv_connector_manager: Optional[KvCacheConnectorManager]): + mapping: Mapping, net_max_seq_len: int, + kv_connector_manager: Optional[KvCacheConnectorManager] = None):Also applies to: 57-57
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
336-337: Pass kv_connector_manager into C++ only when not None.If the binding does not accept the keyword in older builds, passing a None-typed kwarg can raise a TypeError. Guard the kw insertion to be safe across versions.
- 'kv_connector_manager': self.kv_connector_manager, + **({'kv_connector_manager': self.kv_connector_manager} + if self.kv_connector_manager is not None else {}),tensorrt_llm/_torch/pyexecutor/_util.py (1)
332-336: Explicitly block unsupported configurations (Mamba hybrid, draft model).Clear NotImplementedError paths reduce ambiguity for integrators. Consider centralizing the guard into a small helper if this grows further, but fine as-is.
Also applies to: 413-416
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1674-1689: Broaden exception guard when querying cache indices on termination.get_cache_indices may raise ValueError for VSWA (window_size required). Even with the new fail-fast, catching both maintains robustness and prevents noisy terminations if future conditions change.
- except IndexError: + except (IndexError, ValueError): # If the request has not yet been added to the kv cache manager, # we still need to free resources corresponding to other resource managers. self.resource_manager.free_resources(request)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (9)
cpp/tensorrt_llm/nanobind/CMakeLists.txt(1 hunks)cpp/tensorrt_llm/pybind/CMakeLists.txt(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(7 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(10 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(7 hunks)tensorrt_llm/executor/proxy.py(3 hunks)tensorrt_llm/executor/worker.py(6 hunks)tensorrt_llm/llmapi/llm.py(1 hunks)tests/integration/test_lists/test-db/l0_a10.yml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- tensorrt_llm/executor/proxy.py
- tensorrt_llm/executor/worker.py
- tests/integration/test_lists/test-db/l0_a10.yml
- cpp/tensorrt_llm/nanobind/CMakeLists.txt
- tensorrt_llm/llmapi/llm.py
- cpp/tensorrt_llm/pybind/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (6)
KvCacheConnectorManager(345-549)update_state_after_alloc(213-221)update_state_after_alloc(537-539)build_scheduler_output(306-338)build_scheduler_output(419-422)should_add_sequence(415-417)
tensorrt_llm/_torch/pyexecutor/_util.py (3)
tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (1)
KvCacheConnectorManager(345-549)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
KVCacheManager(142-953)tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
start_worker(337-343)PyExecutor(140-1862)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
tensorrt_llm/_torch/modules/decoder_layer.py (1)
DecoderLayer(10-34)tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (11)
register_kv_caches(99-106)layer_pre_hook(544-546)layer_post_hook(548-549)take_scheduled_requests_pending_load(424-450)handle_metadata(452-458)start_load_kv(109-113)get_finished(144-160)get_finished(486-535)wait_for_save(138-141)request_finished(199-210)request_finished(460-484)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
441-444: Right place to build connector scheduler output.Building the scheduler output after prepare_resources is the correct integration point. No issues.
445-448: Helper encapsulation looks good.Centralizing the “should add sequence” decision behind _kv_connector_should_add_sequence keeps call sites clean.
648-650: get_unique_primary_pool exposure is appropriate.Exposing the contiguous primary pool for registration with the connector worker is needed; method naming is clear.
tensorrt_llm/_torch/pyexecutor/_util.py (2)
279-282: Clean separation of “estimating_kv_cache” from “attached connector.”Great call making attachment conditional on estimating_kv_cache. This avoids connector side effects while probing capacity.
Also applies to: 323-325, 397-399
436-452: All call sites correctly handlekv_connector_managerThe two invocations of
create_py_executor_instanceinpy_executor_creator.py(lines 447–456 and 576–584) now both pass the newkv_connector_managerparameter appropriately (usingNoneduring estimation and the real manager afterward). The implementation ofcreate_py_executor_instancein_util.pyaccepts the optionalkv_connector_managerand forwards it into thePyExecutorconstructor. No further changes are needed.tensorrt_llm/_torch/pyexecutor/py_executor.py (4)
142-161: Constructor: optional kv_connector_manager wiring is fine.Parameter threading looks good and leaves existing call sites intact.
990-997: Start-batch connector handshake is at the right point in the loop.Taking pending-load requests out, pushing metadata, then starting the load on the current stream matches the intended overlap.
999-1003: Freeing resources only after connector reports completion is correct.This pairs with request_finished deferral and avoids premature deallocation.
1004-1008: Waiting for synchronous saves post-forward is correct.Ensures on-stream KV writes finish before proceeding to next iteration.
|
/bot run --disable-fail-fast |
|
PR_Github #16444 [ run ] triggered by Bot |
|
@richardhuo-nv The branch has fallen behind main and cannot be rebased with UI due to conflict. Could you do the rebase? |
I find I don't have the permission to push stuff to John's fork. @jthomson04 could you grant me the permission? |
|
PR_Github #16444 [ run ] completed with state |
|
Closing. |
This is a first implementation of a KV connector API in TRTLLM. See https://docs.google.com/document/d/104jQ_Qhx5-rb1d-8ZBSN-jYQESaRjNJVhJoT3X7_KGw/edit?tab=t.0#heading=h.dh5skq4lgud6 for details. This is a very limited initial implementation, so a couple caveats:
Summary by CodeRabbit
New Features
Python API
Examples
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.