-
Couldn't load subscription status.
- Fork 1.8k
Draft:[TRTLLM-7078][chore] optimal kvcache transfer for VWSA #6861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughRefactors KV-cache handling from a single global block range to per-window semantics: adds Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant DataTransceiver
participant BlockRange
participant CacheFormatter
participant Transport
Client->>DataTransceiver: submit request
DataTransceiver->>BlockRange: build/get per-window block-hash map
alt selective transfer (default)
DataTransceiver->>CacheFormatter: request with per-window hashes
loop each windowSize
CacheFormatter->>BlockRange: getBlockRangeForWindow(windowSize)
CacheFormatter->>Transport: send selected blocks for window
end
else full-window transfer (env flag)
DataTransceiver->>CacheFormatter: request without selective hashes
loop each windowSize
CacheFormatter->>BlockRange: get all/newly allocated blocks for window
CacheFormatter->>Transport: send all blocks for window
end
end
sequenceDiagram
autonumber
participant Transport
participant CacheFormatter
participant BlockRange
Transport->>CacheFormatter: deliver cache payload
loop each windowSize
CacheFormatter->>BlockRange: getBlockRangeForWindow(windowSize)
CacheFormatter->>CacheFormatter: map buffers into window's pools and write blocks
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (4)**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
**/*.{cpp,cxx,cc,cu}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
🧠 Learnings (7)📓 Common learnings📚 Learning: 2025-08-14T21:04:50.248ZApplied to files:
📚 Learning: 2025-08-15T06:46:54.897ZApplied to files:
📚 Learning: 2025-08-20T06:56:02.889ZApplied to files:
📚 Learning: 2025-08-21T09:41:49.347ZApplied to files:
📚 Learning: 2025-08-20T06:48:45.368ZApplied to files:
📚 Learning: 2025-08-20T06:48:45.368ZApplied to files:
🧬 Code graph analysis (1)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
🔇 Additional comments (4)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🔭 Outside diff range comments (6)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (2)
140-157: Fix data race: protect map lookup and erase with the mutexfind() and erase() on mRequestToSession must be done under mMtxForMap. Currently, find() happens before acquiring the lock.
Apply:
void DataSenderImpl::release(LlmRequest::RequestIdType requestId) { - auto it = mRequestToSession.find(requestId); - TLLM_CHECK(it != mRequestToSession.end()); - std::unique_lock<std::mutex> lk(mMtxForMap); + std::unique_lock<std::mutex> lk(mMtxForMap); + auto it = mRequestToSession.find(requestId); + TLLM_CHECK(it != mRequestToSession.end()); if (!common::getEnvKVCacheTransferOutputPath().empty()) { if (!mMeasuresFile.is_open()) { auto outputPath = getTransferOutputPath("send"); mMeasuresFile.open(outputPath); TLLM_CHECK_WITH_INFO( mMeasuresFile.is_open(), "Failed to open transfer output file: %s", outputPath.string().c_str()); } it->second.exportMeasure(mMeasuresFile, true); } mRequestToSession.erase(it); }
218-223: Handle “index not found” case and fix variable name typovalideConnectionIdx is misspelled, and std::find may return end() when i is not picked, yielding an invalid index.
Apply:
- // TODO: index -> validConnectionIdx conversion - auto valideConnectionIdx = std::find(pickUpIdx.begin(), pickUpIdx.end(), i) - pickUpIdx.begin(); + // TODO: index -> validConnectionIdx conversion + auto itIdx = std::find(pickUpIdx.begin(), pickUpIdx.end(), i); + if (itIdx == pickUpIdx.end()) + { + // Skip connections that are not selected by the picker. + continue; + } + auto validConnectionIdx = static_cast<size_t>(itIdx - pickUpIdx.begin()); auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connection); TLLM_CHECK(agentConnection != nullptr); TLLM_CHECK(cacheBufferId.has_value()); const_cast<executor::kv_cache::AgentConnection*>(agentConnection) - ->sendRequestAndBufferInfo(requestInfo, cacheBufferId, valideConnectionIdx); + ->sendRequestAndBufferInfo(requestInfo, cacheBufferId, validConnectionIdx);cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (2)
76-83: Fix pybind11 type: use py::class_ (not py::classh)py::classh is invalid and will not compile. Use py::class_.
Apply:
- py::classh<tb::BaseCacheTransceiver, PyCacheTransceiver>(m, "BaseCacheTransceiver") + py::class_<tb::BaseCacheTransceiver, PyCacheTransceiver>(m, "BaseCacheTransceiver") .def("respond_and_send_async", &BaseCacheTransceiver::respondAndSendAsync) .def("request_and_receive_sync", &BaseCacheTransceiver::requestAndReceiveSync) .def("request_and_receive_async", &BaseCacheTransceiver::requestAndReceiveAsync) .def("check_context_transfer_status", &BaseCacheTransceiver::checkContextTransferStatus) .def("check_gen_transfer_status", &BaseCacheTransceiver::checkGenTransferStatus) .def("check_gen_transfer_complete", &BaseCacheTransceiver::checkGenTransferComplete);
88-95: Fix pybind11 type in the derived class bindingSame issue: py::classh should be py::class_. Otherwise the binding fails to build.
Apply:
- py::classh<tb::CacheTransceiver, tb::BaseCacheTransceiver>(m, "CacheTransceiver") + py::class_<tb::CacheTransceiver, tb::BaseCacheTransceiver>(m, "CacheTransceiver") .def(py::init<tb::kv_cache_manager::BaseKVCacheManager*, std::vector<SizeType32>, SizeType32, SizeType32, runtime::WorldConfig, nvinfer1::DataType, executor::kv_cache::CacheState::AttentionType, std::optional<executor::CacheTransceiverConfig>>(), py::arg("cache_manager"), py::arg("num_kv_heads_per_layer"), py::arg("size_per_head"), py::arg("tokens_per_block"), py::arg("world_config"), py::arg("dtype"), py::arg("attention_type"), py::arg("cache_transceiver_config") = std::nullopt);cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1)
20-30: Missing include; header should be self-containedThis header declares APIs using std::map but does not include . Add it to avoid relying on transitive includes.
Apply:
#include "tensorrt_llm/batch_manager/kvCacheManager.h" #include "tensorrt_llm/executor/executor.h" #include "tensorrt_llm/runtime/bufferManager.h" #include "tensorrt_llm/runtime/iTensor.h" +#include <map> #include <atomic> #include <condition_variable> #include <cstddef> #include <optional> #include <unordered_map> #include <vector>cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
18-34: Consider explicitly bringing SizeType32 into scope or fully-qualify itThis header uses SizeType32 but does not declare a using alias in-scope. If not provided transitively, compilation will fail. Prefer fully-qualifying to avoid leaking aliases in a public header.
Example:
- Replace occurrences of SizeType32 with tensorrt_llm::runtime::SizeType32
Or add near includes:
using SizeType32 = tensorrt_llm::runtime::SizeType32;
🧹 Nitpick comments (10)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
30-34: Consider using initializer list for mBlockIds to avoid unnecessary copy.Since
blockIdsis moved into the member initializer list, but the vector is copied when passed to the constructor, consider passing by rvalue reference to avoid the extra copy.- BlockRangeForWindow(std::vector<SizeType32> blockIds, runtime::ITensor::SharedPtr pool) - : mBlockIds(std::move(blockIds)) + BlockRangeForWindow(std::vector<SizeType32>&& blockIds, runtime::ITensor::SharedPtr pool) + : mBlockIds(blockIds) , mPool(std::move(pool)) { }
145-146: Remove commented-out code.The commented line appears to be debug code that should be removed.
- // cacheManager.getBlockManager.getPrimaryPool(0); auto poolNum = mManager->getNumPools();cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1)
100-102: Binding signature LGTM; consider adding minimal docstringsThe python-exposed pre_alloc_buffer_size now matches the C++ signature and includes tokens_per_block. Optional: add a docstring for clarity.
Example:
- .def_static("pre_alloc_buffer_size", &tb::kv_cache_manager::CacheTransBufferManager::preAllocBufferSize, + .def_static("pre_alloc_buffer_size", &tb::kv_cache_manager::CacheTransBufferManager::preAllocBufferSize, py::arg("cache_size_bytes_per_token_per_window"), py::arg("tokens_per_block"), - py::arg("cache_transceiver_config") = py::none()); + py::arg("cache_transceiver_config") = py::none(), + R"doc( +Compute the pre-allocation size for cache transfer buffers. +Args: + cache_size_bytes_per_token_per_window: dict[int,int] of window_size -> size per token (bytes) + tokens_per_block: number of tokens per contiguous KV block + cache_transceiver_config: optional transceiver configuration +)doc");cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (2)
49-51: Typo: supportFbaricMemory() → supportFabricMemory()Public API has a spelling mistake. Renaming is breaking, but better to correct now or provide a deprecated alias.
Suggested change:
- Rename method to supportFabricMemory() and, if needed, keep a deprecated inline alias for one release cycle.
18-18: Consider adding include guards per project guidelineGuidelines prefer named include guards over pragma once. Optional, if consistent with the repository style.
Example:
- Add:
- #ifndef TRTLLM_CACHE_TRANS_BUFFER_H
- #define TRTLLM_CACHE_TRANS_BUFFER_H
- ...
- #endif // TRTLLM_CACHE_TRANS_BUFFER_H
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
126-129: LGTM: Non-destructive move pattern in test is clearCopying to blockIdsVec and poolCopy, then moving into BlockRangeForWindow preserves original values and tests the rvalue overload.
You could also construct directly from copies without intermediate variables:
auto range = BlockRangeForWindow(std::vector(blockIds.begin(), blockIds.end()), pool);cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)
213-219: Consider documenting the rationale for adding tokensPerBlock to validTokenNum calculation.The logic now adds
tokensPerBlockto the valid token count whenwindowSize < maxNumTokens. While this appears to be accounting for an additional block boundary, the reasoning should be documented for maintainability.Add a comment explaining the calculation:
auto validTokenNum - = (windowSize < maxNumTokens.value() ? (windowSize + tokensPerBlock) : maxNumTokens.value()); + = (windowSize < maxNumTokens.value() + ? (windowSize + tokensPerBlock) // Add one extra block for boundary alignment + : maxNumTokens.value());cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
953-956: Parameter name inconsistency:windowSizevsblockPoolIdxin function signature.The function signature shows
windowSizebut the old parameter name wasblockPoolIdx. The documentation comment for this parameter should be updated to reflect the new semantics.Consider adding a documentation comment:
-void fillBlockData(tensorrt_llm::runtime::ITensor& blockData, int blockId, size_t initial, int windowSize = 0) +// Fill block data for testing +// @param blockData The block tensor to fill +// @param blockId The block identifier +// @param initial Initial value for data generation +// @param windowSize The window size (0 uses default from block manager) +void fillBlockData(tensorrt_llm::runtime::ITensor& blockData, int blockId, size_t initial, int windowSize = 0)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
75-80: Optional: Make serialization order deterministic to improve reproducibilityIterating an unordered_map yields a non-deterministic order. While deserialize reads into an unordered_map (so round-trip works), serialization artifacts (e.g., logs, test fixtures, cross-process replay) may differ run-to-run. If determinism matters, serialize entries sorted by key.
Apply this diff inside serialize to sort keys by windowSize:
- su::serialize(requestInfo.mBlockHashesPerWindow.size(), os); - for (auto const& [windowSize, blockHashes] : requestInfo.mBlockHashesPerWindow) - { - su::serialize(windowSize, os); - su::serialize(blockHashes, os); - } + su::serialize(requestInfo.mBlockHashesPerWindow.size(), os); + std::vector<SizeType32> keys; + keys.reserve(requestInfo.mBlockHashesPerWindow.size()); + for (auto const& kv : requestInfo.mBlockHashesPerWindow) + { + keys.push_back(kv.first); + } + std::sort(keys.begin(), keys.end()); + for (auto const windowSize : keys) + { + su::serialize(windowSize, os); + su::serialize(requestInfo.mBlockHashesPerWindow.at(windowSize), os); + }Additionally, add this include near the top of the file:
#include <algorithm>
238-244: Avoid copying potentially large per-window hash maps on the hot pathblockHashesPerWindow is populated via getBlockHashesPerWindow(), which likely returns a const reference; assigning it to a local map copies the entire structure. Given this runs in a loop and maps contain vectors, this can be expensive.
- Prefer moving the map out of RequestInfo, e.g., add a RequestInfo::takeBlockHashesPerWindow() that returns the map by value and clears the source, then:
- auto info = mSender->recvRequestInfo();
- auto blockHashesPerWindow = info.takeBlockHashesPerWindow(); // moved, no copy
- If changing RequestInfo is undesirable, consider returning the map by value from getBlockHashesPerWindow() and using NRVO/move.
- As a lighter alternative, store the map with the request’s Response state so you don’t carry a copy across iterations and only move it into llmRequest at count == 0.
The final move into llmRequest at Line 263 is good; the bottleneck is the preceding copy into blockHashesPerWindow.
Also applies to: 263-263
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h(4 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp(6 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp(3 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h(1 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp(5 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h(4 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp(1 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp(2 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.h(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp(12 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp(2 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp(2 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(0 hunks)
💤 Files with no reviewable changes (1)
- tests/integration/defs/accuracy/test_disaggregated_serving.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/common/envUtils.hcpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.hcpp/tensorrt_llm/common/envUtils.cppcpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/common/envUtils.hcpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.hcpp/tensorrt_llm/common/envUtils.cppcpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/common/envUtils.hcpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/cacheTransBuffer.hcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/common/envUtils.hcpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.hcpp/tensorrt_llm/common/envUtils.cppcpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/common/envUtils.cppcpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cppcpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🧠 Learnings (1)
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🧬 Code Graph Analysis (11)
cpp/tensorrt_llm/common/envUtils.h (1)
cpp/tensorrt_llm/common/envUtils.cpp (2)
getEnvKVCacheTransferAllBlocksForWindow(449-453)getEnvKVCacheTransferAllBlocksForWindow(449-449)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
nodiscard(973-980)nodiscard(1429-1432)nodiscard(2479-2497)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (3)
cpp/tensorrt_llm/common/envUtils.cpp (2)
getEnvDisableSelectiveCacheTransfer(321-325)getEnvDisableSelectiveCacheTransfer(321-321)cpp/tests/batch_manager/cacheTransceiverTest.cpp (10)
llmRequest(421-460)llmRequest(421-421)llmRequest(882-914)llmRequest(882-882)llmRequest(916-923)llmRequest(916-916)llmRequest(925-951)llmRequest(925-925)requestId(200-208)requestId(200-201)cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
RequestInfo(45-97)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)
preAllocBufferSize(246-285)preAllocBufferSize(246-248)
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
BlockRangeForWindow(27-56)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (3)
nodiscard(61-112)nodiscard(123-126)nodiscard(133-138)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h (1)
nodiscard(53-113)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
maxNumTokens(390-406)cpp/tensorrt_llm/common/envUtils.cpp (2)
getEnvKVCacheTransferAllBlocksForWindow(449-453)getEnvKVCacheTransferAllBlocksForWindow(449-449)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
getBlockRangeForSending(43-82)getBlockRangeForSending(43-43)cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
llmRequest(421-460)llmRequest(421-421)llmRequest(882-914)llmRequest(882-882)llmRequest(916-923)llmRequest(916-916)llmRequest(925-951)llmRequest(925-925)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
cpp/tensorrt_llm/common/envUtils.cpp (4)
getEnvKVCacheTransferAllBlocksForWindow(449-453)getEnvKVCacheTransferAllBlocksForWindow(449-449)getEnvDisableSelectiveCacheTransfer(321-325)getEnvDisableSelectiveCacheTransfer(321-321)cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
llmRequest(421-460)llmRequest(421-421)llmRequest(882-914)llmRequest(882-882)llmRequest(916-923)llmRequest(916-916)llmRequest(925-951)llmRequest(925-925)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
RequestInfo(45-97)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1)
requestInfo(179-179)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
blockIds(48-48)cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)
requestId(200-208)requestId(200-201)
🔇 Additional comments (27)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
61-78: Empty window metadata case cannot occur
BlockManager always initializes itsmWindowSizeToMetadatamap with at least one window size, sogetWindowSizesMetadata()never returns an empty map. ThefromNewlyAllocatedBlockIdsfactory thus safely assumes non-empty metadata and needs no additional empty-check.cpp/tensorrt_llm/common/envUtils.h (1)
119-120: LGTM!The new function declaration follows the established pattern and is properly placed in the namespace.
cpp/tensorrt_llm/common/envUtils.cpp (1)
449-453: LGTM!The implementation correctly follows the established pattern using static caching and getBoolEnv.
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)
119-120: LGTM!The test correctly passes the
tokensPerBlockparameter to match the updated API signature.
163-164: LGTM!The test correctly passes the
tokensPerBlockparameter consistent with the API change.cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
55-55: LGTM: Updated test to use BlockRangeForWindow with movesCorrectly constructs BlockRangeForWindow by moving blockIds and pool to validate ownership transfer semantics.
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)
66-69: Getter name and semantics LGTM; consider noexcept and [[nodiscard]] (already present)The accessor and storage type are appropriate for per-window mapping. No further changes needed.
53-56: Serialization for RequestInfo now includesmBlockHashesPerWindowcorrectlyAll three methods in
cpp/tensorrt_llm/batch_manager/dataTransceiver.*have been updated to account for the per-window hash map:
RequestInfo::serialize(...)writes the map size and each(windowSize, blockHashes)pairRequestInfo::deserialize(...)reads them back intomBlockHashesPerWindowRequestInfo::serializedSize(...)sums their sizesUnit tests (e.g.
cacheTransceiverTest.cpp) exercise the round-trip. No further changes are required.cpp/include/tensorrt_llm/batch_manager/llmRequest.h (3)
1834-1837: LGTM! Method signature refactored to support per-window block hashes.The change from a single vector to a per-window map (
std::unordered_map<SizeType32, std::vector<size_t>>) aligns well with the PR's objective of optimizing KV cache transfer on a per-window basis.
1839-1842: LGTM! Getter method properly updated for per-window access.The method is correctly renamed and returns the appropriate per-window map type, maintaining consistency with the setter method.
2036-2036: LGTM! Member variable properly updated for per-window storage.The member variable is correctly changed from a single vector to a per-window map, consistent with the API changes.
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
113-127: LGTM! Per-window iteration correctly implemented for MLA format operation.The change properly iterates over windows using the new per-window API, validates that window sizes match the number of pools, and accumulates blocks per window correctly.
313-326: LGTM! Per-window iteration correctly implemented for MLA unformat operation.The unformat operation properly mirrors the format operation's per-window approach, ensuring consistency in block handling across both operations.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (4)
43-82: LGTM! Per-window block range selection logic is well-implemented.The function correctly implements selective cache transfer with per-window granularity, properly handling reuse information and environment overrides. The debug logging provides good visibility into the decision-making process.
84-111: LGTM! Receiving block range logic correctly handles per-window transfers.The function properly manages different scenarios (single window, all blocks for window, selective transfer) and correctly builds per-window ranges for receiving.
252-276: LGTM! Format operation correctly updated for per-window block handling.The non-layer-wise path properly builds per-window block collections and validates window sizes against pool counts. The debug logging provides good visibility.
483-502: LGTM! Unformat operation properly mirrors the format operation's per-window approach.The receiving side correctly builds per-window output buffers and includes appropriate validation and logging.
cpp/tests/batch_manager/cacheTransceiverTest.cpp (4)
429-438: LGTM! Test properly updated to use per-window iteration.The test correctly iterates over windows and their blocks, properly filling cache data for each block.
447-458: LGTM! Verification logic correctly updated for per-window structure.The test properly verifies the received cache data using per-window iteration, maintaining consistency with the sending side.
1082-1084: Good adaptation of the random value generation to use window size.The change from
blockPoolIdxtowindowSizein the seed calculation maintains deterministic test behavior while aligning with the new per-window architecture.
1267-1267: Test parameter reduction requires justification.The test now uses fewer parameter values
{60, 30, 60, 10}compared to the original. Please verify this doesn't reduce test coverage for important scenarios.Could you explain why the test parameters were reduced from
{30, 10, 60, 30, 60, 10}to{60, 30, 60, 10}? Is this related to the per-window changes or an optimization?cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (6)
26-26: Header include for SizeType32 is appropriateIncluding runtime/common.h to access SizeType32 is correct for the new per-window mapping.
37-37: Alias introduction for SizeType32The using-declaration keeps the code concise and consistent with the new API. LGTM.
45-51: New RequestInfo ctor with per-window hashes uses proper move semanticsThe rvalue-ref constructor and member initialization are correct; moving blockHashesPerWindow avoids a copy. LGTM.
56-57: Equality operator extended to include per-window mapComparison logic is correct; std::unordered_map operator== is order-independent. LGTM.
105-111: serializedSize updated for per-window mapAccounting for the map size and each entry (key + value vector) is correct. LGTM.
88-95: Ignore the decltype “bug”; only consider reserving bucketsThe use of
decltype(blockHashes)in the initializer is valid—blockHashesis in scope within its own initializer—so no change is needed for the template argument. You can, however, optionally reserve the map’s buckets for a small performance gain:
- File:
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp(lines 88–95)
After deserializingsize, add:auto size = su::deserialize<decltype(blockHashesPerWindow.size())>(is); + blockHashesPerWindow.reserve(size); for (size_t i = 0; i < size; i++) { auto windowSize = su::deserialize<SizeType32>(is); std::vector<size_t> blockHashes = su::deserialize<decltype(blockHashes)>(is); blockHashesPerWindow.emplace(windowSize, std::move(blockHashes)); }If on-wire compatibility across heterogeneous platforms is a concern, you may also consider using a fixed-width type (e.g.
std::vector<std::uint64_t>) instead ofstd::vector<size_t>in both serialization and deserialization.Likely an incorrect or invalid review comment.
| static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow, | ||
| SizeType32 tokensPerBlock, | ||
| std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
API change acknowledged; ensure all call sites updated and add [[nodiscard]]
The new tokensPerBlock parameter is appropriate. Consider marking the function [[nodiscard]] to prevent accidental ignoring of the computed size. Also verify all call sites were updated.
Apply:
- static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
+ [[nodiscard]] static size_t preAllocBufferSize(
+ std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
SizeType32 tokensPerBlock,
std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);Run to find legacy calls missing the new argument:
🏁 Script executed:
#!/bin/bash
# Locate all invocations of preAllocBufferSize and show argument lists.
rg -n -A2 -B2 'preAllocBufferSize\('Length of output: 3358
Add [[nodiscard]] to preAllocBufferSize declaration
All call sites—both in tests and in trtGptModelInflightBatching.cpp—already pass the new tokensPerBlock argument, so no updates there are needed. Please annotate the API to prevent accidental ignores:
• In cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h, line 63:
- static size_t preAllocBufferSize(
+ [[nodiscard]] static size_t preAllocBufferSize(
std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
SizeType32 tokensPerBlock,
std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);• (Optional) Mirror the attribute on the definition in cacheTransBuffer.cpp for consistency:
- size_t CacheTransBufferManager::preAllocBufferSize(
+ [[nodiscard]] size_t CacheTransBufferManager::preAllocBufferSize(
std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
SizeType32 tokensPerBlock,
std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow, | |
| SizeType32 tokensPerBlock, | |
| std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt); | |
| [[nodiscard]] static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow, | |
| SizeType32 tokensPerBlock, | |
| std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt); |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h around lines 63-66, the
declaration of preAllocBufferSize lacks the [[nodiscard]] attribute which could
allow callers to accidentally ignore the returned size; add [[nodiscard]] before
the static return type to enforce usage. Also (optional) add the same
[[nodiscard]] attribute to the function definition in cacheTransBuffer.cpp to
keep declaration and definition consistent.
| #include <map> | ||
| #include <string> | ||
|
|
||
| #include "tensorrt_llm/batch_manager/cacheTransceiver.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing <unordered_map> include for new RequestInfo member
RequestInfo now uses std::unordered_map, but this header does not include <unordered_map>. Add it to keep the header self-contained.
Apply:
#include <map>
+#include <unordered_map>
#include <string>Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.h around line 24, the header
now relies on std::unordered_map via the RequestInfo member but does not include
<unordered_map>, breaking self-containment; add #include <unordered_map> near
other standard includes in this header so the type is available without relying
on transitive includes.
fecbc76 to
3fe3fdb
Compare
|
/bot run --add-multi-gpu-test |
|
PR_Github #15612 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
238-265: Logic bug: requested block-hashes can be lost when multiple counterparts existblockHashesPerWindow is a local variable reinitialized on each loop iteration. When getCounterpartsCount(reqId) > 1, you only assign it during the first iteration (when receiving RequestInfo), but call setRequestedBlockHashes when count reaches 0, likely a later iteration where blockHashesPerWindow is empty. Result: the LlmRequest may be updated with an empty map.
Fix by setting the hashes immediately after receiving RequestInfo (while holding the responder mutex), and remove the late update.
Apply:
- std::unordered_map<SizeType32, std::vector<size_t>> blockHashesPerWindow; + std::unordered_map<SizeType32, std::vector<size_t>> blockHashesPerWindow; if (!isSending() && !mReadyResponses.empty()) { auto const& requestInfo = mSender->recvRequestInfo(); auto reqId = requestInfo.getRequestId(); blockHashesPerWindow = requestInfo.getBlockHashesPerWindow(); + // Update the matching LlmRequest immediately to avoid losing hashes across iterations. + { + std::unique_lock lk(mResponderMutex); + auto itResp = mReadyResponses.find(reqId); + if (itResp != mReadyResponses.end() && itResp->second.mRequest) + { + itResp->second.mRequest->setRequestedBlockHashes(blockHashesPerWindow); + } + else + { + TLLM_LOG_WARNING("Received RequestInfo for unknown request ID: %zu", reqId); + } + } + mCurrentRequest = reqId; if (mRemainSendCount.find(reqId) == mRemainSendCount.end()) { mRemainSendCount[reqId] = mSender->getCounterpartsCount(reqId); } } @@ - // TODO(zhengd): pass the hashes directly instead of update llmRequest - auto llmRequest = it->second.mRequest; - llmRequest->setRequestedBlockHashes(std::move(blockHashesPerWindow)); + // Hashes already set when RequestInfo was received.
♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
18-25: Header is missing required standard includesThis header uses std::vector, std::unique_ptr, and std::unordered_map but does not include , , or <unordered_map>. Depending on transitive includes is fragile and can break builds.
Apply:
#pragma once #include <fstream> #include <future> #include <map> +#include <memory> +#include <unordered_map> +#include <vector> #include <string> #include "tensorrt_llm/batch_manager/cacheTransceiver.h"
🧹 Nitpick comments (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
24-24: Avoid unnecessary header coupling: move cacheTransceiver.h include to the .cppIf no types from cacheTransceiver.h are referenced in this header (only in the .cpp), prefer including it in dataTransceiver.cpp to reduce compile-time dependencies.
Would you like me to generate a minimal patch that forward-declares any needed symbols (if any) and moves this include?
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
75-81: Serialize unordered_map deterministically (sort keys before writing)Iterating std::unordered_map yields nondeterministic order across runs/platforms. For stable artifacts, logs, or cross-version comparisons, serialize entries in a deterministic order (e.g., ascending windowSize).
Apply:
- su::serialize(requestInfo.mBlockHashesPerWindow.size(), os); - for (auto const& [windowSize, blockHashes] : requestInfo.mBlockHashesPerWindow) - { - su::serialize(windowSize, os); - su::serialize(blockHashes, os); - } + su::serialize(requestInfo.mBlockHashesPerWindow.size(), os); + std::vector<SizeType32> keys; + keys.reserve(requestInfo.mBlockHashesPerWindow.size()); + for (auto const& kv : requestInfo.mBlockHashesPerWindow) { keys.push_back(kv.first); } + std::sort(keys.begin(), keys.end()); + for (auto const& windowSize : keys) + { + su::serialize(windowSize, os); + su::serialize(requestInfo.mBlockHashesPerWindow.at(windowSize), os); + } @@ - totalSize += su::serializedSize(requestInfo.mBlockHashesPerWindow.size()); - for (auto const& [windowSize, blockHashes] : requestInfo.mBlockHashesPerWindow) - { - totalSize += su::serializedSize(windowSize); - totalSize += su::serializedSize(blockHashes); - } + totalSize += su::serializedSize(requestInfo.mBlockHashesPerWindow.size()); + std::vector<SizeType32> keys2; + keys2.reserve(requestInfo.mBlockHashesPerWindow.size()); + for (auto const& kv : requestInfo.mBlockHashesPerWindow) { keys2.push_back(kv.first); } + std::sort(keys2.begin(), keys2.end()); + for (auto const& windowSize : keys2) + { + totalSize += su::serializedSize(windowSize); + totalSize += su::serializedSize(requestInfo.mBlockHashesPerWindow.at(windowSize)); + }Also applies to: 100-113
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h(4 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp(6 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp(3 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h(1 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp(5 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h(4 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp(1 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp(2 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- cpp/tensorrt_llm/common/envUtils.cpp
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
- cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
- cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/common/envUtils.hcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🔇 Additional comments (7)
cpp/tensorrt_llm/common/envUtils.h (1)
119-120: Add Doxygen comment for KV-cache transfer flagThe
getEnvKVCacheTransferAllBlocksForWindowAPI is backed byTRTLLM_KVCACHE_TRANSFER_ALL_BLOCKS_FOR_WINDOW(default: false asgetBoolEnvreturns true only if the var is set to"1"). Please update the declaration incpp/tensorrt_llm/common/envUtils.has follows:--- a/cpp/tensorrt_llm/common/envUtils.h +++ b/cpp/tensorrt_llm/common/envUtils.h @@ -118,7 +118,9 @@ namespace tensorrt_llm::common { - bool getEnvKVCacheTransferAllBlocksForWindow(); + //! Returns whether to transfer all KV-cache blocks for each attention window during cache transmission. + //! Controlled by the environment variable `TRTLLM_KVCACHE_TRANSFER_ALL_BLOCKS_FOR_WINDOW` (default: false). + bool getEnvKVCacheTransferAllBlocksForWindow();cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
53-56: API shift to per-window hashes looks consistentThe new constructor and accessor for per-window block hashes are coherent, return by const&, and align with the per-window model propagated across the PR.
Also applies to: 66-69, 93-94
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)
101-106: Early return via needSendCache() is appropriateThe guard avoids unnecessary work when this rank shouldn’t send cache. Looks good.
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (4)
45-52: Constructor and equality updates align with per-window stateThe added constructor and operator== changes look correct and consistent with the new per-window storage.
Also applies to: 56-58
26-26: Include for SizeType32 is correctIncluding runtime/common.h for SizeType32 is appropriate.
70-98: Backward compatibility of RequestInfo wire formatSwitching from a flat vector to a per-window map changes the wire format. If mixed-version context/gen executors are possible, consider version-tagging the serialized stream and supporting both decode paths.
Would you confirm if cross-version KV-cache transmission is supported? If yes, I can propose a minimal version byte header and fallback deserializer.
20-32: No remaining stale API usages detected
Ran a repository-wide search and confirmed that neithergetBlockHashes()normBlockHashesappear anywhere in the codebase. No further cleanup is needed.
|
PR_Github #15612 [ run ] completed with state |
5f3c422 to
885509d
Compare
|
/bot run --add-multi-gpu-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (2)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
178-183: Fix: don’t force all blocks under pool-0 window; use the per-window mapMapping everything under pool 0’s window breaks multi-window MLAs. Use the per-window collection built above.
- SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(0); - std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow; - inputKvCacheBlocksPerWindow.emplace(window, inputKvCacheBlocks); + // inputKvCacheBlocksPerWindow was populated above while collecting blocks per window tensorrt_llm::executor::kv_cache::splitKVCacheDispatch( inputKvCacheBlocksPerWindow, outputSplitCaches, destConfig, selfConfig, selfIdx, bufferManager);
492-501: Fix: use the built per-window map in concat instead of forcing pool-0 windowPassing all blocks under a single window key corrupts layout when multiple windows exist.
- { - std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> outputCachesPerWindow; - SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(0); - outputCachesPerWindow.emplace(window, outputBuffers); - NVTX3_SCOPED_RANGE(formatInputConcatenate); - - // recvSplitCaches size == ppdomainsize - executor::kv_cache::concatKvCacheV2Dispatch( - recvSplitCaches, outputCachesPerWindow, destConfig, selfConfig, selfIdx, bufferManager); - } + { + NVTX3_SCOPED_RANGE(formatInputConcatenate); + // recvSplitCaches size == ppdomainsize + executor::kv_cache::concatKvCacheV2Dispatch( + recvSplitCaches, outputCachesPerWindow, destConfig, selfConfig, selfIdx, bufferManager); + }
♻️ Duplicate comments (7)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1)
63-65: Add [[nodiscard]] to preAllocBufferSize to prevent ignored resultsThe return value is critical for buffer sizing. Mark it [[nodiscard]] to avoid accidental discarding. This mirrors earlier feedback and keeps the API safer.
- static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow, + [[nodiscard]] static size_t preAllocBufferSize( + std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow, SizeType32 tokensPerBlock, std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
24-24: Header is not self-contained: missing <unordered_map> (and SizeType32 definition)This header uses std::unordered_map and SizeType32 in public declarations but does not include the corresponding headers. Transitive includes may mask this locally but break other translation units.
#include <map> +#include <unordered_map> #include <string> #include "tensorrt_llm/batch_manager/cacheTransceiver.h" +#include "tensorrt_llm/runtime/common.h"cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1)
181-188: Only override RequestInfo when there are new blocks; move the map to avoid copiesCurrently you always reassign RequestInfo even if no newly allocated blocks exist. Guard the override and move the map to leverage the rvalue overload and avoid unnecessary copies.
- auto disableSelectiveCacheTransfer = common::getEnvDisableSelectiveCacheTransfer(); + auto disableSelectiveCacheTransfer = common::getEnvDisableSelectiveCacheTransfer(); if (!disableSelectiveCacheTransfer) { auto* cacheManager = mFormatter->getCacheManager(); auto blockRange = kv_cache_manager::BlockRange::fromNewlyAllocatedBlockIds(*cacheManager, llmRequest.mRequestId); - requestInfo = RequestInfo(requestId, blockRange.getBlockHashesPerWindow(), mSelfState); + if (auto blockHashesPerWindow = blockRange.getBlockHashesPerWindow(); !blockHashesPerWindow.empty()) + { + requestInfo = RequestInfo(requestId, std::move(blockHashesPerWindow), mSelfState); + } }cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
313-327: Bug: receiving path also flattens per-window blocks; build per-window map onceSame issue on receive: a flat outputBuffers loses window grouping and leads to incorrect concat.
- auto const& windowSizes = blockRange.getWindowSizes(); - TLLM_CHECK_WITH_INFO( - static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools"); - // TODO(oargov): are we sure the other side has the same number of pools? this might not hold for pp_size>1... - size_t blockNum = 0; - for (auto const& windowSize : windowSizes) - { - auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); - for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) - { - outputBuffers.push_back(it); - blockNum++; - } - } + auto const& windowSizes = blockRange.getWindowSizes(); + TLLM_CHECK_WITH_INFO(static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools"); + // Build per-window mapping and maintain a flat view for zcopy/size calculations. + std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> outputCachesPerWindow; + size_t blockNum = 0; + for (auto const& windowSize : windowSizes) + { + auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); + auto& vec = outputCachesPerWindow[windowSize]; + for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) + { + vec.push_back(it); + outputBuffers.push_back(it); // flat view for zcopy path + ++blockNum; + } + }
113-121: Bug: per-window blocks are flattened; build a true per-window map during collectionYou create a flat inputKvCacheBlocks vector, losing per-window grouping required by split/concat logic when multiple windows exist.
Apply this change to collect both flat and per-window views:
int blockNum = 0; - std::vector<runtime::ITensor::SharedPtr> inputKvCacheBlocks; + std::vector<runtime::ITensor::SharedPtr> inputKvCacheBlocks; + std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow; auto const numPools = mCacheManager->getBlockManager().getNumPools(); auto blockRange = getBlockRangeForSending(mCacheManager, llmRequest); auto const& windowSizes = blockRange.getWindowSizes(); TLLM_CHECK_WITH_INFO( static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools"); for (auto const& windowSize : windowSizes) { auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) { inputKvCacheBlocks.push_back(it); + inputKvCacheBlocksPerWindow[windowSize].push_back(it); blockNum++; } }Note: follow-up diffs below switch split/concat to use inputKvCacheBlocksPerWindow.
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
82-85: Unify existence check and fix message grammar/format.Use the same backing set as other checks (pools are authoritative for window presence), and fix “exists” -> “exist”. Also cast to an integral type to match the %d format.
Apply this diff:
- TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), - "Window size %d should exists", windowSize); + TLLM_CHECK_WITH_INFO(mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), + "Window size %d should exist", static_cast<int>(windowSize));
89-95: Make the window existence check consistent and fix grammar.Keep consistency with other methods by validating against pools, and fix the message.
Apply this diff:
- TLLM_CHECK_WITH_INFO( - mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d should exists", windowSize); + TLLM_CHECK_WITH_INFO( + mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), + "Window size %d should exist", static_cast<int>(windowSize));
🧹 Nitpick comments (5)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
53-56: Constructor overload looks good; ensure serialization/deserialization updated accordinglyYou’ve added a per-window hashes ctor taking an rvalue map. Confirm serialize/deserialize/serializedSize cover the new layout and remain version-safe.
Would you like me to generate a verification script that locates RequestInfo::serialize/deserialize definitions and checks for handling of unordered_map<SizeType32, std::vector<size_t>>?
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
225-233: Layer-wise path: confirm per-window iteration is intended for MLA alt-window layoutsLooping windowSizes inside the per-layer loop means each layer slice is sent for every window. If BlockRange partitions blocks disjointly per window (as expected), this is fine; if not, you may duplicate slices. Please confirm BlockRange::getBlockRangeForWindow(windowSize) returns disjoint sets across windows.
484-502: Per-window receive collection—LGTM; keep both per-window and flat views if zcopy path is usedThe new per-window mapping plus block count accumulation is correct. Ensure subsequent zcopy paths iterate per-window or use a flat view consistently.
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
70-82: Serialize/deserialize/size updated for per-window — consider stable ordering (optional).Current implementation serializes an
unordered_map, which is fine for round-trip and equality checks. If you ever compare serialized blobs across processes or runs, consider serializing window keys in sorted order to ensure deterministic byte streams.Example change (conceptual):
- for (auto const& [windowSize, blockHashes] : requestInfo.mBlockHashesPerWindow) + std::vector<SizeType32> keys; + keys.reserve(requestInfo.mBlockHashesPerWindow.size()); + for (auto const& kv : requestInfo.mBlockHashesPerWindow) keys.push_back(kv.first); + std::sort(keys.begin(), keys.end()); + for (auto const& windowSize : keys) { - su::serialize(windowSize, os); - su::serialize(blockHashes, os); + su::serialize(windowSize, os); + su::serialize(requestInfo.mBlockHashesPerWindow.at(windowSize), os); }Apply similarly in
serializedSize. Deserializer can remain unchanged.Also applies to: 84-98, 100-112
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
97-111: Nit: pre-reserve per-window hash vectors to reduce reallocations.Minor perf win when many blocks exist per window.
Apply this diff:
for (auto const& [windowSize, blockIds] : mBlockIdsPerWindow) { - for (auto const& blockId : blockIds) + auto& hashes = blockHashesPerWindow[windowSize]; + hashes.reserve(blockIds.size()); + for (auto const& blockId : blockIds) { - blockHashesPerWindow[windowSize].emplace_back( - blockManager.getBlockById(blockId, windowSize)->getHash()); + hashes.emplace_back(blockManager.getBlockById(blockId, windowSize)->getHash()); } }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (18)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h(4 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp(6 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp(3 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h(1 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp(5 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h(4 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp(1 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp(2 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.h(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp(12 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp(2 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp(2 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(0 hunks)
💤 Files with no reviewable changes (1)
- tests/integration/defs/accuracy/test_disaggregated_serving.py
🚧 Files skipped from review as they are similar to previous changes (9)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
- cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
- cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
- cpp/tensorrt_llm/common/envUtils.h
- cpp/tensorrt_llm/common/envUtils.cpp
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.hcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.hcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.hcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.hcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧠 Learnings (5)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.h
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.hcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.hcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (16)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)
66-69: Getter naming and noexcept/[[nodiscard]] are consistent—LGTMThe const-ref getter for per-window hashes is clear and avoids copies.
93-94: Data member rename to per-window is consistent—LGTMSwitching to mBlockHashesPerWindow aligns with the new API and the rest of the PR.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)
251-271: Building per-window map and total element count—LGTMCollecting input blocks keyed by window and summing element counts aligns with splitKVCacheDispatch’s expectations (element counts, not bytes). Good.
cpp/tests/batch_manager/cacheTransceiverTest.cpp (7)
430-437: Per-window write path looks correct (BlockRangeForWindow + device memset).Good migration to per-window iteration. Using
it->data()andit->getSizeInBytes()withcudaMemsetper block is correct for the sentinel-based iterator, and aligns with the new BlockRangeForWindow API.
891-903: Per-window fill path is consistent and synced on correct stream.The switch to iterating
getWindowSizes()and filling viafillBlockData(*it, ...)per window is consistent with the new BlockRange APIs. Synchronizing the buffer manager stream foronlyWindowSizebefore sending is a good touch.
934-949: Verify off-by-one in per-window verification range.
startBlockId = size - (maxBlockInWindow + 1)makes the verifier includemaxBlockInWindow + 1trailing blocks. If intent is to check exactly one window’s worth of blocks, consider dropping the+ 1. If the extra one is intentional (e.g., to include the current partial block), please add a brief comment to document it.Would you like me to adjust and re-run the tests to confirm both behaviors? I can also draft a small helper to compute the exact expected tail length based on tokensPerBlock and post-prefill semantics (per the retrieved learning).
953-1014: fillBlockData: window-aware path and token indexing are solid.
- Passing windowSize down is correct and selecting the buffer manager via that size matches per-window pools.
- Using tokensPerBlock from the model config to advance token indices is aligned with the attention-window onboarding learning.
1016-1080: verifyBlockData mirrors fill logic correctly.The verification path uses the same window-aware buffer manager and token indexing scheme, which keeps generation/verification in sync.
1082-1109: Deterministic seeding includes windowSize — good for per-window reproducibility.Hashing
windowSizeinto the seed ensures expected values differ across windows, avoiding cross-window collisions while remaining deterministic.
650-650: Doubling mMaxNumSequences to 16: check runtime/memory budget.Increasing from 8 to 16 scales the total number of blocks and transferred bytes proportionally. Given recent CI failures, please confirm this does not push test time or memory over your CI limits on multi-GPU runs.
If helpful, I can add a quick gtest environment flag to reduce
mMaxNumSequencesunder CI (e.g., via getenv) while keeping 16 for local stress testing.cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
45-52: RequestInfo: per-window block hashes constructor is sound.Storing hashes as
unordered_map<SizeType32, std::vector<size_t>>aligns with the per-window design. Moving the map in avoids copies.
238-279: Responder path correctly adopts per-window hashes into LlmRequest.Receiving
blockHashesPerWindowand forwarding viallmRequest->setRequestedBlockHashes(std::move(...))is a clean handoff that avoids extra copies. The logic to wait for all counterparts before setting and sending remains intact.Also applies to: 263-265
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4)
30-56: BlockRangeForWindow + iterator scaffold are clean and minimal.Encapsulating per-window block IDs and pool with a sentinel-iterator is a neat fit for range-for and explicit begin/end loops.
113-120: getBlockRangeForWindow: good defensive check; return by value is fine.Validating presence in
mPoolsPerWindowand returning a self-contained BlockRangeForWindow avoids dangling references and keeps iteration simple.
137-166: Constructors correctly populate pools-per-window and block IDs.
- fromAllBlockIds: pulls per-window block IDs for the request; good.
- fromNewlyAllocatedBlockIds: collects IDs per window; good.
187-193: Iterator correctness and sentinel comparison look good.Bounds checks, lazy slice creation, and sentinel-based termination are implemented properly. The implicit conversion to SharedPtr matches call sites.
Also applies to: 230-233, 242-249, 255-258
| auto const& windowSizes = blockRange.getWindowSizes(); | ||
| for (auto const& windowSize : windowSizes) | ||
| { | ||
| std::vector<uint8_t> bytes(block.getSizeInBytes()); | ||
| TLLM_CUDA_CHECK(cudaMemcpy(bytes.data(), block.data(), block.getSizeInBytes(), cudaMemcpyDeviceToHost)); | ||
| EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), | ||
| [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; })); | ||
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | ||
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | ||
| { | ||
| std::vector<uint8_t> bytes(it->getSizeInBytes()); | ||
| TLLM_CUDA_CHECK(cudaMemcpy(bytes.data(), it->data(), it->getSizeInBytes(), cudaMemcpyDeviceToHost)); | ||
| EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), | ||
| [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; })); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: precedence in equality check causes wrong assertion.
i == llmRequest->getPromptLen() & 0xff is parsed as (i == llmRequest->getPromptLen()) & 0xff. Use parentheses (or an explicit cast) to compare i against the masked byte.
Apply this diff:
- EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
- [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
+ EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
+ [&llmRequest](uint8_t i) {
+ return i == static_cast<uint8_t>(llmRequest->getPromptLen() & 0xff);
+ }));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto const& windowSizes = blockRange.getWindowSizes(); | |
| for (auto const& windowSize : windowSizes) | |
| { | |
| std::vector<uint8_t> bytes(block.getSizeInBytes()); | |
| TLLM_CUDA_CHECK(cudaMemcpy(bytes.data(), block.data(), block.getSizeInBytes(), cudaMemcpyDeviceToHost)); | |
| EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), | |
| [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; })); | |
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | |
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | |
| { | |
| std::vector<uint8_t> bytes(it->getSizeInBytes()); | |
| TLLM_CUDA_CHECK(cudaMemcpy(bytes.data(), it->data(), it->getSizeInBytes(), cudaMemcpyDeviceToHost)); | |
| EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), | |
| [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; })); | |
| } | |
| auto const& windowSizes = blockRange.getWindowSizes(); | |
| for (auto const& windowSize : windowSizes) | |
| { | |
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | |
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | |
| { | |
| std::vector<uint8_t> bytes(it->getSizeInBytes()); | |
| TLLM_CUDA_CHECK(cudaMemcpy( | |
| bytes.data(), | |
| it->data(), | |
| it->getSizeInBytes(), | |
| cudaMemcpyDeviceToHost)); | |
| EXPECT_TRUE(std::all_of( | |
| bytes.begin(), | |
| bytes.end(), | |
| [&llmRequest](uint8_t i) { | |
| return i == static_cast<uint8_t>( | |
| llmRequest->getPromptLen() & 0xff); | |
| })); | |
| } | |
| } |
🤖 Prompt for AI Agents
In cpp/tests/batch_manager/cacheTransceiverTest.cpp around lines 447 to 457, the
lambda in the EXPECT_TRUE uses the expression i == llmRequest->getPromptLen() &
0xff which is parsed as (i == llmRequest->getPromptLen()) & 0xff; change it to
compare the masked value explicitly, e.g. i == (llmRequest->getPromptLen() &
0xff) or i == static_cast<uint8_t>(llmRequest->getPromptLen()), so the byte-wise
comparison is performed correctly.
885509d to
f200189
Compare
|
/bot run --add-multi-gpu-test |
|
PR_Github #15743 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (4)
138-146: Zero-copy path must iterate per-window blocksSend blocks grouped by window to preserve ordering and sizes per window.
Apply:
- for (size_t i = 0; i < connections.size(); i++) - { - for (auto const& block : inputKvCacheBlocks) - { - session.send(i, block->data(), block->getSizeInBytes()); - } - } + for (size_t i = 0; i < connections.size(); i++) + { + for (auto const& [w, blocks] : inputKvCacheBlocksPerWindow) + { + (void) w; + for (auto const& block : blocks) + { + session.send(i, block->data(), block->getSizeInBytes()); + } + } + }
178-183: Don’t remap all blocks under pool 0’s windowYou already have inputKvCacheBlocksPerWindow; use it directly.
Apply:
- SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(0); - std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow; - inputKvCacheBlocksPerWindow.emplace(window, inputKvCacheBlocks); + // inputKvCacheBlocksPerWindow was built above
492-501: Use the built per-window map for concat, not a single-window wrapperStop wrapping all buffers under pool 0’s window. Pass the real per-window map.
Apply:
- { - std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> outputCachesPerWindow; - SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(0); - outputCachesPerWindow.emplace(window, outputBuffers); - NVTX3_SCOPED_RANGE(formatInputConcatenate); - - // recvSplitCaches size == ppdomainsize - executor::kv_cache::concatKvCacheV2Dispatch( - recvSplitCaches, outputCachesPerWindow, destConfig, selfConfig, selfIdx, bufferManager); - } + { + NVTX3_SCOPED_RANGE(formatInputConcatenate); + // recvSplitCaches size == ppdomainsize + executor::kv_cache::concatKvCacheV2Dispatch( + recvSplitCaches, outputCachesPerWindow, destConfig, selfConfig, selfIdx, bufferManager); + }
153-156: Use per-window block size for buffer allocationDerive
cacheBlockSizefrom the per-window map rather than the flattened vector to handle multiple attention windows correctly.
- File:
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp, lines 153–156- auto cacheBlockSize = inputKvCacheBlocks.at(0)->getSize();
- auto cacheBlockSize = inputKvCacheBlocksPerWindow.begin()->second.at(0)->getSize();
- Ensure that `inputKvCacheBlocksPerWindow` is populated with one entry *per* window returned by `blockRange.getWindowSizes()`, not just a single window. - Add or update tests covering multiple attention windows to verify that the computed buffer sizes and splits remain correct. </blockquote></details> </blockquote></details>♻️ Duplicate comments (5)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)
246-285: Avoid duplicated validTokenNum logic (constructor vs. static function)The same policy is implemented in two places; centralize to a helper to reduce drift.
You can add a small internal helper (in an anonymous namespace) and call it from both places:
// Place near top of this translation unit namespace { inline size_t calculateValidTokenNum(size_t windowSize, size_t maxNumTokens, size_t tokensPerBlock) { auto v = (windowSize < maxNumTokens) ? std::min(maxNumTokens, windowSize + tokensPerBlock) : maxNumTokens; if (tensorrt_llm::common::getEnvKVCacheTransferAllBlocksForWindow()) { v = maxNumTokens; } return v; } } // namespaceThen replace the inlined computations at Lines 213-219 and 265-271 with calculateValidTokenNum(...).
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
91-95: Use consistent container for validation in setBlockIdsForAllWindowsAlign with setBlockIdsForWindow; validate against mBlockIdsPerWindow, not mPoolsPerWindow.
Apply:
- TLLM_CHECK_WITH_INFO( - mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d should exists", windowSize); + TLLM_CHECK_WITH_INFO( + mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), + "Window size %d should exist", static_cast<int>(windowSize));cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
113-127: Bug: per-window blocks are flattened; preserve per-window groupingFlattening discards window boundaries and later uses pool 0’s window for all. Build a true per-window map upfront.
Apply:
- int blockNum = 0; - std::vector<runtime::ITensor::SharedPtr> inputKvCacheBlocks; + int blockNum = 0; + std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow; @@ - for (auto const& windowSize : windowSizes) - { - auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); - for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) - { - inputKvCacheBlocks.push_back(it); - blockNum++; - } - } + for (auto const& windowSize : windowSizes) + { + auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); + auto& vec = inputKvCacheBlocksPerWindow[windowSize]; + for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) + { + vec.push_back(it); + ++blockNum; + } + }
313-327: Receiving path: also preserve per-window grouping (don’t flatten)Build outputCachesPerWindow and keep a flat view only where needed (zcopy/size checks).
Apply:
- auto const& windowSizes = blockRange.getWindowSizes(); - TLLM_CHECK_WITH_INFO( - static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools"); - // TODO(oargov): are we sure the other side has the same number of pools? this might not hold for pp_size>1... - size_t blockNum = 0; - for (auto const& windowSize : windowSizes) - { - auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); - for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) - { - outputBuffers.push_back(it); - blockNum++; - } - } + auto const& windowSizes = blockRange.getWindowSizes(); + TLLM_CHECK_WITH_INFO(static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools"); + // Build per-window mapping and a flat view for zcopy and size computations. + std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> outputCachesPerWindow; + size_t blockNum = 0; + for (auto const& windowSize : windowSizes) + { + auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); + auto& vec = outputCachesPerWindow[windowSize]; + for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) + { + vec.push_back(it); + ++blockNum; + } + } + for (auto const& [w, vec] : outputCachesPerWindow) + { + (void) w; + outputBuffers.insert(outputBuffers.end(), vec.begin(), vec.end()); + }cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
455-457: Fix bitwise precedence in byte comparisonThe current lambda evaluates (i == llmRequest->getPromptLen()) & 0xff. Compare against the masked byte explicitly.
Apply:
- EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), - [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; })); + EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), + [&llmRequest](uint8_t i) { + return i == static_cast<uint8_t>(llmRequest->getPromptLen() & 0xff); + }));🧹 Nitpick comments (3)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)
259-284: Local naming style nit: prefer lowerCamelCase for localsTransferBufferSize/RecvBufferCount/SendBufferCount/PreAllocBufferSize are local variables; per guidelines, use lowerCamelCase.
Apply:
- size_t TransferBufferSize = common::getEnvMemSizeForKVCacheTransferBuffer(); + size_t transferBufferSize = common::getEnvMemSizeForKVCacheTransferBuffer(); @@ - TransferBufferSize = 0; + transferBufferSize = 0; @@ - TransferBufferSize += validTokenNum * cacheSizeBytesPerToken; + transferBufferSize += validTokenNum * cacheSizeBytesPerToken; @@ - if (useFabricMemory) + if (useFabricMemory) { - TransferBufferSize = FabricMemory::getAlignedSize(TransferBufferSize); + transferBufferSize = FabricMemory::getAlignedSize(transferBufferSize); } - size_t RecvBufferCount = common::getEnvRequestKVCacheConcurrent() ? common::getEnvKVCacheRecvBufferCount() : 1; - size_t SendBufferCount = common::getEnvParallelCacheSend() ? common::getEnvKVCacheSendMaxConcurrenceNum() : 1; - size_t PreAllocBufferSize = TransferBufferSize * (RecvBufferCount + SendBufferCount); - return PreAllocBufferSize; + size_t recvBufferCount = common::getEnvRequestKVCacheConcurrent() ? common::getEnvKVCacheRecvBufferCount() : 1; + size_t sendBufferCount = common::getEnvParallelCacheSend() ? common::getEnvKVCacheSendMaxConcurrenceNum() : 1; + size_t preAllocBufferSize = transferBufferSize * (recvBufferCount + sendBufferCount); + return preAllocBufferSize;cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
80-85: Fix message grammar and format spec; keep types explicitMinor polish and safer formatting.
Apply:
- TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), - "Window size %d should exists", windowSize); + TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), + "Window size %d should exist", static_cast<int>(windowSize));
115-121: Add check for blockIds presence before at() to avoid exceptionsgetBlockRangeForWindow validates pools but not block IDs; add a symmetric check for mBlockIdsPerWindow.
Apply:
TLLM_CHECK_WITH_INFO( mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d not found", windowSize); + TLLM_CHECK_WITH_INFO( + mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), "Block IDs for window %d not found", + static_cast<int>(windowSize)); auto pool = mPoolsPerWindow.at(windowSize).front(); auto blockIds = mBlockIdsPerWindow.at(windowSize); return BlockRangeForWindow(std::move(blockIds), std::move(pool));📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (18)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h(4 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp(6 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp(3 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h(1 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp(5 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h(4 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp(1 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp(2 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.h(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp(12 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp(2 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp(2 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(0 hunks)💤 Files with no reviewable changes (1)
- tests/integration/defs/accuracy/test_disaggregated_serving.py
🚧 Files skipped from review as they are similar to previous changes (12)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
- cpp/tensorrt_llm/common/envUtils.cpp
- cpp/tensorrt_llm/common/envUtils.h
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
- cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
- cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
- cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
- cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
- cpp/tensorrt_llm/batch_manager/dataTransceiver.h
- cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...Files:
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)Files:
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tests/batch_manager/cacheTransceiverTest.cpp**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h🧠 Learnings (4)
📓 Common learnings
Learnt from: thorjohnsen PR: NVIDIA/TensorRT-LLM#6910 File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0 Timestamp: 2025-08-14T21:04:50.208Z Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.Learnt from: eopXD PR: NVIDIA/TensorRT-LLM#6767 File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0 Timestamp: 2025-08-15T06:46:54.853Z Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen PR: NVIDIA/TensorRT-LLM#6910 File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0 Timestamp: 2025-08-14T21:04:50.208Z Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.Applied to files:
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD PR: NVIDIA/TensorRT-LLM#6767 File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0 Timestamp: 2025-08-15T06:46:54.853Z Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.Applied to files:
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cppcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD PR: NVIDIA/TensorRT-LLM#6767 File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0 Timestamp: 2025-08-15T06:46:53.780Z Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.Applied to files:
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
255-258: Begin() implementation LGTMThe iterator contract and sentinel usage are consistent and safe.
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
429-437: Per-window prefill in test is fineFilling per-window blocks with the prompt length is appropriate for reuse checks.
649-650: Test capacity increase LGTMRaising mMaxNumSequences to 16 improves coverage for windowed attention scenarios.
1267-1269: DP test lengths updated – windowSize is accounted in expected‐value generator
Verified thatgenerateExpectedValuemixes inwindowSizewhen seeding, so the varied request lengths (60, 30, 60, 10) produce correct expected values.
| auto validTokenNum | ||
| = (windowSize < maxNumTokens.value() ? (windowSize + tokensPerBlock) : maxNumTokens.value()); | ||
| if (common::getEnvKVCacheTransferAllBlocksForWindow()) | ||
| { | ||
| validTokenNum = maxNumTokens.value(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clamp validTokenNum to maxNumTokens to avoid overallocation
When windowSize < maxNumTokens, adding tokensPerBlock can exceed maxNumTokens, leading to buffer overallocation. Clamp to maxNumTokens.
Apply:
- auto validTokenNum
- = (windowSize < maxNumTokens.value() ? (windowSize + tokensPerBlock) : maxNumTokens.value());
+ auto validTokenNum = (windowSize < maxNumTokens.value())
+ ? std::min(maxNumTokens.value(), windowSize + tokensPerBlock)
+ : maxNumTokens.value();
if (common::getEnvKVCacheTransferAllBlocksForWindow())
{
validTokenNum = maxNumTokens.value();
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto validTokenNum | |
| = (windowSize < maxNumTokens.value() ? (windowSize + tokensPerBlock) : maxNumTokens.value()); | |
| if (common::getEnvKVCacheTransferAllBlocksForWindow()) | |
| { | |
| validTokenNum = maxNumTokens.value(); | |
| } | |
| auto validTokenNum = (windowSize < maxNumTokens.value()) | |
| ? std::min(maxNumTokens.value(), windowSize + tokensPerBlock) | |
| : maxNumTokens.value(); | |
| if (common::getEnvKVCacheTransferAllBlocksForWindow()) | |
| { | |
| validTokenNum = maxNumTokens.value(); | |
| } |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp around lines 213 to 219,
validTokenNum is computed as (windowSize < maxNumTokens ? windowSize +
tokensPerBlock : maxNumTokens) which can exceed maxNumTokens when tokensPerBlock
is added; change the logic to clamp validTokenNum to maxNumTokens (e.g., compute
tentative = windowSize + tokensPerBlock when windowSize < maxNumTokens, then set
validTokenNum = std::min(tentative, maxNumTokens.value()); keep the existing
override for getEnvKVCacheTransferAllBlocksForWindow() that forces
maxNumTokens).
| auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value() | ||
| ? static_cast<size_t>(windowSize) + tokensPerBlock | ||
| : maxNumTokens.value()); | ||
| if (common::getEnvKVCacheTransferAllBlocksForWindow()) | ||
| { | ||
| validTokenNum = maxNumTokens.value(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same clamping bug in preAllocBufferSize; fix mirrors constructor
Ensure validTokenNum never exceeds maxNumTokens here as well.
Apply:
- auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value()
- ? static_cast<size_t>(windowSize) + tokensPerBlock
- : maxNumTokens.value());
+ auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value())
+ ? std::min(maxNumTokens.value(), static_cast<size_t>(windowSize) + static_cast<size_t>(tokensPerBlock))
+ : maxNumTokens.value();
if (common::getEnvKVCacheTransferAllBlocksForWindow())
{
validTokenNum = maxNumTokens.value();
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value() | |
| ? static_cast<size_t>(windowSize) + tokensPerBlock | |
| : maxNumTokens.value()); | |
| if (common::getEnvKVCacheTransferAllBlocksForWindow()) | |
| { | |
| validTokenNum = maxNumTokens.value(); | |
| } | |
| auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value()) | |
| ? std::min(maxNumTokens.value(), | |
| static_cast<size_t>(windowSize) + static_cast<size_t>(tokensPerBlock)) | |
| : maxNumTokens.value(); | |
| if (common::getEnvKVCacheTransferAllBlocksForWindow()) | |
| { | |
| validTokenNum = maxNumTokens.value(); | |
| } |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp around lines 265 to 271,
validTokenNum is computed but not properly clamped to maxNumTokens causing the
same overflow bug as elsewhere; adjust the logic so after computing the
tentative validTokenNum you clamp it to at most maxNumTokens.value() (i.e., set
validTokenNum = std::min(validTokenNum, maxNumTokens.value())), and keep the
existing branch for getEnvKVCacheTransferAllBlocksForWindow() so that when that
flag is set validTokenNum is assigned maxNumTokens.value(); ensure types remain
size_t and no extra arithmetic can exceed the cap.
| nb::arg("cache_size_bytes_per_token_per_window"), nb::arg("tokens_per_block"), | ||
| nb::arg("cache_transceiver_config") = nb::none()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Binding signature update looks correct; ensure Python callers are updated
The added tokens_per_block argument and defaulted config align with the updated C++ signature.
Run this to find Python/C++ callsites still using the old arity:
🏁 Script executed:
#!/bin/bash
# Verify all callsites of pre_alloc_buffer_size pass tokens_per_block
rg -n -C2 -P '\bpre_alloc_buffer_size\s*\(' --type=py --type=cpp --type=cxx --type=cuLength of output: 115
🏁 Script executed:
#!/bin/bash
# Verify all callsites of pre_alloc_buffer_size pass tokens_per_block across relevant file types
rg -n -C2 '\bpre_alloc_buffer_size\s*\(' -g '*.py' -g '*.cpp' -g '*.cc' -g '*.cu'Length of output: 725
Update Python wrapper to pass tokens_per_block to C++ API
The Python static method pre_alloc_buffer_size in
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
still invokes the C++ binding with only two arguments. You need to:
• Change the method signature to accept the new tokens_per_block parameter (and default the config):
- @staticmethod
- def pre_alloc_buffer_size(kv_cache_size_per_token: int,
- cache_transceiver_config: CacheTransceiverConfig):
- return CacheTransBufferManagerCpp.pre_alloc_buffer_size(
- kv_cache_size_per_token, cache_transceiver_config)
+ @staticmethod
+ def pre_alloc_buffer_size(
+ kv_cache_size_per_token: int,
+ tokens_per_block: int,
+ cache_transceiver_config: CacheTransceiverConfig = None
+ ):
+ return CacheTransBufferManagerCpp.pre_alloc_buffer_size(
+ kv_cache_size_per_token, tokens_per_block, cache_transceiver_config
+ )• Ensure any callers of pre_alloc_buffer_size in Python now supply the new tokens_per_block argument.
Committable suggestion skipped: line range outside the PR's diff.
|
PR_Github #15743 [ run ] completed with state |
f200189 to
3f7dc47
Compare
|
/bot run --add-multi-gpu-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (6)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
24-24: Missing<unordered_map>include for new RequestInfo memberRequestInfo now uses
std::unordered_map, but this header does not include<unordered_map>. Add it to keep the header self-contained.Add the missing include after line 22:
#include <map> +#include <unordered_map> #include <string>cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)
213-219: Potential buffer overallocation when windowSize + tokensPerBlock exceeds maxNumTokensWhen
windowSize < maxNumTokens, addingtokensPerBlockcan exceedmaxNumTokens, leading to buffer overallocation. The calculation should be clamped.Apply this fix to properly clamp the value:
- auto validTokenNum - = (windowSize < maxNumTokens.value() ? (windowSize + tokensPerBlock) : maxNumTokens.value()); + auto validTokenNum = (windowSize < maxNumTokens.value()) + ? std::min(maxNumTokens.value(), windowSize + tokensPerBlock) + : maxNumTokens.value(); if (common::getEnvKVCacheTransferAllBlocksForWindow()) { validTokenNum = maxNumTokens.value(); }
265-271: Same clamping issue in preAllocBufferSize methodThe static
preAllocBufferSizemethod has the same potential overflow issue wherevalidTokenNumcan exceedmaxNumTokens.Apply the same fix here:
- auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value() - ? static_cast<size_t>(windowSize) + tokensPerBlock - : maxNumTokens.value()); + auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value()) + ? std::min(maxNumTokens.value(), static_cast<size_t>(windowSize) + static_cast<size_t>(tokensPerBlock)) + : maxNumTokens.value(); if (common::getEnvKVCacheTransferAllBlocksForWindow()) { validTokenNum = maxNumTokens.value(); }cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
82-85: Fix format specifier type mismatch in error messageThe format string uses
%dbutwindowSizeisSizeType32(likelyuint32_t). This can cause undefined behavior.Fix the format specifier:
TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), - "Window size %d should exists", windowSize); + "Window size %u should exist", windowSize);Also note the typo: "exists" should be "exist".
89-95: Inconsistent validation between setBlockIdsForWindow and setBlockIdsForAllWindows
setBlockIdsForWindowchecks existence inmBlockIdsPerWindow, butsetBlockIdsForAllWindowschecks againstmPoolsPerWindow. This inconsistency could cause issues.Make the validation consistent:
void setBlockIdsForAllWindows(std::unordered_map<SizeType32, std::vector<SizeType32>> blockIdsPerWindow) { for (auto const& [windowSize, blockIds] : blockIdsPerWindow) { TLLM_CHECK_WITH_INFO( - mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d should exists", windowSize); + mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), "Window size %u should exist", windowSize); } mBlockIdsPerWindow = std::move(blockIdsPerWindow); }cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
447-457: Fix operator precedence issue in lambda expressionThe expression
i == llmRequest->getPromptLen() & 0xffis parsed as(i == llmRequest->getPromptLen()) & 0xffdue to operator precedence, which is incorrect.Fix the precedence issue:
EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), - [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; })); + [&llmRequest](uint8_t i) { return i == static_cast<uint8_t>(llmRequest->getPromptLen() & 0xff); }));
🧹 Nitpick comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
145-152: Remove commented-out codeLine 145 contains commented-out code that should be removed. Dead code should not be left in the codebase.
- // cacheManager.getBlockManager.getPrimaryPool(0); auto poolNum = mManager->getNumPools();
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (18)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h(4 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp(6 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp(3 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h(1 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp(5 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h(4 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp(1 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp(2 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.h(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp(12 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp(2 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp(2 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(0 hunks)
💤 Files with no reviewable changes (1)
- tests/integration/defs/accuracy/test_disaggregated_serving.py
🚧 Files skipped from review as they are similar to previous changes (12)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
- cpp/tensorrt_llm/common/envUtils.cpp
- cpp/tensorrt_llm/common/envUtils.h
- cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
- cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
- cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
- cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
- cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.hcpp/tensorrt_llm/batch_manager/cacheTransBuffer.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tests/batch_manager/cacheTransceiverTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)
53-55: LGTM! Per-window block hash handling looks goodThe constructor signature properly accepts per-window block hashes using move semantics, which is appropriate for transferring ownership of potentially large data structures.
66-69: LGTM! Getter properly returns const referenceThe getter correctly returns a const reference and is marked noexcept, following best practices for non-modifying accessors.
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
27-56: LGTM! BlockRangeForWindow class design is solidThe new
BlockRangeForWindowclass properly encapsulates per-window block iteration with appropriate use of move semantics and const-correctness.cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2)
55-55: LGTM! Test properly uses move semantics for BlockRangeForWindowThe test correctly transfers ownership of both
blockIdsandpoolusingstd::move.
126-128: LGTM! Test correctly creates intermediate vectors for move semanticsThe test properly creates local copies before moving them into
BlockRangeForWindow, ensuring correct ownership transfer.cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
649-649: Increased max sequences without justificationThe
mMaxNumSequenceshas been doubled from 8 to 16. This could impact memory usage and test performance.Is this increase necessary for the new per-window functionality? Consider documenting why this change was made or if it's just for testing purposes.
953-956: LGTM! Per-window API properly integrated into testsThe
fillBlockDataandverifyBlockDatafunctions have been correctly updated to acceptwindowSizeparameter with appropriate default value handling.
1267-1267: Test data set reduced without explanationThe test loop has been reduced from
{60, 30, 60, 10}to just{60, 30, 60, 10}, but the impact isn't clear.Was this intentional? If the test coverage is being reduced, please ensure it still adequately tests the per-window functionality.
|
PR_Github #15763 [ run ] triggered by Bot |
|
PR_Github #15763 [ run ] completed with state |
3f7dc47 to
ef0a620
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
87-95: Make window validation consistent across setters (and check the authoritative map).
setBlockIdsForWindowvalidates againstmBlockIdsPerWindowwhilesetBlockIdsForAllWindowsvalidates againstmPoolsPerWindow. The latter is the authoritative list of supported windows (derived from the manager). Validate againstmPoolsPerWindowin both to: (1) allow first-time insertion intomBlockIdsPerWindow, and (2) avoid false negatives.- void setBlockIdsForWindow(SizeType32 windowSize, std::vector<SizeType32> blockIds) + void setBlockIdsForWindow(SizeType32 windowSize, std::vector<SizeType32> blockIds) { - TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), - "Window size %d should exists", windowSize); + TLLM_CHECK_WITH_INFO(mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), + "Window size %d should exist", static_cast<int>(windowSize)); mBlockIdsPerWindow[windowSize] = std::move(blockIds); } @@ - for (auto const& [windowSize, blockIds] : blockIdsPerWindow) + for (auto const& [windowSize, blockIds] : blockIdsPerWindow) { - TLLM_CHECK_WITH_INFO( - mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d should exists", windowSize); + TLLM_CHECK_WITH_INFO(mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), + "Window size %d should exist", static_cast<int>(windowSize)); }cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
60-79: Fix off-by-one in window cropping and guard map access.Use ceil_div for blocks-in-window and avoid
.at()on potentially missing keys. This was raised previously and remains unfixed here.- bool needReuse = !common::getEnvDisableSelectiveCacheTransfer(); - auto const& requestedBlockHashesPerWindow = llmRequest.getRequestedBlockHashesPerWindow(); - for (auto const& [windowSize, metadata] : windowsMetadata) - { - SizeType32 reuseStartBlockIdx - = (needReuse && requestedBlockHashesPerWindow.at(windowSize).size() > 0 - && requestedBlockHashesPerWindow.at(windowSize).size() < blockIdsPerWindow.at(windowSize).size()) - ? (blockIdsPerWindow.at(windowSize).size() - requestedBlockHashesPerWindow.at(windowSize).size()) - : 0; - auto windowStartBlockIdx = needSendAllForWindow - ? 0 - : static_cast<SizeType32>(blockIdsPerWindow.at(windowSize).size()) - - (windowSize / cacheManager->getBlockManager().getTokensPerBlock() + 1); - // TODO: promptLen to get the startBlockIdx - SizeType32 startBlockIdx = std::max(0, std::max(reuseStartBlockIdx, windowStartBlockIdx)); + bool needReuse = !common::getEnvDisableSelectiveCacheTransfer(); + auto const& requestedBlockHashesPerWindow = llmRequest.getRequestedBlockHashesPerWindow(); + auto const tokensPerBlock = cacheManager->getBlockManager().getTokensPerBlock(); + for (auto const& [windowSize, metadata] : windowsMetadata) + { + auto const& ids = blockIdsPerWindow.at(windowSize); + SizeType32 reuseStartBlockIdx = 0; + if (needReuse) + { + auto it = requestedBlockHashesPerWindow.find(windowSize); + auto requested = (it != requestedBlockHashesPerWindow.end()) ? it->second.size() : 0UL; + if (requested > 0 && requested < ids.size()) + { + reuseStartBlockIdx = static_cast<SizeType32>(ids.size() - requested); + } + } + SizeType32 blocksInWindow = static_cast<SizeType32>((windowSize + tokensPerBlock - 1) / tokensPerBlock); + SizeType32 windowStartBlockIdx = needSendAllForWindow + ? 0 + : static_cast<SizeType32>( + std::max<ptrdiff_t>(0, static_cast<ptrdiff_t>(ids.size()) - static_cast<ptrdiff_t>(blocksInWindow))); + SizeType32 startBlockIdx = std::max(reuseStartBlockIdx, windowStartBlockIdx); TLLM_LOG_DEBUG( "getBlockRangeForSending windowSize: %d, startBlockIdx: %d reuseStartBlockIdx: %d windowStartBlockIdx: %d", windowSize, startBlockIdx, reuseStartBlockIdx, windowStartBlockIdx); - blockRange.setBlockIdsForWindow(windowSize, - std::vector<SizeType32>( - blockIdsPerWindow.at(windowSize).begin() + startBlockIdx, blockIdsPerWindow.at(windowSize).end())); + blockRange.setBlockIdsForWindow(windowSize, std::vector<SizeType32>(ids.begin() + startBlockIdx, ids.end())); }
100-111: Receiving side: same ceil_div and clamping needed.Mirror the fix during receive to avoid off-by-one errors and negative indices.
- for (auto const& [windowSize, metadata] : windowsMetadata) - { - auto const& blockIdsPerWindow = blockRange.getBlockIdsPerWindow(); - auto windowStartBlockIdx = static_cast<SizeType32>(blockIdsPerWindow.at(windowSize).size()) - - (windowSize / cacheManager->getBlockManager().getTokensPerBlock() + 1); - SizeType32 startBlockIdx = std::max(0, windowStartBlockIdx); - blockRange.setBlockIdsForWindow(windowSize, - std::vector<SizeType32>( - blockIdsPerWindow.at(windowSize).begin() + startBlockIdx, blockIdsPerWindow.at(windowSize).end())); - } + auto const tokensPerBlock = cacheManager->getBlockManager().getTokensPerBlock(); + for (auto const& [windowSize, metadata] : windowsMetadata) + { + auto const& ids = blockRange.getBlockIdsPerWindow().at(windowSize); + SizeType32 blocksInWindow = static_cast<SizeType32>((windowSize + tokensPerBlock - 1) / tokensPerBlock); + SizeType32 startBlockIdx = static_cast<SizeType32>( + std::max<ptrdiff_t>(0, static_cast<ptrdiff_t>(ids.size()) - static_cast<ptrdiff_t>(blocksInWindow))); + blockRange.setBlockIdsForWindow(windowSize, std::vector<SizeType32>(ids.begin() + startBlockIdx, ids.end())); + }cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
429-457: Fix precedence in byte verification; compare masked byte.Current expression is parsed as
(i == promptLen) & 0xff. Compare against the masked byte.- EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), - [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; })); + EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), + [&llmRequest](uint8_t i) { + return i == static_cast<uint8_t>(llmRequest->getPromptLen() & 0xff); + }));
🧹 Nitpick comments (10)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (3)
80-85: Fix format specifier and grammar in TLLM_CHECK_WITH_INFO.Use a proper specifier for SizeType32 (or cast) and fix “exists” → “exist”.
- TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), - "Window size %d should exists", windowSize); + TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), + "Window size %d should exist", static_cast<int>(windowSize));
145-151: Remove dead/commented-out code.The commented line adds noise and violates repo guidelines on dead code.
- // cacheManager.getBlockManager.getPrimaryPool(0);
1-1: Optional: adopt include guards per project guideline.Headers prefer named include guards (TRTLLM__H).
#pragma onceis fine in practice, but consider aligning with the stated repo guideline.cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)
380-383: Wrong printf specifier for size_t.
defaultTransSizeis size_t;%dis UB on LP64. Use%zuor cast.- TLLM_LOG_INFO("defaultTransSize: %d", defaultTransSize); + TLLM_LOG_INFO("defaultTransSize: %zu", static_cast<size_t>(defaultTransSize));
97-137: Test-only process management: ensure waitpid decl.These test cases use fork()/waitpid(). Consider adding
<sys/wait.h>explicitly to avoid relying on transitive includes across platforms.#include <csignal> +#include <sys/wait.h>Also applies to: 141-181, 329-367, 371-437
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (4)
253-257: Printf specifier mismatch for size_t.
windowSizes.size()is size_t. Fix the format string to avoid UB.- TLLM_LOG_DEBUG( - mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %d", windowSizes.size()); + TLLM_LOG_DEBUG( + mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %zu", + static_cast<size_t>(windowSizes.size()));
545-548: Remove commented-out call.
// blockRange.updatePoolIdx(poolIdx);is dead code.- // blockRange.updatePoolIdx(poolIdx);
484-502: Printf specifier mismatch for multiple size_t logs.Use
%zuforsize_t.- TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " unformat windowSizes size: %d", windowSizes.size()); + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " unformat windowSizes size: %zu", + static_cast<size_t>(windowSizes.size())); @@ - TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), "outputBuffersPerWindow size: %ld,blockNum: %d , windowSizes: %ld", - outputBuffersPerWindow.size(), blockNum, windowSizes.size()); + TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), + "outputBuffersPerWindow size: %zu, blockNum: %d, windowSizes: %zu", + static_cast<size_t>(outputBuffersPerWindow.size()), blockNum, static_cast<size_t>(windowSizes.size()));
321-335: Agent path: assert messages typo and clarity.Minor wording cleanup: “dupliacete” → “duplicate”.
- TLLM_LOG_DEBUG(" formatOutput bufferTargetNum: %d, targetNum: %d, peerDuplicateHeadFactor: %d dupliacete:%d ", + TLLM_LOG_DEBUG(" formatOutput bufferTargetNum: %d, targetNum: %d, peerDuplicateHeadFactor: %d, duplicate:%d ", bufferTargetNum, targetNum, peerDuplicateHeadFactor, targetInfo.mDupHeadFactor);cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
933-949: Window verification: use ceil_div for blocks in window.
maxBlockInWindow = windowSize / tokensPerBlockundercounts when windowSize isn’t a multiple of tokensPerBlock; also the+ 1in start index is a tell. Use ceil_div and drop the ad-hoc +1.- int maxBlockInWindow = windowSize / mCacheState->getModelConfig().mTokensPerBlock; - int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - (maxBlockInWindow + 1)); + auto const tpb = mCacheState->getModelConfig().mTokensPerBlock; + int blocksInWindow = static_cast<int>((windowSize + tpb - 1) / tpb); + int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - blocksInWindow);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (18)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h(4 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp(6 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp(3 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h(1 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp(5 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h(4 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp(1 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp(2 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.h(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp(12 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp(2 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp(2 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(0 hunks)
💤 Files with no reviewable changes (1)
- tests/integration/defs/accuracy/test_disaggregated_serving.py
🚧 Files skipped from review as they are similar to previous changes (13)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
- cpp/tensorrt_llm/common/envUtils.h
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
- cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/common/envUtils.cpp
- cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- cpp/tensorrt_llm/batch_manager/dataTransceiver.h
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
- cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
- cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Applied to files:
cpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
cpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧬 Code Graph Analysis (3)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)
preAllocBufferSize(246-285)preAllocBufferSize(246-248)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
cpp/tensorrt_llm/common/envUtils.cpp (4)
getEnvKVCacheTransferAllBlocksForWindow(449-453)getEnvKVCacheTransferAllBlocksForWindow(449-449)getEnvDisableSelectiveCacheTransfer(321-325)getEnvDisableSelectiveCacheTransfer(321-321)cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
llmRequest(421-460)llmRequest(421-421)llmRequest(882-914)llmRequest(882-882)llmRequest(916-923)llmRequest(916-916)llmRequest(925-951)llmRequest(925-925)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)
tensorrt_llm(35-190)- `` (66-69)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (3)
tensorrt_llm(34-123)nodiscard(106-109)- `` (101-104)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
187-193: Iterator construction precondition is stricter than necessary.The iterator never constructs an “end” iterator (you use sentinel), so the current check is safe. Just calling out that if an end-iterator constructor is ever added, the precondition should allow mIdx == size().
Would you like me to extend BlockIterator with an explicit end() constructor for symmetry with begin()/Sentinel, or keep the sentinel-only design?
255-258: LGTM: clean sentinel-based iteration.The sentinel pattern keeps the iterator minimal and avoids storing a second iterator. Nice.
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)
119-121: LGTM: tests updated for new preAllocBufferSize signature.Passing tokensPerBlock matches the new API. Assertions remain consistent.
164-165: LGTM: second test updated to new signature.Consistent with the kSELF path.
cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)
953-1014: Good test coverage for per-window data paths.The helpers
fillBlockData/verifyBlockDatanow parameterize by window and seed; this improves confidence in per-window correctness.Also applies to: 1016-1080
647-705: Test capacity: mMaxNumSequences doubled (16).Change aligns with expanded window coverage; ensure CI resources can handle increased allocations.
Would you like me to add a gtest filter to run a reduced subset under resource-constrained CI, and leave full suite for nightly?
892-903: No printf specifier issues found in cacheTransceiverTest.cpp
I’ve verified that there are no instances of logging a container’s.size()using%din this file, so no%zureplacements are needed here. All existing%dspecifiers match their argument types.
| size_t allCacheBlockSize = 0; | ||
| auto const& windowSizes = blockRange.getWindowSizes(); | ||
| TLLM_LOG_DEBUG( | ||
| mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %d", windowSizes.size()); | ||
| TLLM_CHECK_WITH_INFO( | ||
| static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools"); | ||
|
|
||
| std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocks; | ||
| for (auto poolIdx = 0; poolIdx < numPools; poolIdx++) | ||
|
|
||
| for (auto const& windowSize : windowSizes) | ||
| { | ||
| blockRange.updatePoolIdx(poolIdx); | ||
| SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(poolIdx); | ||
| TLLM_CHECK_WITH_INFO(inputKvCacheBlocks.find(window) == inputKvCacheBlocks.end(), | ||
| "window size already exists, which is not supported"); | ||
| inputKvCacheBlocks.emplace(window, std::vector<runtime::ITensor::SharedPtr>()); | ||
| auto maxBlockThisWindow = window / selfConfig.getModelConfig().mTokensPerBlock; | ||
| SizeType32 blockNumThisWindow = 0; | ||
| for (auto it = blockRange.begin(); it != blockRange.end(); ++it) | ||
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | ||
| TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " format windowSize: %d blockRangeForWindow size: %d", | ||
| windowSize, blockRangeForWindow.size()); | ||
| inputKvCacheBlocks.emplace(windowSize, std::vector<runtime::ITensor::SharedPtr>()); | ||
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | ||
| { | ||
| blockNum++; | ||
| inputKvCacheBlocks.at(window).push_back(it); | ||
| inputKvCacheBlocks.at(windowSize).push_back(it); | ||
| allCacheBlockSize += it->getSize(); | ||
| blockNumThisWindow++; | ||
| if (blockNumThisWindow >= maxBlockThisWindow) | ||
| { | ||
| break; | ||
| } | ||
| blockNum++; | ||
| } | ||
| } | ||
| TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), "inputKvCacheBlocks size: %ld,blockNum: %d , windowSizes: %ld", | ||
| inputKvCacheBlocks.size(), blockNum, windowSizes.size()); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Automated scan for remaining element/byte mixups.
There may be other places with getSize() used for buffer sizing or network IO.
Also applies to: 481-502, 616-747
🏁 Script executed:
#!/bin/bash
# Find potential unit mix-ups: getSize() used near send/recv/bytes/BufferSize/targetBuffer
rg -n -C2 -e 'getSize\(\)' cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
rg -n -C2 -e '(targetBuffer(Size|Bytes)|getSizeInBytes\(\)|session\.(send|recv)\()' cpp/tensorrt_llm/batch_manager/cacheFormatter.cppLength of output: 8400
Ensure byte‐accurate sizing for all buffer allocations and network I/O
A number of calls to getSize() appear in contexts where the API expects bytes rather than element counts. Mixing element counts into byte‐based calculations will lead to incorrect buffer sizes and potential data corruption or deadlocks. Please update these to use getSizeInBytes() (or multiply by element size) consistently.
Key locations to fix:
- Input cache sizing
- Line 269:
➔ should be
allCacheBlockSize += it->getSize();allCacheBlockSize += it->getSizeInBytes();
- Line 269:
- Preallocated send buffer element size
- Line 383:
➔ should be
auto sendBufferEleSize = preAllocSendBuffer->getSize();
auto sendBufferEleSize = preAllocSendBuffer->getSizeInBytes();
- Line 383:
- Output cache sizing
- Line 496:
➔ should be
cacheBlockSizeSum += it->getSize();cacheBlockSizeSum += it->getSizeInBytes();
- Line 496:
- General network I/O loops
- Anywhere
session.send(..., layer->data(), layer->getSizeInBytes())is correct, but ensure no calls still uselayer->getSize(). - Scan the blocks around lines 481–502 and 616–747 for similar patterns and correct them.
- Anywhere
Consider adding a lint rule or helper wrapper to prevent accidental use of getSize() in byte‐sensitive contexts going forward.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 251–275 (and
the other ranges called out), the code mixes element counts and byte sizes:
replace uses of it->getSize() with it->getSizeInBytes() for allCacheBlockSize at
line ~269, change preAllocSendBuffer->getSize() to
preAllocSendBuffer->getSizeInBytes() at line ~383, replace cacheBlockSizeSum +=
it->getSize() with cacheBlockSizeSum += it->getSizeInBytes() at line ~496, and
scan the nearby blocks (around 481–502 and 616–747) to ensure all network/send
calls and buffer-size calculations use getSizeInBytes() (or multiply by element
byte-size) instead of getSize(); optionally add a small helper or a linter check
to prevent future accidental use of getSize() in byte-sensitive contexts.
🛠️ Refactor suggestion
Unit bug: mixing elements and bytes when computing transfer sizes.
allCacheBlockSize and blockNum are later used to derive targetBufferSize and sliced for send/recv as bytes; currently you add it->getSize() (elements). This mis-sizes transfers for non-1-byte dtypes and causes incorrect slicing.
- size_t allCacheBlockSize = 0;
+ size_t allCacheBlockBytes = 0;
@@
- inputKvCacheBlocks.at(windowSize).push_back(it);
- allCacheBlockSize += it->getSize();
+ inputKvCacheBlocks.at(windowSize).push_back(it);
+ allCacheBlockBytes += it->getSizeInBytes();
blockNum++;
@@
- auto const targetBufferSize = allCacheBlockSize / targetNum * peerDuplicateHeadFactor;
+ auto const targetBufferBytes = allCacheBlockBytes / targetNum * peerDuplicateHeadFactor;
@@
- auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
- cacheBufferId, bufferTargetNum, targetBufferSize, bufferManager);
+ auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
+ cacheBufferId, bufferTargetNum, targetBufferBytes, bufferManager);Follow-up: in the chunked-send path below (bufferCoverTargetNum == 0), convert byte counts to element counts when slicing.
- size = outputSplitCaches[bufferIdx]->getSizeInBytes();
+ size = outputSplitCaches[bufferIdx]->getSizeInBytes();
@@
- size = targetBufferSize;
- size_t remainSendSize = targetBufferSize;
+ size = targetBufferBytes;
+ size_t remainSendBytes = targetBufferBytes;
while (remainSendSize > 0)
{
TLLM_CHECK(preAllocSendBuffer != nullptr);
- auto sendBufferEleSize = preAllocSendBuffer->getSize();
- auto sendSize = std::min(remainSendSize, sendBufferEleSize);
- auto copySlice = runtime::ITensor::slice(
- outputSplitCaches[bufferIdx], targetBufferSize - remainSendSize, sendSize);
- auto copyTargetSlice = runtime::ITensor::slice(preAllocSendBuffer, 0, sendSize);
+ auto elemSize = tensorrt_llm::common::getDTypeSize(preAllocSendBuffer->getDataType());
+ auto sendBufferBytes = preAllocSendBuffer->getSizeInBytes();
+ auto sendBytes = std::min(remainSendBytes, sendBufferBytes);
+ auto sendElems = sendBytes / elemSize;
+ auto offsetBytes = targetBufferBytes - remainSendBytes;
+ auto offsetElems = offsetBytes / elemSize;
+ auto copySlice = runtime::ITensor::slice(outputSplitCaches[bufferIdx], offsetElems, sendElems);
+ auto copyTargetSlice = runtime::ITensor::slice(preAllocSendBuffer, 0, sendElems);
bufferManager.copy(*copySlice, *copyTargetSlice);
bufferManager.getStream().synchronize();
session.send(processIdx, copyTargetSlice->data(), copyTargetSlice->getSizeInBytes());
- remainSendSize -= sendSize;
+ remainSendBytes -= sendBytes;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| size_t allCacheBlockSize = 0; | |
| auto const& windowSizes = blockRange.getWindowSizes(); | |
| TLLM_LOG_DEBUG( | |
| mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %d", windowSizes.size()); | |
| TLLM_CHECK_WITH_INFO( | |
| static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools"); | |
| std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocks; | |
| for (auto poolIdx = 0; poolIdx < numPools; poolIdx++) | |
| for (auto const& windowSize : windowSizes) | |
| { | |
| blockRange.updatePoolIdx(poolIdx); | |
| SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(poolIdx); | |
| TLLM_CHECK_WITH_INFO(inputKvCacheBlocks.find(window) == inputKvCacheBlocks.end(), | |
| "window size already exists, which is not supported"); | |
| inputKvCacheBlocks.emplace(window, std::vector<runtime::ITensor::SharedPtr>()); | |
| auto maxBlockThisWindow = window / selfConfig.getModelConfig().mTokensPerBlock; | |
| SizeType32 blockNumThisWindow = 0; | |
| for (auto it = blockRange.begin(); it != blockRange.end(); ++it) | |
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | |
| TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " format windowSize: %d blockRangeForWindow size: %d", | |
| windowSize, blockRangeForWindow.size()); | |
| inputKvCacheBlocks.emplace(windowSize, std::vector<runtime::ITensor::SharedPtr>()); | |
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | |
| { | |
| blockNum++; | |
| inputKvCacheBlocks.at(window).push_back(it); | |
| inputKvCacheBlocks.at(windowSize).push_back(it); | |
| allCacheBlockSize += it->getSize(); | |
| blockNumThisWindow++; | |
| if (blockNumThisWindow >= maxBlockThisWindow) | |
| { | |
| break; | |
| } | |
| blockNum++; | |
| } | |
| } | |
| TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), "inputKvCacheBlocks size: %ld,blockNum: %d , windowSizes: %ld", | |
| inputKvCacheBlocks.size(), blockNum, windowSizes.size()); | |
| // Rename and switch from element-count to byte-count accumulation | |
| size_t allCacheBlockBytes = 0; | |
| auto const& windowSizes = blockRange.getWindowSizes(); | |
| TLLM_LOG_DEBUG( | |
| mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %d", windowSizes.size()); | |
| TLLM_CHECK_WITH_INFO( | |
| static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools"); | |
| std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocks; | |
| for (auto const& windowSize : windowSizes) | |
| { | |
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | |
| TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), | |
| " format windowSize: %d blockRangeForWindow size: %d", | |
| windowSize, blockRangeForWindow.size()); | |
| inputKvCacheBlocks.emplace(windowSize, std::vector<runtime::ITensor::SharedPtr>()); | |
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | |
| { | |
| inputKvCacheBlocks.at(windowSize).push_back(it); | |
| allCacheBlockBytes += it->getSizeInBytes(); | |
| blockNum++; | |
| } | |
| } | |
| TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), | |
| "inputKvCacheBlocks size: %ld, blockNum: %d, windowSizes: %ld", | |
| inputKvCacheBlocks.size(), blockNum, windowSizes.size()); | |
| // Compute target buffer in bytes, not elements | |
| auto const targetBufferBytes = allCacheBlockBytes / targetNum * peerDuplicateHeadFactor; | |
| - auto result = mCacheTransBufferManager->getOrAllocateSendBuffers( | |
| auto result = mCacheTransBufferManager->getOrAllocateSendBuffers( | |
| cacheBufferId, bufferTargetNum, targetBufferBytes, bufferManager); |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 251 to 275,
the code adds it->getSize() (elements) to allCacheBlockSize but later treats
allCacheBlockSize as bytes for buffer sizing and slicing; change accumulation to
byte counts (elements * element-size-in-bytes or use a provided
getByteSize()/getSizeInBytes() API) and ensure blockNum remains
element-agnostic; additionally, in the chunked-send path (where
bufferCoverTargetNum == 0) convert any byte-based offsets/lengths to element
counts before indexing/slicing tensors so slicing uses element counts not bytes.
| size_t blockNum = 0; | ||
| size_t cacheBlockSizeSum = 0; | ||
| for (auto poolIdx = 0; poolIdx < numPools; poolIdx++) | ||
|
|
||
| auto windowSizes = blockRange.getWindowSizes(); | ||
| TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " unformat windowSizes size: %d", windowSizes.size()); | ||
| for (auto const& windowSize : windowSizes) | ||
| { | ||
| blockRange.updatePoolIdx(poolIdx); | ||
| SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(poolIdx); | ||
| TLLM_CHECK_WITH_INFO(outputBuffersPerWindow.find(window) == outputBuffersPerWindow.end(), | ||
| "window size already exists, which is not supported"); | ||
| outputBuffersPerWindow.emplace(window, std::vector<runtime::ITensor::SharedPtr>()); | ||
| auto maxBlockThisWindow = window / selfConfig.getModelConfig().mTokensPerBlock; | ||
| SizeType32 blockNumThisWindow = 0; | ||
| for (auto it = blockRange.begin(); it != blockRange.end(); ++it) | ||
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | ||
| TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " unformat windowSize: %d blockRangeForWindow size: %d", | ||
| windowSize, blockRangeForWindow.size()); | ||
| outputBuffersPerWindow.emplace(windowSize, std::vector<runtime::ITensor::SharedPtr>()); | ||
|
|
||
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | ||
| { | ||
| blockNum++; | ||
| blockNumThisWindow++; | ||
| outputBuffersPerWindow.at(window).push_back(it); | ||
| outputBuffersPerWindow.at(windowSize).push_back(it); | ||
| cacheBlockSizeSum += it->getSize(); | ||
| if (blockNumThisWindow >= maxBlockThisWindow) | ||
| { | ||
| break; | ||
| } | ||
| blockNum++; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Unit bug on receive: cacheBlockSizeSum must be bytes; align chunking logic.
Mirrors the send-side bug. Accumulate bytes and use bytes-to-elements conversion for slicing.
- size_t cacheBlockSizeSum = 0;
+ size_t cacheBlockBytesSum = 0;
@@
- cacheBlockSizeSum += it->getSize();
+ cacheBlockBytesSum += it->getSizeInBytes();
@@
- TLLM_CHECK(cacheBlockSizeSum % targetNum == 0);
- auto targetBufferSize = cacheBlockSizeSum / targetNum;
+ TLLM_CHECK(cacheBlockBytesSum % targetNum == 0);
+ auto targetBufferBytes = cacheBlockBytesSum / targetNum;
@@
- auto [recvSplitCachestmp, bufferCoverTargetNumtmp, onlyUseDynamicBuffer]
- = mCacheTransBufferManager->getOrAllocateRecvBuffers(
- cacheBufferId, targetNum, targetBufferSize, bufferManager);
+ auto [recvSplitCachestmp, bufferCoverTargetNumtmp, onlyUseDynamicBuffer]
+ = mCacheTransBufferManager->getOrAllocateRecvBuffers(
+ cacheBufferId, targetNum, targetBufferBytes, bufferManager);And in the chunked path:
- size_t remainRecvSize = targetBufferSize;
+ size_t remainRecvBytes = targetBufferBytes;
while (remainRecvSize > 0)
{
TLLM_CHECK(preAllocRecvBuffer != nullptr);
- auto recvBufferEleSize = preAllocRecvBuffer->getSize();
- auto recvSize = std::min(remainRecvSize, recvBufferEleSize);
- auto recvSlice = runtime::ITensor::slice(preAllocRecvBuffer, 0, recvSize);
- auto copySlice = runtime::ITensor::slice(
- recvSplitCaches[processIdx], targetBufferSize - remainRecvSize, recvSize);
+ auto elemSize = tensorrt_llm::common::getDTypeSize(preAllocRecvBuffer->getDataType());
+ auto recvBufferBytes = preAllocRecvBuffer->getSizeInBytes();
+ auto recvBytes = std::min(remainRecvBytes, recvBufferBytes);
+ auto recvElems = recvBytes / elemSize;
+ auto offsetBytes = targetBufferBytes - remainRecvBytes;
+ auto offsetElems = offsetBytes / elemSize;
+ auto recvSlice = runtime::ITensor::slice(preAllocRecvBuffer, 0, recvElems);
+ auto copySlice = runtime::ITensor::slice(recvSplitCaches[processIdx], offsetElems, recvElems);
size += recvSlice->getSizeInBytes();
llmRequest.updateKvCacheSize((*recvSlice).getSizeInBytes());
session.recv(pickUpConnections[processIdx], recvSlice->data(), recvSlice->getSizeInBytes());
bufferManager.copy(*recvSlice, *copySlice);
bufferManager.getStream().synchronize();
- remainRecvSize -= recvSize;
+ remainRecvBytes -= recvBytes;
}Also applies to: 616-624
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 481-498 (and
also apply the same change at 616-624): cacheBlockSizeSum is being accumulated
in element counts instead of bytes which breaks chunking; change accumulation to
sum bytes (e.g., use each tensor/block's byte size or multiply element count by
element size) and, where slicing/chunking computes element counts from byte
sizes, convert bytes back to element counts by dividing by the element byte-size
before creating slices; apply the identical byte-based accumulation and
bytes-to-elements conversion in the chunked path at lines 616-624.
Signed-off-by: Chuang Zhu <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
ef0a620 to
39306d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)
240-241: Bug: double-free of receive buffer indexrecvBufferId is freed at Line 237 and then freed again at Lines 240–241. This can lead to undefined behavior in the buffer index allocator.
Apply:
- // Free buffer indices - mTransBufferManager->freeBufferIndexForRecv(recvBufferId); + // Free buffer indices — already freed above; no-op here + (void) recvBufferId;
289-314: Bug: unreachable code after exit() in child processexit(testing::Test::HasFailure() ? 1 : 0) at Line 289 terminates the child; the subsequent receive-buffer tests are never executed.
Refactor so the child exits only once, after all assertions:
- thread0.join(); - exit(testing::Test::HasFailure() ? 1 : 0); - - auto recvBufferId = mTransBufferManager->assignBufferIndexForRecv(); + thread0.join(); + + // Continue with recv buffer assertions before exiting + auto recvBufferId = mTransBufferManager->assignBufferIndexForRecv(); EXPECT_TRUE(recvBufferId.has_value()); EXPECT_EQ(recvBufferId.value(), 0); auto recvBufferId2 = mTransBufferManager->assignBufferIndexForRecv(); EXPECT_TRUE(recvBufferId2.has_value()); EXPECT_EQ(recvBufferId2.value(), 1); auto time2 = std::chrono::steady_clock::now(); @@ thread1.join(); - exit(testing::Test::HasFailure() ? 1 : 0); + exit(testing::Test::HasFailure() ? 1 : 0);cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)
195-205: Measurement bug: size should be bytes, not elements, in chunked sendIn the chunked branch (bufferCoverTargetNum == 0), size is set to targetBufferSize (elements). appendMeasure expects bytes. Compute bytes via dtype size and optionally rename remainSendSize to remainSendElems for clarity.
- size_t size; + size_t size = 0; @@ - else + else { // bufferCoverTargetNum=0, mSendBuffer size < one outputSlice // send multiple times - size = targetBufferSize; - size_t remainSendSize = targetBufferSize; + auto const elemSize = tensorrt_llm::common::getDTypeSize(outputSplitCaches.at(cacheIdx)->getDataType()); + size_t remainSendElems = targetBufferSize; + size = targetBufferSize * elemSize; while (remainSendSize > 0) { TLLM_CHECK(preAllocSendBuffer != nullptr); - auto sendBufferEleSize = preAllocSendBuffer->getSize(); - auto sendSize = std::min(remainSendSize, sendBufferEleSize); + auto sendBufferEleSize = preAllocSendBuffer->getSize(); + auto sendSize = std::min(remainSendElems, sendBufferEleSize); auto copySlice = runtime::ITensor::slice( - outputSplitCaches.at(cacheIdx), targetBufferSize - remainSendSize, sendSize); + outputSplitCaches.at(cacheIdx), targetBufferSize - remainSendElems, sendSize); auto copyTargetSlice = runtime::ITensor::slice(preAllocSendBuffer, 0, sendSize); bufferManager.copy(*copySlice, *copyTargetSlice); bufferManager.getStream().synchronize(); session.send(processIdx, copyTargetSlice->data(), copyTargetSlice->getSizeInBytes()); - - remainSendSize -= sendSize; + remainSendElems -= sendSize; } }Also applies to: 217-233
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
368-371: Bug: copying from wrong source buffer index in fallback pathIn the send fallback, you calculate bufferIdx, but copy from outputSplitCaches[processIdx]. This mismatches topology mapping and can transmit wrong data.
- bufferManager.copy(*outputSplitCaches[processIdx], *outputSplitCaches.at(sendBufferIdx)); + bufferManager.copy(*outputSplitCaches.at(bufferIdx), *outputSplitCaches.at(sendBufferIdx));
377-394: Measurement bug: size should be bytes in chunked sendsize is assigned targetBufferSize (elements). appendMeasure expects bytes. Compute bytes once or accumulate.
- size = targetBufferSize; - size_t remainSendSize = targetBufferSize; + auto const elemSize = tensorrt_llm::common::getDTypeSize(outputSplitCaches.at(bufferIdx)->getDataType()); + size = targetBufferSize * elemSize; + size_t remainSendSize = targetBufferSize; // elements while (remainSendSize > 0) { TLLM_CHECK(preAllocSendBuffer != nullptr); auto sendBufferEleSize = preAllocSendBuffer->getSize(); auto sendSize = std::min(remainSendSize, sendBufferEleSize);cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
665-671: Bug: cacheType is shadowed and assignment is lost when kvFactor == 1This introduces a new local
cacheTypeinstead of updating the outer one, leaving the type unchanged.CacheType cacheType = CacheType::kSELF; if (kvFactor == 1) { - auto cacheType = CacheType::kSELFKONLY; + cacheType = CacheType::kSELFKONLY; }
♻️ Duplicate comments (8)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
139-145: Zero-copy path must iterate per-window bucketsOnce per-window mapping is built, send each window’s blocks rather than a flat list.
- for (size_t i = 0; i < connections.size(); i++) - { - for (auto const& block : inputKvCacheBlocks) - { - session.send(i, block->data(), block->getSizeInBytes()); - } - } + for (size_t i = 0; i < connections.size(); i++) + { + for (auto const& [w, blocks] : inputKvCacheBlocksPerWindow) + { + (void) w; + for (auto const& block : blocks) + { + session.send(i, block->data(), block->getSizeInBytes()); + } + } + }
113-126: Bug: per-window blocks are flattened and then re-keyed under a single windowYou iterate per-window but flatten into inputKvCacheBlocks and later emplace all blocks under pool 0’s window. This loses per-window grouping and can break split/concat when multiple windows exist.
Update to build and use a true per-window map throughout:
- int blockNum = 0; - std::vector<runtime::ITensor::SharedPtr> inputKvCacheBlocks; + int blockNum = 0; + std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow; @@ - for (auto const& windowSize : windowSizes) + for (auto const& windowSize : windowSizes) { - auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); - for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) - { - inputKvCacheBlocks.push_back(it); - blockNum++; - } + auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); + auto& vec = inputKvCacheBlocksPerWindow[windowSize]; + for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) + { + vec.push_back(it); + ++blockNum; + } } @@ - SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(0); - std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow; - inputKvCacheBlocksPerWindow.emplace(window, inputKvCacheBlocks); + // inputKvCacheBlocksPerWindow was built above per window; use as-isAlso adjust zero-copy loop below (Lines 139–145) to iterate the per-window map instead of the flat vector.
Also applies to: 178-183
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
45-79: Off-by-one in window cropping and unsafe map access when computing reuse start index
- windowStartBlockIdx uses windowSize / tokensPerBlock + 1, which overestimates by one for exact multiples.
- requestedBlockHashesPerWindow.at(windowSize) may throw for missing key.
auto const& windowsMetadata = cacheManager->getBlockManager().getWindowSizesMetadata(); @@ - auto const& blockIdsPerWindow = blockRange.getBlockIdsPerWindow(); + auto const& blockIdsPerWindow = blockRange.getBlockIdsPerWindow(); + auto const tokensPerBlock = cacheManager->getBlockManager().getTokensPerBlock(); @@ - SizeType32 reuseStartBlockIdx - = (needReuse && requestedBlockHashesPerWindow.at(windowSize).size() > 0 - && requestedBlockHashesPerWindow.at(windowSize).size() < blockIdsPerWindow.at(windowSize).size()) - ? (blockIdsPerWindow.at(windowSize).size() - requestedBlockHashesPerWindow.at(windowSize).size()) - : 0; - auto windowStartBlockIdx = needSendAllForWindow - ? 0 - : static_cast<SizeType32>(blockIdsPerWindow.at(windowSize).size()) - - (windowSize / cacheManager->getBlockManager().getTokensPerBlock() + 1); + auto const& ids = blockIdsPerWindow.at(windowSize); + SizeType32 reuseStartBlockIdx = 0; + if (needReuse) + { + auto it = requestedBlockHashesPerWindow.find(windowSize); + auto requested = (it != requestedBlockHashesPerWindow.end()) ? it->second.size() : 0UL; + if (requested > 0 && requested < ids.size()) + { + reuseStartBlockIdx = static_cast<SizeType32>(ids.size() - requested); + } + } + SizeType32 blocksInWindow = static_cast<SizeType32>((windowSize + tokensPerBlock - 1) / tokensPerBlock); + SizeType32 windowStartBlockIdx = needSendAllForWindow + ? 0 + : static_cast<SizeType32>( + std::max<ptrdiff_t>(0, static_cast<ptrdiff_t>(ids.size()) - static_cast<ptrdiff_t>(blocksInWindow))); @@ - blockRange.setBlockIdsForWindow(windowSize, - std::vector<SizeType32>( - blockIdsPerWindow.at(windowSize).begin() + startBlockIdx, blockIdsPerWindow.at(windowSize).end())); + blockRange.setBlockIdsForWindow(windowSize, + std::vector<SizeType32>(ids.begin() + startBlockIdx, ids.end()));
100-111: Receiving side: same off-by-one window croppingUse ceil division and clamp to zero, mirroring the send path.
- for (auto const& [windowSize, metadata] : windowsMetadata) + auto const tokensPerBlock = cacheManager->getBlockManager().getTokensPerBlock(); + for (auto const& [windowSize, metadata] : windowsMetadata) { - auto const& blockIdsPerWindow = blockRange.getBlockIdsPerWindow(); - auto windowStartBlockIdx = static_cast<SizeType32>(blockIdsPerWindow.at(windowSize).size()) - - (windowSize / cacheManager->getBlockManager().getTokensPerBlock() + 1); - SizeType32 startBlockIdx = std::max(0, windowStartBlockIdx); - blockRange.setBlockIdsForWindow(windowSize, - std::vector<SizeType32>( - blockIdsPerWindow.at(windowSize).begin() + startBlockIdx, blockIdsPerWindow.at(windowSize).end())); + auto const& ids = blockRange.getBlockIdsPerWindow().at(windowSize); + SizeType32 blocksInWindow = static_cast<SizeType32>((windowSize + tokensPerBlock - 1) / tokensPerBlock); + SizeType32 startBlockIdx = static_cast<SizeType32>( + std::max<ptrdiff_t>(0, static_cast<ptrdiff_t>(ids.size()) - static_cast<ptrdiff_t>(blocksInWindow))); + blockRange.setBlockIdsForWindow(windowSize, + std::vector<SizeType32>(ids.begin() + startBlockIdx, ids.end())); }cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
455-456: Fix operator precedence in byte-wise equality check
i == llmRequest->getPromptLen() & 0xffis parsed as(i == llmRequest->getPromptLen()) & 0xff. Compare against the masked byte explicitly.- EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), - [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; })); + EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(), + [&llmRequest](uint8_t i) { + return i == static_cast<uint8_t>(llmRequest->getPromptLen() & 0xff); + }));cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (3)
82-85: TLLM_CHECK message: grammar and format specifier; cast windowSizeFix “exists” → “exist” and ensure the format argument type matches the specifier. Safer to cast to
intwith%d(or switch to%uifSizeType32is unsigned).- TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), - "Window size %d should exists", windowSize); + TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), + "Window size %d should exist", static_cast<int>(windowSize));
89-95: Consistent validation: check against mBlockIdsPerWindow, not mPoolsPerWindow
setBlockIdsForWindowvalidates presence inmBlockIdsPerWindowbut this function checksmPoolsPerWindow. Align them to avoid inconsistent behavior.- for (auto const& [windowSize, blockIds] : blockIdsPerWindow) + for (auto const& [windowSize, blockIds] : blockIdsPerWindow) { - TLLM_CHECK_WITH_INFO( - mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d should exists", windowSize); + TLLM_CHECK_WITH_INFO( + mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), + "Window size %d should exist", static_cast<int>(windowSize)); }
115-120: Guard against missing block IDs for a valid window
mBlockIdsPerWindow.at(windowSize)can throw if IDs weren’t set yet. Return an empty range instead of throwing.BlockRangeForWindow getBlockRangeForWindow(SizeType32 windowSize) const { TLLM_CHECK_WITH_INFO( mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d not found", windowSize); auto pool = mPoolsPerWindow.at(windowSize).front(); - auto blockIds = mBlockIdsPerWindow.at(windowSize); - return BlockRangeForWindow(std::move(blockIds), std::move(pool)); + auto it = mBlockIdsPerWindow.find(windowSize); + std::vector<SizeType32> blockIds = (it == mBlockIdsPerWindow.end()) ? std::vector<SizeType32>{} : it->second; + return BlockRangeForWindow(std::move(blockIds), std::move(pool)); }
🧹 Nitpick comments (11)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
381-381: Format specifier mismatch for size_tdefaultTransSize is size_t but the log uses %d. Prefer %zu (or cast).
- TLLM_LOG_INFO("defaultTransSize: %d", defaultTransSize); + TLLM_LOG_INFO("defaultTransSize: %zu", static_cast<size_t>(defaultTransSize));cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
127-130: Constructing range with moved inputs — LGTM; minor micro-nitThe move of blockIdsVec and poolCopy is fine. If pool isn't reused afterward, you can avoid the temporary poolCopy and move pool directly to reduce one shared_ptr copy.
- auto poolCopy = pool; - auto range = BlockRangeForWindow(std::move(blockIdsVec), std::move(poolCopy)); + auto range = BlockRangeForWindow(std::move(blockIdsVec), std::move(pool));cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
153-154: Units: clarify elements vs. bytes for cacheBlockSizegetSize() returns elements. That’s fine for computing targetBufferSize (elements), but be consistent by naming to reflect units, and ensure measurement uses bytes later.
- auto cacheBlockSize = inputKvCacheBlocks.at(0)->getSize(); + auto cacheBlockElemCount = inputKvCacheBlocksPerWindow.begin()->second.front()->getSize();Follow-up: see the measurement fix below.
168-174: Agent connection precondition checks — LGTM; minor variable namingagentConnnecion is misspelled (extra 'n'). Rename for readability.
- auto* agentConnnecion = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]); - if (agentConnnecion != nullptr) + auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]); + if (agentConnection != nullptr) { TLLM_CHECK_WITH_INFO(bufferCoverTargetNum == pPDomainSize, "Agent need all buffer pre-allocated"); TLLM_CHECK(onlyUseDynamicBuffer == false); }cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (4)
22-22: Unused include: kvCacheEventManager.hThis header isn’t referenced in this TU. Drop it to keep includes minimal.
-#include "tensorrt_llm/batch_manager/kvCacheEventManager.h"
251-272: Logging format specifier: size_t vs %dwindowSizes.size() is size_t. Use %zu (or cast) to avoid UB.
- TLLM_LOG_DEBUG( - mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %d", windowSizes.size()); + TLLM_LOG_DEBUG( + mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %zu", + static_cast<size_t>(windowSizes.size()));Same applies to other logs printing container sizes.
484-497: Units on receive: accumulate bytes not elements for cacheBlockSizeSumYou sum it->getSize() (elements) into cacheBlockSizeSum and then compare modulo/derive targetBufferSize used for bytes-based recv/sizing checks. Prefer summing bytes for clearer invariants, or keep elements but ensure all downstream uses treat it as elements.
- size_t cacheBlockSizeSum = 0; + size_t cacheBlockBytesSum = 0; @@ - cacheBlockSizeSum += it->getSize(); + cacheBlockBytesSum += it->getSizeInBytes();If you adopt bytes, adjust the modulo and targetBufferSize accordingly and convert back to elements when slicing.
616-624: Ensure allITensor::sliceoffsets use element counts after switching to byte-based sizesI’ve confirmed every call to
ITensor::sliceincacheFormatter.cpp. Only the slices that usetargetBufferSize(which you’ve refactored to be in bytes) still pass byte offsets directly toslice, which expects element counts. You’ll need to divide those byte offsets by the element size (e.g. viagetElementSizeInBytes()) before callingslice. The remainingslicecalls already work in element units and require no change.• Lines 386–390
auto copySlice = runtime::ITensor::slice( outputSplitCaches[bufferIdx], targetBufferSize - remainSendSize, sendSize);• Lines 737–740
auto copySlice = runtime::ITensor::slice( recvSplitCaches[processIdx], targetBufferSize - remainRecvSize, recvSize);Both offsets (
targetBufferSize - remainXxxSize) must be converted from bytes to element counts.cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
45-51: Per-window block hash transport: constructor/eq/serde updates look correct
- RequestInfo stores mBlockHashesPerWindow and equality compares it.
- Serialization writes map size followed by (window, hashes) pairs and restores them in deserialization.
- serializedSize matches the encoding.
Two optional improvements:
- For deterministic wire format across compilers/architectures, consider serializing a sorted std::vector of pairs or use std::map instead of std::unordered_map.
- Pre-reserve blockHashesPerWindow in deserialize after reading size to reduce rehashing.
- std::unordered_map<SizeType32, std::vector<size_t>> blockHashesPerWindow; - auto size = su::deserialize<decltype(blockHashesPerWindow.size())>(is); + std::unordered_map<SizeType32, std::vector<size_t>> blockHashesPerWindow; + auto size = su::deserialize<decltype(blockHashesPerWindow.size())>(is); + blockHashesPerWindow.reserve(size);Also applies to: 56-58, 75-81, 88-98, 105-113
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)
939-947: Potential off-by-one in visible-window verification
startBlockId = size - (maxBlockInWindow + 1)effectively verifiesmaxBlockInWindow + 1blocks; was the intention to verify only the lastmaxBlockInWindowblocks? If so, drop the+ 1.- int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - (maxBlockInWindow + 1)); + int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - maxBlockInWindow);If the extra block is intentional (to include the current block), please add a brief comment to document this choice.
431-437: Minor: prefer range-for for clarityUsing range-for improves readability with your custom iterator/sentinel.
- auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); - for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) - { - // fill cache with tokens (= request length), for reuse test - TLLM_CUDA_CHECK(cudaMemset(it->data(), llmRequest->getPromptLen(), it->getSizeInBytes())); - } + for (auto& block : blockRange.getBlockRangeForWindow(windowSize)) + { + // Fill with prompt length for reuse test + TLLM_CUDA_CHECK(cudaMemset(block->data(), llmRequest->getPromptLen(), block->getSizeInBytes())); + }Apply the same pattern in the receiver validation loop below.
Also applies to: 451-457
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (18)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h(4 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h(2 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp(6 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp(3 hunks)cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h(1 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp(5 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h(4 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp(1 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp(2 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.cpp(1 hunks)cpp/tensorrt_llm/common/envUtils.h(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp(12 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp(2 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp(2 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(0 hunks)
💤 Files with no reviewable changes (1)
- tests/integration/defs/accuracy/test_disaggregated_serving.py
🚧 Files skipped from review as they are similar to previous changes (10)
- cpp/tensorrt_llm/common/envUtils.h
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
- cpp/tensorrt_llm/common/envUtils.cpp
- cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/batch_manager/dataTransceiver.h
- cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cppcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧠 Learnings (7)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cppcpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Applied to files:
cpp/tests/batch_manager/cacheTransceiverTest.cppcpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
cpp/tensorrt_llm/batch_manager/cacheFormatter.cppcpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hcpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
🧬 Code graph analysis (6)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)
preAllocBufferSize(246-285)preAllocBufferSize(246-248)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
RequestInfo(45-97)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)
cpp/tensorrt_llm/common/envUtils.cpp (4)
getEnvKVCacheTransferAllBlocksForWindow(449-453)getEnvKVCacheTransferAllBlocksForWindow(449-449)getEnvDisableSelectiveCacheTransfer(321-325)getEnvDisableSelectiveCacheTransfer(321-321)
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
BlockRangeForWindow(27-56)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (10)
tensorrt_llm(37-265)nodiscard(547-550)nodiscard(555-558)nodiscard(578-581)nodiscard(765-773)nodiscard(776-779)nodiscard(782-785)nodiscard(1021-1024)nodiscard(1186-1194)nodiscard(1232-1240)cpp/include/tensorrt_llm/runtime/gptDecoder.h (1)
ITensor(145-152)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
getBlockRangeForSending(43-82)getBlockRangeForSending(43-43)cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
llmRequest(421-460)llmRequest(421-421)llmRequest(882-914)llmRequest(882-882)llmRequest(916-923)llmRequest(916-916)llmRequest(925-951)llmRequest(925-925)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)
119-121: PreAllocBufferSize API update: passing tokensPerBlock looks correctThe test now forwards tokensPerBlock to preAllocBufferSize; this aligns with the implementation using tokensPerBlock to bound validTokenNum. No functional concerns here.
164-165: Mirror the API update in TestPreAllocBufferSize2 — LGTMThe second test also passes tokensPerBlock to preAllocBufferSize. Consistent with the first test.
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
55-55: API migration to BlockRangeForWindow — LGTMSwitching to BlockRangeForWindow with moved blockIds and pool matches the new API and preserves behavior.
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)
101-104: Early exit when no cache needs sending — LGTMShort-circuiting avoids unnecessary setup work; consistent with CacheFormatter behavior.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)
268-271: Units consistency verified – no action neededI’ve audited all instances of
getSize()andgetSizeInBytes()around buffer allocation, slicing, andsession.send/session.recvcalls incacheFormatter.cpp.
getSize()is only ever used for element counts (e.g. computingallCacheBlockSize, slicing tensors).getSizeInBytes()is used exclusively when specifying byte lengths for sends/receives and updating cache sizes.There are no mismatches between element counts and byte counts.
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
26-26: Include for SizeType32 — LGTMAdding runtime/common.h is appropriate for SizeType32.
238-244: DataResponder path switches to per-window hashes — LGTMReceiving the per-window hashes and forwarding via setRequestedBlockHashes(...) aligns with the per-window design.
Also applies to: 263-264
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
649-650: Resource usage check: doubled mMaxNumSequences from 8 to 16This increases total blocks and memory footprints in tests. Confirm CI/GPU memory headroom; earlier L0 pipelines failed—this might contribute.
Would you like me to probe the failing pipelines’ memory-related logs and suggest parameterizations to keep coverage while reducing peak memory?
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (3)
20-21: Include looks correct for new per-window pool typingAdding
iTensor.his appropriate forruntime::ITensor::SharedPtrusage in per-window structures.
187-193: Iterator/sentinel correctness looks good
- Constructor guards against invalid non-zero indices and updates state.
- Sentinel comparison uses size() and avoids deref on empty ranges.
Also applies to: 230-233, 244-248
146-152: Initialization of pools and block IDs per window is clear and consistentConstructors correctly populate
mPoolsPerWindowandmBlockIdsPerWindowacross all pools/windows.Also applies to: 158-166
| std::vector<SizeType32> getWindowSizes() const | ||
| { | ||
| std::vector<SizeType32> windowSizes; | ||
| for (auto const& [windowSize, _] : mPoolsPerWindow) | ||
| { | ||
| mWindowSize = newWindowSize; | ||
| mBlockIds = mManager->getSequence(mRequestId).getCacheBlockIds(mWindowSize).at(kFIRST_AND_ONLY_BEAM); | ||
| windowSizes.push_back(windowSize); | ||
| } | ||
| return windowSizes; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Return window sizes in deterministic order
Current implementation iterates an unordered_map, making order non-deterministic across processes and builds. Sorting eliminates test flakiness and keeps sender/receiver blockIdx alignment.
std::vector<SizeType32> getWindowSizes() const
{
- std::vector<SizeType32> windowSizes;
- for (auto const& [windowSize, _] : mPoolsPerWindow)
- {
- windowSizes.push_back(windowSize);
- }
- return windowSizes;
+ std::vector<SizeType32> windowSizes;
+ windowSizes.reserve(mPoolsPerWindow.size());
+ for (auto const& [windowSize, _] : mPoolsPerWindow)
+ {
+ windowSizes.push_back(windowSize);
+ }
+ std::sort(windowSizes.begin(), windowSizes.end());
+ return windowSizes;
}Add the missing header include as well:
#pragma once
#include "tensorrt_llm/batch_manager/kvCacheManager.h"
#include "tensorrt_llm/runtime/iTensor.h"
+#include <algorithm>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::vector<SizeType32> getWindowSizes() const | |
| { | |
| std::vector<SizeType32> windowSizes; | |
| for (auto const& [windowSize, _] : mPoolsPerWindow) | |
| { | |
| mWindowSize = newWindowSize; | |
| mBlockIds = mManager->getSequence(mRequestId).getCacheBlockIds(mWindowSize).at(kFIRST_AND_ONLY_BEAM); | |
| windowSizes.push_back(windowSize); | |
| } | |
| return windowSizes; | |
| } | |
| #pragma once | |
| #include "tensorrt_llm/batch_manager/kvCacheManager.h" | |
| #include "tensorrt_llm/runtime/iTensor.h" | |
| #include <algorithm> | |
| std::vector<SizeType32> getWindowSizes() const | |
| { | |
| std::vector<SizeType32> windowSizes; | |
| windowSizes.reserve(mPoolsPerWindow.size()); | |
| for (auto const& [windowSize, _] : mPoolsPerWindow) | |
| { | |
| windowSizes.push_back(windowSize); | |
| } | |
| std::sort(windowSizes.begin(), windowSizes.end()); | |
| return windowSizes; | |
| } |
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h around lines 122 to
130, the function returns window sizes by iterating an unordered_map which
yields a non-deterministic order; collect the window sizes as you do now then
sort the vector before returning (use std::sort on the collected windowSizes) to
ensure deterministic ordering, and also add the missing include for <algorithm>
at the top of the header so std::sort is available.
| auto const& windowSizes = blockRange.getWindowSizes(); | ||
| for (auto const& windowSize : windowSizes) | ||
| { | ||
| // fill cache with tokens (= request length), for reuse test | ||
| TLLM_CUDA_CHECK(cudaMemset(block.data(), llmRequest->getPromptLen(), block.getSizeInBytes())); | ||
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | ||
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | ||
| { | ||
| // fill cache with tokens (= request length), for reuse test | ||
| TLLM_CUDA_CHECK(cudaMemset(it->data(), llmRequest->getPromptLen(), it->getSizeInBytes())); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Add a device sync after cudaMemset to avoid read-after-write races
The memsets run on the default stream and the responder may read on another stream. Add a sync before starting the send to remove potential flakiness.
}
- mFutures.emplace_back(mResponder->respondAndSendAsync(*llmRequest));
+ // Ensure memset to device memory completes before send starts.
+ TLLM_CUDA_CHECK(cudaDeviceSynchronize());
+ mFutures.emplace_back(mResponder->respondAndSendAsync(*llmRequest));Also applies to: 439-440
🏁 Script executed:
#!/bin/bash
# Search for respondAndSendAsync usage and definition across the codebase
rg -n "respondAndSendAsync" -C5Length of output: 18131
Ensure device memory is fully initialized before launching async transfers
The cudaMemset calls in the reuse test execute on the default stream but don’t guarantee completion before the subsequent async send/receive, leading to potential read–after–write flakiness. Insert an explicit device synchronization immediately after the memset loops and before invoking any asynchronous transfer.
• File: cpp/tests/batch_manager/cacheTransceiverTest.cpp
– Lines 434–440: after the inner cudaMemset loop, add TLLM_CUDA_CHECK(cudaDeviceSynchronize()); before both mResponder->respondAndSendAsync(...) and mRequester->requestAndReceiveAsync(...).
Example patch:
}
+ // Wait for all cudaMemset operations to finish before launching async transfer
+ TLLM_CUDA_CHECK(cudaDeviceSynchronize());
mFutures.emplace_back(mResponder->respondAndSendAsync(*llmRequest));
}
else
{
+ // Ensure device memory is ready before requestAndReceiveAsync
+ TLLM_CUDA_CHECK(cudaDeviceSynchronize());
auto future = mRequester->requestAndReceiveAsync(*llmRequest);
future.get();
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto const& windowSizes = blockRange.getWindowSizes(); | |
| for (auto const& windowSize : windowSizes) | |
| { | |
| // fill cache with tokens (= request length), for reuse test | |
| TLLM_CUDA_CHECK(cudaMemset(block.data(), llmRequest->getPromptLen(), block.getSizeInBytes())); | |
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | |
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | |
| { | |
| // fill cache with tokens (= request length), for reuse test | |
| TLLM_CUDA_CHECK(cudaMemset(it->data(), llmRequest->getPromptLen(), it->getSizeInBytes())); | |
| } | |
| } | |
| auto const& windowSizes = blockRange.getWindowSizes(); | |
| for (auto const& windowSize : windowSizes) | |
| { | |
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | |
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | |
| { | |
| // fill cache with tokens (= request length), for reuse test | |
| TLLM_CUDA_CHECK(cudaMemset(it->data(), llmRequest->getPromptLen(), it->getSizeInBytes())); | |
| } | |
| } | |
| // Wait for all cudaMemset operations to finish before launching async transfer | |
| TLLM_CUDA_CHECK(cudaDeviceSynchronize()); | |
| mFutures.emplace_back(mResponder->respondAndSendAsync(*llmRequest)); | |
| } | |
| else | |
| { | |
| // Ensure device memory is ready before requestAndReceiveAsync | |
| TLLM_CUDA_CHECK(cudaDeviceSynchronize()); | |
| auto future = mRequester->requestAndReceiveAsync(*llmRequest); | |
| future.get(); | |
| } |
🤖 Prompt for AI Agents
In cpp/tests/batch_manager/cacheTransceiverTest.cpp around lines 429 to 438, the
cudaMemset loops initialize device memory on the default stream but do not
guarantee completion before subsequent asynchronous send/receive calls; add an
explicit device synchronization (e.g.,
TLLM_CUDA_CHECK(cudaDeviceSynchronize());) immediately after the memset loops
and before calling mResponder->respondAndSendAsync(...) and
mRequester->requestAndReceiveAsync(...) so the device memory is fully
initialized prior to any async transfers.
| auto const& windowSizes = blockRange.getWindowSizes(); | ||
| int blockIdx = 0; | ||
| for (auto const& windowSize : windowSizes) | ||
| { | ||
| blockRange.updatePoolIdx(poolIdx); | ||
| TLLM_LOG_DEBUG("update poolIdx: %d", poolIdx); | ||
| for (auto& block : blockRange) | ||
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | ||
| TLLM_LOG_DEBUG("update windowSize: %d", windowSize); | ||
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | ||
| { | ||
| fillBlockData(block, blockIdx, llmRequest->getPromptLen(), poolIdx); | ||
| fillBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize); | ||
| blockIdx++; | ||
| } | ||
| TLLM_LOG_DEBUG("blockPoolIdx: %d finish fill block data", poolIdx); | ||
| TLLM_LOG_DEBUG("windowSize: %d finish fill block data", windowSize); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Iteration order of windowSizes is unspecified; make it deterministic
getWindowSizes() returns a vector built from an unordered_map. Relying on its iteration order leads to non-deterministic blockIdx sequencing across processes, which can break the sender/receiver’s matching. Sort the window sizes before iterating.
- auto const& windowSizes = blockRange.getWindowSizes();
- int blockIdx = 0;
+ auto windowSizes = blockRange.getWindowSizes();
+ std::sort(windowSizes.begin(), windowSizes.end());
+ int blockIdx = 0;
for (auto const& windowSize : windowSizes) {Ensure deterministic iteration by sorting window sizes
To avoid non-deterministic blockIdx assignments (since getWindowSizes() pulls from an unordered_map), make a local, mutable copy of the returned vector and sort it before looping. Applying your proposed diff will guarantee consistent ordering across runs:
- auto const& windowSizes = blockRange.getWindowSizes();
+ // Copy and sort to enforce deterministic order (unordered_map has unspecified iteration)
+ auto windowSizes = blockRange.getWindowSizes();
+ std::sort(windowSizes.begin(), windowSizes.end());
int blockIdx = 0;
for (auto const& windowSize : windowSizes) {
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);Tagging this change as a mandatory refactor to ensure correct matching between sender and receiver.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto const& windowSizes = blockRange.getWindowSizes(); | |
| int blockIdx = 0; | |
| for (auto const& windowSize : windowSizes) | |
| { | |
| blockRange.updatePoolIdx(poolIdx); | |
| TLLM_LOG_DEBUG("update poolIdx: %d", poolIdx); | |
| for (auto& block : blockRange) | |
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | |
| TLLM_LOG_DEBUG("update windowSize: %d", windowSize); | |
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | |
| { | |
| fillBlockData(block, blockIdx, llmRequest->getPromptLen(), poolIdx); | |
| fillBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize); | |
| blockIdx++; | |
| } | |
| TLLM_LOG_DEBUG("blockPoolIdx: %d finish fill block data", poolIdx); | |
| TLLM_LOG_DEBUG("windowSize: %d finish fill block data", windowSize); | |
| } | |
| // Copy and sort to enforce deterministic order (unordered_map has unspecified iteration) | |
| auto windowSizes = blockRange.getWindowSizes(); | |
| std::sort(windowSizes.begin(), windowSizes.end()); | |
| int blockIdx = 0; | |
| for (auto const& windowSize : windowSizes) | |
| { | |
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | |
| TLLM_LOG_DEBUG("update windowSize: %d", windowSize); | |
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | |
| { | |
| fillBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize); | |
| blockIdx++; | |
| } | |
| TLLM_LOG_DEBUG("windowSize: %d finish fill block data", windowSize); | |
| } |
🤖 Prompt for AI Agents
In cpp/tests/batch_manager/cacheTransceiverTest.cpp around lines 891 to 903,
getWindowSizes() returns a vector derived from an unordered_map so iteration
order is non-deterministic; make a local mutable copy of the returned vector,
sort it (e.g. std::sort ascending) and then iterate over the sorted copy when
computing blockIdx and filling block data to ensure deterministic block index
assignment; add #include <algorithm> if not already present.
| for (auto const& windowSize : windowSizes) | ||
| { | ||
| blockRange.updatePoolIdx(poolIdx); | ||
| for (auto& block : blockRange) | ||
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | ||
| int maxBlockInWindow = windowSize / mCacheState->getModelConfig().mTokensPerBlock; | ||
| int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - (maxBlockInWindow + 1)); | ||
| int blockIdInWindow = 0; | ||
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | ||
| { | ||
| verifyBlockData(block, blockIdx, llmRequest->getPromptLen(), poolIdx); | ||
| if (blockIdInWindow >= startBlockId) | ||
| { | ||
| verifyBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize); | ||
| } | ||
| blockIdx++; | ||
| blockIdInWindow++; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Apply deterministic window iteration during verification too
Same determinism concern on the generation side; sort window sizes before iterating to keep blockIdx and verification in sync with the sender.
- auto const& windowSizes = blockRange.getWindowSizes();
+ auto windowSizes = blockRange.getWindowSizes();
+ std::sort(windowSizes.begin(), windowSizes.end());
for (auto const& windowSize : windowSizes)🏁 Script executed:
#!/bin/bash
# Search for getWindowSizes method usage or definition
rg -nP 'getWindowSizes\s*\(' -C3 --type=cpp
# Locate the BlockRange class definition
rg -nP 'class\s+\w*BlockRange' -C3 --type=cppLength of output: 8787
🏁 Script executed:
#!/bin/bash
# Show the declaration and context of BlockRange
rg -nP 'class BlockRange' -C10 --type=cpp cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
# Show the declaration of mPoolsPerWindow
rg -nP 'mPoolsPerWindow' -C5 --type=cpp cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.hLength of output: 4151
Ensure deterministic window ordering in cacheTransceiverTest
The call to getWindowSizes() pulls keys from an unordered_map, yielding a non-deterministic iteration order. To keep blockIdx and verification aligned with the sender, sort the window sizes before the loop in cpp/tests/batch_manager/cacheTransceiverTest.cpp:
- auto const& windowSizes = blockRange.getWindowSizes();
+ // Copy and sort for deterministic iteration
+ auto windowSizes = blockRange.getWindowSizes();
+ std::sort(windowSizes.begin(), windowSizes.end());
for (auto const& windowSize : windowSizes)
{
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
…
}– This change should be applied at every occurrence of auto const& windowSizes = blockRange.getWindowSizes(); in the test file.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (auto const& windowSize : windowSizes) | |
| { | |
| blockRange.updatePoolIdx(poolIdx); | |
| for (auto& block : blockRange) | |
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | |
| int maxBlockInWindow = windowSize / mCacheState->getModelConfig().mTokensPerBlock; | |
| int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - (maxBlockInWindow + 1)); | |
| int blockIdInWindow = 0; | |
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | |
| { | |
| verifyBlockData(block, blockIdx, llmRequest->getPromptLen(), poolIdx); | |
| if (blockIdInWindow >= startBlockId) | |
| { | |
| verifyBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize); | |
| } | |
| blockIdx++; | |
| blockIdInWindow++; | |
| } | |
| // Copy and sort for deterministic iteration | |
| auto windowSizes = blockRange.getWindowSizes(); | |
| std::sort(windowSizes.begin(), windowSizes.end()); | |
| for (auto const& windowSize : windowSizes) | |
| { | |
| auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize); | |
| int maxBlockInWindow = windowSize / mCacheState->getModelConfig().mTokensPerBlock; | |
| int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - (maxBlockInWindow + 1)); | |
| int blockIdInWindow = 0; | |
| for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it) | |
| { | |
| if (blockIdInWindow >= startBlockId) | |
| { | |
| verifyBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize); | |
| } | |
| blockIdx++; | |
| blockIdInWindow++; | |
| } |
🤖 Prompt for AI Agents
In cpp/tests/batch_manager/cacheTransceiverTest.cpp around lines 935 to 949, the
test obtains window sizes from blockRange via an unordered_map which yields
non-deterministic iteration order; this misaligns blockIdx/verification with the
sender. Replace uses of auto const& windowSizes = blockRange.getWindowSizes() by
copying the keys into a vector, sorting that vector (e.g., std::sort) and
iterate over the sorted windowSizes instead; apply the same change at every
occurrence of getWindowSizes() in this test file so verification order is
deterministic and matches the sender.
Signed-off-by: Chuang Zhu <[email protected]>
Summary by CodeRabbit
New Features
Refactor
API
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.