Skip to content

Conversation

@chuangz0
Copy link
Collaborator

@chuangz0 chuangz0 commented Aug 13, 2025

Summary by CodeRabbit

  • New Features

    • Per-window block management with new per-window range/iteration APIs and a new env flag to optionally transfer all blocks for a window.
  • Refactor

    • Cache transfer/format/unformat flows converted to per-window processing with per-window pools, iteration, validation, logging, and per-window request/metadata/serialization.
  • API

    • Buffer pre-allocation API (and Python/nanobind bindings) now requires tokens-per-block.
  • Tests

    • Unit/integration tests updated to per-window semantics; one accuracy test skip removed.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@chuangz0 chuangz0 requested a review from a team as a code owner August 13, 2025 09:11
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 13, 2025

📝 Walkthrough

Walkthrough

Refactors KV-cache handling from a single global block range to per-window semantics: adds BlockRangeForWindow and per-window BlockRange data structures and APIs, propagates per-window requested block hashes through transceiver/formatter flows, adds an env flag to force full-window transfer, and extends buffer preallocation API to accept tokensPerBlock.

Changes

Cohort / File(s) Summary
KV cache per-window API
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
Adds BlockRangeForWindow (ctor, begin/end/size, Sentinel); updates BlockIterator to reference BlockRangeForWindow; refactors BlockRange to per-window maps (mBlockIdsPerWindow, mPoolsPerWindow), adds factories (fromAllBlockIds, fromNewlyAllocatedBlockIds), constructors, setters/getters (setBlockIdsForWindow, setBlockIdsForAllWindows, getBlockHashesPerWindow, getBlockRangeForWindow, getWindowSizes, getBlockIdsPerWindow); adds include for runtime/iTensor.h.
Request / RequestInfo per-window hashes
cpp/include/tensorrt_llm/batch_manager/llmRequest.h, cpp/tensorrt_llm/batch_manager/dataTransceiver.h, cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp, .../dataTransceiverImpl.cpp
Replaces flat vector of requested block hashes with std::unordered_map<SizeType32, std::vector<size_t>> per-window: members, constructor signatures, getters (getRequestedBlockHashesPerWindow / getBlockHashesPerWindow), serializers/deserializers, equality and callsites updated.
Cache formatter & MLA formatter (per-window flows)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp, cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
Reworks format/unformat and send/receive to iterate per-window: compute per-window ranges, crop per-window block IDs (setBlockIdsForWindow), replace pool-based loops with window-based loops, add per-window logging and runtime checks, and adjust zero-copy prealloc logic in sending path.
Transfer buffer prealloc API & bindings
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h, cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp, cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp, cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp, cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp, cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
Adds SizeType32 tokensPerBlock parameter to CacheTransBufferManager::preAllocBufferSize; implementation uses tokensPerBlock in valid-token calculations and allows env override; updates all callsites and Python/nanobind bindings and unit tests.
CacheTransBuffer logic changes
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
Adjusts per-layer valid-token calculation to use windowSize + tokensPerBlock when windowSize < maxNumTokens, with env override to force full-window token count.
Env flag
cpp/tensorrt_llm/common/envUtils.h, cpp/tensorrt_llm/common/envUtils.cpp
Adds getEnvKVCacheTransferAllBlocksForWindow() to expose cached TRTLLM_KVCACHE_TRANSFER_ALL_BLOCKS_FOR_WINDOW boolean.
Tests updated to per-window semantics
cpp/tests/batch_manager/cacheTransceiverTest.cpp, cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
Tests converted to per-window traversal (getWindowSizes() + getBlockRangeForWindow); helper signatures changed to accept windowSize; BlockRangeForWindow construction updated to use moved arguments.
Integration test change
tests/integration/defs/accuracy/test_disaggregated_serving.py
Removed a runtime skip so the test runs for both overlap_scheduler values.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant DataTransceiver
  participant BlockRange
  participant CacheFormatter
  participant Transport

  Client->>DataTransceiver: submit request
  DataTransceiver->>BlockRange: build/get per-window block-hash map
  alt selective transfer (default)
    DataTransceiver->>CacheFormatter: request with per-window hashes
    loop each windowSize
      CacheFormatter->>BlockRange: getBlockRangeForWindow(windowSize)
      CacheFormatter->>Transport: send selected blocks for window
    end
  else full-window transfer (env flag)
    DataTransceiver->>CacheFormatter: request without selective hashes
    loop each windowSize
      CacheFormatter->>BlockRange: get all/newly allocated blocks for window
      CacheFormatter->>Transport: send all blocks for window
    end
  end
Loading
sequenceDiagram
  autonumber
  participant Transport
  participant CacheFormatter
  participant BlockRange

  Transport->>CacheFormatter: deliver cache payload
  loop each windowSize
    CacheFormatter->>BlockRange: getBlockRangeForWindow(windowSize)
    CacheFormatter->>CacheFormatter: map buffers into window's pools and write blocks
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

Suggested labels

KV-Cache Management

Suggested reviewers

  • Shixiaowei02
  • zhengd-nv
  • tomeras91
  • Funatiq

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 39306d6 and 78b2de5.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
🧠 Learnings (7)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
🧬 Code graph analysis (1)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
cpp/tensorrt_llm/common/envUtils.cpp (4)
  • getEnvKVCacheTransferAllBlocksForWindow (449-453)
  • getEnvKVCacheTransferAllBlocksForWindow (449-449)
  • getEnvDisableSelectiveCacheTransfer (321-325)
  • getEnvDisableSelectiveCacheTransfer (321-321)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
  • llmRequest (421-460)
  • llmRequest (421-421)
  • llmRequest (882-914)
  • llmRequest (882-882)
  • llmRequest (916-923)
  • llmRequest (916-916)
  • llmRequest (925-951)
  • llmRequest (925-925)
🔇 Additional comments (4)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (4)

22-22: LGTM - Adding kvCacheEventManager header.

The include for kvCacheEventManager is appropriate for the per-window KV cache handling functionality being added.


45-89: Function correctly implements per-window KV cache transfer logic.

The implementation properly handles:

  • Per-window block range selection with getBlockRangeForSending
  • Environment variable gating (needSendAllForWindow)
  • Per-window requested block hashes lookup with safe access patterns
  • Window-specific block cropping logic

The logic for handling single-window scenarios (Line 68) where context and generation phases may have different window sizes is well-considered.


91-118: Per-window receiving logic looks correct.

The function properly:

  • Uses window metadata to determine transfer behavior
  • Handles both selective and full cache transfer modes
  • Applies per-window block range trimming when needed
  • Uses appropriate BlockRange factory methods based on cache reuse settings

The logic flow is sound and aligns with the sending counterpart.


232-252: Per-window iteration in layerwise path is implemented correctly.

The code properly iterates over window sizes and uses getBlockRangeForWindow() to get window-specific block ranges. The layer slicing and sending logic remains intact while being adapted for the per-window approach.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@chuangz0 chuangz0 changed the title [TRTLLM-7078][chore] optimal kvcache transfer Draft:[TRTLLM-7078][chore] optimal kvcache transfer Aug 13, 2025
@chuangz0 chuangz0 marked this pull request as draft August 13, 2025 09:13
@chuangz0 chuangz0 requested review from Shixiaowei02 and removed request for Tabrizian and schetlur-nv August 13, 2025 09:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🔭 Outside diff range comments (6)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (2)

140-157: Fix data race: protect map lookup and erase with the mutex

find() and erase() on mRequestToSession must be done under mMtxForMap. Currently, find() happens before acquiring the lock.

Apply:

 void DataSenderImpl::release(LlmRequest::RequestIdType requestId)
 {
-    auto it = mRequestToSession.find(requestId);
-    TLLM_CHECK(it != mRequestToSession.end());
-    std::unique_lock<std::mutex> lk(mMtxForMap);
+    std::unique_lock<std::mutex> lk(mMtxForMap);
+    auto it = mRequestToSession.find(requestId);
+    TLLM_CHECK(it != mRequestToSession.end());
     if (!common::getEnvKVCacheTransferOutputPath().empty())
     {
         if (!mMeasuresFile.is_open())
         {
             auto outputPath = getTransferOutputPath("send");
             mMeasuresFile.open(outputPath);
             TLLM_CHECK_WITH_INFO(
                 mMeasuresFile.is_open(), "Failed to open transfer output file: %s", outputPath.string().c_str());
         }
         it->second.exportMeasure(mMeasuresFile, true);
     }
     mRequestToSession.erase(it);
 }

218-223: Handle “index not found” case and fix variable name typo

valideConnectionIdx is misspelled, and std::find may return end() when i is not picked, yielding an invalid index.

Apply:

-            // TODO: index -> validConnectionIdx conversion
-            auto valideConnectionIdx = std::find(pickUpIdx.begin(), pickUpIdx.end(), i) - pickUpIdx.begin();
+            // TODO: index -> validConnectionIdx conversion
+            auto itIdx = std::find(pickUpIdx.begin(), pickUpIdx.end(), i);
+            if (itIdx == pickUpIdx.end())
+            {
+                // Skip connections that are not selected by the picker.
+                continue;
+            }
+            auto validConnectionIdx = static_cast<size_t>(itIdx - pickUpIdx.begin());
             auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connection);
             TLLM_CHECK(agentConnection != nullptr);
             TLLM_CHECK(cacheBufferId.has_value());
             const_cast<executor::kv_cache::AgentConnection*>(agentConnection)
-                ->sendRequestAndBufferInfo(requestInfo, cacheBufferId, valideConnectionIdx);
+                ->sendRequestAndBufferInfo(requestInfo, cacheBufferId, validConnectionIdx);
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (2)

76-83: Fix pybind11 type: use py::class_ (not py::classh)

py::classh is invalid and will not compile. Use py::class_.

Apply:

-    py::classh<tb::BaseCacheTransceiver, PyCacheTransceiver>(m, "BaseCacheTransceiver")
+    py::class_<tb::BaseCacheTransceiver, PyCacheTransceiver>(m, "BaseCacheTransceiver")
         .def("respond_and_send_async", &BaseCacheTransceiver::respondAndSendAsync)
         .def("request_and_receive_sync", &BaseCacheTransceiver::requestAndReceiveSync)
         .def("request_and_receive_async", &BaseCacheTransceiver::requestAndReceiveAsync)
         .def("check_context_transfer_status", &BaseCacheTransceiver::checkContextTransferStatus)
         .def("check_gen_transfer_status", &BaseCacheTransceiver::checkGenTransferStatus)
         .def("check_gen_transfer_complete", &BaseCacheTransceiver::checkGenTransferComplete);

88-95: Fix pybind11 type in the derived class binding

Same issue: py::classh should be py::class_. Otherwise the binding fails to build.

Apply:

-    py::classh<tb::CacheTransceiver, tb::BaseCacheTransceiver>(m, "CacheTransceiver")
+    py::class_<tb::CacheTransceiver, tb::BaseCacheTransceiver>(m, "CacheTransceiver")
         .def(py::init<tb::kv_cache_manager::BaseKVCacheManager*, std::vector<SizeType32>, SizeType32, SizeType32,
                 runtime::WorldConfig, nvinfer1::DataType, executor::kv_cache::CacheState::AttentionType,
                 std::optional<executor::CacheTransceiverConfig>>(),
             py::arg("cache_manager"), py::arg("num_kv_heads_per_layer"), py::arg("size_per_head"),
             py::arg("tokens_per_block"), py::arg("world_config"), py::arg("dtype"), py::arg("attention_type"),
             py::arg("cache_transceiver_config") = std::nullopt);
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1)

20-30: Missing include; header should be self-contained

This header declares APIs using std::map but does not include . Add it to avoid relying on transitive includes.

Apply:

 #include "tensorrt_llm/batch_manager/kvCacheManager.h"
 #include "tensorrt_llm/executor/executor.h"
 #include "tensorrt_llm/runtime/bufferManager.h"
 #include "tensorrt_llm/runtime/iTensor.h"
+#include <map>
 #include <atomic>
 #include <condition_variable>
 #include <cstddef>
 #include <optional>
 #include <unordered_map>
 #include <vector>
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

18-34: Consider explicitly bringing SizeType32 into scope or fully-qualify it

This header uses SizeType32 but does not declare a using alias in-scope. If not provided transitively, compilation will fail. Prefer fully-qualifying to avoid leaking aliases in a public header.

Example:

  • Replace occurrences of SizeType32 with tensorrt_llm::runtime::SizeType32
    Or add near includes:
    using SizeType32 = tensorrt_llm::runtime::SizeType32;
🧹 Nitpick comments (10)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)

30-34: Consider using initializer list for mBlockIds to avoid unnecessary copy.

Since blockIds is moved into the member initializer list, but the vector is copied when passed to the constructor, consider passing by rvalue reference to avoid the extra copy.

-    BlockRangeForWindow(std::vector<SizeType32> blockIds, runtime::ITensor::SharedPtr pool)
-        : mBlockIds(std::move(blockIds))
+    BlockRangeForWindow(std::vector<SizeType32>&& blockIds, runtime::ITensor::SharedPtr pool)
+        : mBlockIds(blockIds)
         , mPool(std::move(pool))
     {
     }

145-146: Remove commented-out code.

The commented line appears to be debug code that should be removed.

-        // cacheManager.getBlockManager.getPrimaryPool(0);
         auto poolNum = mManager->getNumPools();
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1)

100-102: Binding signature LGTM; consider adding minimal docstrings

The python-exposed pre_alloc_buffer_size now matches the C++ signature and includes tokens_per_block. Optional: add a docstring for clarity.

Example:

-        .def_static("pre_alloc_buffer_size", &tb::kv_cache_manager::CacheTransBufferManager::preAllocBufferSize,
+        .def_static("pre_alloc_buffer_size", &tb::kv_cache_manager::CacheTransBufferManager::preAllocBufferSize,
             py::arg("cache_size_bytes_per_token_per_window"), py::arg("tokens_per_block"),
-            py::arg("cache_transceiver_config") = py::none());
+            py::arg("cache_transceiver_config") = py::none(),
+            R"doc(
+Compute the pre-allocation size for cache transfer buffers.
+Args:
+  cache_size_bytes_per_token_per_window: dict[int,int] of window_size -> size per token (bytes)
+  tokens_per_block: number of tokens per contiguous KV block
+  cache_transceiver_config: optional transceiver configuration
+)doc");
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (2)

49-51: Typo: supportFbaricMemory() → supportFabricMemory()

Public API has a spelling mistake. Renaming is breaking, but better to correct now or provide a deprecated alias.

Suggested change:

  • Rename method to supportFabricMemory() and, if needed, keep a deprecated inline alias for one release cycle.

18-18: Consider adding include guards per project guideline

Guidelines prefer named include guards over pragma once. Optional, if consistent with the repository style.

Example:

  • Add:
    • #ifndef TRTLLM_CACHE_TRANS_BUFFER_H
    • #define TRTLLM_CACHE_TRANS_BUFFER_H
    • ...
    • #endif // TRTLLM_CACHE_TRANS_BUFFER_H
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)

126-129: LGTM: Non-destructive move pattern in test is clear

Copying to blockIdsVec and poolCopy, then moving into BlockRangeForWindow preserves original values and tests the rvalue overload.

You could also construct directly from copies without intermediate variables:
auto range = BlockRangeForWindow(std::vector(blockIds.begin(), blockIds.end()), pool);

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)

213-219: Consider documenting the rationale for adding tokensPerBlock to validTokenNum calculation.

The logic now adds tokensPerBlock to the valid token count when windowSize < maxNumTokens. While this appears to be accounting for an additional block boundary, the reasoning should be documented for maintainability.

Add a comment explaining the calculation:

 auto validTokenNum
-    = (windowSize < maxNumTokens.value() ? (windowSize + tokensPerBlock) : maxNumTokens.value());
+    = (windowSize < maxNumTokens.value() 
+        ? (windowSize + tokensPerBlock)  // Add one extra block for boundary alignment
+        : maxNumTokens.value());
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

953-956: Parameter name inconsistency: windowSize vs blockPoolIdx in function signature.

The function signature shows windowSize but the old parameter name was blockPoolIdx. The documentation comment for this parameter should be updated to reflect the new semantics.

Consider adding a documentation comment:

-void fillBlockData(tensorrt_llm::runtime::ITensor& blockData, int blockId, size_t initial, int windowSize = 0)
+// Fill block data for testing
+// @param blockData The block tensor to fill
+// @param blockId The block identifier
+// @param initial Initial value for data generation
+// @param windowSize The window size (0 uses default from block manager)
+void fillBlockData(tensorrt_llm::runtime::ITensor& blockData, int blockId, size_t initial, int windowSize = 0)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)

75-80: Optional: Make serialization order deterministic to improve reproducibility

Iterating an unordered_map yields a non-deterministic order. While deserialize reads into an unordered_map (so round-trip works), serialization artifacts (e.g., logs, test fixtures, cross-process replay) may differ run-to-run. If determinism matters, serialize entries sorted by key.

Apply this diff inside serialize to sort keys by windowSize:

-    su::serialize(requestInfo.mBlockHashesPerWindow.size(), os);
-    for (auto const& [windowSize, blockHashes] : requestInfo.mBlockHashesPerWindow)
-    {
-        su::serialize(windowSize, os);
-        su::serialize(blockHashes, os);
-    }
+    su::serialize(requestInfo.mBlockHashesPerWindow.size(), os);
+    std::vector<SizeType32> keys;
+    keys.reserve(requestInfo.mBlockHashesPerWindow.size());
+    for (auto const& kv : requestInfo.mBlockHashesPerWindow)
+    {
+        keys.push_back(kv.first);
+    }
+    std::sort(keys.begin(), keys.end());
+    for (auto const windowSize : keys)
+    {
+        su::serialize(windowSize, os);
+        su::serialize(requestInfo.mBlockHashesPerWindow.at(windowSize), os);
+    }

Additionally, add this include near the top of the file:

#include <algorithm>

238-244: Avoid copying potentially large per-window hash maps on the hot path

blockHashesPerWindow is populated via getBlockHashesPerWindow(), which likely returns a const reference; assigning it to a local map copies the entire structure. Given this runs in a loop and maps contain vectors, this can be expensive.

  • Prefer moving the map out of RequestInfo, e.g., add a RequestInfo::takeBlockHashesPerWindow() that returns the map by value and clears the source, then:
    • auto info = mSender->recvRequestInfo();
    • auto blockHashesPerWindow = info.takeBlockHashesPerWindow(); // moved, no copy
  • If changing RequestInfo is undesirable, consider returning the map by value from getBlockHashesPerWindow() and using NRVO/move.
  • As a lighter alternative, store the map with the request’s Response state so you don’t carry a copy across iterations and only move it into llmRequest at count == 0.

The final move into llmRequest at Line 263 is good; the bottleneck is the preceding copy into blockHashesPerWindow.

Also applies to: 263-263

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc5f766 and fecbc76.

📒 Files selected for processing (17)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4 hunks)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (5 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (4 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.h (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (12 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🧠 Learnings (1)
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🧬 Code Graph Analysis (11)
cpp/tensorrt_llm/common/envUtils.h (1)
cpp/tensorrt_llm/common/envUtils.cpp (2)
  • getEnvKVCacheTransferAllBlocksForWindow (449-453)
  • getEnvKVCacheTransferAllBlocksForWindow (449-449)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
  • nodiscard (973-980)
  • nodiscard (1429-1432)
  • nodiscard (2479-2497)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (3)
cpp/tensorrt_llm/common/envUtils.cpp (2)
  • getEnvDisableSelectiveCacheTransfer (321-325)
  • getEnvDisableSelectiveCacheTransfer (321-321)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (10)
  • llmRequest (421-460)
  • llmRequest (421-421)
  • llmRequest (882-914)
  • llmRequest (882-882)
  • llmRequest (916-923)
  • llmRequest (916-916)
  • llmRequest (925-951)
  • llmRequest (925-925)
  • requestId (200-208)
  • requestId (200-201)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
  • RequestInfo (45-97)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)
  • preAllocBufferSize (246-285)
  • preAllocBufferSize (246-248)
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
  • BlockRangeForWindow (27-56)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (3)
  • nodiscard (61-112)
  • nodiscard (123-126)
  • nodiscard (133-138)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h (1)
  • nodiscard (53-113)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
  • maxNumTokens (390-406)
cpp/tensorrt_llm/common/envUtils.cpp (2)
  • getEnvKVCacheTransferAllBlocksForWindow (449-453)
  • getEnvKVCacheTransferAllBlocksForWindow (449-449)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
  • getBlockRangeForSending (43-82)
  • getBlockRangeForSending (43-43)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
  • llmRequest (421-460)
  • llmRequest (421-421)
  • llmRequest (882-914)
  • llmRequest (882-882)
  • llmRequest (916-923)
  • llmRequest (916-916)
  • llmRequest (925-951)
  • llmRequest (925-925)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
cpp/tensorrt_llm/common/envUtils.cpp (4)
  • getEnvKVCacheTransferAllBlocksForWindow (449-453)
  • getEnvKVCacheTransferAllBlocksForWindow (449-449)
  • getEnvDisableSelectiveCacheTransfer (321-325)
  • getEnvDisableSelectiveCacheTransfer (321-321)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
  • llmRequest (421-460)
  • llmRequest (421-421)
  • llmRequest (882-914)
  • llmRequest (882-882)
  • llmRequest (916-923)
  • llmRequest (916-916)
  • llmRequest (925-951)
  • llmRequest (925-925)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
  • RequestInfo (45-97)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1)
  • requestInfo (179-179)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
  • blockIds (48-48)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)
  • requestId (200-208)
  • requestId (200-201)
🔇 Additional comments (27)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)

61-78: Empty window metadata case cannot occur
BlockManager always initializes its mWindowSizeToMetadata map with at least one window size, so getWindowSizesMetadata() never returns an empty map. The fromNewlyAllocatedBlockIds factory thus safely assumes non-empty metadata and needs no additional empty-check.

cpp/tensorrt_llm/common/envUtils.h (1)

119-120: LGTM!

The new function declaration follows the established pattern and is properly placed in the namespace.

cpp/tensorrt_llm/common/envUtils.cpp (1)

449-453: LGTM!

The implementation correctly follows the established pattern using static caching and getBoolEnv.

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)

119-120: LGTM!

The test correctly passes the tokensPerBlock parameter to match the updated API signature.


163-164: LGTM!

The test correctly passes the tokensPerBlock parameter consistent with the API change.

cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)

55-55: LGTM: Updated test to use BlockRangeForWindow with moves

Correctly constructs BlockRangeForWindow by moving blockIds and pool to validate ownership transfer semantics.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)

66-69: Getter name and semantics LGTM; consider noexcept and [[nodiscard]] (already present)

The accessor and storage type are appropriate for per-window mapping. No further changes needed.


53-56: Serialization for RequestInfo now includes mBlockHashesPerWindow correctly

All three methods in cpp/tensorrt_llm/batch_manager/dataTransceiver.* have been updated to account for the per-window hash map:

  • RequestInfo::serialize(...) writes the map size and each (windowSize, blockHashes) pair
  • RequestInfo::deserialize(...) reads them back into mBlockHashesPerWindow
  • RequestInfo::serializedSize(...) sums their sizes

Unit tests (e.g. cacheTransceiverTest.cpp) exercise the round-trip. No further changes are required.

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (3)

1834-1837: LGTM! Method signature refactored to support per-window block hashes.

The change from a single vector to a per-window map (std::unordered_map<SizeType32, std::vector<size_t>>) aligns well with the PR's objective of optimizing KV cache transfer on a per-window basis.


1839-1842: LGTM! Getter method properly updated for per-window access.

The method is correctly renamed and returns the appropriate per-window map type, maintaining consistency with the setter method.


2036-2036: LGTM! Member variable properly updated for per-window storage.

The member variable is correctly changed from a single vector to a per-window map, consistent with the API changes.

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)

113-127: LGTM! Per-window iteration correctly implemented for MLA format operation.

The change properly iterates over windows using the new per-window API, validates that window sizes match the number of pools, and accumulates blocks per window correctly.


313-326: LGTM! Per-window iteration correctly implemented for MLA unformat operation.

The unformat operation properly mirrors the format operation's per-window approach, ensuring consistency in block handling across both operations.

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (4)

43-82: LGTM! Per-window block range selection logic is well-implemented.

The function correctly implements selective cache transfer with per-window granularity, properly handling reuse information and environment overrides. The debug logging provides good visibility into the decision-making process.


84-111: LGTM! Receiving block range logic correctly handles per-window transfers.

The function properly manages different scenarios (single window, all blocks for window, selective transfer) and correctly builds per-window ranges for receiving.


252-276: LGTM! Format operation correctly updated for per-window block handling.

The non-layer-wise path properly builds per-window block collections and validates window sizes against pool counts. The debug logging provides good visibility.


483-502: LGTM! Unformat operation properly mirrors the format operation's per-window approach.

The receiving side correctly builds per-window output buffers and includes appropriate validation and logging.

cpp/tests/batch_manager/cacheTransceiverTest.cpp (4)

429-438: LGTM! Test properly updated to use per-window iteration.

The test correctly iterates over windows and their blocks, properly filling cache data for each block.


447-458: LGTM! Verification logic correctly updated for per-window structure.

The test properly verifies the received cache data using per-window iteration, maintaining consistency with the sending side.


1082-1084: Good adaptation of the random value generation to use window size.

The change from blockPoolIdx to windowSize in the seed calculation maintains deterministic test behavior while aligning with the new per-window architecture.


1267-1267: Test parameter reduction requires justification.

The test now uses fewer parameter values {60, 30, 60, 10} compared to the original. Please verify this doesn't reduce test coverage for important scenarios.

Could you explain why the test parameters were reduced from {30, 10, 60, 30, 60, 10} to {60, 30, 60, 10}? Is this related to the per-window changes or an optimization?

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (6)

26-26: Header include for SizeType32 is appropriate

Including runtime/common.h to access SizeType32 is correct for the new per-window mapping.


37-37: Alias introduction for SizeType32

The using-declaration keeps the code concise and consistent with the new API. LGTM.


45-51: New RequestInfo ctor with per-window hashes uses proper move semantics

The rvalue-ref constructor and member initialization are correct; moving blockHashesPerWindow avoids a copy. LGTM.


56-57: Equality operator extended to include per-window map

Comparison logic is correct; std::unordered_map operator== is order-independent. LGTM.


105-111: serializedSize updated for per-window map

Accounting for the map size and each entry (key + value vector) is correct. LGTM.


88-95: Ignore the decltype “bug”; only consider reserving buckets

The use of decltype(blockHashes) in the initializer is valid—blockHashes is in scope within its own initializer—so no change is needed for the template argument. You can, however, optionally reserve the map’s buckets for a small performance gain:

  • File: cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (lines 88–95)
    After deserializing size, add:
        auto size = su::deserialize<decltype(blockHashesPerWindow.size())>(is);
    +   blockHashesPerWindow.reserve(size);
        for (size_t i = 0; i < size; i++)
        {
            auto windowSize = su::deserialize<SizeType32>(is);
            std::vector<size_t> blockHashes = su::deserialize<decltype(blockHashes)>(is);
            blockHashesPerWindow.emplace(windowSize, std::move(blockHashes));
        }

If on-wire compatibility across heterogeneous platforms is a concern, you may also consider using a fixed-width type (e.g. std::vector<std::uint64_t>) instead of std::vector<size_t> in both serialization and deserialization.

Likely an incorrect or invalid review comment.

Comment on lines 63 to 66
static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
SizeType32 tokensPerBlock,
std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

API change acknowledged; ensure all call sites updated and add [[nodiscard]]

The new tokensPerBlock parameter is appropriate. Consider marking the function [[nodiscard]] to prevent accidental ignoring of the computed size. Also verify all call sites were updated.

Apply:

-    static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
+    [[nodiscard]] static size_t preAllocBufferSize(
+        std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
         SizeType32 tokensPerBlock,
         std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);

Run to find legacy calls missing the new argument:


🏁 Script executed:

#!/bin/bash
# Locate all invocations of preAllocBufferSize and show argument lists.
rg -n -A2 -B2 'preAllocBufferSize\('

Length of output: 3358


Add [[nodiscard]] to preAllocBufferSize declaration

All call sites—both in tests and in trtGptModelInflightBatching.cpp—already pass the new tokensPerBlock argument, so no updates there are needed. Please annotate the API to prevent accidental ignores:

• In cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h, line 63:

-    static size_t preAllocBufferSize(
+    [[nodiscard]] static size_t preAllocBufferSize(
         std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
         SizeType32 tokensPerBlock,
         std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);

• (Optional) Mirror the attribute on the definition in cacheTransBuffer.cpp for consistency:

- size_t CacheTransBufferManager::preAllocBufferSize(
+ [[nodiscard]] size_t CacheTransBufferManager::preAllocBufferSize(
    std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow, 
    SizeType32 tokensPerBlock,
    std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
SizeType32 tokensPerBlock,
std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);
[[nodiscard]] static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
SizeType32 tokensPerBlock,
std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h around lines 63-66, the
declaration of preAllocBufferSize lacks the [[nodiscard]] attribute which could
allow callers to accidentally ignore the returned size; add [[nodiscard]] before
the static return type to enforce usage. Also (optional) add the same
[[nodiscard]] attribute to the function definition in cacheTransBuffer.cpp to
keep declaration and definition consistent.

#include <map>
#include <string>

#include "tensorrt_llm/batch_manager/cacheTransceiver.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing <unordered_map> include for new RequestInfo member

RequestInfo now uses std::unordered_map, but this header does not include <unordered_map>. Add it to keep the header self-contained.

Apply:

 #include <map>
+#include <unordered_map>
 #include <string>

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.h around line 24, the header
now relies on std::unordered_map via the RequestInfo member but does not include
<unordered_map>, breaking self-containment; add #include <unordered_map> near
other standard includes in this header so the type is available without relying
on transitive includes.

@chuangz0 chuangz0 changed the title Draft:[TRTLLM-7078][chore] optimal kvcache transfer Draft:[TRTLLM-7078][chore] optimal kvcache transfer for VWSA Aug 14, 2025
@chuangz0 chuangz0 force-pushed the kvcache_transfer_for_vwsa_optimal branch from fecbc76 to 3fe3fdb Compare August 18, 2025 07:43
@chuangz0
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15612 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

238-265: Logic bug: requested block-hashes can be lost when multiple counterparts exist

blockHashesPerWindow is a local variable reinitialized on each loop iteration. When getCounterpartsCount(reqId) > 1, you only assign it during the first iteration (when receiving RequestInfo), but call setRequestedBlockHashes when count reaches 0, likely a later iteration where blockHashesPerWindow is empty. Result: the LlmRequest may be updated with an empty map.

Fix by setting the hashes immediately after receiving RequestInfo (while holding the responder mutex), and remove the late update.

Apply:

-                std::unordered_map<SizeType32, std::vector<size_t>> blockHashesPerWindow;
+                std::unordered_map<SizeType32, std::vector<size_t>> blockHashesPerWindow;
                 if (!isSending() && !mReadyResponses.empty())
                 {
                     auto const& requestInfo = mSender->recvRequestInfo();
                     auto reqId = requestInfo.getRequestId();
                     blockHashesPerWindow = requestInfo.getBlockHashesPerWindow();
 
+                    // Update the matching LlmRequest immediately to avoid losing hashes across iterations.
+                    {
+                        std::unique_lock lk(mResponderMutex);
+                        auto itResp = mReadyResponses.find(reqId);
+                        if (itResp != mReadyResponses.end() && itResp->second.mRequest)
+                        {
+                            itResp->second.mRequest->setRequestedBlockHashes(blockHashesPerWindow);
+                        }
+                        else
+                        {
+                            TLLM_LOG_WARNING("Received RequestInfo for unknown request ID: %zu", reqId);
+                        }
+                    }
+
                     mCurrentRequest = reqId;
                     if (mRemainSendCount.find(reqId) == mRemainSendCount.end())
                     {
                         mRemainSendCount[reqId] = mSender->getCounterpartsCount(reqId);
                     }
                 }
@@
-                        // TODO(zhengd): pass the hashes directly instead of update llmRequest
-                        auto llmRequest = it->second.mRequest;
-                        llmRequest->setRequestedBlockHashes(std::move(blockHashesPerWindow));
+                        // Hashes already set when RequestInfo was received.
♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

18-25: Header is missing required standard includes

This header uses std::vector, std::unique_ptr, and std::unordered_map but does not include , , or <unordered_map>. Depending on transitive includes is fragile and can break builds.

Apply:

 #pragma once
 #include <fstream>
 #include <future>
 #include <map>
+#include <memory>
+#include <unordered_map>
+#include <vector>
 #include <string>
 
 #include "tensorrt_llm/batch_manager/cacheTransceiver.h"
🧹 Nitpick comments (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

24-24: Avoid unnecessary header coupling: move cacheTransceiver.h include to the .cpp

If no types from cacheTransceiver.h are referenced in this header (only in the .cpp), prefer including it in dataTransceiver.cpp to reduce compile-time dependencies.

Would you like me to generate a minimal patch that forward-declares any needed symbols (if any) and moves this include?

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

75-81: Serialize unordered_map deterministically (sort keys before writing)

Iterating std::unordered_map yields nondeterministic order across runs/platforms. For stable artifacts, logs, or cross-version comparisons, serialize entries in a deterministic order (e.g., ascending windowSize).

Apply:

-    su::serialize(requestInfo.mBlockHashesPerWindow.size(), os);
-    for (auto const& [windowSize, blockHashes] : requestInfo.mBlockHashesPerWindow)
-    {
-        su::serialize(windowSize, os);
-        su::serialize(blockHashes, os);
-    }
+    su::serialize(requestInfo.mBlockHashesPerWindow.size(), os);
+    std::vector<SizeType32> keys;
+    keys.reserve(requestInfo.mBlockHashesPerWindow.size());
+    for (auto const& kv : requestInfo.mBlockHashesPerWindow) { keys.push_back(kv.first); }
+    std::sort(keys.begin(), keys.end());
+    for (auto const& windowSize : keys)
+    {
+        su::serialize(windowSize, os);
+        su::serialize(requestInfo.mBlockHashesPerWindow.at(windowSize), os);
+    }
@@
-    totalSize += su::serializedSize(requestInfo.mBlockHashesPerWindow.size());
-    for (auto const& [windowSize, blockHashes] : requestInfo.mBlockHashesPerWindow)
-    {
-        totalSize += su::serializedSize(windowSize);
-        totalSize += su::serializedSize(blockHashes);
-    }
+    totalSize += su::serializedSize(requestInfo.mBlockHashesPerWindow.size());
+    std::vector<SizeType32> keys2;
+    keys2.reserve(requestInfo.mBlockHashesPerWindow.size());
+    for (auto const& kv : requestInfo.mBlockHashesPerWindow) { keys2.push_back(kv.first); }
+    std::sort(keys2.begin(), keys2.end());
+    for (auto const& windowSize : keys2)
+    {
+        totalSize += su::serializedSize(windowSize);
+        totalSize += su::serializedSize(requestInfo.mBlockHashesPerWindow.at(windowSize));
+    }

Also applies to: 100-113

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fecbc76 and 3fe3fdb.

📒 Files selected for processing (12)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4 hunks)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (5 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (4 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🔇 Additional comments (7)
cpp/tensorrt_llm/common/envUtils.h (1)

119-120: Add Doxygen comment for KV-cache transfer flag

The getEnvKVCacheTransferAllBlocksForWindow API is backed by TRTLLM_KVCACHE_TRANSFER_ALL_BLOCKS_FOR_WINDOW (default: false as getBoolEnv returns true only if the var is set to "1"). Please update the declaration in cpp/tensorrt_llm/common/envUtils.h as follows:

--- a/cpp/tensorrt_llm/common/envUtils.h
+++ b/cpp/tensorrt_llm/common/envUtils.h
@@ -118,7 +118,9 @@ namespace tensorrt_llm::common {
- bool getEnvKVCacheTransferAllBlocksForWindow();
+ //! Returns whether to transfer all KV-cache blocks for each attention window during cache transmission.
+ //! Controlled by the environment variable `TRTLLM_KVCACHE_TRANSFER_ALL_BLOCKS_FOR_WINDOW` (default: false).
+ bool getEnvKVCacheTransferAllBlocksForWindow();
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

53-56: API shift to per-window hashes looks consistent

The new constructor and accessor for per-window block hashes are coherent, return by const&, and align with the per-window model propagated across the PR.

Also applies to: 66-69, 93-94

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)

101-106: Early return via needSendCache() is appropriate

The guard avoids unnecessary work when this rank shouldn’t send cache. Looks good.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (4)

45-52: Constructor and equality updates align with per-window state

The added constructor and operator== changes look correct and consistent with the new per-window storage.

Also applies to: 56-58


26-26: Include for SizeType32 is correct

Including runtime/common.h for SizeType32 is appropriate.


70-98: Backward compatibility of RequestInfo wire format

Switching from a flat vector to a per-window map changes the wire format. If mixed-version context/gen executors are possible, consider version-tagging the serialized stream and supporting both decode paths.

Would you confirm if cross-version KV-cache transmission is supported? If yes, I can propose a minimal version byte header and fallback deserializer.


20-32: No remaining stale API usages detected
Ran a repository-wide search and confirmed that neither getBlockHashes() nor mBlockHashes appear anywhere in the codebase. No further cleanup is needed.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15612 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11752 completed with status: 'FAILURE'

@chuangz0 chuangz0 force-pushed the kvcache_transfer_for_vwsa_optimal branch from 5f3c422 to 885509d Compare August 19, 2025 02:37
@chuangz0
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (2)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)

178-183: Fix: don’t force all blocks under pool-0 window; use the per-window map

Mapping everything under pool 0’s window breaks multi-window MLAs. Use the per-window collection built above.

-    SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(0);
-    std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow;
-    inputKvCacheBlocksPerWindow.emplace(window, inputKvCacheBlocks);
+    // inputKvCacheBlocksPerWindow was populated above while collecting blocks per window
     tensorrt_llm::executor::kv_cache::splitKVCacheDispatch(
         inputKvCacheBlocksPerWindow, outputSplitCaches, destConfig, selfConfig, selfIdx, bufferManager);

492-501: Fix: use the built per-window map in concat instead of forcing pool-0 window

Passing all blocks under a single window key corrupts layout when multiple windows exist.

-        {
-            std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> outputCachesPerWindow;
-            SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(0);
-            outputCachesPerWindow.emplace(window, outputBuffers);
-            NVTX3_SCOPED_RANGE(formatInputConcatenate);
-
-            // recvSplitCaches size == ppdomainsize
-            executor::kv_cache::concatKvCacheV2Dispatch(
-                recvSplitCaches, outputCachesPerWindow, destConfig, selfConfig, selfIdx, bufferManager);
-        }
+        {
+            NVTX3_SCOPED_RANGE(formatInputConcatenate);
+            // recvSplitCaches size == ppdomainsize
+            executor::kv_cache::concatKvCacheV2Dispatch(
+                recvSplitCaches, outputCachesPerWindow, destConfig, selfConfig, selfIdx, bufferManager);
+        }
♻️ Duplicate comments (7)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1)

63-65: Add [[nodiscard]] to preAllocBufferSize to prevent ignored results

The return value is critical for buffer sizing. Mark it [[nodiscard]] to avoid accidental discarding. This mirrors earlier feedback and keeps the API safer.

-    static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
+    [[nodiscard]] static size_t preAllocBufferSize(
+        std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
         SizeType32 tokensPerBlock,
         std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

24-24: Header is not self-contained: missing <unordered_map> (and SizeType32 definition)

This header uses std::unordered_map and SizeType32 in public declarations but does not include the corresponding headers. Transitive includes may mask this locally but break other translation units.

 #include <map>
+#include <unordered_map>
 #include <string>

 #include "tensorrt_llm/batch_manager/cacheTransceiver.h"
+#include "tensorrt_llm/runtime/common.h"
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1)

181-188: Only override RequestInfo when there are new blocks; move the map to avoid copies

Currently you always reassign RequestInfo even if no newly allocated blocks exist. Guard the override and move the map to leverage the rvalue overload and avoid unnecessary copies.

-    auto disableSelectiveCacheTransfer = common::getEnvDisableSelectiveCacheTransfer();
+    auto disableSelectiveCacheTransfer = common::getEnvDisableSelectiveCacheTransfer();
     if (!disableSelectiveCacheTransfer)
     {
         auto* cacheManager = mFormatter->getCacheManager();
         auto blockRange
             = kv_cache_manager::BlockRange::fromNewlyAllocatedBlockIds(*cacheManager, llmRequest.mRequestId);
-        requestInfo = RequestInfo(requestId, blockRange.getBlockHashesPerWindow(), mSelfState);
+        if (auto blockHashesPerWindow = blockRange.getBlockHashesPerWindow(); !blockHashesPerWindow.empty())
+        {
+            requestInfo = RequestInfo(requestId, std::move(blockHashesPerWindow), mSelfState);
+        }
     }
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)

313-327: Bug: receiving path also flattens per-window blocks; build per-window map once

Same issue on receive: a flat outputBuffers loses window grouping and leads to incorrect concat.

-    auto const& windowSizes = blockRange.getWindowSizes();
-    TLLM_CHECK_WITH_INFO(
-        static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools");
-    // TODO(oargov): are we sure the other side has the same number of pools? this might not hold for pp_size>1...
-    size_t blockNum = 0;
-    for (auto const& windowSize : windowSizes)
-    {
-        auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
-        for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
-        {
-            outputBuffers.push_back(it);
-            blockNum++;
-        }
-    }
+    auto const& windowSizes = blockRange.getWindowSizes();
+    TLLM_CHECK_WITH_INFO(static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools");
+    // Build per-window mapping and maintain a flat view for zcopy/size calculations.
+    std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> outputCachesPerWindow;
+    size_t blockNum = 0;
+    for (auto const& windowSize : windowSizes)
+    {
+        auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
+        auto& vec = outputCachesPerWindow[windowSize];
+        for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
+        {
+            vec.push_back(it);
+            outputBuffers.push_back(it); // flat view for zcopy path
+            ++blockNum;
+        }
+    }

113-121: Bug: per-window blocks are flattened; build a true per-window map during collection

You create a flat inputKvCacheBlocks vector, losing per-window grouping required by split/concat logic when multiple windows exist.

Apply this change to collect both flat and per-window views:

     int blockNum = 0;
-    std::vector<runtime::ITensor::SharedPtr> inputKvCacheBlocks;
+    std::vector<runtime::ITensor::SharedPtr> inputKvCacheBlocks;
+    std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow;
     auto const numPools = mCacheManager->getBlockManager().getNumPools();
     auto blockRange = getBlockRangeForSending(mCacheManager, llmRequest);
     auto const& windowSizes = blockRange.getWindowSizes();
     TLLM_CHECK_WITH_INFO(
         static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools");
     for (auto const& windowSize : windowSizes)
     {
         auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
         for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
         {
             inputKvCacheBlocks.push_back(it);
+            inputKvCacheBlocksPerWindow[windowSize].push_back(it);
             blockNum++;
         }
     }

Note: follow-up diffs below switch split/concat to use inputKvCacheBlocksPerWindow.

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)

82-85: Unify existence check and fix message grammar/format.

Use the same backing set as other checks (pools are authoritative for window presence), and fix “exists” -> “exist”. Also cast to an integral type to match the %d format.

Apply this diff:

-        TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
-            "Window size %d should exists", windowSize);
+        TLLM_CHECK_WITH_INFO(mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(),
+            "Window size %d should exist", static_cast<int>(windowSize));

89-95: Make the window existence check consistent and fix grammar.

Keep consistency with other methods by validating against pools, and fix the message.

Apply this diff:

-            TLLM_CHECK_WITH_INFO(
-                mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d should exists", windowSize);
+            TLLM_CHECK_WITH_INFO(
+                mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(),
+                "Window size %d should exist", static_cast<int>(windowSize));
🧹 Nitpick comments (5)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

53-56: Constructor overload looks good; ensure serialization/deserialization updated accordingly

You’ve added a per-window hashes ctor taking an rvalue map. Confirm serialize/deserialize/serializedSize cover the new layout and remain version-safe.

Would you like me to generate a verification script that locates RequestInfo::serialize/deserialize definitions and checks for handling of unordered_map<SizeType32, std::vector<size_t>>?

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)

225-233: Layer-wise path: confirm per-window iteration is intended for MLA alt-window layouts

Looping windowSizes inside the per-layer loop means each layer slice is sent for every window. If BlockRange partitions blocks disjointly per window (as expected), this is fine; if not, you may duplicate slices. Please confirm BlockRange::getBlockRangeForWindow(windowSize) returns disjoint sets across windows.


484-502: Per-window receive collection—LGTM; keep both per-window and flat views if zcopy path is used

The new per-window mapping plus block count accumulation is correct. Ensure subsequent zcopy paths iterate per-window or use a flat view consistently.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

70-82: Serialize/deserialize/size updated for per-window — consider stable ordering (optional).

Current implementation serializes an unordered_map, which is fine for round-trip and equality checks. If you ever compare serialized blobs across processes or runs, consider serializing window keys in sorted order to ensure deterministic byte streams.

Example change (conceptual):

-    for (auto const& [windowSize, blockHashes] : requestInfo.mBlockHashesPerWindow)
+    std::vector<SizeType32> keys;
+    keys.reserve(requestInfo.mBlockHashesPerWindow.size());
+    for (auto const& kv : requestInfo.mBlockHashesPerWindow) keys.push_back(kv.first);
+    std::sort(keys.begin(), keys.end());
+    for (auto const& windowSize : keys)
     {
-        su::serialize(windowSize, os);
-        su::serialize(blockHashes, os);
+        su::serialize(windowSize, os);
+        su::serialize(requestInfo.mBlockHashesPerWindow.at(windowSize), os);
     }

Apply similarly in serializedSize. Deserializer can remain unchanged.

Also applies to: 84-98, 100-112

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)

97-111: Nit: pre-reserve per-window hash vectors to reduce reallocations.

Minor perf win when many blocks exist per window.

Apply this diff:

     for (auto const& [windowSize, blockIds] : mBlockIdsPerWindow)
     {
-        for (auto const& blockId : blockIds)
+        auto& hashes = blockHashesPerWindow[windowSize];
+        hashes.reserve(blockIds.size());
+        for (auto const& blockId : blockIds)
         {
-            blockHashesPerWindow[windowSize].emplace_back(
-                blockManager.getBlockById(blockId, windowSize)->getHash());
+            hashes.emplace_back(blockManager.getBlockById(blockId, windowSize)->getHash());
         }
     }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5f3c422 and 885509d.

📒 Files selected for processing (18)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4 hunks)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (5 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (4 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (12 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
🚧 Files skipped from review as they are similar to previous changes (9)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧠 Learnings (5)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (16)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)

66-69: Getter naming and noexcept/[[nodiscard]] are consistent—LGTM

The const-ref getter for per-window hashes is clear and avoids copies.


93-94: Data member rename to per-window is consistent—LGTM

Switching to mBlockHashesPerWindow aligns with the new API and the rest of the PR.

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)

251-271: Building per-window map and total element count—LGTM

Collecting input blocks keyed by window and summing element counts aligns with splitKVCacheDispatch’s expectations (element counts, not bytes). Good.

cpp/tests/batch_manager/cacheTransceiverTest.cpp (7)

430-437: Per-window write path looks correct (BlockRangeForWindow + device memset).

Good migration to per-window iteration. Using it->data() and it->getSizeInBytes() with cudaMemset per block is correct for the sentinel-based iterator, and aligns with the new BlockRangeForWindow API.


891-903: Per-window fill path is consistent and synced on correct stream.

The switch to iterating getWindowSizes() and filling via fillBlockData(*it, ...) per window is consistent with the new BlockRange APIs. Synchronizing the buffer manager stream for onlyWindowSize before sending is a good touch.


934-949: Verify off-by-one in per-window verification range.

startBlockId = size - (maxBlockInWindow + 1) makes the verifier include maxBlockInWindow + 1 trailing blocks. If intent is to check exactly one window’s worth of blocks, consider dropping the + 1. If the extra one is intentional (e.g., to include the current partial block), please add a brief comment to document it.

Would you like me to adjust and re-run the tests to confirm both behaviors? I can also draft a small helper to compute the exact expected tail length based on tokensPerBlock and post-prefill semantics (per the retrieved learning).


953-1014: fillBlockData: window-aware path and token indexing are solid.

  • Passing windowSize down is correct and selecting the buffer manager via that size matches per-window pools.
  • Using tokensPerBlock from the model config to advance token indices is aligned with the attention-window onboarding learning.

1016-1080: verifyBlockData mirrors fill logic correctly.

The verification path uses the same window-aware buffer manager and token indexing scheme, which keeps generation/verification in sync.


1082-1109: Deterministic seeding includes windowSize — good for per-window reproducibility.

Hashing windowSize into the seed ensures expected values differ across windows, avoiding cross-window collisions while remaining deterministic.


650-650: Doubling mMaxNumSequences to 16: check runtime/memory budget.

Increasing from 8 to 16 scales the total number of blocks and transferred bytes proportionally. Given recent CI failures, please confirm this does not push test time or memory over your CI limits on multi-GPU runs.

If helpful, I can add a quick gtest environment flag to reduce mMaxNumSequences under CI (e.g., via getenv) while keeping 16 for local stress testing.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)

45-52: RequestInfo: per-window block hashes constructor is sound.

Storing hashes as unordered_map<SizeType32, std::vector<size_t>> aligns with the per-window design. Moving the map in avoids copies.


238-279: Responder path correctly adopts per-window hashes into LlmRequest.

Receiving blockHashesPerWindow and forwarding via llmRequest->setRequestedBlockHashes(std::move(...)) is a clean handoff that avoids extra copies. The logic to wait for all counterparts before setting and sending remains intact.

Also applies to: 263-265

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4)

30-56: BlockRangeForWindow + iterator scaffold are clean and minimal.

Encapsulating per-window block IDs and pool with a sentinel-iterator is a neat fit for range-for and explicit begin/end loops.


113-120: getBlockRangeForWindow: good defensive check; return by value is fine.

Validating presence in mPoolsPerWindow and returning a self-contained BlockRangeForWindow avoids dangling references and keeps iteration simple.


137-166: Constructors correctly populate pools-per-window and block IDs.

  • fromAllBlockIds: pulls per-window block IDs for the request; good.
  • fromNewlyAllocatedBlockIds: collects IDs per window; good.

187-193: Iterator correctness and sentinel comparison look good.

Bounds checks, lazy slice creation, and sentinel-based termination are implemented properly. The implicit conversion to SharedPtr matches call sites.

Also applies to: 230-233, 242-249, 255-258

Comment on lines +447 to +457
auto const& windowSizes = blockRange.getWindowSizes();
for (auto const& windowSize : windowSizes)
{
std::vector<uint8_t> bytes(block.getSizeInBytes());
TLLM_CUDA_CHECK(cudaMemcpy(bytes.data(), block.data(), block.getSizeInBytes(), cudaMemcpyDeviceToHost));
EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
[&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
std::vector<uint8_t> bytes(it->getSizeInBytes());
TLLM_CUDA_CHECK(cudaMemcpy(bytes.data(), it->data(), it->getSizeInBytes(), cudaMemcpyDeviceToHost));
EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
[&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Bug: precedence in equality check causes wrong assertion.

i == llmRequest->getPromptLen() & 0xff is parsed as (i == llmRequest->getPromptLen()) & 0xff. Use parentheses (or an explicit cast) to compare i against the masked byte.

Apply this diff:

-                    EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
-                        [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
+                    EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
+                        [&llmRequest](uint8_t i) {
+                            return i == static_cast<uint8_t>(llmRequest->getPromptLen() & 0xff);
+                        }));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto const& windowSizes = blockRange.getWindowSizes();
for (auto const& windowSize : windowSizes)
{
std::vector<uint8_t> bytes(block.getSizeInBytes());
TLLM_CUDA_CHECK(cudaMemcpy(bytes.data(), block.data(), block.getSizeInBytes(), cudaMemcpyDeviceToHost));
EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
[&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
std::vector<uint8_t> bytes(it->getSizeInBytes());
TLLM_CUDA_CHECK(cudaMemcpy(bytes.data(), it->data(), it->getSizeInBytes(), cudaMemcpyDeviceToHost));
EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
[&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
}
auto const& windowSizes = blockRange.getWindowSizes();
for (auto const& windowSize : windowSizes)
{
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
std::vector<uint8_t> bytes(it->getSizeInBytes());
TLLM_CUDA_CHECK(cudaMemcpy(
bytes.data(),
it->data(),
it->getSizeInBytes(),
cudaMemcpyDeviceToHost));
EXPECT_TRUE(std::all_of(
bytes.begin(),
bytes.end(),
[&llmRequest](uint8_t i) {
return i == static_cast<uint8_t>(
llmRequest->getPromptLen() & 0xff);
}));
}
}
🤖 Prompt for AI Agents
In cpp/tests/batch_manager/cacheTransceiverTest.cpp around lines 447 to 457, the
lambda in the EXPECT_TRUE uses the expression i == llmRequest->getPromptLen() &
0xff which is parsed as (i == llmRequest->getPromptLen()) & 0xff; change it to
compare the masked value explicitly, e.g. i == (llmRequest->getPromptLen() &
0xff) or i == static_cast<uint8_t>(llmRequest->getPromptLen()), so the byte-wise
comparison is performed correctly.

@chuangz0 chuangz0 force-pushed the kvcache_transfer_for_vwsa_optimal branch from 885509d to f200189 Compare August 19, 2025 07:30
@chuangz0
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15743 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (4)

138-146: Zero-copy path must iterate per-window blocks

Send blocks grouped by window to preserve ordering and sizes per window.

Apply:

-        for (size_t i = 0; i < connections.size(); i++)
-        {
-            for (auto const& block : inputKvCacheBlocks)
-            {
-                session.send(i, block->data(), block->getSizeInBytes());
-            }
-        }
+        for (size_t i = 0; i < connections.size(); i++)
+        {
+            for (auto const& [w, blocks] : inputKvCacheBlocksPerWindow)
+            {
+                (void) w;
+                for (auto const& block : blocks)
+                {
+                    session.send(i, block->data(), block->getSizeInBytes());
+                }
+            }
+        }

178-183: Don’t remap all blocks under pool 0’s window

You already have inputKvCacheBlocksPerWindow; use it directly.

Apply:

-    SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(0);
-    std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow;
-    inputKvCacheBlocksPerWindow.emplace(window, inputKvCacheBlocks);
+    // inputKvCacheBlocksPerWindow was built above

492-501: Use the built per-window map for concat, not a single-window wrapper

Stop wrapping all buffers under pool 0’s window. Pass the real per-window map.

Apply:

-        {
-            std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> outputCachesPerWindow;
-            SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(0);
-            outputCachesPerWindow.emplace(window, outputBuffers);
-            NVTX3_SCOPED_RANGE(formatInputConcatenate);
-
-            // recvSplitCaches size == ppdomainsize
-            executor::kv_cache::concatKvCacheV2Dispatch(
-                recvSplitCaches, outputCachesPerWindow, destConfig, selfConfig, selfIdx, bufferManager);
-        }
+        {
+            NVTX3_SCOPED_RANGE(formatInputConcatenate);
+            // recvSplitCaches size == ppdomainsize
+            executor::kv_cache::concatKvCacheV2Dispatch(
+                recvSplitCaches, outputCachesPerWindow, destConfig, selfConfig, selfIdx, bufferManager);
+        }

153-156: Use per-window block size for buffer allocation

Derive cacheBlockSize from the per-window map rather than the flattened vector to handle multiple attention windows correctly.

  • File: cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp, lines 153–156
  • auto cacheBlockSize = inputKvCacheBlocks.at(0)->getSize();
  • auto cacheBlockSize = inputKvCacheBlocksPerWindow.begin()->second.at(0)->getSize();
- Ensure that `inputKvCacheBlocksPerWindow` is populated with one entry *per* window returned by `blockRange.getWindowSizes()`, not just a single window.
- Add or update tests covering multiple attention windows to verify that the computed buffer sizes and splits remain correct.

</blockquote></details>

</blockquote></details>
♻️ Duplicate comments (5)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)

246-285: Avoid duplicated validTokenNum logic (constructor vs. static function)

The same policy is implemented in two places; centralize to a helper to reduce drift.

You can add a small internal helper (in an anonymous namespace) and call it from both places:

// Place near top of this translation unit
namespace {
inline size_t calculateValidTokenNum(size_t windowSize, size_t maxNumTokens, size_t tokensPerBlock)
{
    auto v = (windowSize < maxNumTokens)
        ? std::min(maxNumTokens, windowSize + tokensPerBlock)
        : maxNumTokens;
    if (tensorrt_llm::common::getEnvKVCacheTransferAllBlocksForWindow())
    {
        v = maxNumTokens;
    }
    return v;
}
} // namespace

Then replace the inlined computations at Lines 213-219 and 265-271 with calculateValidTokenNum(...).

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)

91-95: Use consistent container for validation in setBlockIdsForAllWindows

Align with setBlockIdsForWindow; validate against mBlockIdsPerWindow, not mPoolsPerWindow.

Apply:

-            TLLM_CHECK_WITH_INFO(
-                mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d should exists", windowSize);
+            TLLM_CHECK_WITH_INFO(
+                mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
+                "Window size %d should exist", static_cast<int>(windowSize));
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)

113-127: Bug: per-window blocks are flattened; preserve per-window grouping

Flattening discards window boundaries and later uses pool 0’s window for all. Build a true per-window map upfront.

Apply:

-    int blockNum = 0;
-    std::vector<runtime::ITensor::SharedPtr> inputKvCacheBlocks;
+    int blockNum = 0;
+    std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow;
@@
-    for (auto const& windowSize : windowSizes)
-    {
-        auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
-        for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
-        {
-            inputKvCacheBlocks.push_back(it);
-            blockNum++;
-        }
-    }
+    for (auto const& windowSize : windowSizes)
+    {
+        auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
+        auto& vec = inputKvCacheBlocksPerWindow[windowSize];
+        for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
+        {
+            vec.push_back(it);
+            ++blockNum;
+        }
+    }

313-327: Receiving path: also preserve per-window grouping (don’t flatten)

Build outputCachesPerWindow and keep a flat view only where needed (zcopy/size checks).

Apply:

-    auto const& windowSizes = blockRange.getWindowSizes();
-    TLLM_CHECK_WITH_INFO(
-        static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools");
-    // TODO(oargov): are we sure the other side has the same number of pools? this might not hold for pp_size>1...
-    size_t blockNum = 0;
-    for (auto const& windowSize : windowSizes)
-    {
-        auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
-        for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
-        {
-            outputBuffers.push_back(it);
-            blockNum++;
-        }
-    }
+    auto const& windowSizes = blockRange.getWindowSizes();
+    TLLM_CHECK_WITH_INFO(static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools");
+    // Build per-window mapping and a flat view for zcopy and size computations.
+    std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> outputCachesPerWindow;
+    size_t blockNum = 0;
+    for (auto const& windowSize : windowSizes)
+    {
+        auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
+        auto& vec = outputCachesPerWindow[windowSize];
+        for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
+        {
+            vec.push_back(it);
+            ++blockNum;
+        }
+    }
+    for (auto const& [w, vec] : outputCachesPerWindow)
+    {
+        (void) w;
+        outputBuffers.insert(outputBuffers.end(), vec.begin(), vec.end());
+    }
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

455-457: Fix bitwise precedence in byte comparison

The current lambda evaluates (i == llmRequest->getPromptLen()) & 0xff. Compare against the masked byte explicitly.

Apply:

-                    EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
-                        [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
+                    EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
+                        [&llmRequest](uint8_t i) {
+                            return i == static_cast<uint8_t>(llmRequest->getPromptLen() & 0xff);
+                        }));
🧹 Nitpick comments (3)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)

259-284: Local naming style nit: prefer lowerCamelCase for locals

TransferBufferSize/RecvBufferCount/SendBufferCount/PreAllocBufferSize are local variables; per guidelines, use lowerCamelCase.

Apply:

-    size_t TransferBufferSize = common::getEnvMemSizeForKVCacheTransferBuffer();
+    size_t transferBufferSize = common::getEnvMemSizeForKVCacheTransferBuffer();
@@
-        TransferBufferSize = 0;
+        transferBufferSize = 0;
@@
-            TransferBufferSize += validTokenNum * cacheSizeBytesPerToken;
+            transferBufferSize += validTokenNum * cacheSizeBytesPerToken;
@@
-    if (useFabricMemory)
+    if (useFabricMemory)
     {
-        TransferBufferSize = FabricMemory::getAlignedSize(TransferBufferSize);
+        transferBufferSize = FabricMemory::getAlignedSize(transferBufferSize);
     }
-    size_t RecvBufferCount = common::getEnvRequestKVCacheConcurrent() ? common::getEnvKVCacheRecvBufferCount() : 1;
-    size_t SendBufferCount = common::getEnvParallelCacheSend() ? common::getEnvKVCacheSendMaxConcurrenceNum() : 1;
-    size_t PreAllocBufferSize = TransferBufferSize * (RecvBufferCount + SendBufferCount);
-    return PreAllocBufferSize;
+    size_t recvBufferCount = common::getEnvRequestKVCacheConcurrent() ? common::getEnvKVCacheRecvBufferCount() : 1;
+    size_t sendBufferCount = common::getEnvParallelCacheSend() ? common::getEnvKVCacheSendMaxConcurrenceNum() : 1;
+    size_t preAllocBufferSize = transferBufferSize * (recvBufferCount + sendBufferCount);
+    return preAllocBufferSize;
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)

80-85: Fix message grammar and format spec; keep types explicit

Minor polish and safer formatting.

Apply:

-        TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
-            "Window size %d should exists", windowSize);
+        TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
+            "Window size %d should exist", static_cast<int>(windowSize));

115-121: Add check for blockIds presence before at() to avoid exceptions

getBlockRangeForWindow validates pools but not block IDs; add a symmetric check for mBlockIdsPerWindow.

Apply:

     TLLM_CHECK_WITH_INFO(
         mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d not found", windowSize);
+    TLLM_CHECK_WITH_INFO(
+        mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), "Block IDs for window %d not found",
+        static_cast<int>(windowSize));
     auto pool = mPoolsPerWindow.at(windowSize).front();
     auto blockIds = mBlockIdsPerWindow.at(windowSize);
     return BlockRangeForWindow(std::move(blockIds), std::move(pool));
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 885509d and f200189.

📒 Files selected for processing (18)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4 hunks)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (5 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (4 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (12 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
🚧 Files skipped from review as they are similar to previous changes (12)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧠 Learnings (4)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)

255-258: Begin() implementation LGTM

The iterator contract and sentinel usage are consistent and safe.

cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)

429-437: Per-window prefill in test is fine

Filling per-window blocks with the prompt length is appropriate for reuse checks.


649-650: Test capacity increase LGTM

Raising mMaxNumSequences to 16 improves coverage for windowed attention scenarios.


1267-1269: DP test lengths updated – windowSize is accounted in expected‐value generator
Verified that generateExpectedValue mixes in windowSize when seeding, so the varied request lengths (60, 30, 60, 10) produce correct expected values.

Comment on lines +213 to +219
auto validTokenNum
= (windowSize < maxNumTokens.value() ? (windowSize + tokensPerBlock) : maxNumTokens.value());
if (common::getEnvKVCacheTransferAllBlocksForWindow())
{
validTokenNum = maxNumTokens.value();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Clamp validTokenNum to maxNumTokens to avoid overallocation

When windowSize < maxNumTokens, adding tokensPerBlock can exceed maxNumTokens, leading to buffer overallocation. Clamp to maxNumTokens.

Apply:

-            auto validTokenNum
-                = (windowSize < maxNumTokens.value() ? (windowSize + tokensPerBlock) : maxNumTokens.value());
+            auto validTokenNum = (windowSize < maxNumTokens.value())
+                ? std::min(maxNumTokens.value(), windowSize + tokensPerBlock)
+                : maxNumTokens.value();
             if (common::getEnvKVCacheTransferAllBlocksForWindow())
             {
                 validTokenNum = maxNumTokens.value();
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto validTokenNum
= (windowSize < maxNumTokens.value() ? (windowSize + tokensPerBlock) : maxNumTokens.value());
if (common::getEnvKVCacheTransferAllBlocksForWindow())
{
validTokenNum = maxNumTokens.value();
}
auto validTokenNum = (windowSize < maxNumTokens.value())
? std::min(maxNumTokens.value(), windowSize + tokensPerBlock)
: maxNumTokens.value();
if (common::getEnvKVCacheTransferAllBlocksForWindow())
{
validTokenNum = maxNumTokens.value();
}
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp around lines 213 to 219,
validTokenNum is computed as (windowSize < maxNumTokens ? windowSize +
tokensPerBlock : maxNumTokens) which can exceed maxNumTokens when tokensPerBlock
is added; change the logic to clamp validTokenNum to maxNumTokens (e.g., compute
tentative = windowSize + tokensPerBlock when windowSize < maxNumTokens, then set
validTokenNum = std::min(tentative, maxNumTokens.value()); keep the existing
override for getEnvKVCacheTransferAllBlocksForWindow() that forces
maxNumTokens).

Comment on lines +265 to +271
auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value()
? static_cast<size_t>(windowSize) + tokensPerBlock
: maxNumTokens.value());
if (common::getEnvKVCacheTransferAllBlocksForWindow())
{
validTokenNum = maxNumTokens.value();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Same clamping bug in preAllocBufferSize; fix mirrors constructor

Ensure validTokenNum never exceeds maxNumTokens here as well.

Apply:

-            auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value()
-                    ? static_cast<size_t>(windowSize) + tokensPerBlock
-                    : maxNumTokens.value());
+            auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value())
+                ? std::min(maxNumTokens.value(), static_cast<size_t>(windowSize) + static_cast<size_t>(tokensPerBlock))
+                : maxNumTokens.value();
             if (common::getEnvKVCacheTransferAllBlocksForWindow())
             {
                 validTokenNum = maxNumTokens.value();
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value()
? static_cast<size_t>(windowSize) + tokensPerBlock
: maxNumTokens.value());
if (common::getEnvKVCacheTransferAllBlocksForWindow())
{
validTokenNum = maxNumTokens.value();
}
auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value())
? std::min(maxNumTokens.value(),
static_cast<size_t>(windowSize) + static_cast<size_t>(tokensPerBlock))
: maxNumTokens.value();
if (common::getEnvKVCacheTransferAllBlocksForWindow())
{
validTokenNum = maxNumTokens.value();
}
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp around lines 265 to 271,
validTokenNum is computed but not properly clamped to maxNumTokens causing the
same overflow bug as elsewhere; adjust the logic so after computing the
tentative validTokenNum you clamp it to at most maxNumTokens.value() (i.e., set
validTokenNum = std::min(validTokenNum, maxNumTokens.value())), and keep the
existing branch for getEnvKVCacheTransferAllBlocksForWindow() so that when that
flag is set validTokenNum is assigned maxNumTokens.value(); ensure types remain
size_t and no extra arithmetic can exceed the cap.

Comment on lines +103 to 105
nb::arg("cache_size_bytes_per_token_per_window"), nb::arg("tokens_per_block"),
nb::arg("cache_transceiver_config") = nb::none());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Binding signature update looks correct; ensure Python callers are updated

The added tokens_per_block argument and defaulted config align with the updated C++ signature.

Run this to find Python/C++ callsites still using the old arity:


🏁 Script executed:

#!/bin/bash
# Verify all callsites of pre_alloc_buffer_size pass tokens_per_block
rg -n -C2 -P '\bpre_alloc_buffer_size\s*\(' --type=py --type=cpp --type=cxx --type=cu

Length of output: 115


🏁 Script executed:

#!/bin/bash
# Verify all callsites of pre_alloc_buffer_size pass tokens_per_block across relevant file types
rg -n -C2 '\bpre_alloc_buffer_size\s*\(' -g '*.py' -g '*.cpp' -g '*.cc' -g '*.cu'

Length of output: 725


Update Python wrapper to pass tokens_per_block to C++ API

The Python static method pre_alloc_buffer_size in
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
still invokes the C++ binding with only two arguments. You need to:

• Change the method signature to accept the new tokens_per_block parameter (and default the config):

-    @staticmethod
-    def pre_alloc_buffer_size(kv_cache_size_per_token: int,
-                              cache_transceiver_config: CacheTransceiverConfig):
-        return CacheTransBufferManagerCpp.pre_alloc_buffer_size(
-            kv_cache_size_per_token, cache_transceiver_config)
+    @staticmethod
+    def pre_alloc_buffer_size(
+        kv_cache_size_per_token: int,
+        tokens_per_block: int,
+        cache_transceiver_config: CacheTransceiverConfig = None
+    ):
+        return CacheTransBufferManagerCpp.pre_alloc_buffer_size(
+            kv_cache_size_per_token, tokens_per_block, cache_transceiver_config
+        )

• Ensure any callers of pre_alloc_buffer_size in Python now supply the new tokens_per_block argument.

Committable suggestion skipped: line range outside the PR's diff.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15743 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11832 completed with status: 'FAILURE'

@chuangz0 chuangz0 force-pushed the kvcache_transfer_for_vwsa_optimal branch from f200189 to 3f7dc47 Compare August 19, 2025 09:42
@chuangz0
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (6)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

24-24: Missing <unordered_map> include for new RequestInfo member

RequestInfo now uses std::unordered_map, but this header does not include <unordered_map>. Add it to keep the header self-contained.

Add the missing include after line 22:

 #include <map>
+#include <unordered_map>
 #include <string>
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)

213-219: Potential buffer overallocation when windowSize + tokensPerBlock exceeds maxNumTokens

When windowSize < maxNumTokens, adding tokensPerBlock can exceed maxNumTokens, leading to buffer overallocation. The calculation should be clamped.

Apply this fix to properly clamp the value:

-            auto validTokenNum
-                = (windowSize < maxNumTokens.value() ? (windowSize + tokensPerBlock) : maxNumTokens.value());
+            auto validTokenNum = (windowSize < maxNumTokens.value())
+                ? std::min(maxNumTokens.value(), windowSize + tokensPerBlock)
+                : maxNumTokens.value();
             if (common::getEnvKVCacheTransferAllBlocksForWindow())
             {
                 validTokenNum = maxNumTokens.value();
             }

265-271: Same clamping issue in preAllocBufferSize method

The static preAllocBufferSize method has the same potential overflow issue where validTokenNum can exceed maxNumTokens.

Apply the same fix here:

-            auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value()
-                    ? static_cast<size_t>(windowSize) + tokensPerBlock
-                    : maxNumTokens.value());
+            auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value())
+                ? std::min(maxNumTokens.value(), static_cast<size_t>(windowSize) + static_cast<size_t>(tokensPerBlock))
+                : maxNumTokens.value();
             if (common::getEnvKVCacheTransferAllBlocksForWindow())
             {
                 validTokenNum = maxNumTokens.value();
             }
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)

82-85: Fix format specifier type mismatch in error message

The format string uses %d but windowSize is SizeType32 (likely uint32_t). This can cause undefined behavior.

Fix the format specifier:

         TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
-            "Window size %d should exists", windowSize);
+            "Window size %u should exist", windowSize);

Also note the typo: "exists" should be "exist".


89-95: Inconsistent validation between setBlockIdsForWindow and setBlockIdsForAllWindows

setBlockIdsForWindow checks existence in mBlockIdsPerWindow, but setBlockIdsForAllWindows checks against mPoolsPerWindow. This inconsistency could cause issues.

Make the validation consistent:

     void setBlockIdsForAllWindows(std::unordered_map<SizeType32, std::vector<SizeType32>> blockIdsPerWindow)
     {
         for (auto const& [windowSize, blockIds] : blockIdsPerWindow)
         {
             TLLM_CHECK_WITH_INFO(
-                mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d should exists", windowSize);
+                mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(), "Window size %u should exist", windowSize);
         }
         mBlockIdsPerWindow = std::move(blockIdsPerWindow);
     }
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

447-457: Fix operator precedence issue in lambda expression

The expression i == llmRequest->getPromptLen() & 0xff is parsed as (i == llmRequest->getPromptLen()) & 0xff due to operator precedence, which is incorrect.

Fix the precedence issue:

                     EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
-                        [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
+                        [&llmRequest](uint8_t i) { return i == static_cast<uint8_t>(llmRequest->getPromptLen() & 0xff); }));
🧹 Nitpick comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)

145-152: Remove commented-out code

Line 145 contains commented-out code that should be removed. Dead code should not be left in the codebase.

-        // cacheManager.getBlockManager.getPrimaryPool(0);
         auto poolNum = mManager->getNumPools();
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f200189 and 3f7dc47.

📒 Files selected for processing (18)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4 hunks)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (5 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (4 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (12 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
🚧 Files skipped from review as they are similar to previous changes (12)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)

53-55: LGTM! Per-window block hash handling looks good

The constructor signature properly accepts per-window block hashes using move semantics, which is appropriate for transferring ownership of potentially large data structures.


66-69: LGTM! Getter properly returns const reference

The getter correctly returns a const reference and is marked noexcept, following best practices for non-modifying accessors.

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)

27-56: LGTM! BlockRangeForWindow class design is solid

The new BlockRangeForWindow class properly encapsulates per-window block iteration with appropriate use of move semantics and const-correctness.

cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2)

55-55: LGTM! Test properly uses move semantics for BlockRangeForWindow

The test correctly transfers ownership of both blockIds and pool using std::move.


126-128: LGTM! Test correctly creates intermediate vectors for move semantics

The test properly creates local copies before moving them into BlockRangeForWindow, ensuring correct ownership transfer.

cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)

649-649: Increased max sequences without justification

The mMaxNumSequences has been doubled from 8 to 16. This could impact memory usage and test performance.

Is this increase necessary for the new per-window functionality? Consider documenting why this change was made or if it's just for testing purposes.


953-956: LGTM! Per-window API properly integrated into tests

The fillBlockData and verifyBlockData functions have been correctly updated to accept windowSize parameter with appropriate default value handling.


1267-1267: Test data set reduced without explanation

The test loop has been reduced from {60, 30, 60, 10} to just {60, 30, 60, 10}, but the impact isn't clear.

Was this intentional? If the test coverage is being reduced, please ensure it still adequately tests the per-window functionality.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15763 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15763 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11853 completed with status: 'FAILURE'

@chuangz0 chuangz0 force-pushed the kvcache_transfer_for_vwsa_optimal branch from 3f7dc47 to ef0a620 Compare August 21, 2025 03:16
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)

87-95: Make window validation consistent across setters (and check the authoritative map).

setBlockIdsForWindow validates against mBlockIdsPerWindow while setBlockIdsForAllWindows validates against mPoolsPerWindow. The latter is the authoritative list of supported windows (derived from the manager). Validate against mPoolsPerWindow in both to: (1) allow first-time insertion into mBlockIdsPerWindow, and (2) avoid false negatives.

-    void setBlockIdsForWindow(SizeType32 windowSize, std::vector<SizeType32> blockIds)
+    void setBlockIdsForWindow(SizeType32 windowSize, std::vector<SizeType32> blockIds)
     {
-        TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
-            "Window size %d should exists", windowSize);
+        TLLM_CHECK_WITH_INFO(mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(),
+            "Window size %d should exist", static_cast<int>(windowSize));
         mBlockIdsPerWindow[windowSize] = std::move(blockIds);
     }
@@
-        for (auto const& [windowSize, blockIds] : blockIdsPerWindow)
+        for (auto const& [windowSize, blockIds] : blockIdsPerWindow)
         {
-            TLLM_CHECK_WITH_INFO(
-                mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d should exists", windowSize);
+            TLLM_CHECK_WITH_INFO(mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(),
+                "Window size %d should exist", static_cast<int>(windowSize));
         }
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)

60-79: Fix off-by-one in window cropping and guard map access.

Use ceil_div for blocks-in-window and avoid .at() on potentially missing keys. This was raised previously and remains unfixed here.

-    bool needReuse = !common::getEnvDisableSelectiveCacheTransfer();
-    auto const& requestedBlockHashesPerWindow = llmRequest.getRequestedBlockHashesPerWindow();
-    for (auto const& [windowSize, metadata] : windowsMetadata)
-    {
-        SizeType32 reuseStartBlockIdx
-            = (needReuse && requestedBlockHashesPerWindow.at(windowSize).size() > 0
-                  && requestedBlockHashesPerWindow.at(windowSize).size() < blockIdsPerWindow.at(windowSize).size())
-            ? (blockIdsPerWindow.at(windowSize).size() - requestedBlockHashesPerWindow.at(windowSize).size())
-            : 0;
-        auto windowStartBlockIdx = needSendAllForWindow
-            ? 0
-            : static_cast<SizeType32>(blockIdsPerWindow.at(windowSize).size())
-                - (windowSize / cacheManager->getBlockManager().getTokensPerBlock() + 1);
-        // TODO: promptLen to get the startBlockIdx
-        SizeType32 startBlockIdx = std::max(0, std::max(reuseStartBlockIdx, windowStartBlockIdx));
+    bool needReuse = !common::getEnvDisableSelectiveCacheTransfer();
+    auto const& requestedBlockHashesPerWindow = llmRequest.getRequestedBlockHashesPerWindow();
+    auto const tokensPerBlock = cacheManager->getBlockManager().getTokensPerBlock();
+    for (auto const& [windowSize, metadata] : windowsMetadata)
+    {
+        auto const& ids = blockIdsPerWindow.at(windowSize);
+        SizeType32 reuseStartBlockIdx = 0;
+        if (needReuse)
+        {
+            auto it = requestedBlockHashesPerWindow.find(windowSize);
+            auto requested = (it != requestedBlockHashesPerWindow.end()) ? it->second.size() : 0UL;
+            if (requested > 0 && requested < ids.size())
+            {
+                reuseStartBlockIdx = static_cast<SizeType32>(ids.size() - requested);
+            }
+        }
+        SizeType32 blocksInWindow = static_cast<SizeType32>((windowSize + tokensPerBlock - 1) / tokensPerBlock);
+        SizeType32 windowStartBlockIdx = needSendAllForWindow
+            ? 0
+            : static_cast<SizeType32>(
+                std::max<ptrdiff_t>(0, static_cast<ptrdiff_t>(ids.size()) - static_cast<ptrdiff_t>(blocksInWindow)));
+        SizeType32 startBlockIdx = std::max(reuseStartBlockIdx, windowStartBlockIdx);
         TLLM_LOG_DEBUG(
             "getBlockRangeForSending windowSize: %d, startBlockIdx: %d reuseStartBlockIdx: %d windowStartBlockIdx: %d",
             windowSize, startBlockIdx, reuseStartBlockIdx, windowStartBlockIdx);
-        blockRange.setBlockIdsForWindow(windowSize,
-            std::vector<SizeType32>(
-                blockIdsPerWindow.at(windowSize).begin() + startBlockIdx, blockIdsPerWindow.at(windowSize).end()));
+        blockRange.setBlockIdsForWindow(windowSize, std::vector<SizeType32>(ids.begin() + startBlockIdx, ids.end()));
     }

100-111: Receiving side: same ceil_div and clamping needed.

Mirror the fix during receive to avoid off-by-one errors and negative indices.

-    for (auto const& [windowSize, metadata] : windowsMetadata)
-    {
-        auto const& blockIdsPerWindow = blockRange.getBlockIdsPerWindow();
-        auto windowStartBlockIdx = static_cast<SizeType32>(blockIdsPerWindow.at(windowSize).size())
-            - (windowSize / cacheManager->getBlockManager().getTokensPerBlock() + 1);
-        SizeType32 startBlockIdx = std::max(0, windowStartBlockIdx);
-        blockRange.setBlockIdsForWindow(windowSize,
-            std::vector<SizeType32>(
-                blockIdsPerWindow.at(windowSize).begin() + startBlockIdx, blockIdsPerWindow.at(windowSize).end()));
-    }
+    auto const tokensPerBlock = cacheManager->getBlockManager().getTokensPerBlock();
+    for (auto const& [windowSize, metadata] : windowsMetadata)
+    {
+        auto const& ids = blockRange.getBlockIdsPerWindow().at(windowSize);
+        SizeType32 blocksInWindow = static_cast<SizeType32>((windowSize + tokensPerBlock - 1) / tokensPerBlock);
+        SizeType32 startBlockIdx = static_cast<SizeType32>(
+            std::max<ptrdiff_t>(0, static_cast<ptrdiff_t>(ids.size()) - static_cast<ptrdiff_t>(blocksInWindow)));
+        blockRange.setBlockIdsForWindow(windowSize, std::vector<SizeType32>(ids.begin() + startBlockIdx, ids.end()));
+    }
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

429-457: Fix precedence in byte verification; compare masked byte.

Current expression is parsed as (i == promptLen) & 0xff. Compare against the masked byte.

-                    EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
-                        [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
+                    EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
+                        [&llmRequest](uint8_t i) {
+                            return i == static_cast<uint8_t>(llmRequest->getPromptLen() & 0xff);
+                        }));
🧹 Nitpick comments (10)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (3)

80-85: Fix format specifier and grammar in TLLM_CHECK_WITH_INFO.

Use a proper specifier for SizeType32 (or cast) and fix “exists” → “exist”.

-        TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
-            "Window size %d should exists", windowSize);
+        TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
+            "Window size %d should exist", static_cast<int>(windowSize));

145-151: Remove dead/commented-out code.

The commented line adds noise and violates repo guidelines on dead code.

-        // cacheManager.getBlockManager.getPrimaryPool(0);

1-1: Optional: adopt include guards per project guideline.

Headers prefer named include guards (TRTLLM__H). #pragma once is fine in practice, but consider aligning with the stated repo guideline.

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)

380-383: Wrong printf specifier for size_t.

defaultTransSize is size_t; %d is UB on LP64. Use %zu or cast.

-        TLLM_LOG_INFO("defaultTransSize: %d", defaultTransSize);
+        TLLM_LOG_INFO("defaultTransSize: %zu", static_cast<size_t>(defaultTransSize));

97-137: Test-only process management: ensure waitpid decl.

These test cases use fork()/waitpid(). Consider adding <sys/wait.h> explicitly to avoid relying on transitive includes across platforms.

 #include <csignal>
+#include <sys/wait.h>

Also applies to: 141-181, 329-367, 371-437

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (4)

253-257: Printf specifier mismatch for size_t.

windowSizes.size() is size_t. Fix the format string to avoid UB.

-        TLLM_LOG_DEBUG(
-            mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %d", windowSizes.size());
+        TLLM_LOG_DEBUG(
+            mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %zu",
+            static_cast<size_t>(windowSizes.size()));

545-548: Remove commented-out call.

// blockRange.updatePoolIdx(poolIdx); is dead code.

-                // blockRange.updatePoolIdx(poolIdx);

484-502: Printf specifier mismatch for multiple size_t logs.

Use %zu for size_t.

-    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " unformat windowSizes size: %d", windowSizes.size());
+    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " unformat windowSizes size: %zu",
+        static_cast<size_t>(windowSizes.size()));
@@
-    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), "outputBuffersPerWindow size: %ld,blockNum: %d , windowSizes: %ld",
-        outputBuffersPerWindow.size(), blockNum, windowSizes.size());
+    TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(),
+        "outputBuffersPerWindow size: %zu, blockNum: %d, windowSizes: %zu",
+        static_cast<size_t>(outputBuffersPerWindow.size()), blockNum, static_cast<size_t>(windowSizes.size()));

321-335: Agent path: assert messages typo and clarity.

Minor wording cleanup: “dupliacete” → “duplicate”.

-        TLLM_LOG_DEBUG(" formatOutput bufferTargetNum: %d, targetNum: %d, peerDuplicateHeadFactor: %d dupliacete:%d ",
+        TLLM_LOG_DEBUG(" formatOutput bufferTargetNum: %d, targetNum: %d, peerDuplicateHeadFactor: %d, duplicate:%d ",
             bufferTargetNum, targetNum, peerDuplicateHeadFactor, targetInfo.mDupHeadFactor);
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

933-949: Window verification: use ceil_div for blocks in window.

maxBlockInWindow = windowSize / tokensPerBlock undercounts when windowSize isn’t a multiple of tokensPerBlock; also the + 1 in start index is a tell. Use ceil_div and drop the ad-hoc +1.

-            int maxBlockInWindow = windowSize / mCacheState->getModelConfig().mTokensPerBlock;
-            int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - (maxBlockInWindow + 1));
+            auto const tpb = mCacheState->getModelConfig().mTokensPerBlock;
+            int blocksInWindow = static_cast<int>((windowSize + tpb - 1) / tpb);
+            int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - blocksInWindow);
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3f7dc47 and ef0a620.

📒 Files selected for processing (18)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4 hunks)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (5 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (4 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (12 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
🚧 Files skipped from review as they are similar to previous changes (13)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.

Applied to files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧬 Code Graph Analysis (3)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)
  • preAllocBufferSize (246-285)
  • preAllocBufferSize (246-248)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
cpp/tensorrt_llm/common/envUtils.cpp (4)
  • getEnvKVCacheTransferAllBlocksForWindow (449-453)
  • getEnvKVCacheTransferAllBlocksForWindow (449-449)
  • getEnvDisableSelectiveCacheTransfer (321-325)
  • getEnvDisableSelectiveCacheTransfer (321-321)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
  • llmRequest (421-460)
  • llmRequest (421-421)
  • llmRequest (882-914)
  • llmRequest (882-882)
  • llmRequest (916-923)
  • llmRequest (916-916)
  • llmRequest (925-951)
  • llmRequest (925-925)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)
  • tensorrt_llm (35-190)
  • `` (66-69)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (3)
  • tensorrt_llm (34-123)
  • nodiscard (106-109)
  • `` (101-104)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)

187-193: Iterator construction precondition is stricter than necessary.

The iterator never constructs an “end” iterator (you use sentinel), so the current check is safe. Just calling out that if an end-iterator constructor is ever added, the precondition should allow mIdx == size().

Would you like me to extend BlockIterator with an explicit end() constructor for symmetry with begin()/Sentinel, or keep the sentinel-only design?


255-258: LGTM: clean sentinel-based iteration.

The sentinel pattern keeps the iterator minimal and avoids storing a second iterator. Nice.

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)

119-121: LGTM: tests updated for new preAllocBufferSize signature.

Passing tokensPerBlock matches the new API. Assertions remain consistent.


164-165: LGTM: second test updated to new signature.

Consistent with the kSELF path.

cpp/tests/batch_manager/cacheTransceiverTest.cpp (3)

953-1014: Good test coverage for per-window data paths.

The helpers fillBlockData/verifyBlockData now parameterize by window and seed; this improves confidence in per-window correctness.

Also applies to: 1016-1080


647-705: Test capacity: mMaxNumSequences doubled (16).

Change aligns with expanded window coverage; ensure CI resources can handle increased allocations.

Would you like me to add a gtest filter to run a reduced subset under resource-constrained CI, and leave full suite for nightly?


892-903: No printf specifier issues found in cacheTransceiverTest.cpp
I’ve verified that there are no instances of logging a container’s .size() using %d in this file, so no %zu replacements are needed here. All existing %d specifiers match their argument types.

Comment on lines 251 to 275
size_t allCacheBlockSize = 0;
auto const& windowSizes = blockRange.getWindowSizes();
TLLM_LOG_DEBUG(
mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %d", windowSizes.size());
TLLM_CHECK_WITH_INFO(
static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools");

std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocks;
for (auto poolIdx = 0; poolIdx < numPools; poolIdx++)

for (auto const& windowSize : windowSizes)
{
blockRange.updatePoolIdx(poolIdx);
SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(poolIdx);
TLLM_CHECK_WITH_INFO(inputKvCacheBlocks.find(window) == inputKvCacheBlocks.end(),
"window size already exists, which is not supported");
inputKvCacheBlocks.emplace(window, std::vector<runtime::ITensor::SharedPtr>());
auto maxBlockThisWindow = window / selfConfig.getModelConfig().mTokensPerBlock;
SizeType32 blockNumThisWindow = 0;
for (auto it = blockRange.begin(); it != blockRange.end(); ++it)
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " format windowSize: %d blockRangeForWindow size: %d",
windowSize, blockRangeForWindow.size());
inputKvCacheBlocks.emplace(windowSize, std::vector<runtime::ITensor::SharedPtr>());
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
blockNum++;
inputKvCacheBlocks.at(window).push_back(it);
inputKvCacheBlocks.at(windowSize).push_back(it);
allCacheBlockSize += it->getSize();
blockNumThisWindow++;
if (blockNumThisWindow >= maxBlockThisWindow)
{
break;
}
blockNum++;
}
}
TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), "inputKvCacheBlocks size: %ld,blockNum: %d , windowSizes: %ld",
inputKvCacheBlocks.size(), blockNum, windowSizes.size());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Automated scan for remaining element/byte mixups.

There may be other places with getSize() used for buffer sizing or network IO.

Also applies to: 481-502, 616-747


🏁 Script executed:

#!/bin/bash
# Find potential unit mix-ups: getSize() used near send/recv/bytes/BufferSize/targetBuffer
rg -n -C2 -e 'getSize\(\)' cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
rg -n -C2 -e '(targetBuffer(Size|Bytes)|getSizeInBytes\(\)|session\.(send|recv)\()' cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

Length of output: 8400


Ensure byte‐accurate sizing for all buffer allocations and network I/O

A number of calls to getSize() appear in contexts where the API expects bytes rather than element counts. Mixing element counts into byte‐based calculations will lead to incorrect buffer sizes and potential data corruption or deadlocks. Please update these to use getSizeInBytes() (or multiply by element size) consistently.

Key locations to fix:

  • Input cache sizing
    • Line 269:
      allCacheBlockSize += it->getSize();
      ➔ should be
      allCacheBlockSize += it->getSizeInBytes();
  • Preallocated send buffer element size
    • Line 383:
      auto sendBufferEleSize = preAllocSendBuffer->getSize();
      ➔ should be
      auto sendBufferEleSize = preAllocSendBuffer->getSizeInBytes();
  • Output cache sizing
    • Line 496:
      cacheBlockSizeSum += it->getSize();
      ➔ should be
      cacheBlockSizeSum += it->getSizeInBytes();
  • General network I/O loops
    • Anywhere session.send(..., layer->data(), layer->getSizeInBytes()) is correct, but ensure no calls still use layer->getSize().
    • Scan the blocks around lines 481–502 and 616–747 for similar patterns and correct them.

Consider adding a lint rule or helper wrapper to prevent accidental use of getSize() in byte‐sensitive contexts going forward.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 251–275 (and
the other ranges called out), the code mixes element counts and byte sizes:
replace uses of it->getSize() with it->getSizeInBytes() for allCacheBlockSize at
line ~269, change preAllocSendBuffer->getSize() to
preAllocSendBuffer->getSizeInBytes() at line ~383, replace cacheBlockSizeSum +=
it->getSize() with cacheBlockSizeSum += it->getSizeInBytes() at line ~496, and
scan the nearby blocks (around 481–502 and 616–747) to ensure all network/send
calls and buffer-size calculations use getSizeInBytes() (or multiply by element
byte-size) instead of getSize(); optionally add a small helper or a linter check
to prevent future accidental use of getSize() in byte-sensitive contexts.

🛠️ Refactor suggestion

Unit bug: mixing elements and bytes when computing transfer sizes.

allCacheBlockSize and blockNum are later used to derive targetBufferSize and sliced for send/recv as bytes; currently you add it->getSize() (elements). This mis-sizes transfers for non-1-byte dtypes and causes incorrect slicing.

-        size_t allCacheBlockSize = 0;
+        size_t allCacheBlockBytes = 0;
@@
-                inputKvCacheBlocks.at(windowSize).push_back(it);
-                allCacheBlockSize += it->getSize();
+                inputKvCacheBlocks.at(windowSize).push_back(it);
+                allCacheBlockBytes += it->getSizeInBytes();
                 blockNum++;
@@
-        auto const targetBufferSize = allCacheBlockSize / targetNum * peerDuplicateHeadFactor;
+        auto const targetBufferBytes = allCacheBlockBytes / targetNum * peerDuplicateHeadFactor;
@@
-        auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
-            cacheBufferId, bufferTargetNum, targetBufferSize, bufferManager);
+        auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
+            cacheBufferId, bufferTargetNum, targetBufferBytes, bufferManager);

Follow-up: in the chunked-send path below (bufferCoverTargetNum == 0), convert byte counts to element counts when slicing.

-                size = outputSplitCaches[bufferIdx]->getSizeInBytes();
+                size = outputSplitCaches[bufferIdx]->getSizeInBytes();
@@
-                size = targetBufferSize;
-                size_t remainSendSize = targetBufferSize;
+                size = targetBufferBytes;
+                size_t remainSendBytes = targetBufferBytes;
                 while (remainSendSize > 0)
                 {
                     TLLM_CHECK(preAllocSendBuffer != nullptr);
-                    auto sendBufferEleSize = preAllocSendBuffer->getSize();
-                    auto sendSize = std::min(remainSendSize, sendBufferEleSize);
-                    auto copySlice = runtime::ITensor::slice(
-                        outputSplitCaches[bufferIdx], targetBufferSize - remainSendSize, sendSize);
-                    auto copyTargetSlice = runtime::ITensor::slice(preAllocSendBuffer, 0, sendSize);
+                    auto elemSize = tensorrt_llm::common::getDTypeSize(preAllocSendBuffer->getDataType());
+                    auto sendBufferBytes = preAllocSendBuffer->getSizeInBytes();
+                    auto sendBytes = std::min(remainSendBytes, sendBufferBytes);
+                    auto sendElems = sendBytes / elemSize;
+                    auto offsetBytes = targetBufferBytes - remainSendBytes;
+                    auto offsetElems = offsetBytes / elemSize;
+                    auto copySlice = runtime::ITensor::slice(outputSplitCaches[bufferIdx], offsetElems, sendElems);
+                    auto copyTargetSlice = runtime::ITensor::slice(preAllocSendBuffer, 0, sendElems);
                     bufferManager.copy(*copySlice, *copyTargetSlice);
                     bufferManager.getStream().synchronize();
                     session.send(processIdx, copyTargetSlice->data(), copyTargetSlice->getSizeInBytes());
-                    remainSendSize -= sendSize;
+                    remainSendBytes -= sendBytes;
                 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
size_t allCacheBlockSize = 0;
auto const& windowSizes = blockRange.getWindowSizes();
TLLM_LOG_DEBUG(
mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %d", windowSizes.size());
TLLM_CHECK_WITH_INFO(
static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools");
std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocks;
for (auto poolIdx = 0; poolIdx < numPools; poolIdx++)
for (auto const& windowSize : windowSizes)
{
blockRange.updatePoolIdx(poolIdx);
SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(poolIdx);
TLLM_CHECK_WITH_INFO(inputKvCacheBlocks.find(window) == inputKvCacheBlocks.end(),
"window size already exists, which is not supported");
inputKvCacheBlocks.emplace(window, std::vector<runtime::ITensor::SharedPtr>());
auto maxBlockThisWindow = window / selfConfig.getModelConfig().mTokensPerBlock;
SizeType32 blockNumThisWindow = 0;
for (auto it = blockRange.begin(); it != blockRange.end(); ++it)
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " format windowSize: %d blockRangeForWindow size: %d",
windowSize, blockRangeForWindow.size());
inputKvCacheBlocks.emplace(windowSize, std::vector<runtime::ITensor::SharedPtr>());
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
blockNum++;
inputKvCacheBlocks.at(window).push_back(it);
inputKvCacheBlocks.at(windowSize).push_back(it);
allCacheBlockSize += it->getSize();
blockNumThisWindow++;
if (blockNumThisWindow >= maxBlockThisWindow)
{
break;
}
blockNum++;
}
}
TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), "inputKvCacheBlocks size: %ld,blockNum: %d , windowSizes: %ld",
inputKvCacheBlocks.size(), blockNum, windowSizes.size());
// Rename and switch from element-count to byte-count accumulation
size_t allCacheBlockBytes = 0;
auto const& windowSizes = blockRange.getWindowSizes();
TLLM_LOG_DEBUG(
mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %d", windowSizes.size());
TLLM_CHECK_WITH_INFO(
static_cast<int>(windowSizes.size()) == numPools, "window sizes should be the same as numPools");
std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocks;
for (auto const& windowSize : windowSizes)
{
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(),
" format windowSize: %d blockRangeForWindow size: %d",
windowSize, blockRangeForWindow.size());
inputKvCacheBlocks.emplace(windowSize, std::vector<runtime::ITensor::SharedPtr>());
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
inputKvCacheBlocks.at(windowSize).push_back(it);
allCacheBlockBytes += it->getSizeInBytes();
blockNum++;
}
}
TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(),
"inputKvCacheBlocks size: %ld, blockNum: %d, windowSizes: %ld",
inputKvCacheBlocks.size(), blockNum, windowSizes.size());
// Compute target buffer in bytes, not elements
auto const targetBufferBytes = allCacheBlockBytes / targetNum * peerDuplicateHeadFactor;
- auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
auto result = mCacheTransBufferManager->getOrAllocateSendBuffers(
cacheBufferId, bufferTargetNum, targetBufferBytes, bufferManager);
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 251 to 275,
the code adds it->getSize() (elements) to allCacheBlockSize but later treats
allCacheBlockSize as bytes for buffer sizing and slicing; change accumulation to
byte counts (elements * element-size-in-bytes or use a provided
getByteSize()/getSizeInBytes() API) and ensure blockNum remains
element-agnostic; additionally, in the chunked-send path (where
bufferCoverTargetNum == 0) convert any byte-based offsets/lengths to element
counts before indexing/slicing tensors so slicing uses element counts not bytes.

Comment on lines 481 to 498
size_t blockNum = 0;
size_t cacheBlockSizeSum = 0;
for (auto poolIdx = 0; poolIdx < numPools; poolIdx++)

auto windowSizes = blockRange.getWindowSizes();
TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " unformat windowSizes size: %d", windowSizes.size());
for (auto const& windowSize : windowSizes)
{
blockRange.updatePoolIdx(poolIdx);
SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(poolIdx);
TLLM_CHECK_WITH_INFO(outputBuffersPerWindow.find(window) == outputBuffersPerWindow.end(),
"window size already exists, which is not supported");
outputBuffersPerWindow.emplace(window, std::vector<runtime::ITensor::SharedPtr>());
auto maxBlockThisWindow = window / selfConfig.getModelConfig().mTokensPerBlock;
SizeType32 blockNumThisWindow = 0;
for (auto it = blockRange.begin(); it != blockRange.end(); ++it)
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), " unformat windowSize: %d blockRangeForWindow size: %d",
windowSize, blockRangeForWindow.size());
outputBuffersPerWindow.emplace(windowSize, std::vector<runtime::ITensor::SharedPtr>());

for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
blockNum++;
blockNumThisWindow++;
outputBuffersPerWindow.at(window).push_back(it);
outputBuffersPerWindow.at(windowSize).push_back(it);
cacheBlockSizeSum += it->getSize();
if (blockNumThisWindow >= maxBlockThisWindow)
{
break;
}
blockNum++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Unit bug on receive: cacheBlockSizeSum must be bytes; align chunking logic.

Mirrors the send-side bug. Accumulate bytes and use bytes-to-elements conversion for slicing.

-    size_t cacheBlockSizeSum = 0;
+    size_t cacheBlockBytesSum = 0;
@@
-            cacheBlockSizeSum += it->getSize();
+            cacheBlockBytesSum += it->getSizeInBytes();
@@
-    TLLM_CHECK(cacheBlockSizeSum % targetNum == 0);
-    auto targetBufferSize = cacheBlockSizeSum / targetNum;
+    TLLM_CHECK(cacheBlockBytesSum % targetNum == 0);
+    auto targetBufferBytes = cacheBlockBytesSum / targetNum;
@@
-                    auto [recvSplitCachestmp, bufferCoverTargetNumtmp, onlyUseDynamicBuffer]
-                        = mCacheTransBufferManager->getOrAllocateRecvBuffers(
-                            cacheBufferId, targetNum, targetBufferSize, bufferManager);
+                    auto [recvSplitCachestmp, bufferCoverTargetNumtmp, onlyUseDynamicBuffer]
+                        = mCacheTransBufferManager->getOrAllocateRecvBuffers(
+                            cacheBufferId, targetNum, targetBufferBytes, bufferManager);

And in the chunked path:

-                        size_t remainRecvSize = targetBufferSize;
+                        size_t remainRecvBytes = targetBufferBytes;
                         while (remainRecvSize > 0)
                         {
                             TLLM_CHECK(preAllocRecvBuffer != nullptr);
-                            auto recvBufferEleSize = preAllocRecvBuffer->getSize();
-                            auto recvSize = std::min(remainRecvSize, recvBufferEleSize);
-                            auto recvSlice = runtime::ITensor::slice(preAllocRecvBuffer, 0, recvSize);
-                            auto copySlice = runtime::ITensor::slice(
-                                recvSplitCaches[processIdx], targetBufferSize - remainRecvSize, recvSize);
+                            auto elemSize = tensorrt_llm::common::getDTypeSize(preAllocRecvBuffer->getDataType());
+                            auto recvBufferBytes = preAllocRecvBuffer->getSizeInBytes();
+                            auto recvBytes = std::min(remainRecvBytes, recvBufferBytes);
+                            auto recvElems = recvBytes / elemSize;
+                            auto offsetBytes = targetBufferBytes - remainRecvBytes;
+                            auto offsetElems = offsetBytes / elemSize;
+                            auto recvSlice = runtime::ITensor::slice(preAllocRecvBuffer, 0, recvElems);
+                            auto copySlice = runtime::ITensor::slice(recvSplitCaches[processIdx], offsetElems, recvElems);
                             size += recvSlice->getSizeInBytes();
                             llmRequest.updateKvCacheSize((*recvSlice).getSizeInBytes());
                             session.recv(pickUpConnections[processIdx], recvSlice->data(), recvSlice->getSizeInBytes());
                             bufferManager.copy(*recvSlice, *copySlice);
                             bufferManager.getStream().synchronize();
-                            remainRecvSize -= recvSize;
+                            remainRecvBytes -= recvBytes;
                         }

Also applies to: 616-624

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 481-498 (and
also apply the same change at 616-624): cacheBlockSizeSum is being accumulated
in element counts instead of bytes which breaks chunking; change accumulation to
sum bytes (e.g., use each tensor/block's byte size or multiply element count by
element size) and, where slicing/chunking computes element counts from byte
sizes, convert bytes back to element counts by dividing by the element byte-size
before creating slices; apply the identical byte-based accumulation and
bytes-to-elements conversion in the chunked path at lines 616-624.

Signed-off-by: Chuang Zhu <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
@chuangz0 chuangz0 force-pushed the kvcache_transfer_for_vwsa_optimal branch from ef0a620 to 39306d6 Compare August 21, 2025 08:45
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)

240-241: Bug: double-free of receive buffer index

recvBufferId is freed at Line 237 and then freed again at Lines 240–241. This can lead to undefined behavior in the buffer index allocator.

Apply:

-        // Free buffer indices
-        mTransBufferManager->freeBufferIndexForRecv(recvBufferId);
+        // Free buffer indices — already freed above; no-op here
+        (void) recvBufferId;

289-314: Bug: unreachable code after exit() in child process

exit(testing::Test::HasFailure() ? 1 : 0) at Line 289 terminates the child; the subsequent receive-buffer tests are never executed.

Refactor so the child exits only once, after all assertions:

-        thread0.join();
-        exit(testing::Test::HasFailure() ? 1 : 0);
-
-        auto recvBufferId = mTransBufferManager->assignBufferIndexForRecv();
+        thread0.join();
+
+        // Continue with recv buffer assertions before exiting
+        auto recvBufferId = mTransBufferManager->assignBufferIndexForRecv();
         EXPECT_TRUE(recvBufferId.has_value());
         EXPECT_EQ(recvBufferId.value(), 0);
         auto recvBufferId2 = mTransBufferManager->assignBufferIndexForRecv();
         EXPECT_TRUE(recvBufferId2.has_value());
         EXPECT_EQ(recvBufferId2.value(), 1);
         auto time2 = std::chrono::steady_clock::now();
@@
         thread1.join();
-        exit(testing::Test::HasFailure() ? 1 : 0);
+        exit(testing::Test::HasFailure() ? 1 : 0);
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)

195-205: Measurement bug: size should be bytes, not elements, in chunked send

In the chunked branch (bufferCoverTargetNum == 0), size is set to targetBufferSize (elements). appendMeasure expects bytes. Compute bytes via dtype size and optionally rename remainSendSize to remainSendElems for clarity.

-        size_t size;
+        size_t size = 0;
@@
-        else
+        else
         {
             // bufferCoverTargetNum=0, mSendBuffer size < one outputSlice
             // send multiple times
-            size = targetBufferSize;
-            size_t remainSendSize = targetBufferSize;
+            auto const elemSize = tensorrt_llm::common::getDTypeSize(outputSplitCaches.at(cacheIdx)->getDataType());
+            size_t remainSendElems = targetBufferSize;
+            size = targetBufferSize * elemSize;
             while (remainSendSize > 0)
             {
                 TLLM_CHECK(preAllocSendBuffer != nullptr);
-                auto sendBufferEleSize = preAllocSendBuffer->getSize();
-                auto sendSize = std::min(remainSendSize, sendBufferEleSize);
+                auto sendBufferEleSize = preAllocSendBuffer->getSize();
+                auto sendSize = std::min(remainSendElems, sendBufferEleSize);
                 auto copySlice = runtime::ITensor::slice(
-                    outputSplitCaches.at(cacheIdx), targetBufferSize - remainSendSize, sendSize);
+                    outputSplitCaches.at(cacheIdx), targetBufferSize - remainSendElems, sendSize);
                 auto copyTargetSlice = runtime::ITensor::slice(preAllocSendBuffer, 0, sendSize);
                 bufferManager.copy(*copySlice, *copyTargetSlice);
                 bufferManager.getStream().synchronize();
                 session.send(processIdx, copyTargetSlice->data(), copyTargetSlice->getSizeInBytes());
-
-                remainSendSize -= sendSize;
+                remainSendElems -= sendSize;
             }
         }

Also applies to: 217-233

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)

368-371: Bug: copying from wrong source buffer index in fallback path

In the send fallback, you calculate bufferIdx, but copy from outputSplitCaches[processIdx]. This mismatches topology mapping and can transmit wrong data.

-                bufferManager.copy(*outputSplitCaches[processIdx], *outputSplitCaches.at(sendBufferIdx));
+                bufferManager.copy(*outputSplitCaches.at(bufferIdx), *outputSplitCaches.at(sendBufferIdx));

377-394: Measurement bug: size should be bytes in chunked send

size is assigned targetBufferSize (elements). appendMeasure expects bytes. Compute bytes once or accumulate.

-                size = targetBufferSize;
-                size_t remainSendSize = targetBufferSize;
+                auto const elemSize = tensorrt_llm::common::getDTypeSize(outputSplitCaches.at(bufferIdx)->getDataType());
+                size = targetBufferSize * elemSize;
+                size_t remainSendSize = targetBufferSize; // elements
                 while (remainSendSize > 0)
                 {
                     TLLM_CHECK(preAllocSendBuffer != nullptr);
                     auto sendBufferEleSize = preAllocSendBuffer->getSize();
                     auto sendSize = std::min(remainSendSize, sendBufferEleSize);
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

665-671: Bug: cacheType is shadowed and assignment is lost when kvFactor == 1

This introduces a new local cacheType instead of updating the outer one, leaving the type unchanged.

         CacheType cacheType = CacheType::kSELF;
         if (kvFactor == 1)
         {
-            auto cacheType = CacheType::kSELFKONLY;
+            cacheType = CacheType::kSELFKONLY;
         }
♻️ Duplicate comments (8)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)

139-145: Zero-copy path must iterate per-window buckets

Once per-window mapping is built, send each window’s blocks rather than a flat list.

-        for (size_t i = 0; i < connections.size(); i++)
-        {
-            for (auto const& block : inputKvCacheBlocks)
-            {
-                session.send(i, block->data(), block->getSizeInBytes());
-            }
-        }
+        for (size_t i = 0; i < connections.size(); i++)
+        {
+            for (auto const& [w, blocks] : inputKvCacheBlocksPerWindow)
+            {
+                (void) w;
+                for (auto const& block : blocks)
+                {
+                    session.send(i, block->data(), block->getSizeInBytes());
+                }
+            }
+        }

113-126: Bug: per-window blocks are flattened and then re-keyed under a single window

You iterate per-window but flatten into inputKvCacheBlocks and later emplace all blocks under pool 0’s window. This loses per-window grouping and can break split/concat when multiple windows exist.

Update to build and use a true per-window map throughout:

-    int blockNum = 0;
-    std::vector<runtime::ITensor::SharedPtr> inputKvCacheBlocks;
+    int blockNum = 0;
+    std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow;
@@
-    for (auto const& windowSize : windowSizes)
+    for (auto const& windowSize : windowSizes)
     {
-        auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
-        for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
-        {
-            inputKvCacheBlocks.push_back(it);
-            blockNum++;
-        }
+        auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
+        auto& vec = inputKvCacheBlocksPerWindow[windowSize];
+        for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
+        {
+            vec.push_back(it);
+            ++blockNum;
+        }
     }
@@
-    SizeType32 window = mCacheManager->getBlockManager().getPoolWindowSize(0);
-    std::map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> inputKvCacheBlocksPerWindow;
-    inputKvCacheBlocksPerWindow.emplace(window, inputKvCacheBlocks);
+    // inputKvCacheBlocksPerWindow was built above per window; use as-is

Also adjust zero-copy loop below (Lines 139–145) to iterate the per-window map instead of the flat vector.

Also applies to: 178-183

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)

45-79: Off-by-one in window cropping and unsafe map access when computing reuse start index

  • windowStartBlockIdx uses windowSize / tokensPerBlock + 1, which overestimates by one for exact multiples.
  • requestedBlockHashesPerWindow.at(windowSize) may throw for missing key.
     auto const& windowsMetadata = cacheManager->getBlockManager().getWindowSizesMetadata();
@@
-    auto const& blockIdsPerWindow = blockRange.getBlockIdsPerWindow();
+    auto const& blockIdsPerWindow = blockRange.getBlockIdsPerWindow();
+    auto const tokensPerBlock = cacheManager->getBlockManager().getTokensPerBlock();
@@
-        SizeType32 reuseStartBlockIdx
-            = (needReuse && requestedBlockHashesPerWindow.at(windowSize).size() > 0
-                  && requestedBlockHashesPerWindow.at(windowSize).size() < blockIdsPerWindow.at(windowSize).size())
-            ? (blockIdsPerWindow.at(windowSize).size() - requestedBlockHashesPerWindow.at(windowSize).size())
-            : 0;
-        auto windowStartBlockIdx = needSendAllForWindow
-            ? 0
-            : static_cast<SizeType32>(blockIdsPerWindow.at(windowSize).size())
-                - (windowSize / cacheManager->getBlockManager().getTokensPerBlock() + 1);
+        auto const& ids = blockIdsPerWindow.at(windowSize);
+        SizeType32 reuseStartBlockIdx = 0;
+        if (needReuse)
+        {
+            auto it = requestedBlockHashesPerWindow.find(windowSize);
+            auto requested = (it != requestedBlockHashesPerWindow.end()) ? it->second.size() : 0UL;
+            if (requested > 0 && requested < ids.size())
+            {
+                reuseStartBlockIdx = static_cast<SizeType32>(ids.size() - requested);
+            }
+        }
+        SizeType32 blocksInWindow = static_cast<SizeType32>((windowSize + tokensPerBlock - 1) / tokensPerBlock);
+        SizeType32 windowStartBlockIdx = needSendAllForWindow
+            ? 0
+            : static_cast<SizeType32>(
+                  std::max<ptrdiff_t>(0, static_cast<ptrdiff_t>(ids.size()) - static_cast<ptrdiff_t>(blocksInWindow)));
@@
-        blockRange.setBlockIdsForWindow(windowSize,
-            std::vector<SizeType32>(
-                blockIdsPerWindow.at(windowSize).begin() + startBlockIdx, blockIdsPerWindow.at(windowSize).end()));
+        blockRange.setBlockIdsForWindow(windowSize,
+            std::vector<SizeType32>(ids.begin() + startBlockIdx, ids.end()));

100-111: Receiving side: same off-by-one window cropping

Use ceil division and clamp to zero, mirroring the send path.

-    for (auto const& [windowSize, metadata] : windowsMetadata)
+    auto const tokensPerBlock = cacheManager->getBlockManager().getTokensPerBlock();
+    for (auto const& [windowSize, metadata] : windowsMetadata)
     {
-        auto const& blockIdsPerWindow = blockRange.getBlockIdsPerWindow();
-        auto windowStartBlockIdx = static_cast<SizeType32>(blockIdsPerWindow.at(windowSize).size())
-            - (windowSize / cacheManager->getBlockManager().getTokensPerBlock() + 1);
-        SizeType32 startBlockIdx = std::max(0, windowStartBlockIdx);
-        blockRange.setBlockIdsForWindow(windowSize,
-            std::vector<SizeType32>(
-                blockIdsPerWindow.at(windowSize).begin() + startBlockIdx, blockIdsPerWindow.at(windowSize).end()));
+        auto const& ids = blockRange.getBlockIdsPerWindow().at(windowSize);
+        SizeType32 blocksInWindow = static_cast<SizeType32>((windowSize + tokensPerBlock - 1) / tokensPerBlock);
+        SizeType32 startBlockIdx = static_cast<SizeType32>(
+            std::max<ptrdiff_t>(0, static_cast<ptrdiff_t>(ids.size()) - static_cast<ptrdiff_t>(blocksInWindow)));
+        blockRange.setBlockIdsForWindow(windowSize,
+            std::vector<SizeType32>(ids.begin() + startBlockIdx, ids.end()));
     }
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

455-456: Fix operator precedence in byte-wise equality check

i == llmRequest->getPromptLen() & 0xff is parsed as (i == llmRequest->getPromptLen()) & 0xff. Compare against the masked byte explicitly.

-                    EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
-                        [&llmRequest](uint8_t i) { return i == llmRequest->getPromptLen() & 0xff; }));
+                    EXPECT_TRUE(std::all_of(bytes.begin(), bytes.end(),
+                        [&llmRequest](uint8_t i) {
+                            return i == static_cast<uint8_t>(llmRequest->getPromptLen() & 0xff);
+                        }));
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (3)

82-85: TLLM_CHECK message: grammar and format specifier; cast windowSize

Fix “exists” → “exist” and ensure the format argument type matches the specifier. Safer to cast to int with %d (or switch to %u if SizeType32 is unsigned).

-        TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
-            "Window size %d should exists", windowSize);
+        TLLM_CHECK_WITH_INFO(mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
+            "Window size %d should exist", static_cast<int>(windowSize));

89-95: Consistent validation: check against mBlockIdsPerWindow, not mPoolsPerWindow

setBlockIdsForWindow validates presence in mBlockIdsPerWindow but this function checks mPoolsPerWindow. Align them to avoid inconsistent behavior.

-        for (auto const& [windowSize, blockIds] : blockIdsPerWindow)
+        for (auto const& [windowSize, blockIds] : blockIdsPerWindow)
         {
-            TLLM_CHECK_WITH_INFO(
-                mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d should exists", windowSize);
+            TLLM_CHECK_WITH_INFO(
+                mBlockIdsPerWindow.find(windowSize) != mBlockIdsPerWindow.end(),
+                "Window size %d should exist", static_cast<int>(windowSize));
         }

115-120: Guard against missing block IDs for a valid window

mBlockIdsPerWindow.at(windowSize) can throw if IDs weren’t set yet. Return an empty range instead of throwing.

     BlockRangeForWindow getBlockRangeForWindow(SizeType32 windowSize) const
     {
         TLLM_CHECK_WITH_INFO(
             mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d not found", windowSize);
         auto pool = mPoolsPerWindow.at(windowSize).front();
-        auto blockIds = mBlockIdsPerWindow.at(windowSize);
-        return BlockRangeForWindow(std::move(blockIds), std::move(pool));
+        auto it = mBlockIdsPerWindow.find(windowSize);
+        std::vector<SizeType32> blockIds = (it == mBlockIdsPerWindow.end()) ? std::vector<SizeType32>{} : it->second;
+        return BlockRangeForWindow(std::move(blockIds), std::move(pool));
     }
🧹 Nitpick comments (11)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)

381-381: Format specifier mismatch for size_t

defaultTransSize is size_t but the log uses %d. Prefer %zu (or cast).

-        TLLM_LOG_INFO("defaultTransSize: %d", defaultTransSize);
+        TLLM_LOG_INFO("defaultTransSize: %zu", static_cast<size_t>(defaultTransSize));
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)

127-130: Constructing range with moved inputs — LGTM; minor micro-nit

The move of blockIdsVec and poolCopy is fine. If pool isn't reused afterward, you can avoid the temporary poolCopy and move pool directly to reduce one shared_ptr copy.

-    auto poolCopy = pool;
-    auto range = BlockRangeForWindow(std::move(blockIdsVec), std::move(poolCopy));
+    auto range = BlockRangeForWindow(std::move(blockIdsVec), std::move(pool));
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)

153-154: Units: clarify elements vs. bytes for cacheBlockSize

getSize() returns elements. That’s fine for computing targetBufferSize (elements), but be consistent by naming to reflect units, and ensure measurement uses bytes later.

-    auto cacheBlockSize = inputKvCacheBlocks.at(0)->getSize();
+    auto cacheBlockElemCount = inputKvCacheBlocksPerWindow.begin()->second.front()->getSize();

Follow-up: see the measurement fix below.


168-174: Agent connection precondition checks — LGTM; minor variable naming

agentConnnecion is misspelled (extra 'n'). Rename for readability.

-    auto* agentConnnecion = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
-    if (agentConnnecion != nullptr)
+    auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
+    if (agentConnection != nullptr)
     {
         TLLM_CHECK_WITH_INFO(bufferCoverTargetNum == pPDomainSize, "Agent need all buffer pre-allocated");
         TLLM_CHECK(onlyUseDynamicBuffer == false);
     }
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (4)

22-22: Unused include: kvCacheEventManager.h

This header isn’t referenced in this TU. Drop it to keep includes minimal.

-#include "tensorrt_llm/batch_manager/kvCacheEventManager.h"

251-272: Logging format specifier: size_t vs %d

windowSizes.size() is size_t. Use %zu (or cast) to avoid UB.

-        TLLM_LOG_DEBUG(
-            mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %d", windowSizes.size());
+        TLLM_LOG_DEBUG(
+            mpi::MpiComm::world().getRank(), " blockRange.getWindowSizes(); windowSizes size: %zu",
+            static_cast<size_t>(windowSizes.size()));

Same applies to other logs printing container sizes.


484-497: Units on receive: accumulate bytes not elements for cacheBlockSizeSum

You sum it->getSize() (elements) into cacheBlockSizeSum and then compare modulo/derive targetBufferSize used for bytes-based recv/sizing checks. Prefer summing bytes for clearer invariants, or keep elements but ensure all downstream uses treat it as elements.

-    size_t cacheBlockSizeSum = 0;
+    size_t cacheBlockBytesSum = 0;
@@
-            cacheBlockSizeSum += it->getSize();
+            cacheBlockBytesSum += it->getSizeInBytes();

If you adopt bytes, adjust the modulo and targetBufferSize accordingly and convert back to elements when slicing.


616-624: Ensure all ITensor::slice offsets use element counts after switching to byte-based sizes

I’ve confirmed every call to ITensor::slice in cacheFormatter.cpp. Only the slices that use targetBufferSize (which you’ve refactored to be in bytes) still pass byte offsets directly to slice, which expects element counts. You’ll need to divide those byte offsets by the element size (e.g. via getElementSizeInBytes()) before calling slice. The remaining slice calls already work in element units and require no change.

• Lines 386–390

auto copySlice = runtime::ITensor::slice(
    outputSplitCaches[bufferIdx], targetBufferSize - remainSendSize, sendSize);

• Lines 737–740

auto copySlice = runtime::ITensor::slice(
    recvSplitCaches[processIdx], targetBufferSize - remainRecvSize, recvSize);

Both offsets (targetBufferSize - remainXxxSize) must be converted from bytes to element counts.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

45-51: Per-window block hash transport: constructor/eq/serde updates look correct

  • RequestInfo stores mBlockHashesPerWindow and equality compares it.
  • Serialization writes map size followed by (window, hashes) pairs and restores them in deserialization.
  • serializedSize matches the encoding.

Two optional improvements:

  • For deterministic wire format across compilers/architectures, consider serializing a sorted std::vector of pairs or use std::map instead of std::unordered_map.
  • Pre-reserve blockHashesPerWindow in deserialize after reading size to reduce rehashing.
-    std::unordered_map<SizeType32, std::vector<size_t>> blockHashesPerWindow;
-    auto size = su::deserialize<decltype(blockHashesPerWindow.size())>(is);
+    std::unordered_map<SizeType32, std::vector<size_t>> blockHashesPerWindow;
+    auto size = su::deserialize<decltype(blockHashesPerWindow.size())>(is);
+    blockHashesPerWindow.reserve(size);

Also applies to: 56-58, 75-81, 88-98, 105-113

cpp/tests/batch_manager/cacheTransceiverTest.cpp (2)

939-947: Potential off-by-one in visible-window verification

startBlockId = size - (maxBlockInWindow + 1) effectively verifies maxBlockInWindow + 1 blocks; was the intention to verify only the last maxBlockInWindow blocks? If so, drop the + 1.

-            int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - (maxBlockInWindow + 1));
+            int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - maxBlockInWindow);

If the extra block is intentional (to include the current block), please add a brief comment to document this choice.


431-437: Minor: prefer range-for for clarity

Using range-for improves readability with your custom iterator/sentinel.

-                auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
-                for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
-                {
-                    // fill cache with tokens (= request length), for reuse test
-                    TLLM_CUDA_CHECK(cudaMemset(it->data(), llmRequest->getPromptLen(), it->getSizeInBytes()));
-                }
+                for (auto& block : blockRange.getBlockRangeForWindow(windowSize))
+                {
+                    // Fill with prompt length for reuse test
+                    TLLM_CUDA_CHECK(cudaMemset(block->data(), llmRequest->getPromptLen(), block->getSizeInBytes()));
+                }

Apply the same pattern in the receiver validation loop below.

Also applies to: 451-457

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ef0a620 and 39306d6.

📒 Files selected for processing (18)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4 hunks)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (5 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (4 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.cpp (1 hunks)
  • cpp/tensorrt_llm/common/envUtils.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (12 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
🚧 Files skipped from review as they are similar to previous changes (10)
  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
🧠 Learnings (7)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.

Applied to files:

  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
🧬 Code graph analysis (6)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (2)
  • preAllocBufferSize (246-285)
  • preAllocBufferSize (246-248)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
  • RequestInfo (45-97)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)
cpp/tensorrt_llm/common/envUtils.cpp (4)
  • getEnvKVCacheTransferAllBlocksForWindow (449-453)
  • getEnvKVCacheTransferAllBlocksForWindow (449-449)
  • getEnvDisableSelectiveCacheTransfer (321-325)
  • getEnvDisableSelectiveCacheTransfer (321-321)
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
  • BlockRangeForWindow (27-56)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (10)
  • tensorrt_llm (37-265)
  • nodiscard (547-550)
  • nodiscard (555-558)
  • nodiscard (578-581)
  • nodiscard (765-773)
  • nodiscard (776-779)
  • nodiscard (782-785)
  • nodiscard (1021-1024)
  • nodiscard (1186-1194)
  • nodiscard (1232-1240)
cpp/include/tensorrt_llm/runtime/gptDecoder.h (1)
  • ITensor (145-152)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
  • getBlockRangeForSending (43-82)
  • getBlockRangeForSending (43-43)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
  • llmRequest (421-460)
  • llmRequest (421-421)
  • llmRequest (882-914)
  • llmRequest (882-882)
  • llmRequest (916-923)
  • llmRequest (916-916)
  • llmRequest (925-951)
  • llmRequest (925-925)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)

119-121: PreAllocBufferSize API update: passing tokensPerBlock looks correct

The test now forwards tokensPerBlock to preAllocBufferSize; this aligns with the implementation using tokensPerBlock to bound validTokenNum. No functional concerns here.


164-165: Mirror the API update in TestPreAllocBufferSize2 — LGTM

The second test also passes tokensPerBlock to preAllocBufferSize. Consistent with the first test.

cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)

55-55: API migration to BlockRangeForWindow — LGTM

Switching to BlockRangeForWindow with moved blockIds and pool matches the new API and preserves behavior.

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)

101-104: Early exit when no cache needs sending — LGTM

Short-circuiting avoids unnecessary setup work; consistent with CacheFormatter behavior.

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)

268-271: Units consistency verified – no action needed

I’ve audited all instances of getSize() and getSizeInBytes() around buffer allocation, slicing, and session.send/session.recv calls in cacheFormatter.cpp.

  • getSize() is only ever used for element counts (e.g. computing allCacheBlockSize, slicing tensors).
  • getSizeInBytes() is used exclusively when specifying byte lengths for sends/receives and updating cache sizes.

There are no mismatches between element counts and byte counts.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)

26-26: Include for SizeType32 — LGTM

Adding runtime/common.h is appropriate for SizeType32.


238-244: DataResponder path switches to per-window hashes — LGTM

Receiving the per-window hashes and forwarding via setRequestedBlockHashes(...) aligns with the per-window design.

Also applies to: 263-264

cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

649-650: Resource usage check: doubled mMaxNumSequences from 8 to 16

This increases total blocks and memory footprints in tests. Confirm CI/GPU memory headroom; earlier L0 pipelines failed—this might contribute.

Would you like me to probe the failing pipelines’ memory-related logs and suggest parameterizations to keep coverage while reducing peak memory?

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (3)

20-21: Include looks correct for new per-window pool typing

Adding iTensor.h is appropriate for runtime::ITensor::SharedPtr usage in per-window structures.


187-193: Iterator/sentinel correctness looks good

  • Constructor guards against invalid non-zero indices and updates state.
  • Sentinel comparison uses size() and avoids deref on empty ranges.

Also applies to: 230-233, 244-248


146-152: Initialization of pools and block IDs per window is clear and consistent

Constructors correctly populate mPoolsPerWindow and mBlockIdsPerWindow across all pools/windows.

Also applies to: 158-166

Comment on lines +122 to 130
std::vector<SizeType32> getWindowSizes() const
{
std::vector<SizeType32> windowSizes;
for (auto const& [windowSize, _] : mPoolsPerWindow)
{
mWindowSize = newWindowSize;
mBlockIds = mManager->getSequence(mRequestId).getCacheBlockIds(mWindowSize).at(kFIRST_AND_ONLY_BEAM);
windowSizes.push_back(windowSize);
}
return windowSizes;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Return window sizes in deterministic order

Current implementation iterates an unordered_map, making order non-deterministic across processes and builds. Sorting eliminates test flakiness and keeps sender/receiver blockIdx alignment.

     std::vector<SizeType32> getWindowSizes() const
     {
-        std::vector<SizeType32> windowSizes;
-        for (auto const& [windowSize, _] : mPoolsPerWindow)
-        {
-            windowSizes.push_back(windowSize);
-        }
-        return windowSizes;
+        std::vector<SizeType32> windowSizes;
+        windowSizes.reserve(mPoolsPerWindow.size());
+        for (auto const& [windowSize, _] : mPoolsPerWindow)
+        {
+            windowSizes.push_back(windowSize);
+        }
+        std::sort(windowSizes.begin(), windowSizes.end());
+        return windowSizes;
     }

Add the missing header include as well:

 #pragma once

 #include "tensorrt_llm/batch_manager/kvCacheManager.h"
 #include "tensorrt_llm/runtime/iTensor.h"
+#include <algorithm>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::vector<SizeType32> getWindowSizes() const
{
std::vector<SizeType32> windowSizes;
for (auto const& [windowSize, _] : mPoolsPerWindow)
{
mWindowSize = newWindowSize;
mBlockIds = mManager->getSequence(mRequestId).getCacheBlockIds(mWindowSize).at(kFIRST_AND_ONLY_BEAM);
windowSizes.push_back(windowSize);
}
return windowSizes;
}
#pragma once
#include "tensorrt_llm/batch_manager/kvCacheManager.h"
#include "tensorrt_llm/runtime/iTensor.h"
#include <algorithm>
std::vector<SizeType32> getWindowSizes() const
{
std::vector<SizeType32> windowSizes;
windowSizes.reserve(mPoolsPerWindow.size());
for (auto const& [windowSize, _] : mPoolsPerWindow)
{
windowSizes.push_back(windowSize);
}
std::sort(windowSizes.begin(), windowSizes.end());
return windowSizes;
}
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h around lines 122 to
130, the function returns window sizes by iterating an unordered_map which
yields a non-deterministic order; collect the window sizes as you do now then
sort the vector before returning (use std::sort on the collected windowSizes) to
ensure deterministic ordering, and also add the missing include for <algorithm>
at the top of the header so std::sort is available.

Comment on lines +429 to 438
auto const& windowSizes = blockRange.getWindowSizes();
for (auto const& windowSize : windowSizes)
{
// fill cache with tokens (= request length), for reuse test
TLLM_CUDA_CHECK(cudaMemset(block.data(), llmRequest->getPromptLen(), block.getSizeInBytes()));
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
// fill cache with tokens (= request length), for reuse test
TLLM_CUDA_CHECK(cudaMemset(it->data(), llmRequest->getPromptLen(), it->getSizeInBytes()));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Add a device sync after cudaMemset to avoid read-after-write races

The memsets run on the default stream and the responder may read on another stream. Add a sync before starting the send to remove potential flakiness.

             }
-            mFutures.emplace_back(mResponder->respondAndSendAsync(*llmRequest));
+            // Ensure memset to device memory completes before send starts.
+            TLLM_CUDA_CHECK(cudaDeviceSynchronize());
+            mFutures.emplace_back(mResponder->respondAndSendAsync(*llmRequest));

Also applies to: 439-440


🏁 Script executed:

#!/bin/bash
# Search for respondAndSendAsync usage and definition across the codebase
rg -n "respondAndSendAsync" -C5

Length of output: 18131


Ensure device memory is fully initialized before launching async transfers

The cudaMemset calls in the reuse test execute on the default stream but don’t guarantee completion before the subsequent async send/receive, leading to potential read–after–write flakiness. Insert an explicit device synchronization immediately after the memset loops and before invoking any asynchronous transfer.

• File: cpp/tests/batch_manager/cacheTransceiverTest.cpp
– Lines 434–440: after the inner cudaMemset loop, add TLLM_CUDA_CHECK(cudaDeviceSynchronize()); before both mResponder->respondAndSendAsync(...) and mRequester->requestAndReceiveAsync(...).

Example patch:

             }
+            // Wait for all cudaMemset operations to finish before launching async transfer
+            TLLM_CUDA_CHECK(cudaDeviceSynchronize());
             mFutures.emplace_back(mResponder->respondAndSendAsync(*llmRequest));
         }
         else
         {
+            // Ensure device memory is ready before requestAndReceiveAsync
+            TLLM_CUDA_CHECK(cudaDeviceSynchronize());
             auto future = mRequester->requestAndReceiveAsync(*llmRequest);
             future.get();
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto const& windowSizes = blockRange.getWindowSizes();
for (auto const& windowSize : windowSizes)
{
// fill cache with tokens (= request length), for reuse test
TLLM_CUDA_CHECK(cudaMemset(block.data(), llmRequest->getPromptLen(), block.getSizeInBytes()));
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
// fill cache with tokens (= request length), for reuse test
TLLM_CUDA_CHECK(cudaMemset(it->data(), llmRequest->getPromptLen(), it->getSizeInBytes()));
}
}
auto const& windowSizes = blockRange.getWindowSizes();
for (auto const& windowSize : windowSizes)
{
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
// fill cache with tokens (= request length), for reuse test
TLLM_CUDA_CHECK(cudaMemset(it->data(), llmRequest->getPromptLen(), it->getSizeInBytes()));
}
}
// Wait for all cudaMemset operations to finish before launching async transfer
TLLM_CUDA_CHECK(cudaDeviceSynchronize());
mFutures.emplace_back(mResponder->respondAndSendAsync(*llmRequest));
}
else
{
// Ensure device memory is ready before requestAndReceiveAsync
TLLM_CUDA_CHECK(cudaDeviceSynchronize());
auto future = mRequester->requestAndReceiveAsync(*llmRequest);
future.get();
}
🤖 Prompt for AI Agents
In cpp/tests/batch_manager/cacheTransceiverTest.cpp around lines 429 to 438, the
cudaMemset loops initialize device memory on the default stream but do not
guarantee completion before subsequent asynchronous send/receive calls; add an
explicit device synchronization (e.g.,
TLLM_CUDA_CHECK(cudaDeviceSynchronize());) immediately after the memset loops
and before calling mResponder->respondAndSendAsync(...) and
mRequester->requestAndReceiveAsync(...) so the device memory is fully
initialized prior to any async transfers.

Comment on lines +891 to 903
auto const& windowSizes = blockRange.getWindowSizes();
int blockIdx = 0;
for (auto const& windowSize : windowSizes)
{
blockRange.updatePoolIdx(poolIdx);
TLLM_LOG_DEBUG("update poolIdx: %d", poolIdx);
for (auto& block : blockRange)
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
TLLM_LOG_DEBUG("update windowSize: %d", windowSize);
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
fillBlockData(block, blockIdx, llmRequest->getPromptLen(), poolIdx);
fillBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize);
blockIdx++;
}
TLLM_LOG_DEBUG("blockPoolIdx: %d finish fill block data", poolIdx);
TLLM_LOG_DEBUG("windowSize: %d finish fill block data", windowSize);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Iteration order of windowSizes is unspecified; make it deterministic

getWindowSizes() returns a vector built from an unordered_map. Relying on its iteration order leads to non-deterministic blockIdx sequencing across processes, which can break the sender/receiver’s matching. Sort the window sizes before iterating.

-        auto const& windowSizes = blockRange.getWindowSizes();
-        int blockIdx = 0;
+        auto windowSizes = blockRange.getWindowSizes();
+        std::sort(windowSizes.begin(), windowSizes.end());
+        int blockIdx = 0;
         for (auto const& windowSize : windowSizes) {

Ensure deterministic iteration by sorting window sizes

To avoid non-deterministic blockIdx assignments (since getWindowSizes() pulls from an unordered_map), make a local, mutable copy of the returned vector and sort it before looping. Applying your proposed diff will guarantee consistent ordering across runs:

-        auto const& windowSizes = blockRange.getWindowSizes();
+        // Copy and sort to enforce deterministic order (unordered_map has unspecified iteration)
+        auto windowSizes = blockRange.getWindowSizes();
+        std::sort(windowSizes.begin(), windowSizes.end());
         int blockIdx = 0;
         for (auto const& windowSize : windowSizes) {
             auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);

Tagging this change as a mandatory refactor to ensure correct matching between sender and receiver.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto const& windowSizes = blockRange.getWindowSizes();
int blockIdx = 0;
for (auto const& windowSize : windowSizes)
{
blockRange.updatePoolIdx(poolIdx);
TLLM_LOG_DEBUG("update poolIdx: %d", poolIdx);
for (auto& block : blockRange)
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
TLLM_LOG_DEBUG("update windowSize: %d", windowSize);
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
fillBlockData(block, blockIdx, llmRequest->getPromptLen(), poolIdx);
fillBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize);
blockIdx++;
}
TLLM_LOG_DEBUG("blockPoolIdx: %d finish fill block data", poolIdx);
TLLM_LOG_DEBUG("windowSize: %d finish fill block data", windowSize);
}
// Copy and sort to enforce deterministic order (unordered_map has unspecified iteration)
auto windowSizes = blockRange.getWindowSizes();
std::sort(windowSizes.begin(), windowSizes.end());
int blockIdx = 0;
for (auto const& windowSize : windowSizes)
{
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
TLLM_LOG_DEBUG("update windowSize: %d", windowSize);
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
fillBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize);
blockIdx++;
}
TLLM_LOG_DEBUG("windowSize: %d finish fill block data", windowSize);
}
🤖 Prompt for AI Agents
In cpp/tests/batch_manager/cacheTransceiverTest.cpp around lines 891 to 903,
getWindowSizes() returns a vector derived from an unordered_map so iteration
order is non-deterministic; make a local mutable copy of the returned vector,
sort it (e.g. std::sort ascending) and then iterate over the sorted copy when
computing blockIdx and filling block data to ensure deterministic block index
assignment; add #include <algorithm> if not already present.

Comment on lines +935 to 949
for (auto const& windowSize : windowSizes)
{
blockRange.updatePoolIdx(poolIdx);
for (auto& block : blockRange)
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
int maxBlockInWindow = windowSize / mCacheState->getModelConfig().mTokensPerBlock;
int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - (maxBlockInWindow + 1));
int blockIdInWindow = 0;
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
verifyBlockData(block, blockIdx, llmRequest->getPromptLen(), poolIdx);
if (blockIdInWindow >= startBlockId)
{
verifyBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize);
}
blockIdx++;
blockIdInWindow++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Apply deterministic window iteration during verification too

Same determinism concern on the generation side; sort window sizes before iterating to keep blockIdx and verification in sync with the sender.

-        auto const& windowSizes = blockRange.getWindowSizes();
+        auto windowSizes = blockRange.getWindowSizes();
+        std::sort(windowSizes.begin(), windowSizes.end());
         for (auto const& windowSize : windowSizes)

🏁 Script executed:

#!/bin/bash
# Search for getWindowSizes method usage or definition
rg -nP 'getWindowSizes\s*\(' -C3 --type=cpp
# Locate the BlockRange class definition
rg -nP 'class\s+\w*BlockRange' -C3 --type=cpp

Length of output: 8787


🏁 Script executed:

#!/bin/bash
# Show the declaration and context of BlockRange
rg -nP 'class BlockRange' -C10 --type=cpp cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
# Show the declaration of mPoolsPerWindow
rg -nP 'mPoolsPerWindow' -C5 --type=cpp cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h

Length of output: 4151


Ensure deterministic window ordering in cacheTransceiverTest
The call to getWindowSizes() pulls keys from an unordered_map, yielding a non-deterministic iteration order. To keep blockIdx and verification aligned with the sender, sort the window sizes before the loop in cpp/tests/batch_manager/cacheTransceiverTest.cpp:

-        auto const& windowSizes = blockRange.getWindowSizes();
+        // Copy and sort for deterministic iteration
+        auto windowSizes = blockRange.getWindowSizes();
+        std::sort(windowSizes.begin(), windowSizes.end());

         for (auto const& windowSize : windowSizes)
         {
             auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
             …
         }

– This change should be applied at every occurrence of auto const& windowSizes = blockRange.getWindowSizes(); in the test file.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (auto const& windowSize : windowSizes)
{
blockRange.updatePoolIdx(poolIdx);
for (auto& block : blockRange)
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
int maxBlockInWindow = windowSize / mCacheState->getModelConfig().mTokensPerBlock;
int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - (maxBlockInWindow + 1));
int blockIdInWindow = 0;
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
verifyBlockData(block, blockIdx, llmRequest->getPromptLen(), poolIdx);
if (blockIdInWindow >= startBlockId)
{
verifyBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize);
}
blockIdx++;
blockIdInWindow++;
}
// Copy and sort for deterministic iteration
auto windowSizes = blockRange.getWindowSizes();
std::sort(windowSizes.begin(), windowSizes.end());
for (auto const& windowSize : windowSizes)
{
auto blockRangeForWindow = blockRange.getBlockRangeForWindow(windowSize);
int maxBlockInWindow = windowSize / mCacheState->getModelConfig().mTokensPerBlock;
int startBlockId = std::max(0, static_cast<int>(blockRangeForWindow.size()) - (maxBlockInWindow + 1));
int blockIdInWindow = 0;
for (auto it = blockRangeForWindow.begin(); it != blockRangeForWindow.end(); ++it)
{
if (blockIdInWindow >= startBlockId)
{
verifyBlockData(*it, blockIdx, llmRequest->getPromptLen(), windowSize);
}
blockIdx++;
blockIdInWindow++;
}
🤖 Prompt for AI Agents
In cpp/tests/batch_manager/cacheTransceiverTest.cpp around lines 935 to 949, the
test obtains window sizes from blockRange via an unordered_map which yields
non-deterministic iteration order; this misaligns blockIdx/verification with the
sender. Replace uses of auto const& windowSizes = blockRange.getWindowSizes() by
copying the keys into a vector, sorting that vector (e.g., std::sort) and
iterate over the sorted windowSizes instead; apply the same change at every
occurrence of getWindowSizes() in this test file so verification order is
deterministic and matches the sender.

Signed-off-by: Chuang Zhu <[email protected]>
@chuangz0 chuangz0 closed this Sep 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants