[https://nvbugs/5627710][fix] Fix synchronization bugs in KvCacheTransferManager that can cause corrupted blocks #9056

thorjohnsen · 2025-11-11T02:45:04Z

Summary by CodeRabbit

New Features
- Added new synchronization methods for KV cache transfer and buffer management operations, improving coordination between internal components.
- Exposed new Python APIs for cache synchronization and block refresh operations.
Improvements
- Enhanced granular tracking of cache read/write operations for better state management.
- Improved synchronization timing in resource allocation workflows to ensure proper cache consistency.

Description

There were multiple issues with the old code:

Not enough synchronization
The transfer manager kept track of all pending offloads and ensured that onboarding would not start until a pending offload of the block had been finished, but this is not enough. For instance, it is possible that multiple GPU blocks will offload to the same host block in a single step. The old code would not wait for the first offload to finish before starting the second, leading to a corrupted block that is a mix of two different blocks.

The case of two GPU blocks being offloaded into the same host block was extremely unlikely when the transfer manager was first written, but adding priority based eviction changed that. If a block is assigned a lower than default priority before offloading, it will be first in line to be evicted if another host block is needed for offloading, making it extremely likely that the same host block will be written to twice in a single step.

The new code records events for all pending reads from a block and all pending writes to a block. When a new offload or onboarding is scheduled, the block copy will wait for pending writes to the source block and will wait for pending reads and writes from/to the destination block.

Incorrect synchronization
The transfer manager kept track of all pending offloads, but it used the wrong index for this. It used the block id to identify the blocks affected by the offload, but the block id is not related to the raw memory blocks that are involved in the offloading. That id is called the memory pool block offset. Block id is a logical block number, used to identify a block inside KV cache manager. A block instance holds meta-data for a particular KV cache block, one of those meta-data is the address of the raw memory block holding the KV state, that address is the memory pool block offset. Blocks exchange these pointers all the time, for instance when a block is offloaded to a host block, the blocks swap memory pool block indexes after scheduling the memcpy.
No sync before and/or after addSequence loop
All the offloading and onboarding happens in a loop that calls addSequence for all new sequences added to the batch in a single step. The old code relied on explicit synchronization by the decoder to ensure that all blocks were valid for offloading before the addSequence loop, but later developments like asynchronous decoder changed that and necessitated explicit synchronization of transfer manager with buffer manager before addSequence loop. Likewise, the buffer manager, which runs all the prefill and decode kernels, needs to be made to wait for the onboarding and offloading streams after addSequence loop. The original C++ executor already did this, but the much more recent Python executor did not.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Iman Tabrizian <[email protected]> .

Signed-off-by: thorjohnsen <[email protected]> .

Signed-off-by: thorjohnsen <[email protected]>

coderabbitai · 2025-11-11T02:53:45Z

📝 Walkthrough

Walkthrough

This change introduces synchronization mechanisms between the KV cache transfer manager and buffer manager. It adds syncTransferManagerWithBufferManager() methods across the manager hierarchy (BaseKVCacheManager, KVCacheManager, BlockManager, WindowBlockManager), implements syncWithBufferManager() in KVCacheTransferManager, refactors pending I/O tracking from a single map to separate read/write maps, and integrates synchronization calls in cache allocation and Python resource management.

Changes

Cohort / File(s)	Summary
Header declarations `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`	Added `syncTransferManagerWithBufferManager()` methods to WindowBlockManager, BlockManager, and KVCacheManager; added pure virtual `syncTransferManagerWithBufferManager()` to BaseKVCacheManager.
Transfer manager header `cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h`	Added public `syncWithBufferManager()` method; replaced `mPendingOffloads` map with separate `mPendingReads` and `mPendingWrites` maps using KVCacheIndex::UnderlyingType keys.
KV cache manager implementations `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Implemented BlockManager::syncTransferManagerWithBufferManager() to iterate WindowBlockManagers and call their sync methods; implemented WindowBlockManager::syncTransferManagerWithBufferManager() delegating to transfer manager.
Transfer manager implementation `cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp`	Implemented `syncWithBufferManager()` to synchronize internal copy streams; refactored onboard/offload flows to use pending reads/writes tracking with event waits; updated `syncTransfers()` to clear the new tracking maps.
Allocation logic `cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp`	Added call to `kvCacheManager.syncTransferManagerWithBufferManager()` at start of AllocateKvCache::operator().
Python bindings (nanobind) `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`	Added PyKvCacheManager::syncTransferManagerWithBufferManager() override; exposed Python bindings for `sync_transfer_manager_with_buffer_manager` and `refresh_blocks` with gil-scoped-release guards.
Python bindings (pybind) `cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp`	Added PyKvCacheManager::syncTransferManagerWithBufferManager() override using NB_OVERRIDE_PURE; exposed Python bindings for `sync_transfer_manager_with_buffer_manager` and `refresh_blocks`.
Python resource manager `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Added call to `self.impl.sync_transfer_manager_with_buffer_manager()` before context batch processing; added call to `self.impl.refresh_blocks()` after KV cache state update.

Sequence Diagram

sequenceDiagram
    participant alloc as AllocateKvCache
    participant kvMgr as KVCacheManager
    participant blkMgr as BlockManager
    participant wbMgr as WindowBlockManager
    participant xferMgr as KVCacheTransferManager
    participant bufMgr as BufferManager

    alloc->>kvMgr: syncTransferManagerWithBufferManager()
    kvMgr->>blkMgr: syncTransferManagerWithBufferManager()
    blkMgr->>wbMgr: syncTransferManagerWithBufferManager() (all windows)
    wbMgr->>xferMgr: syncWithBufferManager()
    xferMgr->>xferMgr: Wait on internal streams<br/>(offload/onboard)
    xferMgr->>bufMgr: Synchronize state
    xferMgr->>xferMgr: Clear pending reads/writes
    xferMgr-->>wbMgr: return
    wbMgr-->>blkMgr: return
    blkMgr-->>kvMgr: return
    kvMgr-->>alloc: return
    
    Note over alloc: Continue with context processing

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

kvCacheTransferManager.cpp: Requires careful review of the new pending reads/writes synchronization logic, event recording patterns, and the replacement of the single pending offloads map—particularly the onboard/offload flows and event wait semantics.
Data structure migration: The transition from mPendingOffloads (int32_t keys) to mPendingReads and mPendingWrites (KVCacheIndex::UnderlyingType keys) needs verification for correctness across all affected methods.
Integration points: Verify that syncTransferManagerWithBufferManager() is called at appropriate times in AllocateKvCache and resource_manager.py relative to context processing.
Python binding consistency: Ensure both nanobind and pybind implementations correctly expose the new methods with proper gil-scoped-release guards.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.58% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies a synchronization bug fix in KvCacheTransferManager that causes corrupted blocks, directly matching the changeset's focus on synchronization between transfer manager and buffer manager.
Description check	✅ Passed	PR description provides detailed explanation of issues and solutions, but lacks Test Coverage section details and partially addresses PR Checklist items.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp (1)
29-63: Also sync the cross KV cache manager before addSequence

We now require syncTransferManagerWithBufferManager() to run before the first KVCacheManager::addSequence() each step so that any outstanding offload/onboard work is properly ordered. On this path we call it for kvCacheManager, but the optional crossKvCacheManager immediately enters addSequence() without the same synchronization. That leaves its transfer manager using stale pending state, which can let previous-step copies overlap with the new context allocation.

Please guard the optional and issue the sync before you start using it, e.g.:
kvCacheManager.syncTransferManagerWithBufferManager();
if (crossKvCacheManager)
{
    crossKvCacheManager->syncTransferManagerWithBufferManager();
}
Without this, cross-attention caches can still hit the race we’re trying to eliminate.

🧹 Nitpick comments (3)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)
740-741: Use Doxygen comment format for function documentation.

The comment should use //! instead of // to conform to Doxygen format and match the style of other methods in this class.

As per coding guidelines.

Apply this diff:
-    //! \brief Sync internal streams used by transfer manager with buffer manager stream
+    //! \brief Sync internal streams used by transfer manager with buffer manager stream
     void syncTransferManagerWithBufferManager();
1142-1143: Use Doxygen comment format for function documentation.

The comment should use //! instead of // to conform to Doxygen format and match the style of other methods in this class.

As per coding guidelines.

Apply this diff:
-    //! \brief Sync internal streams used by transfer manager with buffer manager stream
+    //! \brief Sync internal streams used by transfer manager with buffer manager stream
     void syncTransferManagerWithBufferManager();
1341-1341: Consider adding a Doxygen comment for the new pure virtual method.

Adding a brief Doxygen comment would improve API documentation and help developers understand the purpose of this method.

Example:
+    //! \brief Synchronize internal streams used by transfer manager with buffer manager stream
     virtual void syncTransferManagerWithBufferManager() = 0;

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fd1145 and 307479b.

📒 Files selected for processing (8)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h (2 hunks)
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp (1 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (1 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (2 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}