Skip to content

Conversation

@brb-nv
Copy link
Collaborator

@brb-nv brb-nv commented Sep 9, 2025

Description

This MR fixes https://nvbugspro.nvidia.com/bug/5501557.

Background:

  • Nemotron Super has 80 layers out of which 49 are attention layers and 31 are no-op.
  • calculateCacheSizePerToken is used to compute the memory allocation for disagg serving's CacheTransBufferManager. While it works fine for models with only attention layers, it's failing for Nemotron Super because of the multiple layer types.

Root cause:

Other issues:

  • groupLayersByWindowSize doesn't account for PP. [example]
  • calculateCacheSizePerTokenForSingleWindowSize expects layer ids passed to be global (and not local as in PP ranks). [explained]

Fix:

  • We look-up numKVHeads specifically for attention layers.
  • We also use global layer ids by accounting for Pipeline Parallelism.

Test Coverage

Unit test:

./cpp/build/tests/unit_tests/executor/executorTestSmall

Running mmlu_pro eval on gemma-2-9b:

// Create extra llmapi options for trtllm-serve.
$ cat gemma2_extra_llmapi_config.yml 
kv_cache_config:
  max_attention_window: [4096, 8192]
  enable_block_reuse: false

// Launch the model server.
$ trtllm-serve /home/scratch.trt_llm_data/llm-models/gemma/gemma-2-9b-it/ --backend trt --tp_size 1 --port 8000 --max_batch_size 8  --extra_llm_api_options gemma2_extra_llmapi_config.yml

// Launch eval client.
$ docker run --rm -it --ipc host --network host -u root -p 8000:8000 nvcr.io/nvidia/eval-factory/simple-evals:25.07.3

// Run eval command within eval client.
# core_evals_simple_evals run_eval --eval_type mmlu_pro --model_id gemma/gemma-2-9b-it --model_url http://127.0.0.1:8000/v1/chat/completions --model_type chat --output_dir /workspace/results_new/

Testing the originally failing model:

// Convert checkpoint to TRTLLM format.
$ python examples/models/core/nemotron_nas/convert_checkpoint.py --model_dir ../random/hf_models/Llama-3_3-Nemotron-Super-49B-v1_5/ --output_dir nemotron_super_bf16_tp2_ckpt/ --dtype bfloat16 --tp_size 2 --trust_remote_code

// Build the engines in advance.
$ trtllm-build --checkpoint_dir nemotron_super_bf16_tp2_ckpt/ --output_dir nemotron_super_bf16_tp2_eng/ --gemm_plugin auto --kv_cache_type paged

// Run model inference.
$ mpirun -n 2 python examples/run.py --engine_dir nemotron_super_bf16_tp2_eng/ --tokenizer_dir ../random/hf_models/Llama-3_3-Nemotron-Super-49B-v1_5/ --max_output_len 512

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • Bug Fixes

    • Per-token KV-cache sizing now correctly accounts for global layer indices across pipeline ranks, avoiding under- or over-allocation while preserving wrap-around when attention-window config is shorter than total layers.
  • Refactor

    • Centralized the layer-to-window grouping using a global-layer view so cache planning is consistent across ranks with no change to external behavior.
  • New Features

    • Added a public helper to compute per-token cache sizes for disaggregated layers.
  • Tests

    • Added unit tests validating cache-size distributions across window and layer configurations.

@brb-nv brb-nv requested a review from a team as a code owner September 9, 2025 05:53
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 9, 2025

📝 Walkthrough

Walkthrough

Adds a new static helper TrtGptModelInflightBatching::calculateCacheSizePerTokenForDisagg that computes per-token KV-cache sizes by grouping global attention-layer IDs (per PP rank) by attention-window size with wrap-around; replaces prior inline lambda in createKvCacheManager to call the helper and adds unit tests exercising the calculation. Public API extended with the new static method.

Changes

Cohort / File(s) Summary of changes
Global KV-cache sizing logic
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
Add calculateCacheSizePerTokenForDisagg(...) that computes global-layer IDs for this pipeline rank, groups layers by entries in maxAttentionWindowVec (wrapping when vec shorter than total layers), calls BaseKVCacheManager::calculateCacheSizePerTokenForSingleWindowSize per group, multiplies by KV dtype size and kvFactor to produce bytes-per-token per window, and replace previous inline lambda in createKvCacheManager with this helper.
Public API (header)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
Add public [[nodiscard]] static std::map<SizeType32, SizeType32> calculateCacheSizePerTokenForDisagg(runtime::ModelConfig const&, runtime::WorldConfig const&, std::vector<SizeType32> const&, bool, SizeType32); declaration placed before terminateRequest.
Unit tests
cpp/tests/unit_tests/executor/executorTestSmall.cpp
Add calculateCacheSizePerTokenHelper(...), include <unordered_map>, and a new test suite TrtInflightBatchingTest with four cases validating per-token cache size distribution across attention-window vectors, attention/RNN layer counts, and kvFactor.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant PP as PP Rank
  participant Model as ModelConfig/WorldConfig
  participant Base as BaseKVCacheManager

  Note over PP,Model: Input: modelConfig, worldConfig, maxAttentionWindowVec, isCrossAttention, kvFactor
  PP->>Model: query totalLayers, hiddenSize, kv dtype, layer partitioning
  PP->>PP: compute numLowerRankAttnLayers, numLocalAttnLayers
  PP->>PP: derive global layer IDs for local layers (start..end)
  PP->>PP: map each global layer ID -> window size (wrap when vec shorter)
  loop per unique window size
    PP->>Base: calculateCacheSizePerTokenForSingleWindowSize(globalLayerIds, ...)
    Base-->>PP: per-window per-token units
    PP->>PP: multiply by KV dtype size * kvFactor → bytes per token
  end
  PP-->>PP: return map(windowSize -> bytesPerToken)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title succinctly and accurately describes the primary change by referencing the fixed out-of-bounds vector access issue for models with multiple layer types and includes the relevant NVBugs ID, making it clear and specific without extraneous detail.
Description Check ✅ Passed The PR description is thorough and follows the repository template's main sections: it includes a clear Description with background, root cause, and fix, Test Coverage with unit and evaluation steps, and the PR Checklist and CI/bot instructions, plus a link to the NVBugs report; this gives reviewers sufficient context and repro steps to validate the change. The required information to understand and test the change is present and well explained. Therefore the description is mostly complete.
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@brb-nv brb-nv force-pushed the user/brb/fix-out-of-bounds-vector-access branch 2 times, most recently from 3ddd8f1 to d71a260 Compare September 9, 2025 05:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

272-297: Good fix: use global attention-layer indices; add a guard for empty maxAttentionWindowVec

The switch to global attention-layer indexing (startAttnLayerId/endAttnLayerId) correctly avoids out-of-bounds when models interleave no-op and attention layers. One edge case: protect against an empty maxAttentionWindowVec to prevent modulo-by-zero on layerIdx % numNonUniqueWindowSizes.

Apply this diff to add a safety check:

           auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size());
+          TLLM_CHECK_WITH_INFO(numNonUniqueWindowSizes > 0,
+              "maxAttentionWindowVec must not be empty when computing per-window cache size.");
           std::map<SizeType32, std::vector<SizeType32>> uniqueWindowSizeToLayers;
           for (SizeType32 layerIdx = startAttnLayerId; layerIdx < endAttnLayerId; layerIdx++)
           {
               // maxAttentionWindowVec may or may not be stretched to the length of numLayers yet.
               // If not stretched yet, we cycle through the window sizes.
               auto const windowSize = maxAttentionWindowVec.at(layerIdx % numNonUniqueWindowSizes);
               uniqueWindowSizeToLayers[windowSize].push_back(layerIdx);
           }
🧹 Nitpick comments (4)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (4)

70-78: Make standard library includes explicit (map/tuple/chrono).

This file uses std::map, std::tuple, and std::chrono but doesn’t include their headers locally. Relying on transitive includes is brittle and may break with header changes.

Apply this diff:

 #include <algorithm>
 #include <cstddef>
+#include <chrono>
+#include <map>
 #include <memory>
 #include <optional>
 #include <stdexcept>
+#include <tuple>
 #include <thread>
 #include <utility>
 #include <vector>

83-86: Replace magic literal ‘2’ (K and V) with a named constant.

The kvFactor = 2 appears multiple times and encodes “K and V tensors per token”. Define a file-local constant for clarity and consistency.

Apply this diff:

 namespace tensorrt_llm::batch_manager
 {
 
+namespace
+{
+    // Number of KV tensors per token (K and V).
+    SizeType32 constexpr kKV_TENSORS_PER_TOKEN{2};
+}
+
 bool TrtGptModelInflightBatching::executorConfigIsValid(

300-307: Use the named constant instead of literal ‘2’.

Apply this diff:

-        auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerToken(
-            mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), 2);
+        auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerToken(
+            mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), kKV_TENSORS_PER_TOKEN);

673-676: Use the named constant for kvFactor here too.

Apply this diff:

-    auto blocksPerWindow = KVCacheManager::calculateMaxNumBlocks(kvCacheConfig, isCrossAttention, kvDtype, mModelConfig,
-        mWorldConfig, windowSizeToLayers, freePrimaryMemBytes, freeSecondaryMemBytes, extraCostMemory, 2);
+    auto blocksPerWindow = KVCacheManager::calculateMaxNumBlocks(kvCacheConfig, isCrossAttention, kvDtype, mModelConfig,
+        mWorldConfig, windowSizeToLayers, freePrimaryMemBytes, freeSecondaryMemBytes, extraCostMemory, kKV_TENSORS_PER_TOKEN);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49dcc0d and d71a260.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧠 Learnings (7)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🔇 Additional comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

670-676: No action required: windowSizeToLayers is attention-only.
getNumKvHeadsPerLayerLocalRange uses countLowerRankLayers/countLocalLayers with LayerType::kATTENTION, so numLayers includes only attention layers and cannot index non-attention layers.


268-297: Confirm indexing semantics of getNumKvHeadsForGivenLayers
Inspect ModelConfig::getNumKvHeadsForGivenLayers in cpp/include/tensorrt_llm/runtime/modelConfig.h to ensure it expects full-model layer indices (global) – not zero-based per-rank local IDs – when computing KV-head counts for calculateCacheSizePerTokenForSingleWindowSize.

@brb-nv brb-nv force-pushed the user/brb/fix-out-of-bounds-vector-access branch from d71a260 to ccbd9be Compare September 9, 2025 15:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

273-281: Prefer explicit SizeType32 for indices/counts to avoid implicit narrowing.

Make the types explicit to match downstream APIs and prevent subtle sign/width mismatches across platforms.

-            const auto numLocalAttnLayers = modelConfig.getNbAttentionLayers(
+            SizeType32 const numLocalAttnLayers = modelConfig.getNbAttentionLayers(
                 worldConfig.getPipelineParallelism(), worldConfig.getPipelineParallelRank());
-            const auto numLowerRankAttnLayers = modelConfig.countLowerRankLayers(ModelConfig::LayerType::kATTENTION,
+            SizeType32 const numLowerRankAttnLayers = modelConfig.countLowerRankLayers(ModelConfig::LayerType::kATTENTION,
                 worldConfig.getPipelineParallelism(), worldConfig.getPipelineParallelRank());
-            const auto startAttnLayerId = numLowerRankAttnLayers;
-            const auto endAttnLayerId = numLowerRankAttnLayers + numLocalAttnLayers;
-            auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size());
+            SizeType32 const startAttnLayerId = numLowerRankAttnLayers;
+            SizeType32 const endAttnLayerId = numLowerRankAttnLayers + numLocalAttnLayers;
+            SizeType32 const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size());

70-78: Include explicitly to avoid reliance on transitive headers.

This TU uses std::map directly (e.g., BlocksPerWindow, uniqueWindowSizeToLayers). Add the header explicitly.

 #include <algorithm>
 #include <cstddef>
+#include <map>
 #include <memory>
 #include <optional>
 #include <stdexcept>
 #include <thread>
 #include <utility>
 #include <vector>
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d71a260 and ccbd9be.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧠 Learnings (7)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

272-289: Correctly scopes window lookups to attention layers (prevents OOB).

Using global attention-layer ids derived from countLowerRankLayers/getNbAttentionLayers fixes the out-of-bounds access when non-attention layers are present on the rank. The modulo wrap keeps compatibility with compact window vectors. Looks good.

@brb-nv
Copy link
Collaborator Author

brb-nv commented Sep 9, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18258 [ run ] triggered by Bot

Copy link
Collaborator

@pcastonguay pcastonguay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm but would like @chuangz0 to have a look since he's the author of those lines.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18258 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #365 completed with status: 'FAILURE'

@brb-nv
Copy link
Collaborator Author

brb-nv commented Sep 9, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18263 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

86-107: Guard against empty maxAttentionWindowVec (avoid divide-by-zero/out-of-range).

If the vector is empty, layerIdx % numNonUniqueWindowSizes and .at() will fault. Add a defensive check.

     auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size());
+    TLLM_CHECK_WITH_INFO(
+        numNonUniqueWindowSizes > 0,
+        "maxAttentionWindowVec must contain at least one entry");
     std::map<SizeType32, std::vector<SizeType32>> uniqueWindowSizeToLayers;
🧹 Nitpick comments (3)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1)

151-154: Document the new public API and clarify units/semantics.

Add Doxygen explaining:

  • Returns: bytes per token keyed by attention window size.
  • Inputs: maxAttentionWindowVec cycling semantics; PP assumptions (global attention-layer IDs per local PP rank).
  • kvFactor meaning (e.g., 2 for K+V).

Also consider a default value for kvFactor = 2 in the declaration to match all call sites.

-    [[nodiscard]] static std::map<SizeType32, SizeType32> calculateCacheSizePerToken(
-        runtime::ModelConfig const& modelConfig, runtime::WorldConfig const& worldConfig,
-        std::vector<SizeType32> const& maxAttentionWindowVec, bool isCrossAttention, SizeType32 kvFactor);
+    //! @brief Compute per-token KV-cache footprint in BYTES for this PP rank, grouped by attention-window size.
+    //! @param modelConfig Model configuration; uses global attention-layer indices.
+    //! @param worldConfig World configuration; PP rank is used to select this rank's attention layers.
+    //! @param maxAttentionWindowVec Window sizes; if shorter than #attention layers, values are cycled.
+    //! @param isCrossAttention Whether computing cross-attention cache size.
+    //! @param kvFactor Multiplicative factor for K/V (e.g., 2 for K+V).
+    //! @return map: window size -> bytes per token.
+    [[nodiscard]] static std::map<SizeType32, SizeType32> calculateCacheSizePerToken(
+        runtime::ModelConfig const& modelConfig, runtime::WorldConfig const& worldConfig,
+        std::vector<SizeType32> const& maxAttentionWindowVec, bool isCrossAttention, SizeType32 kvFactor /* = 2 */);
cpp/tests/unit_tests/executor/executorTestSmall.cpp (1)

225-287: Optional: add PP coverage.

Consider an extra case with pipelineParallelism > 1 (e.g., PP=2) to verify global attention-layer indexing per rank and window distribution are correct across PP splits.

I can draft a minimal PP test that configures WorldConfig with PP=2 and checks per-rank mappings.

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

86-119: Remove stray semicolon after function definition.

}; after a free function definition is unnecessary noise.

-    return cacheSizeBytesPerTokenPerWindow;
-};
+    return cacheSizeBytesPerTokenPerWindow;
+}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ccbd9be and f083d44.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1 hunks)
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧬 Code graph analysis (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
  • calculateCacheSizePerToken (86-118)
  • calculateCacheSizePerToken (86-88)
cpp/include/tensorrt_llm/runtime/modelConfig.h (1)
  • ModelConfig (31-645)
cpp/tests/unit_tests/executor/executorTestSmall.cpp (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
  • calculateCacheSizePerToken (86-118)
  • calculateCacheSizePerToken (86-88)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tests/unit_tests/executor/executorTestSmall.cpp (1)

14-14: No action needed.

The added header isn’t used in this diff; fine to keep if future assertions use it.

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

111-115: No translation needed: calculateCacheSizePerTokenForSingleWindowSize expects global layer indices. We build globalLayerIds via startAttnLayerId = countLowerRankLayers() and pass those into the function, which indexes into the full mNumKvHeads… vectors. Pipeline‐parallelism is already accounted for by selecting only this rank’s layers.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

99-107: Add guard for empty maxAttentionWindowVec (avoid modulo-by-zero and at() on empty).

This prevents UB when the vector is empty.

Apply:

-    auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size());
+    auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size());
+    TLLM_CHECK_WITH_INFO(
+        numNonUniqueWindowSizes > 0,
+        "maxAttentionWindowVec must contain at least one entry");
     std::map<SizeType32, std::vector<SizeType32>> uniqueWindowSizeToLayers;
     for (SizeType32 layerIdx = startAttnLayerId; layerIdx < endAttnLayerId; layerIdx++)
     {
         // maxAttentionWindowVec may or may not be stretched to the length of numLayers yet.
         // If not stretched yet, we cycle through the window sizes.
-        auto const windowSize = maxAttentionWindowVec.at(layerIdx % numNonUniqueWindowSizes);
+        auto const windowSize = maxAttentionWindowVec.at(layerIdx % numNonUniqueWindowSizes);
         uniqueWindowSizeToLayers[windowSize].push_back(layerIdx);
     }
cpp/tests/unit_tests/executor/executorTestSmall.cpp (1)

212-219: Stabilize test by setting KV dtype explicitly to FLOAT.

Prevents flakiness if ModelConfig’s KV dtype default isn’t 4 bytes.

     auto modelConfig = runtime::ModelConfig(
         vocabSize, nbLayers, nbAttentionLayers, nbRnnLayers, nbHeads, hiddenSize, nvinfer1::DataType::kFLOAT);
     modelConfig.useGptAttentionPlugin(true);
     modelConfig.setModelVariant(runtime::ModelConfig::ModelVariant::kGpt);
     modelConfig.setKVCacheType(runtime::ModelConfig::KVCacheType::kPAGED);
+    // Ensure 4-byte expectation in assertions.
+    modelConfig.setKvDataType(nvinfer1::DataType::kFLOAT);
🧹 Nitpick comments (4)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

108-115: Minor: compute element size once.

Avoid repeated BufferDataType construction in the loop.

Apply:

-    std::map<SizeType32, SizeType32> cacheSizeBytesPerTokenPerWindow;
-    for (auto const& [windowSize, globalLayerIds] : uniqueWindowSizeToLayers)
+    std::map<SizeType32, SizeType32> cacheSizeBytesPerTokenPerWindow;
+    auto const bytesPerElem = BufferDataType(modelConfig.getKvDataType()).getSize();
+    for (auto const& [windowSize, globalLayerIds] : uniqueWindowSizeToLayers)
     {
         auto const cacheSizePerToken = BaseKVCacheManager::calculateCacheSizePerTokenForSingleWindowSize(
             modelConfig, globalLayerIds, isCrossAttention, kvFactor);
-        auto const cacheSizeBytesPerToken = cacheSizePerToken * BufferDataType(modelConfig.getKvDataType()).getSize();
+        auto const cacheSizeBytesPerToken = cacheSizePerToken * bytesPerElem;
         cacheSizeBytesPerTokenPerWindow[windowSize] = cacheSizeBytesPerToken;
     }

304-306: Name the KV factor; avoid magic literal.

Define a local constexpr (e.g., kKV_FACTOR = 2) and use it here.

-        auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerTokenForDisagg(
-            mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), 2);
+        auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerTokenForDisagg(
+            mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), kKV_FACTOR);

Add near file top (inside anonymous namespace or at file scope):

constexpr runtime::SizeType32 kKV_FACTOR = 2; // K and V
cpp/tests/unit_tests/executor/executorTestSmall.cpp (2)

14-14: Remove unused include.

<unordered_map> is not used.

-#include <unordered_map>

225-287: Optional: add a CROSS-attention case.

Add a case with isCrossAttention=true (and an encoder-length window) to exercise the cross path.

I can add a minimal test that sets maxAttentionWindowVec={maxEncoderLen} and toggles isCrossAttention=true; want me to push a patch?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f083d44 and bccda5d.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1 hunks)
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/executor/executorTestSmall.cpp (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
  • calculateCacheSizePerTokenForDisagg (86-118)
  • calculateCacheSizePerTokenForDisagg (86-88)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

304-323: Cross/self prealloc windowing mismatch — separate CROSS windows.
You’re feeding the same cacheSizeBytesPerTokenPerWindow (built from getMaxAttentionWindowVec()) into both SELF and CROSS pre-allocation, but CROSS should use encoder-only window sizes (e.g. {getMaxEncoderLen()}). Either compute and pass a CROSS-specific window map to calculateCacheSizePerTokenForDisagg/preAllocBufferSize, or verify that the existing logic internally distinguishes CROSS semantics to avoid mis-sized buffer allocations.


101-113: Ignore this check—calculateCacheSizePerTokenForSingleWindowSize is defined to take global layer indices and is being passed globalLayerIds correctly.

Likely an incorrect or invalid review comment.

…el with multiple layer types

Signed-off-by: Balaram Buddharaju <[email protected]>
@brb-nv brb-nv force-pushed the user/brb/fix-out-of-bounds-vector-access branch from bccda5d to 0b387d6 Compare September 10, 2025 02:14
@brb-nv
Copy link
Collaborator Author

brb-nv commented Sep 10, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18285 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18263 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

99-106: Guard against empty maxAttentionWindowVec (prevents modulo-by-zero and at() OOB).

If maxAttentionWindowVec is empty, layerIdx % numNonUniqueWindowSizes divides by zero and at() will throw.

Apply:

-    auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size());
+    auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size());
+    TLLM_CHECK_WITH_INFO(
+        numNonUniqueWindowSizes > 0,
+        "maxAttentionWindowVec must contain at least one entry");
🧹 Nitpick comments (3)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (3)

108-115: Compute dtype size once; use 64-bit for bytes then clamp with a checked cast.

Avoid repeated BufferDataType construction in the loop and guard against overflow when converting to SizeType32.

Apply:

-    std::map<SizeType32, SizeType32> cacheSizeBytesPerTokenPerWindow;
+    std::map<SizeType32, SizeType32> cacheSizeBytesPerTokenPerWindow;
+    auto const elemSizeBytes = BufferDataType(modelConfig.getKvDataType()).getSize();
     for (auto const& [windowSize, globalLayerIds] : uniqueWindowSizeToLayers)
     {
         auto const cacheSizePerToken = BaseKVCacheManager::calculateCacheSizePerTokenForSingleWindowSize(
             modelConfig, globalLayerIds, isCrossAttention, kvFactor);
-        auto const cacheSizeBytesPerToken = cacheSizePerToken * BufferDataType(modelConfig.getKvDataType()).getSize();
-        cacheSizeBytesPerTokenPerWindow[windowSize] = cacheSizeBytesPerToken;
+        auto const bytes64 = static_cast<uint64_t>(cacheSizePerToken) * static_cast<uint64_t>(elemSizeBytes);
+        TLLM_CHECK_WITH_INFO(
+            bytes64 <= static_cast<uint64_t>(std::numeric_limits<SizeType32>::max()),
+            "Per-token KV cache size (bytes) exceeds SizeType32 range");
+        cacheSizeBytesPerTokenPerWindow[windowSize] = static_cast<SizeType32>(bytes64);
     }

Also add (outside this hunk) the header:

#include <limits>

118-118: Remove trailing semicolon after function definition.

Minor style nit; avoid stray semicolons after function bodies.

-};
+}

304-306: Replace magic literal ‘2’ with a named constant for kvFactor.

Improves readability and avoids scattering the meaning of kvFactor.

-        auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerTokenForDisagg(
-            mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), 2);
+        SizeType32 constexpr kKV_FACTOR = 2; // keys + values per token
+        auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerTokenForDisagg(
+            mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), kKV_FACTOR);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bccda5d and 0b387d6.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1 hunks)
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/tests/unit_tests/executor/executorTestSmall.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

111-113: The requested source excerpts will help clarify whether calculateCacheSizePerTokenForSingleWindowSize uses global or local layer IDs.


90-107: No action needed for zero local attention layers
preAllocBufferSize safely returns 0 for an empty map, and downstream buffer allocation handles this case without assertions.

@brb-nv
Copy link
Collaborator Author

brb-nv commented Sep 10, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18319 [ run ] triggered by Bot

@schetlur-nv schetlur-nv merged commit c0e4fce into NVIDIA:release/1.0 Sep 11, 2025
7 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 15, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 15, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 15, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 15, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 15, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 15, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 15, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 16, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 16, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 16, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 16, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 16, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 17, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 17, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 17, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 17, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 17, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 17, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 17, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 18, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 18, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 18, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 19, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 19, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 19, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 19, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
chzblych pushed a commit that referenced this pull request Sep 22, 2025
…el with multiple layer types (#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
JunyiXu-nv pushed a commit to JunyiXu-nv/TensorRT-LLM that referenced this pull request Sep 22, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
nv-lschneider pushed a commit to nv-lschneider/TensorRT-LLM that referenced this pull request Sep 22, 2025
…el with multiple layer types (NVIDIA#7636)

Signed-off-by: Balaram Buddharaju <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants