-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[https://nvbugs/5501557][fix] Fix out-of-bounds vector access for model with multiple layer types #7636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[https://nvbugs/5501557][fix] Fix out-of-bounds vector access for model with multiple layer types #7636
Conversation
📝 WalkthroughWalkthroughAdds a new static helper TrtGptModelInflightBatching::calculateCacheSizePerTokenForDisagg that computes per-token KV-cache sizes by grouping global attention-layer IDs (per PP rank) by attention-window size with wrap-around; replaces prior inline lambda in createKvCacheManager to call the helper and adds unit tests exercising the calculation. Public API extended with the new static method. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant PP as PP Rank
participant Model as ModelConfig/WorldConfig
participant Base as BaseKVCacheManager
Note over PP,Model: Input: modelConfig, worldConfig, maxAttentionWindowVec, isCrossAttention, kvFactor
PP->>Model: query totalLayers, hiddenSize, kv dtype, layer partitioning
PP->>PP: compute numLowerRankAttnLayers, numLocalAttnLayers
PP->>PP: derive global layer IDs for local layers (start..end)
PP->>PP: map each global layer ID -> window size (wrap when vec shorter)
loop per unique window size
PP->>Base: calculateCacheSizePerTokenForSingleWindowSize(globalLayerIds, ...)
Base-->>PP: per-window per-token units
PP->>PP: multiply by KV dtype size * kvFactor → bytes per token
end
PP-->>PP: return map(windowSize -> bytesPerToken)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
3ddd8f1 to
d71a260
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
272-297: Good fix: use global attention-layer indices; add a guard for empty maxAttentionWindowVecThe switch to global attention-layer indexing (startAttnLayerId/endAttnLayerId) correctly avoids out-of-bounds when models interleave no-op and attention layers. One edge case: protect against an empty maxAttentionWindowVec to prevent modulo-by-zero on layerIdx % numNonUniqueWindowSizes.
Apply this diff to add a safety check:
auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size()); + TLLM_CHECK_WITH_INFO(numNonUniqueWindowSizes > 0, + "maxAttentionWindowVec must not be empty when computing per-window cache size."); std::map<SizeType32, std::vector<SizeType32>> uniqueWindowSizeToLayers; for (SizeType32 layerIdx = startAttnLayerId; layerIdx < endAttnLayerId; layerIdx++) { // maxAttentionWindowVec may or may not be stretched to the length of numLayers yet. // If not stretched yet, we cycle through the window sizes. auto const windowSize = maxAttentionWindowVec.at(layerIdx % numNonUniqueWindowSizes); uniqueWindowSizeToLayers[windowSize].push_back(layerIdx); }
🧹 Nitpick comments (4)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (4)
70-78: Make standard library includes explicit (map/tuple/chrono).This file uses std::map, std::tuple, and std::chrono but doesn’t include their headers locally. Relying on transitive includes is brittle and may break with header changes.
Apply this diff:
#include <algorithm> #include <cstddef> +#include <chrono> +#include <map> #include <memory> #include <optional> #include <stdexcept> +#include <tuple> #include <thread> #include <utility> #include <vector>
83-86: Replace magic literal ‘2’ (K and V) with a named constant.The kvFactor = 2 appears multiple times and encodes “K and V tensors per token”. Define a file-local constant for clarity and consistency.
Apply this diff:
namespace tensorrt_llm::batch_manager { +namespace +{ + // Number of KV tensors per token (K and V). + SizeType32 constexpr kKV_TENSORS_PER_TOKEN{2}; +} + bool TrtGptModelInflightBatching::executorConfigIsValid(
300-307: Use the named constant instead of literal ‘2’.Apply this diff:
- auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerToken( - mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), 2); + auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerToken( + mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), kKV_TENSORS_PER_TOKEN);
673-676: Use the named constant for kvFactor here too.Apply this diff:
- auto blocksPerWindow = KVCacheManager::calculateMaxNumBlocks(kvCacheConfig, isCrossAttention, kvDtype, mModelConfig, - mWorldConfig, windowSizeToLayers, freePrimaryMemBytes, freeSecondaryMemBytes, extraCostMemory, 2); + auto blocksPerWindow = KVCacheManager::calculateMaxNumBlocks(kvCacheConfig, isCrossAttention, kvDtype, mModelConfig, + mWorldConfig, windowSizeToLayers, freePrimaryMemBytes, freeSecondaryMemBytes, extraCostMemory, kKV_TENSORS_PER_TOKEN);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧠 Learnings (7)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🔇 Additional comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
670-676: No action required: windowSizeToLayers is attention-only.
getNumKvHeadsPerLayerLocalRangeusescountLowerRankLayers/countLocalLayerswithLayerType::kATTENTION, sonumLayersincludes only attention layers and cannot index non-attention layers.
268-297: Confirm indexing semantics of getNumKvHeadsForGivenLayers
Inspect ModelConfig::getNumKvHeadsForGivenLayers in cpp/include/tensorrt_llm/runtime/modelConfig.h to ensure it expects full-model layer indices (global) – not zero-based per-rank local IDs – when computing KV-head counts for calculateCacheSizePerTokenForSingleWindowSize.
d71a260 to
ccbd9be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
273-281: Prefer explicit SizeType32 for indices/counts to avoid implicit narrowing.Make the types explicit to match downstream APIs and prevent subtle sign/width mismatches across platforms.
- const auto numLocalAttnLayers = modelConfig.getNbAttentionLayers( + SizeType32 const numLocalAttnLayers = modelConfig.getNbAttentionLayers( worldConfig.getPipelineParallelism(), worldConfig.getPipelineParallelRank()); - const auto numLowerRankAttnLayers = modelConfig.countLowerRankLayers(ModelConfig::LayerType::kATTENTION, + SizeType32 const numLowerRankAttnLayers = modelConfig.countLowerRankLayers(ModelConfig::LayerType::kATTENTION, worldConfig.getPipelineParallelism(), worldConfig.getPipelineParallelRank()); - const auto startAttnLayerId = numLowerRankAttnLayers; - const auto endAttnLayerId = numLowerRankAttnLayers + numLocalAttnLayers; - auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size()); + SizeType32 const startAttnLayerId = numLowerRankAttnLayers; + SizeType32 const endAttnLayerId = numLowerRankAttnLayers + numLocalAttnLayers; + SizeType32 const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size());
70-78: Include explicitly to avoid reliance on transitive headers.This TU uses std::map directly (e.g., BlocksPerWindow, uniqueWindowSizeToLayers). Add the header explicitly.
#include <algorithm> #include <cstddef> +#include <map> #include <memory> #include <optional> #include <stdexcept> #include <thread> #include <utility> #include <vector>
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧠 Learnings (7)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
272-289: Correctly scopes window lookups to attention layers (prevents OOB).Using global attention-layer ids derived from countLowerRankLayers/getNbAttentionLayers fixes the out-of-bounds access when non-attention layers are present on the rank. The modulo wrap keeps compatibility with compact window vectors. Looks good.
|
/bot run |
|
PR_Github #18258 [ run ] triggered by Bot |
pcastonguay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm but would like @chuangz0 to have a look since he's the author of those lines.
|
PR_Github #18258 [ run ] completed with state |
|
/bot run |
|
PR_Github #18263 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
86-107: Guard against empty maxAttentionWindowVec (avoid divide-by-zero/out-of-range).If the vector is empty,
layerIdx % numNonUniqueWindowSizesand.at()will fault. Add a defensive check.auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size()); + TLLM_CHECK_WITH_INFO( + numNonUniqueWindowSizes > 0, + "maxAttentionWindowVec must contain at least one entry"); std::map<SizeType32, std::vector<SizeType32>> uniqueWindowSizeToLayers;
🧹 Nitpick comments (3)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1)
151-154: Document the new public API and clarify units/semantics.Add Doxygen explaining:
- Returns: bytes per token keyed by attention window size.
- Inputs: maxAttentionWindowVec cycling semantics; PP assumptions (global attention-layer IDs per local PP rank).
- kvFactor meaning (e.g., 2 for K+V).
Also consider a default value for kvFactor = 2 in the declaration to match all call sites.
- [[nodiscard]] static std::map<SizeType32, SizeType32> calculateCacheSizePerToken( - runtime::ModelConfig const& modelConfig, runtime::WorldConfig const& worldConfig, - std::vector<SizeType32> const& maxAttentionWindowVec, bool isCrossAttention, SizeType32 kvFactor); + //! @brief Compute per-token KV-cache footprint in BYTES for this PP rank, grouped by attention-window size. + //! @param modelConfig Model configuration; uses global attention-layer indices. + //! @param worldConfig World configuration; PP rank is used to select this rank's attention layers. + //! @param maxAttentionWindowVec Window sizes; if shorter than #attention layers, values are cycled. + //! @param isCrossAttention Whether computing cross-attention cache size. + //! @param kvFactor Multiplicative factor for K/V (e.g., 2 for K+V). + //! @return map: window size -> bytes per token. + [[nodiscard]] static std::map<SizeType32, SizeType32> calculateCacheSizePerToken( + runtime::ModelConfig const& modelConfig, runtime::WorldConfig const& worldConfig, + std::vector<SizeType32> const& maxAttentionWindowVec, bool isCrossAttention, SizeType32 kvFactor /* = 2 */);cpp/tests/unit_tests/executor/executorTestSmall.cpp (1)
225-287: Optional: add PP coverage.Consider an extra case with pipelineParallelism > 1 (e.g., PP=2) to verify global attention-layer indexing per rank and window distribution are correct across PP splits.
I can draft a minimal PP test that configures WorldConfig with PP=2 and checks per-rank mappings.
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
86-119: Remove stray semicolon after function definition.
};after a free function definition is unnecessary noise.- return cacheSizeBytesPerTokenPerWindow; -}; + return cacheSizeBytesPerTokenPerWindow; +}
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(1 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h(1 hunks)cpp/tests/unit_tests/executor/executorTestSmall.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.hcpp/tests/unit_tests/executor/executorTestSmall.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.hcpp/tests/unit_tests/executor/executorTestSmall.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.hcpp/tests/unit_tests/executor/executorTestSmall.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.hcpp/tests/unit_tests/executor/executorTestSmall.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.hcpp/tests/unit_tests/executor/executorTestSmall.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.hcpp/tests/unit_tests/executor/executorTestSmall.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.hcpp/tests/unit_tests/executor/executorTestSmall.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.hcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧬 Code graph analysis (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
calculateCacheSizePerToken(86-118)calculateCacheSizePerToken(86-88)cpp/include/tensorrt_llm/runtime/modelConfig.h (1)
ModelConfig(31-645)
cpp/tests/unit_tests/executor/executorTestSmall.cpp (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
calculateCacheSizePerToken(86-118)calculateCacheSizePerToken(86-88)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tests/unit_tests/executor/executorTestSmall.cpp (1)
14-14: No action needed.The added header isn’t used in this diff; fine to keep if future assertions use it.
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
111-115: No translation needed:calculateCacheSizePerTokenForSingleWindowSizeexpects global layer indices. We buildglobalLayerIdsviastartAttnLayerId = countLowerRankLayers()and pass those into the function, which indexes into the fullmNumKvHeads…vectors. Pipeline‐parallelism is already accounted for by selecting only this rank’s layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
99-107: Add guard for empty maxAttentionWindowVec (avoid modulo-by-zero and at() on empty).This prevents UB when the vector is empty.
Apply:
- auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size()); + auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size()); + TLLM_CHECK_WITH_INFO( + numNonUniqueWindowSizes > 0, + "maxAttentionWindowVec must contain at least one entry"); std::map<SizeType32, std::vector<SizeType32>> uniqueWindowSizeToLayers; for (SizeType32 layerIdx = startAttnLayerId; layerIdx < endAttnLayerId; layerIdx++) { // maxAttentionWindowVec may or may not be stretched to the length of numLayers yet. // If not stretched yet, we cycle through the window sizes. - auto const windowSize = maxAttentionWindowVec.at(layerIdx % numNonUniqueWindowSizes); + auto const windowSize = maxAttentionWindowVec.at(layerIdx % numNonUniqueWindowSizes); uniqueWindowSizeToLayers[windowSize].push_back(layerIdx); }cpp/tests/unit_tests/executor/executorTestSmall.cpp (1)
212-219: Stabilize test by setting KV dtype explicitly to FLOAT.Prevents flakiness if ModelConfig’s KV dtype default isn’t 4 bytes.
auto modelConfig = runtime::ModelConfig( vocabSize, nbLayers, nbAttentionLayers, nbRnnLayers, nbHeads, hiddenSize, nvinfer1::DataType::kFLOAT); modelConfig.useGptAttentionPlugin(true); modelConfig.setModelVariant(runtime::ModelConfig::ModelVariant::kGpt); modelConfig.setKVCacheType(runtime::ModelConfig::KVCacheType::kPAGED); + // Ensure 4-byte expectation in assertions. + modelConfig.setKvDataType(nvinfer1::DataType::kFLOAT);
🧹 Nitpick comments (4)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
108-115: Minor: compute element size once.Avoid repeated BufferDataType construction in the loop.
Apply:
- std::map<SizeType32, SizeType32> cacheSizeBytesPerTokenPerWindow; - for (auto const& [windowSize, globalLayerIds] : uniqueWindowSizeToLayers) + std::map<SizeType32, SizeType32> cacheSizeBytesPerTokenPerWindow; + auto const bytesPerElem = BufferDataType(modelConfig.getKvDataType()).getSize(); + for (auto const& [windowSize, globalLayerIds] : uniqueWindowSizeToLayers) { auto const cacheSizePerToken = BaseKVCacheManager::calculateCacheSizePerTokenForSingleWindowSize( modelConfig, globalLayerIds, isCrossAttention, kvFactor); - auto const cacheSizeBytesPerToken = cacheSizePerToken * BufferDataType(modelConfig.getKvDataType()).getSize(); + auto const cacheSizeBytesPerToken = cacheSizePerToken * bytesPerElem; cacheSizeBytesPerTokenPerWindow[windowSize] = cacheSizeBytesPerToken; }
304-306: Name the KV factor; avoid magic literal.Define a local constexpr (e.g., kKV_FACTOR = 2) and use it here.
- auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerTokenForDisagg( - mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), 2); + auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerTokenForDisagg( + mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), kKV_FACTOR);Add near file top (inside anonymous namespace or at file scope):
constexpr runtime::SizeType32 kKV_FACTOR = 2; // K and Vcpp/tests/unit_tests/executor/executorTestSmall.cpp (2)
14-14: Remove unused include.<unordered_map> is not used.
-#include <unordered_map>
225-287: Optional: add a CROSS-attention case.Add a case with isCrossAttention=true (and an encoder-length window) to exercise the cross path.
I can add a minimal test that sets maxAttentionWindowVec={maxEncoderLen} and toggles isCrossAttention=true; want me to push a patch?
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(2 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h(1 hunks)cpp/tests/unit_tests/executor/executorTestSmall.cpp(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tests/unit_tests/executor/executorTestSmall.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tests/unit_tests/executor/executorTestSmall.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tests/unit_tests/executor/executorTestSmall.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tests/unit_tests/executor/executorTestSmall.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tests/unit_tests/executor/executorTestSmall.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tests/unit_tests/executor/executorTestSmall.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tests/unit_tests/executor/executorTestSmall.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧬 Code graph analysis (1)
cpp/tests/unit_tests/executor/executorTestSmall.cpp (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
calculateCacheSizePerTokenForDisagg(86-118)calculateCacheSizePerTokenForDisagg(86-88)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
304-323: Cross/self prealloc windowing mismatch — separate CROSS windows.
You’re feeding the same cacheSizeBytesPerTokenPerWindow (built from getMaxAttentionWindowVec()) into both SELF and CROSS pre-allocation, but CROSS should use encoder-only window sizes (e.g. {getMaxEncoderLen()}). Either compute and pass a CROSS-specific window map to calculateCacheSizePerTokenForDisagg/preAllocBufferSize, or verify that the existing logic internally distinguishes CROSS semantics to avoid mis-sized buffer allocations.
101-113: Ignore this check—calculateCacheSizePerTokenForSingleWindowSizeis defined to take global layer indices and is being passedglobalLayerIdscorrectly.Likely an incorrect or invalid review comment.
…el with multiple layer types Signed-off-by: Balaram Buddharaju <[email protected]>
bccda5d to
0b387d6
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #18285 [ run ] triggered by Bot |
|
PR_Github #18263 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
99-106: Guard against empty maxAttentionWindowVec (prevents modulo-by-zero and at() OOB).If maxAttentionWindowVec is empty, layerIdx % numNonUniqueWindowSizes divides by zero and at() will throw.
Apply:
- auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size()); + auto const numNonUniqueWindowSizes = static_cast<SizeType32>(maxAttentionWindowVec.size()); + TLLM_CHECK_WITH_INFO( + numNonUniqueWindowSizes > 0, + "maxAttentionWindowVec must contain at least one entry");
🧹 Nitpick comments (3)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (3)
108-115: Compute dtype size once; use 64-bit for bytes then clamp with a checked cast.Avoid repeated BufferDataType construction in the loop and guard against overflow when converting to SizeType32.
Apply:
- std::map<SizeType32, SizeType32> cacheSizeBytesPerTokenPerWindow; + std::map<SizeType32, SizeType32> cacheSizeBytesPerTokenPerWindow; + auto const elemSizeBytes = BufferDataType(modelConfig.getKvDataType()).getSize(); for (auto const& [windowSize, globalLayerIds] : uniqueWindowSizeToLayers) { auto const cacheSizePerToken = BaseKVCacheManager::calculateCacheSizePerTokenForSingleWindowSize( modelConfig, globalLayerIds, isCrossAttention, kvFactor); - auto const cacheSizeBytesPerToken = cacheSizePerToken * BufferDataType(modelConfig.getKvDataType()).getSize(); - cacheSizeBytesPerTokenPerWindow[windowSize] = cacheSizeBytesPerToken; + auto const bytes64 = static_cast<uint64_t>(cacheSizePerToken) * static_cast<uint64_t>(elemSizeBytes); + TLLM_CHECK_WITH_INFO( + bytes64 <= static_cast<uint64_t>(std::numeric_limits<SizeType32>::max()), + "Per-token KV cache size (bytes) exceeds SizeType32 range"); + cacheSizeBytesPerTokenPerWindow[windowSize] = static_cast<SizeType32>(bytes64); }Also add (outside this hunk) the header:
#include <limits>
118-118: Remove trailing semicolon after function definition.Minor style nit; avoid stray semicolons after function bodies.
-}; +}
304-306: Replace magic literal ‘2’ with a named constant for kvFactor.Improves readability and avoids scattering the meaning of kvFactor.
- auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerTokenForDisagg( - mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), 2); + SizeType32 constexpr kKV_FACTOR = 2; // keys + values per token + auto const cacheSizeBytesPerTokenPerWindow = calculateCacheSizePerTokenForDisagg( + mModelConfig, mWorldConfig, getMaxAttentionWindowVec(), mModelConfig.useCrossAttention(), kKV_FACTOR);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp(2 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h(1 hunks)cpp/tests/unit_tests/executor/executorTestSmall.cpp(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
- cpp/tests/unit_tests/executor/executorTestSmall.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
111-113: The requested source excerpts will help clarify whethercalculateCacheSizePerTokenForSingleWindowSizeuses global or local layer IDs.
90-107: No action needed for zero local attention layers
preAllocBufferSize safely returns 0 for an empty map, and downstream buffer allocation handles this case without assertions.
|
/bot run --disable-fail-fast |
|
PR_Github #18319 [ run ] triggered by Bot |
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…el with multiple layer types (NVIDIA#7636) Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Description
This MR fixes https://nvbugspro.nvidia.com/bug/5501557.
Background:
CacheTransBufferManager. While it works fine for models with only attention layers, it's failing for Nemotron Super because of the multiple layer types.Root cause:
Other issues:
groupLayersByWindowSizedoesn't account for PP. [example]calculateCacheSizePerTokenForSingleWindowSizeexpects layer ids passed to be global (and not local as in PP ranks). [explained]Fix:
Test Coverage
Unit test:
Running mmlu_pro eval on gemma-2-9b:
Testing the originally failing model:
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Bug Fixes
Refactor
New Features
Tests