Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -272,8 +272,8 @@ TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer
auto [numKvHeadsPerLayerBegin, numKvHeadsPerLayerEnd] = modelConfig.getNumKvHeadsPerLayerLocalRange(
worldConfig.getPipelineParallelism(), worldConfig.getPipelineParallelRank(), isCrossAttention);
auto numKvHeadsPerLayer = std::vector<SizeType32>(numKvHeadsPerLayerBegin, numKvHeadsPerLayerEnd);
auto windowSizeLayers
= BaseKVCacheManager::groupLayersByWindowSize(maxAttentionWindowVec, modelConfig.getNbLayers());
auto const numLayers = static_cast<SizeType32>(numKvHeadsPerLayer.size());
auto const windowSizeLayers = KVCacheManager::groupLayersByWindowSize(maxAttentionWindowVec, numLayers);
Copy link
Collaborator

@brb-nv brb-nv Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my issue with groupLayersByWindowSize: It doesn't take PP into account and leads to incorrect memory computation.
https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp#L2177-L2192

Let's take an example of a model with 4 layers and maxAttentionWindowVec = {2, 3, 4, 5}

Case1: PP=1
ppRank 0; layers = {l0, l1, l2, l3}
Input to groupLayersByWindowSize: maxAttentionWindowVec={2, 3, 4, 5}, numLayers=4.
Output from groupLayersByWindowSize: {2:{0}, 3:{1}, 4:{2}, 5{3}}.

Case1: PP=2
ppRank 0; layers = {l0, l1}
Input to groupLayersByWindowSize: maxAttentionWindowVec={2, 3, 4, 5}, numLayers=2.
Output from groupLayersByWindowSize: {2:{0}, 3:{1}}.
ppRank 1; layers = {l2, l3}
Input to groupLayersByWindowSize: maxAttentionWindowVec={2, 3, 4, 5}, numLayers=2.
Output from groupLayersByWindowSize: {2:{0}, 3:{1}}. // THIS SHOULD'VE BEEN {4{0}, 5{1}} WHERE 0,1 ARE LOCAL LAYER IDS.

Problem is it's just taking numLayers as input. Instead it should take a layer range (where layerIds are global) to get the math right. [OR] maxAttentionWindowVec must be different for each PP rank.

Please let me know if I'm missing something.

std::map<SizeType32, SizeType32> cacheSizeBytesPerTokenPerWindow;
for (auto const& [windowSize, managedLayers] : windowSizeLayers)
{
Copy link
Collaborator

@brb-nv brb-nv Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a similar issue with calculateCacheSizePerTokenForSingleWindowSize which further calls getNumKvForGivenLayers.

To my understanding, getNumKvForGivenLayers expects global layer ids. Currently, we pass in local layer ids when using PP.

Expand Down