[None][doc] Added line about partial reuse #7846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

thorjohnsen merged 6 commits into NVIDIA:main from thorjohnsen:user/tjohnsen/minor_update_kvcache_reuse_legacy_doc

Dec 5, 2025

docs/source/legacy/advanced/kv-cache-reuse.md

-Original file line number
+Diff line change
@@ Expand Up @@
     Kv cache state for system prompts will remain reusable until memory is needed for launching a new request or propagating an existing one. When this happens, reusable blocks are evicted based on LRU. System prompts that are frequently used have a better chance of remaining reusable, but there is no guarantee since launching new requests take priority over possible reuse. Running with a larger batch size, or larger output sequence lengths for example will reduce the probability of kv cache blocks being reused, since it increases memory needs.
-    KV cache state is stored in blocks, each block holds multiple tokens. Only full blocks can be shared by multiple requests, thus the block size matters. The block size is a trade-off, larger block size may improve efficiency of compute kernels, but it reduces the likelihood of kv cache state reuse. The block defaults to 128 tokens, this can be changed when the model is built with the trtllm-build command, for example
+    KV cache state is stored in blocks, each block holds multiple tokens. Only full blocks can be shared by multiple requests, thus the block size matters. Partially matched blocks can also be reused, but that creates a new copy of the block for each sequence. The block size is a trade-off, larger block size may improve efficiency of compute kernels, but it reduces the likelihood of kv cache state reuse. The block defaults to 128 tokens, this can be changed when the model is built with the trtllm-build command, for example
     ```trtllm-build --tokens_per_block 32 ...```
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][doc] Added line about partial reuse #7846

Uh oh!

Diff view

Diff view

There are no files selected for viewing