-
Notifications
You must be signed in to change notification settings - Fork 7k
[doc][serve][llm] Add user guide for kv-cache offloading #58025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kouroshHakha
merged 10 commits into
ray-project:master
from
kouroshHakha:kh/multiconnector-doc
Oct 23, 2025
Merged
Changes from 9 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
a9e152b
Add KV connector factory and MultiConnector support
kouroshHakha f283756
wip
kouroshHakha 781f529
addressed gemini feedback
kouroshHakha 5165202
wip
kouroshHakha 640c15d
wip
kouroshHakha cb5ac73
wip
kouroshHakha ff41842
[doc][serve][llm] Add user guide for kv-cache offloading
kouroshHakha fc2e31b
wip
kouroshHakha 15750fb
Merge branch 'master' into kh/multiconnector-doc
kouroshHakha dc2df9d
comments
kouroshHakha File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
275 changes: 275 additions & 0 deletions
275
doc/source/serve/llm/user-guides/kv-cache-offloading.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,275 @@ | ||
| (kv-cache-offloading-guide)= | ||
| # KV cache offloading | ||
|
|
||
| Extend KV cache capacity by offloading to CPU or local storage for larger batch sizes and reduced GPU memory pressure. | ||
|
|
||
| :::{note} | ||
| Ray Serve doesn't provide KV cache offloading out of the box, but integrates seamlessly with vLLM solutions. This guide demonstrates one such integration: LMCache. | ||
| ::: | ||
|
|
||
| KV cache offloading moves key-value caches from GPU memory to alternative storage such as CPU RAM or local disk. This approach reduces GPU memory pressure and enables you to serve larger batches or longer contexts without running out of GPU memory. | ||
kouroshHakha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Benefits of KV cache offloading: | ||
|
|
||
| - **Increased capacity**: Store more KV caches by using CPU RAM or local storage instead of relying solely on GPU memory | ||
| - **Cache reuse across requests**: Save and reuse previously computed KV caches for repeated or similar prompts, reducing prefill computation | ||
| - **Flexible storage backends**: Choose from multiple storage options including local CPU, disk, or distributed systems | ||
|
|
||
| Consider KV cache offloading when your application has repeated prompts or multi-turn conversations where you can reuse cached prefills. | ||
kouroshHakha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Deploy with LMCache | ||
|
|
||
| LMCache provides KV cache offloading with support for multiple storage backends. | ||
|
|
||
| ### Prerequisites | ||
|
|
||
| Install LMCache: | ||
|
|
||
| ```bash | ||
| uv pip install lmcache | ||
| ``` | ||
|
|
||
| ### Basic deployment | ||
|
|
||
| The following example shows how to deploy with LMCache for local CPU offloading: | ||
|
|
||
| ::::{tab-set} | ||
| :::{tab-item} Python | ||
| ```python | ||
| from ray.serve.llm import LLMConfig, build_openai_app | ||
| import ray.serve as serve | ||
|
|
||
| llm_config = LLMConfig( | ||
| model_loading_config={ | ||
| "model_id": "qwen-0.5b", | ||
| "model_source": "Qwen/Qwen2-0.5B-Instruct" | ||
| }, | ||
| engine_kwargs={ | ||
| "tensor_parallel_size": 1, | ||
| "kv_transfer_config": { | ||
| "kv_connector": "LMCacheConnectorV1", | ||
| "kv_role": "kv_both", | ||
| } | ||
| }, | ||
| runtime_env={ | ||
| "env_vars": { | ||
| "LMCACHE_LOCAL_CPU": "True", | ||
| "LMCACHE_CHUNK_SIZE": "256", | ||
| "LMCACHE_MAX_LOCAL_CPU_SIZE": "100", # 100GB | ||
| } | ||
| } | ||
| ) | ||
|
|
||
| app = build_openai_app({"llm_configs": [llm_config]}) | ||
| serve.run(app) | ||
| ``` | ||
| ::: | ||
|
|
||
| :::{tab-item} YAML | ||
| ```yaml | ||
| applications: | ||
| - name: llm-with-lmcache | ||
| route_prefix: / | ||
| import_path: ray.serve.llm:build_openai_app | ||
| runtime_env: | ||
| env_vars: | ||
| LMCACHE_LOCAL_CPU: "True" | ||
| LMCACHE_CHUNK_SIZE: "256" | ||
| LMCACHE_MAX_LOCAL_CPU_SIZE: "100" | ||
| args: | ||
| llm_configs: | ||
| - model_loading_config: | ||
| model_id: qwen-0.5b | ||
| model_source: Qwen/Qwen2-0.5B-Instruct | ||
| engine_kwargs: | ||
| tensor_parallel_size: 1 | ||
| kv_transfer_config: | ||
| kv_connector: LMCacheConnectorV1 | ||
| kv_role: kv_both | ||
| ``` | ||
|
|
||
| Deploy with: | ||
|
|
||
| ```bash | ||
| serve run config.yaml | ||
| ``` | ||
| ::: | ||
| :::: | ||
|
|
||
| ## Compose multiple KV transfer backends with MultiConnector | ||
|
|
||
| You can combine multiple KV transfer backends using `MultiConnector`. This is useful when you want both local offloading and cross-instance transfer in disaggregated deployments. | ||
|
|
||
| ### When to use MultiConnector | ||
|
|
||
| Use `MultiConnector` to combine multiple backends when: | ||
|
|
||
| - You're using prefill/decode disaggregation and want both cross-instance transfer (NIXL) and local offloading (LMCache) | ||
| - You need different KV transfer strategies for different layers of your architecture | ||
kouroshHakha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - You want to maximize cache capacity while maintaining efficient cross-instance communication | ||
|
|
||
| ### Deploy with MultiConnector | ||
|
|
||
| The following example shows how to combine NIXL (for cross-instance transfer) with LMCache (for local offloading) in a prefill/decode deployment: | ||
|
|
||
| :::{note} | ||
| The order of connectors matters. Since you want to prioritize local KV cache lookup through LMCache, it appears first in the list before the NIXL connector. | ||
| ::: | ||
|
|
||
| ::::{tab-set} | ||
| :::{tab-item} Python | ||
| ```python | ||
| from ray.serve.llm import LLMConfig, build_pd_openai_app | ||
| import ray.serve as serve | ||
|
|
||
| # Shared KV transfer config combining NIXL and LMCache | ||
| kv_config = { | ||
| "kv_connector": "MultiConnector", | ||
| "kv_role": "kv_both", | ||
| "kv_connector_extra_config": { | ||
| "connectors": [ | ||
| { | ||
| "kv_connector": "LMCacheConnectorV1", | ||
| "kv_role": "kv_both", | ||
| }, | ||
| { | ||
| "kv_connector": "NixlConnector", | ||
| "kv_role": "kv_both", | ||
| "backends": ["UCX"], | ||
| } | ||
| ] | ||
| } | ||
| } | ||
|
|
||
| prefill_config = LLMConfig( | ||
| model_loading_config={ | ||
| "model_id": "qwen-0.5b", | ||
| "model_source": "Qwen/Qwen2-0.5B-Instruct" | ||
| }, | ||
| engine_kwargs={ | ||
| "tensor_parallel_size": 1, | ||
| "kv_transfer_config": kv_config, | ||
| }, | ||
| runtime_env={ | ||
| "env_vars": { | ||
| "LMCACHE_LOCAL_CPU": "True", | ||
| "LMCACHE_CHUNK_SIZE": "256", | ||
| "UCX_TLS": "all", | ||
| } | ||
| } | ||
| ) | ||
|
|
||
| decode_config = LLMConfig( | ||
| model_loading_config={ | ||
| "model_id": "qwen-0.5b", | ||
| "model_source": "Qwen/Qwen2-0.5B-Instruct" | ||
| }, | ||
| engine_kwargs={ | ||
| "tensor_parallel_size": 1, | ||
| "kv_transfer_config": kv_config, | ||
| }, | ||
| runtime_env={ | ||
| "env_vars": { | ||
| "LMCACHE_LOCAL_CPU": "True", | ||
| "LMCACHE_CHUNK_SIZE": "256", | ||
| "UCX_TLS": "all", | ||
| } | ||
| } | ||
| ) | ||
|
|
||
| pd_config = { | ||
| "prefill_config": prefill_config, | ||
| "decode_config": decode_config, | ||
| } | ||
|
|
||
| app = build_pd_openai_app(pd_config) | ||
| serve.run(app) | ||
| ``` | ||
| ::: | ||
|
|
||
| :::{tab-item} YAML | ||
| ```yaml | ||
| applications: | ||
| - name: pd-multiconnector | ||
| route_prefix: / | ||
| import_path: ray.serve.llm:build_pd_openai_app | ||
| runtime_env: | ||
| env_vars: | ||
| LMCACHE_LOCAL_CPU: "True" | ||
| LMCACHE_CHUNK_SIZE: "256" | ||
| UCX_TLS: "all" | ||
| args: | ||
| prefill_config: | ||
| model_loading_config: | ||
| model_id: qwen-0.5b | ||
| model_source: Qwen/Qwen2-0.5B-Instruct | ||
| engine_kwargs: | ||
| tensor_parallel_size: 1 | ||
| kv_transfer_config: | ||
| kv_connector: MultiConnector | ||
| kv_role: kv_both | ||
| kv_connector_extra_config: | ||
| connectors: | ||
| - kv_connector: LMCacheConnectorV1 | ||
| kv_role: kv_both | ||
| - kv_connector: NixlConnector | ||
| kv_role: kv_both | ||
| backends: ["UCX"] | ||
| decode_config: | ||
| model_loading_config: | ||
| model_id: qwen-0.5b | ||
| model_source: Qwen/Qwen2-0.5B-Instruct | ||
| engine_kwargs: | ||
| tensor_parallel_size: 1 | ||
| kv_transfer_config: | ||
| kv_connector: MultiConnector | ||
| kv_role: kv_both | ||
| kv_connector_extra_config: | ||
| connectors: | ||
| - kv_connector: LMCacheConnectorV1 | ||
| kv_role: kv_both | ||
| - kv_connector: NixlConnector | ||
| kv_role: kv_both | ||
| backends: ["UCX"] | ||
| ``` | ||
|
|
||
| Deploy with: | ||
|
|
||
| ```bash | ||
| serve run config.yaml | ||
| ``` | ||
| ::: | ||
| :::: | ||
|
|
||
| ## Configuration parameters | ||
|
|
||
| ### LMCache environment variables | ||
|
|
||
| - `LMCACHE_LOCAL_CPU`: Set to `"True"` to enable local CPU offloading | ||
| - `LMCACHE_CHUNK_SIZE`: Size of KV cache chunks (default: 256) | ||
kouroshHakha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - `LMCACHE_MAX_LOCAL_CPU_SIZE`: Maximum CPU storage size in GB | ||
| - `LMCACHE_PD_BUFFER_DEVICE`: Buffer device for prefill/decode scenarios (default: "cpu") | ||
|
|
||
| For the full list of LMCache configuration options, see the [LMCache configuration reference](https://docs.lmcache.ai/api_reference/configurations.html). | ||
|
|
||
| ### MultiConnector configuration | ||
|
|
||
| - `kv_connector`: Set to `"MultiConnector"` to compose multiple backends | ||
| - `kv_connector_extra_config.connectors`: List of connector configurations to compose. Order matters—connectors earlier in the list take priority. | ||
| - Each connector in the list uses the same configuration format as standalone connectors | ||
|
|
||
| ## Performance considerations | ||
|
|
||
| KV cache offloading trades GPU memory for latency. Consider these factors: | ||
kouroshHakha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| **Overhead in cache-miss scenarios**: When there are no cache hits, offloading adds modest overhead (~10-15%) compared to pure GPU caching. This overhead comes from the additional hashing, data movement, and management operations. | ||
|
|
||
| **Benefits with cache hits**: When caches can be reused, offloading significantly reduces prefill computation. For example, in multi-turn conversations where users return after minutes of inactivity, LMCache retrieves the conversation history from CPU rather than recomputing it, significantly reducing time to first token for follow-up requests. | ||
|
|
||
| **Network transfer costs**: When combining MultiConnector with cross-instance transfer (such as NIXL), ensure that the benefits of disaggregation outweigh the network transfer costs. | ||
|
|
||
|
|
||
| ## See also | ||
|
|
||
| - {doc}`Prefill/decode disaggregation <prefill-decode>` - Deploy LLMs with separated prefill and decode phases | ||
| - [LMCache documentation](https://docs.lmcache.ai/) - Comprehensive LMCache configuration and features | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.