|
| 1 | +(kv-cache-offloading-guide)= |
| 2 | +# KV cache offloading |
| 3 | + |
| 4 | +Extend KV cache capacity by offloading to CPU memory or local disk for larger batch sizes and reduced GPU memory pressure. |
| 5 | + |
| 6 | +:::{note} |
| 7 | +Ray Serve doesn't provide KV cache offloading out of the box, but integrates seamlessly with vLLM solutions. This guide demonstrates one such integration: LMCache. |
| 8 | +::: |
| 9 | + |
| 10 | + |
| 11 | +Benefits of KV cache offloading: |
| 12 | + |
| 13 | +- **Increased capacity**: Store more KV caches by using CPU RAM or local storage instead of relying solely on GPU memory |
| 14 | +- **Cache reuse across requests**: Save and reuse previously computed KV caches for repeated or similar prompts, reducing prefill computation |
| 15 | +- **Flexible storage backends**: Choose from multiple storage options including local CPU, disk, or distributed systems |
| 16 | + |
| 17 | +Consider KV cache offloading when your application has repeated prompts or multi-turn conversations where you can reuse cached prefills. If consecutive conversation queries aren't sent immediately, the GPU evicts these caches to make room for other concurrent requests, causing cache misses. Offloading KV caches to CPU memory or other storage backends, which has much larger capacity, preserves them for longer periods. |
| 18 | + |
| 19 | +## Deploy with LMCache |
| 20 | + |
| 21 | +LMCache provides KV cache offloading with support for multiple storage backends. |
| 22 | + |
| 23 | +### Prerequisites |
| 24 | + |
| 25 | +Install LMCache: |
| 26 | + |
| 27 | +```bash |
| 28 | +uv pip install lmcache |
| 29 | +``` |
| 30 | + |
| 31 | +### Basic deployment |
| 32 | + |
| 33 | +The following example shows how to deploy with LMCache for local CPU offloading: |
| 34 | + |
| 35 | +::::{tab-set} |
| 36 | +:::{tab-item} Python |
| 37 | +```python |
| 38 | +from ray.serve.llm import LLMConfig, build_openai_app |
| 39 | +import ray.serve as serve |
| 40 | + |
| 41 | +llm_config = LLMConfig( |
| 42 | + model_loading_config={ |
| 43 | + "model_id": "qwen-0.5b", |
| 44 | + "model_source": "Qwen/Qwen2-0.5B-Instruct" |
| 45 | + }, |
| 46 | + engine_kwargs={ |
| 47 | + "tensor_parallel_size": 1, |
| 48 | + "kv_transfer_config": { |
| 49 | + "kv_connector": "LMCacheConnectorV1", |
| 50 | + "kv_role": "kv_both", |
| 51 | + } |
| 52 | + }, |
| 53 | + runtime_env={ |
| 54 | + "env_vars": { |
| 55 | + "LMCACHE_LOCAL_CPU": "True", |
| 56 | + "LMCACHE_CHUNK_SIZE": "256", |
| 57 | + "LMCACHE_MAX_LOCAL_CPU_SIZE": "100", # 100GB |
| 58 | + } |
| 59 | + } |
| 60 | +) |
| 61 | + |
| 62 | +app = build_openai_app({"llm_configs": [llm_config]}) |
| 63 | +serve.run(app) |
| 64 | +``` |
| 65 | +::: |
| 66 | + |
| 67 | +:::{tab-item} YAML |
| 68 | +```yaml |
| 69 | +applications: |
| 70 | + - name: llm-with-lmcache |
| 71 | + route_prefix: / |
| 72 | + import_path: ray.serve.llm:build_openai_app |
| 73 | + runtime_env: |
| 74 | + env_vars: |
| 75 | + LMCACHE_LOCAL_CPU: "True" |
| 76 | + LMCACHE_CHUNK_SIZE: "256" |
| 77 | + LMCACHE_MAX_LOCAL_CPU_SIZE: "100" |
| 78 | + args: |
| 79 | + llm_configs: |
| 80 | + - model_loading_config: |
| 81 | + model_id: qwen-0.5b |
| 82 | + model_source: Qwen/Qwen2-0.5B-Instruct |
| 83 | + engine_kwargs: |
| 84 | + tensor_parallel_size: 1 |
| 85 | + kv_transfer_config: |
| 86 | + kv_connector: LMCacheConnectorV1 |
| 87 | + kv_role: kv_both |
| 88 | +``` |
| 89 | +
|
| 90 | +Deploy with: |
| 91 | +
|
| 92 | +```bash |
| 93 | +serve run config.yaml |
| 94 | +``` |
| 95 | +::: |
| 96 | +:::: |
| 97 | + |
| 98 | +## Compose multiple KV transfer backends with MultiConnector |
| 99 | + |
| 100 | +You can combine multiple KV transfer backends using `MultiConnector`. This is useful when you want both local offloading and cross-instance transfer in disaggregated deployments. |
| 101 | + |
| 102 | +### When to use MultiConnector |
| 103 | + |
| 104 | +Use `MultiConnector` to combine multiple backends when you're using prefill/decode disaggregation and want both cross-instance transfer (NIXL) and local offloading. |
| 105 | + |
| 106 | + |
| 107 | +The following example shows how to combine NIXL (for cross-instance transfer) with LMCache (for local offloading) in a prefill/decode deployment: |
| 108 | + |
| 109 | +:::{note} |
| 110 | +The order of connectors matters. Since you want to prioritize local KV cache lookup through LMCache, it appears first in the list before the NIXL connector. |
| 111 | +::: |
| 112 | + |
| 113 | +::::{tab-set} |
| 114 | +:::{tab-item} Python |
| 115 | +```python |
| 116 | +from ray.serve.llm import LLMConfig, build_pd_openai_app |
| 117 | +import ray.serve as serve |
| 118 | + |
| 119 | +# Shared KV transfer config combining NIXL and LMCache |
| 120 | +kv_config = { |
| 121 | + "kv_connector": "MultiConnector", |
| 122 | + "kv_role": "kv_both", |
| 123 | + "kv_connector_extra_config": { |
| 124 | + "connectors": [ |
| 125 | + { |
| 126 | + "kv_connector": "LMCacheConnectorV1", |
| 127 | + "kv_role": "kv_both", |
| 128 | + }, |
| 129 | + { |
| 130 | + "kv_connector": "NixlConnector", |
| 131 | + "kv_role": "kv_both", |
| 132 | + "backends": ["UCX"], |
| 133 | + } |
| 134 | + ] |
| 135 | + } |
| 136 | +} |
| 137 | + |
| 138 | +prefill_config = LLMConfig( |
| 139 | + model_loading_config={ |
| 140 | + "model_id": "qwen-0.5b", |
| 141 | + "model_source": "Qwen/Qwen2-0.5B-Instruct" |
| 142 | + }, |
| 143 | + engine_kwargs={ |
| 144 | + "tensor_parallel_size": 1, |
| 145 | + "kv_transfer_config": kv_config, |
| 146 | + }, |
| 147 | + runtime_env={ |
| 148 | + "env_vars": { |
| 149 | + "LMCACHE_LOCAL_CPU": "True", |
| 150 | + "LMCACHE_CHUNK_SIZE": "256", |
| 151 | + "UCX_TLS": "all", |
| 152 | + } |
| 153 | + } |
| 154 | +) |
| 155 | + |
| 156 | +decode_config = LLMConfig( |
| 157 | + model_loading_config={ |
| 158 | + "model_id": "qwen-0.5b", |
| 159 | + "model_source": "Qwen/Qwen2-0.5B-Instruct" |
| 160 | + }, |
| 161 | + engine_kwargs={ |
| 162 | + "tensor_parallel_size": 1, |
| 163 | + "kv_transfer_config": kv_config, |
| 164 | + }, |
| 165 | + runtime_env={ |
| 166 | + "env_vars": { |
| 167 | + "LMCACHE_LOCAL_CPU": "True", |
| 168 | + "LMCACHE_CHUNK_SIZE": "256", |
| 169 | + "UCX_TLS": "all", |
| 170 | + } |
| 171 | + } |
| 172 | +) |
| 173 | + |
| 174 | +pd_config = { |
| 175 | + "prefill_config": prefill_config, |
| 176 | + "decode_config": decode_config, |
| 177 | +} |
| 178 | + |
| 179 | +app = build_pd_openai_app(pd_config) |
| 180 | +serve.run(app) |
| 181 | +``` |
| 182 | +::: |
| 183 | + |
| 184 | +:::{tab-item} YAML |
| 185 | +```yaml |
| 186 | +applications: |
| 187 | + - name: pd-multiconnector |
| 188 | + route_prefix: / |
| 189 | + import_path: ray.serve.llm:build_pd_openai_app |
| 190 | + runtime_env: |
| 191 | + env_vars: |
| 192 | + LMCACHE_LOCAL_CPU: "True" |
| 193 | + LMCACHE_CHUNK_SIZE: "256" |
| 194 | + UCX_TLS: "all" |
| 195 | + args: |
| 196 | + prefill_config: |
| 197 | + model_loading_config: |
| 198 | + model_id: qwen-0.5b |
| 199 | + model_source: Qwen/Qwen2-0.5B-Instruct |
| 200 | + engine_kwargs: |
| 201 | + tensor_parallel_size: 1 |
| 202 | + kv_transfer_config: |
| 203 | + kv_connector: MultiConnector |
| 204 | + kv_role: kv_both |
| 205 | + kv_connector_extra_config: |
| 206 | + connectors: |
| 207 | + - kv_connector: LMCacheConnectorV1 |
| 208 | + kv_role: kv_both |
| 209 | + - kv_connector: NixlConnector |
| 210 | + kv_role: kv_both |
| 211 | + backends: ["UCX"] |
| 212 | + decode_config: |
| 213 | + model_loading_config: |
| 214 | + model_id: qwen-0.5b |
| 215 | + model_source: Qwen/Qwen2-0.5B-Instruct |
| 216 | + engine_kwargs: |
| 217 | + tensor_parallel_size: 1 |
| 218 | + kv_transfer_config: |
| 219 | + kv_connector: MultiConnector |
| 220 | + kv_role: kv_both |
| 221 | + kv_connector_extra_config: |
| 222 | + connectors: |
| 223 | + - kv_connector: LMCacheConnectorV1 |
| 224 | + kv_role: kv_both |
| 225 | + - kv_connector: NixlConnector |
| 226 | + kv_role: kv_both |
| 227 | + backends: ["UCX"] |
| 228 | +``` |
| 229 | +
|
| 230 | +Deploy with: |
| 231 | +
|
| 232 | +```bash |
| 233 | +serve run config.yaml |
| 234 | +``` |
| 235 | +::: |
| 236 | +:::: |
| 237 | + |
| 238 | +## Configuration parameters |
| 239 | + |
| 240 | +### LMCache environment variables |
| 241 | + |
| 242 | +- `LMCACHE_LOCAL_CPU`: Set to `"True"` to enable local CPU offloading |
| 243 | +- `LMCACHE_CHUNK_SIZE`: Size of KV cache chunks, in terms of tokens (default: 256) |
| 244 | +- `LMCACHE_MAX_LOCAL_CPU_SIZE`: Maximum CPU storage size in GB |
| 245 | +- `LMCACHE_PD_BUFFER_DEVICE`: Buffer device for prefill/decode scenarios (default: "cpu") |
| 246 | + |
| 247 | +For the full list of LMCache configuration options, see the [LMCache configuration reference](https://docs.lmcache.ai/api_reference/configurations.html). |
| 248 | + |
| 249 | +### MultiConnector configuration |
| 250 | + |
| 251 | +- `kv_connector`: Set to `"MultiConnector"` to compose multiple backends |
| 252 | +- `kv_connector_extra_config.connectors`: List of connector configurations to compose. Order matters—connectors earlier in the list take priority. |
| 253 | +- Each connector in the list uses the same configuration format as standalone connectors |
| 254 | + |
| 255 | +## Performance considerations |
| 256 | + |
| 257 | +Extending KV cache beyond local GPU memory introduces overhead for managing and looking up caches across different memory hierarchies. This creates a tradeoff: you gain larger cache capacity but may experience increased latency. Consider these factors: |
| 258 | + |
| 259 | +**Overhead in cache-miss scenarios**: When there are no cache hits, offloading adds modest overhead (~10-15%) compared to pure GPU caching, based on our internal experiments. This overhead comes from the additional hashing, data movement, and management operations. |
| 260 | + |
| 261 | +**Benefits with cache hits**: When caches can be reused, offloading significantly reduces prefill computation. For example, in multi-turn conversations where users return after minutes of inactivity, LMCache retrieves the conversation history from CPU rather than recomputing it, significantly reducing time to first token for follow-up requests. |
| 262 | + |
| 263 | +**Network transfer costs**: When combining MultiConnector with cross-instance transfer (such as NIXL), ensure that the benefits of disaggregation outweigh the network transfer costs. |
| 264 | + |
| 265 | + |
| 266 | +## See also |
| 267 | + |
| 268 | +- {doc}`Prefill/decode disaggregation <prefill-decode>` - Deploy LLMs with separated prefill and decode phases |
| 269 | +- [LMCache documentation](https://docs.lmcache.ai/) - Comprehensive LMCache configuration and features |
0 commit comments