Skip to content

Commit ac943b3

Browse files
authored
[doc][serve][llm] Add user guide for kv-cache offloading (#58025)
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
1 parent 4f2db49 commit ac943b3

File tree

2 files changed

+270
-0
lines changed

2 files changed

+270
-0
lines changed

doc/source/serve/llm/user-guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ How-to guides for deploying and configuring Ray Serve LLM features.
77
88
Model loading <model-loading>
99
Prefill/decode disaggregation <prefill-decode>
10+
KV cache offloading <kv-cache-offloading>
1011
Prefix-aware routing <prefix-aware-routing>
1112
Multi-LoRA deployment <multi-lora>
1213
vLLM compatibility <vllm-compatibility>
Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
(kv-cache-offloading-guide)=
2+
# KV cache offloading
3+
4+
Extend KV cache capacity by offloading to CPU memory or local disk for larger batch sizes and reduced GPU memory pressure.
5+
6+
:::{note}
7+
Ray Serve doesn't provide KV cache offloading out of the box, but integrates seamlessly with vLLM solutions. This guide demonstrates one such integration: LMCache.
8+
:::
9+
10+
11+
Benefits of KV cache offloading:
12+
13+
- **Increased capacity**: Store more KV caches by using CPU RAM or local storage instead of relying solely on GPU memory
14+
- **Cache reuse across requests**: Save and reuse previously computed KV caches for repeated or similar prompts, reducing prefill computation
15+
- **Flexible storage backends**: Choose from multiple storage options including local CPU, disk, or distributed systems
16+
17+
Consider KV cache offloading when your application has repeated prompts or multi-turn conversations where you can reuse cached prefills. If consecutive conversation queries aren't sent immediately, the GPU evicts these caches to make room for other concurrent requests, causing cache misses. Offloading KV caches to CPU memory or other storage backends, which has much larger capacity, preserves them for longer periods.
18+
19+
## Deploy with LMCache
20+
21+
LMCache provides KV cache offloading with support for multiple storage backends.
22+
23+
### Prerequisites
24+
25+
Install LMCache:
26+
27+
```bash
28+
uv pip install lmcache
29+
```
30+
31+
### Basic deployment
32+
33+
The following example shows how to deploy with LMCache for local CPU offloading:
34+
35+
::::{tab-set}
36+
:::{tab-item} Python
37+
```python
38+
from ray.serve.llm import LLMConfig, build_openai_app
39+
import ray.serve as serve
40+
41+
llm_config = LLMConfig(
42+
model_loading_config={
43+
"model_id": "qwen-0.5b",
44+
"model_source": "Qwen/Qwen2-0.5B-Instruct"
45+
},
46+
engine_kwargs={
47+
"tensor_parallel_size": 1,
48+
"kv_transfer_config": {
49+
"kv_connector": "LMCacheConnectorV1",
50+
"kv_role": "kv_both",
51+
}
52+
},
53+
runtime_env={
54+
"env_vars": {
55+
"LMCACHE_LOCAL_CPU": "True",
56+
"LMCACHE_CHUNK_SIZE": "256",
57+
"LMCACHE_MAX_LOCAL_CPU_SIZE": "100", # 100GB
58+
}
59+
}
60+
)
61+
62+
app = build_openai_app({"llm_configs": [llm_config]})
63+
serve.run(app)
64+
```
65+
:::
66+
67+
:::{tab-item} YAML
68+
```yaml
69+
applications:
70+
- name: llm-with-lmcache
71+
route_prefix: /
72+
import_path: ray.serve.llm:build_openai_app
73+
runtime_env:
74+
env_vars:
75+
LMCACHE_LOCAL_CPU: "True"
76+
LMCACHE_CHUNK_SIZE: "256"
77+
LMCACHE_MAX_LOCAL_CPU_SIZE: "100"
78+
args:
79+
llm_configs:
80+
- model_loading_config:
81+
model_id: qwen-0.5b
82+
model_source: Qwen/Qwen2-0.5B-Instruct
83+
engine_kwargs:
84+
tensor_parallel_size: 1
85+
kv_transfer_config:
86+
kv_connector: LMCacheConnectorV1
87+
kv_role: kv_both
88+
```
89+
90+
Deploy with:
91+
92+
```bash
93+
serve run config.yaml
94+
```
95+
:::
96+
::::
97+
98+
## Compose multiple KV transfer backends with MultiConnector
99+
100+
You can combine multiple KV transfer backends using `MultiConnector`. This is useful when you want both local offloading and cross-instance transfer in disaggregated deployments.
101+
102+
### When to use MultiConnector
103+
104+
Use `MultiConnector` to combine multiple backends when you're using prefill/decode disaggregation and want both cross-instance transfer (NIXL) and local offloading.
105+
106+
107+
The following example shows how to combine NIXL (for cross-instance transfer) with LMCache (for local offloading) in a prefill/decode deployment:
108+
109+
:::{note}
110+
The order of connectors matters. Since you want to prioritize local KV cache lookup through LMCache, it appears first in the list before the NIXL connector.
111+
:::
112+
113+
::::{tab-set}
114+
:::{tab-item} Python
115+
```python
116+
from ray.serve.llm import LLMConfig, build_pd_openai_app
117+
import ray.serve as serve
118+
119+
# Shared KV transfer config combining NIXL and LMCache
120+
kv_config = {
121+
"kv_connector": "MultiConnector",
122+
"kv_role": "kv_both",
123+
"kv_connector_extra_config": {
124+
"connectors": [
125+
{
126+
"kv_connector": "LMCacheConnectorV1",
127+
"kv_role": "kv_both",
128+
},
129+
{
130+
"kv_connector": "NixlConnector",
131+
"kv_role": "kv_both",
132+
"backends": ["UCX"],
133+
}
134+
]
135+
}
136+
}
137+
138+
prefill_config = LLMConfig(
139+
model_loading_config={
140+
"model_id": "qwen-0.5b",
141+
"model_source": "Qwen/Qwen2-0.5B-Instruct"
142+
},
143+
engine_kwargs={
144+
"tensor_parallel_size": 1,
145+
"kv_transfer_config": kv_config,
146+
},
147+
runtime_env={
148+
"env_vars": {
149+
"LMCACHE_LOCAL_CPU": "True",
150+
"LMCACHE_CHUNK_SIZE": "256",
151+
"UCX_TLS": "all",
152+
}
153+
}
154+
)
155+
156+
decode_config = LLMConfig(
157+
model_loading_config={
158+
"model_id": "qwen-0.5b",
159+
"model_source": "Qwen/Qwen2-0.5B-Instruct"
160+
},
161+
engine_kwargs={
162+
"tensor_parallel_size": 1,
163+
"kv_transfer_config": kv_config,
164+
},
165+
runtime_env={
166+
"env_vars": {
167+
"LMCACHE_LOCAL_CPU": "True",
168+
"LMCACHE_CHUNK_SIZE": "256",
169+
"UCX_TLS": "all",
170+
}
171+
}
172+
)
173+
174+
pd_config = {
175+
"prefill_config": prefill_config,
176+
"decode_config": decode_config,
177+
}
178+
179+
app = build_pd_openai_app(pd_config)
180+
serve.run(app)
181+
```
182+
:::
183+
184+
:::{tab-item} YAML
185+
```yaml
186+
applications:
187+
- name: pd-multiconnector
188+
route_prefix: /
189+
import_path: ray.serve.llm:build_pd_openai_app
190+
runtime_env:
191+
env_vars:
192+
LMCACHE_LOCAL_CPU: "True"
193+
LMCACHE_CHUNK_SIZE: "256"
194+
UCX_TLS: "all"
195+
args:
196+
prefill_config:
197+
model_loading_config:
198+
model_id: qwen-0.5b
199+
model_source: Qwen/Qwen2-0.5B-Instruct
200+
engine_kwargs:
201+
tensor_parallel_size: 1
202+
kv_transfer_config:
203+
kv_connector: MultiConnector
204+
kv_role: kv_both
205+
kv_connector_extra_config:
206+
connectors:
207+
- kv_connector: LMCacheConnectorV1
208+
kv_role: kv_both
209+
- kv_connector: NixlConnector
210+
kv_role: kv_both
211+
backends: ["UCX"]
212+
decode_config:
213+
model_loading_config:
214+
model_id: qwen-0.5b
215+
model_source: Qwen/Qwen2-0.5B-Instruct
216+
engine_kwargs:
217+
tensor_parallel_size: 1
218+
kv_transfer_config:
219+
kv_connector: MultiConnector
220+
kv_role: kv_both
221+
kv_connector_extra_config:
222+
connectors:
223+
- kv_connector: LMCacheConnectorV1
224+
kv_role: kv_both
225+
- kv_connector: NixlConnector
226+
kv_role: kv_both
227+
backends: ["UCX"]
228+
```
229+
230+
Deploy with:
231+
232+
```bash
233+
serve run config.yaml
234+
```
235+
:::
236+
::::
237+
238+
## Configuration parameters
239+
240+
### LMCache environment variables
241+
242+
- `LMCACHE_LOCAL_CPU`: Set to `"True"` to enable local CPU offloading
243+
- `LMCACHE_CHUNK_SIZE`: Size of KV cache chunks, in terms of tokens (default: 256)
244+
- `LMCACHE_MAX_LOCAL_CPU_SIZE`: Maximum CPU storage size in GB
245+
- `LMCACHE_PD_BUFFER_DEVICE`: Buffer device for prefill/decode scenarios (default: "cpu")
246+
247+
For the full list of LMCache configuration options, see the [LMCache configuration reference](https://docs.lmcache.ai/api_reference/configurations.html).
248+
249+
### MultiConnector configuration
250+
251+
- `kv_connector`: Set to `"MultiConnector"` to compose multiple backends
252+
- `kv_connector_extra_config.connectors`: List of connector configurations to compose. Order matters—connectors earlier in the list take priority.
253+
- Each connector in the list uses the same configuration format as standalone connectors
254+
255+
## Performance considerations
256+
257+
Extending KV cache beyond local GPU memory introduces overhead for managing and looking up caches across different memory hierarchies. This creates a tradeoff: you gain larger cache capacity but may experience increased latency. Consider these factors:
258+
259+
**Overhead in cache-miss scenarios**: When there are no cache hits, offloading adds modest overhead (~10-15%) compared to pure GPU caching, based on our internal experiments. This overhead comes from the additional hashing, data movement, and management operations.
260+
261+
**Benefits with cache hits**: When caches can be reused, offloading significantly reduces prefill computation. For example, in multi-turn conversations where users return after minutes of inactivity, LMCache retrieves the conversation history from CPU rather than recomputing it, significantly reducing time to first token for follow-up requests.
262+
263+
**Network transfer costs**: When combining MultiConnector with cross-instance transfer (such as NIXL), ensure that the benefits of disaggregation outweigh the network transfer costs.
264+
265+
266+
## See also
267+
268+
- {doc}`Prefill/decode disaggregation <prefill-decode>` - Deploy LLMs with separated prefill and decode phases
269+
- [LMCache documentation](https://docs.lmcache.ai/) - Comprehensive LMCache configuration and features

0 commit comments

Comments
 (0)