Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ nav:
- Adapter Rollout: guides/adapter-rollout.md
- InferencePool Rollout: guides/inferencepool-rollout.md
- Metrics: guides/metrics.md
- Configuration Guide:
- Prefix Cache Aware Plugin: guides/epp-configuration/prefix-aware.md
- Implementer's Guide: guides/implementers.md
- Performance:
- Benchmark: performance/benchmark/index.md
Expand Down
89 changes: 89 additions & 0 deletions site-src/guides/epp-configuration/prefix-aware.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Prefix Cache Aware Plugin Configuration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the link not pointing to main intentionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a permanent link. Main can change.

The [prefix cache plugin](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/7617439188b410670ed0f1ff805a3b7f9918a75b/pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go#L63)
takes advantage of the prefix caching (e.g., [vllm APC](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html))
feature of model servers, and optimizes request scheduling by placing requests sharing the longest
prefixes to the same server as much as possible, while balancing the server load by considering kv-cache
and queue depth.

## Enable the prefix cache plugin

Currently prefix cache aware plugin is implemented in the V2 scheduler as an experimental feature.
To enable it, set the following environment variables when starting the EndpointPicker(EPP).

```
EXPERIMENTAL_USE_SCHEDULER_V2: true
ENABLE_PREFIX_CACHE_SCHEDULING: true
```

See the [Use Helm section](#helm) to install an inferencepool with the environment variables.


## Customize the prefix cache plugin

The prefix cache plugin exposes the following advanced configuration options via environment variables:

* `PREFIX_CACHE_HASH_BLOCK_SIZE`: The plugin matches prefixes in the unit of blocks. This is the size
of each block in number of bytes. vLLM default block size is 16 tokens. Assume 4 characters per token, the default
is set to 64 in EPP. The default is recommended unless performance is critical for use cases with
extremely long inputs.

* `PREFIX_CACHE_MAX_PREFIX_BLOCKS`: The maximum number of blocks to find prefix match. The default is
128 (or 128*64=8192 characters, or roughly 2048 tokens). This is useful to tradeoff prefix match accuracy
for performance.

* `PREFIX_CACHE_LRU_CAPACITY`: Maximum capacity the prefix LRU indexer in number of block hashes. Below
shows a detailed analysis on how to estimate this.

The prefix cache plugin estimates the prefix cache indexes in model server HBMs. In the perfect
scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If
the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more
false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits.
Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.**

NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries
in tokens, a conversion between character <-> token is needed.

Below are the formulas to estimate the EPP prefix indexer size:

```
max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
lru_indexer_capacity_total = max_num_servers * lru_indexer_capacity_per_server
```

Let's take an example:

* Model: llama3 8B
* Accelerator: Nvidia H100 80GB
* Num replicas: 3
* Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token))

```
max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000
# assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default)
# each entry is about 358KB, so the memory footrpint is abut 11 MB per server
lru_indexer_capacity_per_server = 500,000*4/64 = 31250
lru_indexer_capacity_total = 3 * 31250 = 93750
```

See the [Use Helm section](#helm) to install an inferencepool with the environment variables.


<a id="helm"></a>
## Use Helm

Use the following reference command to install an inferencepool with the prefix
cache plugin environment variable configurations:

```txt
$ helm install triton-llama3-8b-instruct \
--set inferencePool.modelServers.matchLabels.app=triton-llama3-8b-instruct \
--set inferencePool.modelServerType=triton-tensorrt-llm \
--set provider.name=[none|gke] \
--set inferenceExtension.env.EXPERIMENTAL_USE_SCHEDULER_V2=true \
--set inferenceExtension.env.ENABLE_PREFIX_CACHE_SCHEDULING=true \
--set inferenceExtension.env.PREFIX_CACHE_LRU_CAPACITY=93750 \
--set inferenceExtension.env.PREFIX_CACHE_MAX_PREFIX_BLOCKS=1024 \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
```