generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 206
Add prefix cache plugin configuration guide #923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| # Prefix Cache Aware Plugin Configuration | ||
|
|
||
| The [prefix cache plugin](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/7617439188b410670ed0f1ff805a3b7f9918a75b/pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go#L63) | ||
| takes advantage of the prefix caching (e.g., [vllm APC](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html)) | ||
| feature of model servers, and optimizes request scheduling by placing requests sharing the longest | ||
| prefixes to the same server as much as possible, while balancing the server load by considering kv-cache | ||
| and queue depth. | ||
|
|
||
| ## Enable the prefix cache plugin | ||
|
|
||
| Currently prefix cache aware plugin is implemented in the V2 scheduler as an experimental feature. | ||
| To enable it, set the following environment variables when starting the EndpointPicker(EPP). | ||
|
|
||
| ``` | ||
| EXPERIMENTAL_USE_SCHEDULER_V2: true | ||
| ENABLE_PREFIX_CACHE_SCHEDULING: true | ||
| ``` | ||
|
|
||
| See the [Use Helm section](#helm) to install an inferencepool with the environment variables. | ||
|
|
||
|
|
||
| ## Customize the prefix cache plugin | ||
|
|
||
| The prefix cache plugin exposes the following advanced configuration options via environment variables: | ||
|
|
||
| * `PREFIX_CACHE_HASH_BLOCK_SIZE`: The plugin matches prefixes in the unit of blocks. This is the size | ||
| of each block in number of bytes. vLLM default block size is 16 tokens. Assume 4 characters per token, the default | ||
| is set to 64 in EPP. The default is recommended unless performance is critical for use cases with | ||
| extremely long inputs. | ||
|
|
||
| * `PREFIX_CACHE_MAX_PREFIX_BLOCKS`: The maximum number of blocks to find prefix match. The default is | ||
| 128 (or 128*64=8192 characters, or roughly 2048 tokens). This is useful to tradeoff prefix match accuracy | ||
| for performance. | ||
|
|
||
| * `PREFIX_CACHE_LRU_CAPACITY`: Maximum capacity the prefix LRU indexer in number of block hashes. Below | ||
| shows a detailed analysis on how to estimate this. | ||
|
|
||
| The prefix cache plugin estimates the prefix cache indexes in model server HBMs. In the perfect | ||
| scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If | ||
| the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more | ||
| false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits. | ||
| Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.** | ||
|
|
||
| NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries | ||
| in tokens, a conversion between character <-> token is needed. | ||
|
|
||
| Below are the formulas to estimate the EPP prefix indexer size: | ||
|
|
||
| ``` | ||
| max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token | ||
| lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size | ||
| lru_indexer_capacity_total = max_num_servers * lru_indexer_capacity_per_server | ||
| ``` | ||
|
|
||
| Let's take an example: | ||
|
|
||
| * Model: llama3 8B | ||
| * Accelerator: Nvidia H100 80GB | ||
| * Num replicas: 3 | ||
| * Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token)) | ||
|
|
||
| ``` | ||
| max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000 | ||
| # assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default) | ||
| # each entry is about 358KB, so the memory footrpint is abut 11 MB per server | ||
| lru_indexer_capacity_per_server = 500,000*4/64 = 31250 | ||
| lru_indexer_capacity_total = 3 * 31250 = 93750 | ||
| ``` | ||
|
|
||
| See the [Use Helm section](#helm) to install an inferencepool with the environment variables. | ||
|
|
||
|
|
||
| <a id="helm"></a> | ||
| ## Use Helm | ||
|
|
||
| Use the following reference command to install an inferencepool with the prefix | ||
| cache plugin environment variable configurations: | ||
|
|
||
| ```txt | ||
| $ helm install triton-llama3-8b-instruct \ | ||
| --set inferencePool.modelServers.matchLabels.app=triton-llama3-8b-instruct \ | ||
| --set inferencePool.modelServerType=triton-tensorrt-llm \ | ||
| --set provider.name=[none|gke] \ | ||
| --set inferenceExtension.env.EXPERIMENTAL_USE_SCHEDULER_V2=true \ | ||
| --set inferenceExtension.env.ENABLE_PREFIX_CACHE_SCHEDULING=true \ | ||
| --set inferenceExtension.env.PREFIX_CACHE_LRU_CAPACITY=93750 \ | ||
| --set inferenceExtension.env.PREFIX_CACHE_MAX_PREFIX_BLOCKS=1024 \ | ||
| oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0 | ||
| ``` | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the link not pointing to main intentionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a permanent link. Main can change.