-
Notifications
You must be signed in to change notification settings - Fork 188
feat: Configure LRUCacheSize using the numGPUBlocks for approximate prefix cache #1748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
/assign @liu-cong |
|
/lgtm |
|
cc: @kfswain @ahg-g @nirrozenbaum can you take a look for the approval? |
|
Will this work with sglang? and generally, will this introduce a regression if the model server doesn't emit this metric? |
If metrics is not available, the default configuration will continue to be used (current behavior), so no regression. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, zetxqx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This change introduces the ability to configure the
numGPUBlocksfor the approximate prefix cache.Key changes:
indexernow considersnumOfGPUBlocksfrom the server's metrics when creating a new LRU cache. This allows the cache size to be dynamically adjusted based on the server's capacity.numOfGPUBlocksis not available, a default LRU size is used.Addmethod in theindexernow accepts aServerstruct, which includes both theServerIDandnumOfGPUBlocks.Which issue(s) this PR fixes:
Fixes partially #1304
Fixes #1512
Does this PR introduce a user-facing change?: