🧠 llmisvc: set kv-cache metric to vllm:kv_cache_usage_perc#1020
🧠 llmisvc: set kv-cache metric to vllm:kv_cache_usage_perc#1020adam-d-young wants to merge 3 commits intoopendatahub-io:masterfrom
Conversation
Refs: RHOAIENG-41868 Explicitly set --kv-cache-usage-percentage-metric so EPP does not rely on the legacy default (vllm:gpu_cache_usage_perc) from released GIE versions. Signed-off-by: Adam Young <adam.young@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: adam-d-young The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @adam-d-young. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
WalkthroughTwo LLM scheduler configuration files were updated to add two new command-line arguments: Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Please consider backporting to the ODH/RHOAI 3.0 and 3.2 streams, since those ship the EPP scheduler defaults that still reference |
|
/ok-to-test |
| - --modelServerMetricsHttpsInsecureSkipVerify | ||
| - --certPath | ||
| - "/etc/ssl/certs" | ||
| - --kv-cache-usage-percentage-metric |
There was a problem hiding this comment.
Hi, can this change break backwards compatibility? If not, I think we can go ahead with it.
There was a problem hiding this comment.
A good question. I've found two potential issues with backward compatibility:
- Flag style: kabob-case flags look like they were introduced in GIE v1.0.0, but this repo appears to vendor GIE v0.5.0, which uses camelCase. I've pushed a new commit to switch to camelCase.
- vLLM Compatibility: From what I can tell,
vllm:kv_cache_usage_percwas introduced in vLLM 0.9.x (vLLM PR #18354) which predates LLMInferenceService. If that's correct, there shouldn't be any break to backward compatibility.
I'd appreciate confirmation on both points.
Note: I'll be opening a corresponding PR against red-hat-data-services/kserve with kebab-case (--kv-cache-usage-percentage-metric) for RHOAI 3.x, which uses GIE v1.0.0. It doesn't look like RHOAI, which is what I was intending to fix, is actually build from this repo, but please correct me if I'm wrong.
There was a problem hiding this comment.
To fix RHOAI we need to cherry-pick it to the release-v0.15 branch, which will land in the next RHOAI version, 3.3 at this point.
- I am not sure about it, @bartoszmajsak can you please confirm it? I think that KServe upstream is updating the GIE to 1.0, but not sure how we will do it for ODH and RHOAI.
- ok, it should be fine, the vllm version we are using is v0.11.x.
There was a problem hiding this comment.
@spolti Ad 1. It's coming with kserve#4886, it's similar to what @KillianGolds did with #996 (some of the feedback in upstream we shared is based on this amazing work). I think we will need to somehow consolidate them and push some further improvements upstream if we find gaps (one I can think of is zero downtime update)
The upstream scheduler (llm-d-inference-scheduler) vendors GIE v0.5.0, which uses camelCase flags (--kvCacheUsagePercentageMetric), not kebab-case (--kv-cache-usage-percentage-metric). GIE changed to kebab-case in v1.0.0, but opendatahub-io/kserve targets the upstream scheduler which is still on v0.5.0.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@adam-d-young: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Refs: RHOAIENG-41868
What this PR does / why we need it:
Explicitly sets
--kv-cache-usage-percentage-metrictovllm:kv_cache_usage_percin the default LLM scheduler config so EPP does not rely on the legacy defaultvllm:gpu_cache_usage_percfrom released GIE versions.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):RHOAIENG-41868
Type of changes
Please delete options that are not relevant.
Feature/Issue validation/testing:
Validated by running the scheduler with the flag override and confirming Flags processed shows
kv-cache-usage-percentage-metric=vllm:kv_cache_usage_perc.Logs attached to RHOAIENG-41868
Special notes for your reviewer:
This PR does not change any image versions. It only sets an explicit metric flag in the default config (manifests + Helm) as a workaround until the scheduler/GIE released default is updated.
The legacy default comes from the EPP/GIE scheduler flag
--kv-cache-usage-percentage-metric(GIE v1.2.1 defaults tovllm:gpu_cache_usage_perc, while vLLM exposesvllm:kv_cache_usage_perc). Setting the flag explicitly in the shipped LLMInferenceServiceConfig is the safest short-term fix until RHOAI picks up a scheduler image/GIE release where the default is corrected.Upstream reference: gateway-api-inference-extension/pkg/epp/server/runserver.go in tag
v1.2.1sets the legacy default; gateway-api-inference-extension/pkg/epp/server/options.go onmainusesvllm:kv_cache_usage_perc.Release note:
NONE
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.