Skip to content

🧠 llmisvc: set kv-cache metric to vllm:kv_cache_usage_perc#1020

Open
adam-d-young wants to merge 3 commits intoopendatahub-io:masterfrom
adam-d-young:fix/kv-cache-metric-flag
Open

🧠 llmisvc: set kv-cache metric to vllm:kv_cache_usage_perc#1020
adam-d-young wants to merge 3 commits intoopendatahub-io:masterfrom
adam-d-young:fix/kv-cache-metric-flag

Conversation

@adam-d-young
Copy link
Copy Markdown

@adam-d-young adam-d-young commented Dec 15, 2025

Refs: RHOAIENG-41868

What this PR does / why we need it:
Explicitly sets --kv-cache-usage-percentage-metric to vllm:kv_cache_usage_perc in the default LLM scheduler config so EPP does not rely on the legacy default vllm:gpu_cache_usage_perc from released GIE versions.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
RHOAIENG-41868

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing:

Validated by running the scheduler with the flag override and confirming Flags processed shows kv-cache-usage-percentage-metric=vllm:kv_cache_usage_perc.

Logs attached to RHOAIENG-41868

Special notes for your reviewer:
This PR does not change any image versions. It only sets an explicit metric flag in the default config (manifests + Helm) as a workaround until the scheduler/GIE released default is updated.
The legacy default comes from the EPP/GIE scheduler flag --kv-cache-usage-percentage-metric (GIE v1.2.1 defaults to vllm:gpu_cache_usage_perc, while vLLM exposes vllm:kv_cache_usage_perc). Setting the flag explicitly in the shipped LLMInferenceServiceConfig is the safest short-term fix until RHOAI picks up a scheduler image/GIE release where the default is corrected.
Upstream reference: gateway-api-inference-extension/pkg/epp/server/runserver.go in tag v1.2.1 sets the legacy default; gateway-api-inference-extension/pkg/epp/server/options.go on main uses vllm:kv_cache_usage_perc.

Release note:
NONE

Summary by CodeRabbit

  • New Features
    • Scheduler now emits KV cache usage percentage for improved resource monitoring.
    • Added a configurable option to report that metric (labelled vllm:kv_cache_usage_perc) so operators can collect and visualize KV cache utilization alongside existing metrics.

✏️ Tip: You can customize this high-level summary in your review settings.

Refs: RHOAIENG-41868

Explicitly set --kv-cache-usage-percentage-metric so EPP does not rely on the legacy default (vllm:gpu_cache_usage_perc) from released GIE versions.

Signed-off-by: Adam Young <adam.young@redhat.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Dec 15, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adam-d-young
Once this PR has been reviewed and has the lgtm label, please assign spolti for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Dec 15, 2025

Hi @adam-d-young. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 15, 2025

Walkthrough

Two LLM scheduler configuration files were updated to add two new command-line arguments: --kvCacheUsagePercentageMetric and the metric label vllm:kv_cache_usage_perc. These arguments are appended to the scheduler container args in both the Helm template and the config file; no control flow or other settings were changed.

Changes

Cohort / File(s) Summary
LLM Scheduler KV Cache Metrics
charts/llmisvc-resources/templates/config-llm-scheduler.yaml, config/llmisvcconfig/config-llm-scheduler.yaml
Appended two command-line arguments to the scheduler container args: --kvCacheUsagePercentageMetric and the metric label vllm:kv_cache_usage_perc, enabling KV cache usage percentage metric exposure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Small number of files but configuration affects runtime metrics surface; verify arg naming/casing and consistency between Helm template and config file.
  • Confirm metric label format aligns with metrics consumer and no duplication with existing metrics.
  • Ensure YAML quoting/escaping and line ordering preserve container args ordering in templating context.

Poem

🐰
A tiny hop, a metric made bright,
KV usage beams into the night,
vllm hums a soft data song,
Counters dancing all day long,
Hooray — the scheduler's metrics take flight!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: setting the kv-cache metric to a specific vLLM metric name in the LLM scheduler configuration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae49c3c and 19fdd93.

📒 Files selected for processing (2)
  • charts/llmisvc-resources/templates/config-llm-scheduler.yaml (1 hunks)
  • config/llmisvcconfig/config-llm-scheduler.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • charts/llmisvc-resources/templates/config-llm-scheduler.yaml
  • config/llmisvcconfig/config-llm-scheduler.yaml

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@adam-d-young
Copy link
Copy Markdown
Author

Please consider backporting to the ODH/RHOAI 3.0 and 3.2 streams, since those ship the EPP scheduler defaults that still reference vllm:gpu_cache_usage_perc. This does not apply to RHOAI 2.25 because the llmisvc/EPP scheduler path is not shipped/used there.

@bartoszmajsak
Copy link
Copy Markdown

/ok-to-test

- --modelServerMetricsHttpsInsecureSkipVerify
- --certPath
- "/etc/ssl/certs"
- --kv-cache-usage-percentage-metric
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, can this change break backwards compatibility? If not, I think we can go ahead with it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good question. I've found two potential issues with backward compatibility:

  1. Flag style: kabob-case flags look like they were introduced in GIE v1.0.0, but this repo appears to vendor GIE v0.5.0, which uses camelCase. I've pushed a new commit to switch to camelCase.
  2. vLLM Compatibility: From what I can tell, vllm:kv_cache_usage_perc was introduced in vLLM 0.9.x (vLLM PR #18354) which predates LLMInferenceService. If that's correct, there shouldn't be any break to backward compatibility.

I'd appreciate confirmation on both points.

Note: I'll be opening a corresponding PR against red-hat-data-services/kserve with kebab-case (--kv-cache-usage-percentage-metric) for RHOAI 3.x, which uses GIE v1.0.0. It doesn't look like RHOAI, which is what I was intending to fix, is actually build from this repo, but please correct me if I'm wrong.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix RHOAI we need to cherry-pick it to the release-v0.15 branch, which will land in the next RHOAI version, 3.3 at this point.

  1. I am not sure about it, @bartoszmajsak can you please confirm it? I think that KServe upstream is updating the GIE to 1.0, but not sure how we will do it for ODH and RHOAI.
  2. ok, it should be fine, the vllm version we are using is v0.11.x.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spolti Ad 1. It's coming with kserve#4886, it's similar to what @KillianGolds did with #996 (some of the feedback in upstream we shared is based on this amazing work). I think we will need to somehow consolidate them and push some further improvements upstream if we find gaps (one I can think of is zero downtime update)

Adam Young and others added 2 commits December 19, 2025 16:11
The upstream scheduler (llm-d-inference-scheduler) vendors GIE v0.5.0,
which uses camelCase flags (--kvCacheUsagePercentageMetric), not
kebab-case (--kv-cache-usage-percentage-metric).

GIE changed to kebab-case in v1.0.0, but opendatahub-io/kserve targets
the upstream scheduler which is still on v0.5.0.
@openshift-merge-robot
Copy link
Copy Markdown

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 23, 2026

@adam-d-young: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-llm-inference-service 98624a7 link true /test e2e-llm-inference-service
ci/prow/e2e-raw 98624a7 link true /test e2e-raw
ci/prow/e2e-predictor 98624a7 link true /test e2e-predictor
ci/prow/e2e-graph 98624a7 link true /test e2e-graph

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: New/Backlog

Development

Successfully merging this pull request may close these issues.

4 participants