Pluggable metrics collection #1237

elevran · 2025-07-27T12:28:07Z

Implementation of pluggable metrics collection.

The PR can be reviewed in three parts:

mostly refactored and simplified metrics code from pkg/epp/backend/metrics now under pkg/epp/datalayer/metrics. Tests were moved along with the refactored code.
go routine is no longer part of metrics.PodMetrics and was moved into datalayer/collector.go
changes to go.mod/sum

PR open to collect feedback. The following are yet to be done and shall be added (to this PR or a follow up)

additional unit tests (will be added on this PR)
logging in functions (see open question regarding context.Context in DataSource.Collect and extractor.Extract)
hooking pluggable metrics with rest of system (follow up PR)
- datastore (e.g., inferencepool target port changes, management of Collector for every endpoint, etc.)
- default registration of the metrics data source (e.g., in runner.go?)
- removal of pkg/epp/backend/metrics
global statistics (and their logging) for metrics data source (separate PR or this)
support for k8s based collectors and extractor (separate PR. Plan: extend DS interface GKV(), add new Registry for k8s based sources. Do not pass k8s sources to Collector and invoke their Collect(ep) directly from datastore)

netlify · 2025-07-27T12:28:54Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`809ed0c`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/689215fd98e0cf00088fcd78
😎 Deploy Preview	https://deploy-preview-1237--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

elevran · 2025-07-27T13:31:56Z

@kfswain @ahg-g @nirrozenbaum - this is a large PR.

If interested and it would expedite review, I can break it up into smaller pieces. My concern was that doing so may cause loss of the bigger picture/end-to-end flow.

Example break into smaller units (they'd still be 100's of lines each):

refactor backend/metrics Spec and Mapping in place (can be one or two PRs. This could be a little tricky since moving some functionality from free functions to method receivers)
move Spec, Mapping and adding spec_test to datalayer/metrics, leaving type aliases behind (another PR).
create extractor, data source and related under datalayer/metrics
add Collector to datalayer

This would leave us where we are now but with 4 smaller PRs.
Let me know if that would be helpful and I will refactor the PR into smaller items.

elevran · 2025-07-27T13:37:23Z

side note:
any idea why fatcontext errors were not showing up in make lint but are in pull-gateway-api-inference-extension-verify-main?
Seems to be a tighter loop if both pass/fail the same.

elevran · 2025-07-28T09:56:06Z

side note:
any idea why fatcontext errors were not showing up in make lint but are in pull-gateway-api-inference-extension-verify-main?
Seems to be a tighter loop if both pass/fail the same.

So apparently I was running with the latest 1.x (1.64.8) whereas the GIE Makefile downloads an earlier release (1.62.2).
Any objections to upgrading?
I know the fatcontext linter was wrong in its assessment.

nirrozenbaum · 2025-07-28T10:41:35Z

So apparently I was running with the latest 1.x (1.64.8) whereas the GIE Makefile downloads an earlier release (1.62.2). Any objections to upgrading? I know the fatcontext linter was wrong in its assessment.

any downside from upgrading to latest (v2.3.0)?

elevran · 2025-07-28T11:15:13Z

any downside from upgrading to latest (v2.3.0)?

Not much.
v2 uses a slightly different configuration format (e.g., there's no timeout). You can convert using golangci-lint migrate.
The use of newer versions of linters could potentially uncover some issues which have not been reported on before, but hopefully not too many and those reported should be fixed anyway.
The only potential additional work might have to do with CICD tooling.

Suggest moving this discussion over to a new issue #1240.

elevran · 2025-07-29T14:37:24Z

/retest

elevran · 2025-07-30T11:56:50Z

/cc @nirrozenbaum @kfswain

nirrozenbaum

I completed first pass.
added some comments. nothing seems like a big issue.

pkg/epp/datalayer/collector.go

pkg/epp/datalayer/metrics/datasource.go

pkg/epp/datalayer/metrics/client.go

pkg/epp/datalayer/metrics/datasource.go

pkg/epp/datalayer/metrics/client.go

kfswain · 2025-07-31T00:50:17Z

pkg/epp/datalayer/metrics/client.go

+func newClient() *client {
+	return &client{
+		cl: &http.Client{
+			Timeout: 10 * time.Second,


Simple sounds good for now, but I bet future PRs we will need to think about configurability.

the client timeout would be governed by the context passed into Get. I expect that since we scrape on a short cycle, we'll set them very low and not sure we really need to have it exposed to users.
Other customizations (e.g., TLSConfig) would be added. Haven't thought about what needs specific to the data source (as opposed to system wide) configurations, but it would be supported via the config file, as we do for Plugins.

for the metrics client you're probably right about the short interval (not much sense in having large scrape interval).
is this true also for other collectors?
I think ideally we would like to generalize (while not over complicating) the scrape interval and timeout (which doesn't seem to be over complication).
as a starting point we could also define the timeout as a function of the scraping interval, e.g., timeout = 3 * scrape interval.

There are two classes of http.Client timeouts:

http.Client.Timeout can be used to set a hard timeout for the global client interactions (e.g., DNS resolution, TCP connection, etc.).

per "transaction" timeouts are handled by context.WithTimeout and http.NewRequestWithContext

The initial protection was to ensure we don't have long timeouts in making connections to a non-existent server endpoint (i.e., wrong port configured, target Pod has terminated, etc.) - so main concern was the global client interaction.

Regardless, I will add "client configuration" as parameter (e.g., control over Transport.MaxIdle connections and their timeouts, TLSConfig, etc.) so there's one function to set defaults and a clear way to configure alternate values in the future, should it be needed.

kfswain · 2025-07-31T00:53:00Z

pkg/epp/datalayer/metrics/client.go

+	}
+}
+
+func (cl *client) Get(ctx context.Context, target *url.URL, ep datalayer.Addressable) (PrometheusMetricMap, error) {


Similar to the conversation below, Since this seems to be a general function signature, do we want to abstract the Prometheus assumption (perhaps an endpoint uses OpenTelemetry)

Also similar to the conversation below; thats fine if we don't tackle this now

agree.
There's probably a generic HTTP client wrapper we can expose to return any (or HTTP response); and a specific client that is used for /metrics which would turn the output to a concrete type used in the DataSource.
Will tackle it once we have the second use case (e.g., /v1/modesl?) and have a better idea on what's common and what's adaptable between clients.

pkg/epp/datalayer/metrics/datasource.go

pkg/epp/datalayer/metrics/extractor.go

pkg/epp/datalayer/metrics/datasource.go

pkg/epp/datalayer/metrics/spec.go

kfswain · 2025-07-31T01:14:31Z

pkg/epp/datalayer/metrics/spec.go

+
+	// Be liberal in accepting inputs that are missing quotes around label values,
+	// allowing both {label=value} and {label=\"value\"} inputs
+	quoted := addQuotesToLabelValues(spec)


should we be liberal in acceptance? I'm not sold on the value of accepting the performance hit of running regex in production vs just validating that metrics are working as expected in lower envs first.

rationale:

unlike k8s labels, Prometheus label values allow spaces (e.g., status="200 OK" is a valid expression), so we should be prepared to handle those as well.

standardizing user inputs on the Prometheus expected format allows us to use their official parser (which replace the use of the handcrafted parser code we currently have).

this is only done once when the metrcis.Spec is created and not in any performance critical code. The result of parsing is stored on the spec object (name, labels) for use when retrieving the actual metrics defined by the Spec - just as in the current code.

So I think this change is desirable from a functional standpoint and should not have any noticeable runtime performance hit in production environments.
Makes sense?

elevran · 2025-07-31T15:36:29Z

Addressed most of the review comments. The following are still open and I will handle over the next couple of days

address comments regarding use of metrics.client and metrics.clientmap (e.g., global vars, client configuration, etc.)
additional unit tests in datalayer/metrics
some missing logs to allow debug/observe via logging messages

At this point I will ask for another review cycle. The remaining items will be handled in separate PR

hook this into the data layer (do we want a feature flag for some soak test time or do we feel confident to switch over with existing end-to-end tests for conformance and confirmation)
enable producing and logging cross model server metrics for the entire datastore
additional collectors/extractors (extractor possibly needed for wide EP - leader metrics include rank so no longer in MSP expected format, workers are not exposing metrics? k8s annotations needed to determine leader/workers roles)

pkg/epp/datalayer/collector.go

pkg/epp/datalayer/metrics/client.go

ahg-g · 2025-07-31T12:28:51Z

pkg/epp/datalayer/metrics/client.go

+var (
+	cleanupTick          = 30 * time.Second
+	maxIdleTime          = time.Minute
+	defaultClientFactory = newClientFactory()
+)


I think a singleton pattern is reasonable here, probably better than trying to plumb this through many layers

pkg/epp/datalayer/collector.go

ahg-g · 2025-08-01T20:59:22Z

pkg/epp/datalayer/endpoint.go

 	AttributeMap
 }
+
+// ModelServer is an implementation of the Endpoint interface.


We may want to design this such that a pod is not running a single model server, but more than one. The case is data parallelism for MoE models. I know this is not a design we settled on yet, but that may well happen.

agree, this is what we're seeing in data parallel (multiple scheduling "targets" in the same Pod).

open to different naming

what (if anything) should change at this time in the Endpoint/ModelServer API?

The name is fine I think, I just noticed the pod parameter, not sure if we are doing anything in the code that assumes a 1:1 relationship between modelServer and pod, if not, then we should be fine.

Used the same structure as PodMetrics - an atomic.Pointer to Pod info (not the k8s Pod). So I think there is a "Pod" (IP, name, labels) per "ModelServer" but a Pod could potentially appear in multiple ModelServers.

pkg/epp/datalayer/metrics/client.go

pkg/epp/datalayer/metrics/datasource.go

Signed-off-by: Etai Lev Ran <[email protected]>

elevran · 2025-08-05T14:33:53Z

@ahg-g anything else that should be addressed in this PR?

kfswain · 2025-08-06T19:19:27Z

/lgtm
/approve

k8s-ci-robot · 2025-08-06T19:19:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elevran, kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 27, 2025

k8s-ci-robot requested review from ahg-g and nirrozenbaum July 27, 2025 12:28

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 27, 2025

elevran changed the title ~~[WIP] Pluggable metrics collection~~ Pluggable metrics collection Jul 28, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 28, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2025

elevran force-pushed the metrics_collection branch from b5b1157 to 3c4f686 Compare July 29, 2025 12:01

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2025

elevran force-pushed the metrics_collection branch from 3c4f686 to fc56a16 Compare July 29, 2025 14:30

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 30, 2025

elevran force-pushed the metrics_collection branch from 3a3f218 to 1c29551 Compare July 30, 2025 11:55

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 30, 2025

k8s-ci-robot requested a review from kfswain July 30, 2025 11:56

nirrozenbaum reviewed Jul 30, 2025

View reviewed changes

kfswain reviewed Jul 31, 2025

View reviewed changes

ahg-g reviewed Aug 1, 2025

View reviewed changes

elevran requested a review from nirrozenbaum August 2, 2025 10:35

elevran requested review from ahg-g and kfswain August 2, 2025 10:35

elevran mentioned this pull request Aug 2, 2025

Fixes for make fmt-imports #1287

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 5, 2025

elevran added 9 commits August 5, 2025 17:31

pluggable metrics collection

045f973

Signed-off-by: Etai Lev Ran <[email protected]>

use the provided context

1d502d6

Signed-off-by: Etai Lev Ran <[email protected]>

derive new context outside anonymous function

fc2519d

Signed-off-by: Etai Lev Ran <[email protected]>

make fatcontext happy?

b47a4a2

Signed-off-by: Etai Lev Ran <[email protected]>

mock ticker

24d8571

Signed-off-by: Etai Lev Ran <[email protected]>

split LoRA metric Spec from standard Spec

0c9475c

Signed-off-by: Etai Lev Ran <[email protected]>

add collector test

d2082d6

Signed-off-by: Etai Lev Ran <[email protected]>

address review comments

77a2bca

Signed-off-by: Etai Lev Ran <[email protected]>

review comments on use of client

809ed0c

Signed-off-by: Etai Lev Ran <[email protected]>

elevran force-pushed the metrics_collection branch from 5f1c437 to 809ed0c Compare August 5, 2025 14:32

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 5, 2025

k8s-ci-robot assigned kfswain Aug 6, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 6, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2025

k8s-ci-robot merged commit 5bb70f6 into kubernetes-sigs:main Aug 6, 2025
9 checks passed

elevran deleted the metrics_collection branch August 12, 2025 08:58

This was referenced Aug 17, 2025

extensible data layer: EPP should allow configurable metrics collection #703

Closed

Enable pluggable datalayer as experimental feature #1391

Merged

Pluggable metrics collection #1237

Pluggable metrics collection #1237

Uh oh!

Conversation

elevran commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

elevran commented Jul 27, 2025

Uh oh!

elevran commented Jul 27, 2025

Uh oh!

elevran commented Jul 28, 2025

Uh oh!

nirrozenbaum commented Jul 28, 2025

Uh oh!

elevran commented Jul 28, 2025

Uh oh!

elevran commented Jul 29, 2025

Uh oh!

elevran commented Jul 30, 2025

Uh oh!

nirrozenbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elevran Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elevran Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elevran commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elevran commented Jul 27, 2025 •

edited

Loading

netlify bot commented Jul 27, 2025 •

edited

Loading

elevran Jul 31, 2025 •

edited

Loading

elevran Jul 31, 2025 •

edited

Loading

elevran Aug 4, 2025 •

edited

Loading