Skip to content

Conversation

@elevran
Copy link
Contributor

@elevran elevran commented Jul 27, 2025

Implementation of pluggable metrics collection.

The PR can be reviewed in three parts:

  • mostly refactored and simplified metrics code from pkg/epp/backend/metrics now under pkg/epp/datalayer/metrics. Tests were moved along with the refactored code.
  • go routine is no longer part of metrics.PodMetrics and was moved into datalayer/collector.go
  • changes to go.mod/sum

PR open to collect feedback. The following are yet to be done and shall be added (to this PR or a follow up)

  • additional unit tests (will be added on this PR)
  • logging in functions (see open question regarding context.Context in DataSource.Collect and extractor.Extract)
  • hooking pluggable metrics with rest of system (follow up PR)
    • datastore (e.g., inferencepool target port changes, management of Collector for every endpoint, etc.)
    • default registration of the metrics data source (e.g., in runner.go?)
    • removal of pkg/epp/backend/metrics
  • global statistics (and their logging) for metrics data source (separate PR or this)
  • support for k8s based collectors and extractor (separate PR. Plan: extend DS interface GKV(), add new Registry for k8s based sources. Do not pass k8s sources to Collector and invoke their Collect(ep) directly from datastore)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 27, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 27, 2025
@netlify
Copy link

netlify bot commented Jul 27, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 809ed0c
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/689215fd98e0cf00088fcd78
😎 Deploy Preview https://deploy-preview-1237--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@elevran
Copy link
Contributor Author

elevran commented Jul 27, 2025

@kfswain @ahg-g @nirrozenbaum - this is a large PR.

If interested and it would expedite review, I can break it up into smaller pieces. My concern was that doing so may cause loss of the bigger picture/end-to-end flow.

Example break into smaller units (they'd still be 100's of lines each):

  1. refactor backend/metrics Spec and Mapping in place (can be one or two PRs. This could be a little tricky since moving some functionality from free functions to method receivers)
  2. move Spec, Mapping and adding spec_test to datalayer/metrics, leaving type aliases behind (another PR).
  3. create extractor, data source and related under datalayer/metrics
  4. add Collector to datalayer

This would leave us where we are now but with 4 smaller PRs.
Let me know if that would be helpful and I will refactor the PR into smaller items.

@elevran
Copy link
Contributor Author

elevran commented Jul 27, 2025

side note:
any idea why fatcontext errors were not showing up in make lint but are in pull-gateway-api-inference-extension-verify-main?
Seems to be a tighter loop if both pass/fail the same.

@elevran
Copy link
Contributor Author

elevran commented Jul 28, 2025

side note:
any idea why fatcontext errors were not showing up in make lint but are in pull-gateway-api-inference-extension-verify-main?
Seems to be a tighter loop if both pass/fail the same.

So apparently I was running with the latest 1.x (1.64.8) whereas the GIE Makefile downloads an earlier release (1.62.2).
Any objections to upgrading?
I know the fatcontext linter was wrong in its assessment.

@nirrozenbaum
Copy link
Contributor

So apparently I was running with the latest 1.x (1.64.8) whereas the GIE Makefile downloads an earlier release (1.62.2). Any objections to upgrading? I know the fatcontext linter was wrong in its assessment.

any downside from upgrading to latest (v2.3.0)?

@elevran elevran changed the title [WIP] Pluggable metrics collection Pluggable metrics collection Jul 28, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 28, 2025
@elevran
Copy link
Contributor Author

elevran commented Jul 28, 2025

any downside from upgrading to latest (v2.3.0)?

Not much.
v2 uses a slightly different configuration format (e.g., there's no timeout). You can convert using golangci-lint migrate.
The use of newer versions of linters could potentially uncover some issues which have not been reported on before, but hopefully not too many and those reported should be fixed anyway.
The only potential additional work might have to do with CICD tooling.

Suggest moving this discussion over to a new issue #1240.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2025
@elevran elevran force-pushed the metrics_collection branch from b5b1157 to 3c4f686 Compare July 29, 2025 12:01
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2025
@elevran elevran force-pushed the metrics_collection branch from 3c4f686 to fc56a16 Compare July 29, 2025 14:30
@elevran
Copy link
Contributor Author

elevran commented Jul 29, 2025

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 30, 2025
@elevran elevran force-pushed the metrics_collection branch from 3a3f218 to 1c29551 Compare July 30, 2025 11:55
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 30, 2025
@elevran
Copy link
Contributor Author

elevran commented Jul 30, 2025

/cc @nirrozenbaum @kfswain

@k8s-ci-robot k8s-ci-robot requested a review from kfswain July 30, 2025 11:56
Copy link
Contributor

@nirrozenbaum nirrozenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completed first pass.
added some comments. nothing seems like a big issue.

func newClient() *client {
return &client{
cl: &http.Client{
Timeout: 10 * time.Second,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple sounds good for now, but I bet future PRs we will need to think about configurability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the client timeout would be governed by the context passed into Get. I expect that since we scrape on a short cycle, we'll set them very low and not sure we really need to have it exposed to users.
Other customizations (e.g., TLSConfig) would be added. Haven't thought about what needs specific to the data source (as opposed to system wide) configurations, but it would be supported via the config file, as we do for Plugins.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the metrics client you're probably right about the short interval (not much sense in having large scrape interval).
is this true also for other collectors?
I think ideally we would like to generalize (while not over complicating) the scrape interval and timeout (which doesn't seem to be over complication).
as a starting point we could also define the timeout as a function of the scraping interval, e.g., timeout = 3 * scrape interval.

Copy link
Contributor Author

@elevran elevran Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two classes of http.Client timeouts:

  • http.Client.Timeout can be used to set a hard timeout for the global client interactions (e.g., DNS resolution, TCP connection, etc.).
  • per "transaction" timeouts are handled by context.WithTimeout and http.NewRequestWithContext

The initial protection was to ensure we don't have long timeouts in making connections to a non-existent server endpoint (i.e., wrong port configured, target Pod has terminated, etc.) - so main concern was the global client interaction.

Regardless, I will add "client configuration" as parameter (e.g., control over Transport.MaxIdle connections and their timeouts, TLSConfig, etc.) so there's one function to set defaults and a clear way to configure alternate values in the future, should it be needed.

}
}

func (cl *client) Get(ctx context.Context, target *url.URL, ep datalayer.Addressable) (PrometheusMetricMap, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the conversation below, Since this seems to be a general function signature, do we want to abstract the Prometheus assumption (perhaps an endpoint uses OpenTelemetry)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also similar to the conversation below; thats fine if we don't tackle this now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree.
There's probably a generic HTTP client wrapper we can expose to return any (or HTTP response); and a specific client that is used for /metrics which would turn the output to a concrete type used in the DataSource.
Will tackle it once we have the second use case (e.g., /v1/modesl?) and have a better idea on what's common and what's adaptable between clients.


// Be liberal in accepting inputs that are missing quotes around label values,
// allowing both {label=value} and {label=\"value\"} inputs
quoted := addQuotesToLabelValues(spec)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be liberal in acceptance? I'm not sold on the value of accepting the performance hit of running regex in production vs just validating that metrics are working as expected in lower envs first.

Copy link
Contributor Author

@elevran elevran Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rationale:

  • unlike k8s labels, Prometheus label values allow spaces (e.g., status="200 OK" is a valid expression), so we should be prepared to handle those as well.
  • standardizing user inputs on the Prometheus expected format allows us to use their official parser (which replace the use of the handcrafted parser code we currently have).
  • this is only done once when the metrcis.Spec is created and not in any performance critical code. The result of parsing is stored on the spec object (name, labels) for use when retrieving the actual metrics defined by the Spec - just as in the current code.

So I think this change is desirable from a functional standpoint and should not have any noticeable runtime performance hit in production environments.
Makes sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack SGTM

@elevran
Copy link
Contributor Author

elevran commented Jul 31, 2025

Addressed most of the review comments. The following are still open and I will handle over the next couple of days

  • address comments regarding use of metrics.client and metrics.clientmap (e.g., global vars, client configuration, etc.)
  • additional unit tests in datalayer/metrics
  • some missing logs to allow debug/observe via logging messages

At this point I will ask for another review cycle. The remaining items will be handled in separate PR

  • hook this into the data layer (do we want a feature flag for some soak test time or do we feel confident to switch over with existing end-to-end tests for conformance and confirmation)
  • enable producing and logging cross model server metrics for the entire datastore
  • additional collectors/extractors (extractor possibly needed for wide EP - leader metrics include rank so no longer in MSP expected format, workers are not exposing metrics? k8s annotations needed to determine leader/workers roles)

Comment on lines 50 to 41
var (
cleanupTick = 30 * time.Second
maxIdleTime = time.Minute
defaultClientFactory = newClientFactory()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a singleton pattern is reasonable here, probably better than trying to plumb this through many layers

AttributeMap
}

// ModelServer is an implementation of the Endpoint interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to design this such that a pod is not running a single model server, but more than one. The case is data parallelism for MoE models. I know this is not a design we settled on yet, but that may well happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, this is what we're seeing in data parallel (multiple scheduling "targets" in the same Pod).

  • open to different naming
  • what (if anything) should change at this time in the Endpoint/ModelServer API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is fine I think, I just noticed the pod parameter, not sure if we are doing anything in the code that assumes a 1:1 relationship between modelServer and pod, if not, then we should be fine.

Copy link
Contributor Author

@elevran elevran Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used the same structure as PodMetrics - an atomic.Pointer to Pod info (not the k8s Pod). So I think there is a "Pod" (IP, name, labels) per "ModelServer" but a Pod could potentially appear in multiple ModelServers.

@elevran elevran requested a review from nirrozenbaum August 2, 2025 10:35
@elevran elevran requested review from ahg-g and kfswain August 2, 2025 10:35
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 5, 2025
elevran added 9 commits August 5, 2025 17:31
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
@elevran elevran force-pushed the metrics_collection branch from 5f1c437 to 809ed0c Compare August 5, 2025 14:32
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 5, 2025
@elevran
Copy link
Contributor Author

elevran commented Aug 5, 2025

@ahg-g anything else that should be addressed in this PR?

@kfswain
Copy link
Collaborator

kfswain commented Aug 6, 2025

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 6, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elevran, kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2025
@k8s-ci-robot k8s-ci-robot merged commit 5bb70f6 into kubernetes-sigs:main Aug 6, 2025
9 checks passed
@elevran elevran deleted the metrics_collection branch August 12, 2025 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants