-
Notifications
You must be signed in to change notification settings - Fork 195
Pluggable metrics collection #1237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
@kfswain @ahg-g @nirrozenbaum - this is a large PR. If interested and it would expedite review, I can break it up into smaller pieces. My concern was that doing so may cause loss of the bigger picture/end-to-end flow. Example break into smaller units (they'd still be 100's of lines each):
This would leave us where we are now but with 4 smaller PRs. |
|
side note: |
So apparently I was running with the latest 1.x ( |
any downside from upgrading to latest (v2.3.0)? |
Not much. Suggest moving this discussion over to a new issue #1240. |
b5b1157 to
3c4f686
Compare
3c4f686 to
fc56a16
Compare
|
/retest |
3a3f218 to
1c29551
Compare
nirrozenbaum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completed first pass.
added some comments. nothing seems like a big issue.
pkg/epp/datalayer/metrics/client.go
Outdated
| func newClient() *client { | ||
| return &client{ | ||
| cl: &http.Client{ | ||
| Timeout: 10 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple sounds good for now, but I bet future PRs we will need to think about configurability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the client timeout would be governed by the context passed into Get. I expect that since we scrape on a short cycle, we'll set them very low and not sure we really need to have it exposed to users.
Other customizations (e.g., TLSConfig) would be added. Haven't thought about what needs specific to the data source (as opposed to system wide) configurations, but it would be supported via the config file, as we do for Plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the metrics client you're probably right about the short interval (not much sense in having large scrape interval).
is this true also for other collectors?
I think ideally we would like to generalize (while not over complicating) the scrape interval and timeout (which doesn't seem to be over complication).
as a starting point we could also define the timeout as a function of the scraping interval, e.g., timeout = 3 * scrape interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two classes of http.Client timeouts:
http.Client.Timeoutcan be used to set a hard timeout for the global client interactions (e.g., DNS resolution, TCP connection, etc.).- per "transaction" timeouts are handled by
context.WithTimeoutandhttp.NewRequestWithContext
The initial protection was to ensure we don't have long timeouts in making connections to a non-existent server endpoint (i.e., wrong port configured, target Pod has terminated, etc.) - so main concern was the global client interaction.
Regardless, I will add "client configuration" as parameter (e.g., control over Transport.MaxIdle connections and their timeouts, TLSConfig, etc.) so there's one function to set defaults and a clear way to configure alternate values in the future, should it be needed.
| } | ||
| } | ||
|
|
||
| func (cl *client) Get(ctx context.Context, target *url.URL, ep datalayer.Addressable) (PrometheusMetricMap, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the conversation below, Since this seems to be a general function signature, do we want to abstract the Prometheus assumption (perhaps an endpoint uses OpenTelemetry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also similar to the conversation below; thats fine if we don't tackle this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree.
There's probably a generic HTTP client wrapper we can expose to return any (or HTTP response); and a specific client that is used for /metrics which would turn the output to a concrete type used in the DataSource.
Will tackle it once we have the second use case (e.g., /v1/modesl?) and have a better idea on what's common and what's adaptable between clients.
|
|
||
| // Be liberal in accepting inputs that are missing quotes around label values, | ||
| // allowing both {label=value} and {label=\"value\"} inputs | ||
| quoted := addQuotesToLabelValues(spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be liberal in acceptance? I'm not sold on the value of accepting the performance hit of running regex in production vs just validating that metrics are working as expected in lower envs first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rationale:
- unlike k8s labels, Prometheus label values allow spaces (e.g.,
status="200 OK"is a valid expression), so we should be prepared to handle those as well. - standardizing user inputs on the Prometheus expected format allows us to use their official parser (which replace the use of the handcrafted parser code we currently have).
- this is only done once when the
metrcis.Specis created and not in any performance critical code. The result of parsing is stored on the spec object (name, labels) for use when retrieving the actual metrics defined by the Spec - just as in the current code.
So I think this change is desirable from a functional standpoint and should not have any noticeable runtime performance hit in production environments.
Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack SGTM
|
Addressed most of the review comments. The following are still open and I will handle over the next couple of days
At this point I will ask for another review cycle. The remaining items will be handled in separate PR
|
pkg/epp/datalayer/metrics/client.go
Outdated
| var ( | ||
| cleanupTick = 30 * time.Second | ||
| maxIdleTime = time.Minute | ||
| defaultClientFactory = newClientFactory() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a singleton pattern is reasonable here, probably better than trying to plumb this through many layers
| AttributeMap | ||
| } | ||
|
|
||
| // ModelServer is an implementation of the Endpoint interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to design this such that a pod is not running a single model server, but more than one. The case is data parallelism for MoE models. I know this is not a design we settled on yet, but that may well happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, this is what we're seeing in data parallel (multiple scheduling "targets" in the same Pod).
- open to different naming
- what (if anything) should change at this time in the Endpoint/ModelServer API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is fine I think, I just noticed the pod parameter, not sure if we are doing anything in the code that assumes a 1:1 relationship between modelServer and pod, if not, then we should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used the same structure as PodMetrics - an atomic.Pointer to Pod info (not the k8s Pod). So I think there is a "Pod" (IP, name, labels) per "ModelServer" but a Pod could potentially appear in multiple ModelServers.
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
Signed-off-by: Etai Lev Ran <[email protected]>
5f1c437 to
809ed0c
Compare
|
@ahg-g anything else that should be addressed in this PR? |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elevran, kfswain The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Implementation of pluggable metrics collection.
The PR can be reviewed in three parts:
pkg/epp/backend/metricsnow underpkg/epp/datalayer/metrics. Tests were moved along with the refactored code.metrics.PodMetricsand was moved intodatalayer/collector.goPR open to collect feedback. The following are yet to be done and shall be added (to this PR or a follow up)
DataSource.Collectandextractor.Extract)datastore(e.g., inferencepool target port changes, management of Collector for every endpoint, etc.)runner.go?)pkg/epp/backend/metricsCollect(ep)directly from datastore)