-
Notifications
You must be signed in to change notification settings - Fork 1.7k
KEP-4785: Move RSM to beta #5766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -97,6 +97,9 @@ tags, and then generate with `hack/update-toc.sh`. | |
| - [Integration tests](#integration-tests) | ||
| - [e2e tests](#e2e-tests) | ||
| - [Graduation Criteria](#graduation-criteria) | ||
| - [Alpha](#alpha) | ||
| - [Beta](#beta) | ||
| - [Stable](#stable) | ||
| - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) | ||
| - [Version Skew Strategy](#version-skew-strategy) | ||
| - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) | ||
|
|
@@ -343,13 +346,15 @@ The controller abides by the following principals: | |
| * Library support: The module is **not** intended to be used as a library, and | ||
| as such, does not export any functions or types, with `pkg/` being an | ||
| exception (for e.g., managed resource types). | ||
| * Metrics stability: There are **no** metrics stability guarantees, as the | ||
| * Metrics stability: There are **no** [metrics stability] guarantees, as the | ||
| metrics are dynamically generated. | ||
| * No middle-ware: The configuration is `unmarshal`led into a set of stores that | ||
| the codebase directly operates on. Unlike Kube State Metrics, there is **no** | ||
| controller-defined parsing middle-ware that processes the configuration before | ||
| it is used, in order to cut down on unnecessary complexity as much as | ||
| possible. | ||
| it is consumed (by the resolvers), in order to cut down on unnecessary | ||
| complexity as much as possible. | ||
|
|
||
| [metrics stability]: https://kubernetes.io/blog/2021/04/23/kubernetes-release-1.21-metrics-stability-ga | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
|
|
@@ -365,16 +370,14 @@ How will UX be reviewed, and by whom? | |
| Consider including folks who also work outside the SIG or subproject. | ||
| --> | ||
|
|
||
| N/A since the managed resource offered by Resource State Metrics provides the | ||
| ability to define metric configurations that are a super-set of the | ||
| expressibility that Kube State Metrics' Custom Resource State configurations | ||
| have to offer. However, users will need to familiarize themselves with the | ||
| supported DSLs to be able to translate their configuration into a format that | ||
| the controller can understand, missing which could result in invalid metrics. | ||
| Mostly N/A since the concept that Resource State Metrics builds on top | ||
| of has already been offered through Kube State Metrics' Custom Resource | ||
| State configurations since the feature-set's inception. | ||
|
|
||
| Also, since this aims to deprecate the Custom Resource State API, a timeline | ||
| will be published for users on the Kube State Metrics' repository and a PSA on | ||
| the appropriate channel(s), once the controller graduates to stable. | ||
| Also, since this aims to deprecate the Custom Resource State | ||
| feature-set, a timeline will be published for users on the Kube State | ||
| Metrics' repository and a PSA on the appropriate channels, once the | ||
| controller graduates to stable. | ||
|
|
||
| ## Design Details | ||
|
|
||
|
|
@@ -392,31 +395,35 @@ generation. | |
| - At its core, the controller relies on its managed resource, | ||
| `ResourceMetricsMonitor` to fetch the metric generation configuration. Parts | ||
| of the configuration may be defined using different `resolver`s, such as | ||
| `unstructured` or `CEL`. | ||
| `unstructured` or `CEL` (more resolvers may be added in the future). | ||
| - Once fetched, the controller `unmarshal`s the configuration YAML directly into | ||
| `stores` which are a set of metric `families`, which in turn are a set of | ||
| `metrics`. | ||
| - Metric `stores` are created based on its respective GVKR (a type that embeds | ||
| `schema.GroupVersionKind`, `schema.GroupVersionResource` to avoid [plural | ||
| ambiguities]), and reflectors for the specified resource are initialized, and | ||
| populate the stores on its update. | ||
| - All generated metrics are hardcoded to `gauge`s by design, as Prometheus | ||
| currently does not support some OpenMetrics-specified metrics' types, such as | ||
| `Info` and `StateSets`, but more importantly, because these metrics can be | ||
| expressed using `gauge`s. | ||
| - All generated metrics are hardcoded to `gauge`s by design, as metrics | ||
| backends in the ecosystem do not support some OpenMetrics-specified | ||
| metrics' types, such as `Info` and `StateSets`, but more importantly, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could this be part of an auto-negotiation on the protocol level?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, I pushed the same negotiation behavior as KSM to RSM, which respects the selected KEP content above. However, one could argue that we must maintain compatibility with Prometheus' progress on OpenMetrics, and thus, atleast support The only reason I didn't do this and had a hard-pin on But I realize this makes sense, and the complexity is worth the benefits the community gets out of RSM, so I'll put this in the TODO before we go stable. PLMK if that makes sense, or if you'd like a different approach.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| because these metrics can be expressed using `gauge`s. This also helps | ||
| us discard concerns and complexities around maintaining support for | ||
| other metric types in the long run, which, as we've also observed in | ||
| Kube State Metrics' Custom Resource State API, adds significant | ||
| maintenance overhead. | ||
| - `/metrics` pings on `RSM_MAIN_PORT` trigger the server to write the raw | ||
| metrics defined in the configuration, combined with its appropriate header(s), | ||
| in the response. | ||
| - `/external` pings on `RSM_MAIN_PORT` trigger the server to write the raw | ||
| metrics defined in the `./extenal` directory, combined with its appropriate | ||
| header(s), in the response. | ||
| - `/metrics` pings on `RSM_SELF_PORT` trigger the server to write the raw | ||
| metrics about the process itself, combined with its appropriate header(s), in | ||
| the response. | ||
|
|
||
| At the moment, the `spec` houses a single `configuration` field, which defines | ||
| the metric generation configuration as follows (please note that the schema is | ||
| fast-moving at this point and may be subject to change: | ||
| At the moment, the `spec` houses a single `configuration` field, which | ||
| defines the metric generation configuration as follows (please note that | ||
| the schema is fast-moving at this point and may be subject to change, | ||
| preferably towards one that incorporates the YAML-based configuration | ||
| string into the managed resource's schema itself, allowing for tighter | ||
| integration and validation: | ||
|
|
||
| ```yaml | ||
| generators: # Set of metrics stores for each CR we want to generate metrics for. | ||
|
|
@@ -486,7 +493,9 @@ generators: # Set of metrics stores for each CR we want to generate metrics for. | |
|
|
||
| It's also worth mentioning that unlike Kube State Metrics' Custom Resource | ||
| State, Resource State Metrics supports recursively generating samples from | ||
| nested data structures, all from a single expression. Assuming we have a query, | ||
| nested data structures, all from a single expression. | ||
|
|
||
| Similarly, in Resource State Metrics, assuming we have a query, | ||
| ``` | ||
| o.spec | ||
| ``` | ||
|
|
@@ -530,7 +539,8 @@ test_metric{os="linux", tags="", key_1="value1", key_2="value2", app_i | |
| Note that the order of samples, as well as their labelsets, guaranteed to be | ||
| stable across runs. | ||
|
|
||
| The `status`, on the other hand, is a set of `metav1.Condition`s, like so: | ||
| The `ResourceMetricsMonitor` `status` is a set of `metav1.Condition`s, | ||
| like so: | ||
|
|
||
| ```yaml | ||
| status: | ||
|
|
@@ -618,7 +628,11 @@ For Beta and GA, add links to added tests together with links to k8s-triage for | |
| https://storage.googleapis.com/k8s-triage/index.html | ||
| --> | ||
|
|
||
| N/A. | ||
| `k8s-triage` is not relevant since the controller resides out-of-tree. | ||
|
|
||
| However, we do plan on covering as much surface as possible with | ||
| integration tests before graduating to Stable. These tests usually reside | ||
| with the end-to-end tests (see below). | ||
|
|
||
| ##### e2e tests | ||
|
|
||
|
|
@@ -704,6 +718,36 @@ in back-to-back releases. | |
| - Deprecate the flag | ||
| --> | ||
|
|
||
| #### Alpha | ||
|
|
||
| - Implement foundational ideas discussed in the proposal (till Alpha) to | ||
| establish confidence through the proof-of-work. | ||
| - This also includes adding support for extending the configuration parsing | ||
| techniques via resolvers. | ||
|
|
||
| #### Beta | ||
|
|
||
| - Add support for admission web-hooks to validate the configuration on | ||
| creation/update of the managed resource. | ||
| - Add support for `starlark` resolver to help define more complex logic, in | ||
| addition to mainstream DSLs seen across the ecosystem for similar use-cases, | ||
| such as `jq` and `jsonpath` ([`prometheus-community/json_exporter`], | ||
| [`openmcp-project/metrics-operator`], etc.). | ||
|
|
||
| [`openmcp-project/metrics-operator`]: https://github.com/openmcp-project/metrics-operator#metric | ||
| [`prometheus-community/json_exporter`]: https://github.com/prometheus-community/json_exporter | ||
|
|
||
| #### Stable | ||
|
|
||
| - Support all relevant metric types that Prometheus' OpenMetrics implementation | ||
| allows. This would entail `Counter`s currently, and `Info` and `Stateset` in | ||
| the future. See [expfmt.MetricFamilyToOpenMetrics] for more. | ||
| - Complete API surface test coverage for all applicable and relevant test | ||
| types. | ||
| - Support payload compression (through gzip, zstd, etc.). | ||
|
|
||
| [expfmt.MetricFamilyToOpenMetrics]: https://pkg.go.dev/github.com/prometheus/common@v0.67.5/expfmt#MetricFamilyToOpenMetrics | ||
|
|
||
| ### Upgrade / Downgrade Strategy | ||
|
|
||
| <!-- | ||
|
|
@@ -929,7 +973,7 @@ Longer term, we may want to require automated upgrade/rollback tests, but we | |
| are missing a bunch of machinery and tooling and can't do that now. | ||
| --> | ||
|
|
||
| TBD (when targeting beta). | ||
| N/A | ||
|
|
||
| ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we allow supplying configuration as a regular config file as well? I wonder if there's a usecase to run RSM outside of the cluster.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Folks can use similar semantics as seen in
make localto run this locally. That being said, doing away with RMMs and relying on file blobs will strip away the added feature-set of the controller lifecycle (for e.g., status updates) that users can benefit from. RSM could mirror that blob to an RMM resource, but I'm not sure if the file-only direction would be worth the added complexity.