Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 71 additions & 27 deletions keps/sig-instrumentation/4785-resource-state-metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,9 @@ tags, and then generate with `hack/update-toc.sh`.
- [Integration tests](#integration-tests)
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [Stable](#stable)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
Expand Down Expand Up @@ -343,13 +346,15 @@ The controller abides by the following principals:
* Library support: The module is **not** intended to be used as a library, and
as such, does not export any functions or types, with `pkg/` being an
exception (for e.g., managed resource types).
* Metrics stability: There are **no** metrics stability guarantees, as the
* Metrics stability: There are **no** [metrics stability] guarantees, as the
metrics are dynamically generated.
* No middle-ware: The configuration is `unmarshal`led into a set of stores that
the codebase directly operates on. Unlike Kube State Metrics, there is **no**
controller-defined parsing middle-ware that processes the configuration before
it is used, in order to cut down on unnecessary complexity as much as
possible.
it is consumed (by the resolvers), in order to cut down on unnecessary
complexity as much as possible.

[metrics stability]: https://kubernetes.io/blog/2021/04/23/kubernetes-release-1.21-metrics-stability-ga

### Risks and Mitigations

Expand All @@ -365,16 +370,14 @@ How will UX be reviewed, and by whom?
Consider including folks who also work outside the SIG or subproject.
-->

N/A since the managed resource offered by Resource State Metrics provides the
ability to define metric configurations that are a super-set of the
expressibility that Kube State Metrics' Custom Resource State configurations
have to offer. However, users will need to familiarize themselves with the
supported DSLs to be able to translate their configuration into a format that
the controller can understand, missing which could result in invalid metrics.
Mostly N/A since the concept that Resource State Metrics builds on top
of has already been offered through Kube State Metrics' Custom Resource
State configurations since the feature-set's inception.

Also, since this aims to deprecate the Custom Resource State API, a timeline
will be published for users on the Kube State Metrics' repository and a PSA on
the appropriate channel(s), once the controller graduates to stable.
Also, since this aims to deprecate the Custom Resource State
feature-set, a timeline will be published for users on the Kube State
Metrics' repository and a PSA on the appropriate channels, once the
controller graduates to stable.

## Design Details

Expand All @@ -392,31 +395,35 @@ generation.
- At its core, the controller relies on its managed resource,
`ResourceMetricsMonitor` to fetch the metric generation configuration. Parts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we allow supplying configuration as a regular config file as well? I wonder if there's a usecase to run RSM outside of the cluster.

Copy link
Member Author

@rexagod rexagod Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folks can use similar semantics as seen in make local to run this locally. That being said, doing away with RMMs and relying on file blobs will strip away the added feature-set of the controller lifecycle (for e.g., status updates) that users can benefit from. RSM could mirror that blob to an RMM resource, but I'm not sure if the file-only direction would be worth the added complexity.

of the configuration may be defined using different `resolver`s, such as
`unstructured` or `CEL`.
`unstructured` or `CEL` (more resolvers may be added in the future).
- Once fetched, the controller `unmarshal`s the configuration YAML directly into
`stores` which are a set of metric `families`, which in turn are a set of
`metrics`.
- Metric `stores` are created based on its respective GVKR (a type that embeds
`schema.GroupVersionKind`, `schema.GroupVersionResource` to avoid [plural
ambiguities]), and reflectors for the specified resource are initialized, and
populate the stores on its update.
- All generated metrics are hardcoded to `gauge`s by design, as Prometheus
currently does not support some OpenMetrics-specified metrics' types, such as
`Info` and `StateSets`, but more importantly, because these metrics can be
expressed using `gauge`s.
- All generated metrics are hardcoded to `gauge`s by design, as metrics
backends in the ecosystem do not support some OpenMetrics-specified
metrics' types, such as `Info` and `StateSets`, but more importantly,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be part of an auto-negotiation on the protocol level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I pushed the same negotiation behavior as KSM to RSM, which respects the selected KEP content above. However, one could argue that we must maintain compatibility with Prometheus' progress on OpenMetrics, and thus, atleast support Counters as well right now.

The only reason I didn't do this and had a hard-pin on Gauges was only because this would open us up to supporting all relevant metric types that RMM allows users to construct, which, in the future, will, at the very least, entail Info and Stateset types.

But I realize this makes sense, and the complexity is worth the benefits the community gets out of RSM, so I'll put this in the TODO before we go stable. PLMK if that makes sense, or if you'd like a different approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because these metrics can be expressed using `gauge`s. This also helps
us discard concerns and complexities around maintaining support for
other metric types in the long run, which, as we've also observed in
Kube State Metrics' Custom Resource State API, adds significant
maintenance overhead.
- `/metrics` pings on `RSM_MAIN_PORT` trigger the server to write the raw
metrics defined in the configuration, combined with its appropriate header(s),
in the response.
- `/external` pings on `RSM_MAIN_PORT` trigger the server to write the raw
metrics defined in the `./extenal` directory, combined with its appropriate
header(s), in the response.
- `/metrics` pings on `RSM_SELF_PORT` trigger the server to write the raw
metrics about the process itself, combined with its appropriate header(s), in
the response.

At the moment, the `spec` houses a single `configuration` field, which defines
the metric generation configuration as follows (please note that the schema is
fast-moving at this point and may be subject to change:
At the moment, the `spec` houses a single `configuration` field, which
defines the metric generation configuration as follows (please note that
the schema is fast-moving at this point and may be subject to change,
preferably towards one that incorporates the YAML-based configuration
string into the managed resource's schema itself, allowing for tighter
integration and validation:

```yaml
generators: # Set of metrics stores for each CR we want to generate metrics for.
Expand Down Expand Up @@ -486,7 +493,9 @@ generators: # Set of metrics stores for each CR we want to generate metrics for.

It's also worth mentioning that unlike Kube State Metrics' Custom Resource
State, Resource State Metrics supports recursively generating samples from
nested data structures, all from a single expression. Assuming we have a query,
nested data structures, all from a single expression.

Similarly, in Resource State Metrics, assuming we have a query,
```
o.spec
```
Expand Down Expand Up @@ -530,7 +539,8 @@ test_metric{os="linux", tags="", key_1="value1", key_2="value2", app_i
Note that the order of samples, as well as their labelsets, guaranteed to be
stable across runs.

The `status`, on the other hand, is a set of `metav1.Condition`s, like so:
The `ResourceMetricsMonitor` `status` is a set of `metav1.Condition`s,
like so:

```yaml
status:
Expand Down Expand Up @@ -618,7 +628,11 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
https://storage.googleapis.com/k8s-triage/index.html
-->

N/A.
`k8s-triage` is not relevant since the controller resides out-of-tree.

However, we do plan on covering as much surface as possible with
integration tests before graduating to Stable. These tests usually reside
with the end-to-end tests (see below).

##### e2e tests

Expand Down Expand Up @@ -704,6 +718,36 @@ in back-to-back releases.
- Deprecate the flag
-->

#### Alpha

- Implement foundational ideas discussed in the proposal (till Alpha) to
establish confidence through the proof-of-work.
- This also includes adding support for extending the configuration parsing
techniques via resolvers.

#### Beta

- Add support for admission web-hooks to validate the configuration on
creation/update of the managed resource.
- Add support for `starlark` resolver to help define more complex logic, in
addition to mainstream DSLs seen across the ecosystem for similar use-cases,
such as `jq` and `jsonpath` ([`prometheus-community/json_exporter`],
[`openmcp-project/metrics-operator`], etc.).

[`openmcp-project/metrics-operator`]: https://github.com/openmcp-project/metrics-operator#metric
[`prometheus-community/json_exporter`]: https://github.com/prometheus-community/json_exporter

#### Stable

- Support all relevant metric types that Prometheus' OpenMetrics implementation
allows. This would entail `Counter`s currently, and `Info` and `Stateset` in
the future. See [expfmt.MetricFamilyToOpenMetrics] for more.
- Complete API surface test coverage for all applicable and relevant test
types.
- Support payload compression (through gzip, zstd, etc.).

[expfmt.MetricFamilyToOpenMetrics]: https://pkg.go.dev/github.com/prometheus/common@v0.67.5/expfmt#MetricFamilyToOpenMetrics

### Upgrade / Downgrade Strategy

<!--
Expand Down Expand Up @@ -929,7 +973,7 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

TBD (when targeting beta).
N/A

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ approvers:
- "@richabanker"

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
Expand Down