Skip to content

KEP-4785: Move RSM to beta#5766

Open
rexagod wants to merge 3 commits intokubernetes:masterfrom
rexagod:rsm-beta-grad
Open

KEP-4785: Move RSM to beta#5766
rexagod wants to merge 3 commits intokubernetes:masterfrom
rexagod:rsm-beta-grad

Conversation

@rexagod
Copy link
Member

@rexagod rexagod commented Jan 4, 2026

  • One-line PR description: Alpha features were implemented a while ago, and after discussing the future of this at length with @chrischdi and @mrueg, I believe the proposed criteria for BETA makes sense.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 4, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 4, 2026
@rexagod rexagod mentioned this pull request Jan 4, 2026
8 tasks
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 4, 2026
Alpha features were implemented a while ago, and after discussing the
future of this at length with other maintainers and stakeholders, I
believe the proposed criteria for BETA makes sense.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
@haarchri
Copy link

@rexagod thanks for the BETA proposal enhancement

We would like to explicitly call out that, from the Crossplane community’s perspective, having the RSM repository hosted under a k8s-sig org would be highly valuable. ( End of January 26 / Start February 26 )

This would:

  • Give us confidence to commit time and actively contribute to RSM
  • Make it easier to gather structured feedback from the Crossplane community
  • Provide a neutral, shared ownership model rather than depending on a personal repository
  • Help us collaboratively build a solid, production-grade RSM over time

Regarding resolver: we believe CEL is a strong default and will cover the majority of use-cases well ( and it has already seen meaningful testing in crossplane contexts with XRs, Claims and managed Resources).
At the same time, we see that some users will need more complex or expressive metric logic.
We are aligned that this likely means supporting additional resolvers over time (e.g. embedded GO or Starlark) and are engaged and ready to help drive that evolution as needs become clearer. ( we can add some more User-Stories around different personas using RSM over time)

Importantly, we are intentionally holding back further independent efforts in projects like crossplane-state-metrics ( crossplane/crossplane#6865 ) because we would strongly prefer converging on a single, shared solution in the ecosystem rather than fragmenting efforts across multiple implementations.

We understand that moving the repo ( https://github.com/rexagod/resource-state-metrics ) into the k8s-sig introduces additional overhead ( CI, image publishing, release processes)
From our side, it is fine if the initial move happens before all automations are fully in place, as long as there is clear intent and path (e.g. tracking isses) to iterate on this post-move.

To accelerate feedback already, we currently maintain a fork ( https://github.com/haarchri/resource-state-metrics ) where we have added CI, image publishing and release automation. ( already shared with a few crossplane community folks )
This allows folks to start using RSM end-to-end today and gives us fast, practical feedback from real integrations, which we intend to feedback upstream.

Happy to help where possible and looking forward to contributing once the repo setup enables broader community involvement.

expressed using `gauge`s.
- All generated metrics are hardcoded to `gauge`s by design, as metrics
backends in the ecosystem do not support some OpenMetrics-specified
metrics' types, such as `Info` and `StateSets`, but more importantly,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be part of an auto-negotiation on the protocol level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I pushed the same negotiation behavior as KSM to RSM, which respects the selected KEP content above. However, one could argue that we must maintain compatibility with Prometheus' progress on OpenMetrics, and thus, atleast support Counters as well right now.

The only reason I didn't do this and had a hard-pin on Gauges was only because this would open us up to supporting all relevant metric types that RMM allows users to construct, which, in the future, will, at the very least, entail Info and Stateset types.

But I realize this makes sense, and the complexity is worth the benefits the community gets out of RSM, so I'll put this in the TODO before we go stable. PLMK if that makes sense, or if you'd like a different approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Add support for `starlark` resolver to help define more complex logic,
in addition to mainstream DSLs seen across the ecosystem for similar
use-cases, such as `jq` and [`jsonpath`].
- Explore areas that could benefit from (gzipped) compression, such as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider supporting zstd as well, if it makes sense?

the configuration field in the managed resource, or the payload
endpoint(s).

[`jsonpath`]: https://github.com/openmcp-project/metrics-operator#metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -392,31 +395,35 @@ generation.
- At its core, the controller relies on its managed resource,
`ResourceMetricsMonitor` to fetch the metric generation configuration. Parts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we allow supplying configuration as a regular config file as well? I wonder if there's a usecase to run RSM outside of the cluster.

Copy link
Member Author

@rexagod rexagod Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folks can use similar semantics as seen in make local to run this locally. That being said, doing away with RMMs and relying on file blobs will strip away the added feature-set of the controller lifecycle (for e.g., status updates) that users can benefit from. RSM could mirror that blob to an RMM resource, but I'm not sure if the file-only direction would be worth the added complexity.

@rexagod
Copy link
Member Author

rexagod commented Feb 15, 2026

👋🏼 Thank you for the reviews here, folks!

@haarchri Apologies for the delay here, I'm currently working on this.

I've also managed to allocate some spare cycles in the upcoming week to dedicate to this, and I'll make sure the migration, along with planned updates, happens in the same time period. I'll also ensure your thoughts above are reflected in the patches.

@rexagod
Copy link
Member Author

rexagod commented Feb 23, 2026

FYI @haarchri I've migrated the repository to https://github.com/kubernetes-sigs/resource-state-metrics.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 24, 2026
@rexagod rexagod requested a review from mrueg February 25, 2026 09:40
@rexagod
Copy link
Member Author

rexagod commented Mar 3, 2026

(bump)

Also please note that I've implemented everything I had in mind for BETA as well as GA for RSM, PTAL at the TODO section. We should be good to graduate this for BETA this release and use the soak time between that and STABLE graduation to get stakeholders' inputs, both community and downstream, to build confidence.

#### Stable

- Consider supporting all relevant metric types that Prometheus' OpenMetrics
implementation allows. This would entail `Counter`s currently, and `Info` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a use case for counters in rsm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted RSM to be as close to the OM spec as possible, and respect the nuances it entails.

For eg., with CRS, a counter will be created as a gauge, with no peripherals, i.e., not only will a _total metric showcase metadata for its type as gauge (even though the help text may say otherwise, increasing the overall ambiguity), the user may or may not create a _created scalar to accompany that metric so it's helpful for Prometheus to calculate precise rates between resets.

RSM does not allow the user to specify a type, instead, it looks at the suffix, and if it matches upto a supported OM type, does everything that the OM spec advises.

https://github.com/kubernetes-sigs/resource-state-metrics/blob/5383f7da5dfd6f1fe60755c82302ecc92906a101/tests/golden/unstructured/resourcemetricsmonitor-counter.yaml

addition to mainstream DSLs seen across the ecosystem for similar use-cases,
such as `jq` and `jsonpath` ([`prometheus-community/json_exporter`],
[`openmcp-project/metrics-operator`], etc.).
- Explore areas that could benefit from (gzipped and zstd) compression, such as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a graduation criteria, it should be concrete. Either you think that implementing compression is mandatory for GA and add it as a requirement here, or this could be something for the future, that is not mentioned as a criteria.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'll put this up for GA.

Comment on lines +744 to +746
- Consider supporting all relevant metric types that Prometheus' OpenMetrics
implementation allows. This would entail `Counter`s currently, and `Info` and
`Stateset` in the future. See [expfmt.MetricFamilyToOpenMetrics] for more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to the one I made on https://github.com/kubernetes/enhancements/pull/5766/changes#r2945360746. Supporting OpenMetrics types could be done outside of this KEP's scope.

Generally we also don't put additional functionalities as GA criterias as it is meant for testing and gathering user feedback.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I rooted for OM types support, alongwith every other planned effort to be supported as soon as possible, so the soak time before GA could help us test out most of the feature-sets that CRS already had a blueprint for, which users would expect out of RSM OOTB with GA, while allowing us to move fast owing to no GA-like stability guarantees. These features, shipped at an earlier stage, reinforce what RSM sees integral to its functionality, while less important efforts have been moved beyond GA.

This can be seen at play in https://github.com/crossplane-contrib/resource-state-metrics, owing to which we resolved bugs and addressed feature requests, to work towards a more feature-complete GA that the community looks forward to. There are golden rules to test out all expected behaviors based on the user feedback so far, which I anticipate to increase (even before GA) owing to the soak time we have saved by shipping these features earlier in the cycle.

PS. We support relevant OM types in RSM as of now, please see kubernetes-sigs/resource-state-metrics@81dfeaa#diff-bd1b0d7b07d02221f7aa583d7fe1f0d1822d6476f452cdb9f071c0d4c81f53faR74-R89.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants