feat: now possible to only output non-resource related metrics #1823

metacosm · 2023-03-15T16:33:01Z

scrocquesel

It seems that without per resource metrics we ended totally blind. Would it be possible to keep the metrics per resource kind ?

...pport/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetrics.java

metacosm · 2023-03-20T08:11:21Z

It's still possible to get them but you would need to activate a flag in future versions, i.e. staring from v5 probably. This PR should keep the current behavior as-is.
Are per-resource metrics that interesting by themselves or are they symptomatic of the SDK needing maybe better, more structured logging and/or tracing, though?

metacosm · 2023-03-20T08:12:58Z

Also, we could indeed remove the name/namespace tags only

scrocquesel · 2023-03-21T19:46:03Z

Also, we could indeed remove the name/namespace tags only

+1
I guess the issue was to disable per-resource name/namespace due to label cardinality (which produces time series fragmentation) not because the event metrics by themself are not much interesting.
It still could be interesting to observe events flowing.

metacosm · 2023-03-22T09:52:12Z

Turns out that removing meters without delay doesn't work reliably due to the fact that cleaning operations open in different threads and heavily depend on how / when the API server responds to the SDK so I will remove the possibility of immediately removing the meters.

metacosm · 2023-03-23T14:18:01Z

@scrocquesel made the per-resource collection more granular and added docs, would appreciate a new review 😄

scrocquesel

Some few things.

Also want to share my thoughts on per schedule cleaner as I don't know if creating lots of timer has memory/perf issues. I think the objective is to keep stale meters until they have been collected one last time by the scrapper. Thus the configured delay should be a little more than the scrapping delay. We may remove meters by batch after they have spend delay seconds in a per generation collection. I guess with two generations, we can rotate every delay seconds where stale meters are recorded and clean them upon next switch.

Ideally, I would love to have some sort of a event souurce and being able to clean all stale meters when the event is raised. The event would be raised by the metric endpoint route handler after returning metrics to the scrapper. I checked quarkus route handler implementation but currenlty it is not possible to know when metrics are collected. If it make sense we may open an issue at quarkus. JOSDK could expose the a signal or method to be wired up in QOSDK.

scrocquesel · 2023-03-23T18:32:33Z

docs/documentation/features.md

-is a matter of instantiating `MicrometerMetrics` with the desired values and tell `ConfigurationServiceProvider` about
-it as shown above.
+The micrometer implementation is typically created using one of the provided factory methods which, depending on which
+is used, will return either a ready to use instance or a builder allowing users to customized how the implementation


Shouldn't builder pattern be consistent so we have only one entry point to the configuration ? MicrometerMetrics.builder() and then be self documented with javadoc. By default, the builder is not configuring per resource and you can switch to the per resource builder if needed.

MicrometerMetrics.newMicrometerMetricsBuilder(new LoggingMeterRegistry()) .collectingMetricsPerResource(perResourceBuilder -> perResourceBuilder.withCleanUpDelayInSeconds(60)) .build();

Having several entry points allows to provide often used configurations without users having to create them with the builder. It also allows to have stable API calls since these factory method implementations can be changed if we change the default behavior while the semantics of the method won't so I'd be in favor of keeping things like they are for now.

docs/documentation/features.md

...pport/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetrics.java

metacosm · 2023-03-24T09:00:09Z

Some few things.

Also want to share my thoughts on per schedule cleaner as I don't know if creating lots of timer has memory/perf issues. I think the objective is to keep stale meters until they have been collected one last time by the scrapper. Thus the configured delay should be a little more than the scrapping delay. We may remove meters by batch after they have spend delay seconds in a per generation collection. I guess with two generations, we can rotate every delay seconds where stale meters are recorded and clean them upon next switch.

I have to admit that I'm not too familiar with how the scrapping work and I'm not sure whether our implementation can actually be notified that scrapping occurred.

Ideally, I would love to have some sort of a event souurce and being able to clean all stale meters when the event is raised. The event would be raised by the metric endpoint route handler after returning metrics to the scrapper. I checked quarkus route handler implementation but currenlty it is not possible to know when metrics are collected. If it make sense we may open an issue at quarkus. JOSDK could expose the a signal or method to be wired up in QOSDK.

That's an interesting idea. I'm not sure how feasible that is, though.

scrocquesel · 2023-03-24T09:58:43Z

Looking at failedReconciliation, I wonder if we should also reduce cardinality on exception with a builder.includeExceptionTag().

scrocquesel

Find some discrepancies between doc and impl

docs/documentation/features.md

scrocquesel · 2023-03-24T13:31:29Z

...pport/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetrics.java

@@ -135,40 +167,35 @@ public <T> T timeControllerExecution(ControllerExecution<T> execution) {
      });
      final var successType = execution.successTypeName(result);
      registry
-          .counter(execName + ".success", "controller", name, "type", successType)
+          .counter(execName + SUCCESS_SUFFIX, CONTROLLER, name, TYPE, successType)


doc says it has resoure metadata. I guess it should reuse the tags list above. Same apply for the failure below

I think we should leave it like this for this release to keep the current behavior. I'll fix the doc instead. I'm also unsure whether adding more metadata is useful for this metric (at least, the GVK can be determined by looking at the controller definition from its name).

That said, the metadata is on the timers which doesn't really make sense either and should probably be removed there as well in the next major release.

Make sense there is operator.sdk.reconciliations.failed if the user want to build alerts on failure for a specific GVK

metacosm · 2023-03-24T14:04:22Z

Looking at failedReconciliation, I wonder if we should also reduce cardinality on exception with a builder.includeExceptionTag().

I don't actually want to make things too configurable or too fine-grained with the default implementation. It shouldn't be too hard for people to use that as a starting point for their own implementation.

Fixes #1812.

This is needed because the finalizer will trigger a reconciliation that adds a resource-specific metric.

Co-authored-by: Sébastien CROCQUESEL <[email protected]>

We now still collect GVK information when per-resource collection is switched off.

[skip ci]

Co-authored-by: Sébastien CROCQUESEL <[email protected]>

sonarcloud · 2023-03-27T10:45:28Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
11 Code Smells

32.8% Coverage
0.0% Duplication

...pport/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetrics.java

[skip ci] Co-authored-by: Attila Mészáros <[email protected]>

metacosm self-assigned this Mar 15, 2023

metacosm requested a review from csviri March 15, 2023 16:33

metacosm force-pushed the simple-metrics branch from ae1c020 to de0b4fc Compare March 15, 2023 18:53

metacosm marked this pull request as ready for review March 17, 2023 17:52

scrocquesel reviewed Mar 19, 2023

View reviewed changes

metacosm requested a review from scrocquesel March 23, 2023 14:18

scrocquesel reviewed Mar 23, 2023

View reviewed changes

scrocquesel reviewed Mar 24, 2023

View reviewed changes

metacosm force-pushed the next branch from 29453e8 to 3ec1335 Compare March 27, 2023 09:45

csviri and others added 13 commits March 27, 2023 12:40

feat: bounded cache for informers (#1718)

4e5794f

fix: typo caffein -> caffeine (#1795)

6c5fafa

feat: now possible to only output non-resource related metrics

0ae01d1

Fixes #1812.

refactor: extract abstract test fixture to add tests with variations

f761dbc

fix: add missing annotation

0d75b87

tests: add more test variations

805fbe0

fix: make operator non-static so it's registered once per test subclass

76d25f9

feat: introduce builder for MicrometerMetrics, fix test

dd2d360

fix: exclude more tags when not collecting per resource

57cf246

fix: registry should be per-instance to ensure test independence

b51ea84

fix: make sure we wait a little to ensure event is properly processed

9b14151

fix: make things work on Java 11, format

54ddeec

fix: also clean metrics on finalizer removal

feb8b06

This is needed because the finalizer will trigger a reconciliation that adds a resource-specific metric.

metacosm and others added 15 commits March 27, 2023 12:41

fix: format

edc9530

refactor: extract common tags

6d14663

Co-authored-by: Sébastien CROCQUESEL <[email protected]>

feat: make per-resource collecting finer-grained

916d849

We now still collect GVK information when per-resource collection is switched off.

fix: do not create tag for group if not present

bc5b5f4

fix: remove unreliable no-delay implementation, defaulting to 1s delay

445c891

refactor: renamed & documented factory methods to make things clearer

7565e1b

docs: updated metrics section for code changes

e22dc75

feat: avoid emitting tag on empty value

9c8d77e

docs: update

2624f7b

fix: format

3fd613e

[skip ci]

refactor: use Tag more directly, avoid unneeded work, use constants

ee88028

fix: change will happen instead of might

40441f0

docs: add missing timer

3bf2045

Co-authored-by: Sébastien CROCQUESEL <[email protected]>

docs: fix wrong & missing information

70462a8

refactor: add constants

4505761

metacosm force-pushed the simple-metrics branch from 4f1589b to 4505761 Compare March 27, 2023 10:42

csviri reviewed Mar 27, 2023

View reviewed changes

...pport/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetrics.java Outdated Show resolved Hide resolved

csviri approved these changes Mar 27, 2023

View reviewed changes

fix: wording

31d1326

[skip ci] Co-authored-by: Attila Mészáros <[email protected]>

metacosm merged commit 0b208a2 into next Mar 27, 2023

metacosm deleted the simple-metrics branch March 27, 2023 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: now possible to only output non-resource related metrics #1823

feat: now possible to only output non-resource related metrics #1823

metacosm commented Mar 15, 2023

scrocquesel left a comment

metacosm commented Mar 20, 2023

metacosm commented Mar 20, 2023

scrocquesel commented Mar 21, 2023

metacosm commented Mar 22, 2023

metacosm commented Mar 23, 2023

scrocquesel left a comment

scrocquesel Mar 23, 2023

metacosm Mar 24, 2023

metacosm commented Mar 24, 2023

scrocquesel commented Mar 24, 2023

scrocquesel left a comment

scrocquesel Mar 24, 2023

metacosm Mar 24, 2023

metacosm Mar 24, 2023

scrocquesel Mar 24, 2023

metacosm commented Mar 24, 2023

sonarcloud bot commented Mar 27, 2023

feat: now possible to only output non-resource related metrics #1823

feat: now possible to only output non-resource related metrics #1823

Conversation

metacosm commented Mar 15, 2023

scrocquesel left a comment

Choose a reason for hiding this comment

metacosm commented Mar 20, 2023

metacosm commented Mar 20, 2023

scrocquesel commented Mar 21, 2023

metacosm commented Mar 22, 2023

metacosm commented Mar 23, 2023

scrocquesel left a comment

Choose a reason for hiding this comment

scrocquesel Mar 23, 2023

Choose a reason for hiding this comment

metacosm Mar 24, 2023

Choose a reason for hiding this comment

metacosm commented Mar 24, 2023

scrocquesel commented Mar 24, 2023

scrocquesel left a comment

Choose a reason for hiding this comment

scrocquesel Mar 24, 2023

Choose a reason for hiding this comment

metacosm Mar 24, 2023

Choose a reason for hiding this comment

metacosm Mar 24, 2023

Choose a reason for hiding this comment

scrocquesel Mar 24, 2023

Choose a reason for hiding this comment

metacosm commented Mar 24, 2023

sonarcloud bot commented Mar 27, 2023