Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: now possible to only output non-resource related metrics #1823

Merged
merged 29 commits into from
Mar 27, 2023

Conversation

metacosm
Copy link
Collaborator

Fixes #1812.

@metacosm metacosm self-assigned this Mar 15, 2023
@metacosm metacosm requested a review from csviri March 15, 2023 16:33
@metacosm metacosm marked this pull request as ready for review March 17, 2023 17:52
Copy link
Contributor

@scrocquesel scrocquesel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that without per resource metrics we ended totally blind. Would it be possible to keep the metrics per resource kind ?

@metacosm
Copy link
Collaborator Author

It's still possible to get them but you would need to activate a flag in future versions, i.e. staring from v5 probably. This PR should keep the current behavior as-is.
Are per-resource metrics that interesting by themselves or are they symptomatic of the SDK needing maybe better, more structured logging and/or tracing, though?

@metacosm
Copy link
Collaborator Author

Also, we could indeed remove the name/namespace tags only

@scrocquesel
Copy link
Contributor

Also, we could indeed remove the name/namespace tags only

+1
I guess the issue was to disable per-resource name/namespace due to label cardinality (which produces time series fragmentation) not because the event metrics by themself are not much interesting.
It still could be interesting to observe events flowing.

@metacosm
Copy link
Collaborator Author

Turns out that removing meters without delay doesn't work reliably due to the fact that cleaning operations open in different threads and heavily depend on how / when the API server responds to the SDK so I will remove the possibility of immediately removing the meters.

@metacosm
Copy link
Collaborator Author

@scrocquesel made the per-resource collection more granular and added docs, would appreciate a new review 😄

Copy link
Contributor

@scrocquesel scrocquesel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some few things.

Also want to share my thoughts on per schedule cleaner as I don't know if creating lots of timer has memory/perf issues. I think the objective is to keep stale meters until they have been collected one last time by the scrapper. Thus the configured delay should be a little more than the scrapping delay. We may remove meters by batch after they have spend delay seconds in a per generation collection. I guess with two generations, we can rotate every delay seconds where stale meters are recorded and clean them upon next switch.

Ideally, I would love to have some sort of a event souurce and being able to clean all stale meters when the event is raised. The event would be raised by the metric endpoint route handler after returning metrics to the scrapper. I checked quarkus route handler implementation but currenlty it is not possible to know when metrics are collected. If it make sense we may open an issue at quarkus. JOSDK could expose the a signal or method to be wired up in QOSDK.

is a matter of instantiating `MicrometerMetrics` with the desired values and tell `ConfigurationServiceProvider` about
it as shown above.
The micrometer implementation is typically created using one of the provided factory methods which, depending on which
is used, will return either a ready to use instance or a builder allowing users to customized how the implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't builder pattern be consistent so we have only one entry point to the configuration ? MicrometerMetrics.builder() and then be self documented with javadoc. By default, the builder is not configuring per resource and you can switch to the per resource builder if needed.

MicrometerMetrics.newMicrometerMetricsBuilder(new LoggingMeterRegistry())
    .collectingMetricsPerResource(perResourceBuilder ->
        perResourceBuilder.withCleanUpDelayInSeconds(60))
    .build();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having several entry points allows to provide often used configurations without users having to create them with the builder. It also allows to have stable API calls since these factory method implementations can be changed if we change the default behavior while the semantics of the method won't so I'd be in favor of keeping things like they are for now.

docs/documentation/features.md Outdated Show resolved Hide resolved
@metacosm
Copy link
Collaborator Author

Some few things.

Also want to share my thoughts on per schedule cleaner as I don't know if creating lots of timer has memory/perf issues. I think the objective is to keep stale meters until they have been collected one last time by the scrapper. Thus the configured delay should be a little more than the scrapping delay. We may remove meters by batch after they have spend delay seconds in a per generation collection. I guess with two generations, we can rotate every delay seconds where stale meters are recorded and clean them upon next switch.

I have to admit that I'm not too familiar with how the scrapping work and I'm not sure whether our implementation can actually be notified that scrapping occurred.

Ideally, I would love to have some sort of a event souurce and being able to clean all stale meters when the event is raised. The event would be raised by the metric endpoint route handler after returning metrics to the scrapper. I checked quarkus route handler implementation but currenlty it is not possible to know when metrics are collected. If it make sense we may open an issue at quarkus. JOSDK could expose the a signal or method to be wired up in QOSDK.

That's an interesting idea. I'm not sure how feasible that is, though.

@scrocquesel
Copy link
Contributor

Looking at failedReconciliation, I wonder if we should also reduce cardinality on exception with a builder.includeExceptionTag().

Copy link
Contributor

@scrocquesel scrocquesel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find some discrepancies between doc and impl

docs/documentation/features.md Show resolved Hide resolved
@@ -135,40 +167,35 @@ public <T> T timeControllerExecution(ControllerExecution<T> execution) {
});
final var successType = execution.successTypeName(result);
registry
.counter(execName + ".success", "controller", name, "type", successType)
.counter(execName + SUCCESS_SUFFIX, CONTROLLER, name, TYPE, successType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc says it has resoure metadata. I guess it should reuse the tags list above. Same apply for the failure below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave it like this for this release to keep the current behavior. I'll fix the doc instead. I'm also unsure whether adding more metadata is useful for this metric (at least, the GVK can be determined by looking at the controller definition from its name).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, the metadata is on the timers which doesn't really make sense either and should probably be removed there as well in the next major release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense there is operator.sdk.reconciliations.failed if the user want to build alerts on failure for a specific GVK

@metacosm
Copy link
Collaborator Author

Looking at failedReconciliation, I wonder if we should also reduce cardinality on exception with a builder.includeExceptionTag().

I don't actually want to make things too configurable or too fine-grained with the default implementation. It shouldn't be too hard for people to use that as a starting point for their own implementation.

@sonarcloud
Copy link

sonarcloud bot commented Mar 27, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 11 Code Smells

32.8% 32.8% Coverage
0.0% 0.0% Duplication

[skip ci]

Co-authored-by: Attila Mészáros <[email protected]>
@metacosm metacosm merged commit 0b208a2 into next Mar 27, 2023
@metacosm metacosm deleted the simple-metrics branch March 27, 2023 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants