Skip to content

Conversation

@xBis7
Copy link
Contributor

@xBis7 xBis7 commented Dec 7, 2022

What changes were proposed in this pull request?

For Prometheus, if a metric is unregistered and not pushed to the sink any more it will still exist in the map in PrometheusMetricsSink and it will be presented to the user. In this PR, we are storing all the metrics pushed to the sink to a second map which will be cleared every time we call flush(). This way, if a metric is stale and not pushed to the sink, it will not be presented to the user. This implementation is the same that was followed for hadoop in this PR. The other issues described in the hadoop PR seem to have been previously resolved for ozone.

This problem was uncovered and discussed here: #3781 (comment)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7576

How was this patch tested?

This patch was tested with a new unit test. It was also tested manually in docker for the case discussed in #3781. The metrics that become stale are no longer available after a short period of time.

@kerneltime
Copy link
Contributor

@hemanthboyina

@xBis7
Copy link
Contributor Author

xBis7 commented Dec 13, 2022

@kerneltime Most methods from TestPrometheusMetricsSink actually belong to an integration test.

This class should only be testing PrometheusMetricsSink but instead it's executing a bunch of classes from the hadoop-commons jar. testNamingCamelCase(), testNamingRocksDB(), testNamingPipeline(), testNamingSpaces() are unit tests for PrometheusMetricsSink, the rest of the methods don't belong in this class.

Also, if you try to publish the metrics more than once, you get Prometheus has a full queue and can't consume the given metrics. and the second publish never gets processed.

I think we should fix any issues, cleanup and refactor TestPrometheusMetricsSink as part of this PR.

@xBis7
Copy link
Contributor Author

xBis7 commented Dec 15, 2022

@smengcl This is the PR with the test changes we were discussing.

@kerneltime
Copy link
Contributor

@xBis7 I have not tried out the change against a UI, I have a question. Let's say no more objects are being created in a bucket. Will this drop reporting the object count as a metric post flush? What does a dashboard such as graphana report for the metric?

@xBis7
Copy link
Contributor Author

xBis7 commented Dec 16, 2022

Let's say no more objects are being created in a bucket. Will this drop reporting the object count as a metric post flush?

@kerneltime No, it won't. Although, a metric might not have an updated value, it will still be pushed to the sink and therefore presented. This PR removes only the metrics that get unregistered.

It wouldn't make sense to present only the metrics that get an update on their value because we would end up with different metrics every time and it would be really hard to track changes.

This issue was discoreved in #3781 where there were some metrics that after some operations we would unregister them but PrometheusMetricsSink would still have them stored in the map and present them as if they were actively waiting for a change.

@kerneltime
Copy link
Contributor

As part of a separate review do you want to look into if writeMetrics should be made threadsafe?

@xBis7
Copy link
Contributor Author

xBis7 commented Dec 19, 2022

@kerneltime Thanks for reviewing this. I made all reads and writes synchronized for thread-safety.

@kerneltime kerneltime merged commit c40cb07 into apache:master Dec 21, 2022
@kerneltime
Copy link
Contributor

Thank you @xBis7 for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants