HDDS-7576. Prometheus metrics do not remove stale metrics until restart #4057

xBis7 · 2022-12-07T16:45:03Z

What changes were proposed in this pull request?

For Prometheus, if a metric is unregistered and not pushed to the sink any more it will still exist in the map in PrometheusMetricsSink and it will be presented to the user. In this PR, we are storing all the metrics pushed to the sink to a second map which will be cleared every time we call flush(). This way, if a metric is stale and not pushed to the sink, it will not be presented to the user. This implementation is the same that was followed for hadoop in this PR. The other issues described in the hadoop PR seem to have been previously resolved for ozone.

This problem was uncovered and discussed here: #3781 (comment)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7576

How was this patch tested?

This patch was tested with a new unit test. It was also tested manually in docker for the case discussed in #3781. The metrics that become stale are no longer available after a short period of time.

kerneltime · 2022-12-12T17:14:44Z

@hemanthboyina

xBis7 · 2022-12-13T16:25:15Z

@kerneltime Most methods from TestPrometheusMetricsSink actually belong to an integration test.

This class should only be testing PrometheusMetricsSink but instead it's executing a bunch of classes from the hadoop-commons jar. testNamingCamelCase(), testNamingRocksDB(), testNamingPipeline(), testNamingSpaces() are unit tests for PrometheusMetricsSink, the rest of the methods don't belong in this class.

Also, if you try to publish the metrics more than once, you get Prometheus has a full queue and can't consume the given metrics. and the second publish never gets processed.

I think we should fix any issues, cleanup and refactor TestPrometheusMetricsSink as part of this PR.

xBis7 · 2022-12-15T18:04:34Z

@smengcl This is the PR with the test changes we were discussing.

kerneltime · 2022-12-16T08:15:53Z

@xBis7 I have not tried out the change against a UI, I have a question. Let's say no more objects are being created in a bucket. Will this drop reporting the object count as a metric post flush? What does a dashboard such as graphana report for the metric?

xBis7 · 2022-12-16T08:28:52Z

Let's say no more objects are being created in a bucket. Will this drop reporting the object count as a metric post flush?

@kerneltime No, it won't. Although, a metric might not have an updated value, it will still be pushed to the sink and therefore presented. This PR removes only the metrics that get unregistered.

It wouldn't make sense to present only the metrics that get an update on their value because we would end up with different metrics every time and it would be really hard to track changes.

This issue was discoreved in #3781 where there were some metrics that after some operations we would unregister them but PrometheusMetricsSink would still have them stored in the map and present them as if they were actively waiting for a change.

kerneltime · 2022-12-19T08:02:11Z

As part of a separate review do you want to look into if writeMetrics should be made threadsafe?

xBis7 · 2022-12-19T14:47:14Z

@kerneltime Thanks for reviewing this. I made all reads and writes synchronized for thread-safety.

kerneltime · 2022-12-21T22:51:20Z

Thank you @xBis7 for your contribution!

xBis7 force-pushed the HDDS-7576 branch from 1550424 to bdc20bd Compare December 7, 2022 17:32

kerneltime self-requested a review December 12, 2022 17:14

kerneltime approved these changes Dec 19, 2022

View reviewed changes

xBis7 added 11 commits December 19, 2022 15:32

refreshing prom sink on flush to remove stale metrics

2f7f82e

testRemovingStareMetricsOnFlush, new metrics stored to new string

47ffb06

sink.flush()

2e575a9

cleanup

4fc0c23

test method javadoc

6e32e19

test fixed

d0be04d

publish metrics only once

a532dc0

refactoring TestPrometheusMetricsSink - new class under integration

b9da34d

comments fixed

3269b98

moving TestPrometheusMetrics and renaming

3fe35cf

make writeMetrics synchronized

5ec3880

xBis7 force-pushed the HDDS-7576 branch from 337b981 to 5ec3880 Compare December 19, 2022 13:43

synchronize reads and writes to sink maps

9a79e8c

kerneltime merged commit c40cb07 into apache:master Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-7576. Prometheus metrics do not remove stale metrics until restart #4057

HDDS-7576. Prometheus metrics do not remove stale metrics until restart #4057

Uh oh!

xBis7 commented Dec 7, 2022

Uh oh!

kerneltime commented Dec 12, 2022

Uh oh!

xBis7 commented Dec 13, 2022 •

edited

Loading

Uh oh!

xBis7 commented Dec 15, 2022

Uh oh!

kerneltime commented Dec 16, 2022

Uh oh!

xBis7 commented Dec 16, 2022

Uh oh!

kerneltime commented Dec 19, 2022

Uh oh!

xBis7 commented Dec 19, 2022

Uh oh!

kerneltime commented Dec 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-7576. Prometheus metrics do not remove stale metrics until restart #4057

HDDS-7576. Prometheus metrics do not remove stale metrics until restart #4057

Uh oh!

Conversation

xBis7 commented Dec 7, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

kerneltime commented Dec 12, 2022

Uh oh!

xBis7 commented Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xBis7 commented Dec 15, 2022

Uh oh!

kerneltime commented Dec 16, 2022

Uh oh!

xBis7 commented Dec 16, 2022

Uh oh!

kerneltime commented Dec 19, 2022

Uh oh!

xBis7 commented Dec 19, 2022

Uh oh!

kerneltime commented Dec 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xBis7 commented Dec 13, 2022 •

edited

Loading