Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus duplicate metrics #1941

Closed
jayasai470 opened this issue Jan 30, 2023 · 5 comments · Fixed by #1953
Closed

prometheus duplicate metrics #1941

jayasai470 opened this issue Jan 30, 2023 · 5 comments · Fixed by #1953
Assignees
Labels
bug Something isn't working

Comments

@jayasai470
Copy link

prometheus metrics endpoint return duplicate metrics and not updating them properly on a SimpleWebServer api endpoints.

Steps to reproduce
Created a sample repo which reproduces this issue. https://github.com/jayasai470/opentelemetry-cpp-sample

What is the expected behavior?
Get metrics without duplicates

What is the actual behavior?

# HELP http_server_duration_count 1.8.1
# TYPE http_server_duration_count counter
http_server_duration_count{method="GET",status="200",path="/api/health"} 3 1674066565733
http_server_duration_count{method="GET",status="200",path="/api/test"} 2 1674066565733
# HELP http_server_duration_count 1.8.1
# TYPE http_server_duration_count counter
http_server_duration_count{method="GET",status="200",path="/api/test"} 2 1674066566733
http_server_duration_count{method="GET",status="200",path="/api/health"} 3 1674066566733
@jayasai470 jayasai470 added the bug Something isn't working label Jan 30, 2023
@esigo
Copy link
Member

esigo commented Jan 30, 2023

@jayasai470 thanks for providing the repro steps.
There is no issue on Prometheus side. Here the same meter is being exported in the intervals which is configured for periodic metric reader.
CC @lalitb

@lalitb lalitb self-assigned this Jan 30, 2023
@lalitb
Copy link
Member

lalitb commented Jan 31, 2023

@jayasai470 - The reason for these duplicates are because Prometheus exporter only support Cumulative temporality. This means, if there are no measurements generated by the application/instrumentation after a particular time, the periodic exporter will keep on exporting the total aggregated requests. As mentioned here -

Cumulative temporality means that successive data points repeat the starting timestamp. For example, from start time T0, cumulative data points cover time ranges (T0, T1], (T0, T2], (T0, T3], and so on.

@lalitb
Copy link
Member

lalitb commented Feb 2, 2023

response from @jayasai470 on slack -

I have moved the counter initialization to separate method, thought same counter is retrieved even when I initialize multiple times.
jayasai470/opentelemetry-cpp-sample@7695e2f
from this change I can see the counter is getting incremented properly but still see values like below, assuming these are temporal values

http_server_request_count{method="GET",path="/info"} 1 1675277982770
http_server_request_count{method="GET",path="/info"} 2 1675277991773
 http_server_request_count{method="GET",path="/info"} 2 1675277993773
# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 0
# HELP exposer_scrapes_total Number of times metrics were scraped
# TYPE exposer_scrapes_total counter
exposer_scrapes_total 0
# HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
# TYPE exposer_request_latencies summary
exposer_request_latencies_count 0
exposer_request_latencies_sum 0
exposer_request_latencies{quantile="0.5"} Nan
exposer_request_latencies{quantile="0.9"} Nan
exposer_request_latencies{quantile="0.99"} Nan
# HELP http_server_request_count 1.8.1
# TYPE http_server_request_count counter
http_server_request_count{method="GET",path="/info"} 1 1675277982770
# HELP http_server_request_count 1.8.1
# TYPE http_server_request_count counter
http_server_request_count{method="GET",path="/info"} 2 1675277991773
# HELP http_server_request_count 1.8.1
# TYPE http_server_request_count counter
http_server_request_count{method="GET",path="/info"} 2 1675277992773
# HELP http_server_request_count 1.8.1
# TYPE http_server_request_count counter
http_server_request_count{method="GET",path="/info"} 2 1675277993773

FYI below is our response from nodejs sdk

http_server_request_count_total{method="GET",path="/info"} 2 1675274675379

--> i dont see this value as in Cumulative temporality like you were mentioning in the comments

# HELP http_server_request_count_total Total number of HTTP requests
# UNIT http_server_request_count_total requests
# TYPE http_server_request_count_total counter
http_server_request_count_total{method="GET",path="/info"} 2 1675274675379
# HELP http_server_response_count_total Total number of HTTP responses
# UNIT http_server_response_count_total responses
# TYPE http_server_response_count_total counter
http_server_response_count_total{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest"} 2 1675274675379
# HELP http_server_abort_count_total Total number of data transfers aborted
# UNIT http_server_abort_count_total requests
# TYPE http_server_abort_count_total counter
# HELP http_server_duration The duration of the inbound HTTP request
# UNIT http_server_duration ms
# TYPE http_server_duration histogram
http_server_duration_count{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest"} 2 1675274675379
http_server_duration_sum{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest"} 0.008635375 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="0"} 0 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="5"} 2 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="10"} 2 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="25"} 2 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="50"} 2 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="75"} 2 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="100"} 2 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="250"} 2 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="500"} 2 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="1000"} 2 1675274675379
http_server_duration_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="+Inf"} 2 1675274675379
# HELP http_server_request_size Size of incoming bytes
# UNIT http_server_request_size By
# TYPE http_server_request_size histogram
http_server_request_size_count{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest"} 2 1675274675379
http_server_request_size_sum{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest"} 0 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="0"} 0 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="5"} 2 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="10"} 2 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="25"} 2 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="50"} 2 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="75"} 2 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="100"} 2 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="250"} 2 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="500"} 2 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="1000"} 2 1675274675379
http_server_request_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="+Inf"} 2 1675274675379
# HELP http_server_response_size Size of outgoing bytes
# UNIT http_server_response_size By
# TYPE http_server_response_size histogram
http_server_response_size_count{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest"} 2 1675274675379
http_server_response_size_sum{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest"} 3832 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="0"} 0 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="5"} 0 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="10"} 0 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="25"} 0 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="50"} 0 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="75"} 0 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="100"} 0 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="250"} 0 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="500"} 0 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="1000"} 0 1675274675379
http_server_response_size_bucket{method="GET",status="401",path="/info",service_name="nestjs_http",module="rest",le="+Inf"} 2 1675274675379
# HELP http_server_response_success_count_total Total number of all successful responses
# UNIT http_server_response_success_count_total responses
# TYPE http_server_response_success_count_total counter
# HELP http_server_response_error_count_total Total number of all response errors
# TYPE http_server_response_error_count_total counter
http_server_response_error_count_total 2 1675274675379
# HELP http_client_request_error_count_total Total number of client error requests
# TYPE http_client_request_error_count_total counter
http_client_request_error_count_total 2 1675274675379

at any point of time this does not show the duplicated count values and also the numbers are incremented properly.

@lalitb
Copy link
Member

lalitb commented Feb 2, 2023

@jayasai470 - So I did some tests with Prometheus exporter, and your analysis is correct. With every fetch, the Prometheus exporter should give the cumulative aggregated values from all the configured instruments. And there should be no duplicates.

@esigo fyi - I think we shouldn't be using PeriodicMetricReader with PrometheusExporter. With every periodic fetch, the cumulative metrics gets stored in the Collector vector, and old data is not deleted. This vector should only contain latest cumulative data, not all previous data.

This can be achieved if we collect metrics only when there is http request to pull the metrics:

http request -> PrometheusCollector::Collect() -> PullMetricReader::Collect() -> SyncMetricStorage::Collect()

And PrometheusExporter should be implemented as PullMetricReader. ( Similar to how JS and Python implements it).

Javascript implementation : https://github.com/open-telemetry/opentelemetry-js/blob/2b59c2820c99e7f65012e01b6a310b8438b79e89/experimental/packages/opentelemetry-exporter-prometheus/src/PrometheusExporter.ts#L30

Python implementation : https://github.com/open-telemetry/opentelemetry-python/blob/2e14fe1295445d0c3cb6b5a29cc8db8c7db9f03c/exporter/opentelemetry-exporter-prometheus/src/opentelemetry/exporter/prometheus/__init__.py#L116

In general, PeriodicMetricReader works good for push model where we push the latest cumulative metrics to the agent, not in the pull model where we have to store these metrics in intermediate storage before they get pulled.

@jayasai470
Copy link
Author

@lalitb @esigo thanks for your deep dive into this issue, please let me know if I can assist in closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants