stats: report sample count as an integer to prevent loss of precision#6274
Conversation
When the number of samples in a histogram exceeds roughly 2^24, the value of the "+Inf" bucket in Prometheus stats output is rounded and rendered in scientific notation. This causes incorrect results in the Prometheus histogram_quantile function, which assumes that the rounding-error between the 1 hour and +Inf buckets represents some number of requests that took in excess of 1 hour. libcirclhist actually stores the sample count as a uint64_t, so stop implicitly converting it to double and output the count precisely (as we do with the non-infinite buckets). Also modifies the output format of the sum metric to avoid scientific notation. Risk Level: low Testing: added test case Doc Changes: n/a Release Notes: n/a Signed-off-by: Stephan Zuercher <zuercher@gmail.com>
| response.add(fmt::format("{0}_bucket{{{1}le=\"+Inf\"}} {2}\n", metric_name, hist_tags, | ||
| stats.sampleCount())); | ||
| response.add(fmt::format("{0}_sum{{{1}}} {2}\n", metric_name, tags, stats.sampleSum())); | ||
| response.add(fmt::format("{0}_sum{{{1}}} {2:.32g}\n", metric_name, tags, stats.sampleSum())); |
There was a problem hiding this comment.
My only concern with this is that it's changing something user visible and might break parsers. But if I understand correctly this output is meant to be parsed specifically by Prometheus scraper, so it's probably not being scraped by someone's custom stuff. WDYT?
There was a problem hiding this comment.
I agree. It's also changing it to a purely [0-9]+(\.[0-9]+)? format, so I think that makes it less likely to be misinterpreted. To be fair, I believe prometheus can consume the scientific notation values correctly outside of the tags it uses for histogram bucketing. I changed this so that it matched the other outputs.
|
Ping @dnoe |
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
@dnoe do you have time to move this forward or shall I look for another reviewer? |
dnoe
left a comment
There was a problem hiding this comment.
LGTM. Sorry for the delay, I recently adjusted email filters in an attempt to not lose track of PRs and this got lost in the transition.
* master: (137 commits) test: router upstream log to v2 config stubs (envoyproxy#6499) remove idle timeout validation (envoyproxy#6500) build: Change namespace of chromium_url. (envoyproxy#6506) coverage: exclude chromium_url (envoyproxy#6498) fix(tracing): allow 256 chars in path tag (envoyproxy#6492) Common: Introduce StopAllIteration filter status for decoding and encoding filters (envoyproxy#5954) build: update PGV url (envoyproxy#6495) subset lb: avoid partitioning host lists on worker threads (envoyproxy#6302) ci: Make envoy_select_quiche no-op. (envoyproxy#6393) watcher: notify when watched files are modified (envoyproxy#6215) stat: Add counterFromStatName(), gaugeFromStatName(), and histogramFromStatName() (envoyproxy#6475) bump to 1.11.0-dev (envoyproxy#6490) release: bump to 1.10.0 (envoyproxy#6489) hcm: path normalization. (#1) build: import manually minified Chrome URL lib. (envoyproxy#3) codec: reject embedded NUL in headers. (envoyproxy#2) Added veryfication if path contains query params and add them to path header (envoyproxy#6466) redis: basic integration test for redis_proxy (envoyproxy#6450) stats: report sample count as an integer to prevent loss of precision (envoyproxy#6274) Added VHDS protobuf message and updated RouteConfig to include it. (envoyproxy#6418) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
When the number of samples in a histogram exceeds roughly 2^24, the
value of the "+Inf" bucket in Prometheus stats output is rounded
and rendered in scientific notation. This causes incorrect results
in the Prometheus histogram_quantile function, which assumes that
the rounding-error between the 1 hour and +Inf buckets represents
some number of requests that took in excess of 1 hour.
libcirclhist actually stores the sample count as a uint64_t, so
stop implicitly converting it to double and output the count
precisely (as we do with the non-infinite buckets). Also
modifies the output format of the sum metric to avoid scientific
notation.
Risk Level: low
Testing: added test case
Doc Changes: n/a
Release Notes: n/a
Signed-off-by: Stephan Zuercher zuercher@gmail.com