Skip to content

stats: report sample count as an integer to prevent loss of precision#6274

Merged
mattklein123 merged 1 commit intoenvoyproxy:masterfrom
zuercher:zuercher_fix_sample_count
Apr 3, 2019
Merged

stats: report sample count as an integer to prevent loss of precision#6274
mattklein123 merged 1 commit intoenvoyproxy:masterfrom
zuercher:zuercher_fix_sample_count

Conversation

@zuercher
Copy link
Member

When the number of samples in a histogram exceeds roughly 2^24, the
value of the "+Inf" bucket in Prometheus stats output is rounded
and rendered in scientific notation. This causes incorrect results
in the Prometheus histogram_quantile function, which assumes that
the rounding-error between the 1 hour and +Inf buckets represents
some number of requests that took in excess of 1 hour.

libcirclhist actually stores the sample count as a uint64_t, so
stop implicitly converting it to double and output the count
precisely (as we do with the non-infinite buckets). Also
modifies the output format of the sum metric to avoid scientific
notation.

Risk Level: low
Testing: added test case
Doc Changes: n/a
Release Notes: n/a

Signed-off-by: Stephan Zuercher zuercher@gmail.com

When the number of samples in a histogram exceeds roughly 2^24, the
value of the "+Inf" bucket in Prometheus stats output is rounded
and rendered in scientific notation. This causes incorrect results
in the Prometheus histogram_quantile function, which assumes that
the rounding-error between the 1 hour and +Inf buckets represents
some number of requests that took in excess of 1 hour.

libcirclhist actually stores the sample count as a uint64_t, so
stop implicitly converting it to double and output the count
precisely (as we do with the non-infinite buckets). Also
modifies the output format of the sum metric to avoid scientific
notation.

Risk Level: low
Testing: added test case
Doc Changes: n/a
Release Notes: n/a

Signed-off-by: Stephan Zuercher <zuercher@gmail.com>
response.add(fmt::format("{0}_bucket{{{1}le=\"+Inf\"}} {2}\n", metric_name, hist_tags,
stats.sampleCount()));
response.add(fmt::format("{0}_sum{{{1}}} {2}\n", metric_name, tags, stats.sampleSum()));
response.add(fmt::format("{0}_sum{{{1}}} {2:.32g}\n", metric_name, tags, stats.sampleSum()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern with this is that it's changing something user visible and might break parsers. But if I understand correctly this output is meant to be parsed specifically by Prometheus scraper, so it's probably not being scraped by someone's custom stuff. WDYT?

Copy link
Member Author

@zuercher zuercher Mar 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It's also changing it to a purely [0-9]+(\.[0-9]+)? format, so I think that makes it less likely to be misinterpreted. To be fair, I believe prometheus can consume the scientific notation values correctly outside of the tags it uses for histogram bucketing. I changed this so that it matched the other outputs.

@dnoe dnoe self-assigned this Mar 13, 2019
@zuercher
Copy link
Member Author

Ping @dnoe

@stale
Copy link

stale bot commented Mar 27, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 27, 2019
@zuercher
Copy link
Member Author

zuercher commented Apr 1, 2019

@dnoe do you have time to move this forward or shall I look for another reviewer?

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Apr 1, 2019
Copy link
Contributor

@dnoe dnoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Sorry for the delay, I recently adjusted email filters in an attempt to not lose track of PRs and this got lost in the transition.

@mattklein123 mattklein123 merged commit 406547d into envoyproxy:master Apr 3, 2019
mpuncel added a commit to mpuncel/envoy that referenced this pull request Apr 8, 2019
* master: (137 commits)
  test: router upstream log to v2 config stubs (envoyproxy#6499)
  remove idle timeout validation (envoyproxy#6500)
  build: Change namespace of chromium_url. (envoyproxy#6506)
  coverage: exclude chromium_url (envoyproxy#6498)
  fix(tracing): allow 256 chars in path tag (envoyproxy#6492)
  Common: Introduce StopAllIteration filter status for decoding and encoding filters (envoyproxy#5954)
  build: update PGV url (envoyproxy#6495)
  subset lb: avoid partitioning host lists on worker threads (envoyproxy#6302)
  ci: Make envoy_select_quiche no-op. (envoyproxy#6393)
  watcher: notify when watched files are modified (envoyproxy#6215)
  stat: Add counterFromStatName(), gaugeFromStatName(), and histogramFromStatName() (envoyproxy#6475)
  bump to 1.11.0-dev (envoyproxy#6490)
  release: bump to 1.10.0 (envoyproxy#6489)
  hcm: path normalization. (#1)
  build: import manually minified Chrome URL lib. (envoyproxy#3)
  codec: reject embedded NUL in headers. (envoyproxy#2)
  Added veryfication if path contains query params and add them to path header (envoyproxy#6466)
  redis: basic integration test for redis_proxy (envoyproxy#6450)
  stats: report sample count as an integer to prevent loss of precision (envoyproxy#6274)
  Added VHDS protobuf message and updated RouteConfig to include it. (envoyproxy#6418)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@zuercher zuercher deleted the zuercher_fix_sample_count branch November 7, 2019 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants