Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the inclusivity of exponential histogram bounds #2633

Merged
merged 23 commits into from
Aug 12, 2022

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jun 28, 2022

Fixes #2611 (for the most part)

Changes

The OpenTelemetry OTLP v0.11 exponential histogram defined lower-inclusive histogram buckets. The rationale for this decision is that the computations involved are simpler than for upper-inclusive histogram buckets, relatively speaking, due to the nature of IEEE 754's floating point representation.

The Prometheus project is strongly opposed to the use of lower-inclusive boundaries for histogram buckets because of historical precedent. In brief, there are existing modes of querying (existing) histogram data that are defined with upper-inclusive boundaries => it is possible to make exact inferences using existing histogram data => future histograms should preserve this mode of inference.

The Prometheus project has (with limits) adopted the base-2 exponential boundary structure defined in OTLP v0.11, so the only major difference between theirs and OpenTelemetry's is this issue. The proposed changes that will bring harmony in this space are: (1) this PR, (2) a demonstration of the aggregator change (OTel-Go), and (3) a comments-only change in the protocol.

The counter argument to making this change can be summarized:

  1. It is slightly simpler to use lower-inclusive definitions
  2. Exactness, even for power of two queries, is unwelcome overhead
  3. Exactness is not possible for non-power-of-two queries
  4. Negative-range numbers have opposite inclusivity in both proposals, counter to the claim of consistency.

The argument in favor of this proposal can be summarized:

  1. Prometheus will be happy (even without exactness)
  2. OTel never required exactness, so it's not a breaking change
  3. Checking for exact for powers-of-two is cheap
  4. Table-lookup implementations become less relatively-complex.

Please see the changes to the mapping functions prototyped in open-telemetry/opentelemetry-go#2982.

CC: @gouthamve @beorn7 @RichiH.

@jmacd jmacd requested review from a team June 28, 2022 00:36
@oertl
Copy link
Contributor

oertl commented Jun 28, 2022

@jmacd

  1. Table-lookup implementations become less relatively-complex.

Isn't it rather the opposite? The floating-point representations of values 1.0 and 1.00001 have the same exponent, therefore it is easier to map them to the same bucket than to put them in different buckets.

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@reyang reyang added area:sdk Related to the SDK spec:metrics Related to the specification/metrics directory labels Jun 28, 2022
@jmacd
Copy link
Contributor Author

jmacd commented Jun 28, 2022

Isn't it rather the opposite? The floating-point representations of values 1.0 and 1.00001 have the same exponent, therefore it is easier to map them to the same bucket than to put them in different buckets.

I think it's established that lower-inclusive boundaries have less "absolute" complexity, that's demonstrated in these changes. I suspect your point is that even a table-lookup implementation has to do more work here, because it requires a new test for significand==0--it's the same in every one of these methods, we're adding an explicit test for power-of-two values and subtracting 1 from the calculated index.

As I recall, @oertl your O(1) table-lookup implementation used 2 comparisons per call. I believe your point is that now it will require 3 comparisons per call?

I was trying to find a silver lining here: the complexity difference between table-lookup and log()-based methods is shrinking, even as they become more complex.

@oertl
Copy link
Contributor

oertl commented Jun 28, 2022

@jmacd

As I recall, @oertl your O(1) table-lookup implementation used 2 comparisons per call. I believe your point is that now it will require 3 comparisons per call?

Yes, but maybe it is possible to avoid additional branching for powers of two by decrementing the floating-point input value first (e.g., using Math.nextDown in Java). Anyway, I do not see how table-lookup implementations would become "less relatively-complex" with exclusive lower bounds, which is one of your arguments in favor of this proposal.

@jmacd
Copy link
Contributor Author

jmacd commented Jun 30, 2022

@oertl I struck that sentence about relative complexity. (It didn't land well!)

You're right that nextAfter can change the boundary condition between one and another inclusivity (Prometheus made this suggestion, too, but it looks like equal complexity to me--it costs more than one additional comparison at least.)

I looked back at the table lookup mapping function I had prototyped, which I derived from yours (included in this draft PR). To get a rough idea, the program to generate the table of constants is about the same size as the library that performs the lookup and the reverse mapping function. The library to perform the lookup/reverse-mapping is about the same size of, but already more complex than the logarithm-based mapping function. To me the change you would make in a table lookup function will leave an implementation with about the same complexity as before. On the other hand, the change to be made in a logarithm or exponent-based lookup function leaves behind substantially more complexity than it had before, and this is what I meant by relative complexity. The logarithm implementation suffers a lot, the table lookup implementation suffers a little.

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on the implementation but left some comments.

specification/metrics/data-model.md Outdated Show resolved Hide resolved
specification/metrics/data-model.md Outdated Show resolved Hide resolved
specification/metrics/data-model.md Show resolved Hide resolved
specification/metrics/data-model.md Show resolved Hide resolved
specification/metrics/data-model.md Show resolved Hide resolved
@beorn7
Copy link

beorn7 commented Jul 5, 2022

Sorry for the separate comments rather than doing one review.

I think this looks fine from the Prometheus perspective. Many thanks for doing this!

@jmacd jmacd closed this Jul 5, 2022
@jmacd
Copy link
Contributor Author

jmacd commented Jul 5, 2022

As explained in #2633 (comment), I no longer think this is a good idea.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 7, 2022

As explained in #2611 (comment), I still consider this change acceptable, but the text had some errors. I will re-open this after correcting the mistakes.

Joshua MacDonald added 3 commits July 7, 2022 14:22
@jmacd jmacd reopened this Jul 7, 2022
Joshua MacDonald added 2 commits July 7, 2022 14:53
Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've prototyped this in OpenTelemetry .NET, as I can tell, it is working very well. @jmacd thank you for great work! 👏

@jmacd
Copy link
Contributor Author

jmacd commented Aug 5, 2022

@open-telemetry/specs-approvers Please approve this PR that was requested by the Prometheus developers as it will improve collaboration between these groups.

Copy link
Member

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sdk Related to the SDK spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus and OTel High Resolution Histograms incompatibilities
9 participants