Skip to content

Use sync.Map for exponential histogram aggregations#8077

Draft
dashpole wants to merge 8 commits intoopen-telemetry:mainfrom
dashpole:exphist_syncmap
Draft

Use sync.Map for exponential histogram aggregations#8077
dashpole wants to merge 8 commits intoopen-telemetry:mainfrom
dashpole:exphist_syncmap

Conversation

@dashpole
Copy link
Copy Markdown
Contributor

@dashpole dashpole commented Mar 19, 2026

Part of #7796

This applies the same approach as I did for fixed-bucket histograms (#7474) to exponential histograms.

Changes

  • Move the sync.Mutex from outside the entire map to now only covering the scale and positive/negative buckets.
  • Split expoHistogram into deltaExpoHistogram and cumulativeExpoHistogram: TODO

This does not make the buckets concurrent-safe. That will be done in subsequent PRs.

@dashpole dashpole added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Mar 19, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 88.32685% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.0%. Comparing base (8d70624) to head (5ad698a).

Files with missing lines Patch % Lines
...metric/internal/aggregate/exponential_histogram.go 87.8% 20 Missing and 6 partials ⚠️
sdk/metric/internal/aggregate/atomic.go 89.7% 2 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #8077     +/-   ##
=======================================
- Coverage   82.0%   82.0%   -0.1%     
=======================================
  Files        308     308             
  Lines      24060   24228    +168     
=======================================
+ Hits       19748   19882    +134     
- Misses      3936    3961     +25     
- Partials     376     385      +9     
Files with missing lines Coverage Δ
sdk/metric/internal/aggregate/aggregate.go 100.0% <100.0%> (ø)
sdk/metric/internal/aggregate/atomic.go 89.9% <89.7%> (-2.8%) ⬇️
...metric/internal/aggregate/exponential_histogram.go 92.8% <87.8%> (-7.2%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dashpole
Copy link
Copy Markdown
Contributor Author

dashpole commented Apr 3, 2026

One part if this that is challenging to resolve is dealing with underflow for cumulative metrics. The current design mirrors how the histogram implementation works: Collection swaps hot and cold, reads the cold, and then merges the cold back into the hot point.

The issue comes during the merge process. It is possible that an observation made to the hot point should underflow, but we don't find that out until we try to merge the cold point into the hot one. For example:

attrs := attribute.NewSet()
maxSize := 2
h := newExpoHistogram(maxsize, ...)
h.measure(ctx, math.MaxFloat64, attrs, ...)
go h.collect(...)
h.measure(ctx, math.SmallestNonzeroFloat64, attrs)
// assume collect() finishes after measure, and tries to merge
// an exp histogram with math.MaxFloat64 into an exp histogram
// with math.SmallestNonzeroFloat64. This will underflow, but we
// can't remove the underflowed measurement after it has been
// aggregated.

This is an extremely rare case: Underflow is only possible with maxSize <= 2, and when making measurements where one is 2^1024 times greater than the other.

Some options i've come up with to deal with it:

  1. Fix it properly, but with significant complexity:
    1. Add a separate tracker that uses three atomic bits to track which scale -10 buckets have been seen to drop the measurement that underflows. This will probably have a small performance cost as well.
    2. "Pre-scale" buckets before swapping to make underflow impossible when merging cold back into hot. This is quite a bit more complex.
  2. Best-effort removal of underflowed measurements during the merge process:
    1. Remove the underflowed bucket counts, and lower the overall count by the same amount. Lower the sum proportional to the count to keep the average the same.
    2. Put the smallest underflowed measurements into the zero bucket. Raise the zero threshold, and move the smallest underflowed bucket counts to the zero count. In the worst case, the zero_threshold would be raised to 1.0, but the remaining range would fit into a single scale -10 bucket.

I'm planning to implement the proper fix (option 1.i), but I wanted to document this in-case it comes up later. Option 2.i is also appealing given how extremely rare this should be in-practice.

@dashpole dashpole force-pushed the exphist_syncmap branch 3 times, most recently from fa32a82 to 906aa6e Compare April 6, 2026 19:59
MrAlias added a commit that referenced this pull request Apr 8, 2026
Some small testing improvements forked from
#8077.

This also fixes a flake where the order in which sums are added can
change the resulting sum. Use assertSumEqual to handle this similar to
other places in the test.

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Skip Changelog PRs that do not require a CHANGELOG.md entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant