Optimize histogram reservoir by dashpole · Pull Request #7443 · open-telemetry/opentelemetry-go

dashpole · 2025-10-02T13:56:40Z

This improves the concurrent performance of the histogram reservoir's Offer function by 4x (i.e. 75% reduction).

Accomplish this by locking each measurement, rather than locking around the entire storage. Also, defer extracting the trace context from context.Context until collection time. This improves the performance of Offer, which is on the measure hot path. Exemplars are often overwritten, so deferring the operation until Collect reduces the overall work.

                           │   main.txt   │              hist.txt              │
                           │    sec/op    │   sec/op     vs base               │
FixedSizeReservoirOffer-24    211.4n ± 3%   177.5n ± 3%  -16.04% (p=0.002 n=6)
HistogramReservoirOffer-24   200.85n ± 2%   47.41n ± 2%  -76.40% (p=0.002 n=6)
geomean                       206.1n        91.73n       -55.48%

Benchmarks for Measure:

                                                                                 │  main.txt   │             histres.txt             │
                                                                                 │   sec/op    │    sec/op     vs base               │
SyncMeasure/NoView/ExemplarsEnabled/Int64Histogram/Attributes/0-24                 436.7n ± 4%   114.8n ±  5%  -73.72% (p=0.002 n=6)
SyncMeasure/NoView/ExemplarsEnabled/Int64Histogram/Attributes/10-24                472.2n ± 2%   169.7n ±  8%  -64.08% (p=0.002 n=6)
SyncMeasure/NoView/ExemplarsEnabled/Float64Histogram/Attributes/0-24               431.0n ± 2%   116.3n ±  2%  -73.01% (p=0.002 n=6)
SyncMeasure/NoView/ExemplarsEnabled/Float64Histogram/Attributes/10-24              470.9n ± 1%   171.0n ±  5%  -63.70% (p=0.002 n=6)

I explored using a []atomic.Pointer[measurement], but this had similar performance while being much more complex (needing a sync.Pool to eliminate allocations). The single-threaded performance was also much worse for that solution. See main...dashpole:optimize_histogram_reservoir_old.

codecov · 2025-10-02T13:58:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.2%. Comparing base (c15644d) to head (95f44c1).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #7443     +/-   ##
=======================================
- Coverage   86.2%   86.2%   -0.1%     
=======================================
  Files        302     302             
  Lines      21973   21971      -2     
=======================================
- Hits       18949   18947      -2     
  Misses      2643    2643             
  Partials     381     381

Files with missing lines	Coverage Δ
sdk/metric/exemplar/fixed_size_reservoir.go	`94.7% <100.0%> (ø)`
sdk/metric/exemplar/histogram_reservoir.go	`90.9% <100.0%> (-1.7%)`	⬇️
sdk/metric/exemplar/storage.go	`100.0% <100.0%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Forked from this discussion here: #7443 (comment) It seems like a good idea for us as a group to align on and document what we are comfortable with in terms of how ordered measurements are reflected in collected metric data. --------- Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

bboreham · 2025-10-13T14:23:19Z

On further reflection, I fixed the copying issue before running the benchmark, so it is perhaps reasonable that less racy code runs slower.

Would be good if the tests and/or linter detected the issue. I note that NoCopy was removed from atomic.Value here: golang/go#21504.

dashpole · 2025-10-15T01:08:33Z

I also see slightly worse results, but agree it is definitely better to be correct. I'll work on a test.

dashpole · 2025-10-15T15:40:05Z

I added a ConcurrentSafe test, and verified that it fails (quite spectacularly) with the previous atomic.Value implementation.

bboreham

lgtm

dashpole · 2025-10-15T16:31:22Z

The concurrent safe test found another race condition around my usage of sync.Pool, which i'm looking into

bboreham

Much simpler now.

MrAlias

Overall, looks good to me. Just testing cleanup.

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

dashpole · 2025-12-11T17:24:14Z

I put the Measure benchmarks in the description. Not as good of an improvement as I had expected, so i'll have to dig into that more...

dashpole · 2025-12-11T19:04:33Z

@dmathieu if you have the chance to review, this is related to other optimization PRs.

…ckets

dashpole · 2025-12-11T20:43:53Z

I figured out why the benchmark results were so poor: The benchmark was recording all observations in a single bucket, and each bucket has its own lock, so there were effectively no parallelism gains. I switched the benchmark to record observations in different buckets, and that shows that this is a ~70% performance improvement when exemplars are being recorded.

~Depends on #7441, #7443~ This improves the concurrent performance of the fixed size reservoir's Offer function by 4x (i.e. 75% reduction). This improves the performance of Measure() for fixed-size reservoirs by 60% overall. Accomplish this by: * using a single atomic for count and next. This assumes that both can fit in a uint32. * only use a lock to guard changing `w` and `next` together. Offer benchmarks: ``` │ main.txt │ fixedsize.txt │ │ sec/op │ sec/op vs base │ FixedSizeReservoirOffer-24 185.25n ± 4% 45.58n ± 1% -75.40% (p=0.002 n=6) ``` Measure benchmarks: ``` │ main.txt │ fixedsize.txt │ │ sec/op │ sec/op vs base │ SyncMeasure/NoView/ExemplarsEnabled/Int64Counter/Attributes/0-24 175.45n ± 6% 67.01n ± 9% -61.81% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Int64Counter/Attributes/1-24 170.25n ± 1% 69.82n ± 6% -58.99% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Int64Counter/Attributes/10-24 167.40n ± 2% 64.52n ± 10% -61.46% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Float64Counter/Attributes/0-24 173.55n ± 0% 69.17n ± 12% -60.14% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Float64Counter/Attributes/1-24 169.50n ± 1% 68.55n ± 5% -59.56% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Float64Counter/Attributes/10-24 166.95n ± 1% 65.82n ± 6% -60.58% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Int64UpDownCounter/Attributes/0-24 168.85n ± 1% 67.99n ± 11% -59.73% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Int64UpDownCounter/Attributes/1-24 173.50n ± 1% 66.69n ± 2% -61.56% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Int64UpDownCounter/Attributes/10-24 171.30n ± 5% 67.73n ± 8% -60.46% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Float64UpDownCounter/Attributes/0-24 168.90n ± 2% 67.69n ± 9% -59.92% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Float64UpDownCounter/Attributes/1-24 173.35n ± 2% 68.25n ± 9% -60.63% (p=0.002 n=6) SyncMeasure/NoView/ExemplarsEnabled/Float64UpDownCounter/Attributes/10-24 172.95n ± 2% 70.90n ± 7% -59.01% (p=0.002 n=6) geomean 171.0n 67.83n -60.33% ``` --------- Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> Co-authored-by: Robert Pająk <pellared@hotmail.com>

### Added - Add `Enabled` method to all synchronous instrument interfaces (`Float64Counter`, `Float64UpDownCounter`, `Float64Histogram`, `Float64Gauge`, `Int64Counter`, `Int64UpDownCounter`, `Int64Histogram`, `Int64Gauge`,) in `go.opentelemetry.io/otel/metric`. This stabilizes the synchronous instrument enabled feature, allowing users to check if an instrument will process measurements before performing computationally expensive operations. (#7763) - Add `AlwaysRecord` sampler in `go.opentelemetry.io/otel/sdk/trace`. (#7724) - Add `go.opentelemetry.io/otel/semconv/v1.39.0` package. The package contains semantic conventions from the `v1.39.0` version of the OpenTelemetry Semantic Conventions. See the [migration documentation](https://github.com/open-telemetry/opentelemetry-go/blob/298cbedf256b7a9ab3c21e41fc5e3e6d6e4e94aa/semconv/v1.39.0/MIGRATION.md) for information on how to upgrade from `go.opentelemetry.io/otel/semconv/v1.38.0.` (#7783, #7789) ### Changed - `Exporter` in `go.opentelemetry.io/otel/exporter/prometheus` ignores metrics with the scope `go.opentelemetry.io/contrib/bridges/prometheus`. This prevents scrape failures when the Prometheus exporter is misconfigured to get data from the Prometheus bridge. (#7688) - Improve performance of concurrent histogram measurements in `go.opentelemetry.io/otel/sdk/metric`. (#7474) - Add experimental observability metrics in `go.opentelemetry.io/otel/exporters/stdout/stdoutmetric`. (#7492) - Improve the concurrent performance of `HistogramReservoir` in `go.opentelemetry.io/otel/sdk/metric/exemplar` by 4x. (#7443) - Improve performance of concurrent synchronous gauge measurements in `go.opentelemetry.io/otel/sdk/metric`. (#7478) - Improve performance of concurrent exponential histogram measurements in `go.opentelemetry.io/otel/sdk/metric`. (#7702) - Improve the concurrent performance of `FixedSizeReservoir` in `go.opentelemetry.io/otel/sdk/metric/exemplar`. (#7447) - The `rpc.grpc.status_code` attribute in the experimental metrics emitted from `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc` is replaced with the `rpc.response.status_code` attribute to align with the semantic conventions. (#7854) - The `rpc.grpc.status_code` attribute in the experimental metrics emitted from `go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploggrpc` is replaced with the `rpc.response.status_code` attribute to align with the semantic conventions. (#7854) ### Fixed - Fix bad log message when key-value pairs are dropped because of key duplication in `go.opentelemetry.io/otel/sdk/log`. (#7662) - Fix `DroppedAttributes` on `Record` in `go.opentelemetry.io/otel/sdk/log` to not count the non-attribute key-value pairs dropped because of key duplication. (#7662) - Fix `SetAttributes` on `Record` in `go.opentelemetry.io/otel/sdk/log` to not log that attributes are dropped when they are actually not dropped. (#7662) - `WithHostID` detector in `go.opentelemetry.io/otel/sdk/resource` to use full path for `ioreg` command on Darwin (macOS). (#7818) - Fix missing `request.GetBody` in `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp` to correctly handle HTTP2 GOAWAY frame. (#7794) ### Deprecated - Deprecate `go.opentelemetry.io/otel/exporters/zipkin`. For more information, see the [OTel blog post deprecating the Zipkin exporter](https://opentelemetry.io/blog/2025/deprecating-zipkin-exporters/). (#7670) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

dashpole force-pushed the optimize_histogram_reservoir branch 2 times, most recently from 512b67e to a7d395d Compare October 2, 2025 14:00

This was referenced Oct 2, 2025

Optimize fixedsize reservoir #7447

Merged

AlignedHistogramBucketExemplarReservoir should use time-weighted sampling open-telemetry/opentelemetry-specification#4675

Closed

dashpole force-pushed the optimize_histogram_reservoir branch from 864211f to 6497b59 Compare October 3, 2025 13:20

dashpole marked this pull request as ready for review October 3, 2025 13:27

dashpole requested review from MrAlias, XSAM, dmathieu, flc1125 and pellared as code owners October 3, 2025 13:27

MrAlias reviewed Oct 3, 2025

View reviewed changes

Comment thread sdk/metric/exemplar/storage.go Outdated

Comment thread sdk/metric/exemplar/storage.go Outdated

dashpole mentioned this pull request Oct 4, 2025

Document the ordering guarantees provided by the metrics SDK #7453

Merged

dashpole force-pushed the optimize_histogram_reservoir branch 2 times, most recently from 7457c73 to 7c1476f Compare October 4, 2025 03:47

dashpole mentioned this pull request Oct 6, 2025

PoC: HistogramReservoir uses a time-weighted algorithm #7458

Closed

dashpole force-pushed the optimize_histogram_reservoir branch from 7c1476f to 7b79e43 Compare October 7, 2025 14:40

MrAlias approved these changes Oct 7, 2025

View reviewed changes

Comment thread sdk/metric/exemplar/storage.go Outdated

pellared mentioned this pull request Oct 10, 2025

SIG meeting notes #6648

Open

bboreham reviewed Oct 12, 2025

View reviewed changes

Comment thread sdk/metric/exemplar/storage.go Outdated

Comment thread sdk/metric/exemplar/storage.go Outdated

dashpole force-pushed the optimize_histogram_reservoir branch from 433ff16 to e4dfbac Compare October 15, 2025 01:01

dashpole force-pushed the optimize_histogram_reservoir branch from e4dfbac to 67df837 Compare October 15, 2025 15:38

bboreham approved these changes Oct 15, 2025

View reviewed changes

Comment thread sdk/metric/exemplar/reservoir_test.go Outdated

dashpole force-pushed the optimize_histogram_reservoir branch from 597d23c to 81231b8 Compare October 15, 2025 20:00

dashpole force-pushed the optimize_histogram_reservoir branch from 2c82611 to 5e17e43 Compare October 15, 2025 20:12

bboreham approved these changes Oct 16, 2025

View reviewed changes

Comment thread sdk/metric/exemplar/fixed_size_reservoir.go

Comment thread sdk/metric/exemplar/storage.go

MrAlias added this to the v1.39.0 milestone Oct 16, 2025

MrAlias reviewed Nov 6, 2025

View reviewed changes

dashpole force-pushed the optimize_histogram_reservoir branch from 03e0957 to e3936fe Compare November 20, 2025 16:10

MrAlias self-requested a review December 4, 2025 18:10

MrAlias modified the milestones: v1.39.0, v1.40.0 Dec 4, 2025

dashpole and others added 6 commits December 11, 2025 17:18

lock around each measurement in exemplar reservoir storage

a30824a

lint

0658243

Apply suggestions from code review

17ac10f

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

lock around each measurement in exemplar reservoir storage

f8955ac

address feedback

a8f4ff4

changelog

4415931

dashpole force-pushed the optimize_histogram_reservoir branch from e3936fe to 4415931 Compare December 11, 2025 17:22

MrAlias approved these changes Dec 11, 2025

View reviewed changes

update benchmark test to distribute histogram measurements between bu…

aefed68

…ckets

dmathieu approved these changes Dec 15, 2025

View reviewed changes

Merge branch 'main' into optimize_histogram_reservoir

95f44c1

dashpole merged commit e8542ae into open-telemetry:main Dec 15, 2025
33 checks passed

dashpole deleted the optimize_histogram_reservoir branch December 15, 2025 14:59

MrAlias mentioned this pull request Jan 16, 2026

Optimize the metric SDK #7796

Open

39 tasks

MrAlias mentioned this pull request Feb 2, 2026

Release v1.40.0 #7859

Merged

This was referenced Mar 3, 2026

Bump go.opentelemetry.io/otel/sdk from 1.37.0 to 1.40.0 in /flytectl flyteorg/flyte#6959

Open

Bump go.opentelemetry.io/otel/sdk from 1.35.0 to 1.40.0 in /flytecopilot flyteorg/flyte#6960

Open

Conversation

dashpole commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bboreham commented Oct 13, 2025

Uh oh!

dashpole commented Oct 15, 2025

Uh oh!

dashpole commented Oct 15, 2025

Uh oh!

bboreham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dashpole commented Oct 15, 2025

Uh oh!

bboreham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MrAlias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dashpole commented Dec 11, 2025

Uh oh!

dashpole commented Dec 11, 2025

Uh oh!

dashpole commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dashpole commented Oct 2, 2025 •

edited

Loading

codecov bot commented Oct 2, 2025 •

edited

Loading