-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[connector/spanmetrics] Produce delta temporality span metrics with timestamps representing an uninterrupted series #31780
[connector/spanmetrics] Produce delta temporality span metrics with timestamps representing an uninterrupted series #31780
Conversation
@@ -322,6 +361,7 @@ func (p *connectorImp) resetState() { | |||
// and span metadata such as name, kind, status_code and any additional | |||
// dimensions the user has configured. | |||
func (p *connectorImp) aggregateMetrics(traces ptrace.Traces) { | |||
startTimestamp := pcommon.NewTimestampFromTime(time.Now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this just to make testing easier, since before that the startTimestamp
was regenerated during each iteration of the for loop on the next line
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #31780 +/- ##
==========================================
+ Coverage 81.88% 81.89% +0.01%
==========================================
Files 1858 1858
Lines 172702 172739 +37
==========================================
+ Hits 141410 141459 +49
+ Misses 26980 26967 -13
- Partials 4312 4313 +1 ☔ View full report in Codecov by Sentry. |
f97f823
to
397ec40
Compare
Thanks for the issue and PR @swar8080. Hopefully @portertech or someone from @open-telemetry/collector-contrib-approvers can review. |
startTime = lastTimestamp | ||
} | ||
// Collect lastDeltaTimestamps keys that need to be updated. Metrics can share the same key, so defer the update. | ||
deltaMetricKeys[mk] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update lastDeltaTimestamps
here directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This anonymous function is first called for the calls
metric and then called with the same key for the duration
metric. So if we update it here then the duration
metric will get the incorrect value from the cache
That is what I was trying to say in the comment above this line, but feel free to change the wording if it's unclear!
@@ -125,6 +129,16 @@ func newConnector(logger *zap.Logger, config component.Config, ticker *clock.Tic | |||
resourceMetricsKeyAttributes[attr] = s | |||
} | |||
|
|||
var lastDeltaTimestamps *simplelru.LRU[metrics.Key, pcommon.Timestamp] | |||
if cfg.GetAggregationTemporality() == pmetric.AggregationTemporalityDelta { | |||
lastDeltaTimestamps, err = simplelru.NewLRU[metrics.Key, pcommon.Timestamp](cfg.GetDeltaTimestampCacheSize(), func(k metrics.Key, v pcommon.Timestamp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use cache.NewCache
instead? This will help us maintain uniformity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could but cache.NewCache
keeps items evicted from its LRU cache in memory, which isn't needed here like it is for the other caches. We would have to remember to call RemoveEvictedItems()
to discard those items from memory
cec515c
to
a5182b5
Compare
…mestamps representing an uninterrupted series. This can avoid significant memory usage compared to producing cumulative span metrics, as long a downstream component can convert from delta back to cumulative, which can depend on the timestamps being uninterrupted.
a5182b5
to
c419f4c
Compare
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Hi @Frapschen @portertech is this something you still want to include? The component should be usable with So this MR is mostly to allow using delta metrics as an optimization. It reduces memory/CPU by not buffering the cumulative counts since Also having timestamps aligned with the specification might prevent future bug reports. For example, i'm still maintaining a fork with this timestamp change in order to work with #32521 (which is also forked). |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great change to fix a nasty bug! I'm not an approver but I think this is in a mergeable state
@swar8080 would you have time to fix the merge conflicts? |
…es on span metrics connector
@@ -61,6 +61,7 @@ require ( | |||
github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0 // indirect | |||
github.com/hashicorp/go-version v1.7.0 // indirect | |||
github.com/hashicorp/golang-lru v1.0.2 // indirect | |||
github.com/hashicorp/golang-lru/v2 v2.0.7 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needed because this module depends on spanmetricsconnector
Finished implementing the most recent review feedback so this seems ready to merge cc @mx-psi |
@swar8080 I see some failures on main on Windows that may be related to this PR. Can you take a look?
|
Closes #31671
Description:
Currently delta temporality span metrics are produced with (StartTimestamp, Timestamp)'s of
(T1, T2), (T3, T4) ...
. However, the specification says that the correct pattern for an uninterrupted delta series is(T1, T2), (T2, T3) ...
This misalignment with the spec can confuse downstream components' conversion from delta temporality to cumulative temporality, causing each data point to be viewed as a cumulative counter "reset". An example of this is in
prometheusexporter
The conversion issue forces you to generate cumulative span metrics, which use significantly more memory to cache the cumulative counts.
At work, I applied this patch to our collectors and switched to producing delta temporality metrics for
prometheusexporter
to then convert to cumulative. That caused a significant drop in-memory usage:Testing:
prometheusexporter
to make sure counter values are cumulative and no longer being reset after receiving each delta data point