Mark Spans which have been selected as Exemplars, to enable downstream tail sampling #9951

0x006EA1E5 · 2023-11-25T17:27:42Z

Is your feature request related to a problem? Please describe.

The Otel collector allows for Tail sampling. This lets the collector make sampling decisions based on the whole trace.

Clients are unaware of sampling decisions, as they happen downstream.

However, exemplar selection is taken at the client. The client will select a span which participated in the measurement of a metric (was in scope at the moment of measurement), and will be attached to the metric datapoint when exported.

Currentlty, the collector has no way of knowing that a span has been selected as an exemplar, so cannot make sampling decisions based on that fact. Therefore, Exemplar traces will often be dropped, giving a poor experience when users try to navigate from a metric via the exemplar to the (missing, not sampled) trace.

Describe the solution you'd like

This issue has been addressed in the Prometheus Java Client v1.0, which has support for marking spans as "exemplars".

Essentially, when a span is selected, the attribute exemplar="true" is added to the span.

Downstream tail sampling can then be easily configured to sample any trace where a span has that attribute.

Describe alternatives you've considered

For the tail sampling use case, I see no sensible alternative to marking the span.

However, instead of simply adding this "marking" behaviour, it may be worth considering adding a generic extension point for the exemplar selection.

Additional context

I have the following observations, however this was based on the last time I looked into the source code, which was a few months ago:

It appears the agent is selecting exemplars based on a "last span seen" strategy. It seems like every span that participates in a measurement is (temporarily) selected as the exemplar, until the next span is seen, at which point this next span becomes the selected exemplar., replacing the previous one

The result is, as things stand, simply marking each selected span with the exemplar="true" attribute will not work, as effectively all perticipating spans will get marked. Instead, the selection strategy should avoid the case where a significant number of spans are marked as exemplars, but which do not actually get used as exemplars in an exported metric datapoint. For example, a "first seen span" strategy would work, similar to how the Prometheus Client functions.

Spans are typically processed sometime before the metric datapoint + exemplars are exported. We would want to export the spans (with the exemplar="true" attribute) as soon as possible, but we could be waiting perhaps 30 seconds until the exemplar is exported. Therefore we cannot wait until the moment of metric export to make the exemplar selection, as the span could have long since been exported, and it will be too late to set any attributes.

The prometheus client uses the attribute exemplar="true". Perhaps this should be defined in the semantic conventions.

The text was updated successfully, but these errors were encountered:

trask · 2023-11-27T21:18:06Z

Linking to related: open-telemetry/opentelemetry-java#5915 and open-telemetry/opentelemetry-specification#2922

0x006EA1E5 added enhancement New feature or request needs triage New issue that requires triage labels Nov 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark Spans which have been selected as Exemplars, to enable downstream tail sampling #9951

Mark Spans which have been selected as Exemplars, to enable downstream tail sampling #9951

0x006EA1E5 commented Nov 25, 2023 •

edited

Loading

trask commented Nov 27, 2023

Mark Spans which have been selected as Exemplars, to enable downstream tail sampling #9951

Mark Spans which have been selected as Exemplars, to enable downstream tail sampling #9951

Comments

0x006EA1E5 commented Nov 25, 2023 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

trask commented Nov 27, 2023

0x006EA1E5 commented Nov 25, 2023 •

edited

Loading