Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark Spans which have been selected as Exemplars, to enable downstream tail sampling #9951

Open
0x006EA1E5 opened this issue Nov 25, 2023 · 1 comment
Labels
enhancement New feature or request needs triage New issue that requires triage

Comments

@0x006EA1E5
Copy link

0x006EA1E5 commented Nov 25, 2023

Is your feature request related to a problem? Please describe.

The Otel collector allows for Tail sampling. This lets the collector make sampling decisions based on the whole trace.

Clients are unaware of sampling decisions, as they happen downstream.

However, exemplar selection is taken at the client. The client will select a span which participated in the measurement of a metric (was in scope at the moment of measurement), and will be attached to the metric datapoint when exported.

Currentlty, the collector has no way of knowing that a span has been selected as an exemplar, so cannot make sampling decisions based on that fact. Therefore, Exemplar traces will often be dropped, giving a poor experience when users try to navigate from a metric via the exemplar to the (missing, not sampled) trace.

Describe the solution you'd like

This issue has been addressed in the Prometheus Java Client v1.0, which has support for marking spans as "exemplars".

Essentially, when a span is selected, the attribute exemplar="true" is added to the span.

Downstream tail sampling can then be easily configured to sample any trace where a span has that attribute.

Describe alternatives you've considered

For the tail sampling use case, I see no sensible alternative to marking the span.

However, instead of simply adding this "marking" behaviour, it may be worth considering adding a generic extension point for the exemplar selection.

Additional context

I have the following observations, however this was based on the last time I looked into the source code, which was a few months ago:

It appears the agent is selecting exemplars based on a "last span seen" strategy. It seems like every span that participates in a measurement is (temporarily) selected as the exemplar, until the next span is seen, at which point this next span becomes the selected exemplar., replacing the previous one

The result is, as things stand, simply marking each selected span with the exemplar="true" attribute will not work, as effectively all perticipating spans will get marked. Instead, the selection strategy should avoid the case where a significant number of spans are marked as exemplars, but which do not actually get used as exemplars in an exported metric datapoint. For example, a "first seen span" strategy would work, similar to how the Prometheus Client functions.

Spans are typically processed sometime before the metric datapoint + exemplars are exported. We would want to export the spans (with the exemplar="true" attribute) as soon as possible, but we could be waiting perhaps 30 seconds until the exemplar is exported. Therefore we cannot wait until the moment of metric export to make the exemplar selection, as the span could have long since been exported, and it will be too late to set any attributes.

The prometheus client uses the attribute exemplar="true". Perhaps this should be defined in the semantic conventions.

@0x006EA1E5 0x006EA1E5 added enhancement New feature or request needs triage New issue that requires triage labels Nov 25, 2023
@trask
Copy link
Member

trask commented Nov 27, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage New issue that requires triage
Projects
None yet
Development

No branches or pull requests

2 participants