Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow samplers to be called during different moments in the Span lifetime #307

Open
bogdandrutu opened this issue Oct 11, 2019 · 14 comments
Labels
area:sampling Related to trace sampling area:sdk Related to the SDK release:after-ga Not required before GA release, and not going to work on before GA spec:trace Related to the specification/trace directory
Milestone

Comments

@bogdandrutu
Copy link
Member

Currently Sampler is called only onSpanCreation.

Based on the comment #296 (comment) we should consider to allow:

  • onStart - span creation time
  • onSpanNameUpdate - when updateName is called
  • onSetAttribute - when setAttribute is called
  • onEnd - when end is called.
@jmacd
Copy link
Contributor

jmacd commented Oct 15, 2019

If I were a sampler with enough sophistication to respond to each of these events in a span lifecycle, I might also like to maintain state between the calls in the span processor state itself. This feels like something for a specialized SDK, not something for a public sampler API IMO.

@paivagustavo
Copy link
Member

We've hit some questions (open-telemetry/opentelemetry-go#224) that I think are related with this issue.

By the current specs, if a span that was initially defined as isRecording == false it can drop all tracing events but what happens if someone tries to update the span name or set a new attribute after adding events? The span would have incomplete data if it became sampled.

To have a span that contains all data we would need to keep every event, attribute, status and etc from every span, if we are doing this what is the point of the IsRecording at all?

@jmacd
Copy link
Contributor

jmacd commented Oct 22, 2019

I would like to propose a different solution. Presently, there are open issues in the Go repository about removing the WithRecord option and about the UpdateName vs SetName debate, which are both connected with this issue.

Instead of "re-triggering" the sampler API or re-evaluating it at different moments during the span lifetime, I see two good options:

--
First, a specialized SDK should be able to implement an arbitrarily complex state machine to perform a complex sampling decision that might change upon the arrival of any new information. It could potentially decide to not record a Span using the initial information, or it could decide to begin recording and stop at some later time when a new attribute, a change of name, or an event occur. This is the arbitrarily complex case and SDKs are free to do this.

--
The simple case, which I believe is being advocated for OTel to support generally, is to support a relatively flexible sampling API. The one we have today has raised this confusion about UpdateName vs SetName and this question of re-evaluation, but it hasn't helped us decide whether to record a span or not. Now that the spec says the WithRecord option is not a span parameter, does this become part of the sampling API? It seems there are really two questions the sampling API needs to answer:

  1. Given the initial arguments given to start a Span, decide whether to record this span. In this call, we get the initial name of the span, the initial attributes, and any other options passed to Start(). The sampler has enough information to decide whether to record the span or not. It still does not know the latency, the full set of attributes, or the events.
  2. At the end of a span, assuming it was recorded, the Sampler API has an opportunity to decide whether to sample or not, given the complete and finalized SpanData structure. This is an opportunity to make a final decision.

This addresses the confusion. A sampler can make relatively simplistic decisions up front, or it can make fully-informed decisions at the end of the span, but that is all it can do. This is, after all, the simple case. This suggests that there should be TWO calls in the Sampler API. At the start, decide whether to record. At the end, decide whether to sample.

@Oberon00
Copy link
Member

Note that when the Sampler (or the exporter for that matter) decides to drop the Span only at step 2, the Span's context may already have been propagated with an Sampled=true TraceOptions flag. I.e., if the decision changes from sampled in to sampled out any time after span creation (step 1), then you will probably get broken traces.

@paivagustavo
Copy link
Member

@Oberon00 what is this different than someone updating a span names and it be sampled out after context propagation may happen? If sampler happens anywhere after creation there will be always chance for broken traces.

@Oberon00
Copy link
Member

Oberon00 commented Oct 22, 2019

@paivagustavo: Exactly that was my point 😃

EDIT: My understanding of the current sampling approach is that sampling happens immediately at span creation only.

@carlosalberto
Copy link
Contributor

what is this different than someone updating a span names and it be sampled out after context propagation may happen?

AFAIK this is something that tracers like Jaeger already might run into, and this is done on at their own risk ;)

I see two good options

I really like the second option. Given that a lot of things have been added lately, and more of them will, I'd stick to the simplest, correct choice whenever possible.

@jmacd
Copy link
Contributor

jmacd commented Oct 24, 2019

Remember that there are more uses for a Span than just constructing "complete" traces. If I want to study the statistical properties of a certain operation, let's say, I might decide to record all Spans and use some kind of latency-weighted sampling logic, for example. There's no requirement to collect a full trace, and I'd like to have control over whether spans are recorded or not. Z-pages are the other important use-case.

@carlosalberto
Copy link
Contributor

Moving this issue to the next milestone as there's no consensus here yet.

@yurishkuro
Copy link
Member

Responding to @jmacd's #307 (comment) about restricting sampler to two decisions only, instead of multiple decisions proposed in the issue description. Restricting the second decision only to span.finish means that any other activity the request has performed that resulted in downstream RPCs will not be recorded if the original decision as No, since it will be propagated to other services. In contrast, allowing multiple sampling decisions on every update of the span means the information needed for a Yes decision (like a certain tag) can become available early in the span life cycle.

@toumorokoshi
Copy link
Member

FYI I've created an OTEP around this: open-telemetry/oteps#115. Will attempt to get discussions moving, so review is much appreciated!

@jmacd
Copy link
Contributor

jmacd commented Jun 11, 2020

Responding to @jmacd's #307 (comment) about restricting sampler to two decisions only, instead of multiple decisions proposed in the issue description.

Just to clarify my earlier remark, I had described two descisions, but not two sampling decisions.

One was "attach a real span object vs a no-op span", which equates with "Maybe sample" or "Definitely do not sample"--this is a decision to possibly sample the span before all the information is available. If the first decision is No, there will be no information recorded.

The second decision was a tail-sampling decision where all the information is available. I view my suggestion is pretty close to open-telemetry/oteps#115.

@bogdandrutu bogdandrutu added the spec:trace Related to the specification/trace directory label Jun 12, 2020
@carlosalberto carlosalberto added area:sdk Related to the SDK release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Jun 30, 2020
@andrewhsu andrewhsu added the priority:p3 Lowest priority level label Jul 17, 2020
@tigrannajaryan
Copy link
Member

Given that open-telemetry/oteps#115 has no approvals yet and there is no clear understanding of how this issue will be resolved do we still believe this is a must-have for 1.0 GA?

It appears to me this functionality can be added in backwards compatible manner and if so can be postponed and addressed after GA.

Thoughts?

@toumorokoshi
Copy link
Member

I think it's fine to leave until after GA, since my main concern (discouraging calling update_name) has been removed from the spec.

That said the OTEP as it stands has either addressed all feedback or waiting on responses from others. If someone can help guide me in how to get more eyes I'm happy to get right back to working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sampling Related to trace sampling area:sdk Related to the SDK release:after-ga Not required before GA release, and not going to work on before GA spec:trace Related to the specification/trace directory
Projects
None yet
Development

No branches or pull requests

10 participants