Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Allow additional sampling hooks #115

Conversation

toumorokoshi
Copy link
Member

Hi,

This OTEP is trying to consolidate a few discussions around sampling and deferred span creation. There's a few issues that have been in the spec issues for a while, I thought it would be good to maybe raise this as an OTEP.

@yurishkuro yurishkuro changed the title Feature/add sample to span end [feat] Allow sampling at span end Jun 11, 2020
- making idempotence one option to ensure consistent sampling results
  and not skew statistical results
- adding missing open discussion and pro of sample decisions propagating
  to other services.

shouldRetry is only valid when paired with a RECORD or RECORD_AND_SAMPLED result.
A NOT_RECORD result results in no further calls to the sampler, regardless of
the shouldRetry value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the opposite of how Jaeger samplers behave - a YES decision is final, a NO decision may be retryable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is different a bit from Jaeger. The intention was to enable the SDK to make a decision around how to handle "MAYBE" decisions (whether to consider them a Yes or No), as well as allow the SDK / Sampler make the decision on whether a Yes or No is final.

More rationalle here: #115 (comment)

@toumorokoshi toumorokoshi changed the title [feat] Allow sampling at span end [feat] Allow additional sampling hooks Aug 11, 2020
@toumorokoshi
Copy link
Member Author

@jmacd @yurishkuro I'd like to do whatever I can to help move this forward. I'm itching to do some other OT stuff but don't want to leave work unfinished.

Would you mind giving more feedback and calling out what you feel like are blockers, or approving? I feel your approval would go a long way in the viability of this OTEP.

@yurishkuro
Copy link
Member

@toumorokoshi I like the overall direction, but my concern is with the handling of what you called "MAYBE" state. Specifically, per https://github.com/open-telemetry/oteps/pull/115/files#r446273010, I would not be able to implement Jaeger SDK behavior, because this OTEP takes the opposite approach, without conclusive arguments imo. What if we made this aspect configurable?

shouldRetry is only valid when paired with a RECORD or RECORD_AND_SAMPLED result.
A NOT_RECORD result results in no further calls to the sampler, regardless of
the shouldRetry value.

@toumorokoshi
Copy link
Member Author

@toumorokoshi I like the overall direction, but my concern is with the handling of what you called "MAYBE" state. Specifically, per https://github.com/open-telemetry/oteps/pull/115/files#r446273010, I would not be able to implement Jaeger SDK behavior, because this OTEP takes the opposite approach, without conclusive arguments imo. What if we made this aspect configurable?

shouldRetry is only valid when paired with a RECORD or RECORD_AND_SAMPLED result.
A NOT_RECORD result results in no further calls to the sampler, regardless of
the shouldRetry value.

Thanks @yurishkuro for looking!

I think I personally am ambivalent about the concept of recording / not recording, but the rationalle for this behavior was to allow the usage of noop spans if the initial decision is to not record. I'm having trouble finding the original ticket, I'll see if I can find it.

Looking through the Jaeger code linked, I don't think the concept of "record" exists: A span is never replaced with a dummy version, it is always recording values.

If that is the case, I think one could just not use the "NOT_RECORD" state at all in a sampler to achieve a Jaeger client-like behavior. The Jaeger-style sampler would return on the following decisions:

  • "RECORD" -> sampled: False, retryable: True
  • "RECORD_AND_SAMPLED" -> sampled: true, retryable: False

Alternative: remove NOT_RECORD state

An option is to also just remove the concept of a NOT_RECORD state, and by extension not inject noop spans at all. I personally feel this is a good idea for other reasons:

  • SDK users having to reason about edge cases where attributes are not recorded or stored
  • seeing inconsistent performance of functions based on recording decision)

Either way I believe Jaeger-client style behavior is achievable.

@yurishkuro
Copy link
Member

The reason Jaeger does not use true no-op spans is because regardless of the sampling decision for trace collection users usually still want unique trace ID for every trace, which can be used for other correlation purposes (e.g. tagging logs). I don't see how that's possible to achieve with no-op spans unless one still allocates unique instances of no-op spans pointing to distinct SpanContext's (which is not how no-op works in OpenTracing, there no-op span is a singleton that does not incur allocation cost). Jaeger SDKs do support effective no-op because when the span is not sampled, all its write operations are short-circuited to do nothing (but it does require an atomic read of a flag, so slightly more expensive than real no-op span). Strictly speaking, it's possible to implement another span type that has true no-op functions, as long as the decision is known at span creation.

Looking through the Jaeger code linked, I don't think the concept of "record" exists: A span is never replaced with a dummy version, it is always recording values.

That's not completely accurate - Jaeger SDKs have another hidden sampling state called "finalized", which is set to true as soon as the sampler returns non-retryable decision, which makes the most common cases and most common samplers much more efficient, since a finalized sampling means the write ops (e.g. span.setAttribute) do not need to be echoed into sampler callbacks anymore. I think it's kind of related to RECORD mode in OTel:

Jaeger: decision.sample Jaeger: decision.retryable OTEL decision
false true RECORD
false false NOT_RECORD
true false RECORD_AND_SAMPLED

However, the additional differences are:

  • in Jaeger the sampled state is shared between all spans of a given trace that are not finished in the given process (which allows sampling decisions based on tags on children spans affect parent span)
  • the "finalized" state is explicitly set to true if the parent span context was remote, i.e. downstream services don't attempt to make changes to sampling (which means it's possible that parent A starts non-sampled, calls child B that does not record its data, then A gets an attribute that trips sampling to on, and child C will be recording/sampling data).

Is it worth including these in the OTEP? The main objective of that design in Jaeger was graceful fallback onto "normal" behavior where you only have simplistic samplers saying yes/no and making it efficient (i.e. short-circuiting the rest of the callbacks).

@toumorokoshi
Copy link
Member Author

@yurishkuro thanks for the thorough reply!

I believe the Jaeger behavior as you described it is pretty much how OpenTelemetry works today:

  • effective no-op spans: write short-circuit, but a custom SpanContext is passed to allow a unique traceId.
  • the choice for a no-op span is based on the initial sampling choice (which is the only path that will cause an OTEL decision to be "NOT_RECORD")

I think I understand your concern a bit better now: the sticking point is the ability for the Sampler to return back a "NOT_RECORD" result. If you want a no decision to be retryble, then I would just implement the Sampler to never return NOT_RECORD, so either "RECORD" or "RECORD_AND_SAMPLED".

But looking at the mechanics, I feel like this could all be simplified by having "record" by a calculation of the sample decision and the retryable decision. So slightly modifying your table, we could have is_recording be a function:

Jaeger: decision.sample / OTEL sampled Jaeger: decision.retryable / OTEL shouldRetry OTEL is_recording() return value
false true true
false false false
true false true
true true true

This would be a modification of the "SamplingResult" struct in the SDK. This would also still enable the mostly no-op spans that you described.

Is it worth including these in the OTEP? The main objective of that design in Jaeger was graceful fallback onto "normal" behavior where you only have simplistic samplers saying yes/no and making it efficient (i.e. short-circuiting the rest of the callbacks).

I think the two additional behaviors you called out are really good additions to the spec, but can be tackled outside of this OTEP (which is just advocating for additional sampling hooks).

I'm happy to keep filing follow-up PRs and discussions to get all the behavioral issues smoothed out. Unless you feel like this is a blocker for your own approval, I'd like to commit to following up separately on those instead.

Regarding the risks if we go that route:

  • the shared sampling choice in a trace is not possible (or at least not defined in the current spec). To enable jeager-client like behavior, this would have to be addressed
  • the "finalized" state being honored will be possible with the APIs exposed, since they will consume the span and thus the SpanContext. I.E. This behavior could be replicated by a custom sampler. To make it possible in the SDK as well, I think this would be a matter of adding a flag to the built-in samplers to honor the parent's sampling choices.


- Pro: enables setting the recording decision early, which can skip additional
processing for instrumentations that use the isRecording field.
- Pro: enables setting the recording decision early, which can be used
Copy link
Contributor

@lzchen lzchen Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can the decision to use a cheaper Span implementation (such as noop span) upon a hook decision be made without having a span created already? Unless I am misunderstanding something, if we aren't using deferred span creation, wouldn't a "real" span be created if the initial sampling decision (upon starting the span) was set to true?

e.g.

class CustomSampler():
...
   def should_sample(...):
      return True
...
   def onUpdateName(span, name):
       return False
...
tracer_provider = TracerProvider(sampler=CustomSampler())
span = tracer.start_as_current_span("oldname") -> always sampled, so real Span is created
span.updateName("new name") -> what happens here? The real Span is already created, so how do we "save on processing"? How is this "setting the recording decision early", compared to setting it on creation?

@tedsuo
Copy link
Contributor

tedsuo commented Feb 6, 2023

@toumorokoshi I'm closing this since the Sampler API has changed significantly since this OTEP was opened. Please update or open a new PR if you are still interested in this.

@tedsuo tedsuo closed this Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants