-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify Span ID creation with sampling. #998
Specify Span ID creation with sampling. #998
Conversation
specification/trace/sdk.md
Outdated
note: this must be done before calling `ShouldSample`, because it expects | ||
a valid trace Id as input). | ||
2. Query the `Sampler`'s [`ShouldSample`](#shouldsample) method. | ||
3. If the decision is `DROP`, use the parent span ID or all-zero if there is none. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means: If there is no parent and the decision is drop, we will have a span with the new trace ID but an invalid span ID. The SpanContext as a whole is thus invalid. We could also specify that the generated trace ID should be discarded in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this may be worth re-considering: Generating trace+span ID for an unsampled root has implications if the span is injected. If we generate an invalid spancontext (with zero span ID), we will inject nothing as we have an invalid traceparent header as per W3C (cf. also #753). This causes downstream components to potentially create traces / run sampling logic that they would not if they got an incoming unsampled traceparent header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also specify that the generated trace ID should be discarded in that case.
I'd be fine with this.
we will inject nothing as we have an invalid traceparent header as per W3C (cf. also #753). This causes downstream components to potentially create traces
Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carlosalberto We never inject an invalid SpanContext, at least using W3C. If we would generate a new ID instead of using zero, we would inject a valid spancontext with unset sampled flag. This will likely cause the receiver to suppress the trace. On the other hand, if we use zero, the downstream component will not receive any tracestate and will likely start a new trace.
Now that I write, this, I realize that this even applies to local child spans: Normally if the root span is not sampled, local child spans would be unsampled too, but with this they would become root spans.
This is bad. I think we will have to create Span and Trace ID even for "dropped" root spans after all. It seems there is a dependency of #864 "Option to allow "default" IDs for unsampled traces" on #753 "Current "Invalid" SpanContext definition precludes TraceState-only SpanContext". We would have to have some sentinel SpanContext that is "valid, unsampled, but without any IDs". That's out of scope here of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 4006816. Maybe this demonstrates how easy it is to mess up here and that it is thus important to have this specified clearly? 😃
specification/trace/sdk.md
Outdated
a valid trace Id as input). | ||
2. Query the `Sampler`'s [`ShouldSample`](#shouldsample) method. | ||
3. If the decision is `DROP`, use the parent span ID or all-zero if there is none. | ||
Otherwise (if the decision is not `DROP`) a new span ID MUST be generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means: RECORD_ONLY spans get a new span ID, always. I don't know what the purpose of RECORD_ONLY is (should we remove it?), but having different spans with the same trace+span ID seems worse than breaking the trace in case any children of the RECORD_ONLY span become RECORD_AND_SAMPLE .
from the issue triage mtg today, labelling this as however, if this is just a editorial change that does not affect the trace freeze, please speak |
@andrewhsu As stated in the issue comment, this PR is exactly the part of the issue that is not doable after-ga. |
@Oberon00 will you be able get the approvals today? Other options are not doing this in 1.0 at all or asking for an exception from spec freeze. |
@tigrannajaryan I won't. In that case, it seems we will have to look around after GA and document all the paths taken by any SDKs as allowed and maybe introduce some options. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
I changed the corresponding issue to a new issue #1060, so we can re-evaluate what to do here for GA. |
Relabeled to allowed-for-ga to match #1060. |
Co-authored-by: Armin Ruech <[email protected]>
Conflicts: spec-compliance-matrix.md
specification/trace/sdk.md
Outdated
3. If the decision is `DROP` and there is a valid parent span ID, use it. | ||
Otherwise (if the decision is not `DROP` or there was no valid parent span ID) | ||
a new span ID MUST be generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is tricky, but I like @Oberon00's proposal that DROP sampling decisions at the root of a trace should generate a new trace ID and new span ID without the sampled bit set. Am I understanding?
@open-telemetry/specs-approvers Please review this one. We bumped its related issue as "Required-for-GA" so we definitely needs reviews ;) |
(note: this must be done before calling `ShouldSample`, because it expects | ||
a valid trace ID as input). | ||
2. Query the `Sampler`'s [`ShouldSample`](#shouldsample) method. | ||
3. If the decision is `DROP` and there is a valid parent span ID, reuse it as the new `Span`'s span ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean two spans with the same id? I can imagine this could become extremely confusing to someone trying to debug things. As a user, I think I'd be super confused if I saw a new Span created with the same ID as the parent, even if that span might not end up going anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's two spans with the same ID the same way as calling extract and then inject creates a copy of the injected span (with the same ID). It is more of a hollow shell of a span than an actual span.
even if that span might not end up going anywhere
This span cannot end up going anywhere. A dropped span is always non-recording, so there is nothing that could end up anywhere beside the span+trace ID itself (not even a parent span ID). What is more, SpanProcessors are not invoked with dropped spans.
RECORD_ONLY spans are a different story, and they do get a new ID, see #998 (comment)
All that being said, I think I could agree that this may make the mental model of sampling a bit more complicated than just always creating a new ID. But I hope that the benefits, namely ability to resume sampling in a child and having the closes sampled span as parent, and a slight(?) performance improvement would be worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think I'd prefer attacking this problem via a smarter SpanContext that allowed propagation of the relevant bits, rather than generating weird internal spans for the sake of propagating the SpanContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, better yet, move the TraceState and TraceFlags out of the span altogether and propagate them separately in the Context, rather than hanging them off a span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should startSpan return for a dropped span then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regeneration looks like a safer choice that will not break SDK going forward
I think changing from any of them to the other will break some uses.
and is easier to explain
Fair enough.
the idea is that instrumented library needs a reliable unique ID of the operation
If you have an ID that is not associated with any trace on the tracing backend, will that be more helpful than the "inexact" ID of the parent operation? Note, you will have a valid span ID in any case.
I would say we go GA with this, and discuss how to improve on it using things like adding a GetUniqueOperationId
function to Span
that returns the span ID if it was generated for that span or generates a new ID that is cached and propagated to children if it is not. I recon propagating this ID cross-process may be difficult, but the same goes for propagating the ID of the last sampled span, and it seems unlikely that you call GetUniqueOperationId on an extracted span directly (it would not be very unique, since the parent's process would have the same ID if it was sampled at least).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regeneration looks like a safer choice that will not break SDK going forward
I think changing from any of them to the other will break some uses.
and is easier to explain
Fair enough.
the idea is that instrumented library needs a reliable unique ID of the operation
If you have an ID that is not associated with any trace on the tracing backend, will that be more helpful than the "inexact" ID of the parent operation? Note, you will have a valid span ID in any case.
I would say we go GA with this, and discuss how to improve on it using things like adding a
GetUniqueOperationId
function toSpan
that returns the span ID if it was generated for that span or generates a new ID that is cached and propagated to children if it is not. I recon propagating this ID cross-process may be difficult, but the same goes for propagating the ID of the last sampled span, and it seems unlikely that you call GetUniqueOperationId on an extracted span directly (it would not be very unique, since the parent's process would have the same ID if it was sampled at least).
The assumption that there is a trace backend may not be true for everybody. And having trace backend of your caller having a record of a parent span Id is not helpful to solve your problem. So the decision we made at @open-telemetry/technical-committee is to go with span ID regeneration. This creates a nice guarantee of local uniqueness of a spanID that various telemetry signals may take a dependency on.
How strongly do you disagree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very strongly. I do think my proposed solution is the better trade-off but I concede that always generating a new ID is also a reasonable option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Oberon00 Thanks for the answer. I think we will go with the TC decision for now then - if you have time, would you mind updating this PR to match the proposal? Else, I can help by creating a spin-off of this PR. Let me know ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to create a spin-off PR. That way we clearly document that this option was rejected.
Ping @yurishkuro |
@SergeyKanzhelev you mentioned during the specification sig that you will document the decision that was discussed during the TC meeting. |
@Oberon00 please rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev you mentioned during the specification sig that you will document the decision that was discussed during the TC meeting.
sorry for delay. This comment summarizes it: #998 (comment) but I was thinking to expand a bit.
Fixes #1060.
Partial fix for #864 (remainder would require an API change to not pass a trace ID to the sampler if there is none).
Changes