Default Tracer propagating SpanContext by default or not #208

carlosalberto · 2019-08-09T16:58:02Z

Besides the base Tracer interface, the Java implementation offers a default, no-op DefaultTracer (https://github.com/open-telemetry/opentelemetry-java/blob/master/api/src/main/java/io/opentelemetry/trace/DefaultTracer.java) that propagates SpanContext.

This has the advantage of providing context propagation by default, but adds a little bit of implementation-specific part on the API side (as actual in-process propagation needs to be done, which in turn requires additional dependencies, etc).

Consider having this no-op Tracer as a pure no-op instance, without any context propagation at all.

(A possible alternative would be offering the no-op + context propagation option in the SDK, where actual implementation code can take place)

cc @bogdandrutu @iredelmeier

The text was updated successfully, but these errors were encountered:

c24t · 2019-08-09T22:00:56Z

Suppose an application has a dependency on an OT web framework or RPC integration, but doesn't load the SDK. Don't we want to support context propagation by default in that case?

On the other hand, if there's no mechanism for including the span context in outgoing calls then we shouldn't propagate the context in-process.

It might be helpful to frame the problem this way: libraries depend on OT either to:

generate spans for internal operations (e.g. a ORM integration)
de/serialize span data for incoming/outgoing requests (e.g. a RPC integration)

If an application depends on any library like (2) -- or has an explicit dependency on the API -- it and all other libraries it depends on should get a DefaultTracer by default. If all of the application's dependencies are like (1) then those dependencies should get a true NoopTracer by default.

Does this sound right to you? Zero-config inter-process context prop seems like an important use case to me, but if it turns out that the current defaults mean people are wasting cycles propagating spans internally that are never exported or passed down the call chain then that's a good argument for changing them.

arminru · 2019-08-12T12:14:02Z

related to #183

tsloughter · 2019-08-12T15:11:16Z

It would make things a lot simpler if it was just a no-op tracer and require the SDK or other impl if context propagation is required.

As I say in #183 it is very unclear what is expected of this default tracer. And I think it leads to potentially more confusion and even bad data if the default tracer is propagating span context without creating new spans. So the DefaultTracer either needs to be no-op or it needs to be the regular tracer found in the SDK.

The only other option I can think of is the default tracer acts the same as the SDK tracer for span start and finish but anything else like recording events, adding attributes, etc would be no-ops.

Hm, or, the default tracer could be the same as the SDK tracer only when called with a remote parent. That way when the remote context comes in to B in a sequence like A->B->C a new span is created and propagated to C but any span creation within B is a no-op since it is a local parent span.

tedsuo · 2019-10-04T01:06:41Z

Hi all, trying to resolve this ticket for the v0.2 release.

For now, I suggest we resolve this by saying the the default will be a No-op for the time being. Some reasoning below.

Property of least surprise

A noop implementation is the most intuitive result, when no SDK is present. It is not possible to correctly intuit any other behavior, one must read the docs. If one has not installed an SDK or taken other action, what are the chances that one has read the docs? Extra headers added to your requests out of nowhere is surprising behavior indeed.

Unexplored edge cases

We have not fully explored all the ways that this could turn sour. Here are some examples:

Untrustworthy observations: A sequence A->B->C, where B has no SDK but propagates context. The backend reports the sequence A->C. Operators take mistaken action, assuming that A is talking directly to C. Would you believe anything your observability system reports, after being bitten by this?
Turning it off: if there is automatic behavior as a result of adding the API to your library, then there is no way to disable it without configuration. If disabling this behavior requires a code change and a redeploy, that is essentially the equivalent of installing an SDK. Any other approach to configuration would involve the API aggressively touching additional surface areas of the process, such as environment variables or command line arguments. This is generally considered bad behavior by a library; many people would not like it. Not allowing propagation to be disabled opens the door to leaking data, unavoidable overhead, and other problems.
Calculating fingerprints: changing the header size can lead to things like oauth not working, load balancers truncating headers, and other surprises. Can you imagine trying to track this down to automated behavior in a library you had not heard of or explicitly enabled? Yikes.

Some of the above is hand wavy. My intention is to show that enabling this behavior is not a simple choice. In a world where we will be asking many, many libraries to natively integrate our API, avoiding surprising (and potentially dangerous) behavior should be a priority. I suggest that even though there are some advantages to propagation-by-default, for now they are outweighed by the risks and unknowns.

Oberon00 · 2019-10-04T14:02:29Z

Regarding your first point ("Untrustworthy observations"): There was discussion in Python (open-telemetry/opentelemetry-python#159) to solve this by generating new Span IDs even in the API-only propagation. Then instead of A->C you would have a broken trace of two parts A and ???->C. However, I don't like that approach either because it (a) requires additional code (span ID generation) in the API that is also not free to execute at runtime and (b) I actually prefer the "untrustworthy" observation. It is always just a question of level of detail: If service B is a Layer-2 router it will also not be visible in the trace.

One IMHO particularly important point from that disussion: The "untrustworthy observation" may also occur if the SDK is installed and configured but the spans for A and C are sampled while the one for B is not.

tsloughter · 2019-10-04T14:54:56Z

+1 to no-op.

carlosalberto · 2019-10-04T19:20:10Z

+1 to no-op for now.

SergeyKanzhelev · 2019-10-07T19:37:44Z

Enabling API everywhere is a long term goal. For now API enablement is quite involved process. Mostly people will enable it together with SDK. So API-only behavior can also be seen as API with the "disabled" SDK behavior. Where propagation is expected from my opinion.

Also with the move to separation of context propagation into a separate package in v0.3 that is discussed in oteps - automatic context propagation even in absence of SDK may again be a reality.

So I'd recommend to keep propagation behavior in API-only mode and wait for end user feedback.

c24t · 2019-10-08T17:01:10Z

Some comments on "untrustworthy observations" only:

There was discussion in Python ... to solve this by generating new Span IDs even in the API-only propagation

FWIW my reading of the spec is that we always have to generate new span IDs, proxies excepted.

@SergeyKanzhelev is actually the author of that part of the spec: w3c/trace-context#108. He may be able to give us the word of god here.

The "untrustworthy observation" may also occur if the SDK is installed and configured but the spans for A and C are sampled while the one for B is not.

It's technically possible, but I don't think this is likely to happen in practice. The client that creates the first span makes the sampling decision, and downstream clients can either respect that decision or choose not to sample.

This is the difference between the ProbabilitySampler (respects parent sampling decision) and AlwaysOnSampler (doesn't) in OC, and the reason we default to ProbabilitySampler with probability 1.

c24t · 2019-10-08T18:04:02Z

My intention is to show that enabling this behavior is not a simple choice. In a world where we will be asking many, many libraries to natively integrate our API, avoiding surprising (and potentially dangerous) behavior should be a priority.

Well said. I agree that we should be careful about making this library too magical and stick to conservative defaults. And that there are a lot of edge cases still to consider.

To make sense of these examples we have to distinguish in- and inter-process context propagation. AFAICT, in every language, the API package itself doesn't change outbound headers. For example:

The JS library ships with plugins for http and grpc, and a separate package to load these plugins automatically.
In python we provide WSGI middleware to extract the trace context from inbound requests and insert it into the application context, and a requests integration to attach trace context headers to outbound http requests made via the requests library.

The question here isn't whether we're going to add extra headers to requests, but whether applications need to depend on the SDK (in addition to whichever plugins they need) to propagate the trace context.

tedsuo · 2019-10-09T17:26:25Z

@c24t I agree, I do not see how distributed context propagation can occur without installing plugins... so we are talking about a scenario where the are plugins already installed. Eventually native instrumentation become may become popular, and all software libraries could automatically provide context/header propagation. Until that happens, this scenario concerning automated context propagation – without knowledge or action on the part of the application developer – is pretty much an edge case. I don't think there is an obvious way to enable API-only behavior.

More realistically, at this time, I suggest we focus on providing the following for our users:

sane SDK defaults - if the config defaults for all of our SDKs are different, and do not compose into a working system, there will be breakages. So we should define critical SDK defaults in the spec.
Automatic installation and setup - to the degree possible, every language should provide an installer which automates the setup of both SDK and plugins, which will remove any room for configuration error.

This combo – a zero-config auto-installation – simplifies the process of organizational adoption.

@SergeyKanzhelev I do understand the intent - developers are bad at configuring their systems, and distributed tracing systems benefit from every individual service being properly configured. So it would be great if we could enable context propagation without requiring the application developer to take action. However, for the reasons stated above, I do not believe that an API-only approach would effectively solve this problem any time soon. And since any default behavior creates edge cases which have not been addressed, it seems like this idea needs more time to incubate. Up till now, we've taken a conservative approach to adding new features which have not been validated - I suggest we continue to think about this feature (preferably via an RFC), and for now ship with a no-op implementation.

SergeyKanzhelev · 2019-10-10T20:35:22Z

Interestingly that the spec is currently already saying this while the decision that was made in v0.1 was to use W3C everywhere. I'm not sure what to do with this issue as it suggest to make a change that is already in a spec (is there any other mentions I am missing?)

The implementation MUST provide no-op binary and text Propagators, which the Tracer SHOULD use by default if other propagators are not configured. SDKs SHOULD use the W3C HTTP Trace Context as the default text format. For more details, see trace-context.

As for the arguments - we all saying the same words and listing the same assumptions about the state of the world. However we getting to the different conclusion somehow.

I believe that this project with upcoming improvements like context propagation separation, pre-installed OT collector exporter, introducing logging support is going to the direction when the mere fact of presence of OT API must initiate propagation of context. So earlier we will declare that - easier it will be to explain to end users.

This is what every partner in W3C discussions is looking for. End users control less components in their infrastructure and no matter whether they can enable monitoring or not - they want to have an end-to-end visibility. So ideally, the context propagation is a feature of a platform itself. In other words, the feature of an API-only deployments.

Anyway, I'd suggest we change the wording in specs to the W3C in API-only. And get the end-user feedback.

Thoughts?

tsloughter · 2019-10-10T20:42:54Z

@SergeyKanzhelev that spec only says what to do when crossing process boundaries. Does context propagate within the process when no SDK is enabled? And if it does, how? How here referring to what is the current active context supposed to be when only the API is loaded and a start_span is encountered? Does the parent span context remain the active span?

SergeyKanzhelev · 2019-10-10T21:29:54Z

@tsloughter it's a good question that definitely need a clarification, even for the SDK-only case. I'd say whatever works with the propagators must read/write contexts all the time. Spans needs to be associated with the context. However for inner spans - tracer may return the same span all the time or reuse active.

carlosalberto · 2020-01-28T01:43:45Z

Closing this issue and opening a new one, as propagation went away from the Tracer component.

iredelmeier added api vs sdk bounds labels Aug 13, 2019

carlosalberto mentioned this issue Aug 20, 2019

Minimal implementation is unclear on context propagation #183

Closed

Oberon00 mentioned this issue Sep 16, 2019

Adding a working propagator, adding to integrations and example open-telemetry/opentelemetry-python#137

Merged

SergeyKanzhelev added this to the Alpha v0.2 milestone Sep 27, 2019

tedsuo self-assigned this Oct 4, 2019

SergeyKanzhelev modified the milestones: Alpha v0.2, Alpha v0.3 Oct 16, 2019

Oberon00 mentioned this issue Oct 22, 2019

Define where to handle invalid span contexts open-telemetry/opentelemetry-python#233

Closed

fbogsany mentioned this issue Oct 23, 2019

Update Tracer#in_span to finish the Span open-telemetry/opentelemetry-ruby#118

Merged

carlosalberto closed this as completed Jan 28, 2020

carlosalberto mentioned this issue Jan 28, 2020

Default global propagators #428

Closed

Oberon00 mentioned this issue Jan 30, 2020

Mark local root spans in distributed parts of a trace. #366

Closed

arminru mentioned this issue Feb 7, 2020

Initial composite/global Propagators update for OTEP 66. #440

Merged

Oberon00 mentioned this issue Mar 18, 2020

Remove wording about default interprocess propagator #519

Closed

Oberon00 mentioned this issue Jul 5, 2020

Specify the behavior of the Tracer APIs in the absence of an SDK #689

Closed

carlosalberto mentioned this issue Jul 20, 2020

Have the global Propagators default to a no-op instance. #721

Closed

TuckTuckFloof pushed a commit to TuckTuckFloof/opentelemetry-specification that referenced this issue Oct 15, 2020

Remove default parameters from Recordable interface (open-telemetry#208)

b096d8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default Tracer propagating SpanContext by default or not #208

Default Tracer propagating SpanContext by default or not #208

carlosalberto commented Aug 9, 2019

c24t commented Aug 9, 2019

arminru commented Aug 12, 2019

tsloughter commented Aug 12, 2019

tedsuo commented Oct 4, 2019 •

edited

Loading

Oberon00 commented Oct 4, 2019

tsloughter commented Oct 4, 2019

carlosalberto commented Oct 4, 2019

SergeyKanzhelev commented Oct 7, 2019

c24t commented Oct 8, 2019

c24t commented Oct 8, 2019

tedsuo commented Oct 9, 2019

SergeyKanzhelev commented Oct 10, 2019

tsloughter commented Oct 10, 2019

SergeyKanzhelev commented Oct 10, 2019

carlosalberto commented Jan 28, 2020

Default Tracer propagating SpanContext by default or not #208

Default Tracer propagating SpanContext by default or not #208

Comments

carlosalberto commented Aug 9, 2019

c24t commented Aug 9, 2019

arminru commented Aug 12, 2019

tsloughter commented Aug 12, 2019

tedsuo commented Oct 4, 2019 • edited Loading

Property of least surprise

Unexplored edge cases

Oberon00 commented Oct 4, 2019

tsloughter commented Oct 4, 2019

carlosalberto commented Oct 4, 2019

SergeyKanzhelev commented Oct 7, 2019

c24t commented Oct 8, 2019

c24t commented Oct 8, 2019

tedsuo commented Oct 9, 2019

SergeyKanzhelev commented Oct 10, 2019

tsloughter commented Oct 10, 2019

SergeyKanzhelev commented Oct 10, 2019

carlosalberto commented Jan 28, 2020

tedsuo commented Oct 4, 2019 •

edited

Loading