Specify fallback behaviour when Span.end() was not called due to an error or by accident #1859

svrnm · 2021-08-10T17:28:34Z

What are you trying to achieve?

When I run an application with OpenTelemetry parent spans and their child spans being emitted, I might run into a situation where the application might crash or end before the Span.end() was called on the parent span. This leads to undefined behaviour. Jaeger for example reports "undefined parent id" but leaves the spans disconnected.

Since one purpose of OpenTelemetry is to help finding the cause of an application issue (like a crash), I believe, that the Span.end() api needs a clear specification what needs to be done in such a situation.

I see a few options (and there might be more):

Provide an error handler, that finishes all "partial" spans on crash/application end.
Add a mechanism to the collector that reconstructs a missing parent from child spans.
Allow parent spans to be send as "partial" to the collector before the duration/end is clear or allow the duration of the parent span to be updated later (e.g. after each child I update the parent's duration)
<other options>

Additional context.

This come up at the discussion for open-telemetry/oteps#169

The text was updated successfully, but these errors were encountered:

jkwatson · 2021-08-10T17:49:52Z

In the Java SDK, we don't keep references to the user's spans currently. So, if we were to be required to implement something like the first or 3rd option listed here, we'd have to keep references to the user's spans, potentially leading to memory leaks in these cases. It would be far worse to have a memory leak in the application, as opposed to having some broken/incomplete traces being emitted.

Java also provides a special optional SpanProcessor implementation that allow users to find their un-ended spans, but it's definitely opt-in since it does introduce a significant performance penalty: https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk-extensions/tracing-incubator/src/main/java/io/opentelemetry/sdk/extension/incubator/trace/LeakDetectingSpanProcessor.java#L30

Oberon00 · 2021-08-10T19:31:34Z

The spec leaves that deliberately unspecified to allow languages that are not garbage collected to leak spans if end is not called. Any other option can introduce a significant performance hit.

Oberon00 · 2021-08-10T19:33:32Z

Actually this is (somewhat confusingly, I admit) written in the section on span creation: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#span-creation

Any span that is created MUST also be ended. This is the responsibility of the user. API implementations MAY leak memory or other resources (including, for example, CPU time for periodic work that iterates all spans) if the user forgot to end the span.

svrnm · 2021-08-11T09:22:55Z

Ok, I did miss the part in the section about span creation. Would it make sense then to clarify that the responsibility to end all spans also includes adding fail-safe mechanisms which might be language specific? E.g. a language with garbage collection could use something like a destructor/finalize to end the span if needed or there is some optional implementation like the mentioned LeakDetectionSpanProcessor?

So, if we were to be required to implement something like the first or 3rd option listed here, we'd have to keep references to the user's spans, potentially leading to memory leaks in these cases.
Any other option can introduce a significant performance hit.

I agree with this point especially for option (1), but how does option (3) lead to memory leaks/introduce a performance hit: An user creates a parent span and transmit it with an interim or no end time stamp. In the same sense it is now the responsibility of the user to track that instance (or at least keep the spanid) and end (and update) the span later. The big impact I see here is more on the collector/backend site, which needs to allow spans to be updated.

Oberon00 · 2021-08-11T10:17:16Z

The arguments you bring up are specific to garbage collected languages, and these are free to provide more guarantees. The API spec here is written for the least common denominator. Imagine a C API:

otel_Span* span = otel_start_span("foo", otel_SpanKind_Client);
// .. do something with the span
otel_end_span(span); // Ends the span, freeing any memory associated.

However, just writing this I realize that this is acutally not allowed since otel_get_span_context(span) needs to work even after ending, which means that you'd need a separate otel_free_span() function or similar. That's a shame. @open-telemetry/cpp-approvers was this an issue in C++ too?

svrnm · 2021-08-12T07:58:34Z

What I mean by "clarify that the responsibility to end all spans also includes adding fail-safe mechanisms" is adding a few words into the section (addition in bold):

Any span that is created MUST also be ended. This is the responsibility of the user. This also includes cases where the application ends unexpectedly due to a crash or an early termination. API implementations MAY leak memory or other resources (including, for example, CPU time for periodic work that iterates all spans) if the user forgot to end the span.

That's not a perfect wording, but I hope it gives an idea what I mean?

The arguments you bring up are specific to garbage collected languages, and these are free to provide more guarantees. The API spec here is written for the least common denominator. Imagine a C API: [...]

If there would be an option that otel_start_span already exports a partial span (without end timestamp set yet) to the collector, and otel_end_span does export the full span (with end timestamp and maybe some more updated properties) to the collector, how is this specific to garbage collected languages?

Oberon00 · 2021-08-12T08:04:38Z

If there would be an option that otel_start_span already exports a partial span

That option can be implemented by the user, there is an OnStart callback in the SpanProcessor. The experimental Z pages feature does just that (implemented e.g. in Java).

Oberon00 · 2021-08-12T08:07:34Z

That's not a perfect wording, but I hope it gives an idea what I mean?

No, I don't really understand anymore I think. In the case where the application terminates unexpectedly, the user can obviously not be expected to end all spans. Are you suggesting that all spans must be sent already in OnStart? That would be some significant network overhead and also I don't think most backends will handle it well when they receive the same span with different attributes twice.

svrnm · 2021-08-12T08:28:42Z

That option can be implemented by the user, there is an OnStart callback in the SpanProcessor. The experimental Z pages feature does just that (implemented e.g. in Java).

Let me check this out then!

svrnm · 2021-08-12T08:38:13Z

No, I don't really understand anymore I think. In the case where the application terminates unexpectedly, the user can obviously not be expected to end all spans. Are you suggesting that all spans must be sent already in OnStart? That would be some significant network overhead and also I don't think most backends will handle it well when they receive the same span with different attributes twice.

I am not suggesting that all spans must be send already in OnStart: my line of thinking was, that depending on the language the user can (at least try) to handle that situation with whatever mechanism that language has for those circumstances: error handler, try..catch..., Signal Handling, Signal Traps, Destructors, ...

To put it differently, if unexpected terminations/crashes are an exception to that rule of Any span that is created MUST also be ended, then the word MUST needs to be replaced with a SHOULD

Oberon00 · 2021-08-12T08:42:14Z

Unexpected termination, e.g. through SIGKILL can be used to violate most MUST requirements in this spec, I don't think you can word everything for it.

This MUST is problematic anyway since it is a requirement placed on the users, so it's not an actual spec-requirement. The actually meaningful spec language is the "API implementations MAY leak memory or other resources (including, for example, CPU time for periodic work that iterates all spans) if the user forgot to end the span." ("forgot" should probably be replaced by "does not").

pyohannes · 2021-08-12T23:53:37Z

However, just writing this I realize that this is acutally not allowed since otel_get_span_context(span) needs to work even after ending, which means that you'd need a separate otel_free_span() function or similar. That's a shame. @open-telemetry/cpp-approvers was this an issue in C++ too?

In the C++ API spans are returned as shared pointers, so the span lifetime is decoupled from memory management. Ending a span doesn't automatically free memory.

anuraaga · 2021-08-13T02:15:12Z

@svrnm Thanks a lot for filing this!

For reference, Brave (Zipkin) uses a weak map from Span -> (SpanContext + Start Time) to detect leaks. A reference queue allows the weak map to be notified when a Span is garbage collected, and if it wasn't ended, then a synthetic span is exported at that point using the SpanContext and an extra attribute, flush: true. This has been invaluable as users instrument asynchronous frameworks and it's fairly easy to miss ending a span (usually because of not catching an exception somewhere). The Brave version doesn't keep track of stacktraces so has very little overhead so it's opt-out there.

I think it would be great to have fallback behavior defined in the spec, especially if we can add a semantic convention to indicate spans that are exported due to garbage collection.

Oberon00 · 2021-08-13T08:08:32Z

I think it would be great to have fallback behavior defined in the spec, especially if we can add a semantic convention to indicate spans that are exported due to garbage collection.

That would make sense. But this should be optional to implement for SDKs. I would even say, if a SIG can gain performance by implementing Spans in way that does not allow leak detection, they should go for it. We shouldn't optimize for the error-case. Of course, having a (compile-time?) option that optionally allows additional debugging is helpful if possible.

anuraaga · 2021-08-13T08:11:20Z

@Oberon00 Yup, agree that this should be optional for SDKs, just a definition to ensure consistency if implemented but not a requirement.

carlosalberto · 2021-08-16T14:20:45Z

I think it would be great to have fallback behavior defined in the spec, especially if we can add a semantic convention to indicate spans that are exported due to garbage collection.

A nice although big change IMHO. Maybe it would require an OTEP, but it would definitely somebody writing a prototype at least ;)

Oberon00 · 2021-08-16T14:25:51Z

if we can add a semantic convention to indicate spans that are exported due to garbage collection.

A nice although big change IMHO

Do you think? It requires the language runtime to notify you before a span object gets GCd. Then you can resurrect the span call span.SetAttribute("otel.leaked_span", true) (+ maybe an error status) and span.End() and that's it. Implementable today.

carlosalberto · 2021-08-16T14:33:00Z

language runtime to notify you before a span object gets GCd

This may be the tricky part, as other languages may have a hard time doing this without impacting performance. If not, we should be good to go! (also, making this optional would help). In any case, I'm guessing prototypes in different languages will help.

Oberon00 · 2021-08-16T14:56:20Z

This may be the tricky part, as other languages may have a hard time doing this without impacting performance.

That's one reason why it should be optional. And even optional in two senses IMHO: If a SIG decides it cannot / does not want to implement it, they don't need to. And if they decide to implement it, is MUST be default-off and can optionally be enabled (e.g. by adding an additional SpanProcessor, or with a compile-time #define).

svrnm added the spec:trace Related to the specification/trace directory label Aug 10, 2021

github-actions bot assigned jsuereth Aug 10, 2021

svrnm closed this as completed Jan 17, 2024

svrnm closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify fallback behaviour when Span.end() was not called due to an error or by accident #1859

Specify fallback behaviour when Span.end() was not called due to an error or by accident #1859

svrnm commented Aug 10, 2021

jkwatson commented Aug 10, 2021

Oberon00 commented Aug 10, 2021

Oberon00 commented Aug 10, 2021 •

edited

Loading

svrnm commented Aug 11, 2021

Oberon00 commented Aug 11, 2021 •

edited

Loading

svrnm commented Aug 12, 2021

Oberon00 commented Aug 12, 2021

Oberon00 commented Aug 12, 2021 •

edited

Loading

svrnm commented Aug 12, 2021

svrnm commented Aug 12, 2021

Oberon00 commented Aug 12, 2021

pyohannes commented Aug 12, 2021

anuraaga commented Aug 13, 2021

Oberon00 commented Aug 13, 2021

anuraaga commented Aug 13, 2021 •

edited

Loading

carlosalberto commented Aug 16, 2021

Oberon00 commented Aug 16, 2021

carlosalberto commented Aug 16, 2021

Oberon00 commented Aug 16, 2021

Specify fallback behaviour when Span.end() was not called due to an error or by accident #1859

Specify fallback behaviour when Span.end() was not called due to an error or by accident #1859

Comments

svrnm commented Aug 10, 2021

jkwatson commented Aug 10, 2021

Oberon00 commented Aug 10, 2021

Oberon00 commented Aug 10, 2021 • edited Loading

svrnm commented Aug 11, 2021

Oberon00 commented Aug 11, 2021 • edited Loading

svrnm commented Aug 12, 2021

Oberon00 commented Aug 12, 2021

Oberon00 commented Aug 12, 2021 • edited Loading

svrnm commented Aug 12, 2021

svrnm commented Aug 12, 2021

Oberon00 commented Aug 12, 2021

pyohannes commented Aug 12, 2021

anuraaga commented Aug 13, 2021

Oberon00 commented Aug 13, 2021

anuraaga commented Aug 13, 2021 • edited Loading

carlosalberto commented Aug 16, 2021

Oberon00 commented Aug 16, 2021

carlosalberto commented Aug 16, 2021

Oberon00 commented Aug 16, 2021

Oberon00 commented Aug 10, 2021 •

edited

Loading

Oberon00 commented Aug 11, 2021 •

edited

Loading

Oberon00 commented Aug 12, 2021 •

edited

Loading

anuraaga commented Aug 13, 2021 •

edited

Loading