Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify fallback behaviour when Span.end() was not called due to an error or by accident #1859

Closed
svrnm opened this issue Aug 10, 2021 · 19 comments
Assignees
Labels
spec:trace Related to the specification/trace directory

Comments

@svrnm
Copy link
Member

svrnm commented Aug 10, 2021

What are you trying to achieve?

When I run an application with OpenTelemetry parent spans and their child spans being emitted, I might run into a situation where the application might crash or end before the Span.end() was called on the parent span. This leads to undefined behaviour. Jaeger for example reports "undefined parent id" but leaves the spans disconnected.

Since one purpose of OpenTelemetry is to help finding the cause of an application issue (like a crash), I believe, that the Span.end() api needs a clear specification what needs to be done in such a situation.

I see a few options (and there might be more):

  • Provide an error handler, that finishes all "partial" spans on crash/application end.
  • Add a mechanism to the collector that reconstructs a missing parent from child spans.
  • Allow parent spans to be send as "partial" to the collector before the duration/end is clear or allow the duration of the parent span to be updated later (e.g. after each child I update the parent's duration)
  • <other options>

Additional context.

This come up at the discussion for open-telemetry/oteps#169

@svrnm svrnm added the spec:trace Related to the specification/trace directory label Aug 10, 2021
@jkwatson
Copy link
Contributor

In the Java SDK, we don't keep references to the user's spans currently. So, if we were to be required to implement something like the first or 3rd option listed here, we'd have to keep references to the user's spans, potentially leading to memory leaks in these cases. It would be far worse to have a memory leak in the application, as opposed to having some broken/incomplete traces being emitted.

Java also provides a special optional SpanProcessor implementation that allow users to find their un-ended spans, but it's definitely opt-in since it does introduce a significant performance penalty: https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk-extensions/tracing-incubator/src/main/java/io/opentelemetry/sdk/extension/incubator/trace/LeakDetectingSpanProcessor.java#L30

@Oberon00
Copy link
Member

The spec leaves that deliberately unspecified to allow languages that are not garbage collected to leak spans if end is not called. Any other option can introduce a significant performance hit.

@Oberon00
Copy link
Member

Oberon00 commented Aug 10, 2021

Actually this is (somewhat confusingly, I admit) written in the section on span creation: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#span-creation

Any span that is created MUST also be ended. This is the responsibility of the user. API implementations MAY leak memory or other resources (including, for example, CPU time for periodic work that iterates all spans) if the user forgot to end the span.

@svrnm
Copy link
Member Author

svrnm commented Aug 11, 2021

Ok, I did miss the part in the section about span creation. Would it make sense then to clarify that the responsibility to end all spans also includes adding fail-safe mechanisms which might be language specific? E.g. a language with garbage collection could use something like a destructor/finalize to end the span if needed or there is some optional implementation like the mentioned LeakDetectionSpanProcessor?

So, if we were to be required to implement something like the first or 3rd option listed here, we'd have to keep references to the user's spans, potentially leading to memory leaks in these cases.
Any other option can introduce a significant performance hit.

I agree with this point especially for option (1), but how does option (3) lead to memory leaks/introduce a performance hit: An user creates a parent span and transmit it with an interim or no end time stamp. In the same sense it is now the responsibility of the user to track that instance (or at least keep the spanid) and end (and update) the span later. The big impact I see here is more on the collector/backend site, which needs to allow spans to be updated.

@Oberon00
Copy link
Member

Oberon00 commented Aug 11, 2021

The arguments you bring up are specific to garbage collected languages, and these are free to provide more guarantees. The API spec here is written for the least common denominator. Imagine a C API:

otel_Span* span = otel_start_span("foo", otel_SpanKind_Client);
// .. do something with the span
otel_end_span(span); // Ends the span, freeing any memory associated.

However, just writing this I realize that this is acutally not allowed since otel_get_span_context(span) needs to work even after ending, which means that you'd need a separate otel_free_span() function or similar. That's a shame. @open-telemetry/cpp-approvers was this an issue in C++ too?

@svrnm
Copy link
Member Author

svrnm commented Aug 12, 2021

What I mean by "clarify that the responsibility to end all spans also includes adding fail-safe mechanisms" is adding a few words into the section (addition in bold):

Any span that is created MUST also be ended. This is the responsibility of the user. This also includes cases where the application ends unexpectedly due to a crash or an early termination. API implementations MAY leak memory or other resources (including, for example, CPU time for periodic work that iterates all spans) if the user forgot to end the span.

That's not a perfect wording, but I hope it gives an idea what I mean?

The arguments you bring up are specific to garbage collected languages, and these are free to provide more guarantees. The API spec here is written for the least common denominator. Imagine a C API: [...]

If there would be an option that otel_start_span already exports a partial span (without end timestamp set yet) to the collector, and otel_end_span does export the full span (with end timestamp and maybe some more updated properties) to the collector, how is this specific to garbage collected languages?

@Oberon00
Copy link
Member

If there would be an option that otel_start_span already exports a partial span

That option can be implemented by the user, there is an OnStart callback in the SpanProcessor. The experimental Z pages feature does just that (implemented e.g. in Java).

@Oberon00
Copy link
Member

Oberon00 commented Aug 12, 2021

That's not a perfect wording, but I hope it gives an idea what I mean?

No, I don't really understand anymore I think. In the case where the application terminates unexpectedly, the user can obviously not be expected to end all spans. Are you suggesting that all spans must be sent already in OnStart? That would be some significant network overhead and also I don't think most backends will handle it well when they receive the same span with different attributes twice.

@svrnm
Copy link
Member Author

svrnm commented Aug 12, 2021

That option can be implemented by the user, there is an OnStart callback in the SpanProcessor. The experimental Z pages feature does just that (implemented e.g. in Java).

Let me check this out then!

@svrnm
Copy link
Member Author

svrnm commented Aug 12, 2021

No, I don't really understand anymore I think. In the case where the application terminates unexpectedly, the user can obviously not be expected to end all spans. Are you suggesting that all spans must be sent already in OnStart? That would be some significant network overhead and also I don't think most backends will handle it well when they receive the same span with different attributes twice.

I am not suggesting that all spans must be send already in OnStart: my line of thinking was, that depending on the language the user can (at least try) to handle that situation with whatever mechanism that language has for those circumstances: error handler, try..catch..., Signal Handling, Signal Traps, Destructors, ...

To put it differently, if unexpected terminations/crashes are an exception to that rule of Any span that is created MUST also be ended, then the word MUST needs to be replaced with a SHOULD

@Oberon00
Copy link
Member

Unexpected termination, e.g. through SIGKILL can be used to violate most MUST requirements in this spec, I don't think you can word everything for it.

This MUST is problematic anyway since it is a requirement placed on the users, so it's not an actual spec-requirement. The actually meaningful spec language is the "API implementations MAY leak memory or other resources (including, for example, CPU time for periodic work that iterates all spans) if the user forgot to end the span." ("forgot" should probably be replaced by "does not").

@pyohannes
Copy link
Contributor

However, just writing this I realize that this is acutally not allowed since otel_get_span_context(span) needs to work even after ending, which means that you'd need a separate otel_free_span() function or similar. That's a shame. @open-telemetry/cpp-approvers was this an issue in C++ too?

In the C++ API spans are returned as shared pointers, so the span lifetime is decoupled from memory management. Ending a span doesn't automatically free memory.

@anuraaga
Copy link
Contributor

@svrnm Thanks a lot for filing this!

For reference, Brave (Zipkin) uses a weak map from Span -> (SpanContext + Start Time) to detect leaks. A reference queue allows the weak map to be notified when a Span is garbage collected, and if it wasn't ended, then a synthetic span is exported at that point using the SpanContext and an extra attribute, flush: true. This has been invaluable as users instrument asynchronous frameworks and it's fairly easy to miss ending a span (usually because of not catching an exception somewhere). The Brave version doesn't keep track of stacktraces so has very little overhead so it's opt-out there.

I think it would be great to have fallback behavior defined in the spec, especially if we can add a semantic convention to indicate spans that are exported due to garbage collection.

@Oberon00
Copy link
Member

I think it would be great to have fallback behavior defined in the spec, especially if we can add a semantic convention to indicate spans that are exported due to garbage collection.

That would make sense. But this should be optional to implement for SDKs. I would even say, if a SIG can gain performance by implementing Spans in way that does not allow leak detection, they should go for it. We shouldn't optimize for the error-case. Of course, having a (compile-time?) option that optionally allows additional debugging is helpful if possible.

@anuraaga
Copy link
Contributor

anuraaga commented Aug 13, 2021

@Oberon00 Yup, agree that this should be optional for SDKs, just a definition to ensure consistency if implemented but not a requirement.

@carlosalberto
Copy link
Contributor

I think it would be great to have fallback behavior defined in the spec, especially if we can add a semantic convention to indicate spans that are exported due to garbage collection.

A nice although big change IMHO. Maybe it would require an OTEP, but it would definitely somebody writing a prototype at least ;)

@Oberon00
Copy link
Member

if we can add a semantic convention to indicate spans that are exported due to garbage collection.

A nice although big change IMHO

Do you think? It requires the language runtime to notify you before a span object gets GCd. Then you can resurrect the span call span.SetAttribute("otel.leaked_span", true) (+ maybe an error status) and span.End() and that's it. Implementable today.

@carlosalberto
Copy link
Contributor

language runtime to notify you before a span object gets GCd

This may be the tricky part, as other languages may have a hard time doing this without impacting performance. If not, we should be good to go! (also, making this optional would help). In any case, I'm guessing prototypes in different languages will help.

@Oberon00
Copy link
Member

This may be the tricky part, as other languages may have a hard time doing this without impacting performance.

That's one reason why it should be optional. And even optional in two senses IMHO: If a SIG decides it cannot / does not want to implement it, they don't need to. And if they decide to implement it, is MUST be default-off and can optionally be enabled (e.g. by adding an additional SpanProcessor, or with a compile-time #define).

@svrnm svrnm closed this as completed Jan 17, 2024
@svrnm svrnm closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:trace Related to the specification/trace directory
Projects
None yet
Development

No branches or pull requests

7 participants