-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify fallback behaviour when Span.end() was not called due to an error or by accident #1859
Comments
In the Java SDK, we don't keep references to the user's spans currently. So, if we were to be required to implement something like the first or 3rd option listed here, we'd have to keep references to the user's spans, potentially leading to memory leaks in these cases. It would be far worse to have a memory leak in the application, as opposed to having some broken/incomplete traces being emitted. Java also provides a special optional SpanProcessor implementation that allow users to find their un-ended spans, but it's definitely opt-in since it does introduce a significant performance penalty: https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk-extensions/tracing-incubator/src/main/java/io/opentelemetry/sdk/extension/incubator/trace/LeakDetectingSpanProcessor.java#L30 |
The spec leaves that deliberately unspecified to allow languages that are not garbage collected to leak spans if end is not called. Any other option can introduce a significant performance hit. |
Actually this is (somewhat confusingly, I admit) written in the section on span creation: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#span-creation
|
Ok, I did miss the part in the section about span creation. Would it make sense then to clarify that the responsibility to end all spans also includes adding fail-safe mechanisms which might be language specific? E.g. a language with garbage collection could use something like a destructor/finalize to end the span if needed or there is some optional implementation like the mentioned LeakDetectionSpanProcessor?
I agree with this point especially for option (1), but how does option (3) lead to memory leaks/introduce a performance hit: An user creates a parent span and transmit it with an interim or no end time stamp. In the same sense it is now the responsibility of the user to track that instance (or at least keep the spanid) and end (and update) the span later. The big impact I see here is more on the collector/backend site, which needs to allow spans to be updated. |
The arguments you bring up are specific to garbage collected languages, and these are free to provide more guarantees. The API spec here is written for the least common denominator. Imagine a C API: otel_Span* span = otel_start_span("foo", otel_SpanKind_Client);
// .. do something with the span
otel_end_span(span); // Ends the span, freeing any memory associated. However, just writing this I realize that this is acutally not allowed since |
What I mean by "clarify that the responsibility to end all spans also includes adding fail-safe mechanisms" is adding a few words into the section (addition in bold):
That's not a perfect wording, but I hope it gives an idea what I mean?
If there would be an option that |
That option can be implemented by the user, there is an OnStart callback in the SpanProcessor. The experimental Z pages feature does just that (implemented e.g. in Java). |
No, I don't really understand anymore I think. In the case where the application terminates unexpectedly, the user can obviously not be expected to end all spans. Are you suggesting that all spans must be sent already in OnStart? That would be some significant network overhead and also I don't think most backends will handle it well when they receive the same span with different attributes twice. |
Let me check this out then! |
I am not suggesting that all spans must be send already in OnStart: my line of thinking was, that depending on the language the user can (at least try) to handle that situation with whatever mechanism that language has for those circumstances: error handler, try..catch..., Signal Handling, Signal Traps, Destructors, ... To put it differently, if unexpected terminations/crashes are an exception to that rule of |
Unexpected termination, e.g. through SIGKILL can be used to violate most MUST requirements in this spec, I don't think you can word everything for it. This MUST is problematic anyway since it is a requirement placed on the users, so it's not an actual spec-requirement. The actually meaningful spec language is the "API implementations MAY leak memory or other resources (including, for example, CPU time for periodic work that iterates all spans) if the user forgot to end the span." ("forgot" should probably be replaced by "does not"). |
In the C++ API spans are returned as shared pointers, so the span lifetime is decoupled from memory management. Ending a span doesn't automatically free memory. |
@svrnm Thanks a lot for filing this! For reference, Brave (Zipkin) uses a weak map from Span -> (SpanContext + Start Time) to detect leaks. A reference queue allows the weak map to be notified when a Span is garbage collected, and if it wasn't ended, then a synthetic span is exported at that point using the SpanContext and an extra attribute, I think it would be great to have fallback behavior defined in the spec, especially if we can add a semantic convention to indicate spans that are exported due to garbage collection. |
That would make sense. But this should be optional to implement for SDKs. I would even say, if a SIG can gain performance by implementing Spans in way that does not allow leak detection, they should go for it. We shouldn't optimize for the error-case. Of course, having a (compile-time?) option that optionally allows additional debugging is helpful if possible. |
@Oberon00 Yup, agree that this should be optional for SDKs, just a definition to ensure consistency if implemented but not a requirement. |
A nice although big change IMHO. Maybe it would require an OTEP, but it would definitely somebody writing a prototype at least ;) |
Do you think? It requires the language runtime to notify you before a span object gets GCd. Then you can resurrect the span call |
This may be the tricky part, as other languages may have a hard time doing this without impacting performance. If not, we should be good to go! (also, making this optional would help). In any case, I'm guessing prototypes in different languages will help. |
That's one reason why it should be optional. And even optional in two senses IMHO: If a SIG decides it cannot / does not want to implement it, they don't need to. And if they decide to implement it, is MUST be default-off and can optionally be enabled (e.g. by adding an additional SpanProcessor, or with a compile-time |
What are you trying to achieve?
When I run an application with OpenTelemetry parent spans and their child spans being emitted, I might run into a situation where the application might crash or end before the
Span.end()
was called on the parent span. This leads to undefined behaviour. Jaeger for example reports "undefined parent id" but leaves the spans disconnected.Since one purpose of OpenTelemetry is to help finding the cause of an application issue (like a crash), I believe, that the Span.end() api needs a clear specification what needs to be done in such a situation.
I see a few options (and there might be more):
<other options>
Additional context.
This come up at the discussion for open-telemetry/oteps#169
The text was updated successfully, but these errors were encountered: