-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Add option to export processing spans in ForceFlush or periodically #373
Comments
I implemented something like this for LightStep that we called 'meta events' - they were treated like spans to our recorder, but they would give the backend information about when a span started and ended, and when the span context was injected or extracted. I generally like the idea of some kind of diagnostic information about a span being available to exporters. |
I worry about lock ordering and the potential for deadlock when I read this proposal, though I agree that using the stop-time field as an indicator for completed spans is solid. AFAICT, OpenCensus did not implement z-pages using a span processor, it relied on internal logic. I have some familiarity with an internal Google implementation similar to OC z-pages, which is where this fear comes from. If there's a lock protecting the list of all live spans, you have to be careful when viewing z-pages or else stop the world. |
@jmacd That's certainly a valid concern that we'd have to deal with in the implementations, yes. |
@jmacd I do agree with you about the locking problem, but I think the current SpanProcessor supports that implementation. Add the ReadableSpan to a list of active Spans onStart then remove onEnd, when needed you can iterate over the list to read in-flight spans. I think the standard SpanProcessor for the exporter pipeline should not support this, but if necessary a vendor can implement their own SpanProcessor that does report in-flight spans to the exporter framework. Also this can be added later, because the framework that we built supports this. |
@arminru I think you can use the presence of the end timestamp in the SpanData to determine if the Span is still in-process or finished. Does that solve your problem? |
@bogdandrutu I also wouldn't add it to the standard exporter pipeline. What I'm proposing is to add it to the spec so that the OTel implementations add a SpanProcessor which forwards these processing spans. If a user/operator is interested in them, they just have to plug that processor into their pipeline - it would be an opt-in feature, not the default. Users might, however, plug a default exporter into the |
@arminru I see your point now, I think this is an edge case, and not sure if we need to design the API with this mistake in mind. I don't say that this is not a possible mistake that can happen, but in my mind every vendor will implement a "SpanProcessor build()" function that returns the SpanProcessor needed for that vendor and if no support for in-flight spans then it should not return the processor instance that does pass in-flight spans. In an ideal case most of the users should do: TracerFactorySdk tracerFactorySdk = OpenTelemetrySdk.getTraceFactory()
VendorExporter.init(tracerFactorySdk, options...) Only in case of advance users they should build their own exporter pipeline and SpanProcessor. |
@bogdandrutu Alright, I'm fine with relying on the end timestamp for checking if a span is processing or finished - this way we don't have to add a potentially confusing, second status field to |
I don't think that is really a good option. E.g. consider the Java definition of an end timestamp: private long getEndNanoTimeInternal() {
return hasBeenEnded ? endEpochNanos : clock.now();
} This shows two problems: First, we can't check the "presence" of the end timestamp since we get clock.now() if it isn't present. Second, if Maybe the situation is better in other languages (I'm pretty sure it is e.g. in Python), but if there isn't an implementation of the span processor in the SDK already, I think the processing span processor will be very difficult/impossible to implement due to lots of tiny problems and edge cases like this that no one has thought about because no one had used the SDK in a similar way. |
The only reason to have now() returned in getEndEpochNanos seems to be that it kinda makes sense for getLatencyMs, but that method is package-private and never used. See also open-telemetry/opentelemetry-specification#373 (comment).
The only reason to have now() returned in getEndEpochNanos seems to be that it kinda makes sense for getLatencyMs, but that method is package-private and never used. See also open-telemetry/opentelemetry-specification#373 (comment).
I added an example Java implementation in open-telemetry/opentelemetry-java#697. |
The only reason to have now() returned in getEndEpochNanos seems to be that it kinda makes sense for getLatencyMs, but that method is package-private and never used. See also open-telemetry/opentelemetry-specification#373 (comment).
The only reason to have now() returned in getEndEpochNanos seems to be that it kinda makes sense for getLatencyMs, but that method is package-private and never used. See also open-telemetry/opentelemetry-specification#373 (comment).
The only reason to have now() returned in getEndEpochNanos seems to be that it kinda makes sense for getLatencyMs, but that method is package-private and never used. See also open-telemetry/opentelemetry-specification#373 (comment).
* Simplify span end time. The only reason to have now() returned in getEndEpochNanos seems to be that it kinda makes sense for getLatencyMs, but that method is package-private and never used. See also open-telemetry/opentelemetry-specification#373 (comment). * Add back getLatencyNs, add it to ReadableSpan. * Rename getLatencyNs to getLatencyNanos.
When I was starting with OpenTelemetry in our projects I didn't even think that not-finished spans might be never logged. I'm surprised that the only state of an operation in a trace is "finished". I have a data processing workflow on queues so my messages often stay in a queue and it takes a considerable amount of time to process them. I have to know if the message is in the queue or if it is currently processed. I'll probably create a workaround by linking "put in the queue" and "processing started" to some log messages I could find by trace ID but it would be nice if it will be included in the framework. |
I'm also very curious about the current state of this. Can anyone from the @open-telemetry folks provide an update here? 🙏 |
There is a "Z pages" feature implemented in some languages, e.g. Java that allows you to view in-progress spans. Also, the spec now allows you to react to a span being started and even poll it for updated attributes. However, I know of no backend that supports displaying in-progress spans. |
Thanks! I'll check it out.
That said, is it just a matter of writing a custom exporter and I could gather whatever I want? Seems suspiciously easy 😄 We use OTelExporter to feed Elastic APM but still, we'll be writing our own visualizations because no out-of-the-box backend supports all we want (mainly we'd like to show traces to the client so we have to remove details, enhance with another info, make it pretty etc.). |
@mitoihs It may not be quite as clear cut, but I've done something similar by having a span on "sending a message to a queue", and another span on "processing message from queue". There is no active span while the message is in a queue. Visualizing a trace you can see the "processing message from queue" spans are not present and therefore are still in the queue. Hopefully that's helpful |
Are there any tools or platforms, at this point, capable of receiving and working with unfinished spans? (Not just exporting them from an application via zpages, but sending them to a logging/tracing service and reading them there.) |
Really surprised that opentelemetry cannot handle my scenario... I have 2 machines, machine 1 which is the client and machine 2 which is the server. So this means the span on Machine 1 is not finished and the 3rd span on Machine 2 is not finished. The span on Machine 1 and the 3rd stuck span on Machine 3 is not reported as they are not ended. So the output on Jaeger is really confused as it does not associate the 2 completed spans with a parent. I really need a processor that could export start spans, and end spans, and a backend could then tie up the completed spans with correct started parent span. I could then ask Jaeger for all the incomplete spans, and then drill down into which child span is causing the problem. Is the above on any roadmaps? Also not sure why the proposed implementation PR is marked as obsolete? |
seems like this is quite a huge oversight. +1 |
I would extend this issue even further to support indefinite-lifetime spans. I would like to be able to start a span that never ends, and yet I would like to report its current state (i.e., "still alive!", e.g., change of attributes) with new events every 30 seconds. Related: #2692 (comment) |
Feels like you could model this span as a Resource and report standalone Events against it. |
AWS X-Ray |
How to handle long running spans?
Handler is a long running operation - it performs http requests/ database calls which are creating spans automatically with autoinstrumentation. While handler is running - I can't see anything in Elastic APM. If I look into opentelemetry logs, spans are sent to elastic exporter when they end - which is fine but they don't appear in the UI. I suspect it's because the parent span hasn't finished yet, so there is no parent to attach them to. How should I handle following scenario?
Now I can see them in UI as nested span end but I don't have them grouped in the parent span. Is that something wrong with the way Elastic displays it? Or is it the issue with opentelemetry itself? How do you solve following scenarios? |
completes SAF-1569 Originally I just wanted to report why semgrep-core was killed in the traces, and send any traces that may have been mid flight when it happened. Since Otel the spec [doesn't support partial traces](open-telemetry/opentelemetry-specification#373), I decided to simply raise the kill signal as an error like we already do when we receive Ctrl c. Otel libraries are supposed to report [exceptions in spans and then exit them](https://opentelemetry.io/docs/specs/semconv/exceptions/exceptions-spans/), but it turns out the otel library [does not do this](imandra-ai/ocaml-opentelemetry#50). So this entire time if there was EVER an error, including file and memory time outs, in a span, we would never send the trace! So this PR makes it so we exit a span whenever an error is caught (and then we reraise it), and the error attributes the library is /supposed/ to add, we add instead! So now we can see what spans have errors, their error message, stacktrace, and we actually see them. ## Test plan ```bash semgrep --pro --trace --config p/default --trace-endpoint semgrep-dev --max-memory 10 --interfile-timeout 1 --timeout 1 --timeout-threshold=1 # wait a bit # then ctrl c or kill -15 <pid> ``` then check jaeger to see traces have error info, and sent traces after ctrl c or kill: [example](https://jaeger-dev2.corp.semgrep.dev/trace/2a86f622e267e7164c29ea0dd37c5286) synced from Pro e8114ab0dccd079b51a0381e4168d5383ab87bbe
completes SAF-1569 Originally I just wanted to report why semgrep-core was killed in the traces, and send any traces that may have been mid flight when it happened. Since Otel the spec [doesn't support partial traces](open-telemetry/opentelemetry-specification#373), I decided to simply raise the kill signal as an error like we already do when we receive Ctrl c. Otel libraries are supposed to report [exceptions in spans and then exit them](https://opentelemetry.io/docs/specs/semconv/exceptions/exceptions-spans/), but it turns out the otel library [does not do this](imandra-ai/ocaml-opentelemetry#50). So this entire time if there was EVER an error, including file and memory time outs, in a span, we would never send the trace! So this PR makes it so we exit a span whenever an error is caught (and then we reraise it), and the error attributes the library is /supposed/ to add, we add instead! So now we can see what spans have errors, their error message, stacktrace, and we actually see them. ## Test plan ```bash semgrep --pro --trace --config p/default --trace-endpoint semgrep-dev --max-memory 10 --interfile-timeout 1 --timeout 1 --timeout-threshold=1 # wait a bit # then ctrl c or kill -15 <pid> ``` then check jaeger to see traces have error info, and sent traces after ctrl c or kill: [example](https://jaeger-dev2.corp.semgrep.dev/trace/2a86f622e267e7164c29ea0dd37c5286) synced from Pro e8114ab0dccd079b51a0381e4168d5383ab87bbe
completes SAF-1569 Originally I just wanted to report why semgrep-core was killed in the traces, and send any traces that may have been mid flight when it happened. Since Otel the spec [doesn't support partial traces](open-telemetry/opentelemetry-specification#373), I decided to simply raise the kill signal as an error like we already do when we receive Ctrl c. Otel libraries are supposed to report [exceptions in spans and then exit them](https://opentelemetry.io/docs/specs/semconv/exceptions/exceptions-spans/), but it turns out the otel library [does not do this](imandra-ai/ocaml-opentelemetry#50). So this entire time if there was EVER an error, including file and memory time outs, in a span, we would never send the trace! So this PR makes it so we exit a span whenever an error is caught (and then we reraise it), and the error attributes the library is /supposed/ to add, we add instead! So now we can see what spans have errors, their error message, stacktrace, and we actually see them. ## Test plan ```bash semgrep --pro --trace --config p/default --trace-endpoint semgrep-dev --max-memory 10 --interfile-timeout 1 --timeout 1 --timeout-threshold=1 # wait a bit # then ctrl c or kill -15 <pid> ``` then check jaeger to see traces have error info, and sent traces after ctrl c or kill: [example](https://jaeger-dev2.corp.semgrep.dev/trace/2a86f622e267e7164c29ea0dd37c5286) synced from Pro e8114ab0dccd079b51a0381e4168d5383ab87bbe
Hello, has there been any update on this issue? The changes will be very helpful to my org and I would be interested to participate in the implementation. We have cases where a process might be running for 10h and only in the last 10 minutes there are emitted spans. It would be valuable to be able to see spans(along with their tags and events) as soon as they are started. Here is an example
This is how it would appear in the UI Corresponding thread on the opentelemetry slack channel with people interested in this cc @svrnm |
Thank you @xBis7!
Unfortunately not, but based on the discussion on slack, there is some bigger interest into this topic, so maybe we can re-ignite the discussion. Note, that this is a gnarly topic, what you can see from the previous discussion ongoing since 2019 and the slack thread, so there will not be a quick solution |
Currently, exporters in OTel only get access to spans that are already finished in order to send them to the backend.
This proposal adds an option to also allow exporting spans that are currently processing, i.e., already started but not yet ended. They might also be referred to as open, unfinished, running or "in-flight" spans. (I'd like to avoid the term active here since this has a different semantic in OTel.)
This allows the backend to display long-running spans and provide details about them in realtime. It further allows for debugging the instrumentation code since it will reveal spans that are started but not being ended.
The backend can also use this information for performance improvements because it receives information about spans ahead of time. This, for example, allows the backend to make decisions about keeping other, dependent parent or child spans in memory if there are currently processing spans expected to arrive later on or otherwise persist them on disk.
OpenCensus had a functionality like this used in their zPages feature:
https://github.com/census-instrumentation/opencensus-java/blob/master/impl_core/src/main/java/io/opencensus/implcore/trace/export/InProcessRunningSpanStore.java
This would have the following implications on the implementation:
A SpanProcessor implementing this feature would take Spans provided via the span start hook and provide those to the exporter. The current default processors only use the end span hook on which they hand the span over to the export right away (simple span processor) or batch it first (batching span processor).
The information on whether a span is processing or finished could, for example, be indicated by:
export
method, something likeexportProcessing
.I would prefer option C since this provides the clearest distinction and spare the need for filtering all provided spans in exporters that don't want to handle processing spans - they could just make
exportProcessing
a no-op.What's your opinion on a feature like that? I'd like to gather some feedback before I create an OTEP for it.
The text was updated successfully, but these errors were encountered: