Proposal: Add option to export processing spans in ForceFlush or periodically #373

arminru · 2019-12-05T14:22:36Z

Currently, exporters in OTel only get access to spans that are already finished in order to send them to the backend.
This proposal adds an option to also allow exporting spans that are currently processing, i.e., already started but not yet ended. They might also be referred to as open, unfinished, running or "in-flight" spans. (I'd like to avoid the term active here since this has a different semantic in OTel.)

This allows the backend to display long-running spans and provide details about them in realtime. It further allows for debugging the instrumentation code since it will reveal spans that are started but not being ended.
The backend can also use this information for performance improvements because it receives information about spans ahead of time. This, for example, allows the backend to make decisions about keeping other, dependent parent or child spans in memory if there are currently processing spans expected to arrive later on or otherwise persist them on disk.

OpenCensus had a functionality like this used in their zPages feature:

This would have the following implications on the implementation:

A SpanProcessor implementing this feature would take Spans provided via the span start hook and provide those to the exporter. The current default processors only use the end span hook on which they hand the span over to the export right away (simple span processor) or batch it first (batching span processor).

The information on whether a span is processing or finished could, for example, be indicated by:

A) adding a field ProcessingStatus (or LifecycleStatus?) to the Span object which can have the values processing and finished, or
B) without any adaption by simply checking if the (mandatory) span end time is set - if it is not, the span is not yet finished, or
C) extending the SpanExporter interface by a method in addition to the existing export method, something like exportProcessing.

I would prefer option C since this provides the clearest distinction and spare the need for filtering all provided spans in exporters that don't want to handle processing spans - they could just make exportProcessing a no-op.

What's your opinion on a feature like that? I'd like to gather some feedback before I create an OTEP for it.

The text was updated successfully, but these errors were encountered:

austinlparker · 2019-12-05T16:17:07Z

I implemented something like this for LightStep that we called 'meta events' - they were treated like spans to our recorder, but they would give the backend information about when a span started and ended, and when the span context was injected or extracted. I generally like the idea of some kind of diagnostic information about a span being available to exporters.

jmacd · 2019-12-05T21:50:34Z

I worry about lock ordering and the potential for deadlock when I read this proposal, though I agree that using the stop-time field as an indicator for completed spans is solid. AFAICT, OpenCensus did not implement z-pages using a span processor, it relied on internal logic.

I have some familiarity with an internal Google implementation similar to OC z-pages, which is where this fear comes from. If there's a lock protecting the list of all live spans, you have to be careful when viewing z-pages or else stop the world.

arminru · 2019-12-06T08:26:08Z

@jmacd That's certainly a valid concern that we'd have to deal with in the implementations, yes.

bogdandrutu · 2019-12-06T08:27:55Z

@jmacd I do agree with you about the locking problem, but I think the current SpanProcessor supports that implementation. Add the ReadableSpan to a list of active Spans onStart then remove onEnd, when needed you can iterate over the list to read in-flight spans.

I think the standard SpanProcessor for the exporter pipeline should not support this, but if necessary a vendor can implement their own SpanProcessor that does report in-flight spans to the exporter framework.

Also this can be added later, because the framework that we built supports this.

bogdandrutu · 2019-12-06T08:32:50Z

@arminru I think you can use the presence of the end timestamp in the SpanData to determine if the Span is still in-process or finished. Does that solve your problem?

arminru · 2019-12-06T09:33:10Z

@bogdandrutu
The presence of the end timestamp would be one way of distinguishing processing from finished spans, yes.

I also wouldn't add it to the standard exporter pipeline. What I'm proposing is to add it to the spec so that the OTel implementations add a SpanProcessor which forwards these processing spans. If a user/operator is interested in them, they just have to plug that processor into their pipeline - it would be an opt-in feature, not the default.

Users might, however, plug a default exporter into the ProcessingSpanProcessor (we'll find a better name for that) so we'll have to make sure this works properly.
Option C mentioned above would require the exporter interface to be extended. A separate method for processing spans would be added to assure that exporters neatly and efficiently handle both kinds of spans - the current/default exporters would just make exportProcessingSpans() a no-op an be fine. This would require a spec change for the exporter interface.
With Option B (which you just proposed as well) exporters would have to check each end timestamp (or some convenience method like isFinished()) first in order to know which ones to ignore, before they can rely on assumptions for finished spans.

bogdandrutu · 2019-12-06T09:44:32Z

@arminru I see your point now, I think this is an edge case, and not sure if we need to design the API with this mistake in mind. I don't say that this is not a possible mistake that can happen, but in my mind every vendor will implement a "SpanProcessor build()" function that returns the SpanProcessor needed for that vendor and if no support for in-flight spans then it should not return the processor instance that does pass in-flight spans.

In an ideal case most of the users should do:

  TracerFactorySdk tracerFactorySdk = OpenTelemetrySdk.getTraceFactory()
  VendorExporter.init(tracerFactorySdk, options...)

Only in case of advance users they should build their own exporter pipeline and SpanProcessor.

arminru · 2019-12-09T10:27:21Z

@bogdandrutu Alright, I'm fine with relying on the end timestamp for checking if a span is processing or finished - this way we don't have to add a potentially confusing, second status field to Span and can keep the exporter interface as it is.
I still would like to propose this kind of processor for reporting processing spans into the SDK spec. This way implementations would include it by default and users or vendors can, if desired, activate it by adding it to the pipeline.

Oberon00 · 2019-12-12T10:11:00Z

@bogdandrutu

I think you can use the presence of the end timestamp in the SpanData to determine if the Span is still in-process or finished

I don't think that is really a good option. E.g. consider the Java definition of an end timestamp:

  private long getEndNanoTimeInternal() {
    return hasBeenEnded ? endEpochNanos : clock.now();
  }

This shows two problems: First, we can't check the "presence" of the end timestamp since we get clock.now() if it isn't present. Second, if endEpochNanos != 0 would be a valid replacement for hasBeenEnded, why wouldn't the SDK use it instead of adding an additonal boolean? I think it should be possible for a processing spans processor to use the same definition as the SDK.

Maybe the situation is better in other languages (I'm pretty sure it is e.g. in Python), but if there isn't an implementation of the span processor in the SDK already, I think the processing span processor will be very difficult/impossible to implement due to lots of tiny problems and edge cases like this that no one has thought about because no one had used the SDK in a similar way.

The only reason to have now() returned in getEndEpochNanos seems to be that it kinda makes sense for getLatencyMs, but that method is package-private and never used. See also open-telemetry/opentelemetry-specification#373 (comment).

Oberon00 · 2019-12-13T12:55:29Z

I added an example Java implementation in open-telemetry/opentelemetry-java#697.

The only reason to have now() returned in getEndEpochNanos seems to be that it kinda makes sense for getLatencyMs, but that method is package-private and never used. See also open-telemetry/opentelemetry-specification#373 (comment).

* Simplify span end time. The only reason to have now() returned in getEndEpochNanos seems to be that it kinda makes sense for getLatencyMs, but that method is package-private and never used. See also open-telemetry/opentelemetry-specification#373 (comment). * Add back getLatencyNs, add it to ReadableSpan. * Rename getLatencyNs to getLatencyNanos.

mitoihs · 2021-06-28T08:31:47Z

When I was starting with OpenTelemetry in our projects I didn't even think that not-finished spans might be never logged. I'm surprised that the only state of an operation in a trace is "finished".

I have a data processing workflow on queues so my messages often stay in a queue and it takes a considerable amount of time to process them. I have to know if the message is in the queue or if it is currently processed. I'll probably create a workaround by linking "put in the queue" and "processing started" to some log messages I could find by trace ID but it would be nice if it will be included in the framework.

schickling · 2021-06-28T10:41:18Z

I'm also very curious about the current state of this. Can anyone from the @open-telemetry folks provide an update here? 🙏

Oberon00 · 2021-06-28T11:41:01Z

There is a "Z pages" feature implemented in some languages, e.g. Java that allows you to view in-progress spans.

Also, the spec now allows you to react to a span being started and even poll it for updated attributes. However, I know of no backend that supports displaying in-progress spans.

mitoihs · 2021-06-28T11:58:38Z

There is a "Z pages" feature implemented in some languages, e.g. Java that allows you to view in-progress spans.

Thanks! I'll check it out.

Also, the spec now allows you to react to a span being started and even poll it for updated attributes. However, I know of no backend that supports displaying in-progress spans.

That said, is it just a matter of writing a custom exporter and I could gather whatever I want? Seems suspiciously easy 😄

We use OTelExporter to feed Elastic APM but still, we'll be writing our own visualizations because no out-of-the-box backend supports all we want (mainly we'd like to show traces to the client so we have to remove details, enhance with another info, make it pretty etc.).

kenfinnigan · 2021-06-28T13:09:51Z

@mitoihs It may not be quite as clear cut, but I've done something similar by having a span on "sending a message to a queue", and another span on "processing message from queue". There is no active span while the message is in a queue.

Visualizing a trace you can see the "processing message from queue" spans are not present and therefore are still in the queue.

Hopefully that's helpful

joshtriplett · 2022-04-07T07:05:02Z

Are there any tools or platforms, at this point, capable of receiving and working with unfinished spans? (Not just exporting them from an application via zpages, but sending them to a logging/tracing service and reading them there.)

adelel1 · 2022-06-20T17:10:20Z

Really surprised that opentelemetry cannot handle my scenario...

I have 2 machines, machine 1 which is the client and machine 2 which is the server.
Machine 1 sends a request to machine 2, request on Machine 1 is represented by a span.
Machine 2 receives the request and sequentially does 3 bits of work represented by 3 child spans. Now the first 2 spans on Machine 2 complete, but the 3rd span gets stuck.

So this means the span on Machine 1 is not finished and the 3rd span on Machine 2 is not finished.
Looking at Jaeger it reports the 2 completed spans on Machine 2, but with a warning stating unknown parent span.

The span on Machine 1 and the 3rd stuck span on Machine 3 is not reported as they are not ended.

So the output on Jaeger is really confused as it does not associate the 2 completed spans with a parent.

I really need a processor that could export start spans, and end spans, and a backend could then tie up the completed spans with correct started parent span.

I could then ask Jaeger for all the incomplete spans, and then drill down into which child span is causing the problem.

Is the above on any roadmaps?

Also not sure why the proposed implementation PR is marked as obsolete?

simingy · 2022-07-11T17:01:37Z

seems like this is quite a huge oversight.
without the ability to see in-flight (currently processing spans), we don't have the ability to debug a live system.
further, if spans are not sent out until done - if the traced application dies, the spans are never sent out - we would have a gap in telemetry.

+1

jmacd · 2022-07-29T16:21:24Z

I would extend this issue even further to support indefinite-lifetime spans. I would like to be able to start a span that never ends, and yet I would like to report its current state (i.e., "still alive!", e.g., change of attributes) with new events every 30 seconds.

Related: #2692 (comment)

scheler · 2022-08-06T17:50:23Z

I would like to be able to start a span that never ends, and yet I would like to report its current state (i.e., "still alive!", e.g., change of attributes) with new events every 30 seconds.

Feels like you could model this span as a Resource and report standalone Events against it.

nakedible · 2023-06-03T07:36:18Z

Are there any tools or platforms, at this point, capable of receiving and working with unfinished spans? (Not just exporting them from an application via zpages, but sending them to a logging/tracing service and reading them there.)

AWS X-Ray

zszepczynski · 2023-10-26T11:05:22Z

How to handle long running spans?
I have a service that receives message from a socket.io and runs long running jobs.
in js I'm doing something similar to this:

 const activeContext = propagation.extract(context.active(), headers)
return tracer.startActiveSpan(
  `${socketIo.namespace}[${socketIo.room}] ${MessagingOperationValues.RECEIVE}`,
  activeContext,
  (span) => {
    return endSpan(handler, span)
  }

Handler is a long running operation - it performs http requests/ database calls which are creating spans automatically with autoinstrumentation. While handler is running - I can't see anything in Elastic APM. If I look into opentelemetry logs, spans are sent to elastic exporter when they end - which is fine but they don't appear in the UI. I suspect it's because the parent span hasn't finished yet, so there is no parent to attach them to. How should I handle following scenario?
I replaced it with

return context.with(activeContext, ()=>{
    return handler();
  });

Now I can see them in UI as nested span end but I don't have them grouped in the parent span. Is that something wrong with the way Elastic displays it? Or is it the issue with opentelemetry itself? How do you solve following scenarios?

completes SAF-1569 Originally I just wanted to report why semgrep-core was killed in the traces, and send any traces that may have been mid flight when it happened. Since Otel the spec [doesn't support partial traces](open-telemetry/opentelemetry-specification#373), I decided to simply raise the kill signal as an error like we already do when we receive Ctrl c. Otel libraries are supposed to report [exceptions in spans and then exit them](https://opentelemetry.io/docs/specs/semconv/exceptions/exceptions-spans/), but it turns out the otel library [does not do this](imandra-ai/ocaml-opentelemetry#50). So this entire time if there was EVER an error, including file and memory time outs, in a span, we would never send the trace! So this PR makes it so we exit a span whenever an error is caught (and then we reraise it), and the error attributes the library is /supposed/ to add, we add instead! So now we can see what spans have errors, their error message, stacktrace, and we actually see them. ## Test plan ```bash semgrep --pro --trace --config p/default --trace-endpoint semgrep-dev --max-memory 10 --interfile-timeout 1 --timeout 1 --timeout-threshold=1 # wait a bit # then ctrl c or kill -15 <pid> ``` then check jaeger to see traces have error info, and sent traces after ctrl c or kill: [example](https://jaeger-dev2.corp.semgrep.dev/trace/2a86f622e267e7164c29ea0dd37c5286) synced from Pro e8114ab0dccd079b51a0381e4168d5383ab87bbe

xBis7 · 2024-10-15T15:29:56Z

Hello, has there been any update on this issue?

The changes will be very helpful to my org and I would be interested to participate in the implementation.

We have cases where a process might be running for 10h and only in the last 10 minutes there are emitted spans. It would be valuable to be able to see spans(along with their tags and events) as soon as they are started.

Here is an example

at 12:45 span A is started - whenever events are flushed or upon forceFlush it is sent to the collector
at 12:46 event 1a is added to span A - that also makes it to the collector
12:47 span B (child of A started) - it's also sent to the collector
12:48 event 2a is added to span A and event 1b is added to span B - all of these are sent to the collector
12:49 span B completes
12:50 span A completes

This is how it would appear in the UI

Corresponding thread on the opentelemetry slack channel with people interested in this
https://cloud-native.slack.com/archives/CJFCJHG4Q/p1728915811387139

cc @svrnm

svrnm · 2024-10-16T06:26:15Z

Thank you @xBis7!

Hello, has there been any update on this issue?

Unfortunately not, but based on the discussion on slack, there is some bigger interest into this topic, so maybe we can re-ignite the discussion. Note, that this is a gnarly topic, what you can see from the previous discussion ongoing since 2019 and the slack thread, so there will not be a quick solution

Oberon00 mentioned this issue Dec 12, 2019

[On Hold] Improve behavior of un-ended ReadableSpan open-telemetry/opentelemetry-java#693

Closed

Oberon00 mentioned this issue Dec 13, 2019

DISCUSSION: Add SpanWatcherProcessor (spec#373) open-telemetry/opentelemetry-java#697

Closed

mayurkale22 mentioned this issue Dec 16, 2019

feat: introduce ended property on Span open-telemetry/opentelemetry-js#625

Merged

jmacd mentioned this issue Jun 9, 2020

zPages proposal open-telemetry/oteps#110

Merged

bogdandrutu added the spec:trace Related to the specification/trace directory label Jun 12, 2020

reyang added the area:api Cross language API specification issue label Jun 30, 2020

carlosalberto added the release:after-ga Not required before GA release, and not going to work on before GA label Jul 2, 2020

anuraaga mentioned this issue Jan 30, 2022

Support long running multi tenant workflows open-telemetry/opentelemetry-java#4133

Open

simingy mentioned this issue Jul 11, 2022

export unclosed span open-telemetry/opentelemetry-python#2752

Closed

Oberon00 mentioned this issue Jul 26, 2022

Allow SpanExporter to write partially completed spans during shutdown #2692

Open

jmacd changed the title ~~Proposal: Add option to export processing spans~~ Proposal: Add option to export processing spans in ForceFlush Jul 29, 2022

jmacd changed the title ~~Proposal: Add option to export processing spans in ForceFlush~~ Proposal: Add option to export processing spans in ForceFlush or periodically Jul 29, 2022

Oberon00 mentioned this issue Sep 27, 2022

Spec format for spans in build logs and bash script output #2773

Closed

abitrolly mentioned this issue Dec 20, 2022

Declare OTLP/JSON Stable #2930

Merged

dyladan added the triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted label Mar 29, 2024

github-actions bot added the triage:followup Needs follow up during triage label Sep 19, 2024

trask removed the triage:followup Needs follow up during triage label Sep 24, 2024

github-actions bot added the triage:followup Needs follow up during triage label Sep 25, 2024

trask removed the triage:followup Needs follow up during triage label Sep 26, 2024

github-actions bot added the triage:followup Needs follow up during triage label Oct 16, 2024

danielgblanco removed the triage:followup Needs follow up during triage label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add option to export processing spans in ForceFlush or periodically #373

Proposal: Add option to export processing spans in ForceFlush or periodically #373

arminru commented Dec 5, 2019

austinlparker commented Dec 5, 2019

jmacd commented Dec 5, 2019

arminru commented Dec 6, 2019

bogdandrutu commented Dec 6, 2019 •

edited

Loading

bogdandrutu commented Dec 6, 2019

arminru commented Dec 6, 2019

bogdandrutu commented Dec 6, 2019

arminru commented Dec 9, 2019

Oberon00 commented Dec 12, 2019 •

edited

Loading

Oberon00 commented Dec 13, 2019

mitoihs commented Jun 28, 2021

schickling commented Jun 28, 2021

Oberon00 commented Jun 28, 2021 •

edited

Loading

mitoihs commented Jun 28, 2021

kenfinnigan commented Jun 28, 2021

joshtriplett commented Apr 7, 2022

adelel1 commented Jun 20, 2022

simingy commented Jul 11, 2022

jmacd commented Jul 29, 2022

scheler commented Aug 6, 2022

nakedible commented Jun 3, 2023

zszepczynski commented Oct 26, 2023

xBis7 commented Oct 15, 2024 •

edited

Loading

svrnm commented Oct 16, 2024

Proposal: Add option to export processing spans in ForceFlush or periodically #373

Proposal: Add option to export processing spans in ForceFlush or periodically #373

Comments

arminru commented Dec 5, 2019

austinlparker commented Dec 5, 2019

jmacd commented Dec 5, 2019

arminru commented Dec 6, 2019

bogdandrutu commented Dec 6, 2019 • edited Loading

bogdandrutu commented Dec 6, 2019

arminru commented Dec 6, 2019

bogdandrutu commented Dec 6, 2019

arminru commented Dec 9, 2019

Oberon00 commented Dec 12, 2019 • edited Loading

Oberon00 commented Dec 13, 2019

mitoihs commented Jun 28, 2021

schickling commented Jun 28, 2021

Oberon00 commented Jun 28, 2021 • edited Loading

mitoihs commented Jun 28, 2021

kenfinnigan commented Jun 28, 2021

joshtriplett commented Apr 7, 2022

adelel1 commented Jun 20, 2022

simingy commented Jul 11, 2022

jmacd commented Jul 29, 2022

scheler commented Aug 6, 2022

nakedible commented Jun 3, 2023

zszepczynski commented Oct 26, 2023

xBis7 commented Oct 15, 2024 • edited Loading

svrnm commented Oct 16, 2024

bogdandrutu commented Dec 6, 2019 •

edited

Loading

Oberon00 commented Dec 12, 2019 •

edited

Loading

Oberon00 commented Jun 28, 2021 •

edited

Loading

xBis7 commented Oct 15, 2024 •

edited

Loading