Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document span events <-> log-based events conversion #4393

Open
lmolkova opened this issue Feb 1, 2025 · 8 comments
Open

Document span events <-> log-based events conversion #4393

lmolkova opened this issue Feb 1, 2025 · 8 comments
Labels
sig-issue A specific SIG should look into this before discussing at the spec spec:logs Related to the specification/logs directory

Comments

@lmolkova
Copy link
Contributor

lmolkova commented Feb 1, 2025

The long term vision for events is to leverage log based events instead of span events.

There are different (not mutually-exclusive) migration strategies we could use:

  1. Producers may switch to logs API gradually following Span Events API deprecation
  2. SDK may start converting span events to log-based events (based on the opt-in feature flag)
  3. OTel (or vendors) may provide SDK and collector processors that perform conversion
  4. Backends can handle span events and log-based events in a similar way on the ingestion side
    ...

span event -> log event conversion is as trivial as:

Span event property Log event property
Name EventName
Timestamp (if provided) Timestamp
Attributes Attributes
Span context Span context
Severity: not provided
Body: not provided

With the only challenge being getting access to Logs SDK pipeline from Trace SDK


log events -> span events is more complicated.

The motivation to define this conversion is to support

  • backends such as Jaeger that support traces and not logs
  • provide backward compatibility for other possible cases
Log event property Span event property
EventName Name if EventName is provided, define convention otherwise - e.g. otel.log
Timestamp (if provided) Timestamp
Observed timestamp (if Timestamp not provided) Timestamp
Attributes (standard) Attributes
Attributes (extended) Drop, flatten, or serialize extended attributes
Span context Attach to current span or drop if there is no current span.
Severity Number (if provided) Define semantic convention (log.record.severity.number)
Severity Text (if provided) Define semantic convention (log.record.severity.text)
Body Drop, flatten, or serialize to log.record.body

We should document translation rules to have consistency across languages. Implementation would remain optional and done as a part of contrib repos, but this might change depending on the migration path.

Related:

@lmolkova lmolkova added the spec:logs Related to the specification/logs directory label Feb 1, 2025
@lmolkova lmolkova added this to Logs SIG Feb 1, 2025
@arminru arminru changed the title Document span events <-> log-based events connversion Document span events <-> log-based events conversion Feb 4, 2025
@adriangb
Copy link

adriangb commented Feb 6, 2025

Coming here from #376 (comment)

A couple points I made in that thread:

  • I think some things that are using logs could have just been using spans but chose logs because of the artificial limitations imposed on span attributes (does not support nested attributes, does not support bytes, does not support nulls, etc.).
  • I don't see why we had to move span events -> Events API -> Logs API when we could have just removed the limitations on span attributes and effectively gone in the opposite direction of merging the Events API into spans.
  • I have yet to see a truly compelling use case of logs other than the limitations of span attributes and the fact that there is no standard for a zero duration span / marking a span as an event (I think this would be an easy spec. zero duration? add a marker field at the protocol level?) and the fact that there are no good SDK APIs for emitting events as spans (also easy once you agree on the spec).

In other words, I would have made the following changes to the traces spec:

  • Remove limitations on attributes so that they match log attributes and the wire protocol
  • Add a flag on a span to mark it as a log / event
  • Add SDK APIs to emit "logs" as spans with the above flag and zero duration

This has the following benefits:

  • Backends don't have to make any changes to get basic support for logs / events (they just show up as zero duration spans). With minimal changes (UI only?) they can add very nice support for logs / events.
  • Users don't have to do any more configuration of LoggerProvider, BatchLogRecordProcessor, etc.
  • Sampling and other processing that depends on time works better because it doesn't have to merge two different streams of information.
  • This avoids overloading the logs API which as I understand it was initially design to accommodate legacy systems and thus is always going to have some ugly warts (e.g. the fact that both Body and Attributes are structured).
  • It's still possible to define a mapping from logs to spans marked as logs, albeit having to do something about merging Body and Attributes and some other bits. So you immediately are able to pipe logs from legacy systems into backends that don't support Logs.

@adriangb
Copy link

adriangb commented Feb 6, 2025

I'm happy to bring up these points on a call, wherever is appropriate if not here.

@lmolkova
Copy link
Contributor Author

lmolkova commented Feb 6, 2025

The motivation to use logs for something goes way beyond span limitations. I think @open-telemetry/android-approvers @open-telemetry/semconv-mobile-approvers and Client SIG can write volumes about their journey to writing-event-as-zero-duration-spans and back.

  1. Not everything is instrumented with tracing or needs to be instrumented with spans. Logs/events capture point-in-time things, spans capture things that last. Spans have context and are in general, much more expensive in terms of perf
  2. There's enormous amount of vanilla logs usage in the industry. Some of those should have been metrics and spans, but not all - logs exist for a reason
  3. Logs and events are orthogonal to tracing and could be useful when the corresponding span (if there was one) is sampled out.

Some examples of logs that are not spans and shouldn't be on spans:

  • startup errors
  • Out-of-memory exception
  • receiving a message pushed by the other side (not processing it yet, just recording the fact that it was received)
  • scrolling through a timeline and recording user behavior on it

So OTel community is moving towards leveraging logs and recording data that does not fit into tree-like trace structure.

I'm happy to bring up these points on a call, wherever is appropriate if not here.

If you want to discuss using spans to record logs - there were a lot of discussions in the past. I can find these two #2125, #4123 and maybe @breedx-splk or @bidetofevil have some good pointers on how spans are not a replacement for events.

On span events -> events, Logs SIG or spec is the good place to discuss. Check out #3406

On logs as span attributes - see above - I don't think it's a viable approach (lack of severity, filtering, non-tracing use-cases, existing logging world, etc)

On complex attributes: it's been actively discussed recently and is allowed on logs (events). You can find a lot of context on #2888, #2888, #3858, open-telemetry/semantic-conventions#1669

The TL;DR: allowing them on spans is not out of the question, but is a quite contentious topic with no consensus in sight. You're welcome to drive it in the Spec call, but I recommend first researching the history and understanding the concerns

@adriangb
Copy link

adriangb commented Feb 6, 2025

The TLDR on my points below is that I still think a couple small backward compatible tweaks on the tracing signal could have enabled a lot of the use cases that have moved to logs with a lot less downside than adding a new signal, and much less a new signal that is encumbered by being designed for backward compatibility with legacy systems.

Not everything is instrumented with tracing or needs to be instrumented with spans.

I'm not sure what the point is here. Are you saying there are things which need to emit logs but don't need to emit spans? What if they only emit logs that go via the span machinery? I just don't understand what the advantage of instrumenting something with OTEL logging is vs. instrumenting it with OTEL tracing, assuming you emit the same amount of telemetry and aren't comparing doing nothing with logging with auto-instrumenting every dependency via tracing.

Logs/events capture point-in-time things, spans capture things that last

My whole point is that it wouldn't be that hard to expand the definition of a span to encapsulate things that have no duration. Call it zero duration spans, or call it logs if that sounds better. As you say that was asked in #2125 and it got no response. I agree with your opinion in #4123 (comment): I would like spans to have been re-used for this purpose, I don't even think we'd need to make it a "new thing" without a duration or status, just add a new field that marks that span as a "log". This has the advantage of being completely backward compatible down to the wire protocol level. Old backends will support it (probably pretty well) and new or updated backends can do something fancier than displaying zero duration (e.g. not display a duration at all) based on the new field/flag. We can give these things a semantic meaning and semantic conventions without changing the wire protocol or requiring backends to do something new.

Spans have context and are in general, much more expensive in terms of perf

As you say there are examples where you don't want to propagate the existing context. Some examples that I've encountered and come up in the linked discussions are:

  • Long running batch processing workload that starts a new child trace for each item processed and links them to the parent to avoid having one huge tree.
  • UI monitoring (aka real user monitoring) in which case you don't want every interaction on a SAP to be a child of some long lived span created when the page first loaded.

But logs can also have context! And you can make a contextless span by creating a new trace. What you want in the tree and not is somewhat arbitrary and I think OTEL should have guidance and APIs to make these things easier, but I don't think a new signal is necessary and I see no reason why this can't and shouldn't be modeled under the existing paradigms.

Performance wise it seems to me like at the SDK level the machinery is very similar between logs and tracing and thus would have similar performance. I don't think creating 16 bytes of randomness is a drop in the bucket of the overhead of instrumenting with OTEL, at least not in Python or Rust. I agree that systems that emit logs to a file with no context or buffering probably have less overhead, but that's a completely different topic and more akin to piping data from a legacy system into an OTEL system which has existing solutions.

Some examples of logs that are not spans and shouldn't be on spans:

I think the place where we are not agreeing is on what is a span. I'm proposing you make a new thing that's the same as a span (same internal implementation, same wire protocol, etc.) except that it semantically has no duration and no status. Physically it can have zero duration and an unset status, that doesn't impact the semantic meaning. Once you have that thing, all of your examples fit within it nicely.

On complex attributes: it's been actively discussed recently and is allowed on logs (events). You can find a lot of context on #2888, #2888, #3858, open-telemetry/semantic-conventions#1669

I've been around the block on that one as well... I've been banging the table for a single attributes model that is a superset of JSON for years. As per above I suspect a lot of uses of logs are actually just uses of complex attributes and logs happen to support them because legacy system support reasons.

Somehow we've now gotten ourselves into this weird place where a backend is basically forced to support logs (which have complex attributes) but we are also saying that we don't want to add them to spans because it might break backends that don't support complex attributes.

I know the proposal of "logs" via the tracing machinery and complex attributes works and is useful to users because:

  • We've written a OTEL backend and SDK that supports "logs" by packaging them up as zero duration spans with a flag. Users love it. We've literally had customers say "we use you because you support logs". It was easy to implement because we do almost zero special handling in the backend, we just have the frontend display this data differently.
  • I've written log -> span mappers to get logs out of a legacy system into an OTEL backend that had no special support for logs (this was before the Logs API existed).
  • We've come up with a whole system to json encode complex attributes and into strings and rehydrate them in the backend. It's a PITA to implement but from a users perspective it works very well. We've again had users say "we use you because I can basically chuck anything into a log/span and not think about it". I have also pointed out elsewhere but will point out again here that many (most?) newer backends are being built on things like ClickHouse or DataFusion which all have support for JSON-like data.

@trask
Copy link
Member

trask commented Feb 6, 2025

hi @adriangb,

The OTEP which defines the vision for OpenTelemetry Events was widely discussed and approved.

I understand your concerns and needs, but the decisions around logs and events have been a huge balancing act among lots of concerns, lots of needs, lots of people, and lots of ecosystems.

We are fully committed to supporting users who want to funnel their events to span events, but it will (eventually) require some kind of opt-in strategy (could potentially be as simple as a declarative configuration property).

It will be hard to change this direction given the amount of discussions and approvals that went into it, but if you'd like to try I'd suggest adding a topic to the Specification SIG meeting agenda and raising it there.

@lmolkova
Copy link
Contributor Author

lmolkova commented Feb 6, 2025

@adriangb To add to what @trask said, if the core need for you is to have complex attributes on spans - I think you can find a lot of support in the community (myself included). There are a lot of valid points both in favor and against it in #2888, #3858 and linked discussions. It needs a champion to drive it and reach consensus

@adriangb
Copy link

adriangb commented Feb 7, 2025

Yeah I think what I'm bringing up are orthogonal concerns in some sense, although they are related:
(1) There is a strong need for complex attributes in tracing.
(2) I think events could have been more easily shimmed into tracing than logs, but maybe that ship has sailed.

Reading discussions and motivations it feels like (1) and (2) are related insofar that things have biased toward logs because they have complex attributes that make it easier to support use cases like LLM chats. And now we've ended up with the worst of all worlds where as a backend developer I have to support complex attributes because I have to support logs (which also has a lot of other complexity e.g. what to do with the log body) but I can't benefit from the power of complex attributes in spans.

I'll try to join the next SIG meeting to campaign for complex attributes at least.

@adriangb
Copy link

Backends can handle span events and log-based events in a similar way on the ingestion side

I'll make a note on this. We rolled this out then had to roll it back because currently every span an exception goes through generates a span event and thus a log event if you treat them the same during ingestion. This is very noisy and is not the same as the intention of #4333 which specifically tries to address this:

It's NOT RECOMMENDED to record the same error as it propagates through the call stack, or
to attach the same instance of an exception to multiple log records.

But there is no such guidance / recommendation for span events. So if you treat them similarly things get quite messy.

IMO a reasonable compromise would be to make a change to span events / recording of exceptions on spans to:

  1. Record better structured data on spans outside of span events. Namely the proposal in Record span-ending exceptions as span attributes instead of span event or log #4429 or similar to at least record exception.type and exception.message as attributes on the span with exception.stacktrace being configurable due to cost concerns.
  2. Change SDKs to only record a span event for exceptions on the first span the exception passes through to match the recommendation for exceptions sent as logs.

Then if span events are deprecated in favor of logs the semantics at least match up better and the transition is easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig-issue A specific SIG should look into this before discussing at the spec spec:logs Related to the specification/logs directory
Projects
Status: No status
Development

No branches or pull requests

4 participants