Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logs: Mechanism to preserve a log body while also parsing it #3932

Closed
djaglowski opened this issue Mar 8, 2024 · 32 comments
Closed

logs: Mechanism to preserve a log body while also parsing it #3932

djaglowski opened this issue Mar 8, 2024 · 32 comments
Labels
area:data-model For issues related to data model enhancement New feature or request spec:logs Related to the specification/logs directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted

Comments

@djaglowski
Copy link
Member

djaglowski commented Mar 8, 2024

It is often not possible to send the same log data to multiple backends. This seems undesirable and unnecessary. Ultimately, I propose a new field on the log data model. To illustrate the problem, here is a detailed example:


Suppose I have read the following log from a file called foo.txt:
[INFO] 8/3/24 12:34:56 { "message": "hello", "foo": { "bar": "baz" } }

Prior to any processing, I have the following:

Attributes:
  log.file.name: foo.txt
Body: [INFO] 8/3/24 12:34:56 { "message": "hello", "foo": { "bar": "baz" } }

At this point, I'd like to send this log to several different backends. However, each backend has different requirements which appear to be conflicting. Our data model is flexible enough to support any one of these options, but it does not appear possible to support all of them at once.


Backend 1 intends to perform all necessary parsing from scratch, so it needs the entire original log with no modifications whatsoever. Optionally, a log.type attribute can indicate which parsing algorithm to apply, so perhaps we add an attribute.

Attributes:
  log.file.name: foo.txt
  log.type: foo
Body: [INFO] 8/3/24 12:34:56 { "message": "hello", "foo": { "bar": "baz" } }

Backend 2 needs the body to be a string, but it won't perform any parsing. Therefore, we need to extract the timestamp and severity prior to sending. Optionally, we could remove the corresponding portions of the string, but leaving them in place allows us to send this to Backend 1 as well.

Attributes:
  log.file.name: foo.txt
  log.type: foo
Severity: INFO
Timestamp: 8/3/24 12:34:56
Body: [INFO] 8/3/24 12:34:56 { "message": "hello", "foo": { "bar": "baz" } }

So far so good. We've made no destructive modifications to the log and can send it to both Backends 1 and 2.


Backend 3 expects structured log bodies, so ideally we would parse the body and overwrite it.

Attributes:
  log.file.name: foo.txt
Severity: INFO
Timestamp: 8/3/24 12:34:56
Body: 
  message: hello
  foo:
    bar: baz

This is no longer compatible with Backends 1 or 2.

Arguably, we could reformat this a bit to make it compatible with Backend 2. Specifically, we could consider the "message" to be the body, and flatten everything else into attributes. This kind of manipulation relies heavily on interpretation.

attributes:
  log.file.name: foo.txt
  foo.bar: baz
severity: INFO
timestamp: 8/3/24 12:34:56
body: hello

Either way, it appears to be impossible to send the same data to both Backends 1 and 3, unless we first copy it and then process it separately, which should not be necessary.


I think we should formalize the notion of "original log body" in some way, so that the same log can be sent to all three backends. The alternative appears to be fragmentation and processing which is highly sensitive to specific needs of eventual consumers.

A semantic convention would potentially be enough here, but this need is so closely related to the basic mechanics of working with logs that I think a new field on the data model may be more appropriate. My attempt at defining this field would look like this:

### OriginalBody (Optional)

A copy of the Log Record's original Body value.

If the field contains a value, it MUST be exactly the value which the Body first contained.

When this field contains a value, it SHOULD be assumed that the Body field was changed.

Likewise, when the field does not contain a value, it SHOULD be assumed that the Body field has NOT changed.

This field would allow both Backends 1 and 2 to implement some very simple logic to prefer OriginalBody over Body, thus allowing the log to be processed with much less sensitivity to the needs of eventual consumers. The example log could look like this:

Attributes:
  log.file.name: foo.txt
  log.type: foo
Severity: INFO
Timestamp: 8/3/24 12:34:56
Body: 
  message: hello
  foo:
    bar: baz
OriginalBody: [INFO] 8/3/24 12:34:56 { "message": "hello", "foo": { "bar": "baz" } }

Related:

@djaglowski djaglowski added the spec:logs Related to the specification/logs directory label Mar 8, 2024
@tedsuo tedsuo added triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted enhancement New feature or request area:data-model For issues related to data model labels Mar 12, 2024
@Dylan-M
Copy link

Dylan-M commented Apr 22, 2024

Another common use case for my team is the following:

Parsed, Enriched and Reduced to one Destination (Such as Google Cloud Logging)
Original Raw to "Cold Storage" for compliance (Such as Google Cloud Storage)

This is a very common requirement in the financial sector, for example.

@djaglowski
Copy link
Member Author

I think enrichment & reduction are meaningful changes to the data and therefore if you need to do this for the sake of one backend only, you must copy the data stream. The same is true for other data types too.

What I'm specifically highlighting in this issue is that you sometimes cannot represent the same information in any one way that is compatible with multiple backends.

@jsuereth
Copy link
Contributor

So if I understand the proposal correctly, what you want is:

  1. all log processing would be done in the collector. You parse based on the strictest API
  2. Log exporters would need to be updated to ignore parsed values and things if they plan do their own parsing in their backends.

In your example this seems to work. Would love to hear from vendors here to understand if they are bought into the idea and would update their log exporters appropriately...

ALSO - would this only affect logs-parsed-from-string-sources or would it also be available in an SDK?

@pellared
Copy link
Member

How would it affect Logs Bridge API and Event API?

@djaglowski
Copy link
Member Author

So if I understand the proposal correctly, what you want is:

  1. all log processing would be done in the collector. You parse based on the strictest API
  2. Log exporters would need to be updated to ignore parsed values and things if they plan do their own parsing in their backends.

In your example this seems to work. Would love to hear from vendors here to understand if they are bought into the idea and would update their log exporters appropriately...

What I'm suggesting is intended to be much less intrusive. I'm not suggesting we prescribe where parsing occurs - actually quite the opposite. This would remove an implicit constraint where users sometimes are forced to chose between parsing in the collector or leaving the work for a backend.

I also don't think what I'm suggesting would require changes to any exporters. The idea is to provide an optional improvement for those which prefer an original/string body. Currently, exporters which prefer or require a string body, when presented with a non-string body, will typically just toString the content. Instead, they could check if an original body is present, and if so use that instead.

ALSO - would this only affect logs-parsed-from-string-sources or would it also be available in an SDK?

Generally this is intended only for logs-parsed-from-string-sources. However, I don't think we necessarily need to forbid SDK's from using it. Perhaps there is a logging library out there which prepends some context onto logs before a corresponding appender could have a chance to map it into our data model. e.g. Developer writes logger.Debug("foo") and what the appender receives is "[DEBUG] foo". In such a case, the appender might want to set body="foo" and originalBody="[Debug] foo". To be clear, I'm not sure if such an annoying library exists but in theory this would allow the opportunity to adjust its output at the source while still preserving exactly the original content for those who care. Typically though, I would expect this field is just not used anywhere in instrumentation libraries.

@djaglowski
Copy link
Member Author

How would it affect Logs Bridge API and Event API?

I don't think there would necessarily be any impact on either. We could chose to allow implementors of the Logs Bridge API to use this with the same guidance I proposed above. See the example in my previous comment.

@jack-berg
Copy link
Member

It is often not possible to send the same log data to multiple backends. This seems undesirable and unnecessary

I disagree with this framing. It sounds like its inconvenient rather than no possible:

  • A collector could be configured with a connector such than the pipeline for one backend performs the parsing and another pipeline doesnt
  • Surely an OTLP receiver is capable of receiving the parsed AnyValue representation of the structured body. If an OTLP receiver can do the parsing on the backend, then that's the icing on the cake, but the structured AnyValue representation should be the "least common denominator" that every backend supports. If so, then the solution is to do the parsing in the collector layer, and send the AnyValue representation to both backends. This may be less convenient, but its not impossible.

@djaglowski
Copy link
Member Author

the structured AnyValue representation should be the "least common denominator" that every backend supports.

I don't think this matches the reality of the backend landscape, now or in the foreseeable future. Logs (or the bodies of logs) have regularly been transmitted as strings for decades and will continue to be in many cases for years to come. Many backends were designed from the ground up with this in mind and others will continue to provide particular behaviors based on the original string. As mentioned on today's call, some users simply prefer to have the original value available alongside a parsed representation.

The choice to model a log body as AnyValue was not made in order to mandate that backends support structured logs. It was chosen so that our data model could support either structured and unstructured representations. I think we made a mistake in not having an unambiguous string representation and that adding a new optional field is the best way to support all log backends.

I disagree with this framing. It sounds like its inconvenient rather than no possible:

That's fair, although I would argue it is unreasonably inconvenient. As much as I am a proponent of connectors, I think the notion of retaining the original body is a natural and real concern of many users and backends and should not require replication of data streams. This also doesn't solve the problem of needing both representations simultaneously (which was not part of the issue as originally written up but is a requirement in some cases.)

@jack-berg
Copy link
Member

I don't think this matches the reality of the backend landscape, now or in the foreseeable future.
The choice to model a log body as AnyValue was not made in order to mandate that backends support structured logs. It was chosen so that our data model could support either structured and unstructured representations

A backend which doesn't support AnyValue bodies doesn't fully support OTLP. By giving users and tools the ability to model and transmit structured logs via OTLP, we did set expectations about what backends should support. Not supporting AnyValue body is like not supporting span events or span kind or a particular metric type. An OTLP receiver might not support those things, but if so, OpenTelemetry shouldn't be obligated to provide an alternative representation for that information.

A backend which accepts OTLP but prefers a JSON string representation can always encode AnyValue to JSON. By doing so they can rid themselves of having to qualify their support of OTLP.

As mentioned on today's call, some users simply prefer to have the original value available alongside a parsed representation.

I don't understand what value this could have if the translation to AnyValue is lossless.

@andykellr
Copy link

andykellr commented May 1, 2024

A backend which doesn't support AnyValue bodies doesn't fully support OTLP.

As I understand the issue, this doesn't seem to be about backend support for OTLP and structured log bodies. Consider instead: Should it be possible to write an exporter that sends an original log body using any protocol that a backend supports? Should it be possible to send to this exporter while also sending a parsed log body to a different backend?

I don't understand what value this could have if the translation to AnyValue is lossless.

A translation is not always lossless. If the receiver is ingesting nginx for example (choose any version and configuration), the log bodies could vary considerably and simply encoding as JSON will not preserve the original text. A backend that has chosen to implement parsing would still be subjected to any parsing and interpretation done in the collector.

Log backend architectures have taken different approaches to parsing at the edge vs parsing in the backend. There are advantages to parsing in the backend (e.g. rapid iteration of parsing logic) but rather than debate them, I don't think Open Telemetry should take an opinionated stance. Instead it should support both backends that expect structured logs and backends that expect original logs. If we take that as our premise, then the case Dan makes for parsing being lossy because it replaces the Body is an issue that should be resolved.

@djaglowski
Copy link
Member Author

A backend which doesn't support AnyValue bodies doesn't fully support OTLP. By giving users and tools the ability to model and transmit structured logs via OTLP, we did set expectations about what backends should support. Not supporting AnyValue body is like not supporting span events or span kind or a particular metric type. An OTLP receiver might not support those things, but if so, OpenTelemetry shouldn't be obligated to provide an alternative representation for that information.

In my opinion, if there is a whole category of use cases and accompanying backends for which the data model is a poor fit, we should question whether the data model is correct. Saying that these backends aren't supporting OTLP and therefore should change is a bit like designing a "universal" screwdriver, realizing it doesn't work for some types of screws, and then telling the screws to change.

My hope with this issue was to highlight a specific instance where having an original log body field would be useful. However, I think it's too narrowly focused since we do not have agreement that preservation of traditional string logs is a requirement in enough cases. It probably makes more sense to open a new issue but to reframe this more broadly:

  • There is value in having an unambiguous string field for the original log body. Many backends do not care about it, but others would use it when present.
  • Some users need to send the original log alongside a parsed representation. It often isn't lossless and lossiness is beside the point if their goal is to ingest data which is both semantically useful and compliant.
  • An attribute isn't really appropriate because the value is not metadata, is extremely high cardinality, and is basically never intended for indexing.

@pellared
Copy link
Member

pellared commented May 6, 2024

An attribute isn't really appropriate because the value is not metadata, is extremely high cardinality, and is basically never intended for indexing.

Who said that log record attributes should be used for indexing (by default)? People are using structured logging libraries to emit "parametrized" messages and we should not expect that they have low cardinality.

@djaglowski
Copy link
Member Author

Who said that log record attributes should be used for indexing (by default)?

I didn't say that. I said a string message is basically never intended for indexing. The point is that there isn't any reason to treat a string message as an attribute, other than that it is the only option if we do not provide a place in the data model for the original body.

People are using structured logging libraries to emit "parametrized" messages and we should not expect that they have low cardinality.

A parameterized message, taken as a whole, is of course not appropriate for indexing but often several of its attributes are.

@pellared
Copy link
Member

pellared commented May 6, 2024

that it is the only option if we do not provide a place in the data model for the original body.

I do not see is as a disadvantage (rather the opposite). We can make a common logs semantic convention for it.

@djaglowski
Copy link
Member Author

We can make a common logs semantic convention for it.

Of course everything could be a semantic convention. We have a data model anyways because certain aspects of telemetry are so common that it makes sense to encode them in more concrete ways. Logs as strings have been the simplest and most familiar representation of telemetry for decades. We do not have an unambiguous representation for this and instead treat it as if it is obscure. This seems like an obvious miss by the project and one that will continue to cause confusion and frustration among those who work with traditional logs.

@djaglowski
Copy link
Member Author

Here is a scenario which demonstrates the necessity of having both parsed and raw representations of a log in the same payload, as well as a subsequent discussion of how this should be accomplished.

Scenario

Suppose an organization has a microservices application where at least some of the services emit logs to files or other traditional log media. In order to meet a compliance requirement, any logs matching “business logic X” must be redacted according to a specific set of rules and then retained in what is otherwise their original form. For this scenario, let's assume "business logic X" applies to 10% of all logs.

At the same time, the application must be managed day-to-day which requires sending semantically rich logs to a modern observability backend. For this scenario, let's assume that 10% of logs matching "business logic X" are useful for day-to-day operations.

Organizationally, each service within the application is owned by a dedicated team. There is also an application-level ops team responsible for ensuring that all of the above requirements are met.

Naive Solution

Since the ops team is ultimately responsible for ensuring the requirements, they could ask all of the service teams to send over all their logs. The ops team could then apply "business logic X" in a centralized collector. They can also apply their other business rules and route data appropriately.

image

Problem - Log Volume

Since 90% of logs will be discarded when "business logic X" is applied, it doesn't make sense to send all logs over the network before applying the logic. At best this is very wasteful but it may not even be possible due bandwidth constraints. Therefore, the service teams should apply "business logic X" before exporting their logs to the ops team.

Problem - Parsing Complexity

In order to apply "business logic X", it may be necessary to parse the logs. This isolates meaningful values and allows for semantic reasoning. Parsing logs is often complicated and requires familiarity with the service that emitted the logs, which the ops team does not have.

Even if the ops team was familiar enough with all the various log formats, it would still be impractical for them to handle parsing. First, they would have to manage a layer for separating the various formats from one another so that appropriate parsing rules may be applied. Of course they would also have to manage the parsing rules themselves. Any time there's a new service, a change to the log formats emitted by a service, or a bug in the parsing rules, this would require a reconfiguration of the centralized gateway collector.

image

Edge Parsing

The ops team can avoid many of these problems by asking the service teams to parse their own logs and the apply "business logic X" before sending to the ops team. However, in order to meet compliance requirements, the service teams still need to send over a raw copy of the logs which match "business logic X". As noted previously, in many cases it is necessary to parse the logs in order to determine which logs must be sent. Therefore, the service teams would be asked to (1) parse the logs while somehow retaining the raw log, (2) apply "business logic X", and (3) send over both raw and parsed versions of the remaining logs.

There is currently no definitive way to accomplish this but we can look at the possibility of sending parsed and raw logs separately, or together with the raw log placed in a clever position within the parsed log.

Separate requests for raw and parsed

There are some clear downsides to sending the parsed and raw logs as separate requests. It doubles the network traffic and introduces inconsistency when a request fails, since the corresponding request may succeed. Even if these downsides were accepted, the service teams will still struggle to meet the requirements because of the necessary order of operations.

Naively, one team might think to copy the logs immediately as they are read. However, since they cannot apply "business logic X" to raw logs, both copies would have to be parsed anyways. If "business logic X" allows, they may choose to apply less thorough parsing rules to the "raw" stream, but they would still need to determine a clever place to store the raw log temporarily in order to ensure that it is not overwritten when parsing rules are applied. They would then restore the original log body after parsing.

Another team might prefer to delay copying of their log stream until after parsing rules and "business logic X" are applied. This allows them to simplify their configuration somewhat, but they still need to cleverly preserve the raw log first. Then eventually after copying the stream, they can restore the raw log on one copy, and perhaps also delete the raw log from the other copy.

Ultimately, the requirements are such that each team will likely have to invest time into a solution that goes beyond the basic requirements of parsing and applying business rules, and instead is a problem of working around the data model. It would be reasonable to expect that some teams will misunderstand, misapply, or ignore the requirements because the possible solutions are too nuanced.

image

Single request per log

The work required of the service teams becomes much simpler if they should only send one request. However, there must be an agreed upon place for storing the original log so that all the service teams place it in the same location, and so the ops team can rely on finding it there.

image

Attribute vs Data Model

Log attributes are defined by our data model as (emphasis mine) "Additional information about the specific event occurrence." A traditional string log is a whole standalone representation of the event. It may not be the ideal representation, but it is clearly not "additional information" about the event.

Our data model also describes the benefits of named top-level fields, one of which is being able to have an unambiguous data type. Traditional string logs are common and best represented as strings (or byte sequences).

Additionally, our data model indicates the criteria used to accept a named top-level field. The first states, (emphasis mine) "The field needs to be either mandatory for all records or be frequently present in well-known log and event formats (such as Timestamp) or is expected to be often present in log records in upcoming logging systems (such as TraceId)." Many of the most well known log formats are strings. The other criterion states, "The field’s semantics must be the same for all known log and event formats and can be mapped directly and unambiguously to this data model." I believe the semantics of the original log string are quite clear and consistent when taken as the complete original representation of the event.

@jmacd
Copy link
Contributor

jmacd commented May 14, 2024

I support the idea of retaining original data in a top-level bytes field. How would you annotate the receiver that is responsible for parsing the record? It seems to me that if you see an original log body field you would also want to know which piece of code or logic is responsible for constructing the log record--is that a semantic convention?

@djaglowski
Copy link
Member Author

How would you annotate the receiver that is responsible for parsing the record? It seems to me that if you see an original log body field you would also want to know which piece of code or logic is responsible for constructing the log record--is that a semantic convention?

If we want to capture this I think a semantic convention would be appropriate. It's not entirely clear to me what the semantics of it would be though since parsing may occur in multiple separate stages. Perhaps we can address this later based on feedback.

@djaglowski
Copy link
Member Author

@open-telemetry/specs-approvers please review

@tigrannajaryan
Copy link
Member

Additionally, our data model indicates the criteria used to accept a named top-level field. The first states, (emphasis mine) "The field needs to be either mandatory for all records or be frequently present in well-known log and event formats (such as Timestamp) or is expected to be often present in log records in upcoming logging systems (such as TraceId)." Many of the most well known log formats are strings. The other criterion states, "The field’s semantics must be the same for all known log and event formats and can be mapped directly and unambiguously to this data model." I believe the semantics of the original log string are quite clear and consistent when taken as the complete original representation of the event.

@djaglowski I think I am convinced that there is a need to retain the original log body (thank you for the diagrams and detailed description). I also agree with you that parsing preferably should happen on the edge where the knowledge about the log format is.

However I am not convinced that the use case you describe is frequent enough to warrant a top-level field.

Here is what I suggest we do: we need to see a large number of upvotes from community and support from many spec approvers for this capability to make it a top-level field. Otherwise make it a semantic convention and store in an attribute, e.g. log.original_body or log.record.original. Performance and (compressed) size differences are likely going to be marginal anyway.

@Dylan-M
Copy link

Dylan-M commented May 21, 2024

Here is what I suggest we do: we need to see a large number of upvotes from community and support from many spec approvers for this capability to make it a top-level field. Otherwise make it a semantic convention and store in an attribute, e.g. log.original_body or log.record.original. Performance and (compressed) size differences are likely going to be marginal anyway.

And where would these upvotes be done at? I'm not familiar with this process, but this is a frequent enough need for me that I feel it should be a top level field. Twice in just the last week I've had to work around this by manually manipulating data in an annoying way that wouldn't have been needed if this was present.

If you're just doing the thumbs up thing, I've already added mine to both your comment and the original post.

@tigrannajaryan
Copy link
Member

And where would these upvotes be done at?

On this very issue. The issues list is sortable by upvotes which I find to be a useful proxy for demand: https://github.com/open-telemetry/opentelemetry-specification/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc

Twice in just the last week I've had to work around this by manually manipulating data in an annoying way that wouldn't have been needed if this was present.

@Dylan-M can you tell more? Would this remain annoying if it was a log record attribute with a semantic convention?

@Dylan-M
Copy link

Dylan-M commented May 21, 2024

@Dylan-M can you tell more? Would this remain annoying if it was a log record attribute with a semantic convention?

Sure:

Use case 1:
1 Log entry, 2 Destinations
Specifically, Google Cloud logging - expects data in parsed format to create a nice jsonPayload and Google Security Operations (formerly known as Chronicle) - expects the original raw unparsed log

Use case 2:
Need to do some filtering, which is easier on parsed data. However, the destination needed to be raw logs.

In both use cases, I handled it (well, BindPlane, which is the product I work with to manage OTel configurations did this automatically for me) by having 1 receiver, and using it in 2 logs pipelines.

As a last processor before going to the exporter, it either removes my added _raw field, or rewrites the the raw field over the body making it a string again. Whichever is appropriate for that destination.

If this was a top level field, the exporter helper could have a setting added to handle this automatically:

send_log_as: raw | parsed | both

Obviously, this is just my opinion on how I would approach it, but it seems simple and direct.

@Dylan-M
Copy link

Dylan-M commented May 21, 2024

I support the idea of retaining original data in a top-level bytes field. How would you annotate the receiver that is responsible for parsing the record? It seems to me that if you see an original log body field you would also want to know which piece of code or logic is responsible for constructing the log record--is that a semantic convention?

None of my users have ever cared what parsed it, as long as the data ended up in their expected formats. As Dan said in his earlier response to you, parsing frequently occurs in multiple places.

For example, a JSON log coming out of PCF. You have a Transform Processor that converts the entry from a JSON string to an actual map. Now we can manipulate it in another processor that removes empty fields. Another that removes fields where data is duplicated under attributes, such as hostname/ip and other such pieces of data. Another that parses and promotes the timestamp to the top level timestamp field. Lastly, another that deletes the original timestamp from the body.

Now, lets address the elephant in the room: All of those are instances of the transform processor. So, all of the rules could be combined into a single processor. However, we've found it is often better to take them in small reusable chunks that can be inserted into multiple pipelines. Especially the timestamp parsing/removal. That typically applies to a number of pipelines, while the JSON parsing might not. Same with the removal of empty values.

The other question you might have: Why multiple pipelines? That is an easy one, it is the inverse of the above. Say I have 3 different files I'm monitoring with the filelog receiver. But they're all different formats, and have different requirements for procoessors. That requires (sort of) multiple pipelines. One for each, with only the processors applicable to that pipeline.

I say "sort of" on the multiple pipelines, because yes, you could do it all in one. If you do however, you need more complex "where" rules on the transform processor operations. The more complex those rules, the more likely for them to have errors.

Building configurations for customers is my daily bread and butter, so I've had to address many of these points with them. My approach may not be suitable for everyone, but I tend to work in difficult large enterprise environments with complicated, and often conflicting, requirements.

Hopefully all of that makes sense, sometimes I tend to exposition dump too much ;)

@tigrannajaryan
Copy link
Member

To add a bit more: we may want to record more than just the raw original bytes, but also some additional information about those original bytes.

For example we may want to record the encoding of original bytes. Use case: the filelog receiver in the Collector knows the encoding of the data it reads and can record this information alongside the original bytes. Knowing the encoding can help backends interpret the data correctly.

Another example: we may want to record the offset or sequence number of the body. Use case: the filelog receiver can truncate body if it reaches the maximum size (and not at the correct boundary defined by delimiter). The backends may want to reconstruct and stitch the bodies before processing and to do that they need to know the sequencing of individual log records (which is not guaranteed to be preserved by the Collector).

Given that there is potentially more data to record together with body bytes I think it strengthens the argument that this data need to be modelled as multiple attributes defined in semantic conventions. It is unlikely that we will want to add multiple top-level fields to record this data.

@djaglowski
Copy link
Member Author

@tigrannajaryan, thanks for your review on this. I like the suggestions about potentially recording other information (e.g. sequence number or encoding) as attributes.

For the raw log itself, I can't reconcile how it fits our definition of attributes. Additional information about the specific event occurrence. Is your perspective on this that the raw log is additional information about the log event, or is your opinion that we should model it as an attribute anyways?

I am not convinced that the use case you describe is frequent enough to warrant a top-level field.

I want to highlight that the use case as described is a composition of multiple motivations for retaining the original log, (compliance, reasoning about raw logs, parsing portability) I think it is representative of many use cases. I'm also happy to find distinct use cases if that's the distinguishing factor. That said, I think it first makes sense to determine whether the original log is additional information about the event.

@tigrannajaryan
Copy link
Member

Is your perspective on this that the raw log is additional information about the log event, or is your opinion that we should model it as an attribute anyways?

I believe "Additional" here should be read as "anything else that doesn't already have a place to record that data", so I don't see a problem with adding the original body or any other information we would like to add about the log record in the attributes. In my opinion this does not contradict with spec's intent.

I want to highlight that the use case as described is a composition of multiple motivations for retaining the original log, (compliance, reasoning about raw logs, parsing portability) I think it is representative of many use cases. I'm also happy to find distinct use cases if that's the distinguishing factor.

I don't doubt this. The ability to able to record the original body was something I also felt may be necessary. The reasons I am opposed to adding it as a top-level field is that I don't yet see the evidence of it meeting the bar of it being "frequently present in well-known log and event formats".

There are also downsides to adding top-level fields. For example it can increase in-memory size of every log record (e.g. by 24 bytes in 64bit Go for a byte slice field) even if the field is empty. And if we were to add 3 new fields (body bytes, encoding and offset), that's even more extra memory potentially wasted if the data is not present.

In my opinion we need a strong justification to add a new top-level field that clearly shows the cost of not doing so is higher because an attribute would use more space and is slower. I don't see the evidence of that and I think as usual the burden of proof is on whoever suggests the change.

@djaglowski
Copy link
Member Author

I don't yet see the evidence of it meeting the bar of it being "frequently present in well-known log and event formats".

I would argue that the most well known log formats are traditional formats that are represented as strings or byte sequences, such as syslog, journald, windows event log, and popular file log formats such as those used by docker or containerd. As written, the requirement is easily met because the value is always present in its original representation.

The bar being applied here seems to be whether or not the field would be frequently present in OTel's log format. My understanding of the intent behind the language is that we wanted to ensure well-known formats can be adequately represented, not to establish a utilization threshold.

And if we were to add 3 new fields (body bytes, encoding and offset), that's even more extra memory potentially wasted if the data is not present.

I would suggest that encoding and offset are truly "additional information" about the log and therefore should be attributes, whereas the original bytes are the log. Your interpretation of what "additional" means is pragmatic enough to allow the entire original log to be called an attribute, but I don't think the alternative is that other top-level fields would be made necessary if we accepted the one I've proposed.

In my opinion we need a strong justification to add a new top-level field that clearly shows the cost of not doing so is higher

100% agree

because an attribute would use more space and is slower.

Why would space and speed be the only factors considered in this decision? They are prominent concerns of course but the primary motivation for this proposal is usability and I think this should be a priority for us as well.

as usual the burden of proof is on whoever suggests the change.

I think I've clearly demonstrated a shortcoming in our data model and made the best case I can for how the proposed top-level field would satisfy the documented requirements. Beyond that this is an appeal to usability.

Many logs are strings or byte sequences, even if we consider that to be an outdated representation. By not providing a direct representation of this fact within our data model, we are taking what should be intuitive and making it obscure. These logs will continue to be a ubiquitous telemetry media for a long time and as potentially the person in the community who fields the most questions about them, I am increasingly convinced that we missed the mark by designating the Body field as the appropriate field in which to place them. We intended it to be flexible enough for either structured or unstructured logs but in practice it is overloaded. A semantic convention which defines the log as an attribute of itself is just doubling down on the usability problem rather than relieving it. I think my proposal is the best way to untangle this.

At this point I think I've made the best case I can for the proposal so if there's no appetite for moving forward with it we can close the issue.

@tigrannajaryan
Copy link
Member

I would argue that the most well known log formats are traditional formats that are represented as strings or byte sequences, such as syslog, journald, windows event log, and popular file log formats such as those used by docker or containerd.

That is already achievable today by putting the original bytes in the Body field. What I am asking for is a demonstration that simultaneously the original bytes and the parsed version need to be present in the log record and that it is indeed a frequent case. As far as I know that is not the case for traditional formats, but I may be wrong and you can point me to some examples.

... the primary motivation for this proposal is usability and I think this should be a priority for us as well.

Can you expand on the usability aspect? I am not sure I see the usability problem with original body being a log attribute. We already have the necessary machinery to work with log attributes in the Collector (e.g. using filelog operators) and they don't seem particularly more burdensome than to work with the Body field. Perhaps I am missing something.

At this point I think I've made the best case I can for the proposal so if there's no appetite for moving forward with it we can close the issue.

Let's not give up just yet. :-) And after all mine is just one opinion, others may have a different opinion and I am open to reconsidering.

@djaglowski
Copy link
Member Author

That is already achievable today by putting the original bytes in the Body field.

Right, but when a log is parsed there is usually a unstructured message which is naturally placed in the body.

e.g. 77 <86>1 2015-08-05T21:58:59.693Z 192.168.2.132 inactive - - - Something happened

If I read and parse this syslog, it is mostly well defined fields which can be mapped into either top-level fields or attributes, but "Something happened" is a "human-readable string message (including multi-line) describing the event in a free form" , which is one of the explicitly defined purposes of the Body. So I can either choose to have this message isolated and placed in the Body, which matches expectations of many backends, or retain the original message in it's full form. The field serves both purposes but users may need both.

What I am asking for is a demonstration that simultaneously the original bytes and the parsed version need to be present in the log record and that it is indeed a frequent case.

Let me preface this by specifically setting aside the question of top-level field vs attributes. I don't see a point in discussing that if we haven't settled this question.

I have already provided a very detailed example which demonstrate this necessity in at least some cases and I believe you agreed with my assessment. The question of frequency is obviously more difficult to demonstrate but maybe we can agree that the following two statements are independently true.

  1. Users frequently need to parse their logs prior to arrival in a backend.
  2. Users frequently need to preserve their original logs.

As I understand it, one of the primary value propositions of OpenTelemetry is that data collection is largely decoupled from export in order to avoid traditional observability problems such as vendor lock-in. A vendor-neutral data model is key to this, but the the fan-in/fan-out model used in our collector pipelines is perhaps a better illustration of how this works in practice. When you need another data source, just add a receiver. When you need to export to another backend, just add a new exporter.

If we accept that both representations of logs are valid and necessary at times, but consider simultaneous transport to be an edge case, we are tightly coupling ingest and export. Our users are still experiencing a form of vendor lock-in which I do not believe is intended or congruous with the project's goals.

Going back to the scenario described above, suppose there is initially no requirement to archive logs. This is the kind of simple scenario that the data model currently supports well.

image

Starting from this "parsed-only" solution, it's quite painful to add an archive requirement or simply switch to a vendor which expects raw logs. The ops team should reasonably expect to just add another exporter to their existing pipeline. Instead they must reckon with the fact that their entire data pipeline was designed around a "flavor" of logs.

image

In order to add the archive backend they can ask the service teams to send over both representations. The mechanism for doing so is currently ambiguous but realistically they would either ask all teams to update their configs to copy the original body to an attribute, or if they are less wise they wind up with this mess:

image

A similarly cumbersome process would play out if they had started with an archive-only pipeline and then decided to send structured logs to another backend. Either way, as far as the user is concerned this is just about as painful as traditional vendor lock-in problems.

The point is, if we accept that either representation is frequently needed, then users should not be locked into one or the other. It should be reasonably straightforward to collect the data once, process it as needed, and switch or add new backends without entirely rearchitecting their data pipelines. The fundamental reason why this is still a problem is that the data model does not support both in a straightforward manner.

Is it "frequently" necessary to ship both representations in the same payload? It depends how we measure this. If in terms of % of payloads globally which strictly must contain both representations, it's probably not frequent. However, many users at some point need to add or switch backends, and when they do they very frequently run into this problem. I think the problem may not be as visible to individual backend vendors but when focusing on ingestion, processing, and routing to multiple vendors, this is frequently a major pain point.

@djaglowski
Copy link
Member Author

It's clear there isn't support for this proposal at this point so I will close it and propose a semantic convention instead.

@felixbarny
Copy link

There's a similar thing in ECS - the event.original field. It captures the original message/body field but only supports a string value. Maybe we can bring over this field from ECS to SemConv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:data-model For issues related to data model enhancement New feature or request spec:logs Related to the specification/logs directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted
Projects
None yet
Development

No branches or pull requests