-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logs: Mechanism to preserve a log body while also parsing it #3932
Comments
Another common use case for my team is the following: Parsed, Enriched and Reduced to one Destination (Such as Google Cloud Logging) This is a very common requirement in the financial sector, for example. |
I think enrichment & reduction are meaningful changes to the data and therefore if you need to do this for the sake of one backend only, you must copy the data stream. The same is true for other data types too. What I'm specifically highlighting in this issue is that you sometimes cannot represent the same information in any one way that is compatible with multiple backends. |
So if I understand the proposal correctly, what you want is:
In your example this seems to work. Would love to hear from vendors here to understand if they are bought into the idea and would update their log exporters appropriately... ALSO - would this only affect logs-parsed-from-string-sources or would it also be available in an SDK? |
How would it affect Logs Bridge API and Event API? |
What I'm suggesting is intended to be much less intrusive. I'm not suggesting we prescribe where parsing occurs - actually quite the opposite. This would remove an implicit constraint where users sometimes are forced to chose between parsing in the collector or leaving the work for a backend. I also don't think what I'm suggesting would require changes to any exporters. The idea is to provide an optional improvement for those which prefer an original/string body. Currently, exporters which prefer or require a string body, when presented with a non-string body, will typically just
Generally this is intended only for logs-parsed-from-string-sources. However, I don't think we necessarily need to forbid SDK's from using it. Perhaps there is a logging library out there which prepends some context onto logs before a corresponding appender could have a chance to map it into our data model. e.g. Developer writes |
I don't think there would necessarily be any impact on either. We could chose to allow implementors of the Logs Bridge API to use this with the same guidance I proposed above. See the example in my previous comment. |
I disagree with this framing. It sounds like its inconvenient rather than no possible:
|
I don't think this matches the reality of the backend landscape, now or in the foreseeable future. Logs (or the bodies of logs) have regularly been transmitted as strings for decades and will continue to be in many cases for years to come. Many backends were designed from the ground up with this in mind and others will continue to provide particular behaviors based on the original string. As mentioned on today's call, some users simply prefer to have the original value available alongside a parsed representation. The choice to model a log body as AnyValue was not made in order to mandate that backends support structured logs. It was chosen so that our data model could support either structured and unstructured representations. I think we made a mistake in not having an unambiguous string representation and that adding a new optional field is the best way to support all log backends.
That's fair, although I would argue it is unreasonably inconvenient. As much as I am a proponent of connectors, I think the notion of retaining the original body is a natural and real concern of many users and backends and should not require replication of data streams. This also doesn't solve the problem of needing both representations simultaneously (which was not part of the issue as originally written up but is a requirement in some cases.) |
A backend which doesn't support AnyValue bodies doesn't fully support OTLP. By giving users and tools the ability to model and transmit structured logs via OTLP, we did set expectations about what backends should support. Not supporting AnyValue body is like not supporting span events or span kind or a particular metric type. An OTLP receiver might not support those things, but if so, OpenTelemetry shouldn't be obligated to provide an alternative representation for that information. A backend which accepts OTLP but prefers a JSON string representation can always encode AnyValue to JSON. By doing so they can rid themselves of having to qualify their support of OTLP.
I don't understand what value this could have if the translation to AnyValue is lossless. |
As I understand the issue, this doesn't seem to be about backend support for OTLP and structured log bodies. Consider instead: Should it be possible to write an exporter that sends an original log body using any protocol that a backend supports? Should it be possible to send to this exporter while also sending a parsed log body to a different backend?
A translation is not always lossless. If the receiver is ingesting nginx for example (choose any version and configuration), the log bodies could vary considerably and simply encoding as JSON will not preserve the original text. A backend that has chosen to implement parsing would still be subjected to any parsing and interpretation done in the collector. Log backend architectures have taken different approaches to parsing at the edge vs parsing in the backend. There are advantages to parsing in the backend (e.g. rapid iteration of parsing logic) but rather than debate them, I don't think Open Telemetry should take an opinionated stance. Instead it should support both backends that expect structured logs and backends that expect original logs. If we take that as our premise, then the case Dan makes for parsing being lossy because it replaces the Body is an issue that should be resolved. |
In my opinion, if there is a whole category of use cases and accompanying backends for which the data model is a poor fit, we should question whether the data model is correct. Saying that these backends aren't supporting OTLP and therefore should change is a bit like designing a "universal" screwdriver, realizing it doesn't work for some types of screws, and then telling the screws to change. My hope with this issue was to highlight a specific instance where having an original log body field would be useful. However, I think it's too narrowly focused since we do not have agreement that preservation of traditional string logs is a requirement in enough cases. It probably makes more sense to open a new issue but to reframe this more broadly:
|
Who said that log record attributes should be used for indexing (by default)? People are using structured logging libraries to emit "parametrized" messages and we should not expect that they have low cardinality. |
I didn't say that. I said a string message is basically never intended for indexing. The point is that there isn't any reason to treat a string message as an attribute, other than that it is the only option if we do not provide a place in the data model for the original body.
A parameterized message, taken as a whole, is of course not appropriate for indexing but often several of its attributes are. |
I do not see is as a disadvantage (rather the opposite). We can make a common logs semantic convention for it. |
Of course everything could be a semantic convention. We have a data model anyways because certain aspects of telemetry are so common that it makes sense to encode them in more concrete ways. Logs as strings have been the simplest and most familiar representation of telemetry for decades. We do not have an unambiguous representation for this and instead treat it as if it is obscure. This seems like an obvious miss by the project and one that will continue to cause confusion and frustration among those who work with traditional logs. |
I support the idea of retaining original data in a top-level |
If we want to capture this I think a semantic convention would be appropriate. It's not entirely clear to me what the semantics of it would be though since parsing may occur in multiple separate stages. Perhaps we can address this later based on feedback. |
@open-telemetry/specs-approvers please review |
@djaglowski I think I am convinced that there is a need to retain the original log body (thank you for the diagrams and detailed description). I also agree with you that parsing preferably should happen on the edge where the knowledge about the log format is. However I am not convinced that the use case you describe is frequent enough to warrant a top-level field. Here is what I suggest we do: we need to see a large number of upvotes from community and support from many spec approvers for this capability to make it a top-level field. Otherwise make it a semantic convention and store in an attribute, e.g. |
And where would these upvotes be done at? I'm not familiar with this process, but this is a frequent enough need for me that I feel it should be a top level field. Twice in just the last week I've had to work around this by manually manipulating data in an annoying way that wouldn't have been needed if this was present. If you're just doing the thumbs up thing, I've already added mine to both your comment and the original post. |
On this very issue. The issues list is sortable by upvotes which I find to be a useful proxy for demand: https://github.com/open-telemetry/opentelemetry-specification/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc
@Dylan-M can you tell more? Would this remain annoying if it was a log record attribute with a semantic convention? |
Sure: Use case 1: Use case 2: In both use cases, I handled it (well, BindPlane, which is the product I work with to manage OTel configurations did this automatically for me) by having 1 receiver, and using it in 2 logs pipelines. As a last processor before going to the exporter, it either removes my added If this was a top level field, the exporter helper could have a setting added to handle this automatically:
Obviously, this is just my opinion on how I would approach it, but it seems simple and direct. |
None of my users have ever cared what parsed it, as long as the data ended up in their expected formats. As Dan said in his earlier response to you, parsing frequently occurs in multiple places. For example, a JSON log coming out of PCF. You have a Transform Processor that converts the entry from a JSON string to an actual map. Now we can manipulate it in another processor that removes empty fields. Another that removes fields where data is duplicated under attributes, such as hostname/ip and other such pieces of data. Another that parses and promotes the timestamp to the top level timestamp field. Lastly, another that deletes the original timestamp from the body. Now, lets address the elephant in the room: All of those are instances of the transform processor. So, all of the rules could be combined into a single processor. However, we've found it is often better to take them in small reusable chunks that can be inserted into multiple pipelines. Especially the timestamp parsing/removal. That typically applies to a number of pipelines, while the JSON parsing might not. Same with the removal of empty values. The other question you might have: Why multiple pipelines? That is an easy one, it is the inverse of the above. Say I have 3 different files I'm monitoring with the filelog receiver. But they're all different formats, and have different requirements for procoessors. That requires (sort of) multiple pipelines. One for each, with only the processors applicable to that pipeline. I say "sort of" on the multiple pipelines, because yes, you could do it all in one. If you do however, you need more complex "where" rules on the transform processor operations. The more complex those rules, the more likely for them to have errors. Building configurations for customers is my daily bread and butter, so I've had to address many of these points with them. My approach may not be suitable for everyone, but I tend to work in difficult large enterprise environments with complicated, and often conflicting, requirements. Hopefully all of that makes sense, sometimes I tend to exposition dump too much ;) |
To add a bit more: we may want to record more than just the raw original bytes, but also some additional information about those original bytes. For example we may want to record the encoding of original bytes. Use case: the Another example: we may want to record the offset or sequence number of the body. Use case: the Given that there is potentially more data to record together with body bytes I think it strengthens the argument that this data need to be modelled as multiple attributes defined in semantic conventions. It is unlikely that we will want to add multiple top-level fields to record this data. |
@tigrannajaryan, thanks for your review on this. I like the suggestions about potentially recording other information (e.g. sequence number or encoding) as attributes. For the raw log itself, I can't reconcile how it fits our definition of attributes.
I want to highlight that the use case as described is a composition of multiple motivations for retaining the original log, (compliance, reasoning about raw logs, parsing portability) I think it is representative of many use cases. I'm also happy to find distinct use cases if that's the distinguishing factor. That said, I think it first makes sense to determine whether the original log is additional information about the event. |
I believe "Additional" here should be read as "anything else that doesn't already have a place to record that data", so I don't see a problem with adding the original body or any other information we would like to add about the log record in the attributes. In my opinion this does not contradict with spec's intent.
I don't doubt this. The ability to able to record the original body was something I also felt may be necessary. The reasons I am opposed to adding it as a top-level field is that I don't yet see the evidence of it meeting the bar of it being "frequently present in well-known log and event formats". There are also downsides to adding top-level fields. For example it can increase in-memory size of every log record (e.g. by 24 bytes in 64bit Go for a byte slice field) even if the field is empty. And if we were to add 3 new fields (body bytes, encoding and offset), that's even more extra memory potentially wasted if the data is not present. In my opinion we need a strong justification to add a new top-level field that clearly shows the cost of not doing so is higher because an attribute would use more space and is slower. I don't see the evidence of that and I think as usual the burden of proof is on whoever suggests the change. |
I would argue that the most well known log formats are traditional formats that are represented as strings or byte sequences, such as syslog, journald, windows event log, and popular file log formats such as those used by docker or containerd. As written, the requirement is easily met because the value is always present in its original representation. The bar being applied here seems to be whether or not the field would be frequently present in OTel's log format. My understanding of the intent behind the language is that we wanted to ensure well-known formats can be adequately represented, not to establish a utilization threshold.
I would suggest that encoding and offset are truly "additional information" about the log and therefore should be attributes, whereas the original bytes are the log. Your interpretation of what "additional" means is pragmatic enough to allow the entire original log to be called an attribute, but I don't think the alternative is that other top-level fields would be made necessary if we accepted the one I've proposed.
100% agree
Why would space and speed be the only factors considered in this decision? They are prominent concerns of course but the primary motivation for this proposal is usability and I think this should be a priority for us as well.
I think I've clearly demonstrated a shortcoming in our data model and made the best case I can for how the proposed top-level field would satisfy the documented requirements. Beyond that this is an appeal to usability. Many logs are strings or byte sequences, even if we consider that to be an outdated representation. By not providing a direct representation of this fact within our data model, we are taking what should be intuitive and making it obscure. These logs will continue to be a ubiquitous telemetry media for a long time and as potentially the person in the community who fields the most questions about them, I am increasingly convinced that we missed the mark by designating the Body field as the appropriate field in which to place them. We intended it to be flexible enough for either structured or unstructured logs but in practice it is overloaded. A semantic convention which defines the log as an attribute of itself is just doubling down on the usability problem rather than relieving it. I think my proposal is the best way to untangle this. At this point I think I've made the best case I can for the proposal so if there's no appetite for moving forward with it we can close the issue. |
That is already achievable today by putting the original bytes in the
Can you expand on the usability aspect? I am not sure I see the usability problem with original body being a log attribute. We already have the necessary machinery to work with log attributes in the Collector (e.g. using filelog operators) and they don't seem particularly more burdensome than to work with the Body field. Perhaps I am missing something.
Let's not give up just yet. :-) And after all mine is just one opinion, others may have a different opinion and I am open to reconsidering. |
Right, but when a log is parsed there is usually a unstructured message which is naturally placed in the body. e.g. If I read and parse this syslog, it is mostly well defined fields which can be mapped into either top-level fields or attributes, but
Let me preface this by specifically setting aside the question of top-level field vs attributes. I don't see a point in discussing that if we haven't settled this question. I have already provided a very detailed example which demonstrate this necessity in at least some cases and I believe you agreed with my assessment. The question of frequency is obviously more difficult to demonstrate but maybe we can agree that the following two statements are independently true.
As I understand it, one of the primary value propositions of OpenTelemetry is that data collection is largely decoupled from export in order to avoid traditional observability problems such as vendor lock-in. A vendor-neutral data model is key to this, but the the fan-in/fan-out model used in our collector pipelines is perhaps a better illustration of how this works in practice. When you need another data source, just add a receiver. When you need to export to another backend, just add a new exporter. If we accept that both representations of logs are valid and necessary at times, but consider simultaneous transport to be an edge case, we are tightly coupling ingest and export. Our users are still experiencing a form of vendor lock-in which I do not believe is intended or congruous with the project's goals. Going back to the scenario described above, suppose there is initially no requirement to archive logs. This is the kind of simple scenario that the data model currently supports well. Starting from this "parsed-only" solution, it's quite painful to add an archive requirement or simply switch to a vendor which expects raw logs. The ops team should reasonably expect to just add another exporter to their existing pipeline. Instead they must reckon with the fact that their entire data pipeline was designed around a "flavor" of logs. In order to add the archive backend they can ask the service teams to send over both representations. The mechanism for doing so is currently ambiguous but realistically they would either ask all teams to update their configs to copy the original body to an attribute, or if they are less wise they wind up with this mess: A similarly cumbersome process would play out if they had started with an archive-only pipeline and then decided to send structured logs to another backend. Either way, as far as the user is concerned this is just about as painful as traditional vendor lock-in problems. The point is, if we accept that either representation is frequently needed, then users should not be locked into one or the other. It should be reasonably straightforward to collect the data once, process it as needed, and switch or add new backends without entirely rearchitecting their data pipelines. The fundamental reason why this is still a problem is that the data model does not support both in a straightforward manner. Is it "frequently" necessary to ship both representations in the same payload? It depends how we measure this. If in terms of % of payloads globally which strictly must contain both representations, it's probably not frequent. However, many users at some point need to add or switch backends, and when they do they very frequently run into this problem. I think the problem may not be as visible to individual backend vendors but when focusing on ingestion, processing, and routing to multiple vendors, this is frequently a major pain point. |
It's clear there isn't support for this proposal at this point so I will close it and propose a semantic convention instead. |
There's a similar thing in ECS - the |
It is often not possible to send the same log data to multiple backends. This seems undesirable and unnecessary. Ultimately, I propose a new field on the log data model. To illustrate the problem, here is a detailed example:
Suppose I have read the following log from a file called
foo.txt
:[INFO] 8/3/24 12:34:56 { "message": "hello", "foo": { "bar": "baz" } }
Prior to any processing, I have the following:
At this point, I'd like to send this log to several different backends. However, each backend has different requirements which appear to be conflicting. Our data model is flexible enough to support any one of these options, but it does not appear possible to support all of them at once.
Backend 1 intends to perform all necessary parsing from scratch, so it needs the entire original log with no modifications whatsoever. Optionally, a
log.type
attribute can indicate which parsing algorithm to apply, so perhaps we add an attribute.Backend 2 needs the body to be a string, but it won't perform any parsing. Therefore, we need to extract the timestamp and severity prior to sending. Optionally, we could remove the corresponding portions of the string, but leaving them in place allows us to send this to Backend 1 as well.
So far so good. We've made no destructive modifications to the log and can send it to both Backends 1 and 2.
Backend 3 expects structured log bodies, so ideally we would parse the body and overwrite it.
This is no longer compatible with Backends 1 or 2.
Arguably, we could reformat this a bit to make it compatible with Backend 2. Specifically, we could consider the "message" to be the body, and flatten everything else into attributes. This kind of manipulation relies heavily on interpretation.
Either way, it appears to be impossible to send the same data to both Backends 1 and 3, unless we first copy it and then process it separately, which should not be necessary.
I think we should formalize the notion of "original log body" in some way, so that the same log can be sent to all three backends. The alternative appears to be fragmentation and processing which is highly sensitive to specific needs of eventual consumers.
A semantic convention would potentially be enough here, but this need is so closely related to the basic mechanics of working with logs that I think a new field on the data model may be more appropriate. My attempt at defining this field would look like this:
### OriginalBody (Optional) A copy of the Log Record's original Body value. If the field contains a value, it MUST be exactly the value which the Body first contained. When this field contains a value, it SHOULD be assumed that the Body field was changed. Likewise, when the field does not contain a value, it SHOULD be assumed that the Body field has NOT changed.
This field would allow both Backends 1 and 2 to implement some very simple logic to prefer OriginalBody over Body, thus allowing the log to be processed with much less sensitivity to the needs of eventual consumers. The example log could look like this:
Related:
The text was updated successfully, but these errors were encountered: