-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability for OTLP response to signal partial acceptance of data #2454
Comments
Confused about why do you need 2 counters, since you have access to the request you know the total number, so you can derive one from the other. |
I was initially thinking 3 different outcomes (Accepted, Bad, Dropped). Not sure if we want to make Bad and Dropped to be one category. |
This is also useful for metrics. It's easy to happen that only some metrics are ingested due to many reasons, e.g. exceeded quotas, bad or invalid data. The OTLP spec even has a section for partial success under "limitations" https://github.com/open-telemetry/opentelemetry-specification/blob/v1.9.0/specification/protocol/otlp.md#partial-success The only way to achieve such"partial success" today is by returning a 400 from the server. This is because the spec says on 200 cases, it should return a It's particularly bad today because users will get 400's even though some of their data were normally ingested. Or even worse, if during the ingestion some "data massage" has to happen and the server would like to indicate this to the user, there's no way as 200 is an empty response.
I think having Bad and dropped is useful in cases such as:
|
Yes, this is a known limitation from day one of OTLP :-)
Looks like the definitions of Dropped vs Bad may need some additional discussion to agree on the semantics. We can probably start with just Another future addition can be an error message or a list of error messages or list of error structures, so that the response not only indicates that only part of the data was accepted, but also what exactly was wrong with the data that was not accepted. As for the response codes: I believe we want to return HTTP 200 for partial success so that existing senders who are unaware of this new response are unaffected. |
I guess that's fair, but I would like to challenge it a bit, as I'm not sure it really improves the status quo. In the OP you mentioned this, (emphasizes mine)
Then, if we implement only returning the Accepted, a message in the Collector's OTLP exporter would look like:
It only tells me how many of "my data" got rejected, but not why they got rejected. This is the part where I'm not sure. Further more, in your reply:
I think we need this from the start. Without it, we only have some indication of it. OK it is better than today, but it doesn't really help the user in finding out what they are doing wrong and why their data is getting rejected. So, does it even help with a first version without this? 🤔 |
I don't mind having more detailed response, it just needs more thought about what exactly we want to record. So, for example we could do this much more detailed version (probably excessive): message ExportLogsServiceResponse {
string error_message = 1; // non-empty human readable message if a partial failure, to help troubleshoot.
int64 accepted_log_records = 2;
repeated RejectedLogRecord rejected_log_records = 2; // optional
// do the same for Resources and Scopes?
}
message RejectedLogRecord {
// do we need a reference to the log record that caused the problem? if yes, need a triplet of indexes.
string error_message = 1; // human-readable message about the particular log record.
// machine readable fields pointing to the particular problem in the log record
repeated FieldError field_errors = 2;
}
message FieldError {
enum FieldRef {
Timestamp, ...
}
enum ErrorType {
BadValue, ...
}
string error_message = 1;
FieldRef field_ref = 2;
ErrorType error_type = 3;
} It is hard to tell how far we want to go. My gut feeling is that the best return on investment will come from a simple "accepted" counter and a human readable error message that can be logged and will help with troubleshooting. |
Yes, 100% agreed. The combination of 200 + number of accepted + a message in the body would definitely be a much better start. Then servers could do their best to indicate what went wrong. I like how it is described in the google.rpc.Status
|
do we expect any implementation to retry the whole batch when receiving the partial acceptance status? If so, do we need some indication on whether the error is retry-able or not for the items that wasn't accepted? |
It would be useful to have such information if we included the details of how each individual record was processed. Otherwise, with just the |
I'm not sure I would expect any exporter implementation to retry a partial acceptance status, since that would probably complicate both the client and the server implementation, no? For example, for metrics, I assume every back-end has their own internal format and technicalities on how they handle data. How would an exporter retry, when a back-end doesn't accept the way metric keys are spelled (e.g. special characters in keys)? Or when their quota exceeded, or when they don't have privileges to ingest certain log types. There's just too many variables involved. I see the partial acceptance as a more valuable way for back-ends to try a "best effort" to retain data, while having a good way of signaling users that they should pay attention to what they are sending, maybe even hinting them with doc links on how to fix. |
I agree. Maybe in the future we can have more granular and detailed information in the response that would allow the exporter to have more complicated retrying logic, but for now I think we should keep it simple. |
I would be happy to help with this @tigrannajaryan. What could be a start? I guess defining new protos and then updating the OTLP spec? |
Great! Yes, start with submitting a proposal in the proto repo. Make sure that interoperability between senders/receivers implementing different OTLP versions (pre and post this change) is maintained. Once we have a consensus there we can move forward. |
Hi @tigrannajaryan, I finally opened a PR for this. open-telemetry/opentelemetry-proto#390 |
Each OTLP request can carry a batch of data some of which may be acceptable for the receiver and some may be unacceptable. Today the receiver cannot signal this situation to the sender. The receiver has to either accept the entire batch or reject the entire batch. Typically implementations accept the entire batch and then drop the unwanted portion of the data. This works from the receiver's perspective however the sender remains uniformed that something may be wrong.
A typical situation when this can happen is when OTLP logs are used for carrying different types of events, e.g. profiling events, client-side browser events, generic application logs, etc. These data is sent to the backend where these various event types may correspond to different product features that the end user may or may not be entitled to. So, for example if the user is entitled to generic application logs but is not entitled to profiling events the batch needs to be processed and profiling events need to be dropped from it.
While this approach works it leaves the senders uninformed which complicates the troubleshooting.
We need a way for OTLP to signal in its response that the request was partially accepted. This will allow the sender to surface the problem to the user (e.g. it can be a log message in the Collector's OTLP exporter).
One possible approach is to return counters of accepted and dropped data in the response, e.g.
If necessary a finer granularity information about individual records may be also added to the response to further aid the troubleshooting, e.g. add a message explaining the reason for dropping or why the data is considered bad.
The text was updated successfully, but these errors were encountered: