Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability for OTLP response to signal partial acceptance of data #2454

Closed
tigrannajaryan opened this issue Mar 29, 2022 · 14 comments · Fixed by #2696
Closed

Add ability for OTLP response to signal partial acceptance of data #2454

tigrannajaryan opened this issue Mar 29, 2022 · 14 comments · Fixed by #2696
Assignees
Labels
enhancement New feature or request spec:protocol Related to the specification/protocol directory

Comments

@tigrannajaryan
Copy link
Member

Each OTLP request can carry a batch of data some of which may be acceptable for the receiver and some may be unacceptable. Today the receiver cannot signal this situation to the sender. The receiver has to either accept the entire batch or reject the entire batch. Typically implementations accept the entire batch and then drop the unwanted portion of the data. This works from the receiver's perspective however the sender remains uniformed that something may be wrong.

A typical situation when this can happen is when OTLP logs are used for carrying different types of events, e.g. profiling events, client-side browser events, generic application logs, etc. These data is sent to the backend where these various event types may correspond to different product features that the end user may or may not be entitled to. So, for example if the user is entitled to generic application logs but is not entitled to profiling events the batch needs to be processed and profiling events need to be dropped from it.

While this approach works it leaves the senders uninformed which complicates the troubleshooting.

We need a way for OTLP to signal in its response that the request was partially accepted. This will allow the sender to surface the problem to the user (e.g. it can be a log message in the Collector's OTLP exporter).

One possible approach is to return counters of accepted and dropped data in the response, e.g.

message ExportLogsServiceResponse {
  int64 accepted_log_records = 1;
  int64 dropped_log_records = 2;
  // may also add counters for Resources and Scopes.
}

If necessary a finer granularity information about individual records may be also added to the response to further aid the troubleshooting, e.g. add a message explaining the reason for dropping or why the data is considered bad.

@tigrannajaryan tigrannajaryan added the spec:protocol Related to the specification/protocol directory label Mar 29, 2022
@tigrannajaryan tigrannajaryan self-assigned this Mar 29, 2022
@bogdandrutu
Copy link
Member

Confused about why do you need 2 counters, since you have access to the request you know the total number, so you can derive one from the other.

@tigrannajaryan
Copy link
Member Author

tigrannajaryan commented Mar 30, 2022

I was initially thinking 3 different outcomes (Accepted, Bad, Dropped). Not sure if we want to make Bad and Dropped to be one category.
Bad = don't understand how to interpret this.
Dropped = I see what this means but I am refusing to accept this.

@joaopgrassi
Copy link
Member

joaopgrassi commented Mar 31, 2022

This is also useful for metrics. It's easy to happen that only some metrics are ingested due to many reasons, e.g. exceeded quotas, bad or invalid data.

The OTLP spec even has a section for partial success under "limitations" https://github.com/open-telemetry/opentelemetry-specification/blob/v1.9.0/specification/protocol/otlp.md#partial-success

The only way to achieve such"partial success" today is by returning a 400 from the server. This is because the spec says on 200 cases, it should return a ExportTraceServiceResponse which is empty. If I want to return something to the user, we are stuck with 400 where we can return a Protobuf-encoded Status with something like you described.

It's particularly bad today because users will get 400's even though some of their data were normally ingested. Or even worse, if during the ingestion some "data massage" has to happen and the server would like to indicate this to the user, there's no way as 200 is an empty response.

I was initially thinking 3 different outcomes (Accepted, Bad, Dropped). Not sure if we want to make Bad and Dropped to be one category.

I think having Bad and dropped is useful in cases such as:

  • Dropped: The data is completely wrong and I can't do anything with this
  • Bad/Invalid: The data is invalid, but the server did xyz to it, and it was accepted after - This gives the user a chance to see they are sending wrong data and how they should fix it.

@tigrannajaryan
Copy link
Member Author

The OTLP spec even has a section for partial success under "limitations" https://github.com/open-telemetry/opentelemetry-specification/blob/v1.9.0/specification/protocol/otlp.md#partial-success

Yes, this is a known limitation from day one of OTLP :-)

I think having Bad and dropped is useful in cases such as:

  • Dropped: The data is completely wrong and I can't do anything with this
  • Bad/Invalid: The data is invalid, but the server did xyz to it, and it was accepted after - This gives the user a chance to see they are sending wrong data and how they should fix it.

Looks like the definitions of Dropped vs Bad may need some additional discussion to agree on the semantics. We can probably start with just Accepted and difference of what was sent and what was accepted can be derived by the sender. Dropped, Bad, etc may be added later and it will be a backward compatible addition.

Another future addition can be an error message or a list of error messages or list of error structures, so that the response not only indicates that only part of the data was accepted, but also what exactly was wrong with the data that was not accepted.

As for the response codes: I believe we want to return HTTP 200 for partial success so that existing senders who are unaware of this new response are unaffected.

@arminru arminru added the enhancement New feature or request label Mar 31, 2022
@joaopgrassi
Copy link
Member

Looks like the definitions of Dropped vs Bad may need some additional discussion to agree on the semantics. We can probably start with just Accepted and difference of what was sent and what was accepted can be derived by the sender. Dropped, Bad, etc may be added later and it will be a backward compatible addition.

I guess that's fair, but I would like to challenge it a bit, as I'm not sure it really improves the status quo.

In the OP you mentioned this, (emphasizes mine)

We need a way for OTLP to signal in its response that the request was partially accepted. This will allow the sender to surface the problem to the user (e.g. it can be a log message in the Collector's OTLP exporter).

Then, if we implement only returning the Accepted, a message in the Collector's OTLP exporter would look like:

Sent: 50, Accepted: 40, Rejected: 10. 

It only tells me how many of "my data" got rejected, but not why they got rejected. This is the part where I'm not sure. Further more, in your reply:

Another future addition can be an error message or a list of error messages or list of error structures, so that the response not only indicates that only part of the data was accepted, but also what exactly was wrong with the data that was not accepted.

I think we need this from the start. Without it, we only have some indication of it. OK it is better than today, but it doesn't really help the user in finding out what they are doing wrong and why their data is getting rejected. So, does it even help with a first version without this? 🤔

@tigrannajaryan
Copy link
Member Author

I don't mind having more detailed response, it just needs more thought about what exactly we want to record.

So, for example we could do this much more detailed version (probably excessive):

message ExportLogsServiceResponse {
  string error_message = 1; // non-empty human readable message if a partial failure, to help troubleshoot.
  int64 accepted_log_records = 2;  
  repeated RejectedLogRecord rejected_log_records  = 2; // optional
  // do the same for Resources and Scopes?
}
message RejectedLogRecord {
  // do we need a reference to the log record that caused the problem? if yes, need a triplet of indexes.

  string error_message = 1; // human-readable message about the particular log record.
  
  // machine readable fields pointing to the particular problem in the log record
  repeated FieldError field_errors = 2;
}
message FieldError {
  enum FieldRef {
    Timestamp, ...
  }
  enum ErrorType {
    BadValue, ...  
  }
  string error_message = 1;
  FieldRef field_ref = 2;
  ErrorType error_type = 3;
}

It is hard to tell how far we want to go. My gut feeling is that the best return on investment will come from a simple "accepted" counter and a human readable error message that can be logged and will help with troubleshooting.

@joaopgrassi
Copy link
Member

joaopgrassi commented Apr 7, 2022

It is hard to tell how far we want to go. My gut feeling is that the best return on investment will come from a simple "accepted" counter and a human readable error message that can be logged and will help with troubleshooting.

Yes, 100% agreed. The combination of 200 + number of accepted + a message in the body would definitely be a much better start. Then servers could do their best to indicate what went wrong. I like how it is described in the google.rpc.Status

// A developer-facing human-readable error message in English. It should
// both explain the error and offer an actionable resolution to it.
string message = 2;

@SergeyKanzhelev
Copy link
Member

do we expect any implementation to retry the whole batch when receiving the partial acceptance status? If so, do we need some indication on whether the error is retry-able or not for the items that wasn't accepted?

@tigrannajaryan
Copy link
Member Author

do we expect any implementation to retry the whole batch when receiving the partial acceptance status? If so, do we need some indication on whether the error is retry-able or not for the items that wasn't accepted?

It would be useful to have such information if we included the details of how each individual record was processed. Otherwise, with just the accepted/error_message fields I think it is not so useful, since the server can just reject the entire request and for the entire request we already have a way to tell the client whether they can or cannot retry.

@joaopgrassi
Copy link
Member

I'm not sure I would expect any exporter implementation to retry a partial acceptance status, since that would probably complicate both the client and the server implementation, no?

For example, for metrics, I assume every back-end has their own internal format and technicalities on how they handle data. How would an exporter retry, when a back-end doesn't accept the way metric keys are spelled (e.g. special characters in keys)? Or when their quota exceeded, or when they don't have privileges to ingest certain log types. There's just too many variables involved.

I see the partial acceptance as a more valuable way for back-ends to try a "best effort" to retain data, while having a good way of signaling users that they should pay attention to what they are sending, maybe even hinting them with doc links on how to fix.

@tigrannajaryan
Copy link
Member Author

I see the partial acceptance as a more valuable way for back-ends to try a "best effort" to retain data, while having a good way of signaling users that they should pay attention to what they are sending, maybe even hinting them with doc links on how to fix.

I agree. Maybe in the future we can have more granular and detailed information in the response that would allow the exporter to have more complicated retrying logic, but for now I think we should keep it simple.

@joaopgrassi
Copy link
Member

joaopgrassi commented Apr 11, 2022

I would be happy to help with this @tigrannajaryan. What could be a start? I guess defining new protos and then updating the OTLP spec?

@tigrannajaryan
Copy link
Member Author

I would be happy to help with this @tigrannajaryan. What could be a start? I guess defining new protos and then updating the OTLP spec?

Great! Yes, start with submitting a proposal in the proto repo. Make sure that interoperability between senders/receivers implementing different OTLP versions (pre and post this change) is maintained. Once we have a consensus there we can move forward.

@joaopgrassi
Copy link
Member

Hi @tigrannajaryan, I finally opened a PR for this. open-telemetry/opentelemetry-proto#390

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request spec:protocol Related to the specification/protocol directory
Projects
None yet
5 participants