Add ability for OTLP response to signal partial acceptance of data #2454

tigrannajaryan · 2022-03-29T15:17:55Z

Each OTLP request can carry a batch of data some of which may be acceptable for the receiver and some may be unacceptable. Today the receiver cannot signal this situation to the sender. The receiver has to either accept the entire batch or reject the entire batch. Typically implementations accept the entire batch and then drop the unwanted portion of the data. This works from the receiver's perspective however the sender remains uniformed that something may be wrong.

A typical situation when this can happen is when OTLP logs are used for carrying different types of events, e.g. profiling events, client-side browser events, generic application logs, etc. These data is sent to the backend where these various event types may correspond to different product features that the end user may or may not be entitled to. So, for example if the user is entitled to generic application logs but is not entitled to profiling events the batch needs to be processed and profiling events need to be dropped from it.

While this approach works it leaves the senders uninformed which complicates the troubleshooting.

We need a way for OTLP to signal in its response that the request was partially accepted. This will allow the sender to surface the problem to the user (e.g. it can be a log message in the Collector's OTLP exporter).

One possible approach is to return counters of accepted and dropped data in the response, e.g.

message ExportLogsServiceResponse {
  int64 accepted_log_records = 1;
  int64 dropped_log_records = 2;
  // may also add counters for Resources and Scopes.
}

If necessary a finer granularity information about individual records may be also added to the response to further aid the troubleshooting, e.g. add a message explaining the reason for dropping or why the data is considered bad.

bogdandrutu · 2022-03-30T17:28:12Z

Confused about why do you need 2 counters, since you have access to the request you know the total number, so you can derive one from the other.

tigrannajaryan · 2022-03-30T21:22:00Z

I was initially thinking 3 different outcomes (Accepted, Bad, Dropped). Not sure if we want to make Bad and Dropped to be one category.
Bad = don't understand how to interpret this.
Dropped = I see what this means but I am refusing to accept this.

joaopgrassi · 2022-03-31T14:25:07Z

This is also useful for metrics. It's easy to happen that only some metrics are ingested due to many reasons, e.g. exceeded quotas, bad or invalid data.

The OTLP spec even has a section for partial success under "limitations" https://github.com/open-telemetry/opentelemetry-specification/blob/v1.9.0/specification/protocol/otlp.md#partial-success

The only way to achieve such"partial success" today is by returning a 400 from the server. This is because the spec says on 200 cases, it should return a ExportTraceServiceResponse which is empty. If I want to return something to the user, we are stuck with 400 where we can return a Protobuf-encoded Status with something like you described.

It's particularly bad today because users will get 400's even though some of their data were normally ingested. Or even worse, if during the ingestion some "data massage" has to happen and the server would like to indicate this to the user, there's no way as 200 is an empty response.

I was initially thinking 3 different outcomes (Accepted, Bad, Dropped). Not sure if we want to make Bad and Dropped to be one category.

I think having Bad and dropped is useful in cases such as:

Dropped: The data is completely wrong and I can't do anything with this
Bad/Invalid: The data is invalid, but the server did xyz to it, and it was accepted after - This gives the user a chance to see they are sending wrong data and how they should fix it.

tigrannajaryan · 2022-03-31T14:38:32Z

The OTLP spec even has a section for partial success under "limitations" https://github.com/open-telemetry/opentelemetry-specification/blob/v1.9.0/specification/protocol/otlp.md#partial-success

Yes, this is a known limitation from day one of OTLP :-)

I think having Bad and dropped is useful in cases such as:

Dropped: The data is completely wrong and I can't do anything with this

Bad/Invalid: The data is invalid, but the server did xyz to it, and it was accepted after - This gives the user a chance to see they are sending wrong data and how they should fix it.

Looks like the definitions of Dropped vs Bad may need some additional discussion to agree on the semantics. We can probably start with just Accepted and difference of what was sent and what was accepted can be derived by the sender. Dropped, Bad, etc may be added later and it will be a backward compatible addition.

Another future addition can be an error message or a list of error messages or list of error structures, so that the response not only indicates that only part of the data was accepted, but also what exactly was wrong with the data that was not accepted.

As for the response codes: I believe we want to return HTTP 200 for partial success so that existing senders who are unaware of this new response are unaffected.

joaopgrassi · 2022-04-07T12:41:24Z

Looks like the definitions of Dropped vs Bad may need some additional discussion to agree on the semantics. We can probably start with just Accepted and difference of what was sent and what was accepted can be derived by the sender. Dropped, Bad, etc may be added later and it will be a backward compatible addition.

I guess that's fair, but I would like to challenge it a bit, as I'm not sure it really improves the status quo.

In the OP you mentioned this, (emphasizes mine)

We need a way for OTLP to signal in its response that the request was partially accepted. This will allow the sender to surface the problem to the user (e.g. it can be a log message in the Collector's OTLP exporter).

Then, if we implement only returning the Accepted, a message in the Collector's OTLP exporter would look like:

Sent: 50, Accepted: 40, Rejected: 10.

It only tells me how many of "my data" got rejected, but not why they got rejected. This is the part where I'm not sure. Further more, in your reply:

Another future addition can be an error message or a list of error messages or list of error structures, so that the response not only indicates that only part of the data was accepted, but also what exactly was wrong with the data that was not accepted.

I think we need this from the start. Without it, we only have some indication of it. OK it is better than today, but it doesn't really help the user in finding out what they are doing wrong and why their data is getting rejected. So, does it even help with a first version without this? 🤔

tigrannajaryan · 2022-04-07T13:05:28Z

I don't mind having more detailed response, it just needs more thought about what exactly we want to record.

So, for example we could do this much more detailed version (probably excessive):

message ExportLogsServiceResponse {
  string error_message = 1; // non-empty human readable message if a partial failure, to help troubleshoot.
  int64 accepted_log_records = 2;  
  repeated RejectedLogRecord rejected_log_records  = 2; // optional
  // do the same for Resources and Scopes?
}
message RejectedLogRecord {
  // do we need a reference to the log record that caused the problem? if yes, need a triplet of indexes.

  string error_message = 1; // human-readable message about the particular log record.
  
  // machine readable fields pointing to the particular problem in the log record
  repeated FieldError field_errors = 2;
}
message FieldError {
  enum FieldRef {
    Timestamp, ...
  }
  enum ErrorType {
    BadValue, ...  
  }
  string error_message = 1;
  FieldRef field_ref = 2;
  ErrorType error_type = 3;
}

It is hard to tell how far we want to go. My gut feeling is that the best return on investment will come from a simple "accepted" counter and a human readable error message that can be logged and will help with troubleshooting.

joaopgrassi · 2022-04-07T13:15:40Z

It is hard to tell how far we want to go. My gut feeling is that the best return on investment will come from a simple "accepted" counter and a human readable error message that can be logged and will help with troubleshooting.

Yes, 100% agreed. The combination of 200 + number of accepted + a message in the body would definitely be a much better start. Then servers could do their best to indicate what went wrong. I like how it is described in the google.rpc.Status

// A developer-facing human-readable error message in English. It should
// both explain the error and offer an actionable resolution to it.
string message = 2;

SergeyKanzhelev · 2022-04-07T20:36:23Z

do we expect any implementation to retry the whole batch when receiving the partial acceptance status? If so, do we need some indication on whether the error is retry-able or not for the items that wasn't accepted?

tigrannajaryan · 2022-04-07T20:56:36Z

do we expect any implementation to retry the whole batch when receiving the partial acceptance status? If so, do we need some indication on whether the error is retry-able or not for the items that wasn't accepted?

It would be useful to have such information if we included the details of how each individual record was processed. Otherwise, with just the accepted/error_message fields I think it is not so useful, since the server can just reject the entire request and for the entire request we already have a way to tell the client whether they can or cannot retry.

joaopgrassi · 2022-04-11T08:17:49Z

I'm not sure I would expect any exporter implementation to retry a partial acceptance status, since that would probably complicate both the client and the server implementation, no?

For example, for metrics, I assume every back-end has their own internal format and technicalities on how they handle data. How would an exporter retry, when a back-end doesn't accept the way metric keys are spelled (e.g. special characters in keys)? Or when their quota exceeded, or when they don't have privileges to ingest certain log types. There's just too many variables involved.

I see the partial acceptance as a more valuable way for back-ends to try a "best effort" to retain data, while having a good way of signaling users that they should pay attention to what they are sending, maybe even hinting them with doc links on how to fix.

tigrannajaryan · 2022-04-11T16:50:44Z

I see the partial acceptance as a more valuable way for back-ends to try a "best effort" to retain data, while having a good way of signaling users that they should pay attention to what they are sending, maybe even hinting them with doc links on how to fix.

I agree. Maybe in the future we can have more granular and detailed information in the response that would allow the exporter to have more complicated retrying logic, but for now I think we should keep it simple.

joaopgrassi · 2022-04-11T18:17:56Z

I would be happy to help with this @tigrannajaryan. What could be a start? I guess defining new protos and then updating the OTLP spec?

tigrannajaryan · 2022-04-11T18:45:08Z

I would be happy to help with this @tigrannajaryan. What could be a start? I guess defining new protos and then updating the OTLP spec?

Great! Yes, start with submitting a proposal in the proto repo. Make sure that interoperability between senders/receivers implementing different OTLP versions (pre and post this change) is maintained. Once we have a consensus there we can move forward.

joaopgrassi · 2022-05-09T09:52:38Z

Hi @tigrannajaryan, I finally opened a PR for this. open-telemetry/opentelemetry-proto#390

tigrannajaryan added the spec:protocol Related to the specification/protocol directory label Mar 29, 2022

tigrannajaryan self-assigned this Mar 29, 2022

github-actions bot assigned SergeyKanzhelev Mar 29, 2022

tigrannajaryan unassigned SergeyKanzhelev Mar 29, 2022

arminru added the enhancement New feature or request label Mar 31, 2022

tigrannajaryan assigned joaopgrassi and unassigned tigrannajaryan Apr 20, 2022

joaopgrassi mentioned this issue Apr 27, 2022

Make it possible to indicate partial success in an OTLP export response dynatrace-oss-contrib/opentelemetry-proto#2

Merged

joaopgrassi mentioned this issue May 9, 2022

Make it possible to indicate partial success in an OTLP export response open-telemetry/opentelemetry-proto#390

Closed

joaopgrassi mentioned this issue Jun 29, 2022

Add support for partial success in an OTLP export response #2636

Closed

This was referenced Jul 22, 2022

Make it possible to indicate partial success in an OTLP export response [2] open-telemetry/opentelemetry-proto#414

Merged

Add support for partial success in an OTLP export response [2] #2696

Merged

jmacd closed this as completed in #2696 Aug 2, 2022

joshdover mentioned this issue May 10, 2023

Add ability for OTLP to indicate which records were rejected for retryable reasons open-telemetry/opentelemetry-proto#470

Open

tigrannajaryan mentioned this issue Mar 28, 2024

fix(otlp-spec): Clarified "reject" meaning of the HTTP PartialSuccess explanation. open-telemetry/opentelemetry-proto#539

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability for OTLP response to signal partial acceptance of data #2454

Add ability for OTLP response to signal partial acceptance of data #2454

tigrannajaryan commented Mar 29, 2022

bogdandrutu commented Mar 30, 2022

tigrannajaryan commented Mar 30, 2022 •

edited

Loading

joaopgrassi commented Mar 31, 2022 •

edited

Loading

tigrannajaryan commented Mar 31, 2022

joaopgrassi commented Apr 7, 2022

tigrannajaryan commented Apr 7, 2022

joaopgrassi commented Apr 7, 2022 •

edited

Loading

SergeyKanzhelev commented Apr 7, 2022

tigrannajaryan commented Apr 7, 2022

joaopgrassi commented Apr 11, 2022

tigrannajaryan commented Apr 11, 2022

joaopgrassi commented Apr 11, 2022 •

edited

Loading

tigrannajaryan commented Apr 11, 2022

joaopgrassi commented May 9, 2022

Add ability for OTLP response to signal partial acceptance of data #2454

Add ability for OTLP response to signal partial acceptance of data #2454

Comments

tigrannajaryan commented Mar 29, 2022

bogdandrutu commented Mar 30, 2022

tigrannajaryan commented Mar 30, 2022 • edited Loading

joaopgrassi commented Mar 31, 2022 • edited Loading

tigrannajaryan commented Mar 31, 2022

joaopgrassi commented Apr 7, 2022

tigrannajaryan commented Apr 7, 2022

joaopgrassi commented Apr 7, 2022 • edited Loading

SergeyKanzhelev commented Apr 7, 2022

tigrannajaryan commented Apr 7, 2022

joaopgrassi commented Apr 11, 2022

tigrannajaryan commented Apr 11, 2022

joaopgrassi commented Apr 11, 2022 • edited Loading

tigrannajaryan commented Apr 11, 2022

joaopgrassi commented May 9, 2022

tigrannajaryan commented Mar 30, 2022 •

edited

Loading

joaopgrassi commented Mar 31, 2022 •

edited

Loading

joaopgrassi commented Apr 7, 2022 •

edited

Loading

joaopgrassi commented Apr 11, 2022 •

edited

Loading