fix(otlp-spec): Clarified "reject" meaning of the HTTP PartialSuccess explanation. #539

bwplotka · 2024-03-27T19:36:47Z

👋🏽

I was studying the OTLP spec's partial success mechanism proposed 2y ago for the Prometheus Remote Write 2.0 spec design purposes. I noticed one detail that might need clarification. The partial success mechanism has this specification regarding the "final" status code:

If the request is only partially accepted
(i.e. when the server accepts only parts of the data and rejects the rest), the
server MUST respond with HTTP 200 OK.

Does it mean that if e.g. we write 10 samples and server writes successfully 9, but can't write 1 sample due to some tmp error (503), it should still provide 200 status code? Or that 503 code? It feels the latter make more sense to me, but this sentence is a bit fuzzy.

One way of interpreting the "reject" word in that sentence, is that server rejecting samples means not-retryable errors only (e.g. 400). If that's true then some clarification would be nice. I went ahead and proposed changes to make it explicit. Is my assumption here correct? Is this helpful?

…anation

tigrannajaryan

This is contrary to what the spec says today.

Yes, it says HTTP 200 must be returned even if part of the data cannot be accepted. For metrics the rejected_data_points field contain the number of data points that the server did not accept.

This is part of the stable spec and cannot be changed.

If you need additional information about what you would like to be in the partial success response to help understand the failure that should be discussed separately and may be added as non-breaking change.

bwplotka · 2024-03-28T07:06:19Z

Thanks for a quick response!

Yes, it says HTTP 200 must be returned even if part of the data cannot be accepted. For metrics the rejected_data_points field contain the number of data points that the server did not accept.

Just to clarify: "cannot be accepted" also INCLUDES retryable failures? The same ambiguity exists in the current wording ("server rejects"), which I am trying to interpret here.

To be very clear again--do you mean that current OTLP spec says:

"HTTP 200 must be returned even if part of the data was written, but for the other part server asked to retry (e.g. due to tmp error)?"

To be very specific - this occurs often when OTLP (or any other stream of samples) is going through proxies that reshards written series and parts of it was written, parts of it requires a retry.

tigrannajaryan · 2024-03-28T14:22:44Z

Just to clarify: "cannot be accepted" also INCLUDES retryable failures? The same ambiguity exists in the current wording ("server rejects"), which I am trying to interpret here.

Yes, that is correct.

When it says "reject" it probably should say "drop".

Essentially the server can do one of 4 things:

Full Success. Accept everything successfully and return 200.
Failure, Retryable. Reject everything and return one of the retryable HTTP response codes.
Failure, Non-retryable. Reject everything and return 400.
Accept parts of data, drop some other parts, return 200 and indicate that parts of data were dropped by setting non-zero rejected_data_points value.

There is no provision for accepting parts of data and somehow indicate that the remaining data must be retried. To make this possible the response would need to describe which part of the data exactly should be retried and there is simply no way to do it currently.

jsuereth · 2024-03-28T14:53:33Z

@tigrannajaryan #4 is NOT in-line with what my understanding was for this feature when it was added and I think is actually pretty detrimental if retryable failure can return 200.

Specifically, because the existing rejection doesn't tell you which points are retry-able, and in practice could lead to very poor network usage and overloading. I think this is a pretty glaring hole in OTLP if we allowed that, and we should work on fixing this ASAP if true.

tigrannajaryan · 2024-03-28T16:26:51Z

@tigrannajaryan #4 is NOT in-line with what my understanding was for this feature when it was added and I think is actually pretty detrimental if retryable failure can return 200.

@jsuereth are you reading the spec differently? I don't know how else to interpret the spec.

Specifically, because the existing rejection doesn't tell you which points are retry-able, and in practice could lead to very poor network usage and overloading. I think this is a pretty glaring hole in OTLP if we allowed that, and we should work on fixing this ASAP if true.

Yes, what you say is correct. It is a hole, but that's the way it is. The plan was that we can fix it by adding the required information to Export*PartialSuccess messages when the need arises. See for example open-telemetry/opentelemetry-specification#2454 (comment)

tigrannajaryan · 2024-04-11T17:41:49Z

@jsuereth let me know if you read the spec differently. We may need to change the wording to make it clearer.

bwplotka · 2024-04-22T11:22:09Z

Thanks!

I agree supporting partial-retry-ability (as you proposed here) with detailed response semantics allowing to retry certain series is out of question for now, it's complex and might be not needed. However this is NOT what I am asking here.

My point is that the current wording does NOT support case of:

Server being explicitly idempotent
Server written part of samples, but it wants FULL request to be retried.

Spec says, it's not possible, always return 200 because you ingested something already.

If that's the intention and is acceptable, then nothing to change (other than maybe clarification).

tigrannajaryan · 2024-05-01T15:18:53Z

@bwplotka the case you are looking for (idempotent server) is possible to fake today if you control the server implementation. Make the server return one of the retryable error codes and the client will re-send the entire batch. It won't look pretty since it will likely be logged as an error by the client but the net result will be what you expect: the client will retry the full request.

I don't think this is great, but it should work.

tigrannajaryan · 2024-05-08T14:14:20Z

Closing this PR for now, I think we have clarity that the PR's changes are considered breaking and we are not allowed that.

However I am open to extending the partial success to cover more use cases if it is done while preserving interoperability with all protocol versions (I have not put any thought into how exactly it can be done).

fix(otlp-spec): Clarified "reject" meaning of the PartialSuccess expl…

ebcc6df

…anation

bwplotka requested a review from a team March 27, 2024 19:36

github-actions bot assigned bogdandrutu Mar 27, 2024

Update specification.md

1bae6be

bwplotka changed the title ~~fix(otlp-spec): Clarified "reject" meaning of the PartialSuccess explanation.~~ fix(otlp-spec): Clarified "reject" meaning of the HTTP PartialSuccess explanation. Mar 27, 2024

tigrannajaryan requested changes Mar 28, 2024

View reviewed changes

tigrannajaryan closed this May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(otlp-spec): Clarified "reject" meaning of the HTTP PartialSuccess explanation. #539

fix(otlp-spec): Clarified "reject" meaning of the HTTP PartialSuccess explanation. #539

bwplotka commented Mar 27, 2024

tigrannajaryan left a comment

bwplotka commented Mar 28, 2024

tigrannajaryan commented Mar 28, 2024

jsuereth commented Mar 28, 2024 •

edited

Loading

tigrannajaryan commented Mar 28, 2024

tigrannajaryan commented Apr 11, 2024

bwplotka commented Apr 22, 2024 •

edited

Loading

tigrannajaryan commented May 1, 2024

tigrannajaryan commented May 8, 2024

fix(otlp-spec): Clarified "reject" meaning of the HTTP PartialSuccess explanation. #539

fix(otlp-spec): Clarified "reject" meaning of the HTTP PartialSuccess explanation. #539

Conversation

bwplotka commented Mar 27, 2024

tigrannajaryan left a comment

Choose a reason for hiding this comment

bwplotka commented Mar 28, 2024

tigrannajaryan commented Mar 28, 2024

jsuereth commented Mar 28, 2024 • edited Loading

tigrannajaryan commented Mar 28, 2024

tigrannajaryan commented Apr 11, 2024

bwplotka commented Apr 22, 2024 • edited Loading

tigrannajaryan commented May 1, 2024

tigrannajaryan commented May 8, 2024

jsuereth commented Mar 28, 2024 •

edited

Loading

bwplotka commented Apr 22, 2024 •

edited

Loading