Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(otlp-spec): Clarified "reject" meaning of the HTTP PartialSuccess explanation. #539

Closed
wants to merge 2 commits into from

Conversation

bwplotka
Copy link

👋🏽

I was studying the OTLP spec's partial success mechanism proposed 2y ago for the Prometheus Remote Write 2.0 spec design purposes. I noticed one detail that might need clarification. The partial success mechanism has this specification regarding the "final" status code:

If the request is only partially accepted
(i.e. when the server accepts only parts of the data and rejects the rest), the
server MUST respond with HTTP 200 OK.

Does it mean that if e.g. we write 10 samples and server writes successfully 9, but can't write 1 sample due to some tmp error (503), it should still provide 200 status code? Or that 503 code? It feels the latter make more sense to me, but this sentence is a bit fuzzy.

One way of interpreting the "reject" word in that sentence, is that server rejecting samples means not-retryable errors only (e.g. 400). If that's true then some clarification would be nice. I went ahead and proposed changes to make it explicit. Is my assumption here correct? Is this helpful?

@bwplotka bwplotka requested a review from a team March 27, 2024 19:36
@bwplotka bwplotka changed the title fix(otlp-spec): Clarified "reject" meaning of the PartialSuccess explanation. fix(otlp-spec): Clarified "reject" meaning of the HTTP PartialSuccess explanation. Mar 27, 2024
Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contrary to what the spec says today.

Yes, it says HTTP 200 must be returned even if part of the data cannot be accepted. For metrics the rejected_data_points field contain the number of data points that the server did not accept.

This is part of the stable spec and cannot be changed.

If you need additional information about what you would like to be in the partial success response to help understand the failure that should be discussed separately and may be added as non-breaking change.

@bwplotka
Copy link
Author

Thanks for a quick response!

Yes, it says HTTP 200 must be returned even if part of the data cannot be accepted. For metrics the rejected_data_points field contain the number of data points that the server did not accept.

Just to clarify: "cannot be accepted" also INCLUDES retryable failures? The same ambiguity exists in the current wording ("server rejects"), which I am trying to interpret here.

To be very clear again--do you mean that current OTLP spec says:

"HTTP 200 must be returned even if part of the data was written, but for the other part server asked to retry (e.g. due to tmp error)?"

To be very specific - this occurs often when OTLP (or any other stream of samples) is going through proxies that reshards written series and parts of it was written, parts of it requires a retry.

@tigrannajaryan
Copy link
Member

Just to clarify: "cannot be accepted" also INCLUDES retryable failures? The same ambiguity exists in the current wording ("server rejects"), which I am trying to interpret here.

Yes, that is correct.

When it says "reject" it probably should say "drop".

Essentially the server can do one of 4 things:

  1. Full Success. Accept everything successfully and return 200.
  2. Failure, Retryable. Reject everything and return one of the retryable HTTP response codes.
  3. Failure, Non-retryable. Reject everything and return 400.
  4. Accept parts of data, drop some other parts, return 200 and indicate that parts of data were dropped by setting non-zero rejected_data_points value.

There is no provision for accepting parts of data and somehow indicate that the remaining data must be retried. To make this possible the response would need to describe which part of the data exactly should be retried and there is simply no way to do it currently.

@jsuereth
Copy link
Contributor

jsuereth commented Mar 28, 2024

@tigrannajaryan #4 is NOT in-line with what my understanding was for this feature when it was added and I think is actually pretty detrimental if retryable failure can return 200.

Specifically, because the existing rejection doesn't tell you which points are retry-able, and in practice could lead to very poor network usage and overloading. I think this is a pretty glaring hole in OTLP if we allowed that, and we should work on fixing this ASAP if true.

@tigrannajaryan
Copy link
Member

@tigrannajaryan #4 is NOT in-line with what my understanding was for this feature when it was added and I think is actually pretty detrimental if retryable failure can return 200.

@jsuereth are you reading the spec differently? I don't know how else to interpret the spec.

Specifically, because the existing rejection doesn't tell you which points are retry-able, and in practice could lead to very poor network usage and overloading. I think this is a pretty glaring hole in OTLP if we allowed that, and we should work on fixing this ASAP if true.

Yes, what you say is correct. It is a hole, but that's the way it is. The plan was that we can fix it by adding the required information to Export*PartialSuccess messages when the need arises. See for example open-telemetry/opentelemetry-specification#2454 (comment)

@tigrannajaryan
Copy link
Member

@jsuereth let me know if you read the spec differently. We may need to change the wording to make it clearer.

@bwplotka
Copy link
Author

bwplotka commented Apr 22, 2024

Thanks!

I agree supporting partial-retry-ability (as you proposed here) with detailed response semantics allowing to retry certain series is out of question for now, it's complex and might be not needed. However this is NOT what I am asking here.

My point is that the current wording does NOT support case of:

  • Server being explicitly idempotent
  • Server written part of samples, but it wants FULL request to be retried.

Spec says, it's not possible, always return 200 because you ingested something already.

If that's the intention and is acceptable, then nothing to change (other than maybe clarification).

@tigrannajaryan
Copy link
Member

@bwplotka the case you are looking for (idempotent server) is possible to fake today if you control the server implementation. Make the server return one of the retryable error codes and the client will re-send the entire batch. It won't look pretty since it will likely be logged as an error by the client but the net result will be what you expect: the client will retry the full request.

I don't think this is great, but it should work.

@tigrannajaryan
Copy link
Member

Closing this PR for now, I think we have clarity that the PR's changes are considered breaking and we are not allowed that.

However I am open to extending the partial success to cover more use cases if it is done while preserving interoperability with all protocol versions (I have not put any thought into how exactly it can be done).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants