-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(otlp-spec): Clarified "reject" meaning of the HTTP PartialSuccess explanation. #539
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is contrary to what the spec says today.
Yes, it says HTTP 200 must be returned even if part of the data cannot be accepted. For metrics the rejected_data_points
field contain the number of data points that the server did not accept.
This is part of the stable spec and cannot be changed.
If you need additional information about what you would like to be in the partial success response to help understand the failure that should be discussed separately and may be added as non-breaking change.
Thanks for a quick response!
Just to clarify: "cannot be accepted" also INCLUDES retryable failures? The same ambiguity exists in the current wording ("server rejects"), which I am trying to interpret here. To be very clear again--do you mean that current OTLP spec says: "HTTP 200 must be returned even if part of the data was written, but for the other part server asked to retry (e.g. due to tmp error)?" To be very specific - this occurs often when OTLP (or any other stream of samples) is going through proxies that reshards written series and parts of it was written, parts of it requires a retry. |
Yes, that is correct. When it says "reject" it probably should say "drop". Essentially the server can do one of 4 things:
There is no provision for accepting parts of data and somehow indicate that the remaining data must be retried. To make this possible the response would need to describe which part of the data exactly should be retried and there is simply no way to do it currently. |
@tigrannajaryan #4 is NOT in-line with what my understanding was for this feature when it was added and I think is actually pretty detrimental if retryable failure can return 200. Specifically, because the existing rejection doesn't tell you which points are retry-able, and in practice could lead to very poor network usage and overloading. I think this is a pretty glaring hole in OTLP if we allowed that, and we should work on fixing this ASAP if true. |
@jsuereth are you reading the spec differently? I don't know how else to interpret the spec.
Yes, what you say is correct. It is a hole, but that's the way it is. The plan was that we can fix it by adding the required information to |
@jsuereth let me know if you read the spec differently. We may need to change the wording to make it clearer. |
Thanks! I agree supporting partial-retry-ability (as you proposed here) with detailed response semantics allowing to retry certain series is out of question for now, it's complex and might be not needed. However this is NOT what I am asking here. My point is that the current wording does NOT support case of:
Spec says, it's not possible, always return 200 because you ingested something already. If that's the intention and is acceptable, then nothing to change (other than maybe clarification). |
@bwplotka the case you are looking for (idempotent server) is possible to fake today if you control the server implementation. Make the server return one of the retryable error codes and the client will re-send the entire batch. It won't look pretty since it will likely be logged as an error by the client but the net result will be what you expect: the client will retry the full request. I don't think this is great, but it should work. |
Closing this PR for now, I think we have clarity that the PR's changes are considered breaking and we are not allowed that. However I am open to extending the partial success to cover more use cases if it is done while preserving interoperability with all protocol versions (I have not put any thought into how exactly it can be done). |
👋🏽
I was studying the OTLP spec's partial success mechanism proposed 2y ago for the Prometheus Remote Write 2.0 spec design purposes. I noticed one detail that might need clarification. The partial success mechanism has this specification regarding the "final" status code:
Does it mean that if e.g. we write 10 samples and server writes successfully 9, but can't write 1 sample due to some tmp error (503), it should still provide 200 status code? Or that 503 code? It feels the latter make more sense to me, but this sentence is a bit fuzzy.
One way of interpreting the "reject" word in that sentence, is that server rejecting samples means not-retryable errors only (e.g. 400). If that's true then some clarification would be nice. I went ahead and proposed changes to make it explicit. Is my assumption here correct? Is this helpful?