-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[chore] RFC: Count PartialSuccess in exporterhelper
#15280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # Count `PartialSuccess` in `exporterhelper` | ||
|
|
||
| Author: @braydonk | ||
|
|
||
| ## Intro | ||
|
|
||
| The current `exporterhelper` observability behaviour is to [opaquely recognize any error to mean that the entire batch has failed](https://github.com/open-telemetry/opentelemetry-collector/blob/91b32efbbda0180db8b063262589abaf6a0c9df9/exporter/exporterhelper/internal/obs_report_sender.go#L163), counting the entire size of the current request as being failed. This is not always the case; some backends will implement [the PartialSuccess portion of the OTLP specification](https://opentelemetry.io/docs/specs/otlp/#partial-success) or may [have a similar partial rejection protocol](https://docs.cloud.google.com/logging/docs/reference/v2/rest/v2/entries/write#:~:text=Optional.%20Whether%20a%20batch%27s%20valid%20entries%20should%20be%20written%20even%20if%20some%20other%20entry%20failed%20due%20to%20a%20permanent%20error%20such%20as%20INVALID_ARGUMENT%20or%20PERMISSION_DENIED.%20If%20any%20entry%20failed%2C%20then%20the%20response%20status%20is%20the%20response%20status%20of%20one%20of%20the%20failed%20entries.) that makes it so only some items within a request failed while the rest succeeded. Under the current `exporterhelper` functionality, it is impossible to properly count this scenario. | ||
|
|
||
| ## Proposal | ||
|
|
||
| Two new error APIs will be introduced: | ||
|
|
||
| ### `Countable` error | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to have this public surface?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
what is the unit for "failures"? May not be bad to have this as "PartialError" and map 1:1 the PartialSuccess. 1$ question: In case of a connector (like traces to metrics) if we fail and get back that we failed to export 2 metrics. For components prior to the trace to metrics connector that number "2" does not mean "2 spans failed". How do we treat this case?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Do you mean does the interface need to be public? Not necessarily, I could just make a public function
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In the error itself I intentionally exclude unit. The unit would be decided by whoever is consuming the error to record failures. In my unfinished
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I hadn't considered connectors, since I was only thinking of exporters for this RFC. However I think most of the design will still fit in, but the big open question of how to record this for more complex scenarios like (I'm not sure I understand how error metrics are supposed to be recorded at all for
Are you talking about the scenario where something failed downstream from the metrics post-conversion? Assuming the simple scenario of 1-to-1 traces to metrics and no fanout, I see two scenarios:
I wasn't part of any connector design discussions that happened or may be happening, so the above is based on my own conclusions by reading and not by any definitive explanation given to me. |
||
|
|
||
| This is an error that can any other error can be wrapped in along with a count of failures. This will allow for cases where producers of an error want to include an amount of failures. | ||
|
|
||
| This API may end up being useful in other places than `PartialSuccess`, such as [in `receiverhelper`](https://github.com/open-telemetry/opentelemetry-collector/issues/14440). | ||
|
|
||
| ### `PartialSuccess` error | ||
|
bogdandrutu marked this conversation as resolved.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this error meant to specifically be the OTLP "PartialSuccess" error? Or is this meant to be a collector type that means "The operation succeeded overall, but some parts failed"? If it is OTLP-specific, it shouldn't be in exporterhelper, and shouldn't be used by other components. If it is a generic "this thing partially succeeded" error, then it shouldn't do things like set gRPC status codes, etc. |
||
|
|
||
| A `PartialSuccess` error will wrap an internal error as permanent, implement [GRPC status code resolution](https://pkg.go.dev/google.golang.org/grpc/status#FromError) to code `OK`, and wrap it all as a `Countable` with a count of failures. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does it need to implement gRPC status code resolution? It looks like FromError looks for wrapped errors as well (using errors.As: https://github.com/grpc/grpc-go/blob/cb18228317ff523e63d931b4058b0329585b7dcd/status/status.go#L113). So if, for example, the OTLP exporter wants to provide the gRPC status code OK in its PartialSuccess error, it should just be able to use standard Go error wrapping to do so. |
||
|
|
||
| #### Why should the internal error be permanent? | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @open-telemetry/technical-committee is this a correct read of the protocol definition? I personally think it is.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. However, I think the protocol specification can be improved/fixed to allow the service to provide more information regarding the partial success, which would allow the client to retry the items that failed to be delivered if these are retriable.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think this is the correct read. The OTLP spec says:
The partial success response signals that some items are successfully received and some others are bad and cannot be received, so no retries are needed for either. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@tigrannajaryan What could be a reason for being bad? Why couldn't they be received? (We scenarios, where it's not acceptable to have kind of unknown failures)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Any violation of requirements defined in https://github.com/open-telemetry/opentelemetry-proto/tree/main/opentelemetry/proto could be the reason the data cannot be received. |
||
|
|
||
| A partial success SHOULD NOT be retried. When a server responds according to or similar to the OTLP `PartialSuccess` spec, it is not possible to determine which items exactly failed, and retrying to whole request would end up retrying items that already succeeded. As a result, the rest of the upstream pipeline's behaviour (crucially, the `retry_sender`) does not need to change how it reacts to a `PartialSuccess` compared to any other `PermanentError`. | ||
|
|
||
| #### Why does it implement status code `OK`? | ||
|
|
||
| A `PartialSuccess` is nominally a success according to the spec. As a result, when `exporterhelper` processes this it needs to know that the Go error it received doesn't actually correspond to a real failure to send data.This will allow accurate setting of the span status (it shouldn't be `Failed` upon `PartialSuccess`) and determining of `error.type` (a new custom type called `Partial_Success` would be introduced). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to clearly define if we want this error to be propagated by the exporter. If we agree that exporter clearly reports the failed portion as
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the error needs to be propogated by the exporter for sure, I can't think of any nice way around that. This means that the GRPC status code resolution is then necessary for The rest of the error API was designed such that the error could be worked with as any other permanent error when any other part of exporterhelper other than |
||
|
|
||
| ## Implementation Plan | ||
|
|
||
| This will be done as two PRs: | ||
|
|
||
| 1. [Implementing the new error APIs](https://github.com/open-telemetry/opentelemetry-collector/compare/main...braydonk:opentelemetry-collector:countable_consumer_error) | ||
| 2. [Implementing the counting in `exporterhelper`'s `obs_report_sender`](https://github.com/braydonk/opentelemetry-collector/compare/countable_consumer_error...braydonk:opentelemetry-collector:exporterhelper_partial_success) | ||
| 3. Recognize and return `PartialSuccess` errors from `otlp` and `otlp_http` exporters (no draft yet) | ||
|
|
||
| ## Out of Scope | ||
|
|
||
| The following is out of scope for this RFC: | ||
|
|
||
| * Making the upstream components respond differently to `PartialSuccess` | ||
| - I would wager that it's unlikely it ever needs to, but regardless this RFC is scoped completely to self observability entirely within the exporter in this scenario, whereas the rest of the upstream pipeline will only ever respond to this the same as any other permanent error | ||
| * Fully counting data drops throughout the pipeline | ||
| - The new `Countable` interface does open the possibility of counting partial failures at various points throughout the pipeline, which might be worth exploring down the line. The only thing I want to address right now is ensuring these are properly counted at the exporter metric level. | ||
| * Partial retries | ||
| - This is purely covering scenarios that align with the OTLP spec's definition of `PartialSuccess`, which does not come with any functionality to determine exactly which items failed | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that
Otherwise, does it mean we will have behaviour where |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why we need this one... What use cases we want to cover where this one should be used instead of
PartialSuccess?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it's possible to reduce this just to
PartialSuccessError, but since this is also going to be used by theprometheusreceiver (see #14440) I thought thePartialSuccesserror, which is so tied to exporter use cases, wouldn't make sense. Wanted to make it flexible so they could report any error they might want and simply wrap it in a count.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason why the Prometheus receiver couldn't use a partialsuccess error, as long as we can provide the failed telemetry count somehow.