Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions docs/rfcs/partial-success-counting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Count `PartialSuccess` in `exporterhelper`

Author: @braydonk

## Intro

The current `exporterhelper` observability behaviour is to [opaquely recognize any error to mean that the entire batch has failed](https://github.com/open-telemetry/opentelemetry-collector/blob/91b32efbbda0180db8b063262589abaf6a0c9df9/exporter/exporterhelper/internal/obs_report_sender.go#L163), counting the entire size of the current request as being failed. This is not always the case; some backends will implement [the PartialSuccess portion of the OTLP specification](https://opentelemetry.io/docs/specs/otlp/#partial-success) or may [have a similar partial rejection protocol](https://docs.cloud.google.com/logging/docs/reference/v2/rest/v2/entries/write#:~:text=Optional.%20Whether%20a%20batch%27s%20valid%20entries%20should%20be%20written%20even%20if%20some%20other%20entry%20failed%20due%20to%20a%20permanent%20error%20such%20as%20INVALID_ARGUMENT%20or%20PERMISSION_DENIED.%20If%20any%20entry%20failed%2C%20then%20the%20response%20status%20is%20the%20response%20status%20of%20one%20of%20the%20failed%20entries.) that makes it so only some items within a request failed while the rest succeeded. Under the current `exporterhelper` functionality, it is impossible to properly count this scenario.

## Proposal

Two new error APIs will be introduced:

### `Countable` error
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why we need this one... What use cases we want to cover where this one should be used instead of PartialSuccess?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's possible to reduce this just to PartialSuccessError, but since this is also going to be used by the prometheus receiver (see #14440) I thought the PartialSuccess error, which is so tied to exporter use cases, wouldn't make sense. Wanted to make it flexible so they could report any error they might want and simply wrap it in a count.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason why the Prometheus receiver couldn't use a partialsuccess error, as long as we can provide the failed telemetry count somehow.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have this public surface?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error that can any other error can be wrapped in along with a count of failures. This will allow for cases where producers of an error want to include an amount of failures.

what is the unit for "failures"? May not be bad to have this as "PartialError" and map 1:1 the PartialSuccess.

1$ question: In case of a connector (like traces to metrics) if we fail and get back that we failed to export 2 metrics. For components prior to the trace to metrics connector that number "2" does not mean "2 spans failed". How do we treat this case?

Copy link
Copy Markdown
Contributor Author

@braydonk braydonk May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have this public surface?

Do you mean does the interface need to be public? Not necessarily, I could just make a public function func GetErrorCount(err error) (errCount int, isCountable bool). I thought that was more awkward than the current API of getting the error as xconsumererror.Countable and then calling the Count method on it. But the former makes for less public surface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the unit for "failures"?

In the error itself I intentionally exclude unit. The unit would be decided by whoever is consuming the error to record failures. In my unfinished otlp exporter draft this ends up working out based on the fact that the exporter itself has a signal type, so when obsReportSender is recording failures for a logging otlp exporter, that failed_sent metric already has a unit of {log}.

Copy link
Copy Markdown
Contributor Author

@braydonk braydonk May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a connector (like traces to metrics)

I hadn't considered connectors, since I was only thinking of exporters for this RFC. However I think most of the design will still fit in, but the big open question of how to record this for more complex scenarios like fanoutconsumer was blocking the ability to at least solve this for exporterhelper and receiverhelper, where the case is more clear-cut.

(I'm not sure I understand how error metrics are supposed to be recorded at all for fanoutconsumer scenarios, let alone being able to report partial counts, but I think if the error recording behaviour is clear for fanoutconsumer the partial part is relatively simple to slot in).

For components prior to the trace to metrics connector that number "2" does not mean "2 spans failed".

Are you talking about the scenario where something failed downstream from the metrics post-conversion? Assuming the simple scenario of 1-to-1 traces to metrics and no fanout, I see two scenarios:

  1. The trace/span failed to be converted to a metric or was refused by the conversion process in some way. This seems pretty clear that "1" failed item is "1 failed span in conversion" and would be 1 refused span that obsconsumer could potentially record by recognizing xconsumererror.Countable here.
  2. The metric failed post-conversion by something down its pipeline. I think it's essentially impossible to determine whether a failed metric in that pipeline somehow lines up with a particular span that was converted to a metric. I think it's fair to say that the tracestometrics connector did its job, and considers all spans successful, and if the metric fails after successful conversion then that's for components further down the line to properly report.

I wasn't part of any connector design discussions that happened or may be happening, so the above is based on my own conclusions by reading and not by any definitive explanation given to me.


This is an error that can any other error can be wrapped in along with a count of failures. This will allow for cases where producers of an error want to include an amount of failures.

This API may end up being useful in other places than `PartialSuccess`, such as [in `receiverhelper`](https://github.com/open-telemetry/opentelemetry-collector/issues/14440).

### `PartialSuccess` error
Comment thread
bogdandrutu marked this conversation as resolved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error meant to specifically be the OTLP "PartialSuccess" error? Or is this meant to be a collector type that means "The operation succeeded overall, but some parts failed"? If it is OTLP-specific, it shouldn't be in exporterhelper, and shouldn't be used by other components. If it is a generic "this thing partially succeeded" error, then it shouldn't do things like set gRPC status codes, etc.


A `PartialSuccess` error will wrap an internal error as permanent, implement [GRPC status code resolution](https://pkg.go.dev/google.golang.org/grpc/status#FromError) to code `OK`, and wrap it all as a `Countable` with a count of failures.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to implement gRPC status code resolution? It looks like FromError looks for wrapped errors as well (using errors.As: https://github.com/grpc/grpc-go/blob/cb18228317ff523e63d931b4058b0329585b7dcd/status/status.go#L113). So if, for example, the OTLP exporter wants to provide the gRPC status code OK in its PartialSuccess error, it should just be able to use standard Go error wrapping to do so.


#### Why should the internal error be permanent?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@open-telemetry/technical-committee is this a correct read of the protocol definition? I personally think it is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

However, I think the protocol specification can be improved/fixed to allow the service to provide more information regarding the partial success, which would allow the client to retry the items that failed to be delivered if these are retriable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is the correct read. The OTLP spec says:

The client MUST NOT retry the request when it receives a partial success response where the partial_success is populated.

The partial success response signals that some items are successfully received and some others are bad and cannot be received, so no retries are needed for either.

Copy link
Copy Markdown

@hilmarf hilmarf May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and some others are bad and cannot be received, so no retries are needed for either.

@tigrannajaryan What could be a reason for being bad? Why couldn't they be received? (We scenarios, where it's not acceptable to have kind of unknown failures)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could be a reason for being bad?

Any violation of requirements defined in https://github.com/open-telemetry/opentelemetry-proto/tree/main/opentelemetry/proto could be the reason the data cannot be received.


A partial success SHOULD NOT be retried. When a server responds according to or similar to the OTLP `PartialSuccess` spec, it is not possible to determine which items exactly failed, and retrying to whole request would end up retrying items that already succeeded. As a result, the rest of the upstream pipeline's behaviour (crucially, the `retry_sender`) does not need to change how it reacts to a `PartialSuccess` compared to any other `PermanentError`.

#### Why does it implement status code `OK`?

A `PartialSuccess` is nominally a success according to the spec. As a result, when `exporterhelper` processes this it needs to know that the Go error it received doesn't actually correspond to a real failure to send data.This will allow accurate setting of the span status (it shouldn't be `Failed` upon `PartialSuccess`) and determining of `error.type` (a new custom type called `Partial_Success` would be introduced).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to clearly define if we want this error to be propagated by the exporter. If we agree that exporter clearly reports the failed portion as dropped in the internal telemetry and returns nil, we don't need to include GRPC status code resolution. However, in order to decide on this, we need to clearly outline how the internal telemetry will be affected in the exporter and upstream components (if we propagate)

Copy link
Copy Markdown
Contributor Author

@braydonk braydonk May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error needs to be propogated by the exporter for sure, I can't think of any nice way around that. This means that the GRPC status code resolution is then necessary for obs_report_sender.go to count the error with an appropriate error.type and recognize not to mark the span as failed.

The rest of the error API was designed such that the error could be worked with as any other permanent error when any other part of exporterhelper other than obs_report_sender.go receive it.


## Implementation Plan

This will be done as two PRs:

1. [Implementing the new error APIs](https://github.com/open-telemetry/opentelemetry-collector/compare/main...braydonk:opentelemetry-collector:countable_consumer_error)
2. [Implementing the counting in `exporterhelper`'s `obs_report_sender`](https://github.com/braydonk/opentelemetry-collector/compare/countable_consumer_error...braydonk:opentelemetry-collector:exporterhelper_partial_success)
3. Recognize and return `PartialSuccess` errors from `otlp` and `otlp_http` exporters (no draft yet)

## Out of Scope

The following is out of scope for this RFC:

* Making the upstream components respond differently to `PartialSuccess`
- I would wager that it's unlikely it ever needs to, but regardless this RFC is scoped completely to self observability entirely within the exporter in this scenario, whereas the rest of the upstream pipeline will only ever respond to this the same as any other permanent error
* Fully counting data drops throughout the pipeline
- The new `Countable` interface does open the possibility of counting partial failures at various points throughout the pipeline, which might be worth exploring down the line. The only thing I want to address right now is ensuring these are properly counted at the exporter metric level.
* Partial retries
- This is purely covering scenarios that align with the OTLP spec's definition of `PartialSuccess`, which does not come with any functionality to determine exactly which items failed
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that consumererror can currently be used to trigger partial retries in the retry_sender, won't it be necessary to continue to support both:

  1. partial successes that can be partially retried and counted partially once retries fully fail
  2. partial successes that can only be counted

Otherwise, does it mean we will have behaviour where consumererror.NewTraces(err, failedTraces) indicates partial failure but it isn't counted partially?

Loading