Amend Pipeline Component Telemetry RFC to add a "rejected" outcome by jade-guiton-dd · Pull Request #11956 · open-telemetry/opentelemetry-collector

jade-guiton-dd · 2024-12-18T14:36:28Z

Context

The Pipeline Component Telemetry RFC was recently accepted (#11406). The document states the following regarding error monitoring:

For both [consumed and produced] metrics, an outcome attribute with possible values success and failure should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with outcome as failure when a call from the previous component the ConsumeX function returns an error, and success otherwise. Likewise, produced measurements will be recorded with outcome as failure when a call to the next consumer's ConsumeX function returns an error, and success otherwise.

Observability requirements for stable pipeline components were also recently merged (#11772). The document states the following regarding error monitoring:

The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either:

only include errors internal to the component, or;

allow distinguishing said errors from ones originating in an external service, or propagated from downstream Collector components.

Because errors are typically propagated across ConsumeX calls in a pipeline (except for components with an internal queue like processor/batch), the error observability mechanism proposed by the RFC implies that Pipeline Telemetry will record failures for every component interface upstream of the component that actually emitted the error, which does not match the goals set out in the observability requirements, and makes it much harder to tell which component errors are coming from from the emitted telemetry.

Description

This PR amends the Pipeline Component Telemetry RFC with the following:

restrict the outcome=failure value to cases where the error comes from the very next component (the component on which ConsumeX was called);
add a third possible value for the outcome attribute: rejected, for cases where an error observed at an interface comes from further downstream (the component did not "fail", but its output was "rejected");
propose a mechanism to determine which of the two values should be used.

The current proposal for the mechanism is for the pipeline instrumentation layer to wrap errors in an unexported downstream struct, which upstream layers could check for with errors.As to check whether the error has already been "attributed" to a component. This is the same mechanism currently used for tracking permanent vs. retryable errors. Please check the diff for details.

Possible alternatives

There are a few alternatives to this amendment, which were discussed as part of the observability requirements PR:

loosen the observability requirements for stable components to not require distinguishing internal errors from downstream ones → makes it harder to identify the source of an error;
modify the way we use the Consumer API to no longer propagate errors upstream → prevents proper propagation of backpressure through the pipeline (although this is likely already a problem with the batch prcessor);
let component authors make their own custom telemetry to solve the problem → higher barrier to entry, especially for people wanting to opensource existing components.

codecov · 2024-12-18T15:01:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.70%. Comparing base (ced38e8) to head (98db301).
Report is 107 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #11956      +/-   ##
==========================================
+ Coverage   91.67%   91.70%   +0.02%     
==========================================
  Files         455      462       +7     
  Lines       24039    24749     +710     
==========================================
+ Hits        22038    22695     +657     
- Misses       1629     1672      +43     
- Partials      372      382      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2025-01-03T03:16:55Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

djaglowski · 2025-01-06T14:59:52Z

+The upstream component which called `ConsumeX` will have this `outcome` attribute applied to its produced measurements, and the downstream
+component that `ConsumeX` was called on will have the attribute applied to its consumed measurements.
+
+Errors should be "tagged as coming from downstream" the same way permanent errors are currently handled: they can be wrapped in a `type downstreamError struct { err error }` wrapper error type, then checked with `errors.As`. Note that care may need to be taken when dealing with the `multiError`s returned by the `fanoutconsumer`. (If PR #11085 introducing a single generic `Error` type is merged, an additional `downstream bool` field can be added to it to serve the same purpose.)


If I understand correctly, this may be a breaking change for some components, IF they are checking for types of errors using something other than errors.As. I think it's ok though and those components should update to use errors.As instead anyways. However, we should be aware of this when implementing changes in case it is a widespread problem.

You're absolutely right. I think those components would already be broken anyway, because of the permanentError and multiError wrappers we already use.

will we need / want to update this once #11085 is merged?

If #11085 gets merged before this PR, I'll update this paragraph to only include the parenthetical. If this PR gets merged first, I think presenting the two alternatives is probably good enough? But we could make a second amendment if we feel the need to.

jaronoff97

One minor suggestion for simpler language, otherwise looks great, thank you!

jaronoff97 · 2025-01-07T14:40:25Z

+The upstream component which called `ConsumeX` will have this `outcome` attribute applied to its produced measurements, and the downstream
+component that `ConsumeX` was called on will have the attribute applied to its consumed measurements.
+
+Errors should be "tagged as coming from downstream" the same way permanent errors are currently handled: they can be wrapped in a `type downstreamError struct { err error }` wrapper error type, then checked with `errors.As`. Note that care may need to be taken when dealing with the `multiError`s returned by the `fanoutconsumer`. (If PR #11085 introducing a single generic `Error` type is merged, an additional `downstream bool` field can be added to it to serve the same purpose.)


will we need / want to update this once #11085 is merged?

mx-psi · 2025-01-13T14:03:16Z

Given the two approvals and the announcements on #otel-collector-dev and the Collector SIG meeting from last week, this is entering final comment period. cc @open-telemetry/collector-approvers

jmacd · 2025-01-13T19:20:00Z

+For both metrics, an `outcome` attribute with possible values `success`, `failure`, and `rejected` should be automatically recorded,
+based on whether the corresponding function call returned successfully, returned an internal error, or propagated an error from a
+component further downstream.


I would like to ensure there is detail about the OTLP PartialSuccess fields rejected_spans, rejected_log_records, rejected_data_points. Here's what I propose:

An OTLP exporter that receives success with one of the rejected counts set will:

return nil indicating success

count N-j success points where N is the item count and j is the number rejected

count j rejected points

In this RFC, the exporter wouldn't be the one outputting the metrics; this would be done by an external mechanism— a wrapper of the Consumer API placed in between each component.

This means that adding information about partial successes to pipeline instrumentation metrics would require a whole new mechanism for propagating partial successes through the Collector pipeline (and probably other considerations such as how retry exporters should handle them), which sounds like a much broader discussion...

Considering this mechanism would likely only be used by the OTLP exporter, I would suggest emitting a custom metric in said exporter instead of including it as part of general pipeline telemetry. Even in the latter case, I think this may warrant a separate PR, or even an RFC of its own.

From the conversation above this seems like something that can be dealt with independently from this PR.

@jmacd I will merge this on Tuesday unless you consider this a blocker for this PR

I think the consequences of not getting this right are pretty severe, as OpenTelemetry has advised vendors to use the rejected counts and partial success to indicate when well-formed data cannot be accepted for backend-specific reasons. Also, I think OpenTelemetry should prioritize the user experience for its own protocol over others.

There is already a red flag for me, in this PR, because the term "rejected" is being introduced without a definition.

We have an existing definition for rejected items in OpenTelemetry, which is what happens following a partial success, and we used to refer to failures as "refused" or "failed" in various Collector observability metrics. That said, I'm ready to accept a wider definition for "rejected", but if an OTLP exporter returns partial success and we count 0 rejected points, while using a separate OTLP-specific metric to count rejected points, I think the user experience will be bad, especially for OTLP users.

In this RFC, the exporter wouldn't be the one outputting the metrics; this would be done by an external mechanism— a wrapper of the Consumer API placed in between each component.

I think we could improve this situation by having senders return (error, Details), although that would be a pretty bit change. The best way to avoid my concerns, in the short term, is not to overload the term "rejected" in favor of the term we've used in the past, "refused".

I have pretty strong feels about how we define "dropped" as well. I don't think data should ever count as both rejected/refused as well as dropped, so I think some definitions would help, and keep in mind that "rejected" is already defined.

The best way to avoid my concerns, in the short term, is not to overload the term "rejected" in favor of the term we've used in the past, "refused".

This seems reasonable to me. I don't think we're intending to modify any definitions outside of what is defined in this RFC.

The larger problem, if I understand correctly, boils down to reconciling the "all or nothing" interface which stands in between components vs the notion of partial success between the exporter and destination. I suggest this is separate discussion from this PR.

I don't wish to block this effort, and I agree we should disambiguate requests that fail inside a component vs downstream. The term "reject" was chosen for its dictionary definition, "dismiss as inadequate", "failure to meet standards", because it describes a returned judgement about the data. If there's a downstream failure because of timeout, unavailable destination, and so on, the term "reject" feels less applicable to me. For "refuse" the dictionary has a more applicable definition ("not willing to perform an action").

I think "rejected" is a OK term for downstream failures, but receivers have been using "refused" for this. Mostly, hope we can eventually count the OTLP partial success rejections. This is discussed in #9243.

I see, I wasn't at all aware of the ongoing discussions about partial successes, and this distinction between "internal error / invalid data" and "data is valid but intentionally rejected for backend-specific reasons". Given this and the current use of the word "refused" in receiver metrics, I absolutely agree we should use that instead. To be honest, I hadn't thought very hard about the exact word used, so I'll update the PR with that in mind.

mx-psi · 2025-01-22T09:43:29Z

I'll give this a few more days since there was a minor change and merge this on Tuesday next week

…pen-telemetry#11956) ### Context The [Pipeline Component Telemetry RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md) was recently accepted (open-telemetry#11406). The document states the following regarding error monitoring: > For both [consumed and produced] metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with `outcome` as `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, produced measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise. [Observability requirements for stable pipeline components](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#observability-requirements) were also recently merged (open-telemetry#11772). The document states the following regarding error monitoring: > The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either: > - only include errors internal to the component, or; > - allow distinguishing said errors from ones originating in an external service, or propagated from downstream Collector components. Because errors are typically propagated across `ConsumeX` calls in a pipeline (except for components with an internal queue like `processor/batch`), the error observability mechanism proposed by the RFC implies that Pipeline Telemetry will record failures for every component interface upstream of the component that actually emitted the error, which does not match the goals set out in the observability requirements, and makes it much harder to tell which component errors are coming from from the emitted telemetry. ### Description This PR amends the Pipeline Component Telemetry RFC with the following: - restrict the `outcome=failure` value to cases where the error comes from the very next component (the component on which `ConsumeX` was called); - add a third possible value for the `outcome` attribute: `rejected`, for cases where an error observed at an interface comes from further downstream (the component did not "fail", but its output was "rejected"); - propose a mechanism to determine which of the two values should be used. The current proposal for the mechanism is for the pipeline instrumentation layer to wrap errors in an unexported `downstream` struct, which upstream layers could check for with `errors.As` to check whether the error has already been "attributed" to a component. This is the same mechanism currently used for tracking permanent vs. retryable errors. Please check the diff for details. ### Possible alternatives There are a few alternatives to this amendment, which were discussed as part of the observability requirements PR: - loosen the observability requirements for stable components to not require distinguishing internal errors from downstream ones → makes it harder to identify the source of an error; - modify the way we use the `Consumer` API to no longer propagate errors upstream → prevents proper propagation of backpressure through the pipeline (although this is likely already a problem with the `batch` prcessor); - let component authors make their own custom telemetry to solve the problem → higher barrier to entry, especially for people wanting to opensource existing components. --------- Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>

#### Description This PR updates the [Pipeline Component Telemetry RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md) with the following changes: - Reflect implementation choices that have been made since the RFC was written: 1. using instrumentation scope attributes instead of datapoint attributes to identify component instances (see discussion in #12217 and open-telemetry/opentelemetry-go#6404) 2. automatically injecting these attributes, without changes to component code 3. changing the instrumentation scope name used for pipeline metrics - Slightly change the semantics of `outcome = refused`: The current planned behavior (from #11956) is that, in the case of a pipeline A → B where component B returns an error, the "consumed" metric for B and the "produced" metric for A should both have `outcome = failure`. I fear that this may lead users to think that a failure occurred in A, and would like to restrict `outcome = failure` to only be associated with the component that "failed", ie. component B. The "produced" metric associated with A would instead have `outcome = refused`. This incidentally makes implementation slightly easier, since an instrumentation layer will not need different error wrapping behavior between the "producer" layer and the "consumer" layer. See draft PR #13234 for an example implementation. As this is a non-trivial change to an RFC, it may need to follow the RFC process. Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>

#### Description The last remaining part of #12676 is to implement the `outcome: failure` part of the Pipeline Component Telemetry RFC ([see here](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md#auto-instrumented-metrics)). This is done by introducing a `downstream` error wrapper struct, to distinguish between errors coming from the next component from errors bubbled from further downstream. #### Important note This PR implements things slightly differently from what the text of the RFC describes. If a pipeline contains components `A → B` and an error occurs in B, this PR records: - `otelcol.component.outcome = failure` in the `otelcol.*.consumed.*` metric for B - `otelcol.component.outcome = refused` in the `otelcol.*.produced.*` metric for A whereas the RFC would set both `outcome`s to `failure`. This is programmatically simpler — no need to have different behavior between the `obsconsumer` around the output of A and the one around the input of B — but more importantly, I think it is clearer for users as well: `outcome = failure` only occurs on metrics associated with the component where the failure actually occurred. This subtlety wasn't discussed in-depth in #11956 which introduced `outcome = refused`, so I took the liberty to make this change. If necessary, I can file another RFC amendment to match, or, if there are objections, implement the RFC as-is instead. #### Link to tracking issue Fixes #12676 #### Testing I've updated the existing tests in `obsconsumer` to expect a `downstream`-wrapped error to exit the `obsconsumer` layer. I may add more tests later. #### Documentation None. --------- Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com> Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>

Updated Pipeline Instrumentation RFC to add a "rejected" outcome

9b2707a

jade-guiton-dd added Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests labels Dec 18, 2024

jade-guiton-dd requested a review from djaglowski December 18, 2024 14:37

github-actions Bot added the Stale label Jan 3, 2025

jade-guiton-dd removed the Stale label Jan 6, 2025

djaglowski approved these changes Jan 6, 2025

View reviewed changes

jade-guiton-dd marked this pull request as ready for review January 7, 2025 09:58

jade-guiton-dd requested a review from a team as a code owner January 7, 2025 09:58

jade-guiton-dd requested a review from mx-psi January 7, 2025 09:58

Merge branch 'main' into pipeline-instr-outcomes

5623b8b

mx-psi approved these changes Jan 7, 2025

View reviewed changes

jade-guiton-dd mentioned this pull request Jan 7, 2025

RFC - Pipeline Component Telemetry #11406

Merged

jaronoff97 approved these changes Jan 7, 2025

View reviewed changes

Simplified wording

a3b3bb9

jade-guiton-dd mentioned this pull request Jan 10, 2025

[receiver/otlp] Review telemetry #11139

Open

mx-psi added the rfc:final-comment-period This RFC is in the final comment period phase label Jan 13, 2025

jmacd reviewed Jan 13, 2025

View reviewed changes

mx-psi assigned jade-guiton-dd Jan 20, 2025

Changed rejected to refused + minor wording change

98db301

djaglowski approved these changes Jan 21, 2025

View reviewed changes

mx-psi added this pull request to the merge queue Jan 28, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jan 28, 2025

mx-psi added this pull request to the merge queue Jan 28, 2025

Merged via the queue into open-telemetry:main with commit 1c4726a Jan 28, 2025

This was referenced Jun 18, 2025

Emit outcome: failure in obsconsumer #13234

Merged

Update Pipeline Component Telemetry RFC #13260

Merged

Conversation

jade-guiton-dd commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Description

Possible alternatives

Uh oh!

codecov Bot commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jan 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jade-guiton-dd Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaronoff97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mx-psi commented Jan 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jade-guiton-dd Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djaglowski Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mx-psi commented Jan 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jade-guiton-dd commented Dec 18, 2024 •

edited

Loading

codecov Bot commented Dec 18, 2024 •

edited

Loading

jade-guiton-dd Jan 7, 2025 •

edited

Loading

jade-guiton-dd Jan 14, 2025 •

edited

Loading

djaglowski Jan 17, 2025 •

edited

Loading